Machine Learning Extraction with multiple PDF formats

syezids · March 20, 2023, 8:48pm

hi guys, i am checking if i have like 3 type of invoices, each have minor differences in the pdf format, I want to extract the details from a table with column: Details.

All these 3 types of invoice have this Details column, but maybe 1 other column differs in column name.

can i train the bot to learn to extract data from these 3 types?

Important thing is i want to extract only this 1 ‘Details’ column, regardless of the minor differences in the 3 set of pdf format.

Have tried regex previously but it does not work for all different format of invoices.
Have tried Form Extractor but with only 1 format, can this activity work with multiple pdf format?

Please advice.

Anil_G · March 20, 2023, 8:51pm

@syezids

Yes you can train different document types…it would work for all types you train…

You can use classication as well to classify them into different groups

Cheers

syezids · March 20, 2023, 9:18pm

can you guide me on samples/guide to train different types of pdf for extraction? i am able to work on the classification already.
thank you

Anil_G · March 20, 2023, 9:34pm

@syezids

Sure…

Create a project in AI centre
Upload the files to datatset try uploading atleast 5-10 documents for each type…more files is more better…select a out of box ml model…there are table extraction models as well…
Perform the data labelling…that is indicate which field needs to be extracted and what type of data it is and all
Then you can train the pipeline…nad then create a skill out of it

For a start check this

Also did you try ml form extractors ? Those shpuld be able to extract as required…even if column names are slightly different

Cheers

sharon.palawandram · March 21, 2023, 1:30am

Hi,

So you have three types of “invoices” with small variations. Since the common denominator of document types you have is invoice, here’s how you can train and label the machine learning model.

Good news is that you only need one machine learning model, so instead of using regex and forms extractor, you need to switch to a machine learning model, and use the invoices out of the box ML package.
Note that in the invoices model you can add custom fields as well. So you don’t need to worry about adding column and regular fields. You can add and label as you would normally do in an out of the box model.

So under ML packages, please use Invoices, and that will be the only model you need to train.

Good luck!

syezids · March 21, 2023, 4:01am

@Anil_G @sharon.palawandram hi both, thanks for the input.

Just to confirm, this would need the AI Center enabled right? We cannot do this without the out-of-box ML package? or we can use just this one?

Anil_G · March 21, 2023, 4:05am

@syezids

If the pretrained moden end point is working for you as expected then no ai centre training is needed…please check these public end points

If the pretrained models are not working…then you need to train them for your dataset and then create a akill out of it and use that skill

But for both you need the du license

Hope this helps

Cheers

syezids · March 21, 2023, 4:08am

yep the pretrained model only works well for 1 type of pdf invoice, not for the other 2.

To confirm my understanding, with the pretrained model, we cannot train it again right?
Training only can be used with AI Center.

Anil_G · March 21, 2023, 4:11am

@syezids

Even in ai centre you will select a pretrained model only…and on top of it …you would train your data…

Did you try the pretrained ml already?what column did you mapp it to?

Cheers

syezids · March 21, 2023, 4:14am

ok understood,

yes i have tried the pretrained ml and mapped it into the Details column. It works for 1 type of PDF only, but not for the other 2.

Anil_G · March 21, 2023, 4:15am

@syezids

In the taxonomy did you try creating columns with all the three names and then mapp the same column to all the three in the ml extractor and try?

Can you show how you did it

Cheers

syezids · March 21, 2023, 4:29am

so we can map to 3 different invoice format (to same Details column on each format) and use it to extract is it? i thought for pretrained only can map 1 type for each extraction?
let me try on this

syezids · March 21, 2023, 5:27am

hi @Anil_G, if we only want to get 1 column (Details), but it ML extractor this would fall under items(which have multiple columns as dataset). How should i work on this? Appreciate your help.

Anil_G · March 21, 2023, 5:28am

@syezids

Dint get your question can you elaborate

Cheers

supermanPunch · March 21, 2023, 5:29am

Hi @syezids ,

Could you let us know if the PDF invoices are in Digital Format or Scanned and if there would be any chances of having Scanned documents as well ?

If you are sure that the documents will be Digital, we could then maybe check for the Extraction to be done either by using Regex or String manipulation or maybe try using the Interop methods.

If the documents are scanned then we can proceed with the Document Understanding methods.

Also, let us know if you could provide us with sample data of 3 type invoices (If it is Digital).

syezids · March 21, 2023, 5:50am

using ML extractors, the Detail column that i want to map would only fall under items in this ML capabilities like this;

sample table header:

which is mentioned in this forum that Description falls under items:

because in taxonomy i map to Detail column only, this would throw error on the Extraction result:

Should i map to the whole table or what’s the correct way for this?

syezids · March 21, 2023, 5:52am

hi @supermanPunch ,

it is digital, however im afraid cannot share the sample as it contains restricted data.
Can you explain what is Interop methods?

Thanks

DanRagh · March 21, 2023, 5:56am

HI @syezids

Even if you just need one column to be extracted, To access that line-item field you will have to create a item field in your taxonomy and then create the needed column in the item. Finally map it to the ML Extractor fields in the same manner.

Anil_G · March 21, 2023, 5:59am

@syezids

You need not map whole table…just create a table field…and map item to the table field…and to the column field map the column you need

In taxonomy you have to just create as required…

Cheers

supermanPunch · March 21, 2023, 6:02am

@syezids ,

You could create or provide a Sample PDF of the same format.

You could check the below on using Interop methods, It is not always reliable, but for some of the PDF types it works very well. Do check and let us know :

Also Providing some references on the extraction using Regex Methods :

We would need to identify the common patterns between the 3 Invoice types and then use that pattern as one for all three types.

Topic		Replies	Views
Only tables extraction from scanned pdf Activities ocr , table	3	672	March 22, 2023
Invoice data extraction using document undertading Document Understanding studio , question , document_understanding , data-extraction , invoices	4	1070	June 16, 2023
Extract the pdf but formet are diff RPA Discussions machine-learning , general , career	4	1265	April 29, 2022
How to extract table same pdf more different format using Document understanding Studio studio , question , document_understanding , activities_panel , pdf-extraction , pdf-tag	1	167	May 26, 2024
How To Train My Invioce using Document Understanding Help activities , question , document_understanding	2	878	November 18, 2020

Machine Learning Extraction with multiple PDF formats

Related topics