Machine Learning Extraction with multiple PDF formats

can you guide me on samples/guide to train different types of pdf for extraction? i am able to work on the classification already.
thank you

@syezids

Sure…

  1. Create a project in AI centre
  2. Upload the files to datatset try uploading atleast 5-10 documents for each type…more files is more better…select a out of box ml model…there are table extraction models as well…
  3. Perform the data labelling…that is indicate which field needs to be extracted and what type of data it is and all
  4. Then you can train the pipeline…nad then create a skill out of it

For a start check this

Also did you try ml form extractors ? Those shpuld be able to extract as required…even if column names are slightly different

Cheers

Hi,

So you have three types of “invoices” with small variations. Since the common denominator of document types you have is invoice, here’s how you can train and label the machine learning model.

  1. Good news is that you only need one machine learning model, so instead of using regex and forms extractor, you need to switch to a machine learning model, and use the invoices out of the box ML package.
  2. Note that in the invoices model you can add custom fields as well. So you don’t need to worry about adding column and regular fields. You can add and label as you would normally do in an out of the box model.

So under ML packages, please use Invoices, and that will be the only model you need to train.

Good luck!

@Anil_G @sharon.palawandram hi both, thanks for the input.

Just to confirm, this would need the AI Center enabled right? We cannot do this without the out-of-box ML package? or we can use just this one?
image

@syezids

If the pretrained moden end point is working for you as expected then no ai centre training is needed…please check these public end points

If the pretrained models are not working…then you need to train them for your dataset and then create a akill out of it and use that skill

But for both you need the du license

Hope this helps

Cheers

yep the pretrained model only works well for 1 type of pdf invoice, not for the other 2.

To confirm my understanding, with the pretrained model, we cannot train it again right?
Training only can be used with AI Center.

@syezids

Even in ai centre you will select a pretrained model only…and on top of it …you would train your data…

Did you try the pretrained ml already?what column did you mapp it to?

Cheers

ok understood,

yes i have tried the pretrained ml and mapped it into the Details column. It works for 1 type of PDF only, but not for the other 2.

@syezids

In the taxonomy did you try creating columns with all the three names and then mapp the same column to all the three in the ml extractor and try?

Can you show how you did it

Cheers

so we can map to 3 different invoice format (to same Details column on each format) and use it to extract is it? i thought for pretrained only can map 1 type for each extraction?
let me try on this

hi @Anil_G, if we only want to get 1 column (Details), but it ML extractor this would fall under items(which have multiple columns as dataset). How should i work on this? Appreciate your help.

@syezids

Dint get your question can you elaborate

Cheers

Hi @syezids ,

Could you let us know if the PDF invoices are in Digital Format or Scanned and if there would be any chances of having Scanned documents as well ?

If you are sure that the documents will be Digital, we could then maybe check for the Extraction to be done either by using Regex or String manipulation or maybe try using the Interop methods.

If the documents are scanned then we can proceed with the Document Understanding methods.

Also, let us know if you could provide us with sample data of 3 type invoices (If it is Digital).

using ML extractors, the Detail column that i want to map would only fall under items in this ML capabilities like this;


sample table header:
image

which is mentioned in this forum that Description falls under items:

because in taxonomy i map to Detail column only, this would throw error on the Extraction result:

Should i map to the whole table or what’s the correct way for this?

hi @supermanPunch ,

it is digital, however im afraid cannot share the sample as it contains restricted data.
Can you explain what is Interop methods?

Thanks

HI @syezids

Even if you just need one column to be extracted, To access that line-item field you will have to create a item field in your taxonomy and then create the needed column in the item. Finally map it to the ML Extractor fields in the same manner.

1 Like

@syezids

You need not map whole table…just create a table field…and map item to the table field…and to the column field map the column you need

In taxonomy you have to just create as required…

Cheers

@syezids ,

You could create or provide a Sample PDF of the same format.

You could check the below on using Interop methods, It is not always reliable, but for some of the PDF types it works very well. Do check and let us know :

Also Providing some references on the extraction using Regex Methods :

We would need to identify the common patterns between the 3 Invoice types and then use that pattern as one for all three types.

1 Like

thanks a lot @Anil_G @DanRagh i have it working now with your suggestions. thanks @supermanPunch for your idea, will further learn that for future reference.

thanks guys!

2 Likes

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.