hi guys, i am checking if i have like 3 type of invoices, each have minor differences in the pdf format, I want to extract the details from a table with column: Details.
All these 3 types of invoice have this Details column, but maybe 1 other column differs in column name.
can i train the bot to learn to extract data from these 3 types?
Important thing is i want to extract only this 1 ‘Details’ column, regardless of the minor differences in the 3 set of pdf format.
Have tried regex previously but it does not work for all different format of invoices.
Have tried Form Extractor but with only 1 format, can this activity work with multiple pdf format?
Upload the files to datatset try uploading atleast 5-10 documents for each type…more files is more better…select a out of box ml model…there are table extraction models as well…
Perform the data labelling…that is indicate which field needs to be extracted and what type of data it is and all
Then you can train the pipeline…nad then create a skill out of it
For a start check this
Also did you try ml form extractors ? Those shpuld be able to extract as required…even if column names are slightly different
So you have three types of “invoices” with small variations. Since the common denominator of document types you have is invoice, here’s how you can train and label the machine learning model.
Good news is that you only need one machine learning model, so instead of using regex and forms extractor, you need to switch to a machine learning model, and use the invoices out of the box ML package.
Note that in the invoices model you can add custom fields as well. So you don’t need to worry about adding column and regular fields. You can add and label as you would normally do in an out of the box model.
So under ML packages, please use Invoices, and that will be the only model you need to train.
so we can map to 3 different invoice format (to same Details column on each format) and use it to extract is it? i thought for pretrained only can map 1 type for each extraction?
let me try on this
hi @Anil_G, if we only want to get 1 column (Details), but it ML extractor this would fall under items(which have multiple columns as dataset). How should i work on this? Appreciate your help.
Could you let us know if the PDF invoices are in Digital Format or Scanned and if there would be any chances of having Scanned documents as well ?
If you are sure that the documents will be Digital, we could then maybe check for the Extraction to be done either by using Regex or String manipulation or maybe try using the Interop methods.
If the documents are scanned then we can proceed with the Document Understanding methods.
Also, let us know if you could provide us with sample data of 3 type invoices (If it is Digital).
Even if you just need one column to be extracted, To access that line-item field you will have to create a item field in your taxonomy and then create the needed column in the item. Finally map it to the ML Extractor fields in the same manner.
You could create or provide a Sample PDF of the same format.
You could check the below on using Interop methods, It is not always reliable, but for some of the PDF types it works very well. Do check and let us know :
Also Providing some references on the extraction using Regex Methods :
We would need to identify the common patterns between the 3 Invoice types and then use that pattern as one for all three types.