hi guys, i am checking if i have like 3 type of invoices, each have minor differences in the pdf format, I want to extract the details from a table with column: Details.
All these 3 types of invoice have this Details column, but maybe 1 other column differs in column name.
can i train the bot to learn to extract data from these 3 types?
Important thing is i want to extract only this 1 ‘Details’ column, regardless of the minor differences in the 3 set of pdf format.
Have tried regex previously but it does not work for all different format of invoices.
Have tried Form Extractor but with only 1 format, can this activity work with multiple pdf format?
Yes you can train different document types…it would work for all types you train…
You can use classication as well to classify them into different groups
can you guide me on samples/guide to train different types of pdf for extraction? i am able to work on the classification already.
Create a project in AI centre
Upload the files to datatset try uploading atleast 5-10 documents for each type…more files is more better…select a out of box ml model…there are table extraction models as well…
Perform the data labelling…that is indicate which field needs to be extracted and what type of data it is and all
Then you can train the pipeline…nad then create a skill out of it
For a start check this
Also did you try ml form extractors ? Those shpuld be able to extract as required…even if column names are slightly different
So you have three types of “invoices” with small variations. Since the common denominator of document types you have is invoice, here’s how you can train and label the machine learning model.
Good news is that you only need one machine learning model, so instead of using regex and forms extractor, you need to switch to a machine learning model, and use the invoices out of the box ML package.
Note that in the invoices model you can add custom fields as well. So you don’t need to worry about adding column and regular fields. You can add and label as you would normally do in an out of the box model.
So under ML packages, please use Invoices, and that will be the only model you need to train.
@Anil_G @sharon.palawandram hi both, thanks for the input.
Just to confirm, this would need the AI Center enabled right? We cannot do this without the out-of-box ML package? or we can use just this one?
If the pretrained moden end point is working for you as expected then no ai centre training is needed…please check these public end points
If the pretrained models are not working…then you need to train them for your dataset and then create a akill out of it and use that skill
But for both you need the du license
Hope this helps
yep the pretrained model only works well for 1 type of pdf invoice, not for the other 2.
To confirm my understanding, with the pretrained model, we cannot train it again right?
Training only can be used with AI Center.
Even in ai centre you will select a pretrained model only…and on top of it …you would train your data…
Did you try the pretrained ml already?what column did you mapp it to?
yes i have tried the pretrained ml and mapped it into the Details column. It works for 1 type of PDF only, but not for the other 2.
In the taxonomy did you try creating columns with all the three names and then mapp the same column to all the three in the ml extractor and try?
Can you show how you did it
so we can map to 3 different invoice format (to same Details column on each format) and use it to extract is it? i thought for pretrained only can map 1 type for each extraction?
let me try on this
@Anil_G, if we only want to get 1 column (Details), but it ML extractor this would fall under items(which have multiple columns as dataset). How should i work on this? Appreciate your help.
Dint get your question can you elaborate
Could you let us know if the PDF invoices are in Digital Format or Scanned and if there would be any chances of having Scanned documents as well ?
If you are sure that the documents will be Digital, we could then maybe check for the Extraction to be done either by using Regex or String manipulation or maybe try using the Interop methods.
If the documents are scanned then we can proceed with the Document Understanding methods.
Also, let us know if you could provide us with sample data of 3 type invoices (If it is Digital).
using ML extractors, the Detail column that i want to map would only fall under items in this ML capabilities like this;
sample table header:
which is mentioned in this forum that Description falls under items:
because in taxonomy i map to Detail column only, this would throw error on the Extraction result:
Should i map to the whole table or what’s the correct way for this?
it is digital, however im afraid cannot share the sample as it contains restricted data.
Can you explain what is Interop methods?
Even if you just need one column to be extracted, To access that line-item field you will have to create a item field in your taxonomy and then create the needed column in the item. Finally map it to the ML Extractor fields in the same manner.
You need not map whole table…just create a table field…and map item to the table field…and to the column field map the column you need
In taxonomy you have to just create as required…
You could create or provide a Sample PDF of the same format.
You could check the below on using Interop methods, It is not always reliable, but for some of the PDF types it works very well. Do check and let us know :
I believe the Sample PDF shared is a sample data only and not of confidential data. If it is confidential, Do Let me know. However, I am excluding the PDF sample that was provide to me in private.
Check the below workflow :
PDF_To_Excel_Corrected.zip (4.0 KB)
The Data that you have provided, seemed to have some problem with the column name, and hence it was throwing out error. I did correct that part by implementing a counter/index to the column names if it was already h…
Also Providing some references on the extraction using Regex Methods :
Could you try with the below Steps :
Use Read PDF Text Activity with PreserveFormat as True. You would get the output in the form of a String type, say stored in variable pdfText.
We could now use Regex operations on this data to get the data you need. We will first recognise the pattern that is present in the data. The pattern that is observable is that each item in the Table is separated by more than 2 space atleast. Hence, we could use this pattern to capture these values s…
We would need to identify the common patterns between the 3 Invoice types and then use that pattern as one for all three types.