Hi,
Is there any general way to extract specific element or text from new (or unknown) template PDF,so that the same xaml file can be used to extract data from different PDF templates.
for this purpose you can use Read PDF Text and then build Regex to find specific words or part of the words which you want to extract in every PDF invoice.
Letâs say that you want to extract only headers from the invoices like company, bill number⌠and the string is the same for invoices then you can use it like that.
If document is scanned then you need to use Read PDF With OCR but quality of extracted text depends on quality of scanned document.
For better quality you should use Document Understanding but here you need to specify templates for Invoices
Cheers,
Dino
@dfilipovic,
How is it possible to get the values corresponding to each headers?. PDFs can have values corresponding to the headers in different order i.e⌠right or down to the header.
@Gopalakrishnan_K,
yes it is possible and you can use this activity for that purpose, because regex will search for that combination, row is not important.
@dfilipovic,
Already tried it.I think âmatchesâ activity works only depending on the invoice template.Is there any general way to read the specific element from the PDF âwithout any dependencyâ on the invoice template?
You can use it for different invoices but problem here is that every invoice needs to have the same logic with letâs say Total, so Total should be displayed on every Invoice so you can grab data, but if I have âTotal:â and âTotalâ that is not the same because I have special character : which will result that match will extract â : 123456â not the â123456â.
Workflow:
You can build your own machine learning models here on Microsoft with Cognitive Services and Form Recognizer this features are also used by UiPath for their Activities since UiPath is Microsoft Partner.
Cheers,
Dino
If you want to learn more about Regex check out my Megapost
@dfilipovic,
I went through the demo videos of Document understanding. But I got an error message
âData Extraction Scope: Index was outside the bounds of the arrayâ
Also, I canât get the activity âPosition based extractorâ discussed in the demo video for PDF extraction.
Attaching the video for the same. UiPath Document Understanding Demo 2: Data extraction configuration - YouTube
How can it be resolved?
Hi @Gopalakrishnan_K,
have you put Classification(0) as a result from previous action, I had same problem because I have put a result form previous action in it. Then I saw that I need only to put something with index 0.
Also you must put IF after classification which will end process if Classification is not successful, because that can happen and if you donât have something to redirect in that case process will break.
I think that Position based extractor is this Activity but it was renamed in that video:
Cheers,
Dino
@dfilipovic,
Iâm getting the error because of the same problem.What should be given instead of classification(0)?
Hi @Gopalakrishnan_K,
no that is fine, as you can see in my workflow this is used in Form Extractor, what I wanted to tell you that there is possibility that classifiers didnât classify your document correctly and thatâs why you need to put an IF in workflow like this (image 2).
@dfilipovic,
Some of the values (GST number, PDF with name showing as logo) are not getting extracted while using machine learning extractor.Is there any way to extract the same?