Read PDF text for new Invoice template PDF

Hi,
Is there any general way to extract specific element or text from new (or unknown) template PDF,so that the same xaml file can be used to extract data from different PDF templates.

1 Like

Hi @Gopalakrishnan_K,

for this purpose you can use Read PDF Text and then build Regex to find specific words or part of the words which you want to extract in every PDF invoice.
Let’s say that you want to extract only headers from the invoices like company, bill number… and the string is the same for invoices then you can use it like that.

If document is scanned then you need to use Read PDF With OCR but quality of extracted text depends on quality of scanned document.

For better quality you should use Document Understanding but here you need to specify templates for Invoices


Cheers,
Dino

@dfilipovic,
How is it possible to get the values corresponding to each headers?. PDFs can have values corresponding to the headers in different order i.e… right or down to the header.

1 Like

@Gopalakrishnan_K,
yes it is possible and you can use this activity for that purpose, because regex will search for that combination, row is not important.


@dfilipovic,
Already tried it.I think ‘matches’ activity works only depending on the invoice template.Is there any general way to read the specific element from the PDF “without any dependency” on the invoice template?

1 Like

Hi @Gopalakrishnan_K,

You can use it for different invoices but problem here is that every invoice needs to have the same logic with let’s say Total, so Total should be displayed on every Invoice so you can grab data, but if I have ‘Total:’ and ‘Total’ that is not the same because I have special character : which will result that match will extract ’ : 123456’ not the ‘123456’.

Workflow:

You can build your own machine learning models here on Microsoft with Cognitive Services and Form Recognizer this features are also used by UiPath for their Activities since UiPath is Microsoft Partner.


Cheers,
Dino

If you want to learn more about Regex check out my Megapost :slight_smile:

@dfilipovic,
I went through the demo videos of Document understanding. But I got an error message
‘Data Extraction Scope: Index was outside the bounds of the array’
Also, I can’t get the activity ‘Position based extractor’ discussed in the demo video for PDF extraction.
Attaching the video for the same. https://www.youtube.com/watch?v=ePUWVQftJCg&t=1s
How can it be resolved?

Hi @Gopalakrishnan_K,
have you put Classification(0) as a result from previous action, I had same problem because I have put a result form previous action in it. Then I saw that I need only to put something with index 0.
image
Also you must put IF after classification which will end process if Classification is not successful, because that can happen and if you don’t have something to redirect in that case process will break.
I think that Position based extractor is this Activity but it was renamed in that video:

Cheers,
Dino

@dfilipovic,
I’m getting the error because of the same problem.What should be given instead of classification(0)?

Hi @Gopalakrishnan_K,
no that is fine, as you can see in my workflow this is used in Form Extractor, what I wanted to tell you that there is possibility that classifiers didn’t classify your document correctly and that’s why you need to put an IF in workflow like this (image 2).

@dfilipovic,
Some of the values (GST number, PDF with name showing as logo) are not getting extracted while using machine learning extractor.Is there any way to extract the same?