Read PDF with regions

I tried to get the informations from the PDf

with Read PDF Text. But even umlaute are not extrtacted correctly and as well the regions are not seperated. Any idea who to get the information to at least use regex ro any other string options to read the numbers, addresses and costs?
For example the first line in the table “Ihre Verbrauchskosten” should not add the text in the region right hand side.
In Studio it shows “13 − Ihre VZaerbrhlunaucgehskn osten 2.263,86 €2.304,00 40,14Am 30.05.2018 € auf das wKontoerden IBAN wir d as”
To ctrl&C in AdobeReader and use the copied text shows the best results. But I hope the activity READ PDF could do more?

@TP2B - Could you please let us know what data would like to extract from this pdf? if you could show us with some screenshot or the number tag on the pdf(for ex: address is 2)…

Also have you tried Document Understanding method ?

get address:
image
get items:
image

and want to put this in an excel i.E.
image

(haven’t tried with DU, perhaps that’s the only chance?)

@TP2B

Hallo Tobias,
nope, DU is not the only way. You can very easy extract information from form based PDF. First step is to convert PDF to JPG, second step is to exctract the image section which contains the information you need and the third step is to use Tesseract OCR to get the data you need. The output CroppedImage is the input of Tesseract OCR.

Step zero is to add the UiPath.PDF.Activities library to your workflow.

Here an example, which extracts side 1:

image

Here the invoke code routine to extract exact the section which contains the data, in this case only the Verbrauchskosten:

image

int x= 296, y=1873, width=834, height=57;
Bitmap source = new Bitmap(@"Side1.jpg");
CroppedImage = source.Clone(new System.Drawing.Rectangle(x, y, width, height), source.PixelFormat);

image

It is very easy to detect the section coordinates with MS Paint.

image

image

With an appropriate modularization you can use this very universally. Don’t forget to add the German language data for Tesseract.

With this coordinates you get all information:

int x= 296, y=1873, width=834, height=311;

image

4 Likes

Thank you very much for this good advice.
But let me add two items:
when my code uses captials (“New” instead of “new”), it will stop with “No compiled code to run”, and “error CS1002:; expected”
I was not able to get the language code into Tesseract. It says “Tesseract OCR: Error performing OCR: InvalidInputLanguage”
Do I have do download language files as it says here: Installing OCR Languages
I do not have these folders (using exe-Studio, but even in %localappdata% I do not have it.)

Hallo Tobias,
C# differentiate between upper and lower case, new is correct, New is not correct.
Yes, download the file deu.traineddata and copy it to the path
c:\Users\YourName\.nuget\packages\uipath.vision\3.0.1\build\net461\tessdata\
after you have add the activity Tesseract OCR to your sequence. After you have done that, you can use the German text recognition.
Best regards
Stefan

1 Like

Thank you Stefan, that was very helpfull. I had to add the UIPath.Vision as well, this wasn’t clear before. I added the language file to my nuget folder and it did a good result. (changed the profile to “scan” which takes longer, but all text is reconginzed correctly now).
Many thanks!

1 Like

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.