Read PDF with regions

StefanSchnell · July 25, 2021, 3:55pm

Hallo Tobias,
nope, DU is not the only way. You can very easy extract information from form based PDF. First step is to convert PDF to JPG, second step is to exctract the image section which contains the information you need and the third step is to use Tesseract OCR to get the data you need. The output CroppedImage is the input of Tesseract OCR.

Step zero is to add the UiPath.PDF.Activities library to your workflow.

Here an example, which extracts side 1:

Here the invoke code routine to extract exact the section which contains the data, in this case only the Verbrauchskosten:

int x= 296, y=1873, width=834, height=57;
Bitmap source = new Bitmap(@"Side1.jpg");
CroppedImage = source.Clone(new System.Drawing.Rectangle(x, y, width, height), source.PixelFormat);

It is very easy to detect the section coordinates with MS Paint.

With an appropriate modularization you can use this very universally. Don’t forget to add the German language data for Tesseract.

With this coordinates you get all information:

int x= 296, y=1873, width=834, height=311;

Topic		Replies	Views
PDF help Studio pdf , studio , question	5	877	November 5, 2021
Suggestion to get Studio studio , question , designer_canvas	15	392	July 13, 2023
Read Text from Specific Region Activities pdf , activities , question	7	983	November 14, 2022
Need to extract text from a particular region in a PDF file Help	4	2309	June 10, 2021
Getting PDF text from specific position Help	9	2055	May 23, 2020

Read PDF with regions

Related topics