Read PDF with regions

@TP2B

Hallo Tobias,
nope, DU is not the only way. You can very easy extract information from form based PDF. First step is to convert PDF to JPG, second step is to exctract the image section which contains the information you need and the third step is to use Tesseract OCR to get the data you need. The output CroppedImage is the input of Tesseract OCR.

Step zero is to add the UiPath.PDF.Activities library to your workflow.

Here an example, which extracts side 1:

image

Here the invoke code routine to extract exact the section which contains the data, in this case only the Verbrauchskosten:

image

int x= 296, y=1873, width=834, height=57;
Bitmap source = new Bitmap(@"Side1.jpg");
CroppedImage = source.Clone(new System.Drawing.Rectangle(x, y, width, height), source.PixelFormat);

image

It is very easy to detect the section coordinates with MS Paint.

image

image

With an appropriate modularization you can use this very universally. Don’t forget to add the German language data for Tesseract.

With this coordinates you get all information:

int x= 296, y=1873, width=834, height=311;

image

4 Likes