Read PDF - same content, different layout

Hi guys,

I’ve 3 single page PDF files with the same content but different layouts.

Is there a way to get the same text of all of the three files?

I tried “Read PDF Text” with the following result:

Especially the first example is missing some content at the beginning of the text.

Thanks a lot.

Use anchor base screen scrapping

okay, and which element should be the anchor?

What you mean by this?

Try using set Preserve Formatting as True in the read pdf activity.

well, I just want the whole content of the table in one single output string, like in example no.3.

What I mean is, in example no.1, there is missing this whole first table cell “45500”, and the “EUR” of the third cell is converted to “E”.

In example no.2, the “92” at the end of the output string is really wrong at this position.

Do you unterstand what I mean?

I already tried to set “Preserve Formatting” as true, but it doesn’t change anything in the result.

Yes, when you convert pdf to text this is how to you get the output? it is not 1:1 conversion.

If your pdf is the image, you try reading with read pdf with ocr activity…

If that is no luck, then you can try using CV activities(if allowed in your organization)… there is CV extract table activity which converts table to data table.

PDF read Forum Qurey.zip (660.7 KB)
use this workflow to scrap data from pdf or you can use screen scraping to get by using dynamic selector and for each loop hope this helpful for you if it is please mark as solution

Unfortunately does not work :frowning:
Because my “PDF-Table” looks like slightly different to yours… (please have a look to the example above).

So use screen scrapping its will work

not really… :frowning:
The result ist not in a structured format but mixed .

using “full text” option:

using “OCR” option:

I can’t identifier the table fields based on it’s content.
I need the result in exact the same order as it is shown in the PDF Table.

Use string manipulation to get required result

Hello Michael,
In this video, I have 17 use-cases for extracting tables from PDF and write data in Excel:

In your case, you need to filter multiple spaces here is the timing for you:

42:15 File 9 PDF with multiple spaces on that need to be correct
1:12:50 File 17 simple PDF remove spaces from headers also remove space from Data

Code:

Thanks,
Cristian Negulescu

Hi Cristian,

many thanks for this video.
I’ll check this out, whether it works for my case.

Regards,
Michael

1 Like