Read PDF text scraping consistency


I am trying to scrape invoice PDFs using Read PDF activity and capture the required field values from that scraped data. Scraping works perfectly fine for any PDF type but the way data is scraped is not consistent.

See this (12.2 KB) containing the pdf and it’s scraped data in a txt file.

For a multi-line field the text is sometimes scraped per field and sometimes it scrapes the first line of all fields and then scrapes the second.

| Unit \n Price | Total \n Amount |

Is scraped as:

and sometimes it is scraped as:

Unit Total
Price Amount

I wanted to know, if there is a way we can maintain consistency in the scraped data?

If the same pdf is sometimes scraping one way and sometimes the other then I don’t think that there is much you can do to make that more consistent.

We have a process that scrapes invoices for a dozen clients or so and for each one we had to give the bot a fair bit of leeway in how it processed data. For your example above, our process would be something like:
if lines 1 - 4 are one element each then price is on line 2
otherwise if lines 1 and 2 are 2 elements each, price is first element on line 2

It’s a bit of a pain, but as long as you know what your specific cases are you can very clearly tell the bot how to handle the data that it reads.

What if you don’t know what type of invoice you will get? In my case, I don’t know the invoice structure and that is why I need some consistent way to scrape the pdf data.

We have ours set up so that folder A is always client A, folder B is always client B, etc. That way we know the basic format from the beginning. If you can do something like that it would be best. Maybe add some identifying information to the file name?

Additionally, are there logos on any of the invoices? Or special headers? Something that the bot can check to figure out how it should go about scraping the data?

If there is no way to identify invoices like this then you would need one master scrape workflow with a great degree of branches to account for all of the possible formats that you get.

Ok. I will try that. Thanks a lot!

I would suggest ABBYY for these type of file
Or else so if the pdf is from different clients then you can use a if condition and based on that you can apply logic and this might solve the issue.
Or else abbyy tool will be able to help you on this .

Let us know if this helps,
Pavan H

1 Like

Hi @pavanh003,

Thank you for your reply. I tried using Abbyy Cloud OCR but I am facing the same issue.

This is my PDF

Which is scraped as,

See the fields highlighted in yellow. According to the PDF, after delivery instructions the quantity field was supposed to be scraped.
I know I can add if condition for each vendor, but I am trying to find a more generic way.