Read PDF text scraping consistency

Pranav_Bafna · April 12, 2019, 9:13am

Hi,

I am trying to scrape invoice PDFs using Read PDF activity and capture the required field values from that scraped data. Scraping works perfectly fine for any PDF type but the way data is scraped is not consistent.

See this Invoice.zip (12.2 KB) containing the pdf and it’s scraped data in a txt file.

For a multi-line field the text is sometimes scraped per field and sometimes it scrapes the first line of all fields and then scrapes the second.

Example:
| Unit \n Price | Total \n Amount |

Is scraped as:
Unit
Price
Total
Amount

and sometimes it is scraped as:

Unit Total
Price Amount

I wanted to know, if there is a way we can maintain consistency in the scraped data?

DanielMitchell · April 12, 2019, 12:57pm

If the same pdf is sometimes scraping one way and sometimes the other then I don’t think that there is much you can do to make that more consistent.

We have a process that scrapes invoices for a dozen clients or so and for each one we had to give the bot a fair bit of leeway in how it processed data. For your example above, our process would be something like:
if lines 1 - 4 are one element each then price is on line 2
otherwise if lines 1 and 2 are 2 elements each, price is first element on line 2

It’s a bit of a pain, but as long as you know what your specific cases are you can very clearly tell the bot how to handle the data that it reads.

Pranav_Bafna · April 12, 2019, 2:10pm

What if you don’t know what type of invoice you will get? In my case, I don’t know the invoice structure and that is why I need some consistent way to scrape the pdf data.

DanielMitchell · April 12, 2019, 2:20pm

We have ours set up so that folder A is always client A, folder B is always client B, etc. That way we know the basic format from the beginning. If you can do something like that it would be best. Maybe add some identifying information to the file name?

Additionally, are there logos on any of the invoices? Or special headers? Something that the bot can check to figure out how it should go about scraping the data?

If there is no way to identify invoices like this then you would need one master scrape workflow with a great degree of branches to account for all of the possible formats that you get.

Pranav_Bafna · April 12, 2019, 2:24pm

Ok. I will try that. Thanks a lot!

pavanh003 · April 12, 2019, 2:49pm

Hey,
I would suggest ABBYY for these type of file
Or else so if the pdf is from different clients then you can use a if condition and based on that you can apply logic and this might solve the issue.
Or else abbyy tool will be able to help you on this .

Let us know if this helps,
Regards,
Pavan H

Pranav_Bafna · April 15, 2019, 5:40am

Hi @pavanh003,

Thank you for your reply. I tried using Abbyy Cloud OCR but I am facing the same issue.

This is my PDF

Which is scraped as,
scrapedpdf

See the fields highlighted in yellow. According to the PDF, after delivery instructions the quantity field was supposed to be scraped.
I know I can add if condition for each vendor, but I am trying to find a more generic way.

Topic		Replies	Views
Read pdf ocr Help	5	1160	January 23, 2019
Extracting PDF data Help studio , data_scraping	1	2239	December 29, 2017
OCR Invoices data extraction and analysis Help	4	1164	July 17, 2019
How to scrape same type of data values from 5 different invoice pdf Help studio	6	2192	March 4, 2019
Reading text from PDF left to right Help pdf	4	4718	October 12, 2017

Read PDF text scraping consistency

Related topics