Extract tabular data from Read-Only PDF


#1

Hello,

I have to extract a table from a PDF(which also contains other data) which is READ-ONLY. I tried converting the PDF into word, excel and nothing works. All other permissions on the PDF is disabled except print. The PDF is basically made an image so that content cannot be copied. I have tried all scraping methods and no luck. I tried some online converters(thinking i can automate this atleast), some failed and some worked but not perfectly(a single column of a table is split into multiple columns in the converted excel file).

Is there any other way of achieving this?

TIA.


PDF to excel
#2

i think "Read PDF with OCR should work.Could you please attach the sample PDF file if its not confidential.


#3

Hi @ddpadil,

Thank you for the quick response!

I did not try ‘Read PDF with OCR’. However, i tried to read the file using Google OCR engine which did not give me proper result, hence i thought this method might also give me the same result. I will try to read the file using Read PDF with OCR and check if that works!


#4

I tried to extract using ‘Read PDF with OCR’. By reducing the scale i got more accurate result which still has some issues like ‘i’ is appended in the beginning of each line. I can do a formatting and remove these but i still cannot go forward and use this method as it has given me the result as a string, I don’t have a way to extract the data into corresponding columns as in PDF because columns 3,4 and 5 are having numbers & letters. I cannot filter them in any way. Other columns can be filtered and extracted into corresponding columns like if it starts with number push it into this column, if next word starts with $ push it into next column etc.


#5

Just an idea…Why don’t you try to read each line in string and try to replace Tab (space wont work for name) with comma, that way you will get a structured pattern which you can use as a datatable?


#6

I could have tried that if column/cell separators were tabs in the extracted string(I wish it was) but every word is separated by space irrespective of which column it is extracted from.
If you take the first row of table in the string,

8/7/2012 007 168 RRR DDDD LLL 3633 LOOP LAKE RREEE DDFDF GA 30506 6855 GGGG GL ENN $2,000.00 $753.75 $1,246.25

In PDF,
RRR DDDD LLL - cell 3
3633 LOOP LAKE RREEE DDFDF GA - cell 4
etc…

In my scraped string, there is no way to identify what has to come in which cell of the row. Every word in the row is extracted one after the other separated by space.

I raised this issue in the webinar and they said they don’t have an easy/straightforward solution for this right now. The only possible way is, Use screen scraping method and extract each column separately by scraping only that region which gives a string output of all cells in that column which we can convert into an array and repeat the same for other columns and later combine and make a datatable. But in my case, the data in the PDF may change at a later stage(pdf is extracted from a place where it may get updated later and accordingly i have to update in my extracted file), rows may get added/deleted, in which case even this solution fails.

If anybody has a static PDF which is a scanned image and table format data(single page) has to be extracted, they can use this method. This method extracts the data perfectly, i have tried doing this.


How to read Scanned PDF
I want to enter the specified text and field in the name of the word document in the column format