Extract tabular data from Read-Only PDF

I could have tried that if column/cell separators were tabs in the extracted string(I wish it was) but every word is separated by space irrespective of which column it is extracted from.
If you take the first row of table in the string,

8/7/2012 007 168 RRR DDDD LLL 3633 LOOP LAKE RREEE DDFDF GA 30506 6855 GGGG GL ENN $2,000.00 $753.75 $1,246.25

In PDF,
RRR DDDD LLL - cell 3
3633 LOOP LAKE RREEE DDFDF GA - cell 4
etc…

In my scraped string, there is no way to identify what has to come in which cell of the row. Every word in the row is extracted one after the other separated by space.

I raised this issue in the webinar and they said they don’t have an easy/straightforward solution for this right now. The only possible way is, Use screen scraping method and extract each column separately by scraping only that region which gives a string output of all cells in that column which we can convert into an array and repeat the same for other columns and later combine and make a datatable. But in my case, the data in the PDF may change at a later stage(pdf is extracted from a place where it may get updated later and accordingly i have to update in my extracted file), rows may get added/deleted, in which case even this solution fails.

If anybody has a static PDF which is a scanned image and table format data(single page) has to be extracted, they can use this method. This method extracts the data perfectly, i have tried doing this.

3 Likes