Extract Multiline Tabular Data from PDF using OCR

priscilla · January 8, 2019, 5:33am

Hello,

I’m trying to extract a tabular data from PDF using OCR method and convert it into an excel or csv file. I know a lot of people have asked this question, but all the methods didn’t work for me. The problem with mine is that I have multiple lines of data per row. That is to say that the normal OCR reading (left to right) will mix up my data with more than 1 column. Usually the answers have something to do with splitting the columns by tabs, but I can’t for mine because of the multiline data.

I can’t post the pdf file but here is basically the structure of the table:

No. |  Date      | Description                             | Names          | Total     |
    |            |                                         |                | Payment   |
------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------
1.  | 18/03/2018 | Lorem ipsum dolor sit amet, consectetur | Lorem Ipsum,   | 3,020.75  |
    |            | adipiscing elit, sed do eiusmod tempor  | Dolor sit Amet,|           |
    |            | incididunt ut labore et dolore magna    | Consectetur,   |           |
    |            | aliqua. Ut enim ad minim veniam, quis   | Adipiscing     |           |
    |            | nostrud exercitation ullamco laboris    | Elit, Sed Do   |           |
    |            | nisi ut aliquip ex ea commodo           | Eiusmod        |           |
    |            | consequat.                              |                |           |
------------------------------------------------------------------------------------------
2.  | 20/03/2018 | Lorem ipsum dolor sit amet, consectetur | Lorem Ipsum,   | 5,381.50  |
    |            | adipiscing elit, sed do eiusmod tempor  | Dolor sit Amet,|           |
    |            | incididunt ut labore et dolore magna    | Consectetur,   |           |
    |            | aliqua. Ut enim ad minim veniam, quis   | Adipiscing     |           |
    |            | nostrud exercitation ullamco laboris    | Elit, Sed Do   |           |
    |            | nisi ut aliquip ex ea commodo           | Eiusmod        |           |
    |            | consequat.                              |                |           |

The “Description” and “Names” columns, as well as the “Total Payment” column header spread multiple lines per table row, so when I tried to use OCR, the data would merge together.

Here is a snippet of the result:
1. 18/03/2018 Lorem ipsum dolor sit amet, consectetur Lorem Ipsum, 3,020.75 adipiscing elit, sed do eiusmod tempor Dolor sit Amet, incididunt ut labore et dolore magna....

How should I go about this problem?

P.S. The pdf data is the result of scanning of a physical document, thus why I used OCR instead of FullText or Native.

Thank you in advance!

Tuhin_Samanta · January 8, 2019, 7:13am

Hi Priscilla,

If the data inside the PDF is in proper Tabular/structured format (You can test it using UIExplorer and check if you are able to identify the individual cell, row, and Table), you can use Data Scraping, select one cell in the table and you will get the entire table scrapped in DataTable format.

Alternate Approaches:

If the table inside the PDF is not structured, You can use Get OCR Text activity to retrieve data from the individual cell.
You can try changing accessibility option in PDF Reader( In Adobe Acrobat Reader Edit → Accessibility → Reading Preferences)
Try using Different OCR Engines (Google/Microsoft) to check which returns the better result.

Thanks and Regards,
Tuhin

priscilla · January 8, 2019, 7:52am

Hi Tuhin,

Thanks for your reply!

I have several documents for me to scrape, and unfortunately not all of them have proper formatting. For those with proper formattings, I’ve managed to retreive the data using Data Scraping, as you’ve mentioned. So now I’m resorting to your alternate approaches for the documents without proper formatting. I need to ask you several further questions regarding those approaches, though.

Would this mean that I have to determine the coordinates of each individual cells, or is it possible to automate it using a pattern of some sort so that they are not hardcoded?

Which reading preference should I use in order to retrieve the data? I’m not entirely sure either how this changes the way Screen Scraper work. Would you be able to explain further as to what the connection is between Reading Preferences and Screen Scraper?

Regards,

Priscilla

Tuhin_Samanta · January 8, 2019, 8:01am

You can use relative scraping or Anchor base, in that case even with the change in position, scraping will work.

Which reading preference should I use in order to retrieve the data? I’m not entirely sure either how this changes the way Screen Scraper work. Would you be able to explain further as to what the connection is between Reading Preferences and Screen Scraper?

with changing the reading preference, Scrapper will read in the order (Left to right and Top-To-Button). It will help you to use string manipulation on the scrapped data.

priscilla · January 8, 2019, 8:45am

Would you be able to give me an example of the relative scraping or anchor base for tabular contents, as I can’t find it? I was thinking of trying the relative scraping, and using the table headers as the base region, with the cell contents as the scrape materials. However, I can only get the first data. How do I get the rest of the table’s data?

Also, I tried changing the reading preference, but all of them seems to produce the same result for me. The problem is still with the mixed columns.

Tuhin_Samanta · January 9, 2019, 5:09am

Hello Priscilla,

Could you please share the pdf from which you are trying to extract the data. I will make a sample workflow and share with you.
You can also check the link: - YouTube

Regards,
Tuhin

priscilla · January 10, 2019, 2:02am

Sorry I can’t share the pdf, but I found another pdf online that resembles what I’m working on: klampfl-fig3a.pdf (108.5 KB)

Uemoe · January 10, 2019, 4:59am

I think answer for your problem it’s Abbyy Flexicapture. At least it’s doable there. No cheap or easy solution exists for this.

Sudharsan_Ka · June 11, 2021, 1:58pm

Hello @Tuhin_Samanta ,
The video link which you shared is not available in youtube can you have any other links for this
same task

Topic		Replies	Views
Extract tabular data from Read-Only PDF Help	5	5903	April 26, 2017
Need help on extraction of tabular data from pdf StudioX studiox , question	4	1126	March 1, 2021
Tabular data extraction from pdf to excel Studio excel , pdf	16	2622	March 5, 2021
Extract data from PDF using OCR and save to excel Help	3	1391	July 18, 2019
Extract PDF table to excel Help studio	10	4221	August 29, 2019

Most Active Users - Yesterday
prashant1603765
V_Roboto_V
ashokkarale
arivu96
sharazkm32
Anil_G
pikorpa
postwick
jaswanthvarma.gottumukkal
adi.mehare
More details...

Extract Multiline Tabular Data from PDF using OCR

Related topics