Extract Multiline Tabular Data from PDF using OCR

Hello,

I’m trying to extract a tabular data from PDF using OCR method and convert it into an excel or csv file. I know a lot of people have asked this question, but all the methods didn’t work for me. The problem with mine is that I have multiple lines of data per row. That is to say that the normal OCR reading (left to right) will mix up my data with more than 1 column. Usually the answers have something to do with splitting the columns by tabs, but I can’t for mine because of the multiline data.

I can’t post the pdf file but here is basically the structure of the table:

No. |  Date      | Description                             | Names          | Total     |
    |            |                                         |                | Payment   |
------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------
1.  | 18/03/2018 | Lorem ipsum dolor sit amet, consectetur | Lorem Ipsum,   | 3,020.75  |
    |            | adipiscing elit, sed do eiusmod tempor  | Dolor sit Amet,|           |
    |            | incididunt ut labore et dolore magna    | Consectetur,   |           |
    |            | aliqua. Ut enim ad minim veniam, quis   | Adipiscing     |           |
    |            | nostrud exercitation ullamco laboris    | Elit, Sed Do   |           |
    |            | nisi ut aliquip ex ea commodo           | Eiusmod        |           |
    |            | consequat.                              |                |           |
------------------------------------------------------------------------------------------
2.  | 20/03/2018 | Lorem ipsum dolor sit amet, consectetur | Lorem Ipsum,   | 5,381.50  |
    |            | adipiscing elit, sed do eiusmod tempor  | Dolor sit Amet,|           |
    |            | incididunt ut labore et dolore magna    | Consectetur,   |           |
    |            | aliqua. Ut enim ad minim veniam, quis   | Adipiscing     |           |
    |            | nostrud exercitation ullamco laboris    | Elit, Sed Do   |           |
    |            | nisi ut aliquip ex ea commodo           | Eiusmod        |           |
    |            | consequat.                              |                |           |

The “Description” and “Names” columns, as well as the “Total Payment” column header spread multiple lines per table row, so when I tried to use OCR, the data would merge together.

Here is a snippet of the result:
1. 18/03/2018 Lorem ipsum dolor sit amet, consectetur Lorem Ipsum, 3,020.75 adipiscing elit, sed do eiusmod tempor Dolor sit Amet, incididunt ut labore et dolore magna....

How should I go about this problem?

P.S. The pdf data is the result of scanning of a physical document, thus why I used OCR instead of FullText or Native.

Thank you in advance!

Hi Priscilla,

If the data inside the PDF is in proper Tabular/structured format (You can test it using UIExplorer and check if you are able to identify the individual cell, row, and Table), you can use Data Scraping, select one cell in the table and you will get the entire table scrapped in DataTable format.

Alternate Approaches:

  1. If the table inside the PDF is not structured, You can use Get OCR Text activity to retrieve data from the individual cell.
  2. You can try changing accessibility option in PDF Reader( In Adobe Acrobat Reader Edit -> Accessibility -> Reading Preferences)
  3. Try using Different OCR Engines (Google/Microsoft) to check which returns the better result.

Thanks and Regards,
Tuhin

Hi Tuhin,

Thanks for your reply!

I have several documents for me to scrape, and unfortunately not all of them have proper formatting. For those with proper formattings, I’ve managed to retreive the data using Data Scraping, as you’ve mentioned. So now I’m resorting to your alternate approaches for the documents without proper formatting. I need to ask you several further questions regarding those approaches, though.


Would this mean that I have to determine the coordinates of each individual cells, or is it possible to automate it using a pattern of some sort so that they are not hardcoded?


Which reading preference should I use in order to retrieve the data? I’m not entirely sure either how this changes the way Screen Scraper work. Would you be able to explain further as to what the connection is between Reading Preferences and Screen Scraper?


Regards,


Priscilla

You can use relative scraping or Anchor base, in that case even with the change in position, scraping will work.

Which reading preference should I use in order to retrieve the data? I’m not entirely sure either how this changes the way Screen Scraper work. Would you be able to explain further as to what the connection is between Reading Preferences and Screen Scraper?

with changing the reading preference, Scrapper will read in the order (Left to right and Top-To-Button). It will help you to use string manipulation on the scrapped data.

Would you be able to give me an example of the relative scraping or anchor base for tabular contents, as I can’t find it? I was thinking of trying the relative scraping, and using the table headers as the base region, with the cell contents as the scrape materials. However, I can only get the first data. How do I get the rest of the table’s data?

Also, I tried changing the reading preference, but all of them seems to produce the same result for me. The problem is still with the mixed columns.

Hello Priscilla,

Could you please share the pdf from which you are trying to extract the data. I will make a sample workflow and share with you.
You can also check the link: https://youtu.be/jncjBCY4Auw

Regards,
Tuhin

Sorry I can’t share the pdf, but I found another pdf online that resembles what I’m working on: klampfl-fig3a.pdf (108.5 KB)

I think answer for your problem it’s Abbyy Flexicapture. At least it’s doable there. No cheap or easy solution exists for this.