Hello,
I’m trying to extract a tabular data from PDF using OCR method and convert it into an excel or csv file. I know a lot of people have asked this question, but all the methods didn’t work for me. The problem with mine is that I have multiple lines of data per row. That is to say that the normal OCR reading (left to right) will mix up my data with more than 1 column. Usually the answers have something to do with splitting the columns by tabs, but I can’t for mine because of the multiline data.
I can’t post the pdf file but here is basically the structure of the table:
No. | Date | Description | Names | Total |
| | | | Payment |
------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------
1. | 18/03/2018 | Lorem ipsum dolor sit amet, consectetur | Lorem Ipsum, | 3,020.75 |
| | adipiscing elit, sed do eiusmod tempor | Dolor sit Amet,| |
| | incididunt ut labore et dolore magna | Consectetur, | |
| | aliqua. Ut enim ad minim veniam, quis | Adipiscing | |
| | nostrud exercitation ullamco laboris | Elit, Sed Do | |
| | nisi ut aliquip ex ea commodo | Eiusmod | |
| | consequat. | | |
------------------------------------------------------------------------------------------
2. | 20/03/2018 | Lorem ipsum dolor sit amet, consectetur | Lorem Ipsum, | 5,381.50 |
| | adipiscing elit, sed do eiusmod tempor | Dolor sit Amet,| |
| | incididunt ut labore et dolore magna | Consectetur, | |
| | aliqua. Ut enim ad minim veniam, quis | Adipiscing | |
| | nostrud exercitation ullamco laboris | Elit, Sed Do | |
| | nisi ut aliquip ex ea commodo | Eiusmod | |
| | consequat. | | |
The “Description” and “Names” columns, as well as the “Total Payment” column header spread multiple lines per table row, so when I tried to use OCR, the data would merge together.
Here is a snippet of the result:
1. 18/03/2018 Lorem ipsum dolor sit amet, consectetur Lorem Ipsum, 3,020.75 adipiscing elit, sed do eiusmod tempor Dolor sit Amet, incididunt ut labore et dolore magna....
How should I go about this problem?
P.S. The pdf data is the result of scanning of a physical document, thus why I used OCR instead of FullText or Native.
Thank you in advance!