I am facing one issue. I have a multiple pdf files which contains the data in text as well as tabular format. the tabular data goes to multiple pages also for example bank statement. The page number also not fixed means the tabular data may be start with page no3 or may be page no 5. That data may also be 4 page tabular format or 3 page or 2 page not fixed.
To resolve this problem what is the best approach?
You can convert pdf to word, and then grab the tables from it using “Index of the table”, each table will have a unique index value, In this scenario, if the table contains many pages is not an issue !! You can grab the Table easily !!
If your table has multiple headers, this first approach may not work because the way the table is obtained read and values seperated using string manipulation. If it is a standard single header type, then this will work just fine with some .replace(SEPARATORS,",").
I have 20 pages pdfs in which some pages has only text and some pages has data in tabular format. The page index of tabular format data is not fixed. For example suppose 1 to 5 pages is only text and 6-10 pages only tabular data. This index is not fixed. This is varying as per invoices.
You can open any PDF in word. One thing you need to check before anything else is, if the PDF contains richtext or scanned data (images).
You can only extract data if the PDF contains richtext using the mentioned approaches and not physical scanned/software scanned images as part of the pdf content. OCR or Deep Learning based methods are more proficient for that kind of data extraction.
The above approach fails in some of PDF files. Can I use document understanding for pulling the data from multiple pdf pages i.e. banking statement? Table page is not fixed? Is it possible with DU?? or go with ABBYY Flexi capture??