PDF Table extraction

Dear Forum Team,

I am facing one issue. I have a multiple pdf files which contains the data in text as well as tabular format. the tabular data goes to multiple pages also for example bank statement. The page number also not fixed means the tabular data may be start with page no3 or may be page no 5. That data may also be 4 page tabular format or 3 page or 2 page not fixed.

To resolve this problem what is the best approach?

Regards
Anand

Hi @anand.t

You can convert pdf to word, and then grab the tables from it using “Index of the table”, each table will have a unique index value, In this scenario, if the table contains many pages is not an issue !! You can grab the Table easily !!

Thank you

Hi @anand.t,

  1. You can try this approach in this thread: Convert PDF Datatable to Excel - Build - UiPath Community Forum

If your table has multiple headers, this first approach may not work because the way the table is obtained read and values seperated using string manipulation. If it is a standard single header type, then this will work just fine with some .replace(SEPARATORS,",").

  1. Possible solution for multiple headers (may need string manipulation): Brainstorming Solutions for Editing Data in PDFs - Build / Activities - UiPath Community Forum

Thanks @jeevith and @Rakesh_Sampath

I have 20 pages pdfs in which some pages has only text and some pages has data in tabular format. The page index of tabular format data is not fixed. For example suppose 1 to 5 pages is only text and 6-10 pages only tabular data. This index is not fixed. This is varying as per invoices.

Any approach for this?

Regards
Anand

You can open any PDF in word. One thing you need to check before anything else is, if the PDF contains richtext or scanned data (images).

You can only extract data if the PDF contains richtext using the mentioned approaches and not physical scanned/software scanned images as part of the pdf content. OCR or Deep Learning based methods are more proficient for that kind of data extraction.

If your 20 pages are richtext then take a look at this solution from @vvaidya to extract table/s How to read table in a Word document - Build - UiPath Community Forum

Hi All,

The above approach fails in some of PDF files. Can I use document understanding for pulling the data from multiple pdf pages i.e. banking statement? Table page is not fixed? Is it possible with DU?? or go with ABBYY Flexi capture??

Need expert advice here.

See the example in my this thread.

Regards
Anand

Hello Anand,
In this video, I have 17 use-cases for extracting tables from PDF and write data in Excel and I have also exampels with multiple pages:

45:50 File 10 PDF with multiple columns that have multiple lines + multiple pages
1:17:10 File 19 PDF with multiple pages and columns with multiple lines

Code:

Thanks,
Cristian Negulescu

1 Like

Thanks for help