PDF Table extraction

Dear Forum Team,

I am facing one issue. I have a multiple pdf files which contains the data in text as well as tabular format. the tabular data goes to multiple pages also for example bank statement. The page number also not fixed means the tabular data may be start with page no3 or may be page no 5. That data may also be 4 page tabular format or 3 page or 2 page not fixed.

To resolve this problem what is the best approach?

Regards
Anand

Hi @anand.t

You can convert pdf to word, and then grab the tables from it using “Index of the table”, each table will have a unique index value, In this scenario, if the table contains many pages is not an issue !! You can grab the Table easily !!

Thank you

Hi @anand.t,

  1. You can try this approach in this thread: Convert PDF Datatable to Excel - Build - UiPath Community Forum

If your table has multiple headers, this first approach may not work because the way the table is obtained read and values seperated using string manipulation. If it is a standard single header type, then this will work just fine with some .replace(SEPARATORS,",").

  1. Possible solution for multiple headers (may need string manipulation): Brainstorming Solutions for Editing Data in PDFs - Build / Activities - UiPath Community Forum

Thanks @jeevith and @Rakesh_Sampath

I have 20 pages pdfs in which some pages has only text and some pages has data in tabular format. The page index of tabular format data is not fixed. For example suppose 1 to 5 pages is only text and 6-10 pages only tabular data. This index is not fixed. This is varying as per invoices.

Any approach for this?

Regards
Anand

You can open any PDF in word. One thing you need to check before anything else is, if the PDF contains richtext or scanned data (images).

You can only extract data if the PDF contains richtext using the mentioned approaches and not physical scanned/software scanned images as part of the pdf content. OCR or Deep Learning based methods are more proficient for that kind of data extraction.

If your 20 pages are richtext then take a look at this solution from @vvaidya to extract table/s How to read table in a Word document - Build - UiPath Community Forum

Hi All,

The above approach fails in some of PDF files. Can I use document understanding for pulling the data from multiple pdf pages i.e. banking statement? Table page is not fixed? Is it possible with DU?? or go with ABBYY Flexi capture??

Need expert advice here.

See the example in my this thread.

Regards
Anand

Hello Anand,
In this video, I have 17 use-cases for extracting tables from PDF and write data in Excel and I have also exampels with multiple pages:

45:50 File 10 PDF with multiple columns that have multiple lines + multiple pages
1:17:10 File 19 PDF with multiple pages and columns with multiple lines

Code:

Thanks,
Cristian Negulescu

1 Like

Thanks for help

Thank you so very much putting this tutorial together. I watched it many times and learned a lot from applying the technique to different scenarios. My case is very similar to your case 15, but my process didn’t seem to pick up the correct rows with various case of 2nd and 3rd row. Here is a sample of the unstructured table:

A12345 Apple - by the pound APP-001
B23456 Bagel - a 6 pack BAG-321 Flash Sale
No return
C34567 Cream - Soda special order CRS-999 Non-alcoholic
July Only Order by case
D45678 Danny - Flower red
E56789 Eraser - Head bulk bag 100 count ERA-980
No resale
F67890 Franks - veggie dog, world famous FAN-888
This week only
No return

I have tried to use pattern match for the first like [A-Z]\d{5}. This for sure is first row, the due to the uneven spaces between 2nd and 3rd column, I can try to use the pattern with a single space before and after the hyphen character for 2nd column, and no space but with hyphen for the 3rd column. In terms of how to detect the spaces between the 1 row, any consecutive space will be treated as column separation.

Example row D45678 is the minimum, all rows will have at least that.

second row variation really throw me off, it can be in column 2 or column 3, see case B23456 and E56789.

Once in a while, I will see case F67890, where I have 2nd and 3rd row all belong to column 2.

Could you please show me how I may apply your technique to the scenario above.

Thank you!

Dim rows As String() = strtmp.Split(New String() {Environment.NewLine}, StringSplitOptions.RemoveEmptyEntries)

Dim patternRowStart As String = “[1]\d{5}”
Dim patternColumn2 As String = " - "
Dim patternColumn3 As String = “-[^ ]”

Dim currentRow As String = “”
Dim col1 As String = “”
Dim col2 As String = “”
Dim col3 As String = “”

For Each row As String In rows
Dim trimmedRow As String = row.Trim()

If Regex.IsMatch(trimmedRow, patternRowStart) Then
    If Not String.IsNullOrEmpty(currentRow) Then
        strout &= col1 & "|" & col2 & "|" & col3 & Environment.NewLine
    End If

    currentRow = trimmedRow
Else

If Regex.IsMatch(trimmedRow, patternColumn2) Then
col2 = trimmedRow
ElseIf Regex.IsMatch(trimmedRow, patternColumn3) Then
col3 = trimmedRow
Else
currentRow &= " " & trimmedRow
End If
End If
Next

strout &= col1 & “|” & col2 & “|” & col3 & Environment.NewLine

// Output the result
strout

This seemed to work.


  1. A-Z ↩︎