Tabular data extraction from pdf to excel

Hello,

How can i extract tabular format data from pdf into excel? i have single pdf with multiple pages with the below format of the table

the format of column names would be always
Case Format Kind. Hsm klp/kam @P

i need to extract this table data into excel.

  1. IF you have ABBYY then its a straight forward either using:
    a. Table format
    b. Repeating group.
  2. else need to check weather its a readable format or not, if it is a readable format
    a. Using generate data table you can convert that.
    b. write all extracted text to text file and import that file in excel directly.
  3. If it a OCR need to check the OCR confidence level when you read the data and then perform above actions mentioned.

Hi @sravyarao20,

if it’s digitally created pdf you can use DPF to text activity and to extract table you need to use Regex. so when using regex basically the idea is you need to create delimiter String and then you can use Datatable activity to convert created CSV into data-table and the you can use it

It’s readable format and only one pdf file where it as many tables with same column names I need to extract all the data with those column names

Yes it is digital pdf and column names are also fixed names which we need to extract from single pdf file

Hi @sravyarao20,

I am not able to see the PDF file, Kindly share sample PDF files if it is possible and does not have any sensitive data with that.

You can try with read pdf activity to read that and have to identify the line breaks using some method for each line times then add it to datatable. Finally you can write that datatable into a excel sheet.

Hi @sravyarao20

If possible plzz attach and share the pdf

Happy Automation :raised_hands:

Best Regards
Er Pratik Wavhal :robot::man_technologist:t4: :computer:

Hello Sravya,
In this video, I have 17 use-cases for extracting tables from PDF and write data in Excel:

2:00 GitHub free code for all the files
2:20 Logic of general workflow
4:40 File 1 simple PDF
9:50 File 2 PDF with a column with multiple lines
20:10 File 3 PDF with a column with multiple words ON the LAST column
27:00 File 5 PDF with a column with multiple words ON inside column (2 columns)
31:40 File 6 PDF with a column with multiple lines
39:10 File 8 simple PDF
42:15 File 9 PDF with multiple spaces on that need to be correct
45:50 File 10 PDF with multiple columns that have multiple lines + multiple pages
55:50 File 11 simple PDF with protection empty Cells
58:35 File 12 Big PDF with an empty line and Empty columns and partial total
1:02:25 File 13 PDF with multiple columns that have multiple words and hard to define a rule
1:10:15 File 15 PDF with multiple columns that have multiple lines
1:12:50 File 17 simple PDF remove spaces from headers also remove space from Data
1:16:05 File 18 simple PDF
1:17:10 File 19 PDF with multiple pages and columns with multiple lines
1:22:10 File 20 PDF with multiple columns that have multiple lines
1:25:00 File 21 PDF with empty columns and subtotal

Code:

Thanks,
Cristian Negulescu

1 Like

@Cristian_Negulescu
I am trying to extract table from pdf to datatable. Given Input PDF screenshot below. I followed your steps from your video. But the text got extracted like shown below. Could you please guide me on how to go about it?

Input PDF:

Output Table text:
image

@arunasan - Did you set Preserve Format to True, in the Read PDF activity?

@prasath17 ,
I just did that and now it looks like as shown below. When I tried to generate data table, the headers and column values are not aligning properly. Screenshot below.


@arunasan
Please look at FILE 15 from the video example

Your code should be this one for your case:

      Dim strtmp As String
    Dim cnt As Int32 = 0
    Dim array() As String
    Dim Col1 As String = ""
	Dim Col4 As String = ""
    Dim Coltmp1 As String = ""
    Dim Coltmp2 As String = ""
    strtmp = strin.Substring(strin.IndexOf("(CCN)"), strin.LastIndexOf("TOTAL") - strin.IndexOf("(CCN)") - 6).Trim
    strout = "col1|col2|col3|col4|col5|col6|col7"
    For Each line As String In strtmp.Split(New String() {Environment.NewLine}, StringSplitOptions.RemoveEmptyEntries)            
            cnt = line.Trim.Split(New String() {"  "}, StringSplitOptions.None).Count
            array = line.Trim.Split(New String() {"  "}, StringSplitOptions.None)
            Select Case cnt
                Case 7
                    If Col2.Length > 1 Then
                        strout = strout + Col1 + "|" + Coltmp1 + "|" + Col4+ "|" + Coltmp2 + Environment.NewLine
                    End If
                    Col1 = array(0)
                    Coltmp1 = array(1)+"|"+array(2)
                    Col4 = array(3)
					Coltmp2 = array(4)+"|"+array(5)+"|"+array(6)
                Case 1
                    Col4 = Col4 + " " + array(0)
                Case 2
                    Col1 = Col1 + " " + array(0)
                    Col4 = Col4 + " " + array(1)
            End Select
        End If
    Next
    strout = strout + Col1 + "|" + Coltmp1 + "|" + Col4+ "|" + Coltmp2 + Environment.NewLine

@Cristian_Negulescu
Thank you Cristian. My requirement is to loop through multiple pdf docs and extract data from the table. CCN is part of the column header. In some docs, it is CCN and in some it is Certification Number. The naming is not consistent. Also, when I executed your code for the pdf example i provided earlier, no records were added to excel. I tried to see what output strtmp is giving by executing this statement below in write line activity. Output began from (CCN) as part of the column values
wordContent.Substring(wordContent.IndexOf(“(CCN)”), wordContent.LastIndexOf(“TOTAL”) - wordContent.IndexOf(“(CCN)”) - 6).Trim

image

Hello @arunasan
In my code the issue code this If Col2.Length > 1 Then here need to put Coltmp1.Length
And also another error this line should be like this
strtmp = strin.Substring(strin.IndexOf(“(CCN)”+6), strin.LastIndexOf(“TOTAL”) - strin.IndexOf(“(CCN)”) - 6).Trim
I miss the +6

Now about multiple PDF you say that not all start with (CCN) but for sure they have something before the header that is stable.
Then to jump over the header the logic should be like this you need to jump off the first line with 6X" " and then Start the process when you find again another line with 6x" " then follow the logic from the code above
Thanks,
Cristian Negulescu

Thank you @Cristian_Negulescu. I made the changes. Still getting the same result. Here is the code:

I just verified multiple PDFs and almost all of the PDF tables have the first column header consistent(“Facility Name”). Can we use this instead of CCN? Here is another example from another PDF table. The same code will not work for this table, right?

I am not sure if I followed your header logic. Could you please explain?

@arunasan your PDFs are complex so if you need to find someone that know VB.NET or C# and build logic for you.
You can try this code for simple data extraction if this is enough for you is perfect, if not find a developer and give him multiple PDF to build logic.

Dim strtmp As String
    Dim first As Boolean = True
    strtmp = strin.Substring(strin.IndexOf("Facility Name"), strin.LastIndexOf("TOTAL") - strin.IndexOf("Facility Name") - 1).Trim
    strout = "col1|col2|col3|col4|col5|col6|col7" + Environment.NewLine
    For Each line As String In strtmp.Split(New String() {Environment.NewLine}, StringSplitOptions.RemoveEmptyEntries)
        cnt = line.Trim.Split(New String() {"  "}, StringSplitOptions.None).Count
        If (cnt = 7) Then
            If (first) Then
                first = False
            Else
                strout = strout + line.Trim.Replace("  ", "|") + Environment.NewLine
            End If
        End If
    Next

@Cristian_Negulescu , Thank you Cristian. I am trying to make it work for a simple data extraction for now. I tried your code. Excel is still empty. I will send you the sample pdf in message. Could you please take a look at my code below and let me know if I am missing something?
Also, strout contains only the col1|col2… Screenshot below.
image