Tabular data extraction from pdf to excel

sravyarao20 · September 10, 2020, 2:00pm

Hello,

How can i extract tabular format data from pdf into excel? i have single pdf with multiple pages with the below format of the table

the format of column names would be always
Case Format Kind. Hsm klp/kam @P

i need to extract this table data into excel.

Divyashreem · September 10, 2020, 7:20pm

IF you have ABBYY then its a straight forward either using:
a. Table format
b. Repeating group.
else need to check weather its a readable format or not, if it is a readable format
a. Using generate data table you can convert that.
b. write all extracted text to text file and import that file in excel directly.
If it a OCR need to check the OCR confidence level when you read the data and then perform above actions mentioned.

SamanGuruge · September 11, 2020, 1:44am

Hi @sravyarao20,

if it’s digitally created pdf you can use DPF to text activity and to extract table you need to use Regex. so when using regex basically the idea is you need to create delimiter String and then you can use Datatable activity to convert created CSV into data-table and the you can use it

sravyarao20 · September 11, 2020, 4:01am

It’s readable format and only one pdf file where it as many tables with same column names I need to extract all the data with those column names

sravyarao20 · September 11, 2020, 4:03am

Yes it is digital pdf and column names are also fixed names which we need to extract from single pdf file

sarathi125 · September 11, 2020, 10:23am

Hi @sravyarao20,

I am not able to see the PDF file, Kindly share sample PDF files if it is possible and does not have any sensitive data with that.

You can try with read pdf activity to read that and have to identify the line breaks using some method for each line times then add it to datatable. Finally you can write that datatable into a excel sheet.

Pratik_Wavhal · September 27, 2020, 10:45am

Hi @sravyarao20

If possible plzz attach and share the pdf

Happy Automation

Best Regards
Er Pratik Wavhal

Cristian_Negulescu · March 1, 2021, 7:51am

Hello Sravya,
In this video, I have 17 use-cases for extracting tables from PDF and write data in Excel:

2:00 GitHub free code for all the files
2:20 Logic of general workflow
4:40 File 1 simple PDF
9:50 File 2 PDF with a column with multiple lines
20:10 File 3 PDF with a column with multiple words ON the LAST column
27:00 File 5 PDF with a column with multiple words ON inside column (2 columns)
31:40 File 6 PDF with a column with multiple lines
39:10 File 8 simple PDF
42:15 File 9 PDF with multiple spaces on that need to be correct
45:50 File 10 PDF with multiple columns that have multiple lines + multiple pages
55:50 File 11 simple PDF with protection empty Cells
58:35 File 12 Big PDF with an empty line and Empty columns and partial total
1:02:25 File 13 PDF with multiple columns that have multiple words and hard to define a rule
1:10:15 File 15 PDF with multiple columns that have multiple lines
1:12:50 File 17 simple PDF remove spaces from headers also remove space from Data
1:16:05 File 18 simple PDF
1:17:10 File 19 PDF with multiple pages and columns with multiple lines
1:22:10 File 20 PDF with multiple columns that have multiple lines
1:25:00 File 21 PDF with empty columns and subtotal

Code:

github.com

cristinegulescu/startUiPathFromSalesforce/blob/master/PDFdecode.txt

        'FILE1
        Dim strtmp As String
        strtmp = strin.Substring(strin.IndexOf("Number"), strin.IndexOf("Subtotal") - strin.IndexOf("Number")).Trim
        strout = strtmp.Replace(" ", "|")

        strtmp = strin.Substring(strin.IndexOf("Subtotal") + 8)
        strpar = strtmp.Substring(0, strtmp.IndexOf(Environment.NewLine)).Trim


        'FILE2
        Dim strtmp As String
        Dim strout As String
        strout = "Col1|Col2|Col3|Col4"
        strtmp = strin.Substring(strin.IndexOf("Vacancies") + 11).Trim
        For Each line As String In strtmp.Split(New String() {Environment.NewLine}, StringSplitOptions.RemoveEmptyEntries)
            If (line.Length > 3) Then
                If (IsNumeric(line(0))) And (line(1) = " ") And (line(2) = " ") Then
                    strout = strout + Environment.NewLine + line.Replace("  ", "").Replace("  ", "|").Trim
                ElseIf (line(0) = "") And (line(1) = " ") And (line(2) = " ") Then
                    strout = strout + line.Replace("  ", "$").Trim()

This file has been truncated. show original

Thanks,
Cristian Negulescu

arunasan · March 4, 2021, 2:44am

@Cristian_Negulescu
I am trying to extract table from pdf to datatable. Given Input PDF screenshot below. I followed your steps from your video. But the text got extracted like shown below. Could you please guide me on how to go about it?

Input PDF:

Output Table text:

prasath17 · March 4, 2021, 2:52am

@arunasan - Did you set Preserve Format to True, in the Read PDF activity?

arunasan · March 4, 2021, 2:59am

@prasath17 ,
I just did that and now it looks like as shown below. When I tried to generate data table, the headers and column values are not aligning properly. Screenshot below.

Cristian_Negulescu · March 4, 2021, 5:22am

@arunasan
Please look at FILE 15 from the video example

Your code should be this one for your case:

      Dim strtmp As String
    Dim cnt As Int32 = 0
    Dim array() As String
    Dim Col1 As String = ""
	Dim Col4 As String = ""
    Dim Coltmp1 As String = ""
    Dim Coltmp2 As String = ""
    strtmp = strin.Substring(strin.IndexOf("(CCN)"), strin.LastIndexOf("TOTAL") - strin.IndexOf("(CCN)") - 6).Trim
    strout = "col1|col2|col3|col4|col5|col6|col7"
    For Each line As String In strtmp.Split(New String() {Environment.NewLine}, StringSplitOptions.RemoveEmptyEntries)            
            cnt = line.Trim.Split(New String() {"  "}, StringSplitOptions.None).Count
            array = line.Trim.Split(New String() {"  "}, StringSplitOptions.None)
            Select Case cnt
                Case 7
                    If Col2.Length > 1 Then
                        strout = strout + Col1 + "|" + Coltmp1 + "|" + Col4+ "|" + Coltmp2 + Environment.NewLine
                    End If
                    Col1 = array(0)
                    Coltmp1 = array(1)+"|"+array(2)
                    Col4 = array(3)
					Coltmp2 = array(4)+"|"+array(5)+"|"+array(6)
                Case 1
                    Col4 = Col4 + " " + array(0)
                Case 2
                    Col1 = Col1 + " " + array(0)
                    Col4 = Col4 + " " + array(1)
            End Select
        End If
    Next
    strout = strout + Col1 + "|" + Coltmp1 + "|" + Col4+ "|" + Coltmp2 + Environment.NewLine

arunasan · March 4, 2021, 5:20pm

Cristian_Negulescu:

Dim strtmp As String
    Dim cnt As Int32 = 0
    Dim array() As String
    Dim Col1 As String = ""
	Dim Col4 As String = ""
    Dim Coltmp1 As String = ""
    Dim Coltmp2 As String = ""
    strtmp = strin.Substring(strin.IndexOf("(CCN)"), strin.LastIndexOf("TOTAL") - strin.IndexOf("(CCN)") - 6).Trim
    strout = "col1|col2|col3|col4|col5|col6|col7"
    For Each line As String In strtmp.Split(New String() {Environment.NewLine}, StringSplitOptions.RemoveEmptyEntries)            
            cnt = line.Trim.Split(New String() {"  "}, StringSplitOptions.None).Count
            array = line.Trim.Split(New String() {"  "}, StringSplitOptions.None)
            Select Case cnt
                Case 7
                    If Col2.Length > 1 Then
                        strout = strout + Col1 + "|" + Coltmp1 + "|" + Col4+ "|" + Coltmp2 + Environment.NewLine
                    End If
                    Col1 = array(0)
                    Coltmp1 = array(1)+"|"+array(2)
                    Col4 = array(3)
					Coltmp2 = array(4)+"|"+array(5)+"|"+array(6)
                Case 1
                    Col4 = Col4 + " " + array(0)
                Case 2
                    Col1 = Col1 + " " + array(0)
                    Col4 = Col4 + " " + array(1)
            End Select
        End If
    Next
    strout = strout + Col1 + "|" + Coltmp1 + "|" + Col4+ "|" + Coltmp2 + Environment.NewLine

@Cristian_Negulescu
Thank you Cristian. My requirement is to loop through multiple pdf docs and extract data from the table. CCN is part of the column header. In some docs, it is CCN and in some it is Certification Number. The naming is not consistent. Also, when I executed your code for the pdf example i provided earlier, no records were added to excel. I tried to see what output strtmp is giving by executing this statement below in write line activity. Output began from (CCN) as part of the column values
wordContent.Substring(wordContent.IndexOf(“(CCN)”), wordContent.LastIndexOf(“TOTAL”) - wordContent.IndexOf(“(CCN)”) - 6).Trim

Cristian_Negulescu · March 4, 2021, 7:10pm

Hello @arunasan
In my code the issue code this If Col2.Length > 1 Then here need to put Coltmp1.Length
And also another error this line should be like this
strtmp = strin.Substring(strin.IndexOf(“(CCN)”+6), strin.LastIndexOf(“TOTAL”) - strin.IndexOf(“(CCN)”) - 6).Trim
I miss the +6

Now about multiple PDF you say that not all start with (CCN) but for sure they have something before the header that is stable.
Then to jump over the header the logic should be like this you need to jump off the first line with 6X" " and then Start the process when you find again another line with 6x" " then follow the logic from the code above
Thanks,
Cristian Negulescu

arunasan · March 4, 2021, 7:45pm

Thank you @Cristian_Negulescu. I made the changes. Still getting the same result. Here is the code:

I just verified multiple PDFs and almost all of the PDF tables have the first column header consistent(“Facility Name”). Can we use this instead of CCN? Here is another example from another PDF table. The same code will not work for this table, right?

I am not sure if I followed your header logic. Could you please explain?

Cristian_Negulescu · March 5, 2021, 9:25am

@arunasan your PDFs are complex so if you need to find someone that know VB.NET or C# and build logic for you.
You can try this code for simple data extraction if this is enough for you is perfect, if not find a developer and give him multiple PDF to build logic.

Dim strtmp As String
    Dim first As Boolean = True
    strtmp = strin.Substring(strin.IndexOf("Facility Name"), strin.LastIndexOf("TOTAL") - strin.IndexOf("Facility Name") - 1).Trim
    strout = "col1|col2|col3|col4|col5|col6|col7" + Environment.NewLine
    For Each line As String In strtmp.Split(New String() {Environment.NewLine}, StringSplitOptions.RemoveEmptyEntries)
        cnt = line.Trim.Split(New String() {"  "}, StringSplitOptions.None).Count
        If (cnt = 7) Then
            If (first) Then
                first = False
            Else
                strout = strout + line.Trim.Replace("  ", "|") + Environment.NewLine
            End If
        End If
    Next

arunasan · March 5, 2021, 4:07pm

@Cristian_Negulescu , Thank you Cristian. I am trying to make it work for a simple data extraction for now. I tried your code. Excel is still empty. I will send you the sample pdf in message. Could you please take a look at my code below and let me know if I am missing something?
Also, strout contains only the col1|col2… Screenshot below.

Topic		Replies	Views
How to extract table data from pdf RPA Discussions general	10	3606	April 23, 2022
Extract PDF tabular data Studio datatable , excel , pdf , activities , data_scraping	10	1709	February 24, 2020
Extract table from pdf as it is Activities pdf , studio	15	6065	March 4, 2024
Extracting PDF data into Excel Help excel , pdf , activities	26	24659	May 6, 2021
How to extract a table from pdf to excel Studio excel , activities	18	6541	July 19, 2023

Most Active Users - Yesterday
Anil_G
lrtetala
ashokkarale
mkankatala
Anuja_Ravikant_Khamitkar
sullivanne
chaitanyaKumar
jagadeesh
OttoCorrect
Kismet_Tosun
More details...

Tabular data extraction from pdf to excel

Related Topics