IF you have ABBYY then its a straight forward either using:
a. Table format
b. Repeating group.
else need to check weather its a readable format or not, if it is a readable format
a. Using generate data table you can convert that.
b. write all extracted text to text file and import that file in excel directly.
If it a OCR need to check the OCR confidence level when you read the data and then perform above actions mentioned.
if it’s digitally created pdf you can use DPF to text activity and to extract table you need to use Regex. so when using regex basically the idea is you need to create delimiter String and then you can use Datatable activity to convert created CSV into data-table and the you can use it
I am not able to see the PDF file, Kindly share sample PDF files if it is possible and does not have any sensitive data with that.
You can try with read pdf activity to read that and have to identify the line breaks using some method for each line times then add it to datatable. Finally you can write that datatable into a excel sheet.
Hello Sravya,
In this video, I have 17 use-cases for extracting tables from PDF and write data in Excel:
2:00 GitHub free code for all the files
2:20 Logic of general workflow
4:40 File 1 simple PDF
9:50 File 2 PDF with a column with multiple lines
20:10 File 3 PDF with a column with multiple words ON the LAST column
27:00 File 5 PDF with a column with multiple words ON inside column (2 columns)
31:40 File 6 PDF with a column with multiple lines
39:10 File 8 simple PDF
42:15 File 9 PDF with multiple spaces on that need to be correct
45:50 File 10 PDF with multiple columns that have multiple lines + multiple pages
55:50 File 11 simple PDF with protection empty Cells
58:35 File 12 Big PDF with an empty line and Empty columns and partial total
1:02:25 File 13 PDF with multiple columns that have multiple words and hard to define a rule
1:10:15 File 15 PDF with multiple columns that have multiple lines
1:12:50 File 17 simple PDF remove spaces from headers also remove space from Data
1:16:05 File 18 simple PDF
1:17:10 File 19 PDF with multiple pages and columns with multiple lines
1:22:10 File 20 PDF with multiple columns that have multiple lines
1:25:00 File 21 PDF with empty columns and subtotal
@Cristian_Negulescu
I am trying to extract table from pdf to datatable. Given Input PDF screenshot below. I followed your steps from your video. But the text got extracted like shown below. Could you please guide me on how to go about it?
@prasath17 ,
I just did that and now it looks like as shown below. When I tried to generate data table, the headers and column values are not aligning properly. Screenshot below.
@Cristian_Negulescu
Thank you Cristian. My requirement is to loop through multiple pdf docs and extract data from the table. CCN is part of the column header. In some docs, it is CCN and in some it is Certification Number. The naming is not consistent. Also, when I executed your code for the pdf example i provided earlier, no records were added to excel. I tried to see what output strtmp is giving by executing this statement below in write line activity. Output began from (CCN) as part of the column values
wordContent.Substring(wordContent.IndexOf(“(CCN)”), wordContent.LastIndexOf(“TOTAL”) - wordContent.IndexOf(“(CCN)”) - 6).Trim
Hello @arunasan
In my code the issue code this If Col2.Length > 1 Then here need to put Coltmp1.Length
And also another error this line should be like this
strtmp = strin.Substring(strin.IndexOf(“(CCN)”+6), strin.LastIndexOf(“TOTAL”) - strin.IndexOf(“(CCN)”) - 6).Trim
I miss the +6
Now about multiple PDF you say that not all start with (CCN) but for sure they have something before the header that is stable.
Then to jump over the header the logic should be like this you need to jump off the first line with 6X" " and then Start the process when you find again another line with 6x" " then follow the logic from the code above
Thanks,
Cristian Negulescu
I just verified multiple PDFs and almost all of the PDF tables have the first column header consistent(“Facility Name”). Can we use this instead of CCN? Here is another example from another PDF table. The same code will not work for this table, right?
@arunasan your PDFs are complex so if you need to find someone that know VB.NET or C# and build logic for you.
You can try this code for simple data extraction if this is enough for you is perfect, if not find a developer and give him multiple PDF to build logic.
Dim strtmp As String
Dim first As Boolean = True
strtmp = strin.Substring(strin.IndexOf("Facility Name"), strin.LastIndexOf("TOTAL") - strin.IndexOf("Facility Name") - 1).Trim
strout = "col1|col2|col3|col4|col5|col6|col7" + Environment.NewLine
For Each line As String In strtmp.Split(New String() {Environment.NewLine}, StringSplitOptions.RemoveEmptyEntries)
cnt = line.Trim.Split(New String() {" "}, StringSplitOptions.None).Count
If (cnt = 7) Then
If (first) Then
first = False
Else
strout = strout + line.Trim.Replace(" ", "|") + Environment.NewLine
End If
End If
Next
@Cristian_Negulescu , Thank you Cristian. I am trying to make it work for a simple data extraction for now. I tried your code. Excel is still empty. I will send you the sample pdf in message. Could you please take a look at my code below and let me know if I am missing something?
Also, strout contains only the col1|col2… Screenshot below.