How Extract Table from PDF?

Hi everyone,

i have multiple PDF bank statements i want to extract the Transaction table from that.how can i extract taht table.i tried with data scraping it is working for only one pdf i changed selectors also.

Thanks in Advance
Krishnareddy

we got other two options
–either screen scrapping
or
–use READ PDF or READ PDF OCR which will give us string variable and our required values can be obtained with string manipulation methodd like regex or split method
or
to be more advanced if we have the adobe license then we can use MICROSOFT INTEROPPS method to read the table structure in pdf.
–like save the pdf as doc file and that doc file can be read with its table contents
but we need adobe license for that

hope this would help you

Cheers @krishnareddy

@krishnareddy

Maybe writing a custom extractor for your bank statements would work best: write a custom activity that you can use in the Data Extraction Scope, and get your table in such a way it can also be validated / corrected by a human using the Present Validation Station activity.

Check out the links in here for more details about this approach:

Hope this helps,

Ioana

any options that we can extract table from PDF without any flash players??

Yes. Use the Machine Learning Extractor with a ML Model trained for Bank Statements :slight_smile:

Hello Krishna,
In this video, I have 17 use-cases for extracting tables from PDF and write data in Excel:

2:00 GitHub free code for all the files
2:20 Logic of general workflow
4:40 File 1 simple PDF
9:50 File 2 PDF with a column with multiple lines
20:10 File 3 PDF with a column with multiple words ON the LAST column
27:00 File 5 PDF with a column with multiple words ON inside column (2 columns)
31:40 File 6 PDF with a column with multiple lines
39:10 File 8 simple PDF
42:15 File 9 PDF with multiple spaces on that need to be correct
45:50 File 10 PDF with multiple columns that have multiple lines + multiple pages
55:50 File 11 simple PDF with protection empty Cells
58:35 File 12 Big PDF with an empty line and Empty columns and partial total
1:02:25 File 13 PDF with multiple columns that have multiple words and hard to define a rule
1:10:15 File 15 PDF with multiple columns that have multiple lines
1:12:50 File 17 simple PDF remove spaces from headers also remove space from Data
1:16:05 File 18 simple PDF
1:17:10 File 19 PDF with multiple pages and columns with multiple lines
1:22:10 File 20 PDF with multiple columns that have multiple lines
1:25:00 File 21 PDF with empty columns and subtotal

Code:

Thanks,
Cristian Negulescu