Configure regex based extractor for table in PDF

How to configure Regex based extractor for each table and rows inside each table in a PDF

Hi @SWATI_KAROT,

Could you please elaborate your question with samples.

Thanks,
Arun VIgnesh S
He who serves the most, reaps the most.

1 Like

Hi Arun, Thanks for the reply.I am trying to extract table from PDF using IntelligentOCR and Document Understanding. I have a table in my PDF which i am not able to extract via Form based extractor. Is there any way in which i can use the regex based extractor for table extraction in PDF?

Please find attached the PDF to have a look at the kind of table I am dealing with. Let me know if you need more inputs.

I have the same problem. Form Extractor cannot extract a table from a pdf. Misses 50% and the 50% it gets is not 100% correct.

Hello Swati,
In this video, I have 17 use-cases for extracting tables from PDF and write data in Excel:

2:00 GitHub free code for all the files
2:20 Logic of general workflow
4:40 File 1 simple PDF
9:50 File 2 PDF with a column with multiple lines
20:10 File 3 PDF with a column with multiple words ON the LAST column
27:00 File 5 PDF with a column with multiple words ON inside column (2 columns)
31:40 File 6 PDF with a column with multiple lines
39:10 File 8 simple PDF
42:15 File 9 PDF with multiple spaces on that need to be correct
45:50 File 10 PDF with multiple columns that have multiple lines + multiple pages
55:50 File 11 simple PDF with protection empty Cells
58:35 File 12 Big PDF with an empty line and Empty columns and partial total
1:02:25 File 13 PDF with multiple columns that have multiple words and hard to define a rule
1:10:15 File 15 PDF with multiple columns that have multiple lines
1:12:50 File 17 simple PDF remove spaces from headers also remove space from Data
1:16:05 File 18 simple PDF
1:17:10 File 19 PDF with multiple pages and columns with multiple lines
1:22:10 File 20 PDF with multiple columns that have multiple lines
1:25:00 File 21 PDF with empty columns and subtotal

Code:

Thanks,
Cristian Negulescu