Extract values in PDF

I need to extract data from PDF or text file, I am using regex but sometimes paten are changing, I need highlighted value


Which OCR are you using here? and have you tried document understanding with Regex based extractor?

If i were you i would write a while loop and find the word TAX and from that find each value one by one in that column, but you might need to use an OCR engine for this(Microsoft OCR usually preserves the format of data so it still looks the same as it was when inside a table)

In addition to what @SenzoD said

U can try to extract the table directly using epsilon activitiy which helps to extract the table from PDF and using datatable manipulation u can get the data u need

Regards

Nived N :robot:

Happy Automation :relaxed::relaxed::relaxed:

Hi @Noor_Shaik

Alternative:
Word 2016, 2019 or Office 365 can open your PDF. Extract Table on Word Document.

Hello Noor,
In this video, I have 17 use-cases for extracting tables from PDF and write data in Excel:

2:00 GitHub free code for all the files
2:20 Logic of general workflow
4:40 File 1 simple PDF
9:50 File 2 PDF with a column with multiple lines
20:10 File 3 PDF with a column with multiple words ON the LAST column
27:00 File 5 PDF with a column with multiple words ON inside column (2 columns)
31:40 File 6 PDF with a column with multiple lines
39:10 File 8 simple PDF
42:15 File 9 PDF with multiple spaces on that need to be correct
45:50 File 10 PDF with multiple columns that have multiple lines + multiple pages
55:50 File 11 simple PDF with protection empty Cells
58:35 File 12 Big PDF with an empty line and Empty columns and partial total
1:02:25 File 13 PDF with multiple columns that have multiple words and hard to define a rule
1:10:15 File 15 PDF with multiple columns that have multiple lines
1:12:50 File 17 simple PDF remove spaces from headers also remove space from Data
1:16:05 File 18 simple PDF
1:17:10 File 19 PDF with multiple pages and columns with multiple lines
1:22:10 File 20 PDF with multiple columns that have multiple lines
1:25:00 File 21 PDF with empty columns and subtotal

Code:

Thanks,
Cristian Negulescu

4 Likes

Thank you very much @Cristian_Negulescu. This technique will greatly speed up the extraction process.

1 Like

OMG!!! :raised_hands:t5: after watching your video and reviewing your code, I improved my vb.net skills and :bomb: :bomb: boom
Simply a live saver, thank you, thank you, thank you

1 Like

Hi @Cristian_Negulescu i need your support where i am struggling to extract data from a scanned pdf. The extracted data is not accurate. kindly help me to get accurate data from scanned pdf(Table with Columns and sub columns)

Hello Seshu,
I the data is not accurate you need to change the OCR system. I’m not able to solve this from the code.
The code in VB.net is good when data is unstructured but accurate. Try Document Understanding or you can try ChatGPT like this:
UiPath and ChatGPT extract Tables from PDF (use case) (PDF table) (ChatGPT prompts) - YouTube
Thanks,
Cristian