Data Scraping from PDF with multiple pages and tables into excel


I am trying to extract data from a table spanning multiple pages of a pdf file. However, the problem is I need to extract the data with specific criteria (i.e >= specific amount). once the criteria is met, the amount should be linked with the company name.

Any pointers?

For example

Thanks in advance !!!

scrape the entire table and use filter data table to filter rows based on your condition


how can I filter the column name “maximum funds approved…” into filter row as the error appeared with “the value for argument ‘column name’ is not set or is invalid”

Hey, @19028426d !! Can you share 1 PDF sample with us to find the best solution?
Only if you don’t have sensitive data.

But from what I understand from your question, a possible solution I would use would be to read the text with a READ PDF or OCR activity and use regex to extract the information of interest. If they satisfy the condition, I would keep the data.

I hope it helps!!!

Hey, @19028426d ! I looked for a more refined solution that will bring you much more security in your results. So let’s go!

Step 1 - Break your pdf into single pages so you can iterate one by one.

Step 2 - Use document understanding to extract the table from each PDF. It’s very simple, watch this 20 minute video. (UiPath Document Understanding: Extract Tables Out of PDFs - YouTube)

Step 3 - Merge all extracted tables.

Step 4 - Filter the final table with your required condition.

Take a look at the consistent result of the extraction I performed as a test:

Hope this helps!!!

1 Like

Can your share your test file?
I am still following your steps

The .xaml:
Main.xaml (37.4 KB)

The Taxonomy:
taxonomy.json (4.3 KB)

The Sample Data:
sample data.pdf (1.1 MB)