Data Extraction from PDFs

Sonam_Nagpal · January 14, 2022, 10:51am

Hi

I have bunch of pdf and want to extract a table from the pdfs. But data scrapping not working.
when i am taking the data in text file headers are coming 3 times as there are 3 pages with headers in the pdf. How i can remove these headers?
i have to create separate excel file for every pdf.
Any approach that u can suggest.

Thanks in advance

rahulsharma · January 14, 2022, 4:33pm

What’s the issue when you say that data scraping isn’t working?

That means your datatable(in which you store extracted data) has the headers in it as rows because every page starts with that?

You can go ahead and create the datatable entirely by extracting data then you can use look-up datatable for that and add the value of the header of the datatable and you’ll get a row index returned, then simply remove that row using

You can have a logic like, run this lookup datatable sequence for a counter that is equal to the number of pages in PDF.

Hope this helps!

pikorpa · January 14, 2022, 10:47pm

Hey,
You can always try use the UiPath.Python.Activities

and use the python methods to get table from pdf files and convert to csv.
try this:

you can also try this:

or here is great tutorial how to use document understaning:

Topic		Replies	Views
Extracting PDF data from a website and storing it in an EXCEL Sheet Help	2	1898	April 26, 2017
PDF Table extraction Studio	9	15818	July 15, 2023
ScrapTable from PDF Image into Data table Studio studio , question , tools	38	2022	November 9, 2022
Extracting all headers in a pdf Studio pdf-extraction	9	2777	November 16, 2021
Can't scarp pdf file table data using data scraping? Help uiautomation , activities	7	4257	November 17, 2017

Data Extraction from PDFs

Related topics