Extracting PDF data from a website and storing it in an EXCEL Sheet


#1

So, as the title says, I’m trying to extract pdf data (a table) out of some websites, and I’m trying to store it in an excel sheet. Now I’m able to do that using web/data scraping, but i’m facing a couple of issues in case of some of those pdf files.

ONE: If there are multiple pages in the pdf, and the header is in each of the pages, extra columns are getting generated. For instance the headers are Name, Address and Phone Number, and the number of pages are 18, the number of columns getting generated are 18 times 3

TWO: When trying to scrape one of the pdf tables from some website, I could only get partial data out of it. As in, there were five columns in total, but data scraping could only get 3 of them.

I’ll be grateful if anybody can help me out here.

Thank you. …


#2

For #1 As a last resort may be you can scrape page wise (using Range) and manipulate the datatable to remove Column header and use Append Range activity.

or you can remove Column headers using For Each once you have your datatable reading full PDF.

For #2 Why weren’t the column names mapped properly in the below image? for eg > Pin column is mapped to Date info


#3

That’s one of the issues I’m facing, and it’s the only pdf I’m getting this issue with. As for your solution, I’ll try to see if I’m able to scrape page wise and get back to you.