I have been making a series of data scraping processes, which takes data from a word document and outputs it to excel. The files I have now been given are PDFs, and for consistency, I am trying to keep all the inputs as word docs.
So far, I have tried reading the PDF, and then using a word application scope and append text, but this removed the tables and formatting within the PDF. The tables are necessary for scraping the data. I have also tried this:
Changing file’s extension from .pdf to .docx does not convert the data into the file but only the default application used to open the file; your file remain a pdf file with a different extension.
I don’t know what you’re trying to achieve. Maybe you can use a text file preserving somewhat the presentation by inserting spaces.
You might find interest in pdftotext (with option -table for example).
If the goal is to convert the .pdf to an Excel file, I used a lot pdftotext with/without -raw option to ouput a text parsed with Regex.
Got a solution. A long and slow solution, but a solution nonetheless.
I opened Word with start process, then selected open, and typed in the PDF filepath. I then saved it once it loaded as a .docx, closed it and reopened it. Data can now be scraped from it, and it still has all the formatting/tables in the same locations on the PDF