Convert .pdf to .docx and keep tables/formatting

Hello,

I have been making a series of data scraping processes, which takes data from a word document and outputs it to excel. The files I have now been given are PDFs, and for consistency, I am trying to keep all the inputs as word docs.

So far, I have tried reading the PDF, and then using a word application scope and append text, but this removed the tables and formatting within the PDF. The tables are necessary for scraping the data. I have also tried this:


replace .pdf with .docx and open the files

Which converts the PDF to a word doc, but then an error message comes up when it opens, saying there is an issue with the data.

Does anybody have any suggestions on how to overcome this issue? Any help is appriciated

@william.coulson,

Changing file’s extension from .pdf to .docx does not convert the data into the file but only the default application used to open the file; your file remain a pdf file with a different extension.

I don’t know what you’re trying to achieve. Maybe you can use a text file preserving somewhat the presentation by inserting spaces.

You might find interest in pdftotext (with option -table for example).

If the goal is to convert the .pdf to an Excel file, I used a lot pdftotext with/without -raw option to ouput a text parsed with Regex.

Hope it will help

Hi @msan,

I am trying to open the PDFs with Word, or convert the PDFs to Docx’s, and scrape data from them.

@william.coulson

With LibreOffice you get acces to lowriter --invisible --convert-to docx *.pdf
https://manpages.debian.org/testing/libreoffice-writer/lowriter.1.en.html

Got a solution. A long and slow solution, but a solution nonetheless.

I opened Word with start process, then selected open, and typed in the PDF filepath. I then saved it once it loaded as a .docx, closed it and reopened it. Data can now be scraped from it, and it still has all the formatting/tables in the same locations on the PDF

1 Like

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.