Convert .pdf to .docx and keep tables/formatting

william.coulson · March 17, 2020, 11:02am

Hello,

I have been making a series of data scraping processes, which takes data from a word document and outputs it to excel. The files I have now been given are PDFs, and for consistency, I am trying to keep all the inputs as word docs.

So far, I have tried reading the PDF, and then using a word application scope and append text, but this removed the tables and formatting within the PDF. The tables are necessary for scraping the data. I have also tried this:

replace .pdf with .docx and open the files

Which converts the PDF to a word doc, but then an error message comes up when it opens, saying there is an issue with the data.

Does anybody have any suggestions on how to overcome this issue? Any help is appriciated

msan · March 17, 2020, 11:22am

@william.coulson,

Changing file’s extension from .pdf to .docx does not convert the data into the file but only the default application used to open the file; your file remain a pdf file with a different extension.

I don’t know what you’re trying to achieve. Maybe you can use a text file preserving somewhat the presentation by inserting spaces.

You might find interest in pdftotext (with option -table for example).

If the goal is to convert the .pdf to an Excel file, I used a lot pdftotext with/without -raw option to ouput a text parsed with Regex.

Hope it will help

william.coulson · March 17, 2020, 11:52am

Hi @msan,

I am trying to open the PDFs with Word, or convert the PDFs to Docx’s, and scrape data from them.

msan · March 17, 2020, 12:28pm

@william.coulson

With LibreOffice you get acces to lowriter --invisible --convert-to docx *.pdf
https://manpages.debian.org/testing/libreoffice-writer/lowriter.1.en.html

william.coulson · March 17, 2020, 2:43pm

Got a solution. A long and slow solution, but a solution nonetheless.

I opened Word with start process, then selected open, and typed in the PDF filepath. I then saved it once it loaded as a .docx, closed it and reopened it. Data can now be scraped from it, and it still has all the formatting/tables in the same locations on the PDF

system · March 20, 2020, 2:43pm

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Convert pdf to word without changing format and allignment Studio studio	7	2373	July 13, 2020
Converting Pdf table to excel Activities excel , pdf , activities , studio	23	3782	January 18, 2023
WordからPDFを開きたいフォーラム	3	274	December 24, 2023
Convert PDF document to Word Studio pdf , studio , question , word	7	1761	February 21, 2023
Open PDF file in Word Help	4	2263	July 31, 2020

Convert .pdf to .docx and keep tables/formatting

Related topics