I am working on a secured PDF which are all diagrams, and need to convert it to excel.
Due to it is secured, the only thing i can use is convert pdf to txt, and then capture content I need in the text to excel form.
There is a diagram I cannot convert it,test.pdf (27.4 KB)
this pdf is an example of the diagram due to security issues. I write a word and generate this pdf.
I want it looks like normal format in the excel.
However, it looks like this in the notepad
I checked both and I got what you mean, there is many useless blank and line break exist in the word file.
However, I suppose this is owe to PDF format and those data is the same as PDF, right?
In my understanding, this is only way to read PDF file with it’s table format if you cannot use data scraping method in the PDF (except OCR).
So If the format is stable, I recommend you to extract PDF data with this way and delete each space what you don’t need…
Thanks for your advice, I also tried that method, but it is too slow and didnt work well as I thought. So i came back to the method that capture data from txt. and i want to use regular expression to do that, which seems achievable for me, if you r still interest, i have asked a new question.
You can first convert the scanned pdf into editable pdf (activity “Correct rotation& convert scanned PDF to editable text & images”. Then export the editable pdf to Excel (activity “Export PDF files to other format”).