Read PDF Text Strange Anomaly

So all the pdfs are invoices from one company so the same formatting for everyone

Reading PDF to Text
Generates a Datatable
Saves as CSV

the anomaly is that when it gets down to the invoice rows it likes to concatenate 2 rows together, then the same amount of rows later the same thing again then again and again until the end of the row items

I’ve tried using OCR but Read ppdf text gives me the cleanest output for further manipulation

Hi Rmorgan

If you save pdf to text. Copy text to f.e notepad ++ and check “hidden” characters . Maybe on the end of it there is missing of image
and that is why you can get 2 rows in one line.

yes they are missing

I believe this is happening at each pdf page.
Try to see if there is any char at the end of the page (when getting text from pdf), if there is nothing, the only way I can think is doing the pdf reading page to page.

yeah, there’s no discernible char to use to signify, I think my only option is to split into single pages and do the read pdf text and stitch it together

What I recommend that you do:
Get PDF Page count, loop through pages (based on the page range), get the page content to a temp variable and then concatenate with previous pages in another variable or use append to excel to put it directly. And don`t forget to add a new line when merging the pages.

PS: I would prefer to store everything in variables then export to excel (or csv).

Hope this helps you :slight_smile: