Data scraping on an inconsistent PDF

Hello devs who are more skilled than I.

I’m trying to build out a simple PDF workflow in which I scrape what’s on it and put it into a datatable, but the PDF and it’s weird format are giving me problems.

Little info on the PDF: It’s structured like a table with 5 columns with repeating headers at the top of each of the 17 pages. The 5th column may be empty at certain rows. Both the 1st and 5th column may also have text that wraps onto a new line. It’s a healthy mix of all that I’ve listed.

I went into the UI Explorer to investigate and found a few things. The wrapped text was defined as two separate selectors, even if they were in the same row. I also noticed that there were separate selectors for each of the 3 following conditions. 1) The 1st line in a wrapped text in the 1st column, 2) the entire text in the 2nd column (in this case an ID number), and 3) both the 1st line of the wrapped text and the ID. The same issue of the wrapped text counting as two elements appears in the 5th column as well.

I tried to do some data scraping so I could see the results and it was just a total mess. There’s one main selector that covers an entire page and does a decent job but it doesn’t do well when it has to deal with text wrapping and the contents in the 2nd column. In cases where there is no wrapped text, the contents of the 2nd column are written into the 1st column. In the wrapped text cases, the content of the 2nd column appears on a new line in the first column, in the same row. It seems like the fifth column handles the text wrapping issue alright but that’s overwritten by the odd behavior of different pages.

It seems that what I’m seeing as “rows” in the PDFs on different pages are grouped together and counted as a single UiElement, making for some weird data scraping results. On other pages with mostly one-line entries in column 1, the table looks passable but then it looks like it’s repeating the format for every column except the 1st one right after the 5th column.

These are the only issues I’m seeing right now, I’ve thought of a few potential solutions to simple things like there being 17 pages of data and splitting the first and second columns up in the case of text wrapping but as for the main issue, trying to grab data from an inconsistent PDF, I’m really stuck here.

I’m looking at this as an interesting challenge of how I can generate a report from such a spotty source. If anyone could provide insight on any method that might work I’m all ears and do appreciate the assistance.

Thank you.

Hi,
To extract data from different formats of pdf file I would like to suggest to use ABBY Flexicapture (OCR Tool)

What does this tool offer that stands out from Tesseract and Microsoft OCR?