Extract PDF native text from grid table

image

I have PDF file and it contains numbers, which luckily are in native text, but numbers are in background border which is not recognized by read pdf activity. Then there is a gap - it can be easily seen visually but when text gets extracted - there is no indication that there is a gap. For example if you have table like that:
|214||231| |233|
The extracted text looks like 214 231 233

How to “catch” these empty spaces, so I can know that the value is empty in that cell?

Split the data on the | character and remove any elements of the array equal to String.Empty. After that, trim each of the cells.

Unfortunatelly there is no | character. When data gets extracted it shows only numbers. Borderlines are not recognized at all.

Is there additional space in the string when the cell is empty?

Looking at the screenshot, it appears to be part of a table. Try opening the pdf file with chrome browser and try data scraping. It should work.

If it isn’t client related data, please upload a sample pdf file so that other devs can try different approaches and let you know the solution.

Cheers.!

Thank You for prompt replies, unfortunatelly there is no additional space either. I will try Chrome solution and come back in a few days. I am unable to upload file here, it is sensitive info.

Book1.pdf (36.5 KB)

I have tried chrome method but nothing useful too.
Please find a sample document attached. In this document it is clearly visible that there is a missing value in B column, but when using read from PDF it is not clear is it missing in B or C.

Hi @NotFranmax,
Kindly check the attached workflowMain.xaml (4.9 KB)

1 Like

That solves that, but do not work with my original file (I can not distribute it). Anyways thank You

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.