Extract text of a specific color from PDF


I work on an automation project in which, at some point, I have to extract a table from some pdf, individually. I cannot use Data Scraping in Acrobat Reader because it doesn’t work on all the files.

So I decided to use Read PDF Text activity, but here is my big issue :
Here you can see a portion of the table. the problem is : I have to count the amount of “.” on each line (it corresponds to a vacancy day of the corresponding employe)
But : The empty cells are not empty, there is actually a white “.” in all of them, so when I extract the text there is no difference between empty cells and the others…

So my solution was to convert each pdf into Jpeg files and use Document Understanding (OCR) so extract only optically visible dots. I did it and my program is finished. The only Issue is that some employes are on different pdf files and sometimes the OCR find a wrong letter so some employes are considered as 2 differents employes…

So I was wandering if was possible to extract only black text so I could stop using OCR.

Do you have an idea ?

Thanks and please excuse my bad english

What happens when you copy/paste everything into an excel?

Tagging along to see some solutions :slight_smile:

Check this:
How to get text color in pdf?
It shows how you can find the color of a given text in a pdf file and I guess then you could parse your document and get only the cells that contain black points in them and ignore those who contain white ones .

hope that helps :wink:

All the empty cells result in a point for each of them

It requires custom python scripts and I don’t know how to make Python scripts, but thanks