I work on an automation project in which, at some point, I have to extract a table from some pdf, individually. I cannot use Data Scraping in Acrobat Reader because it doesn’t work on all the files.
So I decided to use Read PDF Text activity, but here is my big issue :
Here you can see a portion of the table. the problem is : I have to count the amount of “.” on each line (it corresponds to a vacancy day of the corresponding employe)
But : The empty cells are not empty, there is actually a white “.” in all of them, so when I extract the text there is no difference between empty cells and the others…
So my solution was to convert each pdf into Jpeg files and use Document Understanding (OCR) so extract only optically visible dots. I did it and my program is finished. The only Issue is that some employes are on different pdf files and sometimes the OCR find a wrong letter so some employes are considered as 2 differents employes…
So I was wandering if was possible to extract only black text so I could stop using OCR.
Do you have an idea ?
Thanks and please excuse my bad english