OCR for pdf files

I am trying to fetch data from PDF with OCR. PDF is scanned document and it has table and it will always be in same format. While trying read pdf with ocr, it gives data like key pair values and sometimes it is giving with all the keys first and values at the end as new lines. Could anyone please help me here.

1 Like

Hi
Welcome back to uipath community
May I know which OCR was used Google or Microsoft
Cheers @baskarn

1 Like

Tried both microsoft, Tessaract and Omnipage. Omnipage is good compare with microsoft and Tessaract, however this is also not giving the expected result. We have also tried with CV get text, though the format of the scanned pdf is same for all the files, but getting exception while we try to fetch the data. CV gives result only for one pdf file which we had used to spy the element.

@baskarn
It might be because of the orientation of the PDF doc which was set
Suppose a table like this with Predefined format

Name \tab shanmukh
role \tab\tab developer

In this case if you try OCR then output will be

Name shanmukh
role
Developer

Regards
shanmukh

2 Likes

But i am able to get the details exactly like below for few of the files.
Name Sanmukh
Role Developer

but for few of the files, i am getting it like
Name
Role
Sanmukh
Developer

But the format of the content is still same in the pdf file.

I am also not able to get the text for few of the files as expected (getting symbols/special characters instead of text)

Hi @baskarn Can you share the PDF if possible ?

find

This is the format I have in pdf. I have converted to jpeg and hidden all the confidential information. This format is static for all the reports and I need the data to be extracted with all the details.

I am not able to get it to Uipath to get you a xaml solution but as you can see the the Fields on Left side have same naming convention but the fields on right side text boxs are aligned differently with different spacing issue ,

This may cause the key comes first and value later

This template is basically filled in excel file and then printed for final doc as pdf (for signature) and scanned. Then will be sent to us to extract the data from the pdf. So i dont see any format/alignment issue here. And some of the fields are blank as well, so cannot use regex as well. Do we have any other option to get the the data here.