OCR for pdf files

baskarn · March 1, 2020, 4:39am

I am trying to fetch data from PDF with OCR. PDF is scanned document and it has table and it will always be in same format. While trying read pdf with ocr, it gives data like key pair values and sometimes it is giving with all the keys first and values at the end as new lines. Could anyone please help me here.

Palaniyappan · March 2, 2020, 7:22am

Hi
Welcome back to uipath community
May I know which OCR was used Google or Microsoft
Cheers @baskarn

baskarn · March 2, 2020, 9:44am

Tried both microsoft, Tessaract and Omnipage. Omnipage is good compare with microsoft and Tessaract, however this is also not giving the expected result. We have also tried with CV get text, though the format of the scanned pdf is same for all the files, but getting exception while we try to fetch the data. CV gives result only for one pdf file which we had used to spy the element.

shanmukh_pothamsetty · March 2, 2020, 9:47am

@baskarn
It might be because of the orientation of the PDF doc which was set
Suppose a table like this with Predefined format

Name \tab shanmukh
role \tab\tab developer

In this case if you try OCR then output will be

Name shanmukh
role
Developer

Regards
shanmukh

baskarn · March 2, 2020, 10:28am

But i am able to get the details exactly like below for few of the files.
Name Sanmukh
Role Developer

but for few of the files, i am getting it like
Name
Role
Sanmukh
Developer

But the format of the content is still same in the pdf file.

I am also not able to get the text for few of the files as expected (getting symbols/special characters instead of text)

shanmukh_pothamsetty · March 2, 2020, 10:29am

Hi @baskarn Can you share the PDF if possible ?

baskarn · March 2, 2020, 10:48am

find

This is the format I have in pdf. I have converted to jpeg and hidden all the confidential information. This format is static for all the reports and I need the data to be extracted with all the details.

shanmukh_pothamsetty · March 2, 2020, 10:58am

I am not able to get it to Uipath to get you a xaml solution but as you can see the the Fields on Left side have same naming convention but the fields on right side text boxs are aligned differently with different spacing issue ,

This may cause the key comes first and value later

baskarn · March 2, 2020, 11:18am

This template is basically filled in excel file and then printed for final doc as pdf (for signature) and scanned. Then will be sent to us to extract the data from the pdf. So i dont see any format/alignment issue here. And some of the fields are blank as well, so cannot use regex as well. Do we have any other option to get the the data here.

Topic		Replies	Views
Extract data from scanned PDFs Help	7	890	August 31, 2020
I am unable to read and extract data from pdf file Help pdf , activities	11	9659	April 20, 2018
How to read Scanned PDF Help	3	5890	April 27, 2017
Unable to capture PDF Invoice information using OCR Help pdf , ocr , activities	41	4830	February 23, 2021
How to extract form values or editable text from PDF files? Help	3	4522	November 21, 2018

OCR for pdf files

Related topics