Different results reading a Native PDF File and Scanned PDF File with the same OCR

Jorge_Valbuena · March 1, 2022, 8:06pm

Hello everyone,

I’m working with document understanding and I have a question related to the behavior of the UiPath Document OCR, Tesseract OCR and OmniPage OCR.

I have a Native PDF file with two pages, it’s an structured document and I’m using Intelligent Form Extractor to extract data from it. The thing is that in the digitization phase, It seems that the OCR doesn’t extract the text correctly; for example, a word like RESIDENTIAL is being extracted as 5(6,'(17,$/.

Have in mind that it’s a native PDF and that I’ve processed other native PDFs with the same structure, size and resolution; also, I’m able to select the text from the PDF so I think It should be really easy for the OCR to digitize the content .

The most strange part of the situation is that I exported the pages from the native PDF as individual images and later a put those exported images into a word document and saved it as a new “scanned” PDF; next, I proceeded to execute the RPA process and the data was extracted successfully meaning that the OCR was able to interpret correctly the text from the “scanned” pdf.

Have any of you been in this situation? Why might this happen?

Thanks in advance

Jorge_Valbuena · March 6, 2022, 12:39am

So, It’s seems to be an issue with the encoding of some fonts, when I set the property ForceApplyOCR to true, I get the expected results from the extraction of the data.

I guess that with the proper OCR, the results will be with 100% accuracy or close to it.

system · March 9, 2022, 12:39am

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Document Understanding – Digitize Document – Native PDF inaccuracies Document Understanding	6	1588	April 18, 2022
For the new document understanding feature why would I use OCR for Native PDFs Document Understanding	1	940	June 18, 2020
Data Extraction From Scanned PDF'S Help activities , question	7	2063	November 2, 2020
Digitized Document text format Issue Document Understanding	7	1287	July 13, 2020
"Scanned" PDF with vector-based text not properly read by UiPath Activities excel , activities , bug , awaiting_user_response	3	883	January 29, 2022

Most Active Users - Yesterday
Anil_G
ashokkarale
jinal.shah
Gautham_Pattabiraman
postwick
chandreshsinh.jadeja
vrdabberu
Ajay_Mishra
sven.wullum1
Vyshnavi_Nalumachu
More details...

Different results reading a Native PDF File and Scanned PDF File with the same OCR

Related Topics