Different results reading a Native PDF File and Scanned PDF File with the same OCR

Hello everyone,

I’m working with document understanding and I have a question related to the behavior of the UiPath Document OCR, Tesseract OCR and OmniPage OCR.

I have a Native PDF file with two pages, it’s an structured document and I’m using Intelligent Form Extractor to extract data from it. The thing is that in the digitization phase, It seems that the OCR doesn’t extract the text correctly; for example, a word like RESIDENTIAL is being extracted as 5(6,'(17,$/.

Have in mind that it’s a native PDF and that I’ve processed other native PDFs with the same structure, size and resolution; also, I’m able to select the text from the PDF so I think It should be really easy for the OCR to digitize the content .

The most strange part of the situation is that I exported the pages from the native PDF as individual images and later a put those exported images into a word document and saved it as a new “scanned” PDF; next, I proceeded to execute the RPA process and the data was extracted successfully meaning that the OCR was able to interpret correctly the text from the “scanned” pdf.

Have any of you been in this situation? Why might this happen?

Thanks in advance

So, It’s seems to be an issue with the encoding of some fonts, when I set the property ForceApplyOCR to true, I get the expected results from the extraction of the data.

I guess that with the proper OCR, the results will be with 100% accuracy or close to it.

1 Like

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.