I’m working with document understanding and I have a question related to the behavior of the UiPath Document OCR, Tesseract OCR and OmniPage OCR.
I have a Native PDF file with two pages, it’s an structured document and I’m using Intelligent Form Extractor to extract data from it. The thing is that in the digitization phase, It seems that the OCR doesn’t extract the text correctly; for example, a word like RESIDENTIAL is being extracted as 5(6,’(17,$/.
Have in mind that it’s a native PDF and that I’ve processed other native PDFs with the same structure, size and resolution; also, I’m able to select the text from the PDF so I think It should be really easy for the OCR to digitize the content .
The most strange part of the situation is that I exported the pages from the native PDF as individual images and later a put those exported images into a word document and saved it as a new “scanned” PDF; next, I proceeded to execute the RPA process and the data was extracted successfully meaning that the OCR was able to interpret correctly the text from the “scanned” pdf.
Have any of you been in this situation? Why might this happen?
Thanks in advance