"Scanned" PDF with vector-based text not properly read by UiPath

Hey,

i have a set of PDF files which look like a digital PDF, but in Adobe Acrobat, you can only select the entire page. This hints on a scanned PDF, but zooming in reveals that all text is vector-text which, in my opinion, should allow a smart and straight-forward extraction.

However, Microsoft OCR is not leading to robust results.

Any ideas how I can improve that?

@BennyS

If you try with Inbuilt OCR’s then you can expect some random results

For better OCR results you can integrate with some other paid OCR tools

and you can easily integrate with uipath

Hope this may help you

Thanks

From my understanding OCR is “reading” the screen in pixels and matching the “alphabets” with different “shapes” in their database (ML).

Whether the PDF reveals they’re vector-text or not, does not matter because OCR picks up image/pixel on screen, not reading them as digital text.

As @Srini84 mentioned you may have to test different OCR engines to find the most accurate one.

Hi @BennyS

Could you please share a sample PDF for us to have a look?

Also, it might be easiest if you could share a sample project (or its screenshot) of how you’ve approached reading this file so far.