"Scanned" PDF with vector-based text not properly read by UiPath

BennyS · January 27, 2022, 10:52am

Hey,

i have a set of PDF files which look like a digital PDF, but in Adobe Acrobat, you can only select the entire page. This hints on a scanned PDF, but zooming in reveals that all text is vector-text which, in my opinion, should allow a smart and straight-forward extraction.

However, Microsoft OCR is not leading to robust results.

Any ideas how I can improve that?

Srini84 · January 27, 2022, 10:59am

@BennyS

If you try with Inbuilt OCR’s then you can expect some random results

For better OCR results you can integrate with some other paid OCR tools

and you can easily integrate with UiPath

Hope this may help you

Thanks

ecSC2 · January 27, 2022, 11:30am

From my understanding OCR is “reading” the screen in pixels and matching the “alphabets” with different “shapes” in their database (ML).

Whether the PDF reveals they’re vector-text or not, does not matter because OCR picks up image/pixel on screen, not reading them as digital text.

As @Srini84 mentioned you may have to test different OCR engines to find the most accurate one.

loginerror · January 29, 2022, 9:47am

Hi @BennyS

Could you please share a sample PDF for us to have a look?

Also, it might be easiest if you could share a sample project (or its screenshot) of how you’ve approached reading this file so far.

Topic		Replies	Views
Different results reading a Native PDF File and Scanned PDF File with the same OCR Activities activities , question , document_understanding	2	1605	March 6, 2022
Scanned PFD Reading Help uiautomation , pdf , activities	6	1191	February 11, 2020
Text Extraction for PDF File Studio	4	1507	July 16, 2020
Excellent PDF Digitization with Intelligent OCR Engines (Portrait and Landscape) Help activities	2	1353	March 30, 2021
Exact data from image in a pdf Random and other categories	2	836	July 26, 2019

Most Active Users - Yesterday
Anil_G
ashokkarale
Ajay_Mishra
Gautham_Pattabiraman
BHUSHAN_NAGAONKAR1
vrdabberu
ABHIMANYU_THITE1
lrtetala
samantha_shah
shyamala_shyamu
More details...

"Scanned" PDF with vector-based text not properly read by UiPath

Related Topics