BennyS
(Benjamin Spiess)
January 27, 2022, 10:52am
1
Hey,
i have a set of PDF files which look like a digital PDF, but in Adobe Acrobat, you can only select the entire page. This hints on a scanned PDF, but zooming in reveals that all text is vector-text which, in my opinion, should allow a smart and straight-forward extraction.
However, Microsoft OCR is not leading to robust results.
Any ideas how I can improve that?
Srini84
(Srinivas Kadamati)
January 27, 2022, 10:59am
3
BennyS:
i have a set of PDF files which look like a digital PDF, but in Adobe Acrobat, you can only select the entire page. This hints on a scanned PDF, but zooming in reveals that all text is vector-text which, in my opinion, should allow a smart and straight-forward extraction.
However, Microsoft OCR is not leading to robust results.
Any ideas how I can improve that?
@BennyS
If you try with Inbuilt OCR’s then you can expect some random results
For better OCR results you can integrate with some other paid OCR tools
and you can easily integrate with UiPath
Hope this may help you
Thanks
ecSC2
January 27, 2022, 11:30am
4
From my understanding OCR is “reading” the screen in pixels and matching the “alphabets” with different “shapes” in their database (ML).
Whether the PDF reveals they’re vector-text or not, does not matter because OCR picks up image/pixel on screen, not reading them as digital text.
As @Srini84 mentioned you may have to test different OCR engines to find the most accurate one.
loginerror
(Maciej Kuźmicz)
January 29, 2022, 9:47am
5
Hi @BennyS
Could you please share a sample PDF for us to have a look?
Also, it might be easiest if you could share a sample project (or its screenshot) of how you’ve approached reading this file so far.