Reading the text from a scanned doc with low quality

kristse121 · August 26, 2019, 4:01am

Hi Guys,

I am a beginner of RPA,

may I know what function can be used to read/ get the text from a scanned doc that the scanning quality is not so good??

Thanks,
Kris

Shubham_Varshney · August 26, 2019, 4:25am

Hey @kristse121

Do you have the scanned doc in PDF version, then it will be read PDF

and if it’s Image then it will be done using OCR tools

Hope this helps, do let me know in case of issue

kristse121 · August 26, 2019, 4:34am

Hi Shubham,

Would you mind giving me your email address so that i can send you via email.

Thanks

ClaytonM · August 26, 2019, 5:04am

Hi.

Here’s the problem with scanned images…
The position of the image is inconsistent and will be shifted document to document. The way OCR works is it tries to place a square over each character on a grid, and if the document shifts, then the squares must shift with it (which it doesn’t). This results in characters getting cut off while the OCR grid is looking at each character. For example, some documents will be accurately finding the 8s and 0s, but others will see those same characters as 3s and Cs, since part of the character gets cut off while the OCR looks at it. — if that makes sense.

One thing that can help is using an OCR tool that let’s you exclude certain characters, like only get numbers. ABBY Flexicapture (not free) seemed like a very powerful tool for this. However, I can’t say for sure if they solve completely scanned images, since I only used it in a short Trial period.

Another thing is if you do this OCR in an attended environment, then the user can validate and correct the extraction during the automation. I believe UiPath has something being designed for this. There might even be a way for an unattended robot to interact with a user for validation, maybe where the automation pauses until the data is validated by the user.

But, anyway, the scanned images will need to be consistently placed in the pdf, so they are not shifted and the quality is not too poor, for OCR to get you any amount of accuracy.

It will also depend on what you actually need from the document. For example, I have automated some signature validation, where it uses Find Image near the signature box, reset the clipping region, then OCR inside the signature box to see if there is a valid amount of pixels as a signature… along with another validation by the user.

EDIT: also, you could try to get the original raw data of the pdf from another source, rather than looking at the image.

Regards.

Shubham_Varshney · August 26, 2019, 5:12am

I have DM’ed you the same

But do take a look at @ClaytonM reply he’s right about the input

Topic		Replies	Views
Different results reading a Native PDF File and Scanned PDF File with the same OCR Activities activities , question , document_understanding	2	1905	March 6, 2022
Reading Hand written Text from scanned PDF Help activities	5	8506	January 17, 2018
"Scanned" PDF with vector-based text not properly read by UiPath Activities excel , activities , bug , awaiting_user_response	3	976	January 29, 2022
Reading PDF1 Robot robot , question	16	696	August 26, 2023
Process automation feasibility: Handwritten Scanned documents Help activities	5	1277	September 12, 2019

Reading the text from a scanned doc with low quality

Related topics