Reading the text from a scanned doc with low quality

Hi Guys,

I am a beginner of RPA,

may I know what function can be used to read/ get the text from a scanned doc that the scanning quality is not so good??

Thanks,
Kris

Hey @kristse121

Do you have the scanned doc in PDF version, then it will be read PDF

and if it’s Image then it will be done using OCR tools :slight_smile:

Hope this helps, do let me know in case of issue :slight_smile:

Hi Shubham,

Would you mind giving me your email address so that i can send you via email.

Thanks

Hi.

Here’s the problem with scanned images…
The position of the image is inconsistent and will be shifted document to document. The way OCR works is it tries to place a square over each character on a grid, and if the document shifts, then the squares must shift with it (which it doesn’t). This results in characters getting cut off while the OCR grid is looking at each character. For example, some documents will be accurately finding the 8s and 0s, but others will see those same characters as 3s and Cs, since part of the character gets cut off while the OCR looks at it. — if that makes sense.

One thing that can help is using an OCR tool that let’s you exclude certain characters, like only get numbers. ABBY Flexicapture (not free) seemed like a very powerful tool for this. However, I can’t say for sure if they solve completely scanned images, since I only used it in a short Trial period.

Another thing is if you do this OCR in an attended environment, then the user can validate and correct the extraction during the automation. I believe UiPath has something being designed for this. There might even be a way for an unattended robot to interact with a user for validation, maybe where the automation pauses until the data is validated by the user.

But, anyway, the scanned images will need to be consistently placed in the pdf, so they are not shifted and the quality is not too poor, for OCR to get you any amount of accuracy.

It will also depend on what you actually need from the document. For example, I have automated some signature validation, where it uses Find Image near the signature box, reset the clipping region, then OCR inside the signature box to see if there is a valid amount of pixels as a signature… along with another validation by the user.

EDIT: also, you could try to get the original raw data of the pdf from another source, rather than looking at the image.

Regards.

2 Likes

I have DM’ed you the same :slight_smile:

But do take a look at @ClaytonM reply he’s right about the input :slight_smile: