OCR result is less accurate

Aswathy · September 15, 2017, 7:18am

Read PDF with OCR output text is very less accurate compared to screen scraping.

Is there any solution for getting more accurate result using Read PDF with OCR, since screen scraping is not suitable for my usecase.

ddpadil · September 15, 2017, 7:28am

Hi,
No OCR is 100% accurate . But you can make use of Abby which is far better than Google and Microsoft OCR.

PS:For Abby OCR activity to work, you need to install ABBYY FineReader Engine and purchase a license for it.

rajukvm999 · September 15, 2017, 9:00am

is there any plan in upcoming releases to get more accurate data using read pdf with ocr ?

ddpadil · September 15, 2017, 9:02am

@ovi

Aswathy · September 15, 2017, 10:20am

The response we get from screen scraping an image is accurate.
When we convert the same image into pdf and while reading dynamically using read PDF with OCR, the accuracy is not even 10%

What will be reason for this?

Florent_Salendres · September 15, 2017, 12:27pm

Hello,

OCR will overall perform better working on smaller area.
If Read PDF with OCR activity is insufficient to have the result you need, you can try to scrap in a smaller area for testing.

If on a smaller area the results are better, you could Open the pdf via the user interface (Adobe or IE for example) and Use Change clipping region and OCR activity. This is quite tedious to develop but it is a solution which can be acceptable if you have a good pdf quality and only a small amount of structured data to extract.

Cheers

ajith.jose · September 15, 2017, 1:02pm

Hi @Florent_Salendres,

Here there is a manual interaction is in need. What if we wanted our system to automate it. I meant the case we wanted to do screen scraping dynamically.

Florent_Salendres · September 15, 2017, 1:31pm

Hi,

It does not need necessarily to be manual.

You could have a Workflow opening it in the UI.

Setting clipping region to specific part of the window

Use Get OCR Text activity to extract the data you need

I’m attaching you a small demo example i made out of an image shared on this forum where it will attempt to extract cells from an image table, attaching to the precise region of the cell.

This is not exactly what you want to achieve but it could be helping you (or others) to be on the way.

Note that your IE zoom level needs to be set to 100% to be functional with the workflow.
ReadImageTable.zip (438.3 KB)

Cheers

ClaytonM · September 15, 2017, 3:49pm

I’ve noticed this too. I believe it’s because Screen Scraping lets you set the Zoom on the screen. Setting the correct size of the text you want to scrape that fits in the Scale of the OCR is the key for best accuracy.

For example, if you have a scale of 1 but the characters in the text are slightly bigger than what the OCR looks at, it cuts off part of the character, like sees a 3 instead of an 8. And, if the characters are slightly smaller it might include part of next character and cut that next character off.

Also, Scale is only an integer (from my knowledge) so 4.9 is the same as 4 Scale. So in order to find the sweet spot of the OCR you need to adjust the Zoom of the text which lets you be more precise.

That was from my testing with Google OCR anyway.

gregelliott · August 15, 2019, 6:16pm

@Florent_Salendres
Would you mind re-uploading this?
When I download I get an error saying “error detecting project version”

Topic		Replies	Views
Issue regarding pdf scrapping Help activities , studio	5	1799	July 13, 2018
Accuracy of scraping data Activities ocr , activities , question , web-scraping	3	1036	July 20, 2022
OCR difference Help orchestrator , activities , studio	3	849	August 20, 2019
OCR; PDF (image), Image manipulation, Monochrome, Contrast Help pdf , ocr	1	3191	July 28, 2017
ABBYY Cloud OCR Input Help ocr , studio	2	4925	October 9, 2017

OCR result is less accurate

Related topics