OCR result is less accurate


#1

Read PDF with OCR output text is very less accurate compared to screen scraping.

Is there any solution for getting more accurate result using Read PDF with OCR, since screen scraping is not suitable for my usecase.


How to use regular expressions for pdf document
#2

Hi,
No OCR is 100% accurate . But you can make use of Abby which is far better than Google and Microsoft OCR.


PS:For Abby OCR activity to work, you need to install ABBYY FineReader Engine and purchase a license for it.


#3

is there any plan in upcoming releases to get more accurate data using read pdf with ocr ?


#4

@ovi


#5

The response we get from screen scraping an image is accurate.
When we convert the same image into pdf and while reading dynamically using read PDF with OCR, the accuracy is not even 10%

What will be reason for this?


#6

Hello,

OCR will overall perform better working on smaller area.
If Read PDF with OCR activity is insufficient to have the result you need, you can try to scrap in a smaller area for testing.

If on a smaller area the results are better, you could Open the pdf via the user interface (Adobe or IE for example) and Use Change clipping region and OCR activity. This is quite tedious to develop but it is a solution which can be acceptable if you have a good pdf quality and only a small amount of structured data to extract.

Cheers


#7

Hi @Florent_Salendres,

Here there is a manual interaction is in need. What if we wanted our system to automate it. I meant the case we wanted to do screen scraping dynamically.


#8

Hi,

It does not need necessarily to be manual.

You could have a Workflow opening it in the UI.

Setting clipping region to specific part of the window

Use Get OCR Text activity to extract the data you need

I’m attaching you a small demo example i made out of an image shared on this forum where it will attempt to extract cells from an image table, attaching to the precise region of the cell.

This is not exactly what you want to achieve but it could be helping you (or others) to be on the way.

Note that your IE zoom level needs to be set to 100% to be functional with the workflow.
ReadImageTable.zip (438.3 KB)

Cheers


#9

I’ve noticed this too. I believe it’s because Screen Scraping lets you set the Zoom on the screen. Setting the correct size of the text you want to scrape that fits in the Scale of the OCR is the key for best accuracy.

For example, if you have a scale of 1 but the characters in the text are slightly bigger than what the OCR looks at, it cuts off part of the character, like sees a 3 instead of an 8. And, if the characters are slightly smaller it might include part of next character and cut that next character off.

Also, Scale is only an integer (from my knowledge) so 4.9 is the same as 4 Scale. So in order to find the sweet spot of the OCR you need to adjust the Zoom of the text which lets you be more precise.

That was from my testing with Google OCR anyway.