Disabling the tesseract engine's data dictionary

dowlagar · July 11, 2018, 1:02pm

I am using the Google OCR to scrape a gif image. The fields that I am interested in contain alphanumeric codes (i.e. a mix of letters and digits). At times, the engine is incorrectly recognizing 0 (zeros) as O (letter O). As the field is an ID, incorrect identification kills the whole purpose of automation.

Upon analysis, I stumbled upon a tesseract thread that talks of muting the auto correction / data dictionary in the engine. This will lead to the OCR identifying a zero as a zero and not try to figure out a “logical” word by looking at adjoining characters.

How can I do this in UI Path?

Thanks
Nitesh

Srini84 · October 18, 2018, 10:49am

@dowlagar Hi Nitesh, I am also working on OCR activities, i saw your thread, where i am struck with similar things, so can you please tell me have you solved this?

Thanks,
Srinivas K

Topic		Replies	Views
OCR reads as letter "O" zero numbers Academy Feedback activities	6	4297	January 3, 2023
Desktop application - Unable to find data in a column using OCR with various scaling Academy Feedback	3	630	January 30, 2020
Is OCR supposed to be terrible? Academy Feedback	6	787	January 22, 2020
Google OCR Help	2	758	April 4, 2019
How to block ligature character in OCR activity Studio studio , question , activities_panel	1	340	March 9, 2023

Most Active Users - Yesterday
ashokkarale
ppr
Anil_G
Ajay_Mishra
Yoichi
mhaniff
Shiva_Nikhil
Anonymouss
quick_123
vrdabberu
More details...

Disabling the tesseract engine's data dictionary

Related Topics