Data Manager - reading Cyrillic with OCR

ydimitrova · March 11, 2021, 11:17am

Hello everyone,

The situation is:
I have Data Manager all set up and working. I also have a Microsoft OCR license, and I can use the OmniPage OCR free, as it requires no license.
I have Invoices, that I want to label and extract data using Data Manager. Some of those invoices are in English and some of them are in Cyrillic (Bulgarian).

The problem is:
The invoices in English are being read perfectly, but when I try to process those that are in Cyrillic, lets say, if in the Invoice is “Описание”, the OCR reads it as “OnncaHne” (using Microsoft OCR) which is absolutely incorrect.
When using OmniPage OCR “Описание” is read as “OnIcaHNe”.

The question:
Can you suggest how to make the OCR work not only for English but for Cyrillic too.

Regards,
Yoana

system · March 13, 2021, 4:00pm

Hello @ydimitrova!

It seems that you have trouble getting an answer to your question in the first 24 hours.
Let us give you a few hints and helpful links.

First, make sure you browsed through our Forum FAQ Beginner’s Guide. It will teach you what should be included in your topic.

You can check out some of our resources directly, see below:

Always search first. It is the best way to quickly find your answer. Check out the icon for that.
Clicking the options button will let you set more specific topic search filters, i.e. only the ones with a solution.
Topic that contains most common solutions with example project files can be found here.
Read our official documentation where you can find a lot of information and instructions about each of our products:
Watch the videos on our official YouTube channel for more visual tutorials.
Meet us and our users on our Community Slack and ask your question there.

Hopefully this will let you easily find the solution/information you need. Once you have it, we would be happy if you could share your findings here and mark it as a solution. This will help other users find it in the future.

Thank you for helping us build our UiPath Community!

Cheers from your friendly
Forum_Staff

melanie · July 14, 2021, 9:39am

Hi,

I think you can use Tesseract OCR with Bulgarian language, maybe it is “bug” that you need to specify in the Language property. I know that for Microsoft OCR, you need to download the language package in your own laptop ( Installing OCR Languages ) , but it never worked for me.

Good luck!

ydimitrova · July 16, 2021, 11:30am

Hi,

Thank you for the reply.

I found a workaround. I will try to explain it in case someone has the same problem and searches for solution.
Before uploading the files into Data Manager, I run a process to digitize them using OmniPage OCR as it allows to specify two languages in the properties (example: “BUL, ENG”). I need to specify two languages as in the Cyrillic documents there are some words in English too.
The steps in the process are:
Load Taxonomy → Digitize Document (with OmniPage OCR) → Data Extraction Scope (with regex extractor containing only one parameter ‘name’ with regular expression defined as ‘abc’) → Train Extractors Scope (with ML Extractor Trainer with specified only Output Folder).
As a result I have a zip file in the specified Output Folder that contains metadata. This zip should be imported into DM and then the documents are ready for labeling.

Marzhan_Oshanova · May 19, 2023, 9:35am

Hello [ydimitrova]! Have you find a way to process documents in Cyrillic?

Topic		Replies	Views
Change OCR engine in Forms AI AI Center question , document_understanding , ai_center , forms-ai	1	830	September 29, 2022
Localized OCR Engine Document Understanding ocr , document_understanding , du , chinese	1	1312	February 5, 2022
Cloud OCR Problems calling APIs Help studio	3	2490	November 16, 2017
How can we use Google cloud vision OCR & Microsoft Azure Vision OCR? UiPath Document Understanding Activities activities , question , document_understanding	2	1357	March 23, 2022
Data Manager에서 직접 지원하지 않는 OCR 엔진을 사용하여 레이블링하는 방법 Korea RPA 개발자를 위한 공간 ocr , document_understanding , intelligent_ocr	2	2830	August 26, 2021

Data Manager - reading Cyrillic with OCR

Related topics