Data Manager - Unable to OCR Cyrillic (Non-Latin) Characters

Why the Data Manager is not able to OCR Cyrillic (non-Latin) characters?

Currently, the OCR engines available in Data Manager(DM) are configured using only a URL and Key, but not using  language flag . This means that they work well as long as they are able to automatically detect the language of the document.

This method works well for some engines, like Google Cloud Vision OCR, but it works less well for others, especially Omnipage OCR or Microsoft Read OCR. Adding a language flag configuration to the Data Manager OCR configuration screen is in our backlog, and this feature implementation is in product development pipeline.

Meantime, there is a workaround, which can work very well:

  • Run documents through Digitize activity with whatever OCR engine and language setting wanted, and then import in DM using the ML Extractor Trainer activity
A process automation should look like:
  • Label data outside of DM, build a simple workflow that does digitize + no data extraction + validation station (attended or action center) + train extractors to collect the data submitted by the user .
1 Like

Yes this helps time being, but it conflicts with the documentation about do not train the model from scratch using validation station. :slight_smile: