Computer vision to extract non English characters?

Hi,

I am currently developing an automation which extract datatable from a system. This system isn’t very well developed, and does not support me to extract any information from it without using computer vision.

However, as I work within a Swedish company, we a lot of information will contain the Swedish characters of “å,ä,ö”. When I use the CV Extract Table, it will however only give me the English characters. The characters å and ä becomes “a”, and ö becomes “o”.

Is there any way to change the encoder of computer vision to include these characters? Thanks!

Below is an example of the data to extract.

image

@henry.wang

1.Custom OCR Language Packs:-UiPath provides OCR language packs that you can download and install.

2.Tesseract OCR with Language Data:-If you are using the Tesseract OCR engine within UiPath, you can configure it to recognize
non-English characters by specifying the language data.

3.Post-Processing for Character Correction:-After extracting text using Computer Vision OCR, you can implement a post-processing step in your
automation workflow to replace incorrect characters with the correct ones. Use string manipulation
functions to replace “a” with “å,” “o” with “ö,” and so on, in the extracted text.

(Assign Activity:
ExtractedText = ExtractedText.Replace("a", "å").Replace("a", "ä").Replace("o", "ö")

4.Custom OCR Engines:-For advanced use cases, consider using third-party OCR engines that provide
better support for non-English characters

Thanks for the reply. However, I don’t think that the previous information alone will help me reach my goal here. Here are my questions/comments:

1. Custom OCR Language Packs: These are only for extracting text as far as I know. Or is there anything for extracting data tables? If so, what’s the name of the activity?

2. Tesseract OCR: How do I find these settings? Is it set somewhere in the Orchestrator?

3: Character Correction: This would not be helpful as not all “a” will be “å” or “ä”, but only some.

4. Custom OCR: Do you have any suggestions on which ones that would be good to use?

1 Like

Hello @henry.wang!
Did you find any solution for this problem? I am currently facing the exact same problem, only for the norwegian characters ‘æ’, ‘ø’ and ‘å’.

I did test Tesseract OCR, but that only gets me so far. CV Extract Table cannot be set up to use Tesseract as far as I can see.

Tesseract will give me an IEnumerable with all the words and its corresponding X and Y coordinate on the screen. With some manipulation of the IEnumerable it should be possible to make out coloumns and rows based off the coordinates, but it requires quite some work…