OmniOCR is not extracting all the words on all pages

Hello,

I am using the Digitize activity of Document Understanding within it i am using OmniPage OCR to extract all the words on an invoice. This invoice has 4 pages in total.

When i look at the text result (Variable name: OmniText) it contains all the strings on all pages but when i want to look at the word (Variable name: OmniResult) it only has the words that is on the last page of the invoice? To look on the word key value pair i use a for each activity.

Am i missing something?

By the way my aim is to iterate through all the words on the invoice. The invoice is a pdf with ocr text on it. Is there any other alternative way to do this

Hi Burak,

Do you want to use the digitize activities for any reason in particular? If you objective is to just extract ALL the words, maybe try just use the activity Read PDF (PDF package), and then use Regex expressions to match every separated word. It’s not a fancy solution but maybe it’s a little bit more straight forward.

Hi Marti,

Thanks for the reply. I am open to any suggestions. My objective is to just extract all the words. I have tried using “Read PDF Text” activity but after that i am lost.

When i want to iterate on all the text it goes char by char because its all string not word by word. I need to detect the start and finish of table on the invoice and after that i need to extract the line items.

Hi @Arloth,

If you simply want to extract all text from the document,you can also use variable(OCR_text) under “Document Text” shown in your screenshot. This will contain all the text from the document.

Alternatively, if there is no specific requirement for usage of DU framework, you can use ‘Read pdf with OCR’ activity which allows to extract data from pdf, which can later on be used in other activities.

Hope this helps.

Regards
Sonali

Adding to @sonaliaggarwal47 response : If you prefer to go with multiple docs and use the DU framework UiPath Document OCR extraction is too good (Not so good with handwritten though). But it has licensing… Else, omnipage should extract in wise…

This could be the OCR engine accuracy. Try a different OCR like Google Cloud Vision OCR/ Azure Form Extractor for better handwriting captures.

OmniText should have all the text you need. Please use the DU Process found in the templates for this. The next step after having the text would be the step where you extract data and then process it. Also, I’d recommend you partake in a DU course on the academy to learn more how DU works.

From recent experience, the OCR and DU offering is staggeringly powerful…I will say however to make sure you are using the very latest versions of the OCR engines and DU packages.

I was having failures and missed extractions all over the place but once I realized that most if not all of my packages were quite outdated, I updated them all and it is completely transformed the experience for me

I did end up using UIPath Doc OCR however

Andy