OmniOCR is not extracting all the words on all pages

Hello,

I am using the Digitize activity of Document Understanding within it i am using OmniPage OCR to extract all the words on an invoice. This invoice has 4 pages in total.

When i look at the text result (Variable name: OmniText) it contains all the strings on all pages but when i want to look at the word (Variable name: OmniResult) it only has the words that is on the last page of the invoice? To look on the word key value pair i use a for each activity.

Am i missing something?

By the way my aim is to iterate through all the words on the invoice. The invoice is a pdf with ocr text on it. Is there any other alternative way to do this

Hi Burak,

Do you want to use the digitize activities for any reason in particular? If you objective is to just extract ALL the words, maybe try just use the activity Read PDF (PDF package), and then use Regex expressions to match every separated word. It’s not a fancy solution but maybe it’s a little bit more straight forward.

Hi Marti,

Thanks for the reply. I am open to any suggestions. My objective is to just extract all the words. I have tried using “Read PDF Text” activity but after that i am lost.

When i want to iterate on all the text it goes char by char because its all string not word by word. I need to detect the start and finish of table on the invoice and after that i need to extract the line items.

Hi @Arloth,

If you simply want to extract all text from the document,you can also use variable(OCR_text) under “Document Text” shown in your screenshot. This will contain all the text from the document.

Alternatively, if there is no specific requirement for usage of DU framework, you can use ‘Read pdf with OCR’ activity which allows to extract data from pdf, which can later on be used in other activities.

Hope this helps.

Regards
Sonali

Adding to @sonaliaggarwal47 response : If you prefer to go with multiple docs and use the DU framework UiPath Document OCR extraction is too good (Not so good with handwritten though). But it has licensing… Else, omnipage should extract in wise…