How to use the IntelligentOCR Package

Hi @Ioana_Gligan, awesome responses! This info will help me build better automations. I’ll PM you about our use of the IntelligentOCR package.

P.S. I have two more more feature requests:

RegEx Classifier
This would allow us to define a RegEx query that if it matches, would allow you to classify that document. For example, we might have a RegEx to look for a certain phone number pattern, or a certain invoice name pattern, and we could then easily classify the document based on a match of a regular expression.

This is an improvement on the current static keyword classification.

Multiple RegEx’s for each item
Sometimes, an invoice from a vendor may have multiple ways of writing their invoice number, or PO number. I would like to be able to connect multiple RegEx values to a single item in a document type. If there are no matches, then we don’t use that RegEx. But if there is a match, we can automatically use that match. Otherwise, if there are multiple matches, we can allow the user to select the correct value for the item.

This would allow us to have a single “Invoice Number” item, but have multiple RegEx’s linked to the single Invoice Number variable. This way, we don’t need to have multiple RegEx extractors for the same document type.

Currently, you can only have 1 RegEx linked to an item that is part of the entire document type.

3 Likes

Hi @Ioana_Gligan,

I’ve seen the beta release of the regex table extractor which you’ve shown in the private preview video, and am trying with a sample, however when i run the workflow, nothing gets extracted when i view it in validation station

The overall process is pretty simple, and for the regex extractor, the regex is as below (pretty sure the regex is valid as i’ve tested it separately)

Any pointers as to what could’ve gone wrong? I’m using 4.3.0-preview of the IntelligentOCR.Activities.

This is the sample PDF i’m using from, downloaded from the web >> https://www.w3.org/WAI/WCAG20/Techniques/working-examples/PDF20/table.pdf

Have you tried putting the RegEx inside of a capture group? For example (\d\d\d\d). Just \d\d\d\d doesn’t seem to work for some reason…

Hi Warren,

how you have OMNIPage OCR?

  1. Try to add the “capture” flag to the elements you want to capture from your expression - you will find this setting in the regex editor as the last option for each line you add - or you can add the capture groups yourself.
  2. Try to test the regular expressions with the text output of the Digitize Document component. There might be differences between how the robot “reads” the document and how we perceive it.
  3. Double check that you activated the table field in the “Configure Extractors” wizard of the data extraction scope.
  4. Can you please send us a sample pdf document and a sample workflow to reproduce it?

Hi @Ioana_Gligan,

I’ve tried adding the capture flag, but it’s still not able to extract it.
The table fields have been activated in the extractor (only has regex extractor configured).

I’ve attached here the workflow i’m using, the PDF file is in the zip file as well.

You should be able to just run it in debug, and it’ll read the PDF in the given directory, print out the Digitized text in write line (looks the same as to what i see in the PDF)

01 DocumentUnderstandingPoC.zip (84.8 KB)

Hi irahmat,

You can download the package via Manage Package in Studio

Hello @warren_lee,

Found the mistake :slight_smile:

You added a “Results” string on the table line, in the Regex configuration. That was causing the issue.

The way the table configuration works is as follows:

  • on the table row in the config wizard, you can optionally add a regular expression. If you don’t add one, the entire document will be considered as a huge “table” that row expressions should be applied to. If you do fill it in, that expression should CAPTURE the entire table area. You added “Results” in there, which matches, but returns the match “Results” only (not the actual table area).
  • on the rows line in the config wizard, you can optionally add a regular expression. This expression is applied to the output of the TABLE CAPTURE. If you don’t add one, each row (delimited by \n) will be considered as a normal “row” that column expressions should be applied to. If you do fill it in (for example if you have a case of rows that run across multiple lines), that expression should CAPTURE ALL rows in the table area. You left this blank, it was not a problem.
  • on each columns row in the config wizard, you add a regular expression that applies to the EACH ROW CAPTURE.

So to solve your first issue (no results), remove the expression for the table (the “Results” keyword), and test your regexes for the columns :slight_smile:

Hope this helps,

Ioana

Hi @Ioana_Gligan,

Thanks! I removed results from the table level regex and that did the trick :slight_smile:
Do you mind expanding a bit more on the REGEX on the table level? I think that’s where i was a bit confused about. From your comments above:

  • on the table row in the config wizard, you can optionally add a regular expression. If you don’t add one, the entire document will be considered as a huge “table” that row expressions should be applied to. If you do fill it in, that expression should CAPTURE the entire table area. You added “Results” in there, which matches, but returns the match “Results” only (not the actual table area).

Q : What exactly do you mean when you say the 'expression should CAPTURE the entire table area'? If you don't mind, it's always easier to illustrate with examples - i've attached a sample PDF here with multiple tables to help walk through.

Let’s say i have 3 tables in the document PDF (imagine for now they all carry different information and has different patterns), and i’m only interested in extracting table #2 and all rows from that table.

How would i then go about using the ‘TABLE LEVEL REGEX’ to extract just that table area (table #2)? [Is it by a keyword that identifies the table? or is it an expression that matches the whole table/parts of the table column name?]
My understanding of table level REGEX is that it will locate a given table area of interest and apply any row/column level REGEX on that output - This would be helpful as i would not want the bot to scan through the whole document if there’s multiple tables since i know which ‘table area’ i would like to extract.

I’ve also attached the sample PDF here for your reference.
Appreciate your guidance!

SamplePDF.pdf (89.3 KB)

Similary, how to get the output of Classify Document Scope?

I would like to see whether bot has classified it into correct classification based on the Keyword provided. Also, please correct me if I am wrong, this keyword should be something that is present in the invoice right?
image

I was going through one of the sample invoices provided. What if I want to extract Address also. How can I achieve it?

I tried making field in taxonomy but data is not getting extracted in validation station.

Steps are:

  • create field in taxonomy
  • go to data extraction scope, click on configure extractors, and activate the address field . Make sure to put in the right string if you are still using the old version

I also recommend switching the UiPath.MachineLearningExtractor (which has been discontinued) for the Uipath.DocumentUnderstanding.ML.Activities, which is in official release now and will be updated.

1 Like