Hi @Ioana_Gligan, awesome responses! This info will help me build better automations. I’ll PM you about our use of the IntelligentOCR package.
P.S. I have two more more feature requests:
RegEx Classifier
This would allow us to define a RegEx query that if it matches, would allow you to classify that document. For example, we might have a RegEx to look for a certain phone number pattern, or a certain invoice name pattern, and we could then easily classify the document based on a match of a regular expression.
This is an improvement on the current static keyword classification.
Multiple RegEx’s for each item
Sometimes, an invoice from a vendor may have multiple ways of writing their invoice number, or PO number. I would like to be able to connect multiple RegEx values to a single item in a document type. If there are no matches, then we don’t use that RegEx. But if there is a match, we can automatically use that match. Otherwise, if there are multiple matches, we can allow the user to select the correct value for the item.
This would allow us to have a single “Invoice Number” item, but have multiple RegEx’s linked to the single Invoice Number variable. This way, we don’t need to have multiple RegEx extractors for the same document type.
Currently, you can only have 1 RegEx linked to an item that is part of the entire document type.
I’ve seen the beta release of the regex table extractor which you’ve shown in the private preview video, and am trying with a sample, however when i run the workflow, nothing gets extracted when i view it in validation station
The overall process is pretty simple, and for the regex extractor, the regex is as below (pretty sure the regex is valid as i’ve tested it separately)
Any pointers as to what could’ve gone wrong? I’m using 4.3.0-preview of the IntelligentOCR.Activities.
Try to add the “capture” flag to the elements you want to capture from your expression - you will find this setting in the regex editor as the last option for each line you add - or you can add the capture groups yourself.
Try to test the regular expressions with the text output of the Digitize Document component. There might be differences between how the robot “reads” the document and how we perceive it.
Double check that you activated the table field in the “Configure Extractors” wizard of the data extraction scope.
Can you please send us a sample pdf document and a sample workflow to reproduce it?
I’ve tried adding the capture flag, but it’s still not able to extract it.
The table fields have been activated in the extractor (only has regex extractor configured).
I’ve attached here the workflow i’m using, the PDF file is in the zip file as well.
You should be able to just run it in debug, and it’ll read the PDF in the given directory, print out the Digitized text in write line (looks the same as to what i see in the PDF)
You added a “Results” string on the table line, in the Regex configuration. That was causing the issue.
The way the table configuration works is as follows:
on the table row in the config wizard, you can optionally add a regular expression. If you don’t add one, the entire document will be considered as a huge “table” that row expressions should be applied to. If you do fill it in, that expression should CAPTURE the entire table area. You added “Results” in there, which matches, but returns the match “Results” only (not the actual table area).
on the rows line in the config wizard, you can optionally add a regular expression. This expression is applied to the output of the TABLE CAPTURE. If you don’t add one, each row (delimited by \n) will be considered as a normal “row” that column expressions should be applied to. If you do fill it in (for example if you have a case of rows that run across multiple lines), that expression should CAPTURE ALL rows in the table area. You left this blank, it was not a problem.
on each columns row in the config wizard, you add a regular expression that applies to the EACH ROW CAPTURE.
So to solve your first issue (no results), remove the expression for the table (the “Results” keyword), and test your regexes for the columns
Thanks! I removed results from the table level regex and that did the trick
Do you mind expanding a bit more on the REGEX on the table level? I think that’s where i was a bit confused about. From your comments above:
on the table row in the config wizard, you can optionally add a regular expression. If you don’t add one, the entire document will be considered as a huge “table” that row expressions should be applied to. If you do fill it in, that expression should CAPTURE the entire table area. You added “Results” in there, which matches, but returns the match “Results” only (not the actual table area).
Q : What exactly do you mean when you say the 'expression should CAPTURE the entire table area'? If you don't mind, it's always easier to illustrate with examples - i've attached a sample PDF here with multiple tables to help walk through.
Let’s say i have 3 tables in the document PDF (imagine for now they all carry different information and has different patterns), and i’m only interested in extracting table #2 and all rows from that table.
How would i then go about using the ‘TABLE LEVEL REGEX’ to extract just that table area (table #2)? [Is it by a keyword that identifies the table? or is it an expression that matches the whole table/parts of the table column name?]
My understanding of table level REGEX is that it will locate a given table area of interest and apply any row/column level REGEX on that output - This would be helpful as i would not want the bot to scan through the whole document if there’s multiple tables since i know which ‘table area’ i would like to extract.
I would like to see whether bot has classified it into correct classification based on the Keyword provided. Also, please correct me if I am wrong, this keyword should be something that is present in the invoice right?
go to data extraction scope, click on configure extractors, and activate the address field . Make sure to put in the right string if you are still using the old version
I also recommend switching the UiPath.MachineLearningExtractor (which has been discontinued) for the Uipath.DocumentUnderstanding.ML.Activities, which is in official release now and will be updated.
As you can see, in the Extraction, i was using the ‘Invoice ML Extractor’ trying to get the “Total”, using the ReceiptSwiss in the anon folder (i’ve filtered the files to only get this particular jpg).
I can’t see anything wrong with the properties as well
RemoteException wrapping UiPath.DocumentUnderstanding.Persistence.OrchestratorException: Unexpected character encountered while parsing value: <. Path '',
line 0,
position 0.
at UiPath.IntelligentOCR.Activities.BaseOrchestratorClientAsyncActivity.ThrowIfNeeded(Task task,
Boolean suppressThrowException)
at UiPath.IntelligentOCR.Activities.BaseOrchestratorClientAsyncActivity.EndExecute(AsyncCodeActivityContext context,
IAsyncResult result)
at System.Activities.AsyncCodeActivity.System.Activities.IAsyncCodeActivity.FinishExecution(AsyncCodeActivityContext context,
IAsyncResult result)
at System.Activities.AsyncCodeActivity.CompleteAsyncCodeActivityData.CompleteAsyncCodeActivityWorkItem.Execute(ActivityExecutor executor,
BookmarkManager bookmarkManager)
Both of them offers to extract data from a predefined form, and the first one, Intelligent Form Extractor recognize handwritten text through a service providing our API key. But the second one, Form Extractor also asks for an API key, and there is no difference with previous versions of the package that have the same activity.
I am using the UiPath.IntelligentOCR.Activities version 4.5.0-beta.878183 and I am very interested if there is a bug or if the plan is to offer both activities with API authentication.
Hi,
I am trying to understand and use the Intelligent OCR package. What i dont really understand is the learning part of the framework. I have some invoices, and i use regex classifier to classify them. The first few documents have around 70% confidence level. The documents which came after this have much higher confidence levels, like 95% or so. After the process finished i checked the .json file for classification learning and it looks the same as before the process started, there are no added values or anything just the “ConfirmedNumberOfTimes” changed.
My question is that why is the confidence level of the classification grew if there aren’t any more data in the learn.json file to work with?
Thank you