How to use the IntelligentOCR Package

Hi @Ioana_Gligan and @warren_lee,

I’ve been facing the same problem when using the Create Document Validation Action activity.

Unexpected character encountered while parsing value: <. Path ‘’, line 0, position 0.

Have you managed to solve this problem?

Thanks!

hello @paulo.kurihara,

Any chance you could share the failing workflow with me + Studio version? I would like to try to reproduce the issue.

ioana

hello @marton.szaboo, and welcome to the community!

The confidence computation is a complex algorithm that keeps track of which words are found, where they are in the document, when a certain keyword or set of keywords has been added to the learning, and how many times a keyword has been reinforced. That is why it is growing :slight_smile:
You will notice that IF the classifier makes a mistake and you correct the document type from the Validaiton Station (if you have more than one doc type in there), then new stuff appears in the learning content.

Hope this helps,

Ioana

Hi @Ioana_Gligan,

Since I’m a new user and I can’t attach files here, I’ve attached the zipped file with the failing workflow on my google drive.

The link is: IntelligentOCR.zip - Google Drive

The Studio version is 2019.4.4, Community edition.

Since it always fails, the ContinueOnError property of the Create Document Validation Action activity is enable, and you will need to unable it in order to see the exception it returns.

Thanks!

Hello @Iona, thanks for your great work here :slight_smile:

I am just starting on this topic of document understanding. I would like ask you about the following that you said:

“The machine learning extractor is pre-trained and does not expose the re-training capability at this moment.”

Is there any out of the box new option? If not, do you have any idea when it will appear?

“In order to train extractors, you currently have to build your own”.

What do you think it would be the best approach to target this? Connect our workflow to an Azure/AWS machine learning instance?

All the best,

Hi again forum,

I have downloaded UiPath Studio Pro 2020.4.0.beta1731 Community and updated all packages including prereleases, currently no errors there. However, I have a missed activity giving the following error:

Could not find member ‘SkipServerSideOCR’ in type ‘http://schemas.uipath.com/workflow/activities/documentunderstanding-ml:MachineLearningExtractor’. Row: 194, Column: 522

Any idea how to fix this issue?

Thank you in advance to all for your support!

All the best,

Hello @OsoDormilon,

I updated the archive - that error should go away now… sorry about that!

Ioana

1 Like

Hello @paulo.kurihara,

I think the Studio version is the issue - 19.4 will be out of support in a couple of months… why don’t you switch to the preview channel in Studio (main menu / help / right side bar / switch to Preview), or install the latest Community?

Please let me know if this works once you set the persistence flag in project settings, and try it out on the latest version!

Thank you,

Ioana

Hi,

The project is giving some warnings because deprecated UiPath.MachineLearningExtractor package. I would suggest to update projects available on current link and update packages.

image

Thanks!

I referred to DocumentProcessing_IntelligentOCR300 project.

Cheers,

Hi @Ioana_Gligan,

Alright, I’ll give it a shot and let’s see what happens.

Thank you!

1 Like

Hello @Ioana_Gligan,

I’ve noticed that there’s a property for the Digitize Document activity that gives the possibility to force the activity to read the document with OCR (the ForceApplyOCR property). Now I’m wondering if it is possible to do something like the opposite, which would be to force it to read for example a PDF file, just like the Read PDF Text activity does, because sometimes the result of a PDF read using OCR is not good enough to bring all the information we need to extract, and most of the PDFs I’m working with, don’t really need the OCR, since they have extractable text.

So, basically my doubt is if it is possible to use the Digitize Document as if it was the Read PDF Text, in order to avoid the use of OCR when not needed.

Thank you!

1 Like

Thank you, I’ve played around with it a lot and with your help i understand it.

1 Like

hi Loana, @paulo.kurihara,

Did you manage to get to this issue?

I tried again today and it’s still the same for me…so just wondering if you guys have found anything that might have caused this?

This is the sample workflow i’m using (it has a ML extractor pointing to the receipt ML endpoint, without my API key in here of course)

SampleDUActionCenterIntegration_Forum.zip (648.0 KB)

Hey @paulo.kurihara,

The Digitize Document activity does not apply OCR by default. If a PDF can be natively read, it is. If a certain page contains too much coverage of images,or does not return text for native reading, or a couple other conditions, only then it applies OCR.

Hi @warren_lee,

what Studio version are you using? Are you on the preview channel using the latest version?

Hi @warren_lee and @Ioana_Gligan,

My Studio is now the 2020.4.0 version, and it actually works now!

Thank you!

Hi Loana and @paulo.kurihara,

I’m also on 2020.4.0 version, and i finally figured out what was the issue!!

It’s interesting because it appears to somehow be linked to the Orchestrator API endpoint on my robot.

So my bot was connected to the orchestrator via the latest community endpoint, this was when i experienced the issue:

https://cloud.uipath.com

What i then tried is dis-connect my bot and re-connect back up using:

https://platform.uipath.com

This too, did not work and produce the same error, BUT >>

I then connect it using:

https://platform.uipath.com/{my specific service}/{my specific tenant}

This full URL appears to resolve the issue, so it makes me think that what’s happening, is that for some reason, either it’s Studio, or specific to the activity, where the service or tenant information is not passed in and it can’t potentially perform the background API operations with my orchestrator, specifically for this activity… :thinking:

Reason i say specific to this activity, is because my bot has always been able to connect to orchestrator fine, and other orchestrator operation has been working well.

Interesting observation though … :slight_smile:

2 Likes

@OsoDormilon @Ioana_Gligan not sure if there was a response to the train extractor activity for the machine learning? have been looking for the solution and was hoping uipath will be releasing something on this, but have seen anything yet. Any solution how to go about this?

Hi @Ioana_Gligan,

I have the same problem. My PDF can be read natively perfectly, but there some non-text images like a logo, backgrounds, and due to it, the PDF is always being read as OCR and the result is very messy.

It would be better if there were an option to force extraction as text, is there a way to do it?