Document Understanding shouldn't require digitizing

In this process I’m building, there is zero reason to digitize the document. We are doing no automatic classifying nor data extraction. The digitizing step being required just takes additional time (it’s pretty slow, taking minutes for some documents).

If you aren’t classifying nor extracting, what is your general process flow / end goal?

If you aren’t classifying nor extracting then what DU functionality are you using? Thats all it does right?

Manual classification/splitting of the document.

We are classifying/splitting manually by presenting Classification Station to the user.

Gotcha,
So usually the classification usually needs digitizing since the text is used in classification. Since you are basically skipping all classification and are just using this to use the pre-built form in Action Center I think its abit of an edge case.

Let me look into something, I think it might be possible to skip digitizing still, or at least to spoof it.

1 Like

Yeah I tried to spoof it but wasn’t able to. Doesn’t mean it can’t be done.

I don’t think doing manual classification is an edge case, it’s the reason the Present Classification Station activity exists. We aren’t doing it in Action Center, we are doing it attended (not that there’s really a difference).

Perhaps my meaning wasn’t clear.
I do not mean to suggest that manual classification is an edge case, that is common. I am stating skipping any kind of automatic classification first, which requires digitizing, seems an edge case.

I don’t have an easy to access taxonomy as I have been trying to use the ‘new’ Document Understanding actions (which weirdly seem to be missing classification station) so I don’t have an easy way to test but if I look at the Present Classification Station action it needs the following inputs.

AutomaticClassificationResults - Optional, so skip it.
DocumentObjectModel - Try just making an empty one?
DocumentPath - easy
Document Test - “FOO BAR”
Taxonomy - put your taxonomy with document types there.

Maybe that can work? My idea is skip digitizing and just make the objects we need on the fly.

I did skip the Automatic Classification Results. That’s not the issue. The issue is the necessity of the Digitize step. The outputs of the Digitize activity are Document Text and Document Object Model, both are required inputs of the Present Classification Station activity:

The text appears skippable with just an empty string but the DOM is a complex object.

I know. I specifically posted a suggested way to skip them Paul.

It would be pretty complex to spoof the DOM. There’s a lot in there and I’m not sure how much of it the classification station actually needs. That’s why I’m suggesting to UiPath via feedback that it would be useful if it weren’t required.

Did you try the way to spoof it like I said though? making an empty DOM was easy in my screenshot, at least to make it compile.

We can try it out rather than just assume its hard. If it errors when it runs then more work would be indeed needed.

Yes I tried that.

image

Shame.

I wonder if we can serialize an existing DOM from digitizing and then store that, we could then deserialize it when the code is ran. I know you want a proper fix, but as a community member I can only suggest workarounds. Let me know if that doesn’t interest you and I’ll stop.

I had the same thought but no idea how to serialize it to then read it back in. After doing that I could adjust things that become necessary like page count etc.

I had the idea to make a small dummy PDF to digitize and just hard-code that file into the Digitize activity. But then when Classification Station appeared it only showed the first page of the actual document (because the dummy PDF is only one page). So I tried to manually set the value of sourceFileDOM.Pages.Length and it told me it’s read-only. So that’s a no-go.

Damn, thats annoying. Some progress but indeed not enough.

You can try to store the DOM object by using the Newtonsoft Serialize and Deserialize activities (to load it back to a DOM object from a string).

We’d need to test if it is infact serializable. At least we have more insight on the sort of things the DOM is being used for now.

Regarding the Read-Only property. Annyoing but we can work around that using ‘Reflection’.

Not sure if you’ve heard of that Paul, its can be daunting and confusing but its very powerful and allows you do to things such as changing a read only field.

Let me know which you prefer to focus on first, seeing if you can store the DOM as a serializing string, skipping the Digitize completely and avoiding the dummy PDF, or the Reflection?

There is a .serialize method on the Document datatype. I’m thinking maybe I could just serialize it, then during the process read it in as text, alter the values I need in the text, then deserialize. Think that would work?

No idea about that method, but now I think about it the DOM has the be serializable in order to transfer it to the Orchestrator.

I’d personally just use the Newtonsoft methodology since its easy and consistent to serialize and deserialize any object.