Document Understanding shouldn't require digitizing

postwick · November 29, 2023, 7:50pm

In this process I’m building, there is zero reason to digitize the document. We are doing no automatic classifying nor data extraction. The digitizing step being required just takes additional time (it’s pretty slow, taking minutes for some documents).

rall · November 30, 2023, 2:07pm

If you aren’t classifying nor extracting, what is your general process flow / end goal?

Jon_Smith · November 30, 2023, 2:59pm

If you aren’t classifying nor extracting then what DU functionality are you using? Thats all it does right?

postwick · November 30, 2023, 3:11pm

Manual classification/splitting of the document.

postwick · November 30, 2023, 3:11pm

We are classifying/splitting manually by presenting Classification Station to the user.

Jon_Smith · November 30, 2023, 3:33pm

Gotcha,
So usually the classification usually needs digitizing since the text is used in classification. Since you are basically skipping all classification and are just using this to use the pre-built form in Action Center I think its abit of an edge case.

Let me look into something, I think it might be possible to skip digitizing still, or at least to spoof it.

postwick · November 30, 2023, 3:35pm

Yeah I tried to spoof it but wasn’t able to. Doesn’t mean it can’t be done.

I don’t think doing manual classification is an edge case, it’s the reason the Present Classification Station activity exists. We aren’t doing it in Action Center, we are doing it attended (not that there’s really a difference).

Jon_Smith · November 30, 2023, 3:38pm

Perhaps my meaning wasn’t clear.
I do not mean to suggest that manual classification is an edge case, that is common. I am stating skipping any kind of automatic classification first, which requires digitizing, seems an edge case.

Jon_Smith · November 30, 2023, 3:47pm

I don’t have an easy to access taxonomy as I have been trying to use the ‘new’ Document Understanding actions (which weirdly seem to be missing classification station) so I don’t have an easy way to test but if I look at the Present Classification Station action it needs the following inputs.

AutomaticClassificationResults - Optional, so skip it.
DocumentObjectModel - Try just making an empty one?
DocumentPath - easy
Document Test - “FOO BAR”
Taxonomy - put your taxonomy with document types there.

Maybe that can work? My idea is skip digitizing and just make the objects we need on the fly.

postwick · November 30, 2023, 3:51pm

I did skip the Automatic Classification Results. That’s not the issue. The issue is the necessity of the Digitize step. The outputs of the Digitize activity are Document Text and Document Object Model, both are required inputs of the Present Classification Station activity:

The text appears skippable with just an empty string but the DOM is a complex object.

Jon_Smith · November 30, 2023, 4:51pm

I know. I specifically posted a suggested way to skip them Paul.

postwick · November 30, 2023, 9:49pm

It would be pretty complex to spoof the DOM. There’s a lot in there and I’m not sure how much of it the classification station actually needs. That’s why I’m suggesting to UiPath via feedback that it would be useful if it weren’t required.

Jon_Smith · December 1, 2023, 8:18am

Did you try the way to spoof it like I said though? making an empty DOM was easy in my screenshot, at least to make it compile.

We can try it out rather than just assume its hard. If it errors when it runs then more work would be indeed needed.

postwick · December 1, 2023, 1:37pm

Yes I tried that.

Jon_Smith · December 1, 2023, 3:01pm

Shame.

I wonder if we can serialize an existing DOM from digitizing and then store that, we could then deserialize it when the code is ran. I know you want a proper fix, but as a community member I can only suggest workarounds. Let me know if that doesn’t interest you and I’ll stop.

postwick · December 1, 2023, 3:08pm

I had the same thought but no idea how to serialize it to then read it back in. After doing that I could adjust things that become necessary like page count etc.

postwick · December 1, 2023, 6:44pm

I had the idea to make a small dummy PDF to digitize and just hard-code that file into the Digitize activity. But then when Classification Station appeared it only showed the first page of the actual document (because the dummy PDF is only one page). So I tried to manually set the value of sourceFileDOM.Pages.Length and it told me it’s read-only. So that’s a no-go.

Jon_Smith · December 4, 2023, 8:48am

Damn, thats annoying. Some progress but indeed not enough.

You can try to store the DOM object by using the Newtonsoft Serialize and Deserialize activities (to load it back to a DOM object from a string).

We’d need to test if it is infact serializable. At least we have more insight on the sort of things the DOM is being used for now.

Regarding the Read-Only property. Annyoing but we can work around that using ‘Reflection’.

Not sure if you’ve heard of that Paul, its can be daunting and confusing but its very powerful and allows you do to things such as changing a read only field.

Let me know which you prefer to focus on first, seeing if you can store the DOM as a serializing string, skipping the Digitize completely and avoiding the dummy PDF, or the Reflection?

postwick · December 4, 2023, 5:56pm

There is a .serialize method on the Document datatype. I’m thinking maybe I could just serialize it, then during the process read it in as text, alter the values I need in the text, then deserialize. Think that would work?

Jon_Smith · December 4, 2023, 8:20pm

No idea about that method, but now I think about it the DOM has the be serializable in order to transfer it to the Orchestrator.

I’d personally just use the Newtonsoft methodology since its easy and consistent to serialize and deserialize any object.

Topic		Replies	Views
DU Framework - How to skip classify Document Understanding document_understanding	6	170	March 22, 2025
How to use Present Classification Station activity \|\| Classifier DU Other activities youtube-video	0	38	September 8, 2024
Document Understandng Studio studio , question	4	1143	April 15, 2021
Process PDF Files, Classify Documents & more with new Document Understanding Activities in Studio Web Product News document_understanding , studio-web	5	1908	April 7, 2023
Digitizing only one page of a document Studio studio , question , document_understanding , template	3	1343	February 9, 2021

Document Understanding shouldn't require digitizing

Related topics