Few Questions About Document Understanding

Hello everyone,

I have a few questions regarding Document Understanding:

1- I have 10 different invoice layouts and I want to extract 5 fields from each. Initially, should I create batches for each different layout in the Data Labeling step (e.g., first-batch, second-batch) or should I create separate datasets for each layout? What is the best practice?

2- In the Data Labeling step, if a field is present in 400 invoices but not in 10, I am asked to hide these fields during export. However, if I hide them, these fields will not be visible in Studio. What should I do in this situation? What is the best practice?

3- As I understand, when developing a Document Understanding project for a client, we use “Present Validation Station” and “Machine Learning Extractor Trainer” in Studio to present an interface to the client for approval and training. However, this does not seem very user-friendly. Should we present the “Present Validation Station” screen to the client while the process is running? What is the best practice?

4- When I want to add and train new data to my model in Studio, what is the best practice? Where are the data trained in Studio located in AI Center, and how are they integrated? When users approve data with “Present Validation Station” and train with “Machine Learning Extractor Trainer”, do we need to manually redo Data Labeling in AI Center for those data, or is this process automated?

I would appreciate your help with these questions.

Thank you.

Hi Tuncay, welcome to the community forum

1 - For only 10 invoice layouts, if they are all equally structured and consistent, that’s borderline within consideration for the form extractor - Document Understanding - Form Extractor (uipath.com) If you expect more layouts or if the layouts are not very structured within themselves, then do certainly proceed with the machine learning extractor.

You need one dataset for all of your layouts; to follow best practices, you need to subset it into a training and evaluation data set, both containing examples from all the expected layouts. The training set will be used to fine tune the model and the evaluation set will be used to test the accuracy of your fine-tuned model. The idea is that you will train the model on the semantic meaning of the extra fields that you need to extract and that the model will be able to identify this field from its semantic meaning on any type of invoice layout.

2- I’m not sure I understand I’m sorry

3 - The present validation station activity is only for attended document understanding processes. This will bring up the validation action directly on the machine that the automation is running on. For unattended automations you want to use the create labeling task and wait for external task activities so that a human may complete these tasks on action center.

4 - AI Center - Using Data Labeling with human in the loop (uipath.com) You can follow these steps to take the output from either the external or attended validation action and add it back into your dataset. You can then manually or automatically rerun a training pipeline that uses the expanded data set to retrain your model.

This is all from memory I haven’t made a DU project in a long time, so I invite the community to correct me if I have made any mistakes please

1 Like

Hello,

Firstly, thank you for your response.

From what I understand, when I have 10 different invoice layouts, I should keep them all in the same dataset even though they are different layouts. I need to separate the dataset into batches like first-batch, second-batch for each invoice layout, and also create an evaluation set batch for each layout. Do I need to perform training in the pipeline step for each batch? I would appreciate detailed guidance on this matter as I am quite new to this.

I had not heard about the “Create Labeling Task” and “Wait for External Task” activities. I will look into them, but do you have any recommended detailed resources for these topics?

I would appreciate any help you can provide on these matters.

Thank you.

From what I understand the batching feature is just to make the labeling process more manageable and maybe to split into training and evaluation sets. You should not divide your layouts into different batches or subsets though. You should pick most of your documents and group them into a training set and pick a small subset for your evaluation set, and both should contain examples from all types of layouts.

The ability to use those wait for external task activities goes into a much deeper and advanced type of workflows known as long running workflows, you have to code your process keeping in mind that the process will be suspended at some point to wait for validation to resume.
Studio - Orchestration Process (uipath.com)
Orchestrator - Working with long-running workflows (uipath.com)