DU Training Model

Latifa · November 8, 2024, 9:23am

Given the following scenario: You have a trained version of the Document Understanding Model with 1000 pages called v22.10.0.1. You have an evaluation dataset of 100 pages that gave a score of 0.72 for v22.10.0.1. The business team labeled 800 pages and they ask for an increment of the Model that would contain all 1000+800 pages.

What is the first recommended pipeline run configuration to create the new version?

A. Run a Pipeline on the Package with the following settings:
Pipeline type: Full -
Package Major Version: 22.10.0 -
Package Minor Version: 1 -
Input Dataset: 800 pages -
Evaluation Dataset: 100 pages
B. Run a Pipeline on the Package with the following settings:
Pipeline type: Training -
Package Major Version: 22.10.0 -
Package Minor Version: 1 -
Input Dataset: 1000+800 pages -
Evaluation Dataset: N/A
C. Run a Pipeline on the Package with the following settings:
Pipeline type: Full -
Package Major Version: 22.10.0 -
Package Minor Version: 0 -
Input Dataset: 1000+800 pages -
Evaluation Dataset: 100 pages
D. Run a Pipeline on the Package with the following settings:
Pipeline type: Evaluate -
Package Major Version: 22.10.0 -
Package Minor Version: 0 -
Input Dataset: 1000+800 pages -
Evaluation Dataset: 100 pages

Akash_Javalekar1 · November 8, 2024, 9:32am

Hi @Latifa I think C The major version remains the same, while the minor version can start from 0 since it represents a new iteration of the combined data. New model is fully trained with the combined dataset (1000+800 pages) and then evaluated using the 100-page evaluation dataset. This will your model will train with new dataset while running full pipeline

ashokkarale · November 8, 2024, 9:45am

@Latifa,

I think Option B would be better option.

This configuration ensures that the model is trained on the entire dataset, including the newly labeled 800 pages, creating a comprehensive and updated version of the model. The evaluation dataset is not needed in this initial training run but can be used later to assess the model’s performance.

Anil_G · November 8, 2024, 9:48am

@Latifa

Correct one would be c

It is always advised to retrain on the base version with full dataset than incremental retraining

Cheers

Latifa · November 8, 2024, 12:25pm

Thank you very much guys

Topic		Replies	Views
Document undertanding Training, Evaluation , Full pipeline Document Understanding question	3	68	March 18, 2026
How to train same custom DU model multiple times? Studio studio , question , document_understanding , tools	10	2091	March 23, 2022
Pipeline Configuration AI Center question , ai_center	1	87	November 21, 2024
Pipeline Configuration Question Document Understanding	2	64	December 9, 2024
ML Skill - Pipeline Run AI Center question , ai_center , pipeline-training	1	80	April 23, 2025

DU Training Model

Related topics