DU Training Model

Given the following scenario: You have a trained version of the Document Understanding Model with 1000 pages called v22.10.0.1. You have an evaluation dataset of 100 pages that gave a score of 0.72 for v22.10.0.1. The business team labeled 800 pages and they ask for an increment of the Model that would contain all 1000+800 pages.

What is the first recommended pipeline run configuration to create the new version?

  • A. Run a Pipeline on the Package with the following settings:
    Pipeline type: Full -
    Package Major Version: 22.10.0 -
    Package Minor Version: 1 -
    Input Dataset: 800 pages -
    Evaluation Dataset: 100 pages

  • B. Run a Pipeline on the Package with the following settings:
    Pipeline type: Training -
    Package Major Version: 22.10.0 -
    Package Minor Version: 1 -
    Input Dataset: 1000+800 pages -
    Evaluation Dataset: N/A

  • C. Run a Pipeline on the Package with the following settings:
    Pipeline type: Full -
    Package Major Version: 22.10.0 -
    Package Minor Version: 0 -
    Input Dataset: 1000+800 pages -
    Evaluation Dataset: 100 pages

  • D. Run a Pipeline on the Package with the following settings:
    Pipeline type: Evaluate -
    Package Major Version: 22.10.0 -
    Package Minor Version: 0 -
    Input Dataset: 1000+800 pages -
    Evaluation Dataset: 100 pages

Hi @Latifa I think C The major version remains the same, while the minor version can start from 0 since it represents a new iteration of the combined data. New model is fully trained with the combined dataset (1000+800 pages) and then evaluated using the 100-page evaluation dataset. This will your model will train with new dataset while running full pipeline

1 Like

@Latifa,

I think Option B would be better option.

This configuration ensures that the model is trained on the entire dataset, including the newly labeled 800 pages, creating a comprehensive and updated version of the model. The evaluation dataset is not needed in this initial training run but can be used later to assess the model’s performance.

1 Like

@Latifa

Correct one would be c

It is always advised to retrain on the base version with full dataset than incremental retraining

Cheers

2 Likes

Thank you very much guys

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.