Optimizing AI Center Training Pipelines for Document Understanding Packages

If a model is trained on invoices for companies A, B, and C, and later, it is trained on invoices for company D using a different dataset, does the learning take into account company D being added? Does it override any previous logic made by A, B, and C, or is there a need to include a few documents from A, B, and C when training with company D's data?

Whenever there are changes to existing documents or new documents are introduced, do not immediately start training the model. Follow these steps:

  1. Test with the Existing Model:

    • Before making any changes, test the model with the new documents (company D invoices) using the current model.
    • If the results are good, there's no need for further training.
    • If the results are decent but not optimal, use the current version and gather more data during validation using a validation station or action center.
  2. Evaluate the Results:

    • If the results are not satisfactory, add the new documents (company D invoices) to the older dataset (company A, B, and C invoices). This combined dataset will now include all documents from A, B, C, and D.
  3. Retrain the Base Model:

    • Train the base model (v.0) on this new, larger dataset. This will generate a new version of the model (v.2) which is expected to perform better.

Key Considerations:

  • "v.0" refers to the base model version.
  • It is always recommended to retrain the model on a larger dataset (including companies A, B, C, and D). rather than just a smaller dataset (only company D) on an intermediate version (v.1).
  • Training on a larger dataset helps the model identify more patterns, making it smarter and improving overall performance.