Document Understanding - Auto-retraining of ML Models

Dear Community,

The wait is over! The ability to automatically retrain ML models with data from Validation Station in an RPA workflow is finally here! :tada:

However, let us first make sure we understand what this feature is, and what it is not. There are two major phases in the lifecycle of any Machine Learning model: build phase, and maintenance phase.

In the build phase you use Data Manager to prepare the training dataset and the evaluation dataset and try to get the best performance possible, while at the same time building the RPA automation and business logic around the ML model which is at least as important as the model itself for obtaining the Return on Investment you expect.

In the maintenance phase you try to maintain the high performance level you achieved in the build phase, in other words, you prevent regressions.

Automatic retraining belongs firmly in the maintenance phase. Its objective is mainly to prevent it from regressing as data flowing through the process changes. In particular data fed back from the human validation using Validation Station should not be used to build a model from scratch. Building a model should be done by preparing training and evaluation datasets in Data Manager.

The below description assumes you have already created a Data Manager session in AI Center, and you have carefully manually labelled a high quality Training dataset and an Evaluation dataset, and you have trained a few versions of your ML model, tested it, ironed out any issues, and deployed it to your RPA+AI automation.

The 3 components of Auto-Retraining

  1. The ML Extractor Trainer activity. You need to add this into your workflow in a Train Extractors Scope, properly configure the scope, make sure the Framework Alias contains the same alias as the ML Extractor alias in the Data Extraction scope, and select the Project and the Dataset which is associated with the Data Manager session containing your Training and Evaluation dataset mentioned above.


    You can see the Dataset name on the Data Labelling view in AI Center, next to the name of the Data Labelling session:

    What this activity does is to create a folder called fine-tune inside if your Dataset, and to write the exported documents there in 3 folders: documents, metadata and predictions. This is the folder where the data will then be imported into Data Manager automatically, merged with the previously existing data, and exported into the right format to be consumed by a Training or Full pipeline.

  2. Data Manager - Scheduled Exports feature

    When clicking the Export button you will see a dialog with a tab called Schedule (Preview). Open that tab and toggle Scheduling to On. Then you can select the time and the periodicity in days. Please note that AI Center training pipelines are mainly configured to run weekly, so a periodicity of 7 days is probably your best choice at the time this post is published.

    The Scheduled Export operation actually does 2 things: it imports the data which exists in the fine-tune folder created in Step 1, and then it exports the full dataset, including the previously existing data and the newly imported Validation Station data, into the export folder. So with each scheduled export, the datasets gets larger and larger.

    The Scheduled import+export operation might take 1-2 hours, depending on how much data was sent from Step 1 during the previous week, and , you might choose a time when you will not use the Data Manager since when an export operation is ongoing no other exports or imports are allowed.

  3. Scheduled auto-retraining Pipeline

    When creating a Training or Full Pipeline in AI Center, there are a few things to be careful with. First, you need to select the export folder in your dataset as the “Input Dataset” dropdown. if you do not select the export folder the auto-retraining will not work. Then you need to toggle the auto_retraining environment variable of the ML Package to True. And finally you need to select the Recurring day and time to leave enough time for the Export from Data Manager to finish. So if the Data Manager export runs at 1 AM on Saturday, then the Pipeline might run at 2 or 3AM on Saturday. If the export is not finished when the pipeline runs, it will use the previous export, and it might retrain on the same data it trained on the previous week.

And that’s all folks! Let us know what you think below by hitting the image

Your friendly Document Understanding team.