Document Understanding - Auto-retraining of ML Models

Dear Community,

The wait is over! The ability to automatically retrain ML models with data from Validation Station in an RPA workflow is finally here! :tada:

However, let us first make sure we understand what this feature is, and what it is not. There are two major phases in the lifecycle of any Machine Learning model: build phase, and maintenance phase.

In the build phase you use Data Manager to prepare the training dataset and the evaluation dataset and try to get the best performance possible, while at the same time building the RPA automation and business logic around the ML model which is at least as important as the model itself for obtaining the Return on Investment you expect.

In the maintenance phase you try to maintain the high performance level you achieved in the build phase, in other words, you prevent regressions.

Automatic retraining belongs firmly in the maintenance phase. Its objective is mainly to prevent it from regressing as data flowing through the process changes. In particular data fed back from the human validation using Validation Station should not be used to build a model from scratch. Building a model should be done by preparing training and evaluation datasets in Data Manager.

The below description assumes you have already created a Data Manager session in AI Center, and you have carefully manually labelled a high quality Training dataset and an Evaluation dataset, and you have trained a few versions of your ML model, tested it, ironed out any issues, and deployed it to your RPA+AI automation.

The 3 components of Auto-Retraining

  1. The ML Extractor Trainer activity. You need to add this into your workflow in a Train Extractors Scope, properly configure the scope, make sure the Framework Alias contains the same alias as the ML Extractor alias in the Data Extraction scope, and select the Project and the Dataset which is associated with the Data Manager session containing your Training and Evaluation dataset mentioned above.

    image

    You can see the Dataset name on the Data Labelling view in AI Center, next to the name of the Data Labelling session:

    What this activity does is to create a folder called fine-tune inside if your Dataset, and to write the exported documents there in 3 folders: documents, metadata and predictions. This is the folder where the data will then be imported into Data Manager automatically, merged with the previously existing data, and exported into the right format to be consumed by a Training or Full pipeline.

  2. Data Manager - Scheduled Exports feature

    When clicking the Export button you will see a dialog with a tab called Schedule (Preview). Open that tab and toggle Scheduling to On. Then you can select the time and the periodicity in days. Please note that AI Center training pipelines are mainly configured to run weekly, so a periodicity of 7 days is probably your best choice at the time this post is published.

    The Scheduled Export operation actually does 2 things: it imports the data which exists in the fine-tune folder created in Step 1, and then it exports the full dataset, including the previously existing data and the newly imported Validation Station data, into the export folder. So with each scheduled export, the datasets gets larger and larger.

    The Scheduled import+export operation might take 1-2 hours, depending on how much data was sent from Step 1 during the previous week, and , you might choose a time when you will not use the Data Manager since when an export operation is ongoing no other exports or imports are allowed.

  3. Scheduled auto-retraining Pipeline

    When creating a Training or Full Pipeline in AI Center, there are a few things to be careful with. First, you need to select the export folder in your dataset as the “Input Dataset” dropdown. if you do not select the export folder the auto-retraining will not work. Then you need to toggle the auto_retraining environment variable of the ML Package to True. And finally you need to select the Recurring day and time to leave enough time for the Export from Data Manager to finish. So if the Data Manager export runs at 1 AM on Saturday, then the Pipeline might run at 2 or 3AM on Saturday. If the export is not finished when the pipeline runs, it will use the previous export, and it might retrain on the same data it trained on the previous week.

And that’s all folks! Let us know what you think below by hitting the image
button

Your friendly Document Understanding team.

24 Likes

Hi @alexcabuz !

Thanks for the great news. We were impatiently waiting for this feature to be released!

Unfortunately, we keep getting problems with the auto-deployment of the pipeline. It seems that the folder structure is causing the problem because the error received is: Pipeline failed due to ML Package Issue. The first error on the logs is: ERROR: images/ directory does not exist / is empty.

We must have been calibrating the pipeline incorrectly but i do not recognize where.

As per your description above we:

  1. Carefully trained OOB model for Invoices AUS
  2. Implemented auto upload to the data manager data set within the RPA Process to the same data set used above for the training. To ensure high quality and not overtraining, only documents where the confidence of the Robot is lower than x% are uploaded to the retraining.

After this, as expected within the data set a new folder structure was created:

  • Export
  • Fine Tune
  • Internal
  1. Created auto export feature in Data Manager scheduled on a daily basis
  2. Create auto-retraining pipeline. As described above, the export folder was selected for the input pipeline data set.

The result of the pipeline is the error mentioned above. Is there any issue with the set up? From previous, manual pipeline runs we know that we must select a subfolder within the export folder to let the pipeline run successfully. However, with the new feature i expected it to be different?

Would be of great help to understand where is the issue in the process.

Thanks in advance
Andreas

1 Like

Hey Andreas, did you manage to get it to work ?

Thanks

Having the same issue.

Im having an issue where the Pipeline i choose the export folder but it keep getting the error Document type du not valid, check that document type data is in dataset folder and follows folder structure.

Can you provide more clarity and screenshots of the error?

Hi Sharon,

Here are some screenshots of the pipelines configuration:

As you can see, the Input directory is of export folder and auto_retraining is set to True. I have also set the recurring as well but the Status is always FAILED.

Also i attached the Input Directory

This is the message from the logs:

Hi there. I am getting the same error.

It seems to me file latest.txt is not used

This seems to be an issue of the ML Packages, I tried using the 22.4.0 it is working as expected. While other higher version are getting this error.

1 Like