Is retraining the base Document Understanding model more effective than using the default Invoices model?

We are using UiPath Document Understanding workflows specifically for processing invoices. We have been using the standard Invoices model for getting invoice values to be validated by accounts payable. This has been working quite well for some time.

We recently started looking into using the AI Fabric additions to retrain the Invoices model to make the automatically selected fields more accurate. I recently was told that we cannot train the base Invoices model without purchasing GPU licensing and that we will need to use the base Document Understanding model instead for CPU only retraining. My concern with this is that the documentation pipeline section states that retraining using one of the base models (invoice, receipts) takes advantage of the preexisting model while training using the Document Understanding model “just trains a model from scratch on the dataset provided as input”.

The retrained Document Understanding model recently finished training with 176 of our invoices, but it does not seem to perform as well as the base Invoice model. Is it recommended that we get access to GPU for retraining the actual Invoices model if we are solely interested in improving invoice processing?

We tried retraining the “Document Understanding” model again, this time with 300 invoices, but still see significantly better results using the old base “Invoices” model w/o training. It looks like we may need to wait until retraining the “Invoices” model is available for CPU only before properly testing.

Hi, How do you setup the dataset for document understanding model in AI Fabric?

We used the Data Manager to import real invoices then verified all fields within Data Manager and exported that data-set folder. Once exported you just need to unzip the exported folder and upload that folder into the data-set in the under AI Fabric > Datasets.

Thank you for the info. based on further searching Data Manager is only available on prem, so I will need to request on prem trial before I could try using Document Understanding as an AI Fabric ML Skill since there is no available dataset to upload without using the Data Manager.

Hi @mtu,

The recently released Invoices model can be retrained on CPU as well, that was an issue which was resolved.

For best results you should retrain the Invoices model using your own labelled data. This way you keep the knowledge of the Out of the box Invoices model, but you optimize it a bit on your data.

To get a model trained from scratch (using Document Understanding package) to the same level you would need thousands of samples - at least 2000 I would say. So my recommendation is to make sure you are using the latest AI Fabric version (in the Cloud it’s automatic, but on prem you might need to pull the latest aifmanager container) and then create an Invoices package, and train that one using your 300 docs, and using version 1.0 as a Base.

It should work on CPU, though it might take a whole day to run, depending on how many CPU cores you have.

Please update here how it worked.

This is great news! I will start retraining the invoices model again to see if it works now.

I actually got the same error again. I am using the AI Fabric Tab so I am assuming that I am on the latest version.

So it looks like I needed to create another MLPackage using the Invoices 2.0 model in order to begin Pipeline retraining on CPU only.

Are there any step by step tutorial to set this up specifically for Invoice Model? Because on the AI Fabric overview its not the invoice model that was used as an example.

