Trainable ML model for invoice extraction - Pipeline failed

Hello All / @alexcabuz & @Lahiru.Fernando

I’m working on a use case where I want to extract data from Invoice using Document understanding and ML model so that I can retrain it.

I have attached the use case of extracting the invoice here.
ML (3.2 MB)

I have generated the set of files needed for training through the Train Extractor Scope and ML extractor trainer. Now, I have got access to Insider program and got the Data Labeling.

Now I have uploaded the same in Data Labeling and created a new data set, created regular fields and tried to export to AI fabric.

Then I have created a pipeline for both Train and Evaluation Pipeline. Unfortunately, I both pipeline was failed due to some error which I’m not sure what to do next. I have attached both the logs below:

2e67d5f1-47e4-492c-8715-5847f51ef9fc.docx (14.0 KB)

Can you help me what I’m missing here?

Hello @Arun_Singh

First, ML Extractor Trainer data is for Fine Tuning of models, not for training. Fine tuning means you need to have a model already (trained using data labelled in Data Manager) and you use the ML Extractor Trainer data to fine tune it.

Second, it looks like your pipeline is not pointed at the right folder. The input folder needs to contain these 4 things: 2 folders called images and latest, and 2 files called split.csv and schema.json. Can you make sure this is the case and try again?

Let me know how it goes,

Hello Alex,

1st, thanks for the information you have provided and it was very helpful to understand the concept. Now I have few issues and many question.

Download the schema for the invoices using Data Manager as you said 1st and trained the ML model in pipeline and it was successful. When I ran the process using the trained model, I can see the confidence level has been increase to 99% from 75% for 3 certain fields. For the 4 field items, (product name which I have purchased) from the invoice even after I have trained the model using Data Manager it is not predicting the items purchased when I ran it through UiPath.

So using ML extractor trainer data I have extracted the data which have 3 folder:

  1. documents - have all the input invoices in pdf format
  2. metadata - have all the json files for the input files
  3. predictions - No data

I have compressed this and upload the folder in Data manager and it shows the items highlighted correctly. Now I have downloaded the file from Data manager and used this file (The input folder needs to contain these 4 things: 2 folders called images and latest, and 2 files called split.csv and schema.json.) to retrain the previously trained model.

So in Pipelines: created a new pipeline and choose the ML model, major version as 5 (Invoices India ML model) & minor version as 1 (Previously it was 0) and started the training for which I’m getting an error. I have attached the logs below.e12fd44a-3c19-4513-bffb-e3a9163cd8fd.txt (54.7 KB)

  1. How can I train my ML model again using the data from ML extractor trainer data?
  2. How to use evaluate and full pipeline run for this scenario to get the confidence level?

Kindly help me in solving these hurdles.

First, as described on this Documentation page, the training requires at least 20-30 samples to give some meaningful results, and only if the documents are low diversity, if they are all from same template: UiPath Document Understanding

In your case it looks like you have different layouts, so you might need more. The same page describes data volumes you should expect for high diversity documents, i.e. dozens, hundreds or thousands of layouts.

Second, you should always train on the minor version 0, i.e. the OOB model provided by UiPath. Do not retrain a model that you have previously trained yourself, that will give poor results. See the warning callout called Retraining on top of previously trained models on this same documentation page.