AI Center Pipeline Version Accuracy Issues

Hi All, @nisargkadam23 @mukeshkala @Lahiru.Fernando

can anyone tell me the exact answer to my question on pipeline minor version selection?

Assume I have different formats of documents, let’s consider them as collections

The extraction field I labeled in the document manager order_num and customer_name

Collection 1 - 20 docs e.g format 1
Collection 2 - 25 docs e.g format 2
Collection 3 - 30 docs e.g format 3

Since I am going to extract a common field from all the collections, let’s say order_num and customer_name

I have trained all docs and exported and the first time I ran the collection 1 pipeline and chose major version 22.10.1 and minor version 0.

The pipeline got successful and I tested a sample of 50 docs from collection 1 using the extraction method, it was given 90% accuracy.

then I ran the collection 2 pipeline and chose major version 22.10.1 and minor version 0.

In this case, the second collection documents are giving good accuracy, when I tested the first collection’s accuracy it was totally changed, and only it gives 10%.

Any help, even though I tried changing the pipeline version 0,1,2 for every pipeline run, it doesn’t work.

Any good approach to selecting the minor version of the pipeline.

Hi @SrenivasanKanna

For this you have to use a classification as well. Because you have 3 different formats each format needs to be identified by a different classification and when you train… say after classification you train first set with minor 0 then you get minor 1 then train second with minir 1 as you have included classification also…the bot would clasify and then extract

Cheers

1 Like

Is highlighting one field enough to classify documents?

Hi @SrenivasanKanna

you can go with one. Its better to always give atleast 3 to 5 different fields to classify…

Cheers

Thank you, I can add a classification field with an existing document manager Data Labeling Session right ?

I think in my case it won’t work, since I can not add classification word options while creating the field.

I can able to highlight and classify the documents field at the time of labeling the docs. Is that possible?

Hi @SrenivasanKanna

Yes …you will see a classify field in ai center when you go to the data labelling

Add the classify fields there

And make sure whenever you want the old trained data also to be present then you have to select the trained minor version not 0 …if you select zero that means its like training a new dataset and all old training won’t be there

Cheers

Thank you, Correct me for the below two questions.

  1. The pipeline runs Minor version Selection:

Let’s say I have collection 1 (30 same formats of docs) and labelled then I ran pipeline, the minor version got changed as from 0 to 1.

then I ran collection 2 docs and the pipeline minor version selected 1 and it got changed to 2

Assume, after few months I got few additional samples with the same collection 1, I trained. in this case do I need to select the pipeline minor version as 1 am I right? since these collections are related to version 1

  1. Classification field

Do I need to add all unique classification fields each time at the time of training the docs right?

Hi @SrenivasanKanna

No you have to select minor version as 2

This is how it works.

You trained 0 to 1 with a set of documents
Then you trained 1 to 2 with other set of documents

Now even if you are training with similar set of documents you still need to select minor version 2 because minor version 2 is having the results of both the previous versions in its memory

If you train on minor version 1 then the training done using second set of documents will be lost

So conclusion is , whenever you train you train with the latest minor version if you want the old training results also

And whenever you label the fields you have to label the classification field as well

Cheers

Thank you so much, that’s how I trained all the collections.
let me apply these classification rules and check the accuracy.
I will let you know soon

1 Like

I contacted the uipath team they said these points.

You need at least 30-50 documents for each field. I request you increase the number of documents.

Also, the pipeline needs to be run on the base version with a complete dataset (Collection 1 + Collection 2) for better results with DU ML models.

I have trained more than 40-50 documents in each collection and ran pipelines in both logic.

-Always pipeline should be a base version of 0

e.g
Collection 1 pipeline run – Choose minor version 0
Collection 2 pipeline run – Choose minor version 0
Collection 3 pipeline run – Choose minor version 0

- Changing pipeline minor version 0+1+2 for each collection pipeline run.
e.g
Collection 1 pipeline run – Choose minor version 0
Collection 2 pipeline run – Choose minor version 1
Collection 3 pipeline run – Choose minor version 2

Both methods are not giving any accuracy improvement, it was affecting previous trained models. I am not sure how to take it.

Note

For successfully running Training or Full pipelines we strongly recommend at least 25 documents and at least 10 samples from each labelled field in your dataset. Otherwise the pipeline will show an error “Dataset Creation Failed”

Retraining on top of previously trained models

As more data gets labelled, either using Data Manager or coming from Validation Station, best results are obtained by maintaining a single dataset and adding more data to it, and always retraining on the base model provided by UiPath, with minor version 0. It is strongly recommended to avoid retraining using a base model which you trained yourself previously (minor version 1 or higher).

https://docs.uipath.com/ai-fabric/v2020.7/docs/uipath-document-understanding

Hi @SrenivasanKanna

You need to either move all the documents into same dataset and train 0 to 1 or if you have 3 separate then you have to do one by one on minor versions.

As per recommendation on site it says the first apporach gives the best results to move the 3 datasets and any new data also to same dataset an train.

While training on same minor it is recomended to use one dataset because all the variations needs to be considered and that can be done only by having one dataset.

If you understand the approach, they want you to train minor zero when you place all the available data into one dataset so ideally when retraining you are training with old and the new data.

If that is not the case and you have a requirement to maintain them separate then training on different minors is your way

Cheers

In my case, It’s not one-time train dataset. I used to train whenever I receive a new format of data into the document manager.

Let’s say , today i received 500 samples, i trained it, i may get another format of 200 next month to train. So in this case i have to use the latest minor version, not the base version.

Am correct?

Hi @SrenivasanKanna

So here you have two approaches

1, Place the data into same dataset always and train the version 0. That way your old and new data everything will be trained together so use version zero
2. Place the data every time into a new dataset (only the new data) and then train the current version

UIPath is recommending first way because it would be easy to maintain the dataset as well as it will have only one folder with all the data

The second approach is used when you have a very high sampling data everyday and it is not realistic to train all the documents if the dataset is very huge on a daily basis

cheers

Thank you so much, i got it.

1 Like

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.