can anyone tell me the exact answer to my question on pipeline minor version selection?
Assume I have different formats of documents, let’s consider them as collections
The extraction field I labeled in the document manager order_num and customer_name
Collection 1 - 20 docs e.g format 1
Collection 2 - 25 docs e.g format 2
Collection 3 - 30 docs e.g format 3
Since I am going to extract a common field from all the collections, let’s say order_num and customer_name
I have trained all docs and exported and the first time I ran the collection 1 pipeline and chose major version 22.10.1 and minor version 0.
The pipeline got successful and I tested a sample of 50 docs from collection 1 using the extraction method, it was given 90% accuracy.
then I ran the collection 2 pipeline and chose major version 22.10.1 and minor version 0.
In this case, the second collection documents are giving good accuracy, when I tested the first collection’s accuracy it was totally changed, and only it gives 10%.
Any help, even though I tried changing the pipeline version 0,1,2 for every pipeline run, it doesn’t work.
Any good approach to selecting the minor version of the pipeline.
For this you have to use a classification as well. Because you have 3 different formats each format needs to be identified by a different classification and when you train… say after classification you train first set with minor 0 then you get minor 1 then train second with minir 1 as you have included classification also…the bot would clasify and then extract
Yes …you will see a classify field in ai center when you go to the data labelling
Add the classify fields there
And make sure whenever you want the old trained data also to be present then you have to select the trained minor version not 0 …if you select zero that means its like training a new dataset and all old training won’t be there
Thank you, Correct me for the below two questions.
The pipeline runs Minor version Selection:
Let’s say I have collection 1 (30 same formats of docs) and labelled then I ran pipeline, the minor version got changed as from 0 to 1.
then I ran collection 2 docs and the pipeline minor version selected 1 and it got changed to 2
Assume, after few months I got few additional samples with the same collection 1, I trained. in this case do I need to select the pipeline minor version as 1 am I right? since these collections are related to version 1
Classification field
Do I need to add all unique classification fields each time at the time of training the docs right?
You trained 0 to 1 with a set of documents
Then you trained 1 to 2 with other set of documents
Now even if you are training with similar set of documents you still need to select minor version 2 because minor version 2 is having the results of both the previous versions in its memory
If you train on minor version 1 then the training done using second set of documents will be lost
So conclusion is , whenever you train you train with the latest minor version if you want the old training results also
And whenever you label the fields you have to label the classification field as well
I contacted the uipath team they said these points.
You need at least 30-50 documents for each field. I request you increase the number of documents.
Also, the pipeline needs to be run on the base version with a complete dataset (Collection 1 + Collection 2) for better results with DU ML models.
I have trained more than 40-50 documents in each collection and ran pipelines in both logic.
-Always pipeline should be a base version of 0
e.g
Collection 1 pipeline run – Choose minor version 0
Collection 2 pipeline run – Choose minor version 0
Collection 3 pipeline run – Choose minor version 0
- Changing pipeline minor version 0+1+2 for each collection pipeline run.
e.g
Collection 1 pipeline run – Choose minor version 0
Collection 2 pipeline run – Choose minor version 1
Collection 3 pipeline run – Choose minor version 2
Both methods are not giving any accuracy improvement, it was affecting previous trained models. I am not sure how to take it.
For successfully running Training or Full pipelines we strongly recommend at least 25 documents and at least 10 samples from each labelled field in your dataset. Otherwise the pipeline will show an error “Dataset Creation Failed”
Retraining on top of previously trained models
As more data gets labelled, either using Data Manager or coming from Validation Station, best results are obtained by maintaining a single dataset and adding more data to it, and always retraining on the base model provided by UiPath, with minor version 0. It is strongly recommended to avoid retraining using a base model which you trained yourself previously (minor version 1 or higher).
You need to either move all the documents into same dataset and train 0 to 1 or if you have 3 separate then you have to do one by one on minor versions.
As per recommendation on site it says the first apporach gives the best results to move the 3 datasets and any new data also to same dataset an train.
While training on same minor it is recomended to use one dataset because all the variations needs to be considered and that can be done only by having one dataset.
If you understand the approach, they want you to train minor zero when you place all the available data into one dataset so ideally when retraining you are training with old and the new data.
If that is not the case and you have a requirement to maintain them separate then training on different minors is your way
In my case, It’s not one-time train dataset. I used to train whenever I receive a new format of data into the document manager.
Let’s say , today i received 500 samples, i trained it, i may get another format of 200 next month to train. So in this case i have to use the latest minor version, not the base version.
1, Place the data into same dataset always and train the version 0. That way your old and new data everything will be trained together so use version zero
2. Place the data every time into a new dataset (only the new data) and then train the current version
UIPath is recommending first way because it would be easy to maintain the dataset as well as it will have only one folder with all the data
The second approach is used when you have a very high sampling data everyday and it is not realistic to train all the documents if the dataset is very huge on a daily basis