we are using Document understanding and Google vision API as OCR Engine.
and we created ML Skill which contained over 5500 invoice file with multiple language and different kind of template.
problem is data extraction accuracy is not high. and whenever we had more invoice training with different kind of invoice, accuracy seems going down. what i mean, some data has been extracted correctly before. but after new invoice training, the data will not be extracted correctly.
I feel much more invoice training cause less data extraction accuracy.
someone say that too much invoice training make this problem. so volume of ML Skill need to be divided by language or template. but i am not sure it’s best solution.
please advice me how can we keep high data extract accuracy with suitable ML Skill.
yes, so that the algorithm learns that pattern for one language at a time
If there are multiple formats, make sure you create custom models to be able to handle that. below video shows the steps:
Building & Training Custom ML Models for Document Processing | RPA | UiPath - YouTube
Regarding the number of items required to be trained on, that depends on case to case basis. but for one format, I recommend at least 7-10 invoices to make the algorithm learn
given the large invoice files, 100 samples per language is a good sample size.
The model will be more improved if there are different ML skills/ models for different languages as the model is trained by recognizing characters. you can either classify or deploy new model.
do try with 100 samples first, and maybe increase the size and see if the accuracy increases, however there will be a max threshold at one point.