Document Understanding Retraining Fails

avanmeurs · November 13, 2020, 3:51pm

Hey,

I’m experiencing issues when trying to retrain the generic Document Understanding out-of-the-box package in AI Fabric.

Here’s what I tried:

I used the data labeling module to manually label all the data for each of my 10 training documents.
Afterwards, I created a full pipeline run using the generic, retrainable, DU package (version 4.0).
I selected the folder created by Data Manager/Data Labeling as input folder
I created a separate folder containing PDF documents for evaluation, and selected this as evaluation dataset.
After activating this pipeline, it ran for a couple hours. In the logs, I can see that it successfully reaches 150 epochs during training. However, an error seems to occur during the evaluation process:

ValueError: max_df corresponds to < documents than min_df

I added the full .log file in this post. What could be the cause of this error? How can I fix it?

Kind regards,
Aram
1f9eb25d-e633-44a8-9597-a02aa3590dfa.txt (150.1 KB)

Andra_Buica · November 13, 2020, 4:12pm

Hi,

Looks like the classification model is failing (currency, I assume). My suggestion would be to add some additional documents where currency can be labelled (anyway label it on every page you have). To be able to see some decent improvements, I would say to try with at least 25-50 documents. The model is a deep learning architecture and they’re hungry for data. The more data it has, the better can find the patterns and make the connections.

A trick to test it faster, you can use in environment variables “ml_model.epochs” set to 5 epochs. This will run quite fast and you’ll be able to see if there are errors at all. If no errors, leave it to run for all 150 epochs so it can learn the most out of the dataset you have.

Hope this helps,
Andra

Sagar_Gupta1 · November 13, 2020, 5:18pm

Better suggestion

avanmeurs · November 27, 2020, 12:44pm

Sorry for the late reply, but this helped a great deal. Thanks!
Turned I did indeed need more classification data.

Topic		Replies	Views
Low accuracy of results - Document Undestanding Document Understanding question , document_understanding	5	1628	September 1, 2020
Document Understanding Auto-Training Is Failing Knowledge Base ai_center , ai_fabric , ai-fabric , ai-center	0	113	January 3, 2025
Retrained out-of-box model Document Understanding Error AI Center bug , ai_center	10	1241	March 10, 2023
Is retraining the base Document Understanding model more effective than using the default Invoices model? AI Center	10	1717	August 19, 2020
AI Center x Document Understanding AI Center document_understanding , ai_center , document-understanding-in-ai-center--clo	2	341	March 27, 2024

Document Understanding Retraining Fails

Related topics