I’m experiencing issues when trying to retrain the generic Document Understanding out-of-the-box package in AI Fabric.
Here’s what I tried:
I used the data labeling module to manually label all the data for each of my 10 training documents.
Afterwards, I created a full pipeline run using the generic, retrainable, DU package (version 4.0).
I selected the folder created by Data Manager/Data Labeling as input folder
I created a separate folder containing PDF documents for evaluation, and selected this as evaluation dataset.
After activating this pipeline, it ran for a couple hours. In the logs, I can see that it successfully reaches 150 epochs during training. However, an error seems to occur during the evaluation process:
ValueError: max_df corresponds to < documents than min_df
I added the full .log file in this post. What could be the cause of this error? How can I fix it?
Looks like the classification model is failing (currency, I assume). My suggestion would be to add some additional documents where currency can be labelled (anyway label it on every page you have). To be able to see some decent improvements, I would say to try with at least 25-50 documents. The model is a deep learning architecture and they’re hungry for data. The more data it has, the better can find the patterns and make the connections.
A trick to test it faster, you can use in environment variables “ml_model.epochs” set to 5 epochs. This will run quite fast and you’ll be able to see if there are errors at all. If no errors, leave it to run for all 150 epochs so it can learn the most out of the dataset you have.