The use case is following:
We are receiving bunch of documents that almost have the same structure, but from different providers. Document type is something like Receipts or Bank Statements.
There are 2 fields that need to be extracted: Reference Number and Amount.
At first I have defined 1 document type in taxonomy with those 2 fields, but using this and public endpoints ML models I didn’t get proper results, and I could not create different Regex expressions to get the values from the documents.
After that I’ve created multiple document types in my taxonomy (for each provider that we have) but with the same fields.
With this approach I can use Classification and use different Regex Expressions.
This is giving me better results overall, but it feels wrong to me, since the same can be made even without Document Understanding.
How should I approach this problem?
Should I create one Document type in Taxonomy and use ML model to try to extract the values or should I continue with the approach that is giving me latest some results.
You can re-train the model with the documents using data labeling and running few files…that might give you better results and …if they are even little different you can use classification in data labeing as well so even using du you should be able to classify
It is generally a good idea to try to create a single document type in your taxonomy that can be used to extract the common fields (e.g., Reference Number and Amount) from all of the different types of documents you are processing.
This will allow you to use a single set of extraction rules, which can be more efficient and easier to maintain than having to maintain multiple sets of rules for each different document type.
Ultimately, the approach that works best will depend on the specific characteristics of your documents and the data you are trying to extract. It may be helpful to experiment with different approaches and see which one gives you the best results.