Thank you for taking the time to read this post, unfortunately, there’s not enough information that explains properly how to use Classification fields on the Data Manager
As I understand, I need to define all the fields that are going to be used for classification purposes and for each of those fields I need to define a list of possible values. For example, all invoices have a field named “Currency” so I need to define a list with all the possible currencies I might find on the invoices I process.
And how does the model works?
Does It try to find any possible currency value on all the pages of the document being processed? Because unlike Regular and Column Fields, I can’t highlight the field I want to use as a classification field inside Data Manager so I think that the model would basically try to find the word USD on the overall document and if it finds that word, it’s a step closer to classify a document as an Invoice.
So for example, if I want to use the “Billing address” field as a classification field, all the optional values I need to fill in for the Classification field refer to different ways the “Billing address” field might appear in a document?
If I have to defined all the possible values for each classification field, in which cases a ML model approach is better than the Intelligent Keyword Classifier?
And If my solution requires to classify 50 document types, do I need to deploy 50 different Machine Learning skills for classification and another 50 models for extraction? Or am I able to train a single model to classify 50 different document types?
The difference between classification fields in an ML model & Intelligent Keyword Classifier is that classification fields within an ML model classify field level values within a document, where as keyword classifier classifies document types.
You can also use machine learning classifier as well. You should consider using the Machine Learning Classifier if:
Your need to classify the single documents into different document types. No splitting is required.
The custom document types are very similar. A trained Machine Learning Extractor can differentiate more easily between two similar document types than the Intelligent Keyword Extractor.
You can find a simple description of classifying documents using intelligent keyword classification here:
Finally can you elaborate your 50 document types? Do they have the same fields to extract or are they mutually exclusive documents with new and distinct fields for each document?
If you have 50 document types with the same fields to be extracted, you can train one machine learning model using document understanding. You can refer the article on how to train high performing models here:
Thank you for your quick answer. Here are all my comments related to your response:
I want to train and deploy a Machine Learning Classifier model capable of properly classify 50 totally different document types, just a classifier, not an extraction model. So, can I train a single Machine Learning Classifier model to classify different document types that don’t share common fields? or do I have to train 50 different Machine Learning Classifier models?
I understand that you’re able to classify a single document into different document types. If a single document contains an Invoice and a Form, will the classifier identify two different document types and split automatically the pages? How can you achieve that? By using a single Machine Learning Classifier model and Classification fields? How can I train a machine learning classifier?
Are classification fields used to classify a document or for extraction purposes? In a Data Labeling session, how do you use a classification field? Because I can’t highlight the field that represents the classification field.
Thank you for your reply. Please find my answers below.
You can train one model to identify 50 different document types. Here you can do this in two methods. you can either use intelligent keyword classifier, to classify your documents or you can use the machine learning model “document classifier” from ML packages. Both these options will help you to classify different document types.
Document Classifier is a retrainable model for classifying any type of structured or semi-structured documents, building a model from scratch. This method might help you to retrain the model in the event you want to expand document types needed to classify.
The Intelligent Keyword Classifier is a classifier that uses the word vector it learns from files of certain document types to perform document classification.
Since you want to split documents types in a master document, do look into intelligent keyword classifier. It enables you to split documents and train the classifier over time with intelligent keyword classifier trainer. This will enable you to both split and classify documents in your automation.
Classification fields in data manager are data points which refer to a document as a whole. For instance, the Expense Type of a receipt (Food, Hotel, Airline, Transportation) or the Currency of an invoice (USD, EUR, JPY) would be examples of Classification fields.
Once you define your key value pair schema in classification fields it will automatically show eg: currency, expense type as a classified field.
And for your use case, intelligent keyword classifier might help you. It will split and classify documents at the same time and can handle different variations of document types.
You can successfully deploy this model with intelligent keyword classifier. Make sure you find unique keywords and positions to train the model. You might need to train at least 20 samples per document type to start with and expand as you go. Please note, the number of documents you need to train really depends on the type of documents and the variability. so it can increase depending on your training, and results.
Is there any tutorial on how to train and deploy a ML classifier model? All the official documentation focus on extraction and not enough on classification, just two or three lines that “explains” the classifier model and one paragraph to “explain” the classification fields, even the invoice machine learning model schema only uses a single field as classification field and it’s very confusing to understand the purpose of it on a Labeling session for extraction purposes.
I agree with you that there’s little information on training & deploying a OOTB document classifier. One of the reasons why is that document classifier does not split documents compared to intelligent keyword classifier. This leads to more automations choosing intelligent keyword classifier over ML classifier. UiPath might introduce document splitting and a more advance version of ML classifier in their next releases.