How to integrate the output of Trainer Classifier with the Document Classifier?

Hello Community,

I see that my Classifier Trainer has generated a bunch of keywords and I can see them in the activity UI as shown below. I can also see that the learning.json file has been created with these new keywords.

How do I merge these additional keywords back to the main keywords json file?

I can do this manually for now. But for a practical application, this can be impractical.

Is there a way I can do this via any OCR activity that would allow me to merge these files back to the Document Classifier to create a feedback loop?

image

thanks!

I tried this and my process still works without any issues. But I don’t have enough document samples yet to make sure if the solution I have is really working.

Can anyone in the community please validate if this is a good solution?!!

1. In the Document Classification scope, I added a second KB Classifier and configured it to work with the learning.json file that is the output of the Classifier Trainer scope as shown below

image

2 & 3. Then I enabled both classifiers in the “Configure Classifiers” dialog which meant that the classifiers will use both the Json files.

4. And downstream, after the Validation station, in the Classifier Trainer, the path to the learning.json file is the same as the one shown above:

image

Not sure I quite understand what you mean by merging the keywords with another learning file? If you use Train Classifiers Scope with the same learning file, whatever is learned at that step gets merged with was previously in the file.

1 Like

I have a keyword.json file with the manually defined keywords that I’m using in my Document Classifier.I tried to start with an empty json file and despite numerous attempts I could not get it to work. (There is a separate thread that I had posted and is linked below)

In the Classifier Trainer, I set up an empty json file containing just the [ ]. This file is named learning.json when configuring the Keyword based Classifier Trainer.

After I processed at least three documents, I could see some additional keywords populated in the empty learning.json file that I decided to use as part of a second KB Classifier at the beginning of the process as shown in my screen shots.

In related issue I’m facing now, I picked up a bunch of UiPath Certificates that we get on completing the training. They’re almost identical to on another. I have tried at least 4 times to recreate the templates and define custom area mappings for these certificates. I’ve switched OCR engines and played with the scale as well. Despite all that, the Form extractor pulls only the name information from these certificates. I think that’s because the name is in large font and rest of the details are in regular font.

After defining tokens in the Validation station several times, I did not find any changes in the learning.json file.

Therefore:
Should we or should we not start with an empty json when setting up Document Classifier?
If yes, how many files do I need to pass to it before it starts generating keywords?
If no, then is the Robot learning only during classification training?
What is the Robot learning each time when I define tokens in the Present validation station, especially when the Form Extractor fails?

My other thread related to KB Classifier:

thanks

Ah I understand now. Let’s see:

  1. Should we or should we not start with an empty json when setting up Document Classifier?
    Yes, if you have no previous information about the document you need to start from an empty file.
  2. If yes, how many files do I need to pass to it before it starts generating keywords?
    At least one file is sufficient: digitize the document, open a validation station, select the correct document type and select the keywords on the document. Click save. Pass the validation station result to a Train Classifiers Scope with Keyword Based Classifier Trainer.
  3. If no, then is the Robot learning only during classification training?
    Yes, the Classifier is only trained using the Manage Keywords wizard or through Train Classifiers Scope.
  4. What is the Robot learning each time when I define tokens in the Present validation station, especially when the Form Extractor fails?
    It will learn the selected keywords for the selected Document Type.
1 Like

Thanks for your time in responding to my post.
I will set up another project that will specifically test the above quoted functionalities in your response.

Andy

Hi @tudor.serban, I’m currently trying to test this item out:

I see what you mean. I digitized the document and then I passed an empty dataExtractionResults object to the Validation station:

image

And the Validation station came up with this:

I have to select the Document type from the drop-down on the left and proceed with the mappings between the document and the taxonomy. This makes sense.

Update: 

I have attached a XAML that follows a series of steps to train the Classifier.
I hope I've understood this correctly and others find it helpful.

Process_Train_Classifier.xaml (14.8 KB)
thanks,
Andy