I’m having a problem with the ML Classifier. I’ve already trained and setup ML extraction and have been very happy with the results. However, I want to accurately classify the pages first so I’m not spending AI units (pages) that don’t need ML extraction.
All of our documents start out as a cover page. The rest of the pages can be classified as a type of lab report. So, for example, I would expect the cover page that never changes to always be classified as such and I can only send certain lab report pages for ML extraction.
I started out with Intelligent Keyword Classifier and it quickly started doing what I needed. It would always classify the first page as the cover letter. However it was missing classifications of the other pages because they were similar yet had differences to set them apart. This is when I decided to give the ML Classifier a try.
I set up the AI Center dataset and created the DU Classifier ML package and trained it on some documents I had manually classified. After deploying the skill, I plugged in the ML version of the classifier. As I started looping through the files (hoping to manually classify more and improve the training) I noticed it was trying to classify the whole document as one type or another instead of page-by-page. This was happening 100% of the time. I have just started so I have only trained it on 30 documents, but the Intelligent Keyword Classifier was trained on the same amount and at the minimum it can consistently label the first page as the cover page.
What it should do (this is what Intelligent Keyword Classifier is doing):
What the ML Classifier does (it should have classified the first page as “dmip-cover” and the rest as “dmip-qwest”:
Any suggestions how I can get the ML Classifier to work as intended would be much appreciated!
It doesn’t seem to be related to the amount of training. I classified 100 docs this time and retrained the model. 100% of the time the ML Classifier will try to classify the entire document as a whole instead of page-by-page.
@Lahiru.Fernando - your videos on this topic have been extremely helpful so far. Have you seen anything like this and do you have any ideas?
This is a very interesting problem you have here, If you are certain your cover page always comes first i would suggest splitting the PDF document into 2. The cover page will be first document and the other pages will be the second document. Well, this is if you haven’t been able to get the classifier to work as you would like.
Thanks for the suggestion. I do already have a solution for the cover page classification by using the intelligent keyword classifier. That will reliably classify the first cover page. It does a pretty good job classifying the other pages correctly as well but it has some fallacies. The ML Classifier should be a smarter version of the intelligent keyword classifier, but in my case, it is actually worse since it will only try to classify the whole document as if it were one page.
Although I never did get an official response, I think I found the answer. At this time, at least, it doesn’t look like the ML Classifier supports classifying page-by-page. It makes sense that it would but perhaps running each page against the ML model would be too much for it to handle.
According to the ML Classifier documentation:
" * Your need to classify the single documents into different document types. No splitting is required."
For me, it seemed a bit confusing and perhaps could be reworded. I took this as classifying a multiple page document into different types on a page-by-page basis, and for this reason we didn’t need to split a document up. I believe it means it isn’t a good use case for when page splitting is required.