How does the Keyword based classifier compute it's confidence?
The Keyword based classifier is case insensitive, but it is sensitive to the order of the words. If it is configured or learned "abc def", it will only match documents that contain this entire string.
If it is learned or configured with "abc def", "xyz", "blablab", it will have a high confidence only if all three strings are found in the document.
The closer to the start of the document the strings are, the higher the confidence. The more times the strings are confirmed as good classification, the higher the confidence as well.
The Keyword-based classifier works best if it knows “titles” of document. Those are usually at the top or close, and don’t have many variations for the same document type. Each set (starting from the top) is evaluated against the document and the sooner it gets a match, the better the confidence is.
By the order, the top to bottom order in which the keyword sets have been defined. So, each set (starting from the top) is evaluated against the document and the sooner it gets a match, the better the confidence is.