Train different documents separately, or group together? Manually adjust keywords?

I’m doing a Document Understanding project and we have these 8 documents:

  • General Ledger Debit
  • Certificate of Deposit
  • Traditional IRA Distribution
  • Traditional IRA Contribution
  • Roth IRA Distribution
  • Roth IRA Contribution
  • Education Savings Contribution
  • Education Savings Distribution

The end goal is to upload and categorize them into a document management system, where the categories will be:

  • CDAPPS
    – Certificate of Deposit
  • CD HO TKTS
    – General Ledger Debit
  • IRA Dist Tkt
    – Traditional IRA Distribution
    – Roth IRA Distribution
    – Education Savings Distribution
  • IRA Contr Tkt
    – Traditional IRA Contribution
    – Roth IRA Contribution
    – Education Savings Contribution

So is it better to have each separate type of document in the taxonomy and train them separately, or can I just have the four categories in the taxonomy and train the distributions together and the contributions together?

At the moment we are only able to use the keyword based classifier, but I am using the Intelligent Keyword Classifier to have it automatically give me the list of keywords for each document. My concern is that there are some similar keywords between the contribution and distribution documents. For example:

Traditional IRA Contribution keywords:

“ira”, “contribution”, “traditional”, “horizon”, “term”, “type”, “receipt”, “account”, “rollover”, “plan”, “social”, “check”, “direct”, “eligible”, “code”, “days”, “investment”, “months”, “reason”, “repayment”, “retirement”, “postponed”, “zone”, “tax”, “disaster”, “rate”, “regular”, “transfer”, “owner”, “birth”, “qualified”, “distribution”, “security”, “pension”, “employee”, “apy”, “maturity”, “signature”, “dale”, “ft”, “ion”, “cda”, “election”, “transaction”, “da”, “accepted”, “treat”, “subject”, “agree”, “rules”

Traditional IRA Distribution keywords:

“ira”, “distribution”, “form”, “horizon”, “debit”, “traditional”, “receipt”, “premature”, “additional”, “note”, “exception”, “social”, “security”, “tax”, “withheld”, “required”, “reason”, “applies”, “disability”, “income”, “signature”, “requested”, “death”, “transaction”, “charged”, “administration”, “fee”, “subtotal”, “federal”, “prohibited”, “reverse”, “penalties”, “owner”, “local”, “documentation”, “account”, “paid”, “beneficiary”, “separate”, “normal”, “copies”, “completing”, “election”, “wi”, “payment”, “complete”, “identification”, “address”, “vifirst”, “hl”

Is it necessary to try to remove common keywords from both? Will that make it more accurate?

I believe you should train on all the documents you have using the Intelligent Keyword Classifier. For instance, if you train on the category “IRA Contr Tkt,” you need to include all related documents, such as Traditional IRA Contribution, Roth IRA Contribution, and Education Savings Contribution. However, it’s important to keep in mind that you won’t have the option to remove any words from the training.

I’m just copying the keywords the Intelligent Classifier gives me into the Keyword Classifier, so yes I can edit the list.

you can copy them ,but you can not edit them , mate please check again.

Of course you can edit the keywords that are in the Keyword Based Classifier. You just click Manage Learning:

You can also directly edit the learning file, if you want.

Ohh , sorry you are using keyword classifier.so you can edit keywords from manager,but if you are using intelligent keyword classifier you can’t edit keywords from the manger
So keyword classifier didn’t have spliting feature you need to double check if you need to use this feature or not , mate .