I’m doing a Document Understanding project and we have these 8 documents:
- General Ledger Debit
- Certificate of Deposit
- Traditional IRA Distribution
- Traditional IRA Contribution
- Roth IRA Distribution
- Roth IRA Contribution
- Education Savings Contribution
- Education Savings Distribution
The end goal is to upload and categorize them into a document management system, where the categories will be:
- CDAPPS
– Certificate of Deposit - CD HO TKTS
– General Ledger Debit - IRA Dist Tkt
– Traditional IRA Distribution
– Roth IRA Distribution
– Education Savings Distribution - IRA Contr Tkt
– Traditional IRA Contribution
– Roth IRA Contribution
– Education Savings Contribution
So is it better to have each separate type of document in the taxonomy and train them separately, or can I just have the four categories in the taxonomy and train the distributions together and the contributions together?
At the moment we are only able to use the keyword based classifier, but I am using the Intelligent Keyword Classifier to have it automatically give me the list of keywords for each document. My concern is that there are some similar keywords between the contribution and distribution documents. For example:
Traditional IRA Contribution keywords:
“ira”, “contribution”, “traditional”, “horizon”, “term”, “type”, “receipt”, “account”, “rollover”, “plan”, “social”, “check”, “direct”, “eligible”, “code”, “days”, “investment”, “months”, “reason”, “repayment”, “retirement”, “postponed”, “zone”, “tax”, “disaster”, “rate”, “regular”, “transfer”, “owner”, “birth”, “qualified”, “distribution”, “security”, “pension”, “employee”, “apy”, “maturity”, “signature”, “dale”, “ft”, “ion”, “cda”, “election”, “transaction”, “da”, “accepted”, “treat”, “subject”, “agree”, “rules”
Traditional IRA Distribution keywords:
“ira”, “distribution”, “form”, “horizon”, “debit”, “traditional”, “receipt”, “premature”, “additional”, “note”, “exception”, “social”, “security”, “tax”, “withheld”, “required”, “reason”, “applies”, “disability”, “income”, “signature”, “requested”, “death”, “transaction”, “charged”, “administration”, “fee”, “subtotal”, “federal”, “prohibited”, “reverse”, “penalties”, “owner”, “local”, “documentation”, “account”, “paid”, “beneficiary”, “separate”, “normal”, “copies”, “completing”, “election”, “wi”, “payment”, “complete”, “identification”, “address”, “vifirst”, “hl”
Is it necessary to try to remove common keywords from both? Will that make it more accurate?
