I have developed ML Skill for extracting bank name from bank statements. What I notice is that the extraction can have variety of bank names for one type of bank depending on the quality of the document and the type of bank statements received. Getting the extraction for the bank name is important in my process. I will give 3 example of scenario;
Extracted Bank Name: DES Bank | Real Bank Name: DBS Bank
Extracted Bank Name: Lind1an 0verseas Bank | Real Bank Name: Indian Overseas Bank or IOB
Extracted Bank Name: United Overseas Bank or UOB | Real Bank Name: Same as Extracted
I know that this type of problem is related with ML and to be more specific it is under Natural Language Processing (NLP). What is the algorithm or solution that can map the extracted bank name to the specific bank, no matter how bad the extraction is?
Yes I have tried it, but actually in a different scenario.
Basically, It will try to find you a match with some defined percentage equality something like 50% match or a 60% instead of matching it like 100%.
So in your case, DES will have somewhere around 70-80% match with DBS I believe.
That’s why fuzzy matching of strings may give you a good result may be with a very few exceptions here and there which can be tweaked once its up and running to yield best success rate.
This problem falls within the field of computer science, linguistics and statistics and ML skill or NLP to start of with is overkill.
Specifically, you are looking to calculate an edit distance between two strings. Based on this distance you can infer if the given string matches your required string.
You can read this thread for some background and two more approaches: Compare Names
I have updated my azure function, you can use it for development but don’t use it on a production environment.
Either way, (using fuzzy matching as @Nithinkrishna suggested) or calculating edit distances or using the approach suggested by @kumar.varun2 is not sufficient in your case.
You will also need a mapping to the correct value string (correct Bank Name). So let’s say your match is higher than 80% then you will have to use a mapper which can return what DES Bank means in your required names list / array or / dictionary.
This way if you in the future have additional banks, just add to your banks list / array / dictionary.
Variability in your data will make this string comparison quite a challenge and do expect to have many exceptions.
On the contrary, if you use any ML skill this variability is good but then you need sufficient volume of cases and the velocity (how soon you get new data and retrain the Model) will dictate the performance of such matches using ML / NLP.