Predict the extracted text

aqiff · October 10, 2021, 4:46am

I have developed ML Skill for extracting bank name from bank statements. What I notice is that the extraction can have variety of bank names for one type of bank depending on the quality of the document and the type of bank statements received. Getting the extraction for the bank name is important in my process. I will give 3 example of scenario;

Extracted Bank Name: DES Bank | Real Bank Name: DBS Bank
Extracted Bank Name: Lind1an 0verseas Bank | Real Bank Name: Indian Overseas Bank or IOB
Extracted Bank Name: United Overseas Bank or UOB | Real Bank Name: Same as Extracted

I know that this type of problem is related with ML and to be more specific it is under Natural Language Processing (NLP). What is the algorithm or solution that can map the extracted bank name to the specific bank, no matter how bad the extraction is?

Nithinkrishna · October 10, 2021, 5:46am

Hey @aqiff

May be using fuzzy matching of strings should help.

Thanks
#nK

aqiff · October 10, 2021, 10:13am

I will try it out @Nithinkrishna and thanks, personally have you test it? Is it reliable ?

Nithinkrishna · October 10, 2021, 10:18am

Hey @aqiff

Yes I have tried it, but actually in a different scenario.

Basically, It will try to find you a match with some defined percentage equality something like 50% match or a 60% instead of matching it like 100%.

So in your case, DES will have somewhere around 70-80% match with DBS I believe.

That’s why fuzzy matching of strings may give you a good result may be with a very few exceptions here and there which can be tweaked once its up and running to yield best success rate.

Hope that helps.

Thank you
#nK

jeevith · October 10, 2021, 11:45am

Hi @aqiff,

This problem falls within the field of computer science, linguistics and statistics and ML skill or NLP to start of with is overkill.

Specifically, you are looking to calculate an edit distance between two strings. Based on this distance you can infer if the given string matches your required string.

You can read this thread for some background and two more approaches: Compare Names

I have updated my azure function, you can use it for development but don’t use it on a production environment.

Either way, (using fuzzy matching as @Nithinkrishna suggested) or calculating edit distances or using the approach suggested by @kumar.varun2 is not sufficient in your case.

You will also need a mapping to the correct value string (correct Bank Name). So let’s say your match is higher than 80% then you will have to use a mapper which can return what DES Bank means in your required names list / array or / dictionary.

This way if you in the future have additional banks, just add to your banks list / array / dictionary.

Variability in your data will make this string comparison quite a challenge and do expect to have many exceptions.

On the contrary, if you use any ML skill this variability is good but then you need sufficient volume of cases and the velocity (how soon you get new data and retrain the Model) will dictate the performance of such matches using ML / NLP.

aqiff · October 10, 2021, 12:10pm

Very informative @jeevith , I will try it out tomorrow for both solutions and will come back here if there’s any enquiries

Ritaman_Baral · April 19, 2024, 11:54am

how to use the fuzzy approcah ? you mean we can use regex extractor ?

Topic		Replies	Views
FinancialStatements Document Understanding: Using the outputs AI Center question , ai_center	6	770	March 5, 2024
Matching confidence Activities excel , question , exce , exc , ex	2	194	June 3, 2024
Accuracy for handwritten documents AI Center question , ai_center	1	171	April 10, 2024
Invoice ML Extractor - Currency(classification Field) with low confidence AI Center question , ai_center , low-confidence , classification-field , ml-training	0	504	April 20, 2023
Machine Learning Extractor: Request CorrelationId StudioX activities , studiox , question , document_understanding , document_processing , invoices , machine-learning-extractor	2	590	April 27, 2024

Predict the extracted text

Related topics