I am trying to find similarity percentage of two strings (first name and last name), like a matrix which will give us an info on how similar two strings are.
I used 2 approaches:
Levenshtein distance - Not that preferred as results were not that great when compared with results when manually done (eyeballed)
Sequence matcher method in difflib python library - This is so far the best one I could find. This will give me similarity percentage for most of the cases but the percentage will become lower if the order of first name and last names is jumbled.
For eg:
John Smith vs John Smith = 100%
but
John Smith vs Smith John = 50%
But both can be the same person, is there any way this can be achieved in some method? Also, the number of words in names is not fixed, some can have first name+second name+third names, etc…
Anyway, we could identify or give a better similarity percentage if the order of the provided names is reversed? I’m open to any other methods, or techniques.
Some data samples are as below, ideally, string 1 and string 2 can be the same, but how can it be achieved?
Check the below post on obtaining the similarity using a .Net package. Other methods are also proposed in the thread. Let us know if you find it helpful.
Hi, I can get the similarity percentage, my problem statement is to get a better similarity percentage if the names are jumbled. I will give it a shot. Thanks
I have tried this in my case and these are the results from different algorithms. This was a wonderful learning experience for me. Much appreciated for sharing this.
Is there anyway we can improve the results? Like, I am looking for a perfect threshold value which can catch if names are reversed or in different order. For eg: I can built the code like anything above 80% are perfect partial match of the names.
Im open for options other than RPA. Any machine learning or python activities?