I am trying to find similarity percentage of two strings (first name and last name), like a matrix which will give us an info on how similar two strings are.
I used 2 approaches:
Levenshtein distance - Not that preferred as results were not that great when compared with results when manually done (eyeballed)
Sequence matcher method in difflib python library - This is so far the best one I could find. This will give me similarity percentage for most of the cases but the percentage will become lower if the order of first name and last names is jumbled.
John Smith vs John Smith = 100%
John Smith vs Smith John = 50%
But both can be the same person, is there any way this can be achieved in some method? Also, the number of words in names is not fixed, some can have first name+second name+third names, etc…
Anyway, we could identify or give a better similarity percentage if the order of the provided names is reversed? I’m open to any other methods, or techniques.
Some data samples are as below, ideally, string 1 and string 2 can be the same, but how can it be achieved?
Is there anyway we can improve the results? Like, I am looking for a perfect threshold value which can catch if names are reversed or in different order. For eg: I can built the code like anything above 80% are perfect partial match of the names.
Im open for options other than RPA. Any machine learning or python activities?