Find string partial simmilarity percentage

I am trying to find similarity percentage of two strings (first name and last name), like a matrix which will give us an info on how similar two strings are.
I used 2 approaches:

  1. Levenshtein distance - Not that preferred as results were not that great when compared with results when manually done (eyeballed)
  2. Sequence matcher method in difflib python library - This is so far the best one I could find. This will give me similarity percentage for most of the cases but the percentage will become lower if the order of first name and last names is jumbled.

For eg:

John Smith vs John Smith = 100%
but
John Smith vs Smith John = 50%

But both can be the same person, is there any way this can be achieved in some method? Also, the number of words in names is not fixed, some can have first name+second name+third names, etc…

Anyway, we could identify or give a better similarity percentage if the order of the provided names is reversed? I’m open to any other methods, or techniques.

Some data samples are as below, ideally, string 1 and string 2 can be the same, but how can it be achieved?

String 1 String 2
will smith smith will
Christian Max payne Payne Max Christian
John Max William Defoe William Defoe John Max

Hi @amithvs ,

  1. check if exact match strVal1.ToLower()=strVal2.ToLower()
  2. if not match strVal1.ToLower().Contains(strVal2.ToLower()) or strVal2.ToLower().Contains(strVal1.ToLower())
  3. Split the string using space and check all contains
    strVal1.ToLower().Split(" "c) and check all the values are matching with strVal2.

Regards,
Arivu

1 Like

Hi @amithvs ,

Check the below post on obtaining the similarity using a .Net package. Other methods are also proposed in the thread. Let us know if you find it helpful.

1 Like

Hi @amithvs ,

Please follow the @supermanPunch shared post. its easy to compare and get the percentage match

Regards,
Arivu

1 Like

Thanks I will try and check whether the results are better compared with the Python library I was using.

Hi, I can get the similarity percentage, my problem statement is to get a better similarity percentage if the names are jumbled. I will give it a shot. Thanks :slight_smile:

I have tried this in my case and these are the results from different algorithms. This was a wonderful learning experience for me. Much appreciated for sharing this.

Is there anyway we can improve the results? Like, I am looking for a perfect threshold value which can catch if names are reversed or in different order. For eg: I can built the code like anything above 80% are perfect partial match of the names.

Im open for options other than RPA. Any machine learning or python activities?