Comparing two strings and display the percentage & highlight word matches

I am trying to develop a way to do content similarity detection by comparing one string against an exacting 100+ Excel cells from a database. After doing the stings comparison, I want to display the percentage % and the matching words.

I tried the Loewenstein algorithm. However, after multiple trials, the Loewenstein did not return accurate results, as the Loewenstein measured the metric difference between the two strings and did not read actual words.

What other way can I compare two strings words-wise, then identify percentage similarity and highlight slimier words in two strings, like how plagiarism checkers work.
For example:

The project will provide a platform and software solution to assist in planning & scheduling
The solution will offer a software platform to enhance planning & scheduling

Results should be around: 65%
And the highlighted words: (The, will, software, platform , to, planning, &, scheduling)

@ClaytonM , @lakshman , @Palaniyappan

Hi @n3if

Instead of using the Loewenstein algorithm, Use the Jaro Winkler Algorithm it was giving the best results as compared with the other algorithms.

Check the below image for better understanding,

Hope it helps!!

Hi @n3if ,

I dont know it will work for you or not, but you can try with Levenshtein Distance Algorithm

UiPath Studio compare strings using Levenshtein Distance Algorithm | code in description - Learn / Video Tutorials - UiPath Community Forum

Vinit Mhatre

Hi @n3if

I also faced the same issue with loewenstein algorithm, what works for me is this python script which compare the similarity between two texts based on their word frequencies. It calculates the cosine similarity between these vectors

import math
import re
from collections import Counter

WORD = re.compile(r"\w+")

def get_cosine(vec1, vec2):
    intersection = set(vec1.keys()) & set(vec2.keys())
    numerator = sum([vec1[x] * vec2[x] for x in intersection])

    sum1 = sum([vec1[x] ** 2 for x in list(vec1.keys())])
    sum2 = sum([vec2[x] ** 2 for x in list(vec2.keys())])
    denominator = math.sqrt(sum1) * math.sqrt(sum2)

    if not denominator:
        return 0.0
        return float(numerator) / denominator

def text_to_vector(text):
    words = WORD.findall(text)
    return Counter(words)

def get_text(text1,text2):
    vector1 = text_to_vector(text1)
    vector2 = text_to_vector(text2)
    cosine = get_cosine(vector1, vector2)
    return(cosine) (473 Bytes)

Hope this helps :slight_smile:

1 Like