I am trying to develop a way to do content similarity detection by comparing one string against an exacting 100+ Excel cells from a database. After doing the stings comparison, I want to display the percentage % and the matching words.
I tried the Loewenstein algorithm. However, after multiple trials, the Loewenstein did not return accurate results, as the Loewenstein measured the metric difference between the two strings and did not read actual words.
What other way can I compare two strings words-wise, then identify percentage similarity and highlight slimier words in two strings, like how plagiarism checkers work.
For example:
The project will provide a platform and software solution to assist in planning & scheduling
VS.
The solution will offer a software platform to enhance planning & scheduling
Results should be around: 65%
And the highlighted words: (The, will, software, platform , to, planning, &, scheduling)
I also faced the same issue with loewenstein algorithm, what works for me is this python script which compare the similarity between two texts based on their word frequencies. It calculates the cosine similarity between these vectors
import math
import re
from collections import Counter
WORD = re.compile(r"\w+")
def get_cosine(vec1, vec2):
intersection = set(vec1.keys()) & set(vec2.keys())
numerator = sum([vec1[x] * vec2[x] for x in intersection])
sum1 = sum([vec1[x] ** 2 for x in list(vec1.keys())])
sum2 = sum([vec2[x] ** 2 for x in list(vec2.keys())])
denominator = math.sqrt(sum1) * math.sqrt(sum2)
if not denominator:
return 0.0
else:
return float(numerator) / denominator
def text_to_vector(text):
words = WORD.findall(text)
return Counter(words)
def get_text(text1,text2):
vector1 = text_to_vector(text1)
vector2 = text_to_vector(text2)
cosine = get_cosine(vector1, vector2)
print(cosine)
return(cosine)