Hi
What is the best way to compare two PDF documents. Is it then possible to work out what percentage of the documents are the same.
Thanks
Raychel
Hi
What is the best way to compare two PDF documents. Is it then possible to work out what percentage of the documents are the same.
Thanks
Raychel
Hi @raychel.hall
Extract text from both PDFs using OCR and compare the extracted text using string comparison techniques.
Calculate the percentage of similarity between the two documents.
Hi @raychel.hall ,
If the Documents are Digital documents, you could use Read PDF Text
Activity and get the text as Output. Perform this on both the documents and then perform the String Comparison. One such component that would perform the String Comparison is provided below :
Hi Shanmathi
Thank you for the quick response, I’ll try that
Hi SupermanPunch
Thank you for your quick response, I’m going to try this now.
Hi Shanmathi
What string comparison method should I use?
Thanks
Raychel
Hi Raychel !
The Jaccard similarity index and cosine similarity are commonly used for document comparison, while the Levenshtein distance can be useful for comparing strings with a small number of differences.
For instance, using Jaccard
String string1 = "Hello world";
String string2 = "Hi universe";
char[] set1 = string1.ToCharArray().Distinct().ToArray();
char[] set2 = string2.ToCharArray().Distinct().ToArray();
int numCommon = set1.Intersect(set2).Count();
int numUnique = set1.Union(set2).Count();
double jaccardIndex = (double)numCommon / numUnique;
This code will calculate the Jaccard similarity index for the two strings and store the result in the variable “jaccardIndex”.
Hi Shanmathi
Just trying to work this out now, please can you confirm what a char variable is ?
the char[]
type variable is used to store the sets of unique characters.
For example, for the string "Hello world"
, the ToCharArray()
method would create the following array of characters: ['H', 'e', 'l', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd']
. The Distinct()
method would then remove duplicates, resulting in the set of unique characters: ['H', 'e', 'l', 'o', ' ', 'w', 'r', 'd']
.
The resulting sets of unique characters are stored in the char[]
type variables set1
and set2
. These sets are then used in the next steps of the Jaccard similarity index calculation.
Hi Shanmathi Thanks for that Can you let me know what variable type the char is please? I’ve searched for it in the variables but nothing is coming up. I’ve tried using Char but this error is coming up
Hi, the System.Char isn’t displayed on my list…
Hi @raychel.hall ,
I believe you are using Windows Compatibility Project and hence was not able to use the Package provided above.
There are other .Net packages that you could make use of to Compute the Difference/Similarity between Strings. One Such package is StringSimilarity.Net
. You could install the package from the Manage Packages
Section.
Check the below workflow which contains the method to Compute the Similarity using the Package mentioned :
StringSimilarity.zip (2.4 KB)
Visuals :
Similarly, there are other Algorithms built-in the package that you can use. But the end computation what you may need to do would vary.
Below is the Github source.
Let us know if you are facing any issues with the implementation.
Thanks Shanmathi, I have used that one but now I’m getting this error message
Are you able to help?
Regards
Raychel
Hi SupermanPunch
I have installed StringSimilarity.Net as you suggested
Have you assigned char[] type to set2? @raychel.hall
set2 = string2.ToCharArray().Distinct().ToArray();
int datatype to numCommon and numUnique variables and
double datatype to jaccardIndex variable