Compare Documents and work out what percentage of the documents are the same

Hi

What is the best way to compare two PDF documents. Is it then possible to work out what percentage of the documents are the same.

Thanks
Raychel

Hi @raychel.hall
Extract text from both PDFs using OCR and compare the extracted text using string comparison techniques.
Calculate the percentage of similarity between the two documents.

Hi @raychel.hall ,

If the Documents are Digital documents, you could use Read PDF Text Activity and get the text as Output. Perform this on both the documents and then perform the String Comparison. One such component that would perform the String Comparison is provided below :

Hi Shanmathi
Thank you for the quick response, I’ll try that :slight_smile:

1 Like

Hi SupermanPunch

Thank you for your quick response, I’m going to try this now.

Hi Shanmathi
What string comparison method should I use?

Thanks

Raychel

Hi Raychel !
The Jaccard similarity index and cosine similarity are commonly used for document comparison, while the Levenshtein distance can be useful for comparing strings with a small number of differences.

For instance, using Jaccard

String string1 = "Hello world";
String string2 = "Hi universe";

char[] set1 = string1.ToCharArray().Distinct().ToArray();
char[] set2 = string2.ToCharArray().Distinct().ToArray();

int numCommon = set1.Intersect(set2).Count();
int numUnique = set1.Union(set2).Count();

double jaccardIndex = (double)numCommon / numUnique;

This code will calculate the Jaccard similarity index for the two strings and store the result in the variable “jaccardIndex”.

Hi Shanmathi
Just trying to work this out now, please can you confirm what a char variable is ?

the char[] type variable is used to store the sets of unique characters.

For example, for the string "Hello world", the ToCharArray() method would create the following array of characters: ['H', 'e', 'l', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd']. The Distinct() method would then remove duplicates, resulting in the set of unique characters: ['H', 'e', 'l', 'o', ' ', 'w', 'r', 'd'].

The resulting sets of unique characters are stored in the char[] type variables set1 and set2. These sets are then used in the next steps of the Jaccard similarity index calculation.

1 Like

Hi Shanmathi Thanks for that :slight_smile: Can you let me know what variable type the char is please? I’ve searched for it in the variables but nothing is coming up. I’ve tried using Char but this error is coming up

Have you tried System.Char data type? @raychel.hall

Also you can use this
Compare.xaml (10.9 KB)

Hi, the System.Char isn’t displayed on my list…
Screenshot 2

Hi Shanmathi

I tried to download your Compare.xaml but got this error, do you know what it means.

Hi @raychel.hall ,

I believe you are using Windows Compatibility Project and hence was not able to use the Package provided above.
There are other .Net packages that you could make use of to Compute the Difference/Similarity between Strings. One Such package is StringSimilarity.Net. You could install the package from the Manage Packages Section.

Check the below workflow which contains the method to Compute the Similarity using the Package mentioned :
StringSimilarity.zip (2.4 KB)

Visuals :
image

Similarly, there are other Algorithms built-in the package that you can use. But the end computation what you may need to do would vary.
image

Below is the Github source.

Let us know if you are facing any issues with the implementation.

Hy @raychel.hall

Select the Char that is available in the image that you have shared.

Thanks Shanmathi, I have used that one but now I’m getting this error message
Screenshot4
Are you able to help?
Regards

Raychel

Hi SupermanPunch
I have installed StringSimilarity.Net as you suggested


But I am getting the following error messages


Are you able to help please.
Thanks
Raychel

@raychel.hall ,

Does the workflow I have provided above give you these errors ?

@supermanPunch yes, it does.

Have you assigned char[] type to set2? @raychel.hall

 set2 = string2.ToCharArray().Distinct().ToArray();

int datatype to numCommon and numUnique variables and
double datatype to jaccardIndex variable