# Compare Documents and work out what percentage of the documents are the same

Hi

What is the best way to compare two PDF documents. Is it then possible to work out what percentage of the documents are the same.

Thanks
Raychel

Hi @raychel.hall
Extract text from both PDFs using OCR and compare the extracted text using string comparison techniques.
Calculate the percentage of similarity between the two documents.

If the Documents are Digital documents, you could use `Read PDF Text` Activity and get the text as Output. Perform this on both the documents and then perform the String Comparison. One such component that would perform the String Comparison is provided below :

Hi Shanmathi
Thank you for the quick response, Iâ€™ll try that

1 Like

Hi SupermanPunch

Thank you for your quick response, Iâ€™m going to try this now.

Hi Shanmathi
What string comparison method should I use?

Thanks

Raychel

Hi Raychel !
The Jaccard similarity index and cosine similarity are commonly used for document comparison, while the Levenshtein distance can be useful for comparing strings with a small number of differences.

For instance, using Jaccard

``````String string1 = "Hello world";
String string2 = "Hi universe";

char[] set1 = string1.ToCharArray().Distinct().ToArray();
char[] set2 = string2.ToCharArray().Distinct().ToArray();

int numCommon = set1.Intersect(set2).Count();
int numUnique = set1.Union(set2).Count();

double jaccardIndex = (double)numCommon / numUnique;
``````

This code will calculate the Jaccard similarity index for the two strings and store the result in the variable â€śjaccardIndexâ€ť.

Hi Shanmathi
Just trying to work this out now, please can you confirm what a char variable is ?

the `char[]` type variable is used to store the sets of unique characters.

For example, for the string `"Hello world"`, the `ToCharArray()` method would create the following array of characters: `['H', 'e', 'l', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd']`. The `Distinct()` method would then remove duplicates, resulting in the set of unique characters: `['H', 'e', 'l', 'o', ' ', 'w', 'r', 'd']`.

The resulting sets of unique characters are stored in the `char[]` type variables `set1` and `set2`. These sets are then used in the next steps of the Jaccard similarity index calculation.

1 Like

Hi Shanmathi Thanks for that Can you let me know what variable type the char is please? Iâ€™ve searched for it in the variables but nothing is coming up. Iâ€™ve tried using Char but this error is coming up

Have you tried System.Char data type? @raychel.hall

Also you can use this
Compare.xaml (10.9 KB)

Hi, the System.Char isnâ€™t displayed on my listâ€¦

Hi Shanmathi

I believe you are using Windows Compatibility Project and hence was not able to use the Package provided above.
There are other .Net packages that you could make use of to Compute the Difference/Similarity between Strings. One Such package is `StringSimilarity.Net`. You could install the package from the `Manage Packages` Section.

Check the below workflow which contains the method to Compute the Similarity using the Package mentioned :
StringSimilarity.zip (2.4 KB)

Visuals :

Similarly, there are other Algorithms built-in the package that you can use. But the end computation what you may need to do would vary.

Below is the Github source.

Let us know if you are facing any issues with the implementation.

Select the Char that is available in the image that you have shared.

Thanks Shanmathi, I have used that one but now Iâ€™m getting this error message

Are you able to help?
Regards

Raychel

Hi SupermanPunch
I have installed StringSimilarity.Net as you suggested

But I am getting the following error messages

Are you able to help please.
Thanks
Raychel

Does the workflow I have provided above give you these errors ?

@supermanPunch yes, it does.

Have you assigned char[] type to set2? @raychel.hall

`````` set2 = string2.ToCharArray().Distinct().ToArray();
``````

int datatype to numCommon and numUnique variables and
double datatype to jaccardIndex variable