Compare two datatables from pdf

hi,

i’m trying to compare two pdf’s with nearly the same peoples names in it. The first pdf contains first name, one or more possible middle names and the last name. The second pdf contains the same data, but in a different order and all written in capital letters.

My goal is to extract the names that doesn’t exist in both documents. I uploaded two example documents with random generated data. Both tables missing ten different names.

Tabelle1.pdf (197.2 KB) Tabelle2.pdf (183.5 KB)

Thank you

Try it:

Main.xaml (9.5 KB)

2 Likes

Thank you very much.

But the workflow output gives too many names that exist in both documents. In fact, the final result should have 20 missing names. 10 missing entries per document. My problem is, that i can’t always tell the difference between a middle name and a last name consisting of two names.

@devinsta
May I ask you to do some quality checks on my results

Around 979 common names found

PDF1 Names not common with names in PDF2
string[12] { “Name MiddleName Surname”, “Orelee Zsa zsa Swaffer”, “Leonelle Bossom”, “Luisa Jeniece Shernock”, “Georgy Philippe”, “Evangelia Rutherfoord”, “Tobe Eldin Ranking”, “Dante Jewett”, “Celie Durrand”, “Ally Rakel Marchent”, “Katlin Semble”, “Worthy Dark” }

PDF2 Names not common with names in PDF1
string[12] { “Surname Name MiddleName”, "REDDOCH BENJI ", “COULT GIUSTINO NAHUM”, "BRIATT ZECHARIAH ", "SIEUR MATTHIEW ", "LOWDES UNA ", "GREENHILL ARIELLA ", "RADSDALE ADRIA ", "SETCH TRUDEY ", "LAURENTIN VALENTINE ", "DAGLEAS SARENA ", “SWAFFER ORELEE ZSA ZSA” }

We do have one wrong item: SWAFFER ORELEE ZSA ZSA vs Orelee Zsa zsa Swaffer

Can this handled e.g by manually post checks on only a littlel amount of values or can you check if there are hidden chars, blanks that are making difference? Thanks

1 Like

thanks a lot,
the result would already be very good. the filtered names would be checked manually one more time anyway.

@devinsta
so I can take it as a confirmation that the two lists are correct? If yes I will revise my implementation and will share it with you

1 Like

@devinsta

Find starter help here:
devinsta.xaml (13.7 KB)

Kindly note:

A short feedback on the implementation:

  • reading with Read PDF Text all pages in one rush, has problem on the pagebreaks as e.g Livy GarraltsCacilia Ric Mankor will be readin instead of Livy Garralts\n\rCacilia Ric Mankor with \n\r being a line break

    • thats why you need the latest PDF Activities here we have page count and did it manually fixed (@loginerror was this behaviour expected?)
  • Assuming that a space will seperate the 3 columns was not confirmed: the name Orelee Zsa zsa Swaffer is one the exceptions. And there is no non-cognitive way to know when it is middle name or last name part e.g. Rube Van den Velde where van den Velde is a dutch family name

All in All I implemented a logic and if all tokens (name parts seperated by space are mathing each other then it is treated as as Matching name).The Problem with som false matching or not matching I already mentioned above. So let the unmatched ones still revised.

Let us know your feedback

1 Like

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.