Self-join a DataTable to remove duplicates

Hi, I have a task to remove duplicate rows from a DataTable.

If I do it using brute force by comparing one row by every other row it results in a very slow algorithm especially with large data sets.

How do I use self-join on the DataTable to remove duplicate rows very quickly by processing them in bulk?

Input: raw file, contains duplicate rows
Output: Only unique rows

You have an activity for that -

Regards,
Karthik Byggari

How do I specify criteria that determines whether the rows are referring to the same item? I.e. instead of comparing all the fields in a row, specify custom logic as comparison criteria

E.g. 3 out of 5 important fields are the same, first name + last name and last name + first name are the same etc.

@DEATHFISH

correct me if i’m wrong . As per my understanding you have remove duplicate row based col right?

for this use below code
dt.AsEnumerable().GroupBy(Function(x) x(“Column1”)).Select(Function(m) m.First).copytodatatable

i need to remove duplicate rows, where “duplicate row” is defined by multiple custom criteria

Please state where I am supposed to key in these custom criteria/functions, thanks

@DEATHFISH

can you tell me the condition(mean custom criteria condition)? without knowing condition we can’t provide correct solution

Duplicate data row activity only remove duplicate row. here you can’t mention any condition

for your problem we have to create query … for that we need condition (logic)

I’m also interested to know the solution for this as i got similar requirement, please do update me if get a breakthrough.