Get count of duplicates from a datatable

Hi Everyone,

Need some help on the below scenario.

I have a datatable which has around 10 columns and around 400K ~ rows.
I want to get the count of duplicates from the datatable.

I tried some linq queries consisting of sequenceequals, itemarray to itemarray comparison but did not get the expected output also the execution time is huge due to may be number of rows.

Can anyone suggest me good solution for this.

Thanks!!

Hi,

Can you elaborate with specific sample?

Regards,

@Dhruvi_Arumugam

Please provide a sample data…alternately please tell if you want duplicates based on any specific column or all columns?

Cheers

@Yoichi

Thanks for the quick response.

Attached is some sample data.

I cannot attach the exact data so I have tried to prepare some mock.

I want the count of duplicate records from the datatable.
I have 10 columns and 400k to 500k rows, I want the number of duplicates which exist in this data i.e. exact matching records.
sample.xlsx (34.1 KB)

Thanks

Thanks for the quick response.

I have attached the data in my below post.
I have the use case for both the scenarios you mentioned but firstly I am checking for all the columns.

Thanks

1 Like

Hi,

Which column(s) do you want to check duplication? And what is expected value of the above sample?

Regards,

Hi @Dhruvi_Arumugam
*Read the excel using read range.
*Take row count count1=dt_input.Rows.Count
*dt_input=dt_input.DefaultView.ToTable(True)
It will remove the all the duplicate rows
*Take row count after deleting duplicate count2=dt_input.Rows.Count
*Count1-Count2. It will give you the duplicate record count.

Hope this helps

@Dhruvi_Arumugam

One way if you only want to count then

Nonduplicatedt = dt.DefaultView.ToTable(True,"Column1","Column2") will give you non duplicate rows based on the specified two column…if you want on all columsn then dont specify any column then it would consider all columns for duplicates

now countofduplicates = dt.rowcount - nonduplicatedt.rowcount will gvie you the duplicates count

Hope this helps

cheers

All the columns currently, and on the expectation point its that I should get the count of duplicates if out of 10 records only 4 are identical then I should get 4 i.e. the duplicate count similar like we have in excel

Your sample EXCEL shows more then 10 Columns

so it looks like the case of a subset of columns are used for the duplication identification. Just to start with a first step - Key by ColValue Concatenation Approach

arrDupliColFilter = new String(){“ColName1”, “ColName2”,…“ColName10” }

dtColFiltered = YourOrigDataTableVar.DefaultView.ToTable(false, arrDupliColFilter)

dtReport = dtColFiltered.Clone

Add DataColumn - Colname: “Count”, ColumnDataType = Int32

(From d in dtColFiltered.AsEnumerable
Let ck = String.Join(“_”, d.ItemArray)
Group d by k=ck into grp=Group
Let ra = grp.First().ItemArray.Append(grp.Count).toArray
Select r = dtReport.Rows.Add(ra)).CopyToDataTable

When execution time will fall in a too long running range, then in a next step we would check to optimize the 10 col key construction, used for grouping the data

@Dhruvi_Arumugam

Please try the above method and let us know if you find any issue

cheers

Hi @Anil_G @Yoichi @ppr

Apologies for giving sample file with extra columns. I have revised the file with only 10 limited columns please find the updated file. Please let me know your valuable inputs
sample.xlsx (32.4 KB)

Thanks

for a simple report / count you can use the method from Anil

we would modify by: dt.DefaultView.ToTable(True)

For the more detailed report described here:

…

we just modify:
dtReport = dtOrigDataTableVar.Clone

Add DataColumn - Colname: “Count”, ColumnDataType = Int32

(From d in dtOrigDataTableVar.AsEnumerable
Let ck = String.Join(“_”, d.ItemArray)
Group d by k=ck into grp=Group
Let ra = grp.First().ItemArray.Append(grp.Count).toArray
Select r = dtReport.Rows.Add(ra)).CopyToDataTable

I will test and let you know,
whereas in the below query how will I get the count?

(From d in dtOrigDataTableVar.AsEnumerable
Let ck = String.Join(“_”, d.ItemArray)
Group d by k=ck into grp=Group
Let ra = grp.First().ItemArray.Append(grp.Count).toArray
Select r = dtReport.Rows.Add(ra)).CopyToDataTable

Thanks

part will bring in the group member count ~ no. of duplicated rows related to the set of columns used for grouping the data (group by - grouping key)

Hi @ppr

I tried the below approach

but the bot is not ending and running endlessly(may be due to large data taking enormous time)
can you or @Anil_G suggest some better approach?

Thanks

Hi @Yoichi

Waiting for your valuable inputs as well if you have any?

Thanks

@Dhruvi_Arumugam

Can you try this

Duplicatecount = Dt.AsEnumerable.GroupBy(function(x) x("Column1").ToString+x("Column2").ToString).Where(function(x) x.Count>1).Sum(function(x) x.Count-1)

Include more columns if needed, if you want all columns try with x directly

Cheers

Can you elaborate this please

Thanks

@Dhruvi_Arumugam

If you want all columns to be included in finding duplicates then use as below

count = Dt.AsEnumerable.GroupBy(function(x) String.Join(",",x.ItemArray.Select(function(y) y.ToString))).Where(function(x) x.Count>1).sum(function(x) x.count-1)

If including only x then we are not getting correct count so changed it a bit to get itemarray of all columns and used in group by

cheers