[LINQ] - Remove Duplicate Rows with Multiple Unique Identifier Columns from Multiple Data Table

Good day fellow UiPath RPA Developers,

Is it possible to remove a transaction row (with only certain required columns) with duplicate(s) on another data table without merging the data tables? The reason why it cannot be merge and there are multiple data tables is that I only down-scaled. Assuming they will be merge on large-scale transaction, Excel and UiPath cannot handle more than 1 million rows that is why I divided the transaction rows with multiple data tables.

These are the input data tables:
image

And this is the EXPECTED OUTPUT:
image

Note: Only ‘UID’, ‘INV_NO’, ‘TXN_AMOUNT’, and ‘TRACE_NO’ columns are the required columns for determining the duplicates.

Is this possible to implement this on a LINQ approach? Any kind of help is much appreciated!

Best Regards,
Anthony Jr.

Hello @anthonyjr ,

Use the below LinQ :

(From itemsA in firstDT where( From itemsB in secondDT where string.Join(“,”,itemsB.ItemArray).Equals(string.Join(“,”,itemsA.ItemArray))
Select itemsB).ToArray.Count>1 Select temsA).CopyToDatatable.DefaultView.ToTable(True)

It will remove the duplicates and provide you the Unique records.

Regards,
Rohith

@anthonyjr ,

There’s one more solution for this if your data is huge then you can use excel as Data Source and execute SQL queries on it.

Please refer to the below thread for the same:

Regards,
Rohith

Thanks sir @rohith.prabhu

I think this will be only applicable for 2 data tables but it gave me an idea how to start the process and try to twist it up.

Thanks again!

Regards,
Anthony Jr.

1 Like

Thanks again sir @rohith.prabhu ,

I’ll look into and discover it. Thanks for the reference!

Best Regards,
Anthony Jr.

1 Like

We have some doubts that we would like to clear.

The current description is mentioning duplicates from dt-N in dt-N+1
Is this guaranteed or could it be also the case that dt-N has a row that is duplicated in dt-N+2,3,4… ?

As far we had understood a large data source has been split into x datatables. Thats why different data tables have to be checked, right?

to make it simpler you can merge all data into single DT (clear old DT to save diskspace) and remove duplicates in single query

“The current description is mentioning duplicates from dt-N in dt-N+1
Is this guaranteed or could it be also the case that dt-N has a row that is duplicated in dt-N+2,3,4… ?”

It is possible to have N+2,3,4 and so on depending on how large the original data source.

“As far we had understood a large data source has been split into x datatables. Thats why different data tables have to be checked, right?”

Yes, you are right sir. We are splitting large data source that is why it is splitted into x data tables.

Best Regards,
Anthony Jr.

We already tried that because some of our data reach up to more than 1 million transaction rows and has more than 20+ columns. It didn’t work with that kind of data and UiPath went to job stop.

than you have to check for duplicates in all the combinations available,

eg, say you have 4 tables:- need to check duplicates for 1 to 2, 2 to 3, 3 to 4, 4 to 1, 1 to 3, 2 to 4 so this will be better approach,
if you insert query inside another query will might throw job error for memory utilization.

to get the list of combinations you can refer this thread: c# - Get all pairs in a list using LINQ - Stack Overflow

not sure if this question was get and answered by you. Let me rephrase:

Your sample had not the case that table1 has a rows which is duplicated in Table3. Could this scenario also happen?

However to preview on solution approaches (when split tables are to kept)

  • loop over the tables and track the IDs in a Dictionary as unique Keys - dictLKID
  • retrieve IDs from checked table, calculate the duplicated IDs by using the dictLKID and remove the duplicated rows

The reason why we suggest this Dictionary approach is the outcome from another Topic where we reduced the processing time from 24+ hrs to seconds

1 Like

Sorry my English comprehension is not that good.

Yes it is also possible that Table1 has a rows which is duplicated in Table3. This scenario could happen.

Thank you sir @ppr ! I’ll look and discover the reference you suggested.

Best Regards,
Anthony Jr.

I see, I’ll discover this approach and try to test if this can be applicable on the process.

Thank you sir @megharajky for you recommendation!

Best Regards,
Anthony Jr.

1 Like

@anthonyjr
Kindly note the essential difference between the
dictionary approach and the Cartesian product approach (1 to 2, 2 to 3, 3 to 4, 4 to 1, 1 to 3, 2 to 4)

with the Dictionary approach and tracking the IDs once seen the heavy compare pairs can be ommited