Remove Duplicate rows based on single column and that too if cell starts with specific values

Hi,

I need to remove duplicates from a excel/DataTable using LINQ. Specifically, I need to remove 2nd occurrence duplicates for a single column(column2), maintaining other columns duplicates as it is…
Very imp is in 2nd column I need to keep other duplicates but need to remove only some special duplicates, which starts with some fix string (e.g. MN-J).

remove duplicates - removing all duplicate rows (not useful)
default view - not useful

Probably can be done using regex or linq.

Thanks in advance for your help!

Hi @S_Nitin ,

Could you maybe provide us with a Sample data and the Expected Output for that data, this way we will be able to help quicker and provide appropriate suggestions.

use assign activity and mention like this

NewDT  = yourDataTable.AsEnumerable().GroupBy(Function(row) row("Column2")).SelectMany(Function(group) If(group.Key.StartsWith("MN-J"), group.Take(1), group)).CopyToDataTable()

Hope this helps

Cheers @S_Nitin

Data.xlsx (10.1 KB)

Hi @S_Nitin ,

From the Explanation and the Input/Expected Output, it seems that you would want to Group By based on two columns (First & Second).

Could you maybe try the below and check :

OutputDT = DT.AsEnumerable.GroupBy(Function(x)x(0).ToString+x(1).ToString).Select(Function(x)x.First).CopyToDatatable

HI @Palaniyappan , Thanks for reply,
Query ran with some modification:
NewDT = yourDataTable.AsEnumerable().GroupBy(Function(row) row(“Column2”)ToString.Trim).SelectMany(Function(group) If(group.Key.StartsWith(“MN-J”), group.Take(1), group)).CopyToDataTable()

But it has grouped all similar value in column 2, we are not supposed to move other values actually. I am even not sure it’s possible or no

Hi @supermanPunch,
It is partially correct but it is deleting some other rows too.

actually let me explain why i want to delete those duplicates as it is transaction ID and column A is data for that, so if transaction id is repeating it might be taken as another transaction …hence i thought to remove those duplicate …

from 1 id to other new id all data should be of 1 transaction only

is there any other way for it ?

Let me redefine my issue
Delete the duplicate in column 2 but check the respective column 1 values too while deleting if column 1 value is different then dont delete and if column 1 value is same then delete that too.

Hi @S_Nitin

Check out the XAML file

11.10.2023_Forum_3.xaml (11.8 KB)

Regards
Gokul

Hii
@S_Nitin
Please use this LinQ

DT.AsEnumerable.GroupBy(Function(a) Tuple.Create(a(“Column Name1”).ToString,a(“C.Name2”).ToString)).Select(Function(b) b.First).CopyToDataTable

Cheers…!

1 Like

@Dilli_Reddy -
Thanks, I tried this query it is partially working; Means, along with column 2 duplicates it is also deleting column 1 exact duplicates. ; However, this is useful for me in other scenarios. :slight_smile:

Thanks !

hi @Gokul001
Thank you for your reply !

I tried but it is deleting other cells too, which are unique.
I will try to use similar logic., with some modifications.

Based on this input provide by you i have created the logic for that.

Is it working with tha above inpu file?

@Gokul001

Yes, working for this file and actual data too; however, it is arranging all the codes on collumn2. Hence unable to recognize what column1 values belong to which Col2 codes.

actually let me explain why i want to delete those duplicates as it is transaction ID and column 1 is data for that, so if transaction id is repeating it might be taken as another transaction …hence i thought to remove those duplicate …

from 1 id to other new id all data should be of 1 transaction only

Code Output-
image

Expected Output:
image

@S_Nitin ,

Could you provide us with the Data Scenario where the Logic provided does not work and provide us the Expected Output for that Data. We will be able to help you out faster and with accurate logic once we get the data samples.

Have you revied this Workflow @S_Nitin

  1. Will Filter the Column2 with contains M-IND store that in the Dt_Filter_M_IND.

  2. Will Filter the Column2 with not contains M-IND store that in the Dt_Filter.

  3. using LINQ expression i will Keep the First Row and element the second one. → DT_Dupli

  4. Use Merge data table - Source -> Dt_Filter and in the Destination - DT_Dupli

Hi @Gokul001 ,
Thank you for your time!
I have gone through the .xaml you have provided and understood too. It works! It is not deleting anything from col1 and deleting mentioned specific duplicates from column 2… Agreed !
But , it is appending col2’a imp data rows at the top, Hence it is difficult to understand the which rows belong to which special code in column 2.

Here, as per pic apple ,sea, Monday is the transaction data for M-IND.45.VC (id)

e.g.
image

Pic attached to show expected output.

@supermanPunch
Thanks for your time Arpan,

I have added 1 more row to my excel and explained the issue too.
Data.xlsx (138.1 KB)

Problem is it is deleting the rows which has same values in col1 and col2, so crucial data from col1 is getting deleted.

@S_Nitin ,

A Bit of analysing on the Excel sheet data submitted, can we come to a conclusion that if only all the column values are repeated in multiple rows we would require to delete it ?

If we are concentrating on the conditions of Delete part only, we arrive at that same conclusion. Because the condition of Sun, Chennai has repetition and not being removed is not specific to that value but as general condition (mentioned as this belongs to different code in col2 - could not properly get this part).

So with maybe slight modification I do get the Expected Output (with 1 row being interchanged) :

DT.AsEnumerable.GroupBy(Function(x)String.Join(",",x.ItemArray)).Select(Function(x)x.First).CopyToDatatable

Do let us know if this does not work.

@supermanPunch -
Thanks for reply!

Let me try to explain in better way-with new input file (attached)

  1. Delete row (only 2nd occurrence) if all the column values are repeated in multiple rows.
  2. But add condition like that 2) -( 2nd column value of that row must have unique value M-IND).
    If, this value is not there, keep the repeated rows as it is. e.g. Sun Chennai (has no M-IND in it)
    DataFinal.xlsx (140.0 KB)

Hi,

Finally, I achieved the expected output with vb.net code.
Invoked in UiPath.

Still, thanks you @supermanPunch @Gokul001 @Palaniyappan @Dilli_Reddy - I got to learn many linq queries by you all and inspired to upskill :slight_smile:
Maybe I was unable to put my issue properly here hence, I might not get the solution. As your give queries were working for the sample data i gave.

Code Below:

Dim rowsToRemove As New List(Of DataRow)

’ Create a Dictionary to track row counts based on column values
Dim rowCounts As New Dictionary(Of String, Integer)

’ Iterate through the DataTable
For Each row As DataRow In DT.Rows
’ Combine values from columns you want to use to check for repetition
Dim key As String = String.Join(“|”, row.ItemArray.Cast(Of Object)().Skip(1)) ’ Skip the first column (index 0)

' Check if the key is already in the Dictionary
If rowCounts.ContainsKey(key) Then
    ' This is a repeated row
    If row("ColumnNameForSecondColumn").ToString().Contains("M-IND") Then
        ' If the second column contains "M-IND," mark it for removal
        rowsToRemove.Add(row)
    End If
Else
    ' This is the first occurrence of this row, so add it to the Dictionary
    rowCounts(key) = 1
End If

Next

’ Remove the marked rows from the DataTable
For Each rowToRemove As DataRow In rowsToRemove
DT.Rows.Remove(rowToRemove)
Next