Remove duplicate rows based on conditions

Hi,
I’m trying to remove duplicate rows based on values in column ‘A’ from datatable, and from those rows that have same value in column ‘A’, I need to keep the one that contains value in column ‘B’. (if all duplicate rows are NULL in column ‘B’, then just keep one)

For example, my datatable looks like:
EmployeeID Department
2 A
6 B
1 C
5
3 A
3 A
3
1
4 B
5

After the operation, it will become:
EmployeeID Department
2 A
6 B
1 C
5
3 A
4 B

(Note: my datatable has more columns than this one)

I tried this query:
myDT.AsEnumerable().GroupBy(Function(x) convert.ToString(x.Field(of object)(“EmployeeID”))).SelectMany(function(gp) gp.ToArray().Take(1)).CopyToDataTable
But it keeps the row that contains NULL in column ‘B’. (I’m not quite sure why. Is it because the rowindex of these rows are smaller than those I wanna keep?)

Thanks in advance, I really appreciate your suggestions.

Heres an example sequence. hope it helps
(the below is the input and output of this sequence)

i have attached my sequence (, its called test.xaml) and input excel below
test.xaml (8.4 KB) example.xlsx (8.4 KB)

Sequence
image
Arguments
dt = datatable of input
listOfReadIDs = list to store all the unique ids that robot has already read


newdt = the output DT

flow

  1. read input excel into dt variable
  2. clone headers of dt into newDt
  3. for each row in dt, check if list contains employeeID, if it does then skip row, if not add row to newDt and to the list
    image
  4. write newDt to result excel

Use this below linq query, @JinGe
(From p In TestDt.Select() Group p By ID=New With { Key.a =p.Item(“EmployeeID”),Key.b=p.Item(“Department”)} Into Group Select Group(0)).ToArray.CopyToDataTable()

Hope this may help you :slight_smile:

Thank you for your reply. I tried this and found that if the order of input data changed like below, it will output a row with ‘3, Null’.
1612416226(1) 1612416531

@JinGe So, you require only “5, Null” and other data???..
And you don’t need any value with “3, Null” and so on…

Thank you. I tried but the query still filtered out the rows I need to keep.

Yes. I don’t need any value with “3, Null” because there is at least one row that indicates Employee 3 is in Department A. And I need “5, Null” because there is no row in the data table indicating that Employee 5 belongs to any department.

use this sequence instead,
test2.xaml (12.5 KB)
heres the result (rows rearranged)

Check this below workflow, @JinGe
Uipath_RemoveDuplicates.xaml (11.6 KB)
Hope this may help you :slight_smile:

@JinGe
the gamechange is the requirement of the handling the department blanks. Maybe we can reformulate the requirements as following:

  • Output the for each employee all distinct Department associations
    • if there is a missing association for the employee:
      • output the missing departement association only if the member does not have any other department association
  • distinct/Duplicate removal handling should work even when there are additional columns along with department,employeeID

Following LINQ should do this and also the initial row order is set:

(From d In dtData.AsEnumerable
Group d By k=d("EmployeeID").toString.Trim Into grp=Group
Let gf = grp.Where(Function (x) Not String.IsNullOrWhiteSpace(x("Department").toString.Trim)).DefaultIfEmpty(grp.First).toList
From g In gf
Group g By k1=g("EmployeeID").toString.Trim, k2=g("Department").toString.Trim Into grp2=Group
Let gf2 = grp2.First
Order By dtData.Rows.IndexOf(gf2)
Select gf2).CopyToDataTable

grafik

1 Like

Magic! Thank you so much! Using linq query really improved the performance a lot since I got a quite large dataset.

Thank you for your patience, your solution also works for me!

Thank you for your patience, this one solved the order issue!

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.