Remove duplicate rows based on specific columns?

Hi all,

I am trying to remove duplicate rows from an input file being processed by my bot.

How can you remove duplicate rows based on specific columns being identical and not all?

For example, I have to filter out duplicate rows in a data file where only criteria such as NAME + POLICY NO + START DATE are identical, to only keep 1 row. Regardless of what’s in the other columns.

For example.

FIRST NAME | SURNAME | POLICY NO | START DATE | ADDRESS | NOTES
John | Identical | 12345 | 01/01/2021 | 12 Test St | Dummy text here1
John | Identical | 12345 | 01/01/2021 | 12 Potato St | Dummy text here2
John | Identical | 12345 | 01/01/2021 | 12 Fly St | Dummy text here3
John | Identical | 12345 | 01/01/2021 | 12 Weird St | Dummy text here4

Remove all except 1 row… ???

FIRST NAME | SURNAME | POLICY NO | START DATE | ADDRESS | NOTES
John | Identical | 12345 | 01/01/2021 | 12 Test St | Dummy text here1

2 Likes

HI @Jon_G

Have you tried with Filter Datatable Activity?

Regards
Sudharsan

Hey, I have tried Filter activity but not sure how to fill this in to check for duplications. I could only see if column value = …

Hi @Jon_G

Try these activities

  1. Remove Duplicate Rows (or)
  2. Remove duplicates Range (Where you need to specify Range)

Hope it helps…

Regards
Sudharsan

@Jon_G

This can be done as follows

image

Instead of Build Data Table activity use the Read Range Activity to read the input excel file into the data table variable (inputDT)

The LINQ used is given below

(
	From row In inputDT
	Group row By 
	k1 = row("FIRST NAME").ToString.Trim,
	k2 = row("SURNAME").ToString.Trim,
	k3 = row("POLICY NO").ToString.Trim,
	k4 = row("START DATE").ToString.Trim
	Into grp = Group
	Select grp(0)
).CopyToDataTable

Finally you can write the output in an excel file using the Write Range Activity.

Please refer the attached xaml file

Remove duplicate rows based on specific columns.xaml (9.7 KB)

3 Likes

Wow you’re a legend! This works flawlessly. Thankyou.

Not familiar with LINQ so this is something I might try to learn more about. I’m sure this is a very common function that others will appreciate your solution.

Adding to this… how can I add something to this LINQ so that I can keep the row that contains the MAX or MIN value in the START DATE field?

@Jon_G

Bit confused about your requirement. Since we are looking at records which have duplicate START DATE, we cannot apply Min or Max on it.

Please provide an input and output example about your requirement.

Also, since this topic is closed raise it in a separate topic.

For your reference

There are Min and Max methods that can be used for getting a minimum and maximum value.

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.