Remove Rows with similar string value from a datatable

Hi,
can anyone help me wit the following problem
I open a excel workbook and store it into a data table. This data table can contain a large number of rows. Each of these rows contains a code in one of the columns - lets call it product code. The trick is that we don’t know what that product code means so we have to look it up. Since we have 5k+ lines we cannot lookup one code at the time, but there is a trick: codes that have a similar beginning or ending we can be 99% confident they are the same product \ but just some variation of it. Hence, we can say that 90% of the data are only repeated products but with a slightly different code - which we can now lookup and apply the same product name to all similar product numbers:
Example
234256756 - we can look this up and find that this is an u iPhone
------------------------- - and say that with 99% confidence the codes below are the same product
234251232
234253423
234255352
234259983
x3432343332 - except this one that is starting a new sequence codes for the same product
x3432343356
x3432343434

so if you can see that there will be a similarity in the first 50% or last 50% of characters as well as the overall product code length.

My approach was to:

  1. create a datatable where i will store only the unique codes - get the code for the first row and assign it to a variable - previousCode and add the first code to a data table called unique codes.
    Once i have that code in a for each row i move to the second row and assign the new code to a variable called current Code and check if the two strings match like:
    if previousCode.substring(0,cint(0,5previousCode.Length)) = currentCode.substring(0,cint(0,5currentCode.Length)) And - similar logic to check the last part as well (you understand the point i guess). if this criteria is matched we can say that this is not an unique code but a variation of the previous one. if there is no match this is new code that we add to the unique codes data table.

This works but its ridiculously slow. Is there any way to filter data tables based on similar but not same string.

@eddeta
Welcome to the forum

as a quick shot maybe a special GroupBy with LINQ can help:

From your data:
grafik

a statistic with Keys (first half of the Id) and its count can be generated:
[Key,Count
2342,5
x34323,3
]
and it is already filtering out the only 1 occurences

a short key full Key report can be generated (without 1 occurences Keys)
[Key,FullKey
2342,234256756
2342,234251232
2342,234253423
2342,234255352
2342,234259983
x34323,x3432343332
x34323,x3432343356
x34323,x3432343434
]

So an approach could be

  • lets generate the report (2time one for the start, one for end partial key)
  • use the found keys for further filtering and corrections

Find demo XAML here:
GroupBy_1Col_PartialKey.xaml (11.2 KB)

1 Like

This seems to be working fine! Now the only question is if you know how i can efficiently turn back the results to each line of the original data table - once i find the product names from the external website?

@eddeta
I am not sure if I understood your question rigth in total. Can you please help me and formulate it different. Also tell me which apprioach (key, sum Or part Key, Key) you have focused?

When it is about on how to correct the product names by a value for the different IDs, then I would suggest:

  • Requirement Analysis and estimation on average number of Correction Rows per Batch
  • input: Product Key (full key) new name
  • initial solution approach: filter for row(s) and update

But let us share some more details on the requirements and we will find a matching solution

Hi,
thanks for all the inputs and for still sticking with the topic.
I was focusing on the product key solution since the number of all products that are similar is irrelevant at the moment - i need to find their names, not to count them.
So i tried to use the linq solution you gave me but ran into some problems. I don’t use linq almost never so not too experienced with that.
The main issue i encountered now is that the variable part of the name is not only the beining, but can be the middle part as well meaning:
123456
123457
123468 - are the same since the 1234 part is the same
but look at this
12345_genericProcuct
12353_genericProduct
14221_genericProduct - this is still grouped as the same but its not - the left part has changed after and the right had a constant part of the string.
i was tinking to introduce something like the edit distance model to calculate the exact percentage of string that is different rather then just checking the 50% of the string from the left.

another topic is that i need to create keys and keep track of them to be able to vlook up them later to a table that i will enrich with the same calculated key column.

So imagine i have a sorce data - where i need to add another column with product code key - this is calculated based on the product code similarity %
After i ran through this table i copy out the keys and filter them so i am left with only 1 entry of each unique (each product key will contain only 1 compleete product key used for searching).

I have now a second datatable which i am going to feed into the product management software to get the product names. Once i get the names for each complete product code, they will be written into the table, in which i have also the product key stored.

So now i will need to find the way to feed the product names from the unique product keys table to the source data table that needs to be enriched - so i will have 1 product code name matching 1 product code key will need to be written into multiple rows matching the product key code. Probably lookup type of operation will do the job.

it is hard to follow, just provide some data samples illustrating your input and exptected output along with short clear description. Thanks

1 Like