A set of activities that get or merge similar rows or columns in datatable based on fuzzy search (similarity)

Hello guys im thinking of sets of activities idea which helps to merge or get duplicate rows or columns based on fuzzy logic (or Similarity in simple words)

Activities Category:

String:

1) Fuzzy Column Search: Obtain matching columns based on the provided string.

Properties Category:

Inputs:

a) String - The string to be compared with the columns.
b) Datatable - The datatable in which you want the string to be compared with the columns.

Options:

Algorithm: Drop Down List (Jaro Winkler Distance & Levenshtein Distance) - Jaro Winkler Distance is the default.

Match Rate: Drop Down List of percentages (10%, 20%, 30%, … 100%) - Set the desired similarity match rate (default is 80%).

Outputs:

a) Matched Col Index (int[]) - Get the index of columns that closely match the string.
b) Matched Col Name (string[]) - Get names of columns that closely match the string.
c) Average Distance (double) - Average distance for all the columns; if 0, no columns are matched; if 1, all columns are matched with the string; if between 0 and 1 (e.g., 0.804554545 or 0.96854842626), there is similarity.

Datatable:

1) Get Duplicate Columns: Retrieve duplicate columns in a datatable.

Properties Category:

Inputs:

a) Datatable - The datatable in which you want to find duplicate columns.

Options:

Algorithm: Drop Down List (Jaro Winkler Distance & Levenshtein Distance) - Jaro Winkler Distance is the default.
Match Rate: Drop Down List of percentages (10%, 20%, 30%, … 100%) - Set the desired similarity match rate (default is 80%).

Outputs:

a) Matched Col Index (List<int[]> or other datatype) - Get indexes of column pairs (see below for understanding of pairs).
b) Matched Col Name (List<string[]> or other datatype) - Get names of column pairs (see below for understanding of pairs).
c) Average Distance (double) - Average distance for all the columns; if 0, no columns are matched; if 1, all columns are matched with each other; if between 0 and 1 (e.g., 0.804554545 or 0.96854842626), there is similarity.

2) Get Duplicate Rows: Identify duplicate rows in a datatable.

Properties Category:

Inputs:

a) Datatable - The datatable in which you want to find duplicate rows.

Options:

Algorithm: Drop Down List (Jaro Winkler Distance & Levenshtein Distance) - Jaro Winkler Distance is the default.
Match Rate: Drop Down List of percentages (10%, 20%, 30%, … 100%) - Set the desired similarity match rate (default is 80%).

Outputs:

a) Matched Row Index (List<int[]> or other datatype) - Get indexes of row pairs (see below for understanding of pairs).
b) Average Distance (double) - Average distance for all the rows; if 0, no rows are matched; if 1, all rows are matched with each other; if between 0 and 1 (e.g., 0.804554545 or 0.96854842626), there is similarity.

3) Merge Duplicate Columns:

Properties Category:

Inputs:

a) Datatable - The datatable in which you want to find duplicate columns.

Options:
Algorithm: Drop Down List (Jaro Winkler Distance & Levenshtein Distance) - Jaro Winkler Distance is the default.
Match Rate: Drop Down List of percentages (10%, 20%, 30%, … 100%) - Merge only if the similarity is at least the given percentage (80% is default).

Outputs:

a) Matched Col Index (List<int[]> or other datatype) - Get indexes of column pairs (see below for understanding of pairs).
b) Matched Col Name (List<string[]> or other datatype) - Get names of column pairs (see below for understanding of pairs).
c) Average Distance (double) - Average distance for all the columns; if 0, no columns are matched with each other; if 1, all columns are matched with each other; if between 0 and 1 (e.g., 0.804554545 or 0.96854842626), there is similarity.
d) Datatable - Output Datatable after merging.

4) Merge Duplicate Rows:

Properties Category:

Inputs:

a) Datatable - The datatable in which you want to find duplicate rows.

Options:

Algorithm: Drop Down List (Jaro Winkler Distance & Levenshtein Distance) - Jaro Winkler Distance is the default.

Match Rate: Drop Down List of percentages (10%, 20%, 30%, … 100%) - Merge only if the similarity is at least the given percentage (80% is default).

Outputs:

a) Matched Row Index (List<int[]> or other datatype) - Get indexes of row pairs (see below for understanding of pairs).
b) Average Distance (double) - Average distance for all the rows; if 0, no rows are matched with each other; if 1, all rows are matched with each other; if between 0 and 1 (e.g., 0.804554545 or 0.96854842626), there is similarity.
c) Datatable - Output Datatable after merging.

// My Post Content Starts Here

What is This Fuzzy Fuzzy ? :-
The fuzzy search employs different algorithms to compare two strings and find the distance between them. More about it here

Explaining the Pairs:

Suppose we want to get duplicate columns. The activity will take any (important: any) first column it encounters as a starting point for a new pair. It will then go through the rows in that column, store the data, move to the next available column, and check if the data matches the first column based on the configuration provided in the properties. If matched ,It will store it in that pair and repeat the process , If not macthed , it will move forward to next available column . Once all columns are checked in the first round and added to the pair according to the similarity, the output as an index (e.g., 0, 3, 9, 8) is considered Pair 1. The columns used in Pair 1 are removed from the temparray datatable, and the process continues with a new first column the same process will go for the second round , now for Pair 2 (e.g., 1, 5, 7). This way, the activity will identify all duplicate columns and return a List<Int> for Col Index and List<String> for column names.

The same approach is applied to ‘get duplicate rows’ and ‘merge duplicate columns’ and ‘merge duplicate rows’.

Want Some Suggestion: What could be a better datatype for the pairs? Is List<Int> and List<String> good?

Any Suggestion to the idea/theory is really appreciated