Fuzzy string matching with unclean data

I need to fuzzy match a transaction description from a bank export to a vendor list. The vendor list should be error-free, but the transaction description might contain garbage. Here’s an actual transaction I’ve seen:

Vendor list: Kum & Go
Transaction description: Kum&GO 0222 GRINNELL

I’ve read a few other threads on fuzzy matching but I didn’t see any that specifically accounted for one of the strings being unclean such as in the above.

@andy2kh

Welcome to the community

It is not a garbage value but looks like encoded string

We can use system.web.httputility.HtmlDecode("Kum&GO 0222 GRINNELL").Tolower.Contains("Kum & Go".ToLower) OrElse system.web.httputility.HTMLDecode("Kum&GO 0222 GRINNELL").Tolower.Contains("Kum & Go".ToLower.Replace(" ",""))

you can use variables instead of constant strings

image

Cheers

Hi @andy2kh ,

Could you also let us know other example data that you have come across or are present ?

We could identify from the different cases, what solution could fit more appropriately.

In my experience, the transactions typically fall within 3 categories, the first being those with encoded strings, the second that simply lists the vendor as normal, and the third which abbreviates the vendor in some way.

Vendor: Fareway
Description: PURCHASE AUTHORIZED ON 06/15 FAREWAY STORES GRINNELL P000000382698902

Vendor: Amazon
Description: AMZN.COM*783ZAFR2 A

It’s also possible that there also be some small variations in the vendor list, like if someone named the vendor “Kum and Go” instead of “Kum & Go”, or “Sams Club” instead of “Sam’s Club”, or “John Smith” instead of “John Adam Smith”.

@andy2kh

I guess then you can go with this

This uses Levenshtein algorithm to perfomr string match…

Apart from that if you have ai center then you can go with the text classification model as well…where you can train mode on different descriptions and possible set of vendor names to know the prcentage matches

Cheers

If I use Levenshtein distance, won’t that cause issues when one of the strings is unclean? For example, with the Amazon transaction I mentioned:

Vendor: Amazon
Description: AMZN.COM*783ZAFR2 A

Won’t the Levenshtein distance be unnecessarily high because of all the junk that comes after AMZN?

@andy2kh

Ideally it would not…because it looks for the group words from input

Cheerd

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.