I want to compare multiple keywords from excel sheet or datatable to Input column string from inputfile to check matching words

Hi friends,

I have one input file which consists 15000+ transaction also I have one template file which consists country with keywords. So I need to check Input column of each transaction to check any keywords from template file exist in input file. If keyword found then It will return Country name of matched keyword in Country1 /Country2 column as per match.

Note- If keyword found in Input transaction then it should be seperate word not substring of any word. E.g. (In given transaction xyz , sender- NaviMumbai)

So If searching keyword is Mumbai then…It will be false match.

Input File-

Input Country1 Country2 Country3 Country4
Please check transaction xyz from Sender-Mumbai India
Please check transaction xyz from London and Received by Surat England India
Please check transaction xyz from Sender-USA
Please check transaction xyz from Sender-wuhan China
Please check transaction xyz from Sender-USA
Please check transaction xyz from Sender-HK
Please check transaction xyz from Sender-Dhaka
Please check transaction xyz from Sender-USA
Please check transaction xyz from Sender-Dubai
Please check transaction xyz from Sender-USA
Please check transaction xyz from Sender-Berlin Germany
Please check transaction xyz from Sender-USA
Please check transaction xyz from Sender-Japan
Please check transaction xyz from Sender-USA
Please check transaction xyz from Sender-Islamabad Pakistan
Please check transaction xyz from Sender-USA
Please check transaction xyz from Sender-Dubai
Please check transaction xyz from Sender-USA
Please check transaction xyz from Sender-USA
Please check transaction xyz from Sender-Egypt

Template file

Country Keywords
India IN,Mumbai,Delhi,Mysore,Surat,Hyderabad,Kolkata,Bangalore
Iraq IQ,BAGHDAD,BASRA,NAFJAF,KIRKUK,KARBALA,NASIRIYAS,AMARA
China CN,SHANGHAI,beijing,chongqing,guangzhou,tianjin,shenzhen,hangzhou,wuhan
France FRA,Paris,Lyon,Nice,Nantes,Reims,Lille
England UK,London,Manchester,Bristol,Liverpool,Southampton
Russia Moscow,Kazan,Omsk,Volgograd,Krasnoyarsk,Samara
Afghanistan AF,Kabul,kandahar,jalalabad,kandahar,herat,ghazni,khanabad
Zimbabwe Harare, Bulawayo, Manicaland, Midlands, Mashonaland West, Masvingo
Germany Berlin,Munich,Hamburg,stuttgart,essen,Leipzig
Morocco MA,Casablanca,Rabat,fez,Tangier,Agadir,kenitra,oujda,meknes
Pakistan Karachi,Lahore,Islamabad,Faisalabad,Peshawar,Multan,Bahawalpur,Sargodha
Jamaica JM,Kingston,Portmore,Montego,saint catherine,mandeville,maypen,holdharbour
Mali bamako,gao,sikasso,kalabancoro,koutiala,segou,kayes,kati,mopti,niono
Iran Hamadan, Ilam,Karaj, Tehran, East Azerbaijan, Semnan, Birjand, Bushehr
Spain Madrid,Barcelona,Seville,bilbao.valencia,cadiz,Girona

If the template is static or will be having less changes. I suggest make it as a two column data, and then read in a dictionary. This way you can retrieve the Country name of a passed city name. Eg., if getCoutry is the dictionary then getCountry(“Mumbai”) will give you India.

Now the searching part, As the City would be after the keyword Sender-, then you can create a different column for the City of each transaction and then just read the template datatable created at first, loop through it and for each element in the City column do a Lookup in the secind datatable(input data), this will give you the index of the city and then you can add the country in the corresponding field using getCountry("")

Hope this helps, Happy automating!!!

@Mansi_Mhatre
have a look on following Building blocks helping you to implement the case

Country Lookup Dictionary

for a later check a Country Lookup Dictionary will help. We can do from the Template Excel

Essential

LINQ

grafik

Demo XAML:
DTToDict_KeysFromCSVString.xaml (10.2 KB)

Tokenizing the Input String

We can tokenize each word from String with e.g Regex. Kindly note a split on space would not work e.g. for Sender-Mumbai
grafik

Country LookUp

With the help of LookUp Dictionary and the tokenized string we can do it by following


dictTokenCountry2.Keys.Intersect(arrTokens).Select(Function (x) dictTokenCountry(x)).toArray

Kindly note: we did the Prototype just on small sample data base with the intention to showcase the building blocks.

Let us know if you need further help for the implementation.

Actually I received only input string in Input file and I want output after comparing this input with template country keywords…so basically input file have only input column and i want output as previously provided snapshot (like country1,country2,etc).

Input:

Input Country1 Country2 Country3 Country4
Please check transaction xyz from Sender-Mumbai
Please check transaction xyz from Sender-London and Received by Surat
Please check transaction xyz from Sender-USA
Please check transaction xyz from Sender-wuhan
Please check transaction xyz from Sender-USA
Please check transaction xyz from Sender-HK
Please check transaction xyz from Sender-Dhaka

Output: after comparing

Input Country1 Country2 Country3 Country4
Please check transaction xyz from Sender-Mumbai India
Please check transaction xyz from Sender-London and Received by Surat England India
Please check transaction xyz from Sender-USA
Please check transaction xyz from Sender-wuhan China
Please check transaction xyz from Sender-USA
Please check transaction xyz from Sender-HK
Please check transaction xyz from Sender-Dhaka

@Mansi_Mhatre
thanks for your feedback

the provided building blocks will do it. You have only to fill the different countries, after retrieval
grafik

in case of you need further help, maybe you can share your XAML, so we can check what was already implemented

Facing issue for keywords which are contain 2 or more words e.g. United States or Abu Dhabi or Republic of Ireland.
arrtTokens variable is splitting values on the basis of space and seperate out them so what to do for this scenario?

arrayTokens will consider them as ‘United’ ,‘States’ but it should be searching for United States.

just share these new samples, so we can have a look at it. Maybe the data can be controlled by some reliable seperator info, or other lookup lists

Country Keywords
India IN,Mumbai Hub,Delhi,Mysore,Surat city,Jammu and Kashmir,West bengal,Bangalore
Iraq IQ,BAGHDAD City,BASRA,NAFJAF Al sarf,KIRKUK,KARBALA,NASIRIYAS,AMARA
China CN,SHANGHAI,beijing,chongqing,guangzhou,tianjin,shenzhen,hangzhou,wuhan
France FRA,Paris city,Lyon,Nice,Nantes,Reims,Lille
England UK,United Kingdom,London,Manchester,Bristol,Liverpool,Southampton
Russia Moscow,Kazan,Omsk,Volgograd,Krasnoyarsk,Samara
Afghanistan AF,Kabul,jalalabad,kandahar,herat,ghazni,khanabad
Zimbabwe Harare, Bulawayo, Manicaland, Midlands, Mashonaland West, Masvingo
Germany Berlin,Munich city,Hamburg,stuttgart,essen,Leipzig

Like for Mumbai Hub or West bengal or Baghdad city keywords.

And how to search in particular keywords i.e.
dictTokenCountry2.Keys.Intersect(arrTokens).Select(Function (x) dictTokenCountry(x)).toArray
directly string not in array. When i try to replace arrTokens with input string it is getting error.
I want to search keywords directly in data.rows(0)(0).tostring i.e. input string instead off adding it in separate array variable for each word.

Input data is in unstructured format so can we directly search Keyword dictionary keys with input string instead of array(arrTokens)…and store result in array…just need to check if any keys exist in the string and return matched key values in array. No problem if it will return output for substring keywords found. E.g. keyword is Mumbai and string is I am from NaviMumbai. Then also it should return output with country name.

can you give the sample of unstructured data?
Give the data as is (replace any confidential info with junk data)

This is confidential information so we cannot share address also as it contains customer details…you can refer previous input table and consider input data as 2-3 address and some currency details instead of that previously provided input string(Please check transaction xyz from Sender-Mumbai). I received 20-30k data daily. Previous solution is will only applicable if keyword is single word e.g. Mumbai, IN or Dubai but not for keywords with 2 or more word like Abu Dhabi or united states of america or north korea.

I need sample data to identify the junk data

junk data is highlighted in bold below
Please check transaction xyz from London and Received by Surat
Please check transaction xyz from Sender-USA

you can replace confidential data with any junk data like actual name with celebrity names, address with “Street Addres1”
Basically I want starting and ending anchors to identify the junk data, so that we left with only required data

There is no specific format for address and keywords can be found anywhere inside the input string. Keyword or keywords can be found at the start or mid or end and for most of the cases keywords not found inside the input string.

Input demo for reference(not fixed):
JRHISXXXX AL XYZ BANK( HEAD OFFICE) RIYADH SA Date: 11 July 2019 Currency : INR(INDIAN RUPEE) Currency: USD(US DOLLAR) ROHIT SHARMA, LLC 264 SIDDHI VINAYAK APART NERUL, NAVI MUMBAI IN 400703 78 AJINKYA RAHANE H.N.B-107 SUPER TECH VALLEY GANESH NAGAR KOLIWADA DOMBIVLI INDIA 89

Something like this i will receive in input string. But address or details will be different for each transaction.

so from this you need “RIYADH SA” and “INDIA”… right ?

I want any country names if any keywords found inside string from below sample keyword list.

Country Keywords
India IN,Mumbai Hub,Delhi,Mysore,Surat city,Jammu and Kashmir,West bengal,Bangalore
Iraq IQ,BAGHDAD City,BASRA,NAFJAF Al sarf,KIRKUK,KARBALA,NASIRIYAS,AMARA
China CN,SHANGHAI,beijing,chongqing,guangzhou,tianjin,shenzhen,hangzhou,wuhan
France FRA,Paris city,Lyon,Nice,Nantes,Reims,Lille
England UK,United Kingdom,London,Manchester,Bristol,Liverpool,Southampton
Russia Moscow,Kazan,Omsk,Volgograd,Krasnoyarsk,Samara
Afghanistan AF,Kabul,jalalabad,kandahar,herat,ghazni,khanabad
Zimbabwe Harare, Bulawayo, Manicaland, Midlands, Mashonaland West, Masvingo
Germany Berlin,Munich city,Hamburg,stuttgart,essen,Leipzig

If any keywords found then it should return country name of that keyword
Else skip the transaction it no keywords found.

BlankProcess3.zip (16.6 KB)
Check if this works for you
You will have to make some adjustment for keyword though

I want to create 2 dictionary for country.

Like you created dictTokenCountry dictionary.
But I want 2 dictionary , one should have only values which contain single word and another dictionary will contain dictionary of more than one word.

Country Keywords
India IN,Mumbai Hub,Delhi,Mysore,Surat city,Jammu and Kashmir,West bengal,Bangalore
Iraq IQ,BAGHDAD City,BASRA,NAFJAF Al sarf,KIRKUK,KARBALA,NASIRIYAS,AMARA
China CN,SHANGHAI,beijing,chongqing,guangzhou,tianjin,shenzhen,hangzhou,wuhan
France FRA,Paris city,Lyon,Nice,Nantes,Reims,Lille
England UK,United Kingdom,London,Manchester,Bristol,Liverpool,Southampton
Russia Moscow,Kazan,Omsk,Volgograd,Krasnoyarsk,Samara
Afghanistan AF,Kabul,jalalabad,kandahar,herat-e-shama,ghazni,khanabad
Zimbabwe Harare, Bulawayo, Manicaland, Midlands, Mashonaland West, Masvingo
Germany Berlin,Munich city,Hamburg,stuttgart,st.essen,Leipzig

From the given keywords.
Single word dictionary should contain only one word.
e.g. IN, Delhi,Mysore,SHANGHAI,herat-e-shama,Harare st.essen,etc.

Another dictionary which contain more than one word.
Mumbai Hub, Jammu and Kashmir, United Kingdom,etc.

Each keyword will in Keywords column will be seperated by comma.