Need help in splitting text

I have a dynamic .txt file that contains job advertisements which follow specific pattern. The pattern is as follows “Company Name has a vacancy for the occupation of DRIVER , suitably qualified applicants can contact 1717171717 or CompanyNAme@GMAIL.COM ”. Company Name and all other bold details is different in each line, and each advertisement ENDS with an email. The .txt files contains a continues text without break lines or delimiters. I need to break each single advertisement after every email address, and output the results in XLS file. Anyone can help please?
jobs-test.txt (22.2 KB)

Hi @Anwar_Mirza

Please try below and see if this helps.

Let’s say your string is in variable “jobs” and Arr is an array of strings.

Arr= Jobs.split(“.com”)

Then You can loop through array and refer to each of the values. You may have to concatenate the .com in the end

@Anwar_Mirza

When i observe your text, some emails contains .COM and .com, in this fi we want to ignore the case and split accordingly after .com irrespective of case. for this you can use below regex

System.Text.RegularExpressions.Regex.Matches(strValue, “.*?[\w.]+@[\w.]+”, System.Text.RegularExpressions.RegexOptions.IgnoreCase)

I will give a sample flow snap, you can build along with that

lstvar type should MatchCollection

Output in Excel

Please mark it as solution if you find it helpful!!
Happy Automation!!

@Anwar_Mirza

you can use this to get all the values

System.Text.RegularExpressions.Regex.Split(str,"(?<=\.com)\s*",System.Text.RegularExpressions.RegexOptions.IgnoreCase).ToArray this will give you array of strings which each item represents one part


cheers

Hi Anwar,

To split the text, use below expression and store it in matchescollection
System.Text.RegularExpressions.Regex.Matches(
strInput,
“([a-zA-Z0-9.%±]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,})\s+(.*?)(?=[a-zA-Z0-9.%±]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,}|\z)”,
System.Text.RegularExpressions.RegexOptions.Singleline
)
Use it inside for each
new string(){ item.Groups(2).Value.Trim + Environment.NewLine + item.Groups(1).Value }

Then You can loop through the collection and add it to data table then write in excel

Thank you all for sharing your knowledge, I will test all of them and share the results. Although majority of emails ends with .com, but I do have emails ending with .bh, .au, .sa and so forth. I will see what I can do about this, and appreciate if you can share your thoughts about it also.

@Anwar_Mirza

System.Text.RegularExpressions.Regex.Matches(strValue, “.*?[\w.]+@[\w.]+”, System.Text.RegularExpressions.RegexOptions.IgnoreCase)

This pattern will work for all of your scenarios, just give a try and see, whether it is working or not.
here we are referring the @ symbol, so it can be anything .com, .ah, .in anything it can be even it will works

Thank you, I just tested it and it is working fine however I struggled with quite few cases where there is a space in the email address itself which causes the line to break in a wrong place. Is there a way to trim these wrongly made spaces from the string?

Thanks @sonaliaggarwal47 however this will only work with emails ending with.COM, other emails ending .in for instance will not be captured.

Thanks @Anil_G but this will only capture emails ending with .com, other emails will not be captured.

Can you show me example, at what scenario you are getting error

@Anwar_Mirza

let me know what all are possible we can include all as well

cheers

Hi @Anwar_Mirza

In that case, use below syntax:

Arr = str.Split({“.com”,“.bh”,“.au”,“.sa”},stringSplitOptions.None)

if there can be any more like .in etc simply, add in the list above and it should work.

@Anwar_Mirza

Can you give a try with below syntax and see

System.Text.RegularExpressions.Regex.Matches(strValue, “.?[\w\s.]+@\s[\w\s.]+.\s*\w+”, System.Text.RegularExpressions.RegexOptions.IgnoreCase)


Plz see the attached, there is a space after @, and sometimes the space is after the dot (.)

It only selects those lines with spaces after the @, and ignores other lines. Output is as below:

Thanks I will try it first thing tomorrow and let you know. Thank you so much @sonaliaggarwal47

Mainly .bh, .au, .in, .org, .net, and .sa. I think I will define this as a variable in the project and update it each time something new pops up.

@Anwar_Mirza

Try with below pattern, this was giving the correct output

System.Text.RegularExpressions.Regex.Matches(strValue, “.?(?:[\w.-]+@\s[\w.-]+\s*.\s*\w+)”, System.Text.RegularExpressions.RegexOptions.IgnoreCase)

Happy Automation!!

@Anwar_Mirza

you can use like this,this is more generic to get all types

System.Text.RegularExpressions.Regex.Split(str,"(?<=\.[A-Za-z]{2,3})\s+",System.Text.RegularExpressions.RegexOptions.IgnoreCase).ToArray

cheers

1 Like