Formatting Screen Scraping Output

Hi there team thank you in advance for your help. I scraped a page that has employee info, when i first scraped it, it looked like this:

Employee Profile


Employee Name



John Smith


Employee Email



john.smith@employee.com

As you can see there were several \n, then i did some research and found System.Text.RegularExpressions.Regex.Replace(employeeResults,“[\n]”,“”) and it removed the new lines but now it seems like its all in one big line becasue it looks like this now:

Employee Profile              Employee Name                      John Smith                Employee Email                       john.smith@employee.com

I want it to look like this:

Employee Profile
Employee Name John Smith
Employee Email john.smith@employee.com

How can i accomplish this?
Thank you in advance!

Hi,

Can you try the following expression?

System.Text.RegularExpressions.Regex.Replace(System.Text.RegularExpressions.Regex.Replace(yourString,"(\r?\n){1,}",vbCrLf),"(?<=(Employee Name|Employee Email))\r\n"," ")

Regards,

@olmccb

can you try this.This is just an extension forwhat you have done

System.Text.RegularExpressions.Regex.Replace(employeeResults,"[\n]","").Replace("Employee",Environment.Newline + "Employee").Trim

Cheers

Thank you so very much for your quick response, I really appreciate it. Yoichi this is awesome but
Im going to apologize because i was not clear. The information for each employee is going to be dynamic.
Meaning, some employees may have no email, some may have 1 or more so in this case the Employee Email
may or may not be present. Just like there may be some that have a website and others may not.
Therefore the label Employee Website may or may not be present. And all in all the number of labels for each employee will be different. This is my challenge. Thank you in advance!!

Hi,

Can you share your specific examples (input and expected output)?

Regards,

Thank you @Anil_G i for your help, I added an update to my question. Thank you!

1 Like

@olmccb

If you can share variations of input and the output required then it would be better to provide a solution

Cheers

ok @Yoichi here is a an example and once again thank you for helping!

input:
Employee Profile         Employee Name           John Smith         Employee Email      johnsmith@email.com            Employee Website   www.johnsmith.com

Employee Profile         Employee Name          Jane Smith          Employee Email   janesmith@email.com    janesmith1email.com          Birthday   jan 2, 2000

Employee Profile         Employee Name          Roger Smith       


Expected Output:

Employee Profile
Employee Name  John Smith
Employee Email  johnsmith@email.com
Employee Website  www.johnsmith.com

Employee Profile 
Empployee Name Jane Smith
Employee Email  janesmith@email.com
	               janesmith1email.com
Birthday jan 2, 2000

Employee Profile
Employee Name Roger Smith

See how each employee has different employee details. I get the details in one sweep by using the Get Value activity and was thinking of converting the string into an array but then the array has all the spaces in between from the original string, then if i remove them then it’s just all one long string. Can we make what you created dynamic?
Thank you!

HI,

Can you try the following sample?

mc = System.Text.RegularExpressions.Regex.Matches(yourString,"Employee Profile[\s\S]*?(?=Employee Profile|$)")

name = System.Text.RegularExpressions.Regex.Match(m.Value,"Employee Name[\s\S]+?(?=Employee Email|Birthday|Employee Website|$)").Value

email = System.Text.RegularExpressions.Regex.Replace(System.Text.RegularExpressions.Regex.Match(m.Value.Trim,"Employee Email[\s\S]+?(?=Birthday|Employee Website|$)").Value,"(\S+@\S+)","$1"+vbcrlf).Trim

birthday = System.Text.RegularExpressions.Regex.Match(m.Value,"Birthday[\s\S]+?(?=Employee Website|$)").Value

website = System.Text.RegularExpressions.Regex.Match(m.Value,"Employee Website[\s\S]+?").Value

Then

String.Join(vbcrlf,{profile,name,email,birthday,website}.Where(Function(s) not String.IsNullOrWhiteSpace(s)))

Sample20230124-4L.zip (2.9 KB)

Regards,

@Yoichi thank you very much for this!! I am working on implementing it in my workflow, will report back!! Much love thank you!!

Hi there @Yoichi I’m implementing your solution and i have created my matchcollection variable as you can see:


But when i use it in the for loop
image
I get the following message saying “Value” is not a member of MatchCollection. I googled and searched in the topics here in the fourm but couldn’t find a solution.

Hi,

Can you try to set Match (System.Text.RegularExpressions.Match) at TypeArgument of ForEach activity?

Regards,

It worked, thank you!!
image

1 Like

Thank you so very much @Yoichi !! it worked on my workflow, had to modify it a bit but it did the job!! Thank you very much!!

1 Like

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.