How to retrieve specific data from HTML

Hi guys,

I just give up with this case. I need your support. Let’s see if I am able to explain it.

The use case is based on e-mails reaching a shared mailbox. The body of those mails is as follows:

image

Each line contains information that has to be retrieved and introduced in a web page afterwards.

My problem is that, the body of those mails comes in HTML format and I haven’t been able to extract the data in any of the methods I can imagine. Here’s the “insides” of the HTML file. I marked in yellow the relative tag of every data I want to retrieve (in green):

I tried several methods (get full text, screen scrapping…) and I’m not able to get a structured entity from where I could start working. If Regex is the best solution here, I need your knowledge and expertise as I’m not that good using Regex.

Thanks everyone for your suggestions and, please, let me know if something is unclear.

Hi @jferre

Open the file using use application/browser activity then extract the data from it.

Regards,

Hi @supriya117 ,

I already tried that. And I got no text as result. I surely am doing things wrong so, I’d appreciate it someone could bring some light here :slight_smile: .

@jferre

Save the outlook mail in “*.mht” format by using save outlook mail activity then open it in the browser.

I am not using the outlook client at all. Everything is done by using Exchange activities.

And I would prefer to not use Outlook if avoidable.

lets assume the email is sent in HTML Body format

We do see
myMailVar.BodyAsHtml - getting it as HTML
myMailVar.Body - getting it as Text

Then we can extract the values e.g. with a regex

Check at your end by doing the following

Samples:
grafik
grafik

Hi @ppr ,

Thanks for your suggestion. I already was able to save the body of the mail in a text file. My problem comes just afterwards. I don’t know how to get rid off all the rubish (unwanted tags and symbols) and leave only the data I need.

If some Regex ‘guru’ could lend a hand here I would really appreciate it. :wink:

as shown above we can get the text only. Can you share with us what was done at your end for the body text retrieval? Thanks

Sure. Basicallly I get all the e-mails in scope with a ‘Get Exchange Mail Message’ activity and I store them in a list of Mail Messages.

Then, I do a For Each and I process each mail on that list. I can read the body information with item.Body.ToString assignment.

So, the last step is to save that information with a ‘Write Text File’ activity. And now I have a plain-text file (which I can save as *.txt or *.html) containing the information I mentioned in my first post.

I need to extract the information from this file. And there’s where I am stucked with.

item.Body is similar on what we had done in the immediate panel, but got plain text only.
Please do it at your end similar and share the screenshot with us thanks

I can’t see the difference between item.Body and item.Body.ToString. I guess it’s exactly the same and, I insist on the fact that that’s not the real issue.

My need here is to extract the information from the plain text file which contains the HTML code. And I guess that’s a topic for Regex. :man_shrugging:

ok, not shared but are stating that Body is returning also HTML tagged text.

If so, we can check on yourMailVar.Headers(“PlainText”) for getting the text only

Otherwise we can use https://html-agility-pack.net for the help as well

1 Like