Accurate way of scraping specific email information?

Hi there,

I built a bot which uses ‘Index of’ to extract text from a templated email which is sent to us from customers filling in application form. I’ve had no problems with it, except for two over the weekend.

The issue is that there is a free-typing message box which customers can write in as part of the email which is sent to us and it could potentially screw things up. So, the issue I had:

If I wanted to get the ‘Annual cost’ field from the email, I could set two markers either side of this value like:

Marker 1: [Annual cost:] £1000 (value needed)
Marker 2: [Contract Length]: 48 months.

The issue then is that in the free-typing message box, if the customer writes the words ‘Annual cost’ or ‘Contract Length’, this appears first and so it gets art of their message and the whole thing crashes out.

Is there any way I can stop this from happening? Maybe I can add in a try catch which somehow selects the second value? Is there an easy way to do this?

Thanks a lot

You can use regular expression to find a match like if the line has Annual Cost in it, it should be followed by : and then space then currency and then the actual amount. This goes same for Contract Length. : followed by space, numbers followed by a word

HI,

There are some ways to achieve it, i think.

The first way is to remove free-typeing message using index-of or regex.
The second way is to extract target string using regex with more strict condition.

If you can share whole sample string, we might be able to suggest more specific way.

Regards,

I actually just re-reviewed the email and it lacks the ‘:’, that was more in there fore my explanation.

The issue is that these words already show up earlier in the email, so there are two lots (which would confuse regex).

Apologies for the crude image, I had to quickly replicate the email and remove any identifying information (for company data security).

As you can see in yellow, the monthly payments is here twice thanks to the free message. I want the value (ticked) but return the circled one (crossed) because the bot is looking for those words.

Hi,

For now, can you try the following expression?

System.Text.RegularExpressions.Regex.Match(strMailBody,"(?<=Monthly payment\s+\d+\s+monthly\s+payments\s+)\p{Sc}[.\d]+").Value

Regards,

1 Like

This works! Thank you. I’ve had a similar issue with the ‘annual mileage’ - would there be a way that I get this in a similar fashion?

Regex seems so confusing ahaha

Hi,

The following will work.

System.Text.RegularExpressions.Regex.Match(strMailBody,"(?<=Annual\s+mileage\s+)\d+\s+per\s+year").Value

We can also get other items using same way.

Regards,

1 Like

Thank you for your excellent knowledge - did you use a regex website to generate this? I feel it would be more reliable to get all the values this way, but would have taken me years to write it ahaha.

1 Like

Hi,

did you use a regex website to generate this?

To confirm if the regex pattern is correct, i use regex test site. But i usually write it from based on some frequent pattern.

Perhaps the following post helps you.

Regards,

2 Likes

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.