Use Matches to obtain strings of text within a paragraph

Katie_Vooght · June 17, 2020, 12:30pm

Hi,

I would like to pull all triathlon websites from the follow using matches:

href=“/triathlon.com/rmstechnology/home”>RMS Technology

Home
About
<div class=“j10yRb” role=“presentation”

For example; “/triathlon.com/rmstechnology/home”

How would I do this? Should I use the advanced option?

image1171×823 34.4 KB

ppr · June 17, 2020, 1:37pm

@Katie_Vooght

can you check if following options would better fit to the task:

using find children, filtering to all a elements (Links) and retrieve the href attribute value
using XML processing and filtering to all a elements and href attribute value retrival

About Regex A quick an dirty approach could be:
grafik

Katie_Vooght · June 19, 2020, 4:13pm

(?<=href=)“.*” seems to start in the right place but then highlights all text after even text that is not needed. Is there a better way to indicate where to stop the text?

ppr · June 19, 2020, 4:28pm

@Katie_Vooght
as it was doing the most simplest regex pattern yes it take also the surrounding “”. But I dont know your RegEx skills and aimed to do as simple as possible.

However it will not disturb as it can easy removed.

use the Matches activity and configure Pattern, input output
Afterwards run within an Assign
left side: String() - Urls
right side: Matches.Select(Function (m) m.ToString.replace(chr(34).toString,“”).Trim).toArray

and you will get a string array with All Urls

Another approach would be to work with regex Groups and to refer to the Url sourrounded by "

bcorrea · June 19, 2020, 5:43pm

Try this expression:
<a\s+(?:[^>]*?\s+)?href=(["'])(.*?)\1
And from every match returned you will want to get the group 2.
Match.Groups(1).Value

Katie_Vooght · June 21, 2020, 6:07pm

How can I adapt <?href=(["'])(.*?)\1 to ensure it only picks up links with /triathlon.com/rmstechnology in? At the moment is also picking up links such as:

href=“https://fonts.googleapis.com/css?family=Google+Sans:400,500|Roboto:300,400,500,700|Source+Code+Pro:400,700&display=swap”

ppr · June 21, 2020, 8:06pm

@Katie_Vooght
maybe this helps:
(?<=href=")(\/triathlon\.com\/rmstechnology\/.*)(")

refer to group 1
grafik

if it is not doing as expected, then please your clear described requirements and sample values with us

Topic		Replies	Views
Extract URL of text without Data Scraping Learning Hub studio , question	4	2125	February 19, 2020
Highlighting Regex Match Studio studio , question , activities_panel	4	986	October 4, 2021
Extract certain text from a string Help	13	1403	June 24, 2020
URL, RegularExpressions Help	6	1329	July 24, 2019
Question: how to get the "Matches" function / RegEx Expression to work Help studio	7	1748	September 1, 2019

Use Matches to obtain strings of text within a paragraph

Related topics