Copy all href links from a website and paste into Google sheets

Hi,

I would like to copy all of the href (i.e href=“//ichef.bbci.co.uk”) from a website after pressing ‘Ctrl-U’, for example on this page:

Then I would like to copy the text into a google sheet.

What is the best way to do this? Data scraping does not work? Maybe Get Attribute will work but not sure how to get all of the links this way.

Thanks,

Katie

1 Like

@Katie_Vooght, Hi and welcome to the community

personally i would use the get visible text activity, that will get everything on the screen and then i would use matches activity to only find text next to href=

Hi @Katie_Vooght

You can use FindChldren activity with FIND_DESCENDANTS scope to get the elements you’re interested with. You can tweak the filter as you like.("<webctrl tag='A|LINK' matching:tag='regex' />" for example, I don’t have my studio here to check but you have the idea)

You can then iterate over the elements and get their href attribute (GetAttribute is an option, as you mentioned)

EDIT: Another approch would be to get the source code as text with http activity for example and either use an html parser or use regex. If you prefer this approach, I can elaborate.

Thank you!

How do I make the ‘Get Visible text’ so that it can search the whole page and not just the selection i click on?

Very new to this!

Thanks msan,

Please could you elaborate on your EDIT method? I have tried Find children and cannot seem to get it to work.

Hi @Katie_Vooght,

Here is an activity to get all the web links and also you can able to verify the links that they are in active or not .

Note :
find the video attached .

Thank you that is really helpful. I am trying to use the Weblink Extractor. In the For each activity I get an exclamation mark when I put ‘item’ in the Write Line activity. Do you know why this might be?

Thanks,
Katie

Is there a way to get the text from the whole webpage and not just the section of the webpage that is visible on screen?

@Katie_Vooght

in your Writeline activity just replace item with item.ToString

Your error would go away. :innocent: And you will be able to see the output.

I am uploading my xaml file for your reference

BBC2.xaml (5.0 KB)

Hi @Katie_Vooght,

Question is already answered but here it is:

Use HTTP Request activity with your url as endpoint and you’ll have the page source code as output (String). You can then use a Regex to find the links for example with pattern = "(?<=\bhref\="")[^""]+?(?="")". A better approach would be to use a html parser (I’m a noob in VB and I don’t know any but in python you can use BeautifulSoup).

In attached workflow, you’ll find a sequence with HTTP Request and Regex and another sequence with FindChildren.

For each case, you’ll find the result as a string a NewLine as separator and as an Array of String. With FindChildren, I filter A elements only but you can edit the selector. The variables’ scope is kept to their respective sequence.

Scrap_Urls.xaml (13.3 KB)

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.