Web Scraping - Only 1 element in pages

Hi, all - particularly anyone who can help!

I am entirely new to UIPath and not a programmer. I would like to scrape data from some web pages that are of the same general layout, but have only one of each item (e.g., title, or date) per web page. This means using the second element (being the same as the first) to “teach” UIPath will not work - when using the Data Scraping tab.

Could anyone please lead me by the nose to the method needed, I don’t mind working, but some specific pointers would really help because while superficially the route I have taken looks attractive, it does not work.

Hoping someone can hep.

Stuart

Hi Stuart,

Welcome to the forum !:slight_smile:

Could you please provide some more detail on the topic ?? Show screenshots of the application, share workflow youve developed so far :)? That would hugely help us to guide you through

Best Regards,
Filip

Hi, Filip, thanks for the reply.

To be specific, I want to harvest data from the site Fuel Cells Works.

The top level is:

Top

This leads to individual pages, where only one given element exists. I want to extract things like the URL, date (text), content (text) and title (text) into a CSV file. An example page (presently the first) is at:

Example page

If I try starting from the top level page, when I click naturally the web browser takes me to the next page. So this will not work for the 2-elements required to teach UIPath. (And I don’t really want to get data from this page, just use it as the top level to move on from).

If I try starting from the example page (1-news item per page), there is only 1-element on that page, so I cannot satisfy the requirement of selecting a second similar element from the page.

Thanks again for your help, and if this is insufficient explanation, please do not hesitate to come back for more. I am most grateful that you have replied.

Stuart

@jonessl,

I think I see what you’re trying to do.

You need to use data scrapping (but I believe you know that judging by you mentioning the “second element”)

  1. open the top-level page with articles

  2. run data scraping wizard
    image

  3. press next to initiate scrapping
    image

  4. you’ll be able to select the article title without navigating to the article itself

  5. after selecting item click “next” on the wizard to initiate searching for second element

  6. after you select header of the second article you’ll have to check the url option and rename the columns but after that, a preview with content of the first page should appear :slight_smile:

  • you’ll you can later extract correlated data (date, content etc)

you’ve got the whole process described here: https://www.youtube.com/watch?v=CIsJGvvdz6Q

Hope that’s what you meant and my answer helps you :slight_smile:

Best Regards,
Filip

Hi, Philip.

Thanks again for the reply.

As you indicate, using the example “top level” page (fuelcellsworks.com/news) I can do the First and Second Element, and if this was all I was after I would not have a problem.

However, I should give you some further information.

The top level page is only relevant as a landing page. This is static. But the articles that it points to change, so the long title of the second link (the example) will be top of that page one day and the next it will be something else, hence the idea of starting from the top level page. The top level page cannot be used to harvest data as it only contains short summaries of data.

When I move to the example pages (and there are hundreds of them), the full data set is available. However, this only has a unique element (like date, or title per page). There is no second element to click or select. The only thing (relevant) aside from the news article is the Next and Previous links to the next and previous news items, each of which opens a similar page.

Hence, while I can specify the First and Second Element (using Data Scraping) on the top level page, I cannot do so on the pages this points to, as each page only has a First Element.

Does this explain things better, and again, very much appreciating your responses so far, can you point me to a method to deal with this contingency?

Thanks again,

Stuart

@jonessl,

Then the most reliable way to get all the articles would be to first scrape them (as much as possible from the top-level page) and then loop through the output table …navigate to article based on URL of an item and extract the missing information directly from detailed article view

quick demo attached :slight_smile:
Main.xaml (12.9 KB)

Best Regards,
Filip

Hi, Filip. Thanks again.

Can you please give me some guidance as to how to make use of the XML file you sent? Is it to illustrate the flow (in terms of the box titles “Sequence”, “Articles Scraping”, etc.) or is it directly useful (i.e. I can use it directly as a basis to do what I want)? If the latter, could you please give me an indication of “how”?

Sorry, remember I am not a programmer. (But I am willing to learn and invest time).

Postscript: I have loaded the file into the Studio. It correctly gets the URLs, titles and long content! But for the first top-level page only (i.e. to the 11th record, where page 1 finishes). I needed to change one the variables as “Title” but that may have been irrelevant. It would be really helpful if you could indicate in broad terms how you very kindly made this flow - i.e. by what method - so I can try to build on it, and adapt it to other web addresses as I source new content - one day!

Kind regards,

Stuart

@jonessl,

yes you can use it but you’ll need to do some adjustments.
The workflow I’ve sent you contains basic logic for extracting the list of articles and then looping through every item to extract full article content and save it all into CSV after that. (though I didn’t go all the way to rename every sequence and make it “bulletproof”)

You can go ahead, open the xaml file with UiPath studio and give it a go,
Once you go through the example you can work on extending the functionality i.e. adding columns with additional data by reusing the “add column” activity with different parameters (name, type, length of other columns) and then scraping the necessary information from article page + assign it to a relevant column value as I did with full article text

Also,
Have you gone through the UiPath Academy ? They’ve got quite a few courses there that will help you understand the basics of UiPath and will basically get you going :slight_smile: and as you progress through the courses you’ll be able to manage error handling, logging and other stuff that’s quite crucial for bot stability :slight_smile:

Best Regards,
Filip

Hi, Filip. I am doing this on a train! My train completes its journey in 20 min, so if you could kindly indicate how you made up this flow, I will have a go tomorrow. As you can see, I have tried now running the XML using Studio. I will also look at the videos and try to learn some basics. Thanks again.

And I really appreciate what you have done even if there are errors. It is pure kindness on your part and I can only give you my thanks.

Hi @jonessl
Is your problem solved?
If yes can you share the details of the solution.

Best Regards,
Pooja M