I am entirely new to UIPath and not a programmer. I would like to scrape data from some web pages that are of the same general layout, but have only one of each item (e.g., title, or date) per web page. This means using the second element (being the same as the first) to “teach” UIPath will not work - when using the Data Scraping tab.
Could anyone please lead me by the nose to the method needed, I don’t mind working, but some specific pointers would really help because while superficially the route I have taken looks attractive, it does not work.
Could you please provide some more detail on the topic ?? Show screenshots of the application, share workflow youve developed so far :)? That would hugely help us to guide you through
This leads to individual pages, where only one given element exists. I want to extract things like the URL, date (text), content (text) and title (text) into a CSV file. An example page (presently the first) is at:
If I try starting from the top level page, when I click naturally the web browser takes me to the next page. So this will not work for the 2-elements required to teach UIPath. (And I don’t really want to get data from this page, just use it as the top level to move on from).
If I try starting from the example page (1-news item per page), there is only 1-element on that page, so I cannot satisfy the requirement of selecting a second similar element from the page.
Thanks again for your help, and if this is insufficient explanation, please do not hesitate to come back for more. I am most grateful that you have replied.
after selecting item click “next” on the wizard to initiate searching for second element
after you select header of the second article you’ll have to check the url option and rename the columns but after that, a preview with content of the first page should appear
you’ll you can later extract correlated data (date, content etc)
As you indicate, using the example “top level” page (fuelcellsworks.com/news) I can do the First and Second Element, and if this was all I was after I would not have a problem.
However, I should give you some further information.
The top level page is only relevant as a landing page. This is static. But the articles that it points to change, so the long title of the second link (the example) will be top of that page one day and the next it will be something else, hence the idea of starting from the top level page. The top level page cannot be used to harvest data as it only contains short summaries of data.
When I move to the example pages (and there are hundreds of them), the full data set is available. However, this only has a unique element (like date, or title per page). There is no second element to click or select. The only thing (relevant) aside from the news article is the Next and Previous links to the next and previous news items, each of which opens a similar page.
Hence, while I can specify the First and Second Element (using Data Scraping) on the top level page, I cannot do so on the pages this points to, as each page only has a First Element.
Does this explain things better, and again, very much appreciating your responses so far, can you point me to a method to deal with this contingency?
Then the most reliable way to get all the articles would be to first scrape them (as much as possible from the top-level page) and then loop through the output table …navigate to article based on URL of an item and extract the missing information directly from detailed article view
Can you please give me some guidance as to how to make use of the XML file you sent? Is it to illustrate the flow (in terms of the box titles “Sequence”, “Articles Scraping”, etc.) or is it directly useful (i.e. I can use it directly as a basis to do what I want)? If the latter, could you please give me an indication of “how”?
Sorry, remember I am not a programmer. (But I am willing to learn and invest time).
Postscript: I have loaded the file into the Studio. It correctly gets the URLs, titles and long content! But for the first top-level page only (i.e. to the 11th record, where page 1 finishes). I needed to change one the variables as “Title” but that may have been irrelevant. It would be really helpful if you could indicate in broad terms how you very kindly made this flow - i.e. by what method - so I can try to build on it, and adapt it to other web addresses as I source new content - one day!
yes you can use it but you’ll need to do some adjustments.
The workflow I’ve sent you contains basic logic for extracting the list of articles and then looping through every item to extract full article content and save it all into CSV after that. (though I didn’t go all the way to rename every sequence and make it “bulletproof”)
You can go ahead, open the xaml file with UiPath studio and give it a go,
Once you go through the example you can work on extending the functionality i.e. adding columns with additional data by reusing the “add column” activity with different parameters (name, type, length of other columns) and then scraping the necessary information from article page + assign it to a relevant column value as I did with full article text
Also,
Have you gone through the UiPath Academy ? They’ve got quite a few courses there that will help you understand the basics of UiPath and will basically get you going and as you progress through the courses you’ll be able to manage error handling, logging and other stuff that’s quite crucial for bot stability
Hi, Filip. I am doing this on a train! My train completes its journey in 20 min, so if you could kindly indicate how you made up this flow, I will have a go tomorrow. As you can see, I have tried now running the XML using Studio. I will also look at the videos and try to learn some basics. Thanks again.
And I really appreciate what you have done even if there are errors. It is pure kindness on your part and I can only give you my thanks.