senek
January 29, 2022, 10:53am
1
Hi,
I am facing some problems with web scrapping. I am new to Ui Path software, so any help is appreciated. I want to scrape article titles from a website.
Basic web scraping is not enough - it gives me only part of the title and some text underneath it (I don’t need it, I would love to put it in the second column).
I managed to get the title I want using the “get attribute” activity, but I want to automate that process. Is there a way to loop it?
The Page looks like this:
Can you help me? I want to export a list of titles and things underneath it to excel.
1 Like
Hey @senek
Kindly use the Data Scraping
wizard and perform it seamlessly.
Thanks
#nK
senek
January 29, 2022, 12:40pm
3
Hey, it doesn’t work as it should. This is the result:
Whole thing instead of a title
1 Like
Hey @senek
Hope you indicated only title?
Also, if it’s a public site please share link to check…
Thanks
#nK
senek
January 29, 2022, 4:07pm
5
Yes, I indicated only the title, but I guess the website is a little bit tricky. sure here you go https://rpa.hybrydoweit.pl/
ppr
(Peter Preuss)
January 29, 2022, 4:20pm
6
we can do with a manually editing the extract config:
<extract>
<row exact="1">
<webctrl tag="div"/>
<webctrl tag="article" idx="1"/>
</row>
<column exact="1" name="Column1" attr="text">
<webctrl tag="div"/>
<webctrl tag="h3" idx="1"/>
</column>
<column exact="1" name="Column2" attr="href">
<webctrl tag="a" idx="1"/>
</column>
</extract>
1 Like
senek
January 29, 2022, 4:50pm
7
Thank you! It’s amazing. Can you tell me how you determined it? Is there a way to learn it (haha I bet it is)?
Also, I copied it and it does not work:
ppr
(Peter Preuss)
January 31, 2022, 9:33am
8
@senek
A little experience is helpfully but in general we can practice straightforward like this:
start with data scraping wizard
when we cannot take selectors more detailed e.g. ot get only the blocks like:
THEN: we check the structure of the webpage
We do see, that it is clear divided into the different sections:
Now we do following:
indicate the article blocks in wizard
indicate the article blocks again in the wizard for a second column
with the second correlated data it generates for us the row extract definition:
<extract>
<row exact="1">
<webctrl tag="div"/>
<webctrl tag="article" idx="1"/>
</row>
we refer just back to the structure of the website and do postediting the extract config xml manually
in this alternate example we used the title from the image
<extract>
<row exact="1">
<webctrl tag="div"/>
<webctrl tag="article" idx="1"/>
</row>
<column exact="1" name="Column1" attr="alt">
<webctrl tag="img" idx="1"/>
</column>
<column exact="1" name="Column2" attr="href">
<webctrl tag="a" idx="1"/>
</column>
</extract>
so we got the the full title instead the shortened text ending with … for long texts
Once we have done and confirmed the wizard, then we just cross check the selector of the extract structured data activity and verify that it is targeting the list of all aticles correctly and conform to what we configured for the row extract definition.
Also have a look here:
This HowTo introduces on how Data scraping can be configured to retrieve also on non standard information from a web table. After indicating the different data columns with the wizard the extract data definition was post edited and changed to the relevant attributes e.g. value (Text field), src ( Image Source), class (CSS Class Name), tite (Hover Text), href (Url).
Introduction
Following web table is to use for data scraping and also the non text information should be retrieved.
[grafik]
We…
Hello there. New to UIPath as well. I don’t have that on my menu. Has UIPath updated it or it’s not on CE?
1 Like
Hey @Carl_Robillos
It is the Table Extraction menu item in the toolbar (above screenshot)
Thanks
#nK
1 Like