Article title web scraping - problem

Hi,

I am facing some problems with web scrapping. I am new to Ui Path software, so any help is appreciated. I want to scrape article titles from a website.
Basic web scraping is not enough - it gives me only part of the title and some text underneath it (I don’t need it, I would love to put it in the second column).
I managed to get the title I want using the “get attribute” activity, but I want to automate that process. Is there a way to loop it?

The Page looks like this:

Can you help me? I want to export a list of titles and things underneath it to excel.

1 Like

Hey @senek

Kindly use the Data Scraping wizard and perform it seamlessly.

Thanks
#nK

Hey, it doesn’t work as it should. This is the result:
image

Whole thing instead of a title :frowning:

1 Like

Hey @senek

Hope you indicated only title?

Also, if it’s a public site please share link to check…

Thanks
#nK

Yes, I indicated only the title, but I guess the website is a little bit tricky. sure here you go https://rpa.hybrydoweit.pl/

we can do with a manually editing the extract config:

<extract>
	<row exact="1">
		<webctrl tag="div"/>
		<webctrl tag="article" idx="1"/>
	</row>
	<column exact="1" name="Column1" attr="text">
		<webctrl tag="div"/>
		<webctrl tag="h3" idx="1"/>
	</column>
	<column exact="1" name="Column2" attr="href">
		<webctrl tag="a" idx="1"/>
	</column>
</extract>
1 Like

Thank you! It’s amazing. Can you tell me how you determined it? Is there a way to learn it (haha I bet it is)?

Also, I copied it and it does not work:

@senek
A little experience is helpfully but in general we can practice straightforward like this:

  • start with data scraping wizard
  • when we cannot take selectors more detailed e.g. ot get only the blocks like:

    THEN: we check the structure of the webpage

We do see, that it is clear divided into the different sections:

Now we do following:

  • indicate the article blocks in wizard
  • indicate the article blocks again in the wizard for a second column

with the second correlated data it generates for us the row extract definition:

<extract>
	<row exact="1">
		<webctrl tag="div"/>
		<webctrl tag="article" idx="1"/>
	</row>

we refer just back to the structure of the website and do postediting the extract config xml manually

in this alternate example we used the title from the image

<extract>
	<row exact="1">
		<webctrl tag="div"/>
		<webctrl tag="article" idx="1"/>
	</row>
	<column exact="1" name="Column1" attr="alt">
		<webctrl tag="img" idx="1"/>
	</column>
	<column exact="1" name="Column2" attr="href">
		<webctrl tag="a" idx="1"/>
	</column>
</extract>

so we got the the full title instead the shortened text ending with … for long texts

Once we have done and confirmed the wizard, then we just cross check the selector of the extract structured data activity and verify that it is targeting the list of all aticles correctly and conform to what we configured for the row extract definition.

Also have a look here:

Hello there. New to UIPath as well. I don’t have that on my menu. Has UIPath updated it or it’s not on CE?

1 Like

Hey @Carl_Robillos

It is the Table Extraction menu item in the toolbar (above screenshot)

Thanks
#nK

1 Like

Oh thanks!

1 Like