Hello Genius people,
I’m trying to scrap data from Google news, where I’m extracting Name, URL & Summary of the search data.
For a while it was working like a charm but now I realized the xml is not reliable and I need to edit them a bit.
Can anyone help with this to edit this XML and please explain how it should be done.
RunCmdAsDiffUsers.xaml (9.6 KB)
ppr
(Peter Preuss)
2
the structure changed. Kindly note the g-card tag:
Give try by reconfiguring the extraction or try following extract xml
<extract>
<row exact='1'>
<webctrl tag='div' />
<webctrl tag='g-card' idx='1' />
<webctrl tag='div' idx='1' />
<webctrl tag='div' idx='1' />
<webctrl tag='a' idx='1' />
<webctrl tag='div' idx='1' />
<webctrl tag='div' idx='2' />
</row>
<column exact='1' name='Head' attr='text' name2='Url' attr2='href'>
<webctrl tag='div' />
<webctrl tag='g-card' idx='1' />
<webctrl tag='div' idx='1' />
<webctrl tag='div' idx='1' />
<webctrl tag='a' idx='1' />
<webctrl tag='div' idx='1' />
<webctrl tag='div' idx='2' />
<webctrl tag='div' idx='2' />
</column>
<column exact='1' name='Summary' attr='text'>
<webctrl tag='div' />
<webctrl tag='g-card' idx='1' />
<webctrl tag='div' idx='1' />
<webctrl tag='div' idx='1' />
<webctrl tag='a' idx='1' />
<webctrl tag='div' idx='1' />
<webctrl tag='div' idx='2' />
<webctrl tag='div' idx='3' />
</column>
</extract>
1 Like
Thank you so much for your help. But if I can ask how I can recognize this g-tag? I couldn’t find this when I inspect the page.
ppr
(Peter Preuss)
4
just righ click a card and select inspect element. Within the Browser F12 webtools you can check the structure
1 Like