Extract data without losses from multiple pages that load with different times


#1

I have a sequence where i open this url and i must extract the first column an the url of every entry. I’ve set the selectors through the data scraping tool, and i’ve specified the next page selector as:
"<html app='chrome.exe' title='report aziende*' /> <webctrl parentid='tabella*' tag='LI' aaname=' &gt;&gt;' />"
The next selector works, and it seems that it extract the data, but it is strongly dependant from the loading time of the page. If everything goes well and i set a low number of result, it works, but otherwise it will jump some result. I’ve also tried to set an higher “DelayBetweenPagesMS” and sometimes is better, but not perfect. If i made the entire sequence for all the 3771 pages, it will not work and i cannot get all the result.
Can you advise me a method to get ALL the data? furthermore it gives an error when it reach the last page, because the next selector disappear! How can i solve also this problem? Thanks


#2

No one can give me a solution? :disappointed:


#3

for the “next page” selector you could use a try-catch activity.
regarding the data itself, could you please share the selector you are using to extract the data from the first column?


#4

This is the selector (generated from the Data Scraping Tool) is:
<webctrl id='tabellaQueryProvince' tag='DIV' />

And the ExtractMetaData is:
<extract> <column exact='1' name='Nome' attr='text' name2='URL' attr2='href'> <webctrl tag='div' class='div-riga'/> <webctrl tag='div' class='div-nome' idx='1'/> <webctrl tag='a' idx='1'/> </column> </extract>

I’ve made a test with TimeoutMS setted to 300000 and DelayBetweenPagesMS setted to 2500 and after 3 hour an 51 minutes it scrape 55612 row instead of 56557. Then it loss 945 result (63 pages).
I really cannot understand why it jumps that results!


#5

Does the entire page load when robot clicks next button?

Waitforready=complete


#6

yes, i’ve already setted waitforready parameter to Complete! I cannot understand how is possible that it jumps some pages! The problem is that it doesn’t refresh te page every “next button click” because it uses ajax and load only the new table page! It’s also difficult to recover what pages are missing, because it jumps them randomly and is impossible to check between 3771 pages!


#7

Most likely because of Next link selector, does it have an Idx value?

Are you still scraping each column and merging later? Or you just want Nome Column data?


#8

This is my next link selector, and it don’t have an Idx value:

`<html app='chrome.exe' title='report aziende*' /><webctrl parentid='tabella*' tag='LI' aaname=' &gt;&gt;' />`

I just want “Nome” column and the Url.


#9

Not sure then,one thing to try is remove the leading space in aaname for next link selector and try of it works.


#10

I’ve found its parameter through UI Explorer…if i remove the space it doesn’t work because it cannot find the selector. The Next Link Selector itself do the work but i don’t know that sometimes it switch the pages too fast and it is not able to scrape data in this time. If i open a different link with the same selectors but with a dataset with less element, it scrape all the row and no one is missing. If i made the same sequence on a bigger dataset, it will skip some data…i cannot understand why…