Data scrapping with next page button when the internet cuts off

Hey guys, so I’ve ran into a problem yet again and want to figure it out once and for all.

So I’m doing data scrapping on a specific website. Im basically going through like 10k urls to scrape data and it takes up a normal bit of time. I thought about next page button when scrapping. In that instance, I wouldn’t have to load a new webpage every time. But the problem is, that from time to time my internet connection cuts off, and I can’t figure out how to manage this problem.

Imagine, you go to url, for example Iphone google search and you want to scrappe all the given url’s with this keyword. You do “Extract table data” with next page button inside of it, but lets say when you get to page 8, your internet cuts off. (How can I make it restart the scraping from the same page that the internet cut off?)

I hope I managed to explain it clearly and i’m waiting for your ideas, thanks in advance! :slight_smile:

Hi,

You can add a row into config file (or another excel temp) and read the page number and update the row that you’ve created in config file while it iterates through the pages.

Create a variable called pageNumber and read it from config file at the beginning of the process. You can directly go to the page that pageNumber variable referring before it starts to data scrape.

Hope it helps :slight_smile:

1 Like

Hi and thanks for the response,

I would love to see maybe an example of it? I’m trying to figure out what you are saying, but it is hard to even imagine this task. Really grateful for your response and thanks! :slight_smile:

It would be better to use an Asset.

1 Like

Hi Paul,

Is it possible to give some more information on the Asset thing?

Get the value of the Asset to determine where to start in the data. When processing, update the Asset to reflect the last row processed.

1 Like

Thanks again Paul, but is it possible to get a visual example of the task? I’ve tried my best, all the time I failed.
Thanks!

Hi @postwick,

How does the asset thing work? I’m trying to do reaserch towards it, but can’t find nothing.

They’re a basic feature of Orchestrator.