Not scraping data from entire page

I am currently creating a bot to scrape information from real estate listed on zillow.com. I used the Data scraping wizard and after identifying the areas on the website I would like it to scrape, it presents all the scraped data from the page.

I was able to check that all the correct information was highlighted and I scrolled down the page to make sure the highlighting pattern continued on all the listings on the page.

After selecting all the correlated data that I would like scraped, the generated table displays all the desired information from all the listings on the page according to the highlighted sections that I selected in the previous step. - Note in this step, all the information from the first page was properly and accurately displayed and there are 40 listings per page.

When I tell the bot to scrape the first 40 offerings, the bot only scrapes the first 9 per page before going to the next page. This makes the bot go through 5 pages to scrape 40 listings when it should be getting all the results from the first page. It is as if the bot can’t see farther than the 9th listing on each page even though it recognized it when I used the Data Scraping Wizard function in the previously mentioned steps. I have been problem-solving for days and have not found a solution. Help would be much appreciated.

1 Like

Hi, welcome!

I can’t be sure, but this website might be lazy loading .

Try sending the Page-Down Key until you reach the bottom of the page and then do the scraping. Are you getting all the desired results now?

I was following a udemy course throughout making this bot, and although the lecturer did all the same steps I did, his bot would scrape the data from all 40 listings per page while mine only scrapes the first 9 listings.

Assuming I would be scraping a bot that is lazy loading, is it possible to input a delay in the scanning process per page? I know I can create a delay in a bot between steps, but I am not sure how one can cause a small delay to allow the page to load when scraping data since the data scraping action is a closed loop that goes through the pages without my ability to add a command in the middle of its data scraping activity.

After adding 7 pgdn hot keys (I am assuming there is a better way to do this) to end at the end of the page, the bot scrapes all the data from the first page properly. The problem arises when it tries to scrape data from the 2nd page and beyond because I don’t have the ability to make the the browser scroll to the bottom of the webpage during the data scraping process. The only idea that I have is to create a loop of attaching the browser, scrolling to the bottom of the page, extracting the data and then simulating a click in the next page button, but that seems like a terrible work around and won’t be a sustainable fix if I ever want to scrape data from other websites that also are lazy loaders.

Do you have any other ideas? Is there a way to set the data scraping activity to wait until the webpage is fully loaded? And do you have any idea as to why the data scraping works for other people’s bots on the same webpage and why mine may have issues (like stated in my last reply)?

Something like this has been added for scraping of SAP tables, but I don’t think it is implemented in classic data scraping! :frowning: Check out this thread aswell: DataScrapping on an infinite scrolling page

You could try around with different browsers! Chrome/Firefox/Edge might give different results.

Actually I would do it like you mentioned before and I’ve done this before in an old academy assignment:

  • Use an Element Exists activity to check if a button for the next page exists.
  • Then add a While activity using that Element Exists output-boolean as a condition.
  • Add the same Element Exists activity (you might need to wildcard the selector) to the While sequence
  • then click the next page button in the While sequence (this click will fail on the last page, because there is no button. You can catch it with Try/Catch or only execute it in if with the element exist output-boolean).

You should have your structure to loop through pages until you have reached the last page now.

There is! Put a Repeat activity with pgdn hotkey and repeat 7 at the start of the While sequence. Then do your data scraping, but do not indicate there is a next page, we use our own structure for this. And there you should have it:

While the button for the next page exists, your bot will scroll down, scrape data, click next page. Then on the last page it will check for the next page, scroll down, scrape data and not click.

You will probably need to merge the resulting DataTable from the current page to a master DataTable, but then you should get all the results for all the pages in one DataTable :wink:

Hope this helps!

This helps a lot - I just ran into a few problems. When I try to use the Element Exists activity, when I am on the last page of the website, the boolean returns as ‘true’ since the “Next page” button is still showing/exists; however, it is greyed out and doesn’t have a hyperlink behind it (which the Element Exists apparently doesn’t care about). Due to this, when I try the loop you suggested, it goes on forever and will throw no errors to catch and thus I am unable to make the NextPageExists Boolean ‘false’ to exit the loop. Idk if it is possible to have an Element Exists activity that searches for valid elements that have a hyperlink behind them because if that exists, I would just use that and the loop would stop on the last page. Do you know of a better way to do the loop with the “Element Exists” that doesn’t make the bot go in an infinite loop?

Instead, I created a weird workaround where I have a variable that I assign the number of items I would like scraped as well as the maximum amount of items that can be scraped per page. Since each Zillow page has 40 listings, the maximum amount of items that can be scraped per page is equal to 40.

So the work around looks like this:

[My Variables]:

Int ItemsToScrape (how many items I would like to scrape in total) Ă  THIS NEEDS THE INPUT from the user

Int MaxPerPage (in this case 40 because Zillow only displays 40 items per page and this is set as the default)

Int ScrapesNextRun (how many items the data-scraping activity should scrape per run)

Int LeftToScrape (after each scrape, how many more scrapes to I want to do à This is ItemsToScrape – ToScrapeNext)

While LeftToScrape <> 0 {

If: LeftToScrape - MaxPerPage > 0 {

ScrapesNextRun = MaxPerPage

LeftToScrape = LeftToScrape – MaxPerPage

}

Else if: LeftToScrape - MaxPerPage < 0 {

ScrapesNextRun = LeftToScrape

LeftToScrape = 0

Else (so when LeftToScrape and MaxPerPage are the same) {

ScrapesNextRun = MaxPerPage

LeftToScrape = LeftToScrape – MaxPerPage (in this case, this will equal 0 and then will stop the loop when it gets back to the while)

}

}

Then I did the “Repeat Action” activity 7 times (for the pgdn) and then started the data scraping activity for a single page and inside the Properties tab of the data scraping activity, I told it to scrape the amount of the variable: ScrapesNextRun

I then went through the ExtractedData Table, putting the rows in a new data table named “BuiltDataTable” and then I cleared the ExtractedData Table, otherwise it would not overwrite the information in the ExtractedData Table in the sub-sequential runs of the single page data scraping activity (is that what you meant about the master data table or is there a better solution than what I did?)

} (end of while loop)

Export to Excel Document

End of Program.

I can foresee a problem that if I tell the bot to scrape more listings than are available for the given city, I don’t have a way to stop the bot when the “next page button” is not responsive (the same problem I have with the Element Exists method; however, though this workaround I am able to scan as many items as I choose as long as it does not exceed the amount of listing per city).

Do you have any feedback to my solution or answers to my questions?

Thanks a lot for your help, without your solutions I wouldn’t have known how to go on in trying to solve this problem.

Also, what do you mean by "you might need to wildcard the selector)? -

Again, thanks so much for your help, I really appreciate it

Hi,

happy to help!

How inconvenient!

Selectors for the same button on the next page might be different! Sometimes a page number or some id can find its way into the selector and you need to wildcard with * / ? or a variable/argument.

You can try to examine the element (next page button) with the UiExplorer Tool and look for attributes that change when the button is greyed out. You can use a Get Attribute activity and a conditional expression based on that attribute in the While condition. For UiAutomation UiExplorer is your best friend!

Kind of. I recommend using the Merge DataTable activity to merge the current DataTable to a master DataTable. If you keep the scope of the current DataTable variable in the While Body you should not even need to clear it, since it gets reinitialized in that scope, no?

It is a good workaround dealing with your problem at hand until you have a more dynamic and stable solution! Well done. However, you should probably rely on external assets such as an Orchestrator Asset or a Config file as in the ReFramework to avoid hardcoding and therefore unflexible process input.

Practice makes perfect! Until then UiAutomation is a lot of trial and error :wink:

Hi, sorry for the late response.

Ahh, ok - I knew about changing the selector with a “*”, but I didn’t know that it is possible to do so with a “?” as well.

Interesting solution. I’ll have to try that and see how it goes.

Hmm, using the scopes as a way to reinitialize the data table makes sense. I will have to try that and see how it works.

Yeah, that makes sense. The problem that I have is that I don’t know what assets exist and what functions they serve… so I don’t know if there is an Orchestrator Asset that solves my problem. How would you suggest going about the search for the corresponding Orchestrator Assets when you have a problem?

Yeah, it definitly does, but I am grateful for the help you provided :slight_smile:

1 Like

Little misunderstanding :smiley: You can set the values of Orchestrator assets yourself! You configure them how you need them and if there are changes to this flexible value, you don’t need to republish your process. Instead just update the asset and you’re done :+1:

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.