So we’re scraping an HTML table of links (several different URLs, with anywhere from a dozen to 50 or so links). Unfortunately the table we scrape doesn’t have some data that we need, such as the descriptive title of the page being linked to, and the date when it was last updated. So, we’re resorting to iterating the scraped links and using Browser.NavigateTo to open each page, where we do some text scraping.
This is painfully slow. Trying to think of something a bit peppier, I’ve played with the idea of downloading the HTML… without actually navigating to and rendering the page… and then doing string searches etc. to get what we need.
I’ve tried using System.Net.WebClient.DownloadString, which does work in general, but not with the specific URLs we’re dealing with here (I suspect it has something to do with security certificates, or redirects, or something similar).
Have also looked at the HTTP Request activity but haven’t been able to get that to work. Though I’ve seen some samples of how to use it to download FILES and such, I haven’t found a clear example of how to use it to download or steam the HTML.
Open to any suggestions or insight anyone can offer.
ddk