Data Scraping: Scrape new elements until first duplicated row was found

Hello guys,

I used the Data Scraping wizard to scrape around 10.000 rows of structured data.

Every day the website updates the content, usually around 1 - 5 elements.

So what I want is to restart the Data Scraping wizard every 24 hours to get the new elements.

However, as the amount of new elements is changing daily, it’s hard to tell the bot when to stop scraping new data.

My current workaround is the following:

  1. Scraping the whole database once (the ~10.000 rows I talked about)
  2. Reusing the Scraping mechanism from 1), but limit the scraped elements to 50 (because the chance is really small they will upload more than 50 new elements within 24 hours)
  3. Merge the database from 1) with the new elements from 2)
  4. Count duplicated rows
  5. If duplicated rows = 0 then send a mail to me (because it means that either there were exactly 50 new elements or the bot didn’t scrape everything)
  6. Remove duplicated rows

I’m not too happy with this workaround because it’s not very efficient.

Tl;dr: Is there a way to check within the Data Scraping wizard if a duplicated row was added, and if so, finish the wizard?

You should find a way to detect new items like tags in. If a way to order on date. If not you will never be sure without fetching everything again

With creating a dataview with unique rows from your table you can detect duplicate rows by comparing the count