Hi, I need your valid suggestion. As we use Data Scrapping to scrap the data from the multiple page Right.
Ex. I have 10 pages on the web and I am scrapping the data from there.
suppose I set to fetch 50 records so I run the Bot and it will fetch the first 50 records from the pages and give it me. No mater if the 50 records one page 1 or 2 or any.
Now My question is If I run second time and I don’t want to fetch the records which I fetched when I run the bot first time. Can we skip that first 50 records which Bot already fetched and start from 51 records? I need your suggestion. Is it possible if yes how?
can I set this condition on the Data Scrapping activity itself that I don;t want the records which already fetched. @Lahiru.Fernando@lakshman@Palaniyappan
You can’t directly skip the rows while data scraping.
Instead you fetch all pages and take only the required number of rows from the scraped data table.
Good question.
I’m not 100% sure on whether we can set that filtering option within the data scraping activity itself. But you can do a workaround to get it done.
In the first iteration, do the scraping and add the data of that datatable to another datatable. A simple merge data table activity can do this. So what happens here is you store all the data you scrape in another holding datatable variable in each iteration.
Example
Iteration 1
Do the scraping to a datatable.
Merge it to holding datatable variable
Iteration 2
Scraps the sata to a datatable
Merge it
And so on…
So what happens here is all the data you scrape will be saved in the second one. And since you might scrape the same data from previous iterations it will obviously have duplicate records. So to get rid of it, you can simply use a remove duplicate activity to clear those out
However If you are planning to do separate executions the datatable option will have to be modified a bit because the first time data will not be available for the second because those two are separate executions. So here you can slightly do a modification to write data in each iteration to a temporary excel file. So in each execution, scrape the data and append to the excel. Then at the end, read it and remove the duplicates.
Thanks for the response bro. But I want to skip from the initial only. Like I set this condition which @KarthikByggari mentioned in the post and I skip that records and start from the 51. But I have to check it whether it’s working. Because I want to check before fetching not after downloading and then remove.
Can I use it directly like first I am using the data scrapping activity and store in a datatype variable. assuming ExtractedDT. so can I pass this variable in the assign activity in the place of myDatatable and assign to some other datatype variable let NewDataTable.
Is it right approach? But showing an error screenshot attached above comment.
I didn’t declare it. I don’t want to skip the rows. My ques is if you understand from the above post is.
Let say I am extract the items from the website. In first run I want to fetch the first 50 records no matter in how many pages are there. Next time when I run I want to start from the 51 item. so how to skip the first 50 records. Hope you understand. Note. I don’t have any data of first 50 items. Direct I want to fetch from 51 items.
Using the skip method you are skipping the no of rows you want and then using the take method you are fetching the no of records you are looking for.
Skip(skipRows):will help you to skip the records
Take(50):Is fetching the no of records passed inside the paranthesis