Data Scrapping Suggestion

Hi, I need your valid suggestion. As we use Data Scrapping to scrap the data from the multiple page Right.
Ex. I have 10 pages on the web and I am scrapping the data from there.
suppose I set to fetch 50 records so I run the Bot and it will fetch the first 50 records from the pages and give it me. No mater if the 50 records one page 1 or 2 or any.
Now My question is If I run second time and I don’t want to fetch the records which I fetched when I run the bot first time. Can we skip that first 50 records which Bot already fetched and start from 51 records? I need your suggestion. Is it possible if yes how?
can I set this condition on the Data Scrapping activity itself that I don;t want the records which already fetched.
@Lahiru.Fernando @lakshman @Palaniyappan

Is it a possible scenario.

2 Likes

You can’t directly skip the rows while data scraping.
Instead you fetch all pages and take only the required number of rows from the scraped data table.

newdt = myDataTable.AsEnumerable().Skip(skipRows).Take(50).CopyToDataTable

Skip - Will skip the unwanted rows
Take - Will take the next 50 rows

Regards,
Karthik Byggari

7 Likes

Thanks for your quick response bro.

so where can I use this condition.
newdt = myDataTable.AsEnumerable().Skip(skipRows).Take(50).CopyToDataTable

1 Like

Assign Activity.

Create one variable newdt of type DataTable.

3 Likes

So myDataTable is a DataType variable which is created using the Extracted Structure Data.

So whenever I set this digit it will skip the initial records and start from next?
Take(50)

1 Like

Hi @balkishan

Good question.
I’m not 100% sure on whether we can set that filtering option within the data scraping activity itself. But you can do a workaround to get it done.

In the first iteration, do the scraping and add the data of that datatable to another datatable. A simple merge data table activity can do this. So what happens here is you store all the data you scrape in another holding datatable variable in each iteration.

Example
Iteration 1
Do the scraping to a datatable.
Merge it to holding datatable variable
Iteration 2
Scraps the sata to a datatable
Merge it
And so on…

So what happens here is all the data you scrape will be saved in the second one. And since you might scrape the same data from previous iterations it will obviously have duplicate records. So to get rid of it, you can simply use a remove duplicate activity to clear those out :slight_smile:

However If you are planning to do separate executions the datatable option will have to be modified a bit because the first time data will not be available for the second because those two are separate executions. So here you can slightly do a modification to write data in each iteration to a temporary excel file. So in each execution, scrape the data and append to the excel. Then at the end, read it and remove the duplicates.

2 Likes

Yes.

3 Likes

Thanks for the response bro. But I want to skip from the initial only. Like I set this condition which @KarthikByggari mentioned in the post and I skip that records and start from the 51. But I have to check it whether it’s working. Because I want to check before fetching not after downloading and then remove.

2 Likes

Yep… looks like @KarthikByggari has a good approach… :slight_smile:

2 Likes

Can I use it directly like first I am using the data scrapping activity and store in a datatype variable. assuming ExtractedDT. so can I pass this variable in the assign activity in the place of myDatatable and assign to some other datatype variable let NewDataTable.
Is it right approach? But showing an error screenshot attached above comment.

May I know what is this SkipRows variable and what it doing?

1 Like

Hey @balkishan,

It seems like you forgot to declare the skipRows varaible.It should be of type int and give it the row count you want to skip.

May be you have declared it but the scope is incorrect.Please do a check on it.

1 Like

I didn’t declare it. I don’t want to skip the rows. My ques is if you understand from the above post is.
Let say I am extract the items from the website. In first run I want to fetch the first 50 records no matter in how many pages are there. Next time when I run I want to start from the 51 item. so how to skip the first 50 records. Hope you understand.
Note. I don’t have any data of first 50 items. Direct I want to fetch from 51 items.

Using the skip method you are skipping the no of rows you want and then using the take method you are fetching the no of records you are looking for.
Skip(skipRows):will help you to skip the records
Take(50):Is fetching the no of records passed inside the paranthesis

myDataTable.AsEnumerable().Skip(skipRows).Take(50).CopyToDataTable

1 Like

but can you tell me what is this skiprows variable and where to use it.

skipRows is the variable of type int.

1 Like

You need to ensure the datatype is int and also the scope should be correct.

1 Like

Sure

1 Like

shared one to one

Received successfully

It is not showing me any error.All sorted