In linkedin under jobs section, to perform a job search I had put location as “united states” and when i got a list of jobs as a result, I performed data scrapping on it and tried to retrieve data in a csv sheet. However I noticed that if i tried to scrap 50 results I do get them but the data is getting picked up randomly from the linkedin result page. Can anyone please check this workflow and help me out
test2.xaml (9.9 KB)
@ddpadil In ie also i am facing the same issue
When you say “picked randomly” means are they in different order or data scraping not failed to pick the data available on the scree?
Each page has 25 items, so if i am to scrap 50 data, it should ideally go till 2nd page but it goes through all the pages (eg if there are 70 pages it will go through 70 pages) and among them it will pick any data from any page but will have a total count of 50
o got it.
Tried the scenario with LinkedIn. Your right it’s not even taking 100 result though it set in property .
It just go to second page and stops but result i get is only 10.(It suppose to be 20)
Which version are you using?
I am using 2017.1.6435
oh new one. I thought of telling you to use new version
@ovi need improvement on data scraping activity I guess !
Can you try increasing the delay value in DelayBetweenPages property of Extract Structured Data activity?
I’m not sure what the problem is exactly but here’s a working example that might be helpful: Main.xaml (14.1 KB)
You can also try to check if during scraping the next page button is selected properly. You have to select the surrounding button, not the arrow because they have different selectors.
I don’t want to put the url in the open browser activity. I wish to perform data scrapping only however i also tried this way but still not getting the desired result
I have checked next page button in all the possible ways that could be there.
Did that but still didn’t got desired result
You don’t have to use open browser if the page is already open, so you can remove the open browser from the example. The example that I provided doesn’t give the desired result?
No it is not giving the desired result
Can you please help me understand what is the desired result? I thought it was to extract the first 50 jobs (meaning the first 2 pages) into a csv file.
Yes the requirement is this only, i am getting total of 50 results but these results are getting picked up randomly from linkedin page. It is going through the all the pages and among them all it is picking 50 records
Hmm, see that’s why I’m confused because my example gets only the first 50 jobs from the first 2 pages and then the data scraping stops on page 2. In the resulting csv file are the first 50 jobs in the order they are on the site. Are you sure you are looking at the most up-to-date data and that the jobs are sorted by “Relevance”? I don’t know what else could be the issue.
Would like to ask one thing, in your sheet that you have uploaded, you have created “output data table” and stored its o/p in “str” however this o/p variable is not being used in the workflow, can you please tell the purpose of using it.
Now for the issue let me explain it in an elaborate manner
Each Linkedin job search page has 25 entries and suppose there are approximately 10 pages through which the data is spanning. Now lets suppose i need to scrap 50 results through data scrapping .
Now on running the process that you have provided and that i have created in both the cases the issue is that it is data scrapping is not ending on the 2nd page it goes through all the 10 pages (i.e 250 records ) and then it wil pick 3-4 records from first page, 5-6 records from second page, 2-3 records from third page… etc and like this it will collect total of 50 records and write these 50 records in csv sheet and now if you match the sheet records with the linkedin page. Looking at the beginning records it seems that the records are getting captured properly but as you go down the list you will see that the records are not matching
And yes i have the updated data and is sorted by releavance, but i don’t think this should matter because data scripting should capture what ever record it is pointed at
Hope i was able to clearify it now
I used the Output Data Table activity together with a Write Line activity to print the result of the data scraping to the console before writing to csv, just to make sure the data scraping works as expected. I then deleted the Write Line activity as it was not needed but I forgot to delete the Output Data Table activity. It can be safely deleted because it doesn’t affect the flow.
As for the problem, I am out of ideas. I know it’s working for me but I can’t explain why it’s not working for you and I can’t reproduce the behavior you get. Maybe someone else could help and I hope you figure it out.
Can you pls try giving the second or third row while choosing the second set of data while Data scraping.
eg. choose the first row as First set and second/third row as the second set of data.