Web Scraping URL error


#1

Hi,
I am scraping data plus corresponding URLs from certain sites on the internet. However, some of the URLs are only being partially scraped i.e instead of:
http://www.irishtimes.com/news/offbeat/she-has-a-nice-smile-trump-singles-out-rté-s-caitriona-perry-1.3136445”,
I just get “/news/offbeat/she-has-a-nice-smile-trump-singles-out-rt%C3%A9-s-caitriona-perry-1.3136445”.
Any ideas why this is happening? It only happens for some websites and not for others.


#2

Hi,
May i know what activity your using to get the URL?
or you could attach workflow for better understanding.


#3

SampleFlow.xaml (3.0 KB)

Here is a sample of what I’m doing, and the site that is being scraped from is the one that is causing the URL error.
Thanks for your help.


#4

oops.Looks like your workflow is blank.:thinking:


#5

Apologies,
My workflow won’t upload for some reason.
Essentially, I am running a for each loop to search three terms in the Irish times website search box. Then, on the search page, I use a recording to scrape the Headine and Url of the top 5 results for that search. I then have appended these results to an excel sheet.


#6

How about do concatenation with ""www.irishtimes.com"as this field remains same and append the rest. "http://wLw.irishtimes.com+“Scraped URL”


#7

I was thinking of doing somethinf along those lines, the only problem is that I am scraping URLs from other websites too, so I will only need to perform the concatenation for the Irish Times URLs. This is tricky because the number of URLs scraped from each website varies slightly each time I run the program, depending on how many relevant links there are. Is there anyway to attach the string “www.irishtimes.com” to the scraped URL before appending to the excel sheet?
Thanks for your help.


#8

two way of doing it
1 .use of condition and if website =“www.irishtimes.com” then use concatination
2. once you scrape the url you have output variable you can concatinate using assign activity and pass the new variable into excel.