When we scrape data, we find the wizard will give us a choice of two columns for each field we scrape:
We don’t always get what we want from the wizard. Consider this example:
a) Go to http://www.amazon.com/.
b) Type in looper pedals and hit Enter.
(This will get us to our scraping point in our workflow.)
We’ll use these images for our preliminary scrape.
Set up a workflow as shown in the image.
Double-click the “Attach Browser” activity called “Attach looper page”.
In our first run through, the workflow might look like this.
Now we’ll see what we get as a default. All we want is the pedal description and its picture.
Open the “Data Scraping” wizard.
Select the first field as shown:
Get the second field.
Duplicate the steps above for this text:
Configure the columns for the pedal description:
Preview your data.
Hit “Extract correlated data”.
Repeat the steps 1 - 4 for the image to the left of each description.
When done, configure the columns as shown.
Now we’re ready for our first run, but we’re going to look at the XML first.
Before our run, we’ll check our XML.
In the Properties pane, click the “…” at the far right of the image.
(Be sure you have the activity “Get looper data” selected)
Here’s our XML (Hint: what we’re after has a rectangle around it …)
Let’s run the workflow as usual.
When we’re done, we’ll have a file “firstloopers.csv” in the project folder.
There is is:
Now let’s see if we got what we wanted: a pedal description and a URL for its picture.
Wait a minute. That’s an awfully long URL - and when we click on it, we see we got a redirect!
We want the image too, not just the product page! How do we do that?
Okay, we’ll edit the XML and see what we need. First, we’d better find what we’re looking for.
We’ll inspect the element in Firefox.
Right-click on the image to the left of:
Which should look like this:
The right-click(context) menu in Firefox has an “Inspect Element (Q )” selection.
That’s what we want.
Here’s what Firefox shows us (cut off at the right edge, unfortunately).
Okay, we can see there’s a JPG in there, right after ‘src=’. That’s what we want.
How do we get it?
It turns out to be simple.
We don’t have to change much. Here’s what our XML looks like after the edit:
See? All we did was add name2=”Pedal URL” attr2=’src’ to the line shown.
Now let’s change the name of the CSV file to “secondloopers.csv” and run our workflow again.
Success. Here’s an excerpt from “secondloopers.csv” with a little re-formatting:
There’s more you can do by editing the XML from the Data Scraping wizard, but it will have some surprises in store for you as well. Experiment!
More next time. The files below may help if you get stuck.