(Tutorial) How to customize data scraping by editing XML

tutorial
datascraping
xml

#1

When we scrape data, we find the wizard will give us a choice of two columns for each field we scrape:

  1. Text
  2. URL

We don’t always get what we want from the wizard. Consider this example:

a) Go to http://www.amazon.com/.
b) Type in looper pedals and hit Enter.
(This will get us to our scraping point in our workflow.)

We’ll use these images for our preliminary scrape.

Set up a workflow as shown in the image.

workflow

Double-click the “Attach Browser” activity called “Attach looper page”.

In our first run through, the workflow might look like this.

Now we’ll see what we get as a default. All we want is the pedal description and its picture.

Open the “Data Scraping” wizard.

select1

Select the first field as shown:

Get the second field.

select2

Duplicate the steps above for this text:

tc text

Configure the columns for the pedal description:

configure columns

Preview your data.
preview data 1

Hit “Extract correlated data”.

Repeat the steps 1 - 4 for the image to the left of each description.

When done, configure the columns as shown.

URL configure

Now we’re ready for our first run, but we’re going to look at the XML first.

Before our run, we’ll check our XML.

In the Properties pane, click the “…” at the far right of the image.
To view XML
(Be sure you have the activity “Get looper data” selected)

Here’s our XML (Hint: what we’re after has a rectangle around it …)

Let’s run the workflow as usual.
When we’re done, we’ll have a file “firstloopers.csv” in the project folder.
There is is:

Now let’s see if we got what we wanted: a pedal description and a URL for its picture.

Wait a minute. That’s an awfully long URL - and when we click on it, we see we got a redirect!

We want the image too, not just the product page! How do we do that?

Okay, we’ll edit the XML and see what we need. First, we’d better find what we’re looking for.
We’ll inspect the element in Firefox.

Right-click on the image to the left of:


Which should look like this:
rowin
The right-click(context) menu in Firefox has an “Inspect Element (Q )” selection.
That’s what we want.

Here’s what Firefox shows us (cut off at the right edge, unfortunately).

Okay, we can see there’s a JPG in there, right after ‘src=’. That’s what we want.
How do we get it?
It turns out to be simple.

We don’t have to change much. Here’s what our XML looks like after the edit:

See? All we did was add name2=”Pedal URL” attr2=’src’ to the line shown.
Now let’s change the name of the CSV file to “secondloopers.csv” and run our workflow again.

Success. Here’s an excerpt from “secondloopers.csv” with a little re-formatting:

There’s more you can do by editing the XML from the Data Scraping wizard, but it will have some surprises in store for you as well. Experiment!

More next time. The files below may help if you get stuck.

Regards,
burque505

tutorial.xaml (16.4 KB)
tutorial2.xaml (17.2 KB)


ExtractData - more infor on ExtractMetadata
(Tutorial) Encode images, save to CSV (with decode capability)
#2

can you please clarify if the xml format/schema is documented somewhere ?
it would be great to understand how to edit the xml without having to go through the wizard ? can you point to the documentation ?

thanks