(Tutorial) How to customize data scraping by editing XML

When we scrape data, we find the wizard will give us a choice of two columns for each field we scrape:

  1. Text
  2. URL

We don’t always get what we want from the wizard. Consider this example:

a) Go to http://www.amazon.com/.
b) Type in looper pedals and hit Enter.
(This will get us to our scraping point in our workflow.)

We’ll use these images for our preliminary scrape.

Set up a workflow as shown in the image.

workflow

Double-click the “Attach Browser” activity called “Attach looper page”.

In our first run through, the workflow might look like this.

Now we’ll see what we get as a default. All we want is the pedal description and its picture.

Open the “Data Scraping” wizard.

select1

Select the first field as shown:

Get the second field.

select2

Duplicate the steps above for this text:

tc text

Configure the columns for the pedal description:

configure columns

Preview your data.
preview data 1

Hit “Extract correlated data”.

Repeat the steps 1 - 4 for the image to the left of each description.

When done, configure the columns as shown.

URL configure

Now we’re ready for our first run, but we’re going to look at the XML first.

Before our run, we’ll check our XML.

In the Properties pane, click the “…” at the far right of the image.
To view XML
(Be sure you have the activity “Get looper data” selected)

Here’s our XML (Hint: what we’re after has a rectangle around it …)

Let’s run the workflow as usual.
When we’re done, we’ll have a file “firstloopers.csv” in the project folder.
There is is:

Now let’s see if we got what we wanted: a pedal description and a URL for its picture.

Wait a minute. That’s an awfully long URL - and when we click on it, we see we got a redirect!

We want the image too, not just the product page! How do we do that?

Okay, we’ll edit the XML and see what we need. First, we’d better find what we’re looking for.
We’ll inspect the element in Firefox.

Right-click on the image to the left of:


Which should look like this:
rowin
The right-click(context) menu in Firefox has an “Inspect Element (Q )” selection.
That’s what we want.

Here’s what Firefox shows us (cut off at the right edge, unfortunately).

Okay, we can see there’s a JPG in there, right after ‘src=’. That’s what we want.
How do we get it?
It turns out to be simple.

We don’t have to change much. Here’s what our XML looks like after the edit:

See? All we did was add name2=”Pedal URL” attr2=’src’ to the line shown.
Now let’s change the name of the CSV file to “secondloopers.csv” and run our workflow again.

Success. Here’s an excerpt from “secondloopers.csv” with a little re-formatting:

There’s more you can do by editing the XML from the Data Scraping wizard, but it will have some surprises in store for you as well. Experiment!

More next time. The files below may help if you get stuck.

Regards,
burque505

tutorial.xaml (16.4 KB)
tutorial2.xaml (17.2 KB)

11 Likes

can you please clarify if the xml format/schema is documented somewhere ?
it would be great to understand how to edit the xml without having to go through the wizard ? can you point to the documentation ?

thanks

3 Likes

I agree with @Sam_Marko. If we can have proper understanding of xml, we can change that very well then.

2 Likes

What if there is no src attribute when inspecting element? I’m struggling with empty URL column.

Hi @bp777

This depends. An example of the situation would make it more clear.

Hello @loginerror,
I have a web page with a list of elements(Anchors) which contains data I need to scrape.
How I want to do that : First of all, I need to scrape URLs from that elements, so I can use For Each Row activity in order to open each link and scrape data from it.
What’s the problem - URL Column is empty. Also, when inspecting element, there is no src attribute.
I’ve tried with Amazon, and it worked without problem. URL Column is full, and everything is easy then.

It depends on the attribute name, see here:
image

To get the content from your page, you can Inspect the Element in your browser to see the attribute name that contains the URL. Then, replace the attr name in the XML definition of the Data Scraping activity from src to the one you want to extract.