How to write proper ExtractMetadata for data scraping?

I’m trying to extract data from a table in a web page. The table is quite simple, there is 8 columns in the table and the first row is for header. I can use the data scraping wizard to create Extract Structured Data activity with all necessary things set and it does extract data from the table cells but I need more than just text.

The 7th column contains a link to download a pdf file and an image and the link is what I’m interested in. I have tried to customize ExtractMetadata property using pieces of information gathered from the Internet. Currently the propery contains this XML

<extract-table>
	<row>
		<webctrl parentname='alaformi' tag='TABLE' />
		<webctrl tag='TR' />
	</row>
	<column attr='text'>
		<webctrl parentname='alaformi' tag='TABLE' />
		<webctrl tag='TR' />
		<webctrl tag='TD' idx='1' />
	</column>
	<column attr='href'>
		<webctrl parentname='alaformi' tag='TABLE' />
		<webctrl tag='TR' />
		<webctrl tag='TD' idx='7' />
		<webctrl tag='A' />
	</column>
</extract-table>

This is very much product of trial and error since there is no documentation of the XML. And it seems that it doesn’t matter what is in it, as long as the root element is extract-table. The result is always the same, text from all columns.

What I need from this table is text from the 1st column and href attribute of A tag from the 7th column. What am I not doing right here? Any help is appreciated.

how looks the selector of extract data activity?
When possible pleas also share the url with us. Thanks

I’m afraid I can’t share the url since log in is required to access the data.

The selector is

<html app='msedge.exe' title='IWF' />
<webctrl src='laskulista.w?toiminto=FIRST' tag='FRAME' />
<webctrl parentname='alaformi' tag='TABLE' />

where the “msedge” line comes from Attach Browser selector.

@rturkia
give a try on
selector:

<Use your parent selector/>
<if needed to bring more specfics on it/>
<webctrl parentname='alaformi' tag='TBODY' />

metadata:

<extract>
	<row exact='1'>
		<webctrl tag='tr'/>
	</row>
	<column exact='1' name='Column1' attr='text'>
		<webctrl tag='tr'/>
		<webctrl tag='td' idx='1'/>
	</column>
	<column exact='1' name='Column2' attr='href'>
		<webctrl tag='tr'/>
		<webctrl tag='td' idx='7'/>
		<webctrl tag='a' idx='1'/>
	</column>
</extract>

grafik

keep in mind:
first exctracted row is header row (2021 packages / modern, may have some different options / behaviours) - so we delete maybe

framsets will lose the hierarchies, so we focus on specifics of the particular webpage holding the table

Now it works, thank you very much, this saved my day!

I wish UiPath would release documentation for the metadata XML.

1 Like

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.