How to write proper ExtractMetadata for data scraping?

rturkia · October 21, 2021, 3:26pm

I’m trying to extract data from a table in a web page. The table is quite simple, there is 8 columns in the table and the first row is for header. I can use the data scraping wizard to create Extract Structured Data activity with all necessary things set and it does extract data from the table cells but I need more than just text.

The 7th column contains a link to download a pdf file and an image and the link is what I’m interested in. I have tried to customize ExtractMetadata property using pieces of information gathered from the Internet. Currently the propery contains this XML

<extract-table>
	<row>
		<webctrl parentname='alaformi' tag='TABLE' />
		<webctrl tag='TR' />
	</row>
	<column attr='text'>
		<webctrl parentname='alaformi' tag='TABLE' />
		<webctrl tag='TR' />
		<webctrl tag='TD' idx='1' />
	</column>
	<column attr='href'>
		<webctrl parentname='alaformi' tag='TABLE' />
		<webctrl tag='TR' />
		<webctrl tag='TD' idx='7' />
		<webctrl tag='A' />
	</column>
</extract-table>

This is very much product of trial and error since there is no documentation of the XML. And it seems that it doesn’t matter what is in it, as long as the root element is extract-table. The result is always the same, text from all columns.

What I need from this table is text from the 1st column and href attribute of A tag from the 7th column. What am I not doing right here? Any help is appreciated.

ppr · October 21, 2021, 3:29pm

how looks the selector of extract data activity?
When possible pleas also share the url with us. Thanks

rturkia · October 21, 2021, 3:36pm

I’m afraid I can’t share the url since log in is required to access the data.

The selector is

<html app='msedge.exe' title='IWF' />
<webctrl src='laskulista.w?toiminto=FIRST' tag='FRAME' />
<webctrl parentname='alaformi' tag='TABLE' />

where the “msedge” line comes from Attach Browser selector.

ppr · October 21, 2021, 4:21pm

@rturkia
give a try on
selector:

<Use your parent selector/>
<if needed to bring more specfics on it/>
<webctrl parentname='alaformi' tag='TBODY' />

metadata:

<extract>
	<row exact='1'>
		<webctrl tag='tr'/>
	</row>
	<column exact='1' name='Column1' attr='text'>
		<webctrl tag='tr'/>
		<webctrl tag='td' idx='1'/>
	</column>
	<column exact='1' name='Column2' attr='href'>
		<webctrl tag='tr'/>
		<webctrl tag='td' idx='7'/>
		<webctrl tag='a' idx='1'/>
	</column>
</extract>

grafik

keep in mind:
first exctracted row is header row (2021 packages / modern, may have some different options / behaviours) - so we delete maybe

framsets will lose the hierarchies, so we focus on specifics of the particular webpage holding the table

rturkia · October 21, 2021, 4:29pm

Now it works, thank you very much, this saved my day!

I wish UiPath would release documentation for the metadata XML.

ppr · October 21, 2021, 4:46pm

system · October 24, 2021, 4:47pm

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Extract Structured Data - dynamic extract metadata Activities uiautomation , activities , data-scraping	5	1432	January 31, 2023
How to use ExtractMetaData Field of ExtractData activity Help datatable	2	5822	February 12, 2021
Data Scraping and the resulting datatable Help	1	1525	August 2, 2018
Extract Structured Data with Input values Studio studio , question , properties_panel	8	89	January 8, 2025
How do i edit the extract meta data selector Activities selector , uiautomation	5	1948	May 16, 2022

How to write proper ExtractMetadata for data scraping?

Related topics