I’m trying to extract data from a table in a web page. The table is quite simple, there is 8 columns in the table and the first row is for header. I can use the data scraping wizard to create Extract Structured Data activity with all necessary things set and it does extract data from the table cells but I need more than just text.
The 7th column contains a link to download a pdf file and an image and the link is what I’m interested in. I have tried to customize ExtractMetadata property using pieces of information gathered from the Internet. Currently the propery contains this XML
<extract-table> <row> <webctrl parentname='alaformi' tag='TABLE' /> <webctrl tag='TR' /> </row> <column attr='text'> <webctrl parentname='alaformi' tag='TABLE' /> <webctrl tag='TR' /> <webctrl tag='TD' idx='1' /> </column> <column attr='href'> <webctrl parentname='alaformi' tag='TABLE' /> <webctrl tag='TR' /> <webctrl tag='TD' idx='7' /> <webctrl tag='A' /> </column> </extract-table>
This is very much product of trial and error since there is no documentation of the XML. And it seems that it doesn’t matter what is in it, as long as the root element is
extract-table. The result is always the same, text from all columns.
What I need from this table is text from the 1st column and href attribute of A tag from the 7th column. What am I not doing right here? Any help is appreciated.