[HowTo] Data Scraping - Advanced Configuration - Text Field, Image Source, Url, CSS Classname, Hover text

This HowTo introduces on how Data scraping can be configured to retrieve also on non standard information from a web table. After indicating the different data columns with the wizard the extract data definition was post edited and changed to the relevant attributes e.g. value (Text field), src ( Image Source), class (CSS Class Name), tite (Hover Text), href (Url).

Introduction

Following web table is to use for data scraping and also the non text information should be retrieved.

grafik

We are interested on following details:

  • ID
  • Name
  • Task
  • Cercle Type
  • Hover text of cercle
  • Prio info
  • Url

Preperation / Analysis

It always recommended to do a quick check on Browsers web tools (F12) and / or UiEplorer. The table looks like this:
grafik

The quick look shows us

  • it is organized in tabular structure based on a table (instead of a div table representation)
  • the different information sources are yellow marked and identified
  • first row with the headers are used within TH tags

So it looks good, lets do the retrieval

Data Scraping configuration

First Column (ID)

Start with data scraping

  • Select Element Dialog - click next
  • click on the first ID Value
  • following dialog is displayed:
    grafik
  • Click No (Nein) - we want to fine control the retrieval configuration
  • Select Second Element Dialog - click next
  • click on the second ID Value
  • Following Dialog is shown:
  • grafik
  • No url extraction is required, the column name is set later

Following Preview is shown:
grafik

Second Column (Name)

  • Click on the preview dialog extract roccrelated data
    grafik
  • similar to the first column the first element is indicated - first name
  • indicating the second element - second name
  • result is:
  • grafik

Regadles if the selectors are correct or invalid, the empty column values are correct

An empty result is received as the name value is not text in the data cell. The name info is a value in a text field (refer to screenshot above)

Lets adopt the extraction by the following steps:

  • Click Edit Data Definition
  • grafik
  • Validate the extraction result that it is selecting an input
  • Check that the second table call is selected: td idx=‘2’
  • change attribute from text to value:
    grafik

And validated the new generated preview:
grafik

Additonal columns

  • repeat the steps from first column and add the other columns by right indicating the column first element value, second element value
  • Click on Edit Data Definition and modify as following:

Result:
grafik

Final Result

grafik
The datatable with the extracted values. The PrioInfo values are the different css classes. In a conversion run also this info can be mapped e.g. to …circle-up = HIGH etc.

Tips

  • After each editing the extract data definition copy the result / modified extract metadata XML into the clipboard
  • Do at first the additions / selection of the different columns and edit the extract data definition on the end.
    • Reason: after modifying the extract data definition and adding the a new column the modifications are reset. Thats why also the part results are copied to the clipboard
  • in case of suspicious preview results after heavy editing rounds stop the wizard and restart it again

Downloads

HowTo_TableFieldClassImgLink.zip (175.5 KB)

Questions

For questions on your retrieval case open a new topic and get individual support

28 Likes

Cool article! I moved it to our FAQ category.

1 Like

Hi @ppr Will the Steps be the same even if the Table representation is in a Div table format ?

@supermanPunch
we did it also in some projects where the data was organized in rows and columns e.g. represented by divs.

The very important part is to get defined a reliable row iterator selector and consistent selector to the correlated data within the extract data definition.

Awesome! Thanks a lot!

Great article! Thank you.
There is a way to extract “everything” that is in the block instead of extracting the Text, Class, Value, etc.
In my case, each element contains a “structured data” insde (4 elements). But sometimes 1 element is missing so I would like to extract everything (Everything=source code) so I can past it manually.
Any idea how can I do that?
Thanks!

Nice

@ppr - what if the UI Element is not part of a table?

Please see: Get list of running web apps using browser task manager

@tsverthoff
this HowTo is describing the approach applied on webpage / for web applications
your referenced topic is another case and different

Better to keep the discussion about your linked topic there itself.

I agree. That is why I specifically posted that question to a new topic, rather than posting the content of the topic here.

how to loop data scrapping in a web page. Where the table is spanning over multiple pages and it is inside the pallet.


In this case the once the inner page (1-18) is done then it move to next page Outer(1 of 3)the do the data scrapping.

@agathiyanv

if not already done, just open a topic for your case and we will pick it up from there

This has been shown clearly in the below video, please watch and understand the steps in detail.

we can good see in the video what the tip/hint is about

1 Like