Can I know if there is a way to get the InnerHtml for a webpage by editing the Data Definition of the Data scrapping tool? Something like this would be helpful
@Sunitha_Premakumaran
In Case of you only want to extract one element or column not spanning over Pages a find children combined later with get Attribute of the different items maybe would be a working alternate
Is the Website Public can you Share some more Infos?
option 1: pass an uielement variable e.g retrieved with find children or find element
OR (only option 1 or Option 2 can be used not both on the same )
option 2: define the selector
I am trying to extract the questions and answers from that url. It is a structured data with question and answers. I need to extract the answers with the urls in
Just illustrate a dataset of two samples like: what do you want from the head line (the group, thw question, the answers, the link from the answers) and then we can work out a starter help.
As it is semi-structured (e.g. no of links in the answers) we may go for combined approaches with datascraping, find children, get Attribute and using dynamic selectors
The aim is to construct a DataTable that looks like this:
Questions
What is Compute Engine? What can it do?
Answers:
Compute Engine is an Infrastructure-as-a-Service product offering
flexible, self-managed virtual machines hosted on Google’s infrastructure. Compute Engine
includes Linux and Windows based virtual machines running on KVM, local and durable storage
options, and a simple REST based API for configuration and control. The service
integrates with Google Cloud Platform technologies such as Cloud Storage, App Engine,
and BigQuery to extend beyond the basic
computational capability to create more complex and sophisticated
apps.
@Sunitha_Premakumaran
Let’s continue on the results from above. A closer look on the structure had the outcome that the content is not offered in a structured form (Question - Answer). Instead it is available in the form Question - Answer Part 1…Answer Part n). Also there are tables and other execptions present.
However to start I did for you following demo implementation with the purpose to provide you an initial entry and showing some working tools.
Read In all Article parts is done with Find children
filtering to all direct children
With a switch on the HTML Tag Name the children sequence is processed (Demo: Log Messages)
Just take it as base and incorporate more structure variation handling and dumping out in a datatable / Excel by your own - Here is the XAML: Sunitha_Premakumaran_2.xaml (11.3 KB)
For adressing your initial question (getting innerhtml) find an alternate retrieving the entire article part as innerhtml. XAML is here: Sunitha_Premakumaran_1.xaml (6.4 KB)