Data scrapping for getting Inner Html of a tag

Hi,

Can I know if there is a way to get the InnerHtml for a webpage by editing the Data Definition of the Data scrapping tool? Something like this would be helpful

<extract>
	<column exact="1" name="Answers" attr="html">
		<webctrl tag="div" class="devsite-article-body clearfix&#10;            " idx="1"/>
		<webctrl tag="p"/>
	</column>
</extract>

Thanks!

1 Like

@Sunitha_Premakumaran
Give a try on letting the extract XML created by Data scraping wizzard and Change then Attribute of the column to innerhtml

1 Like

Thanks for the reply @ppr. But it’s not working

1 Like

@Sunitha_Premakumaran
In Case of you only want to extract one element or column not spanning over Pages a find children combined later with get Attribute of the different items maybe would be a working alternate

Is the Website Public can you Share some more Infos?

1 Like

Assuming this the url I need to the extract the url from here. How Can i pass the selector as argument to Get Attribute

1 Like

Give me some time i have to Rush now and will give you Feedback soon.

In the meanwhile Just Play with indicate in Screen and Let Point to the element of interest

1 Like

@Sunitha_Premakumaran
I am not sure if I got your question right.
have a look on following:
grafik

I configured the selector to the link from the post
with the href attribute I extracted the linked url

Let me know your feedback and still open questions

1 Like

grafik
option 1: pass an uielement variable e.g retrieved with find children or find element
OR (only option 1 or Option 2 can be used not both on the same )
option 2: define the selector

2 Likes

I am trying to extract the questions and answers from that url. It is a structured data with question and answers. I need to extract the answers with the urls in

1 Like

Ok now I have better understood,

Just illustrate a dataset of two samples like: what do you want from the head line (the group, thw question, the answers, the link from the answers) and then we can work out a starter help.

As it is semi-structured (e.g. no of links in the answers) we may go for combined approaches with datascraping, find children, get Attribute and using dynamic selectors

1 Like

The aim is to construct a DataTable that looks like this:

Questions

  1. What is Compute Engine? What can it do?

Answers:

  1. Compute Engine is an Infrastructure-as-a-Service product offering
    flexible, self-managed virtual machines hosted on Google’s infrastructure. Compute Engine
    includes Linux and Windows based virtual machines running on KVM, local and durable storage
    options, and a simple REST based API for configuration and control. The service
    integrates with Google Cloud Platform technologies such as
    Cloud Storage, App Engine,
    and BigQuery to extend beyond the basic
    computational capability to create more complex and sophisticated
    apps.

and so on…

1 Like

@Sunitha_Premakumaran
Ok, your requirement does make sense to me
Its a semistructured scenario, but should be solvable.

Semistructured because of there could 1…n Answers Paragraphs and so DataScrapping is stressed and maybe failling

with find children and later more detailled docomposing of the found elemenst the retrieval could be realized.

I will have a further look on a starter help after my work shift.

1 Like

@Sunitha_Premakumaran
Let’s continue on the results from above. A closer look on the structure had the outcome that the content is not offered in a structured form (Question - Answer). Instead it is available in the form Question - Answer Part 1…Answer Part n). Also there are tables and other execptions present.

However to start I did for you following demo implementation with the purpose to provide you an initial entry and showing some working tools.

Read In all Article parts is done with Find children
grafik
filtering to all direct children

With a switch on the HTML Tag Name the children sequence is processed (Demo: Log Messages)

in the case of answers with an addtional find children the links from the answer block are fetched as well:

and as a first result it is logging:
grafik

Just take it as base and incorporate more structure variation handling and dumping out in a datatable / Excel by your own - Here is the XAML:
Sunitha_Premakumaran_2.xaml (11.3 KB)

For adressing your initial question (getting innerhtml) find an alternate retrieving the entire article part as innerhtml. XAML is here:
Sunitha_Premakumaran_1.xaml (6.4 KB)

1 Like

Thanks I figured it out!

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.