I am trying to scrape the content of the topics available in Udemy,
For eg: go to this url : Free Python Tutorial - Python 3 in 100 Minutes | Udemy
We will find the Course Content which is in tree structure. I tried to scrape the value but the Main topic i am able to scrape, but i am not able to get the sub topics of the that.
Any idea please let me know. I tried using Data Scrapping, Find Child element.
Hmm… are you using the data scraping wizard for this? Once you select one of the content, does it highlight all the sub contents in it or is it the only one that gets highlighted?
Additionally, since you can have multiple topics per course, you might need to go with two data scraping wizards. One to extract the courses and another to extract the sub topics. Then later you can merge the data together… Have you tried anything like it?
Yes, i tried with two data scraping wizards, the content i am able to populate, sub contents are not able to scrape entire sub contents. I can able to populate for one topic not all the topics.
Eg: I am able to scrape all the main content like “Introduction and Welcome Message” to “Bonus Lectures”, but for sub contents i am able to scrape only one which is under one main content - Introduction and Welcome Message to Course Overview
Please let me know any idea.
Hmm… let me try this for the above link you shared… I may be able to find something…
I get the point you were describing here.
So as you see, it only takes that selected line. If I get the first row, it only takes that row. So handling this is quite tricky because it topic can have different number of sub topics and each gets captured as a different column. Also, we don’t know how many sub topics it will have.
I guess the better way to handle this is by manually configuring your data scraping.
- Use Extract Structured Data activity and provide that entire region as the scraping region.
- Provide the meta data for the table
I suppose this should work for you…
@Jan_Brian_Despi & @lakshman, if you guys have any better solutions, please share here as this is the option that I can think of
As @Lahiru.Fernando mentioned, use Extract structured Data or you can give it a try with Full Text Activity also and indicate entire region. It will full visible text along with hidden text(sub topics) also.
If you wanna extract all the sub-contents\topic in a single stretch, then you can use the metadata attached below and as shown in the screenshot:
<column exact="1" name="Column1" attr="text">
<webctrl tag="div" class="main-content-wrapper" idx="1"/>
<webctrl tag="div" class="main-content" idx="1"/>
<webctrl tag="div" class="container container--component-margin" idx="1"/>
<webctrl tag="div" class="row row--component-margin" idx="2"/>
<webctrl tag="div" class="col-xxs-8 left-col" idx="1"/>
<webctrl tag="div" class="clp-component-render" idx="1"/>
<webctrl tag="div" class="curriculum-wrapper" idx="1"/>
<webctrl tag="div" class="ud-component--clp--curriculum" idx="1"/>
<webctrl tag="div" class="lectures-container collapse in" idx="1"/>
<webctrl tag="div" class="lecture-container"/>
You need to remove the idx=‘1’ from the last line of the metadata as that would only
extract the first match of the above described elements.
Modifying the metadata is quite easier when you compare it against with the web page’s HTML content in the DOM explorer
Please let me know if you have any clarifications.
@Lahiru.Fernando, @lakshman, @Dominic - Thanks for your inputs, really appreciated.
@Dominic, The solution is good, Also working fine. Can i get In 1st column only Topic/Sub content not the duration and in 2nd column Duration.
Please give me your ideas on this.
@Lahiru.Fernando, @lakshman, @Dominic
Thanks all for your inputs, I am able to extract all the sub topics successfully. Thank you for prompt response also.
Also please let me know is there any way to get the Topics insert into the subtopics?
Great answer Dominic - You really are the ‘DOM’ explorer.