Hi.
We are doing data scraping on prices from a website. Normally we are allowed to only take the price, name and url but on this particular website, the product is in a sort of section by its own, see picture. What can we do to only scrape price,name and url of this product as well?
Hi @IngKim
If the issue is that the UiExplorer is only selecting the big box and does not allow you to select the price, then you will need to find a workaround.
Two possible solutions:
- See if by selecting the big boxes you will get some text. You could then use a Regex to find your price in the output string.
- Try to manually edit the xml of the Data Scraping tool in combination with Inspect element on the website. Unfortunately, this is trickier and not well documented just yet, so you would need to post your webpage for someone to try to come up with the proper xml to extract the price.
I would suggest you to go with the 1st option.
Hi @IngKim,
Can you please share the URL of the website that you are trying to scrap the data from.
Hi.
Thank you. yes, it is: https://www.addnature.no/search.html?id=0&strSearchQuery=Norrøna
Ingrid
See attachment, it was developed with the Community Edition 2018.3, so it might not be compatible with 2018.2.
I manually customized the XML of the Data Scrapper to give you the expected result:
<extract>
<row exact='1'>
<webctrl class='product-list-gallery-wrapper cyc-margin_top-2' tag='div' idx='1' />
<webctrl class='cyc-grid cyc-grid--gutters cyc-flex--wrap js-galleryRow gallery' tag='div' idx='1' />
<webctrl class='cyc-grid_cell gallery_item js-galleryItem' tag='div' />
<webctrl tag='div' />
<webctrl class='is-relative' tag='div' idx='1' />
<webctrl tag='div' />
</row>
<column name='Item Name' attr='title' exact='1'>
<webctrl class='product-list-gallery-wrapper cyc-margin_top-2' tag='div' idx='1' />
<webctrl class='cyc-grid cyc-grid--gutters cyc-flex--wrap js-galleryRow gallery' tag='div' idx='1' />
<webctrl class='cyc-grid_cell gallery_item js-galleryItem' tag='div' />
<webctrl tag='div' />
<webctrl class='is-relative' tag='div' idx='1' />
<webctrl tag='div' />
<webctrl tag='a' />
</column>
<column name='Price' attr='text' exact='1'>
<webctrl class='product-list-gallery-wrapper cyc-margin_top-2' tag='div' idx='1' />
<webctrl class='cyc-grid cyc-grid--gutters cyc-flex--wrap js-galleryRow gallery' tag='div' idx='1' />
<webctrl class='cyc-grid_cell gallery_item js-galleryItem' tag='div' />
<webctrl tag='div' />
<webctrl class='is-relative' tag='div' idx='1' />
<webctrl tag='div' />
<webctrl tag='div' class='cyc-margin_top-2 cyc-margin_leftright-3 cyc-height_5'/>
<webctrl tag='div' />
<webctrl tag='span' class='cyc-typo_body-2 cyc-color-text_sale cyc-margin_right-1' />
</column>
See here for the project:
DataScrapFromPage.zip (58.7 KB)
Hi @IngKim,
You can try the method mentioned by @loginerror.
Or(This could be a tedious process though)
You can use data scrapping and extract URL from each image and store in the Datatable.
And then iterate through each link.
Copy the link, paste in the browser and let the page load.
From there you get use Get Text activity and modify the selector to get the price of the product.
Thank you so very much. It works perfectly. I am trying to copy what you have done in my sequence but don’t manage. Is there anyway to get your piece merged with my sequence?
It is probably because you need to re-create my variables and all should work.
If not, simply replace your XML code with my XML code from the ‘Extract Structured Data’ activity. That is the bit that gets it done.
Hi @pruthvisiddhartha.
thank you for the answer. will try @loginerror first, then try yours if does’ent work. great with options so thank you very much.
Ingrid
Simply replace it by the selector for next page. You can actually just temporarily add a Click activity, point it to the next button on the first page and then copy the selector value from that Click activity into that box from the screenshot. Afterwards, delete the Click activity, of course
Hi again @loginerror
I did that and no fault when entering the selector value. But once pressed ok, new fault…
Ingrid
Please include double quotation marks on each side of the selector:
This should fix it
Can you take a wider screenshot to see what variable you are adding this value to?
YES!! It worked. Thank you so much. This is the case for other websites as well. Could I use the same XLM?
I don’t think so. I arrived at this one with a method of try-and-error and it will be custom to that one webpage. Hopefully for the other webpages you can use the Data Scraping wizard without modifying XML (which is the proper way).
Feel free to do these steps to learn how to modify the XML:
- Right click on the element you want to scrap in your browser and select “Inspect Element”.
- Compare my XML with the way it looks in the Source of the page.
It is really only a list of html tags from top of the tree to the bottom of the tree with your value.
Hi,
I want to scrape a data table using Data Scrapping option without using Get Attribute or Check Activity , i need to get checkbox status , is it possible to get.
https://editor.datatables.net/examples/api/checkbox.html
The above is the url, when i am extracting i’m getting all the fields except checkbox column, within ExtractData option i need to get checkbox status like false or true (or) checked or unchecked.
I tried to change extract Meta data but i’m not getting any result.
Thanks,
tej.