Filtering OuterHtml Attribute

Hello,

I am attempting to keep count of the number of images/videos on a website. The only solution I have found so far is to grab the outer html which gives a list of the tab items inside. However, there are multiple

tags in the outer html code. Would there be a way I could filter through the entire html code and see which
has the greatest number? Somehow filtering the string? If the tab item number is greater than the previous than store that instead. In addition if there is a way to count videos that would be great. Thank you. Image Example

@kodm You can perform a regex match to get the number of images present by using the regex “.jpg” to get the total number of “.jpg” matches. Hence this would be the total number of Images present? Have you tried it?

Hi @supermanPunch thank you for your response. Unfortunately this doesn’t work because the word “.jpg” is repeated multiple times and returns an inaccurate count. Is there anything else you think might work? I tried doing slide as well but instead of 6 slides it is returning 8.

@kodm Is that website public ?

Hi @supermanPunch. Yes it is. Its a shopping website. Ill post it. https://www.sephora.com/product/innisfree-daily-uv-defense-sunscreen-spf-36-P456392?skuId=2338325&icid2=products%20grid:p456392:product. In this particular link it should count 4 images and one video, however when I regex “.jpg” I get more results.

@kodm

as mentioned in your other posts. it can be done with data scrapping:
grafik
4 images, 1 video scrolled to right for the screenshot

data scrapping result:
grafik
1 col for the image info, 1 col for video info

done with following extract info:

<extract>
	<row exact="1">
		<webctrl tag="div" class="owl-stage-outer" idx="1"/>
		<webctrl tag="div" class="owl-stage" idx="1"/>
		<webctrl tag="div" />
		<webctrl tag="li" idx="1"/>
	</row>
	<column exact="1" name="ImageInfo" attr="src">
		<webctrl tag="a" class="thumbnail-link" idx="1"/>
		<webctrl tag="img" idx="1"/>
	</column>
	<column exact="1" name="VideoInfo" attr="data-lgimg">
		<webctrl tag="a" class="productthumbnail video" idx="1"/>
	</column>
</extract>

doing it with find children will follow the same logic:

  • filter on the a elements and analyse the a classes:
    productthumbnail video for video, thumbnail-link for video

so a quick counting can be implemented.

but as it can done with datascraping with retrieving more info in one rush, so this approach is suggested:

Kindly note:

1 Like

@ppr Thank you for your response ill make sure to avoid duplicate topics however this does not work for me. Whenever I try data scraping, it returns no image info or video info. Am I data scraping the wrong pattern? I click on each image on the slide so it follows the next one ahead of it but I get a null result.

@kodm
this quite ok and right. When indicating the columns fo creating the column selectors it will be blank on the beginning. As it takes the the text attribute for scrapping. But once you opened edit data definitions and do the adoptions then it will show your e.g. the src from an image. Just give a try and get guided by my snippet.

Kindly note:

  • after editing the extract xml, take it always into the clipboard. Sometimes it fails on the first run and then you can redo easier and faster
  • sometimes it is confused after two much unclear edits. Stop scraping and do it again

@ppr Thank you for help, when I click on edit data definition it provides me with the following code even though it is blank. However as instructed when I try pasting the results into an excel file it is still blank. What did you mean by adoptions? Here are the results I got.

Web Scrape Result

@kodm
compering your extractmetadata xml with the extractmetadata xml provided in my post from above shows a really big difference. And for sure it will not work

What did you mean by adoptions?

i just configured the first column by indicating the preview images (first, second), then I configured a second column by indicating the same first and second preview images. Sure it shows me an empty result, but when clicking on edit data definition then the first rework on the extract metadata can be done

give a try on replacing the generated extract xml while configuring with my extract xml or try again (but configure 2 columns properly via the definition of correlated data )