Table Extraction, half records Image URL, and half is not!

AmarBouz · March 6, 2021, 9:52am

Hi,
2021-03-06_10h39_31
2021-03-06_10h40_03
When I use Table Extraction I obtain the following results:

I have records that are perfect such as https://ih1.redbubble.net/image.2164397085.2836/st,small,507x507-pad,600x600,f8f8f8.jpg

And I have wrong record such as data:image/gif;base64,R0lGODdhFQAXAPAAANba3wAAACwAAAAAFQAXAAACFISPqcvtD6OctNqLs968+w+GolUAADs=

Best Regards

supermanPunch · March 6, 2021, 10:37am

@AmarBouz Can you let us know if the website from where you scrape the data is public and if it is public could you provide us with the website link, so that we can analyse it.

Also, Can you try tweaking the Metadata Property in the Extract Structured Data Activity by following the Steps mentioned in the below post.

Let us know if you were not able to get the needed data.

loginerror · March 6, 2021, 10:41am

HI @AmarBouz

Interesting. It would seem like the website chose to encode GIFs as base64 strings. Think of it as an image saved as a string.

Try to decode it for yourself:

If that works, you could probably use some code to convert these to files

AmarBouz · March 6, 2021, 4:47pm

Hi,
The wesite is public :
https://www.redbubble.com/
I entered the following research sentence “Sticker Artmark8 Thank You Arabs”

And obtain :

supermanPunch · March 7, 2021, 11:50am

@AmarBouz Can you Check the below workflow :
Data Scraping.zip (4.0 KB)

The Method that I have used to overcome this problem was to hover every 4th element in the row, and then let the Extract Structured Data Activity do it’s work.
But this would lead to a time consuming process as it would need to hover over so many items present.
I have also tried converting the base64 encoded image to a Image file but unfortunately I wasn’t able to succeed as it didn’t render the image properly.

It seems to be like the webpage has a lazy loading feature on the images of each item listed. So the Image won’t be loaded and hence there would be an incomplete 'src' value for that item if the Data Scraping is done directly.

But However if you scroll through them either one by one or row by row you would load the image fully on the scrolled parts and then you would be able to get the 'src' value properly.

A little bit about Lazy Loading :

An Inspect of the web page on the last element or last few elements (Not Scrolled Through) would have the base64 encoding of the image (Suspected Partial encoding) :

Can you check the workflow provided and let us know if it worked or If you have already found a solution please post the solution so that it would help others as well.

Would Request you to keep the post open for suggestions from others

Topic		Replies	Views
Not able to scrape Image URLS Studio studio , question , activities_panel	6	1658	August 24, 2021
How to do Data Scraping? Activities uiautomation , activities , studio , question	2	553	February 6, 2023
Web image scraping Studio browser , uiautomation , activities , question	8	2081	March 6, 2020
How to scrape a web table, with different elements, and three columns with URLs Studio datatable , activities , studio , data_scraping	5	1766	February 26, 2022
Extract data table - get a specific attribute instead of the text in the table Help	2	2715	September 7, 2020

Most Active Users - Yesterday
Anil_G
ashokkarale
Ajay_Mishra
Gautham_Pattabiraman
BHUSHAN_NAGAONKAR1
vrdabberu
ABHIMANYU_THITE1
lrtetala
samantha_shah
shyamala_shyamu
More details...

Table Extraction, half records Image URL, and half is not!

Related Topics