Table Extraction, half records Image URL, and half is not!

Hi,
2021-03-06_10h39_31
2021-03-06_10h40_03
When I use Table Extraction I obtain the following results:

I have records that are perfect such as https://ih1.redbubble.net/image.2164397085.2836/st,small,507x507-pad,600x600,f8f8f8.jpg

And I have wrong record such as data:image/gif;base64,R0lGODdhFQAXAPAAANba3wAAACwAAAAAFQAXAAACFISPqcvtD6OctNqLs968+w+GolUAADs=

Best Regards

@AmarBouz Can you let us know if the website from where you scrape the data is public and if it is public could you provide us with the website link, so that we can analyse it.

Also, Can you try tweaking the Metadata Property in the Extract Structured Data Activity by following the Steps mentioned in the below post.

Let us know if you were not able to get the needed data.

HI @AmarBouz

Interesting. It would seem like the website chose to encode GIFs as base64 strings. Think of it as an image saved as a string.

Try to decode it for yourself:

If that works, you could probably use some code to convert these to files :slight_smile:

1 Like

Hi,
The wesite is public :
https://www.redbubble.com/
I entered the following research sentence “Sticker Artmark8 Thank You Arabs”

And obtain :

@AmarBouz Can you Check the below workflow :
Data Scraping.zip (4.0 KB)

The Method that I have used to overcome this problem was to hover every 4th element in the row, and then let the Extract Structured Data Activity do it’s work.
But this would lead to a time consuming process as it would need to hover over so many items present.
I have also tried converting the base64 encoded image to a Image file but unfortunately I wasn’t able to succeed as it didn’t render the image properly.

It seems to be like the webpage has a lazy loading feature on the images of each item listed. So the Image won’t be loaded and hence there would be an incomplete 'src' value for that item if the Data Scraping is done directly.

But However if you scroll through them either one by one or row by row you would load the image fully on the scrolled parts and then you would be able to get the 'src' value properly.

A little bit about Lazy Loading :

An Inspect of the web page on the last element or last few elements (Not Scrolled Through) would have the base64 encoding of the image (Suspected Partial encoding) :

Can you check the workflow provided and let us know if it worked or If you have already found a solution please post the solution so that it would help others as well.

Would Request you to keep the post open for suggestions from others :slightly_smiling_face:

1 Like