Facing issues in Comment extraction from Amazon.in website

arijit1213 · September 7, 2020, 5:24am

hi,
i want to extract comments from amazon.in website. i did a workflow which will extract all reviews from amazon and insert the comments in database.

i did data scraping and extract title,posted_by,post_date,rating,review. and one add data column of product_id.
this is my result output.

but everyday all comments extraction using data scraping is time taking process.so i want to extract only new comments which are not present in data base. how can i do this?
or which will be the best approach to extract new comments?
Thank u.

ppr · September 7, 2020, 11:28am

@arijit1213
its just an idea on an alternate flow (I assume you are doing datascraping with paging)

order the reviews on most recent
grap the review count
define a threshold which fork on all reviews datascrapping or difference scraping

in case of scraping only the differences:

scrap first page
check if last comment from old retrieval is part of it
if not found old latest review then go to next page and repeat

merging and consolidation the new differences to existing reviews can be done by:

Join (left join - dtnew to dtOld)
duplicate detection

arijit1213 · September 7, 2020, 1:04pm

@ppr
can you send me the workflow?

ppr · September 7, 2020, 1:06pm

@arijit1213
there is no workflow as I just checked a conceptual approach for the retrieval.
But we can help you on setting up the building blocks e.g. find duplicates etc. in case of you need further help

arijit1213 · September 7, 2020, 1:12pm

i setting up the reviews on most recent.
after that what can i do? can you explain how can i grap the review count
and define a threshold which fork on all reviews datascrapping or difference scraping
this two points?

arijit1213 · September 10, 2020, 2:22am

so i want to scrap the first page then i want to check the whole scrap data from 1st page is present in data base or not. if all data from first page is present in database then close the tab. if one or more unmatched row found in the first page then i want to go for next page and scrap the second page and check the scraped data from second page is present in data base or not. And the same process i want to repeat. how can i do this?
Thank u

ppr · September 10, 2020, 8:11am

@arijit1213
you are on the right track. So just divide th different steps into smaller tasks and map it to the corresponding actions:

finding out the reviews count:

get text, e.g Regex for the count extraction, CInt for the conversion to Integer

forking, if to read the all or the differences:

if activity - if NoOfReviews < XXX then… else…

finding the most recent post in existing, past retrieved data(table)

filtering on different columns: filter datatable, datatablevar.Select… or LINQ

etc.

Topic		Replies	Views
Problem in Data Scraping Academy Feedback activities , data_scraping	2	757	April 12, 2020
Scraping issue Academy Feedback datatable , excel , uiautomation , activities , data_scraping , question	3	660	April 10, 2020
Scraping quary Academy Feedback uiautomation , activities , question	2	645	April 8, 2020
How to eliminate from Data Scraping OF Same page Multiple time Academy Feedback datatable , uiautomation , activities , data_scraping , question	2	692	August 1, 2020
Scraping activity not extracting all the data Studio activities , question	3	796	April 18, 2022

Most Active Users - Yesterday
ashokkarale
ppr
Anil_G
Ajay_Mishra
Yoichi
mhaniff
Shiva_Nikhil
Anonymouss
quick_123
vrdabberu
More details...

Facing issues in Comment extraction from Amazon.in website

Related Topics