Facing issues in Comment extraction from Amazon.in website

hi,
i want to extract comments from amazon.in website. i did a workflow which will extract all reviews from amazon and insert the comments in database.

i did data scraping and extract title,posted_by,post_date,rating,review. and one add data column of product_id.
this is my result output.
image

but everyday all comments extraction using data scraping is time taking process.so i want to extract only new comments which are not present in data base. how can i do this?
or which will be the best approach to extract new comments?
Thank u.

@arijit1213
its just an idea on an alternate flow (I assume you are doing datascraping with paging)

  • order the reviews on most recent
  • grap the review count
  • define a threshold which fork on all reviews datascrapping or difference scraping

in case of scraping only the differences:

  • scrap first page
  • check if last comment from old retrieval is part of it
  • if not found old latest review then go to next page and repeat

merging and consolidation the new differences to existing reviews can be done by:

  • Join (left join - dtnew to dtOld)
  • duplicate detection
2 Likes

@ppr
can you send me the workflow?

@arijit1213
there is no workflow as I just checked a conceptual approach for the retrieval.
But we can help you on setting up the building blocks e.g. find duplicates etc. in case of you need further help

i setting up the reviews on most recent.
after that what can i do? can you explain how can i grap the review count
and define a threshold which fork on all reviews datascrapping or difference scraping
this two points?

so i want to scrap the first page then i want to check the whole scrap data from 1st page is present in data base or not. if all data from first page is present in database then close the tab. if one or more unmatched row found in the first page then i want to go for next page and scrap the second page and check the scraped data from second page is present in data base or not. And the same process i want to repeat. how can i do this?
Thank u

@arijit1213
you are on the right track. So just divide th different steps into smaller tasks and map it to the corresponding actions:

finding out the reviews count:

  • get text, e.g Regex for the count extraction, CInt for the conversion to Integer

forking, if to read the all or the differences:

  • if activity - if NoOfReviews < XXX then… else…

finding the most recent post in existing, past retrieved data(table)

  • filtering on different columns: filter datatable, datatablevar.Select… or LINQ

etc.