Web Scraping and Sentiment Analysis of Amazon Reviews

Note: Since the code in this post is outdated, as of 3/4/2019 a new post on Scraping Amazon and Sentiment Analysis (along with other NLP topics such as Word Embedding and Topic Modeling) are available through the links!

How to Scrape the Web in R

Most things on the web are actually scrapable. By selecting certain elements or paths of any given webpage and extracting parts of interest (also known as parsing), we are able to obtain data. A simple example of webscraping in R can be found in this awesome blog post on R-bloggers.

We will use Amazon for an example in this post. Let’s say we have the ASIN code of a product B0043WCH66. Let’s scrape the product name of this on Amazon. The URL of Amazon’s product pages are easy to build; simply concatenate the ASIN code to the “base” URL as such: https://www.amazon.com/dp/B0043WCH66.

We build the URL, and point to a specific node #productTitle of the HTML web page using the CSS selector (read about CSS Selector and how to obtain it using the SelectorGadget here). Finally, we clean and parse the text to obtain just the product name:

pacman::p_load(XML, dplyr, stringr, rvest, audio)

#Remove all white space
trim <- function (x) gsub("^\\s+|\\s+$", "", x)

prod_code = "B0043WCH66"
url <- paste0("https://www.amazon.com/dp/", prod_code)
doc <- read_html(url)

#obtain the text in the node, remove "\n" from the text, and remove white space
prod <- html_nodes(doc, "#productTitle") %>% html_text() %>% gsub("\n", "", .) %>% trim()
prod

## [1] "Bose® MIE2i Mobile Headset"

With this simple code, we were able to obtain the product name of this ASIN code.

Now say we want to scrape more data of the product BoseÂ® MIE2i Mobile Headset. We will use a function amazonscraper (available on my github). We will pull the first 10 pages of reviews:

#Source funtion to Parse Amazon html pages for data
source("https://raw.githubusercontent.com/rjsaito/Just-R-Things/master/Text%20Mining/amazonscraper.R")

pages <- 10

reviews_all <- NULL
for(page_num in 1:pages){
  url <- paste0("http://www.amazon.com/product-reviews/",prod_code,"/?pageNumber=", page_num)
  doc <- read_html(url)

  reviews <- amazon_scraper(doc, reviewer = F, delay = 2)
  reviews_all <- rbind(reviews_all, cbind(prod, reviews))
}

## 'data.frame':    100 obs. of  9 variables:
##  $ prod        : chr  "Bose® MIE2i Mobile Headset" "Bose® MIE2i Mobile Headset" "Bose® MIE2i Mobile Headset" "Bose® MIE2i Mobile Headset" ...
##  $ title       : chr  "Clarity, Comfort, Construction" "Awesome!" "Bose's signature sound works in its advantage" "Very disappointed as the wires are not lasting." ...
##  $ author      : chr  "Richard Blumberg" "James R. Spitznas" "John Barta" "Rick Gillis" ...
##  $ date        : chr  "November 20, 2010" "December 28, 2010" "June 29, 2011" "September 6, 2012" ...
##  $ ver.purchase: int  1 0 0 1 1 0 1 0 1 0 ...
##  $ format      : chr  "Package Type: Standard Packaging" "Package Type: Standard Packaging" "Package Type: Standard Packaging" "Package Type: Standard Packaging" ...
##  $ stars       : int  5 4 4 2 1 1 1 1 5 1 ...
##  $ comments    : chr  "These are the sixth or seventh set of earbuds I've had for a succession of iPods and iPhones. I'm slightly hard of hearing, and"| __truncated__ "I purchased these to replace the Sennheiser's that I had been using at the gym, cycling and skiing.  BTW I will never buy anoth"| __truncated__ "When you take 100 dollar earbuds the biggest hurdle is the technology. No matter how accurate the earbud WANTS to be, its going"| __truncated__ "While I would normally boast of the Bose name, this time I am let down. I have had these earphones for just over a year, and I "| __truncated__ ...
##  $ helpful     : int  577 63 166 18 59 14 11 177 93 23 ...

With amazonscraper, we obtained several values for each of the first 100 reviews of the product.

Sentiment Analysis in R

Now that we were able to obtain all this data, what can we do with this? Sure, we can read through all these reviews to see what people are saying about this product or how they feel about it, but that doesn’t seem like a good use of time. That’s where Sentiment Analysis comes in handy.

Sentiment Analysis is a Natural Langauge Processing method that allows us to obtain the general sentiment or “feeling” on some text. Sure we can just look at the star ratings themselves, but actually star ratings are not always consistent with the sentiment of the reviews. Sentiment is measured on a polar scale, with a negative value representing a negative sentiment, and positive value representing a positive sentiment.

Package ‘sentimentr’ allows for quick and simple yet elegant sentiment analysis, where sentiment is obtained on each sentences within reviews and aggregated over the whole review. In this method of sentiment analysis, sentiment is obtained by identifying tokens (any element that may represent a sentiment, i.e. words, punctiation, symbols) within the text that represent a postive or negative sentiment, and scores the text based on number of positive tokens, negative tokens, length of text, etc:

pacman::p_load_gh("trinker/sentimentr")

sent_agg <- with(reviews_all, sentiment_by(comments))
head(sent_agg)

##    element_id word_count        sd ave_sentiment
## 1:          1        670 0.4730184    0.16014619
## 2:          2        373 0.2602493    0.06357467
## 3:          3        724 0.3943997    0.26884453
## 4:          4        164 0.1905274   -0.01444510
## 5:          5        406 0.4812642    0.03993990
## 6:          6        112 0.3986540    0.03549016

par(mfrow=c(1,2))
with(reviews_all, hist(stars))
with(sent_agg, hist(ave_sentiment))

mean(reviews_all$stars)

## [1] 3.5

mean(sent_agg$ave_sentiment)

## [1] 0.1512848

You can see here there is a major inconsistency between stars and sentiment, even just by comparing the distrubution of both. In addition, while the average star rating is 3.5, the average sentiment is actually distrubuted around near 0 (neutral sentiment).

Now let’s see how these sentiments are actually being determined at the sentence level. Let’s obtain the reviews with highest sentiment and lowest sentiment, and take a look. The function highlight in sentimentr allows us to do this easisly.

best_reviews <- slice(reviews_all, top_n(sent_agg, 3, ave_sentiment)$element_id) with(best_reviews, sentiment_by(comments)) %>% highlight()

worst_reviews <- slice(reviews_all, top_n(sent_agg, 3, -ave_sentiment)$element_id) with(worst_reviews, sentiment_by(comments)) %>% highlight()

While the positive reviews have all positive sentiments, the negative reviews are actually a mix of positive and negative, where the negative significantly outweights the positive.

While these sentiments do not perfectly capture the true sentiments in these reviews, it is a quick and decently accurate method to quickly obtain the sentiments of these reviewers.

This method of sentiment analysis is a simple approach, and there are a number of widely known methods of sentiment anaylsis (one of which I am interested is in a machine learning approach to sentiment analysis) that involve analysing text by considering sequence of words and relationships between these sequence of words (here is a basic explanation in this youtube video).

18 thoughts on “Web Scraping and Sentiment Analysis of Amazon Reviews”

Alexander Mafael says:

November 22, 2016 at 3:23 pm

Hey Riki, cool stuff! I have one question, as I am not as familiar with R as you: Is it possible to save the data for each ASIN/Review Dataset as a csv for latter transportation to other statistics programs? Thanks for your help! Best, Alex

LikeLike

1. That's Awesome says:
  
  September 23, 2017 at 11:52 am
  
  This is really late but yes, you can!
  use this code
  
  write.csv(reviews_all,’reviews.csv’)
  
  It will save in your computer’s R working directory
  
  LikeLiked by 1 person
  
  1. Mars says:
    
    October 6, 2017 at 5:45 am
    
    I have a problem when I run this: “Error in data.frame(title, author, date, ver.purchase, format, stars, :
    arguments imply differing number of rows: 10, 9”
    
    What this means? Thank you very much!
    
    LikeLike
  2. Riki Saito says:
    
    December 30, 2017 at 9:04 pm
    
    This occurs most likely because one (or more) of the variables you are scraping had a missing value from the original page on Amazon, thus skipped over a value and only pulled 9 values instead of 10 – you might want to look into each variable and see which one is missing a a value.
    
    LikeLike
  3. Riki Saito says:
    
    December 30, 2017 at 9:06 pm
    
    Another trick I like to use is:
    
    write.csv(reviews_all, file.choose(new = T))
    
    This will open an interactive window and will prompt you to select the folder you want to save the file in and create a file name.
    
    LikeLike
jeffrey t monroe says:

January 9, 2017 at 1:04 pm

Not sure why, but this code works inconsistently. Curious if maybe it’s a server issue.

LikeLiked by 2 people

1. Riki Saito says:
  
  December 30, 2017 at 9:09 pm
  
  If you are referring to the the web scraping – there is a caveat on pulling large amounts of data as websites typically don’t want people rendering new pages so much and so frequently, – you’ll want to consider throttling your calls (i.e. add a sleep time of a couple seconds in between each page call).
  
  LikeLike
  
  1. Tony says:
    
    May 24, 2018 at 11:14 am
    
    This isn’t working for me either on any product I’ve tried.
    
    Error in data.frame(title, author, date, ver.purchase, format, stars, :
    arguments imply differing number of rows: 10, 0
    
    LikeLike
Ashay Jain says:

June 28, 2017 at 5:54 am

This is really cool. Works fine for me. I was searching for this from a long time.

LikeLiked by 1 person

Diggy says:

March 20, 2018 at 3:35 pm

Hello Riki, thanks for the article. I have a question, I’ve seen this website which says that anonymizes your data https://proxycrawl.com how would you use it for amazon following your tutorial? Thanks

LikeLike

vlarun@ymail.com says:

July 26, 2018 at 3:16 pm

This code is not working with the error
Error in data.frame(stars, comments, helpful, stringsAsFactors = F) :
arguments imply differing number of rows: 10, 0

LikeLike

1. Giacomo says:
  
  August 23, 2018 at 1:01 pm
  
  Did you find a solution?
  
  LikeLike
  
2. Faiza says:
  
  January 28, 2019 at 5:45 am
  
  me too i have the same error , did u find solution????
  
  LikeLike
  
Anj A says:

February 28, 2019 at 5:09 am

I am getting the same error as many commenters – I have included it below with traceback:

Error in data.frame(title, author, date, ver.purchase, format, stars, :
arguments imply differing number of rows: 8, 0

3. stop(gettextf(“arguments imply differing number of rows: %s”,
paste(unique(nrows), collapse = “, “)), domain = NA)
2. data.frame(title, author, date, ver.purchase, format, stars,
comments, helpful, stringsAsFactors = F) at amazonscraper.R#57
1. amazon_scraper(doc, reviewer = F, delay = 2)

Any suggestions on how to debug this would be appreciated. Does the function itself have to be modified? Is there a problem with how the data is being appended? I have tried this with multiple different products.

LikeLike

1. Anj A says:
  
  February 28, 2019 at 5:51 am
  
  To follow up on the above – I looked further and found that when the code ran, for “author” and “helpful” the values were not being registered correctly.
  
  author the output resulted in:
  character(0)
  
  helpful the output resulted in:
  numeric(0)
  
  When i commented out the lines of code pulling this data, the code ran fine. However I would like to pull data from the “helpful” field. I am not familiar with CSS selector but I think the code for these two fields has to be modified to pull this data correctly. Can anyone advise me as to how to do that?
  
  LikeLike
  
Pingback: Web Scraping Amazon Reviews (March 2019) – Just R Things
Riki Saito says:

March 4, 2019 at 12:01 am

Hi everyone,

I’m happy to announce that I’ve updated the Amazon web scraping R function in my re-release version of this post. Please visit the link below to find the update:

https://justrthings.com/2019/03/03/web-scraping-amazon-reviews-march-2019/

LikeLike

Emma says:

April 26, 2022 at 9:39 am

Amazing article, worth reading.I was searching for info like this for a long time! Keep those posts coming! Retailgators helps you scrape data from all websites like Amazon. It’s specially designed to make data scraping a totally painless exercise.

Know more here: Amazon Product Data Scraper

LikeLike

	Emma on Web Scraping and Sentiment Ana…
	retailgators9315 on Web Scraping Amazon Reviews in…
	Web Scraping Amazon… on Web Scraping Amazon Reviews in…
	iWeb Scraping Servic… on Web Scraping Amazon Reviews in…
	alexkommy on Web Scraping Amazon Reviews in…

Just R Things

Topics in Data Science with R (and sometimes Python)

Web Scraping and Sentiment Analysis of Amazon Reviews

How to Scrape the Web in R

Sentiment Analysis in R

18 thoughts on “Web Scraping and Sentiment Analysis of Amazon Reviews”

Leave a comment Cancel reply

How to Scrape the Web in R

Sentiment Analysis in R

Share this:

18 thoughts on “Web Scraping and Sentiment Analysis of Amazon Reviews”

Leave a comment Cancel reply