Note: Since the code in this post is outdated, as of 3/4/2019 a new post on Scraping Amazon and Sentiment Analysis (along with other NLP topics such as Word Embedding and Topic Modeling) are available through the links!

How to Scrape the Web in R

Most things on the web are actually scrapable. By selecting certain elements or paths of any given webpage and extracting parts of interest (also known as parsing), we are able to obtain data. A simple example of webscraping in R can be found in this awesome blog post on R-bloggers.

We will use Amazon for an example in this post. Let’s say we have the ASIN code of a product B0043WCH66. Let’s scrape the product name of this on Amazon. The URL of Amazon’s product pages are easy to build; simply concatenate the ASIN code to the “base” URL as such: https://www.amazon.com/dp/B0043WCH66.

product_page.PNG

We build the URL, and point to a specific node #productTitle of the HTML web page using the CSS selector (read about CSS Selector and how to obtain it using the SelectorGadget here). Finally, we clean and parse the text to obtain just the product name:

pacman::p_load(XML, dplyr, stringr, rvest, audio)

#Remove all white space
trim <- function (x) gsub("^\\s+|\\s+$", "", x)

prod_code = "B0043WCH66"
url <- paste0("https://www.amazon.com/dp/", prod_code)
doc <- read_html(url)

#obtain the text in the node, remove "\n" from the text, and remove white space
prod <- html_nodes(doc, "#productTitle") %>% html_text() %>% gsub("\n", "", .) %>% trim()
prod
## [1] "Bose® MIE2i Mobile Headset"

With this simple code, we were able to obtain the product name of this ASIN code.

Now say we want to scrape more data of the product Bose® MIE2i Mobile Headset. We will use a function amazonscraper (available on my github). We will pull the first 10 pages of reviews:

#Source funtion to Parse Amazon html pages for data
source("https://raw.githubusercontent.com/rjsaito/Just-R-Things/master/Text%20Mining/amazonscraper.R")

pages <- 10

reviews_all <- NULL
for(page_num in 1:pages){
  url <- paste0("http://www.amazon.com/product-reviews/",prod_code,"/?pageNumber=", page_num)
  doc <- read_html(url)

  reviews <- amazon_scraper(doc, reviewer = F, delay = 2)
  reviews_all <- rbind(reviews_all, cbind(prod, reviews))
}
## 'data.frame':    100 obs. of  9 variables:
##  $ prod        : chr  "Bose® MIE2i Mobile Headset" "Bose® MIE2i Mobile Headset" "Bose® MIE2i Mobile Headset" "Bose® MIE2i Mobile Headset" ...
##  $ title       : chr  "Clarity, Comfort, Construction" "Awesome!" "Bose's signature sound works in its advantage" "Very disappointed as the wires are not lasting." ...
##  $ author      : chr  "Richard Blumberg" "James R. Spitznas" "John Barta" "Rick Gillis" ...
##  $ date        : chr  "November 20, 2010" "December 28, 2010" "June 29, 2011" "September 6, 2012" ...
##  $ ver.purchase: int  1 0 0 1 1 0 1 0 1 0 ...
##  $ format      : chr  "Package Type: Standard Packaging" "Package Type: Standard Packaging" "Package Type: Standard Packaging" "Package Type: Standard Packaging" ...
##  $ stars       : int  5 4 4 2 1 1 1 1 5 1 ...
##  $ comments    : chr  "These are the sixth or seventh set of earbuds I've had for a succession of iPods and iPhones. I'm slightly hard of hearing, and"| __truncated__ "I purchased these to replace the Sennheiser's that I had been using at the gym, cycling and skiing.  BTW I will never buy anoth"| __truncated__ "When you take 100 dollar earbuds the biggest hurdle is the technology. No matter how accurate the earbud WANTS to be, its going"| __truncated__ "While I would normally boast of the Bose name, this time I am let down. I have had these earphones for just over a year, and I "| __truncated__ ...
##  $ helpful     : int  577 63 166 18 59 14 11 177 93 23 ...

With amazonscraper, we obtained several values for each of the first 100 reviews of the product.

Sentiment Analysis in R

Now that we were able to obtain all this data, what can we do with this? Sure, we can read through all these reviews to see what people are saying about this product or how they feel about it, but that doesn’t seem like a good use of time. That’s where Sentiment Analysis comes in handy.

Sentiment Analysis is a Natural Langauge Processing method that allows us to obtain the general sentiment or “feeling” on some text. Sure we can just look at the star ratings themselves, but actually star ratings are not always consistent with the sentiment of the reviews. Sentiment is measured on a polar scale, with a negative value representing a negative sentiment, and positive value representing a positive sentiment.

Package ‘sentimentr’ allows for quick and simple yet elegant sentiment analysis, where sentiment is obtained on each sentences within reviews and aggregated over the whole review. In this method of sentiment analysis, sentiment is obtained by identifying tokens (any element that may represent a sentiment, i.e. words, punctiation, symbols) within the text that represent a postive or negative sentiment, and scores the text based on number of positive tokens, negative tokens, length of text, etc:

pacman::p_load_gh("trinker/sentimentr")

sent_agg <- with(reviews_all, sentiment_by(comments))
head(sent_agg)
##    element_id word_count        sd ave_sentiment
## 1:          1        670 0.4730184    0.16014619
## 2:          2        373 0.2602493    0.06357467
## 3:          3        724 0.3943997    0.26884453
## 4:          4        164 0.1905274   -0.01444510
## 5:          5        406 0.4812642    0.03993990
## 6:          6        112 0.3986540    0.03549016
par(mfrow=c(1,2))
with(reviews_all, hist(stars))
with(sent_agg, hist(ave_sentiment))

plot of chunk unnamed-chunk-3

mean(reviews_all$stars)
## [1] 3.5
mean(sent_agg$ave_sentiment)
## [1] 0.1512848

You can see here there is a major inconsistency between stars and sentiment, even just by comparing the distrubution of both. In addition, while the average star rating is 3.5, the average sentiment is actually distrubuted around near 0 (neutral sentiment).

Now let’s see how these sentiments are actually being determined at the sentence level. Let’s obtain the reviews with highest sentiment and lowest sentiment, and take a look. The function highlight in sentimentr allows us to do this easisly.

best_reviews <- slice(reviews_all, top_n(sent_agg, 3, ave_sentiment)$element_id) with(best_reviews, sentiment_by(comments)) %>% highlight()

positive_reviews.PNG

worst_reviews <- slice(reviews_all, top_n(sent_agg, 3, -ave_sentiment)$element_id) with(worst_reviews, sentiment_by(comments)) %>% highlight()

negative_reviews.PNG

While the positive reviews have all positive sentiments, the negative reviews are actually a mix of positive and negative, where the negative significantly outweights the positive.

While these sentiments do not perfectly capture the true sentiments in these reviews, it is a quick and decently accurate method to quickly obtain the sentiments of these reviewers.

This method of sentiment analysis is a simple approach, and there are a number of widely known methods of sentiment anaylsis (one of which I am interested is in a machine learning approach to sentiment analysis) that involve analysing text by considering sequence of words and relationships between these sequence of words (here is a basic explanation in this youtube video).