Sentiment analysis is a useful tool for gathering information on emotions relayed through text. It relies on text analysis systems to interpret the polarity of the opinions expressed in text — positive, negative or neutral — as well as emotions or interest level. Sentiment analysis programs optimize data comprehension by automatically iterating through text and providing a general conclusion of the sentiment behind a piece of text.

A typical use of sentiment analysis is to collect information on customer satisfaction. Thousands of words can be used in a written review of something, so text processing with sentiment analysis is an efficient way to aggregate relevant information about a product to determine the overall opinion of it.

This tutorial will teach you how to code an effective sentiment analysis program that looks at the data from customer reviews of Amazon products. In Part 1, you will convert data formats to organize the data and make it easier to work with. Part 2 is the sentiment analysis portion, where you will create word clouds that show the most commonly used words in the reviews of a product. Lastly, you will check the performance of your model through training and testing, and produce a confusion matrix that visualizes your model’s accuracy.

By the end of the tutorial, you should be able to:

  • Import and properly call several programs to use in your code
  • Convert data from a json gzip format into a more readable chart-style dataframe from Pandas
  • Use the asin, or Amazon standard identification number, to decipher how many reviews correlate to a product
  • Use the “worldcloud” Python library to create diagrams showing common words in reviews
  • Train and test your model by splitting up your data
  • Use scikit learn to perform logistic regression, vectorization, and other programs to test the model’s accuracy

Let’s get started!


Getting Started

For this tutorial, you will need to download this dataset. All of our outcomes will be drawn from this dataset, which holds reviews for Amazon's most and least reviewed Sports and Outdoor products.

First, import the necessary Python libraries and packages.

import gzip

import itertools

import string

import wordcloud

import numpy as np

import pandas as pd

import datetime as dt

import matplotlib.pyplot as plt

import pylab as pl

from nltk import word_tokenize

from nltk.corpus import stopwords

from nltk.stem.wordnet import WordNetLemmatizer

from collections import Counter

from sklearn import svm

from sklearn.pipeline import Pipeline, make_pipeline

from sklearn.linear_model import LogisticRegression

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

%matplotlib inline

Part 1: EDA and Preprocessing

The dataset you downloaded comes in a json gzip format, so we first have to convert it into something more visual and easy to extract from. gzip is a compression and decompression file format. A Pandas dataframe is like a chart, which is a great alternative since it organizes information into rows and columns, making it easier to read and extract data from.

We’re going to create some of our own functions to help ourselves out in the process.

  • parse_gz opens a .gz formatted file using the gzip library so we can start reading the data.
  • convert_to_DF converts an opened gzip file into a Pandas dataframe for organization through parsing

#Code provided via

def parse_gz(path):
g =, 'rb')
for l in g:
yield eval(l)

def convert_to_DF(path):
i = 0
df = {}
for d in parse_gz(path):
df[i] = d
i += 1
return pd.DataFrame.from_dict(df, orient='index')

Next, we convert from a dataset to a dataframe. It should be downloaded and in the same folder as where this code file is:

sports_outdoors = convert_to_DF('reviews_Sports_and_Outdoors_5.json.gz')

If you want to see how many words are in this dataframe:

print('Dataset size: {:,} words'.format(len(sports_outdoors)))

To print out first 3 results of dataframe:


# note: replace number for more or less results

We can see that the review time right now is a date. We want a clock time, so we can use Pandas .to_datetime function:

sports_outdoors["reviewTime"] = pd.to_datetime(sports_outdoors["reviewTime"])

We also want to organize the columns according to their relevance

sports_outdoors = sports_outdoors[['asin', 'summary', 'reviewText', 'overall', 'reviewerID', 'reviewerName', 'helpful', 'reviewTime', 'unixReviewTime']]

Note: asins (Amazon Standard Identification Number) unique to each product and the corresponding number of reviews for each product.

Let's check the first 3 results again to see our changes:


We can see more columns are organized in the topics we set above and the time column isn't a date anymore. We can also check out the last 3 results:


Finding the Number of Reviews of Unique Products [asin]

Continuing to use the Pandas library to explore our dataframe, we're going to get the number of unique items that are in this dataframe.

products = sports_outdoors['overall'].groupby(sports_outdoors['asin']).count()
print("Number of Unique Products in the Sports & Outdoors Category = {}".format(products.count()))

Most and Least Reviewed Products

First, we’re going to sort our dataframe so that it’s ordered from most reviewed to least reviewed.

sorted_products = products.sort_values(ascending=False)

Let’s take a look at the top 20 most reviewed products. From the sorted_products dataframe, we’re going to display the first 20 items, and after printing each one, we’re going to enter the next line for the next print. When results are out, we’re going to print how many reviews the most reviewed product has.

print("Top 20 Reviewed Products:\n")
print(sorted_products[:20], end='\n\n')
print('Most Reviewed Product, B001HBHNHE - has {}

If you save your code and run it, you should see this printed:

Turns out, the most reviewed product is this 9mm Pistol Magazine Loader.

pink mag

Now let’s get the least reviewed product. The code is pretty much the same. Return to your code. You can comment out the code that prints the top 20 products but keep the sorting element.

print("Bottom 20 Reviewed Products:\n")
print(sorted_products[18337:], end='\n\n')
print('Least Reviewed Product (Sorted), B003Z6HUZE - has {}

Save and run.

This will return in the same format as the most reviewed products. The least reviewed product is a 2-Position Web Nylon Knife Sheath.


Now, we have to process our dateset before modeling. Below are our top 11 results. Here, we have punctuation and words that might not matter. Let’s clean it up!


We’ll start with stopwords. Stopwords are common words that have no definition/not too much meaning (e.g. “the,” “a,” “and”). These words show up way too often and will disrupt our data if they aren’t taken out. Luckily, there is a function stopwords() from NLTK (Natural Learning Toolkit) that helps us remove all of them! There are many languages compatible with this function, but we’re going to use English since the reviews are mainly in English .

stops = stopwords.words('english')

Tokenize in Python means to split up large amounts of text into smaller parts, and it sometimes can create words that might not be in English. Lemmatize in Python means to group together the different tenses of a word (e.g. “-ed,” “-ing”) so that they can fall under one word. This is important because we don’t want similar words to take up critical space where other words could be.

The 2 functions below will help us do this.

def tokenize(text):
tokenized = word_tokenize(text)
no_punc = []
for review in tokenized:
line = "".join(char for char in review if char not in string.punctuation)
tokens = lemmatize(no_punc)
return tokens
def lemmatize(tokens):
lmtzr = WordNetLemmatizer()
lemma = [lmtzr.lemmatize(t) for t in tokens]
return lemma

Now we’re going to apply these changes to finish cleaning our dataset.

reviews = reviews.apply(lambda x: tokenize(x))

You can see that the reviews have now been split into a list form. Using a list will give us easy access when modeling.

Part 2: Modeling

Classification / Sentiment Analysis (LogReg, Multinomial)

We’re going to make a word cloud with the top words that appear in the written reviews. This will show the most commonly used words in all the reviews of a product, so we can determine if the overall sentiment is positive or negative for a product.

Creating a word cloud is made easy thanks to the wordcloud library that we imported from the beginning. Background_color will indicate the background color behind the words. max_font_size sets the largest font size for the most common word, and as the count of other words goes down, relative_scaling will indicate how much the font size will decrease by each time. This is how we will tell visually how often a word is used in comparison to others.

There are many more features included that you’re welcome to explore!

cloud = wordcloud.WordCloud(background_color='gray', max_font_size=60, relative_scaling=1).generate(''.join(sports_outdoors.reviewText))
fig = plt.figure(figsize=(20, 10))

Let’s print the top 3 results again:


We also want to figure out the star rating of a product. We will do this by assuming that negative reviews are rated 1-3 stars, which we will label as 0, and that positive reviews are rated 4-5 stars, labeled as 1. These labels (0 or 1) will go in a new column that we will add so it’s clear what has a positive star value and what has a negative one.

sports_outdoors['pos_neg'] = [1 if x > 3 else 0 for x in sports_outdoors.overall]

Run the result.


review_text = sports_outdoors["reviewText"]

Using both methods to determine a review’s rating, we can check to see if the star value corresponds to the sentiment in the text reviews for products. This will show us how accurately the stars represent the actual opinion of a product.

Train/Test Split

Training data is used to help the machine learn with a dataset, while test data will be used to test the accuracy of the model. Training is how the machine learns how to work with the data, and also gives us an opportunity to check for holes in the program. Testing is where we put our model to work with new data, and check how well it works.

We want to keep the data used for training and testing separate. Since the training data was used to train the model, if it were also used in testing, the testing process would be redundant. The model will also appear more accurate than it actually is if the data is reused, since it will easily predict the results for the data it was trained with. The model learned from the training data, so using it again will inaccurately reflect a ‘smarter’ system.

x_train, x_test, y_train, y_test = train_test_split(sports_outdoors.reviewText, sports_outdoors.pos_neg, random_state=0)
print("x_train shape: {}".format(x_train.shape), end='\n')
print("y_train shape: {}".format(y_train.shape), end='\n\n')
print("x_test shape: {}".format(x_test.shape), end='\n')
print("y_test shape: {}".format(y_test.shape), end='\n\n')

Run result: x_train shape: (222252,)

y_train shape: (222252,)

x_test shape: (74085,)

y_test shape: (74085,)

Logistic Regression

We are going to be using scikit-learn for logistic regression. Logistic regression will compute the probability of an event’s occurrence. In our case, we want to look at the occurrence of words in the reviews for each product.


Vectorization is the process turning text into a numeric representation. CountVectorizer will tokenize and count the occurances of words in our training data. fit() is the function that will take our document name.

#Vectorize X_train
vectorizer = CountVectorizer(min_df=5).fit(x_train)
X_train = vectorizer.transform(x_train)

Run result: X_train:

<222252x28733 sparse matrix of type '<class 'numpy.int64'>'

with 12428687 stored elements in Compressed Sparse Row format>

The number of features is how many terms are selected from the original (raw) document. The feature names list is sorted. The number of featured names is the length of the feature names list.

feature_names = vectorizer.get_feature_names()
print("Number of features: {}".format(len(feature_names)))

Run result: Number of features: 28733

Training Data

We also want to test our model’s accuracy, so we can make sure that we can trust it. Cross_val_score gives an estimate of the accuracy of our model with the training data.

scores = cross_val_score(LogisticRegression(), X_train, y_train, cv=5)
print("Mean cross-validation accuracy: {:.3f}".format(np.mean(scores)))

Run result: Mean cross-validation accuracy: 0.888

Our accuracy is 0.888. That means the model will be pretty reliable, which is good news!

We’re also going to do accuracy estimations on our testing data.

logreg = LogisticRegression(C=0.1).fit(X_train, y_train)
X_test = vectorizer.transform(x_test)
log_y_pred = logreg.predict(X_test)
logreg_score = accuracy_score(y_test, log_y_pred)
print("Accuracy: {:.3f}".format(logreg_score))

Run result: Accuracy:   0.890

print("Training set score: {:.3f}".format(logreg.score(X_train, y_train)))
print("Test set score: {:.3f}".format(logreg.score(X_test, y_test)))

Run result: Training set score: 0.907

Test set score: 0.890

A confusion matrix is a visualized summary of the predicted results. Most importantly, it can tell us the type of errors that are being made. This is your standard confusion matrix:

Understanding Confusion Matrix - Towards Data Science
  • True Positive (TP) : Observation is positive, and was predicted to be positive.
  • False Negative (FN) : Observation is positive, but was predicted to be negative.
  • True Negative (TN) : Observation is negative, and was predicted to be negative.
  • False Positive (FP) : Observation is negative, but was predicted to be positive.

log_cfm = confusion_matrix(y_test, log_y_pred)
print("Confusion matrix:")
print(log_cfm, end='\n\n')
print(np.array([['TN', 'FP'],[ 'FN' , 'TP']]))

Run the result.

Confusion matrix:

[[ 4645  6289]

[ 1831 61320]]


[['TN' 'FP']

['FN' 'TP']]

We will now plot our data into an actual matrix. We’re going to use matplotlib.pyplot, which is a library that makes plotting simple (plt). imshow()from matplotlib creates the 2D image you see in the results. The number of boxes that the chart is going to split up into is the number of elements there are. The background color of each box depends on where the number is on the scale used by imshow().

plt.imshow(log_cfm, interpolation='nearest')
for i, j in itertools.product(range(log_cfm.shape[0]), range(log_cfm.shape[1])):
plt.text(j, i, log_cfm[i, j],horizontalalignment="center",color="white")
plt.ylabel('True label (Recall)')
plt.xlabel('Predicted label (Precision)')
plt.title('Logistic Reg | Confusion Matrix')

Run the result.

The F1 score measures a test’s accuracy. It usually helps for retrieving information. The F1 score is best at 1 and worst at 0.

log_f1 = f1_score(y_test, log_y_pred)
print("Logistic Reg - F1 score: {:.3f}".format(log_f1))

Run the result.

Logistic Reg - F1 score: 0.938

Our F1 score is pretty close to 1. This means our model is accurate. Success!


In this tutorial, we walked through how to format usable data for analysis, how to efficiently gather the sentiments behind text reviews and the proper way to train and test the model in order to determine the accuracy of the results. With the help of sentiment analysis, we can optimize the way we understand opinion and emotion in text, and draw valuable conclusions about the products discussed in written reviews.

This method of customer review analysis can be very useful in the real world. Paired with additional data, sentiment analysis can give insight for marketing strategies and business models. For example, people’s shopping patterns can be collected and compared to the sentiment analysis, illuminating how trustworthy or valuable certain reviews are. The sentiments can also be checked against the star ratings, like we did in this tutorial, to see how accurately the star values represent what the customers are saying.

Sentiment analysis offers an innovative intersection of programming and psychology, where we can efficiently analyze customer/ reviewer needs and opinions through coding. Rather than gather customer surveys or use other feedback methods, we can use computer systems to detect how a group perceives something, and quickly make necessary changes to promote satisfaction. Thus, sentiment analysis programs are very powerful tools in data analysis and marketing.