## Description

Naive Bayes (Multinomial) for sentiment analysis

In this assignment you will implement the Naive Bayes (Multinomial) classifier for Sentiment Analysis

of an IMDB movie review dataset (a highly polarized dataset with 50000 movie reviews). The primary task

is to classify the reviews into negative and positive.

More about the dataset: http://ai.stanford.edu/˜amaas/data/sentiment/

For the Multinomial model, each document is represented by a vector of integer-valued variables, i.e.,

x = (x1, x2, …, x|V |)

T and each variable xi corresponds to the i-th word in a vocabulary V and represents the number of times it appears in the document. The probability of observing a document x given its

class label y is defined as (for example, for y = 1):

p(x|y = 1) = Y

|V |

i=1

P(wi

|y = 1)xi

Here we assume that given the class label y, each word in the document follows a multinomial distribution

of |V | outcomes and P(wi

|y = 1) is the probability that a randomly selected word is word i for a document

of the positive class. Note that P

i=1 P(wi

|y) = 1 for y = 0 and y = 1. Your implementation need to

estimate p(y), and P(wi

|y) for i = 1, · · · , |V |, and y = 1, 0 for the model. For p(y), you can use the MLE

estimation. For P(wi

|y), you MUST use Laplace smoothing for the model. One useful thing to note is that

when calculating the probability of observing a document given its class label, i.e., p(x|y), it can and will

become overly small because it is the product of many probabilities. As a result, you will run into underflow

issues. To avoid this problem, your implementation should operate with log of the probabilities.

1 Description of the dataset

The data set provided are in two parts:

1

• IMDB.csv: This contains a single column called Reviews where each row contains a movies review.

There are total of 50K rows. The first 30K rows should be used as your Training set (to train your

model). The next 10K should be used as the validation set (use this for parameter tuning). And the

last 10K rows should be used as the test set (predict the labels).

• IMDB labels.csv: This contains 40K labels. Please use the first 30K labels for the training data and

the last 10K labels for validation data. The labels for test data is not provided, we will use that to

evaluate your predicted labels.

2 Data cleaning and generating BOW representation

Data Cleaning. Pre-processing is need to makes the texts cleaner and easy to process. The reviews

columns are comments provided by users about the movie. These are known as ”dirty text” that required

further cleaning. Typical cleaning steps include a) Removing html tags; b) Removing special characters; c)

Converting text to lower case d) replacing punctuation characters with spaces; and d) removing stopwords

i .e . articles, pronouns from consideration. You will not need to implement these functionalities and we will

provide some starter code containing these functions for you to use.

Generating BOW representation. To transform from variable length reviews to fixed-length vectors,

we use the Bag Of Words technique. It uses a list of words called ”vocabulary”, so that given an input text

we can output a vector of word counts for each word in the vocabulary. You can use the CountVectorizer

functionality from sklearnstarter to go over the full 50K reviews to generate the vocabulary and create

the feature vectors representing each review. Not that the CountVectorizer function has several tunable

parameters that can directly impact the result feature representation. This includes max features :, which

specifies the maximum number of features (by considering terms with high frequency); max df and min df,

which filter the words from the dictionary if its document frequency is too high (> max df) or too low

(< min df) respectively.
3 What you need to do
1. (10 pts) Apply the above described data cleaning and feature generation steps to generate the BOW
representation for all 50k reviews. For this step, we will use the default value for max df and min df
and set max features = 2000.
2. (20 pts) Train a multi-nomial Naive Bayes classifier with Laplace smooth with α = 1 on the training
set. This involves learning P(y = 1), P(y = 0), P(wi
|y = 1) for i = 1, ..., |V | and P(wi
|y = 0) for
i = 1, ..., |V | from the training data (the first 30k reviews and their associated labels).
3. (20 pts) Apply the learned Naive Bayes model to the validation set (the next 10k reviews) and
report the validation accuracy of the your model. Apply the same model to the testing data and
output the predictions in a file, which should contain a single column of 10k labels (0 (negative) or 1
(positive)). Please name the file test-prediction1.csv.
4. (20 pts) Tuning smoothing parameter alpha. Train the Naive Bayes classifier with different values of α
between 0 to 2 (incrementing by 0.2). For each alpah value, apply the resulting model to the validation
data to make predictions and measure the prediction accuracy. Report the results by creating a plot
with value of α on the x-axis and the validation accuracy on the y-axis. Comment on how the
validation accuracy change as α changes and provide a short explanation for your observation. Identify
the best α value based on your experiments and apply the corresponding model to the test data and
output the predictions in a file, named test-prediction2.csv.
5. (20 pts ) Tune your heart out. For the last part, you are required to tune the parameters for
the CountVectorizer (max feature, max df, and min df). You can freely choose the range of values
2
you wish1
to test for these parameters and use the validation set to select the best model. Please
describe your strategy for choosing the value ranges and report the best parameters (as measured by
the prediction accuracy on the validation set) and the resulting model’s validation accuracy. You
are also required to apply the chosen best model to make predictions for the testing data, and
output the predictions in a file, named test-prediction3.csv.
1You are encouraged to try your best to tune these parameters. Higher validation accuracy and testing accuracy will be
rewarded with possible bonus points.
3