# Assignment 5: Naive Bayes Classifier for Text Classification

\$35.00

Category:
Rate this product

Assignment 5: Naive Bayes Classifier for Text Classification
UVA CS 4501-03 :
Machine Learning
1 Naive Bayes Classifier for Text-base Movie Review Classification
In this programming assignment, you are expected to implement different Naive Bayes Classifiers using
python for a text-based movie review classification task. A ZIP file “data sets naive bayes.zip” including
two sets of data samples (i.e. training set and test set) for movie reviews is provided to you through collab
We expect you to submit a source-code file named as ”naiveBayes.py” containing the necessary and
required functions for training, testing and evaluations.
For a naive Bayes classifier, when given an unlabeled document d, the predicted class
c

d = argmax
c
P(c|d)
, where c is the value of the target class variable. For the target movie review classification task, c = pos or
neg. For example, if P(c = pos|d) = 3
4
and P(c = neg|d) = 1
4
, we use the MAP rule to classify the document
d into the “positive” class. The conditional probability P(c|d) is calculated through Bayes’ rule,
P(c|d) = P(c)P(d|c)
P(d)
∝ P(c)P(d|c)
This assignment requires you to implement three types of naive bayes classifiers, among which the first
two follow the Multinomial assumption and the third uses the multivariate Bernoulli assumption.
1
1.1 Preprocessing
(Q1) You are required to implement the first choice of preprocessing described as following.
You can get 1 point of extra credit if you also implement the second choice of preprocessing and discuss
the classification results based on it.
• (First choice): to get a consistent result from all the students, please use a predefined dictionary
including the following words: {love, wonderful, best, great, superb, still, beautiful, bad, worst, stupid,
waste, boring, ?, !, UNK }.
– Besides, for the token “love”, you should also consider the tokens “loving”, “loved”, “loves” since
their stemmed version will be the same as the token “love”. In other words, the frequency of love
should include the words ”love, loving, loves, loved”. In order to do that, you don’t need to use
the NLTK. You can do a simple string.replace(”loved”, ”love”), string.replace(”loves”, ”love”),
string.replace(”loving”, ”love”). No other preprocessing is necessary (e.g. stemming, stopword
removal).
– UNK represents unknown words not included in the dictionary.
– In summary, thetaPos and thetaNeg are each vectors of length 15.
• (Second choice): normally, as the first step, you will build a vocabulary of unique words from the
training corpus, being ranked with their frequency. Then you will just use the top K words that
appearing more than a certain times (e.g. 3 times) in the whole training corpus.
– It is recommended that you use stemming and stopword removal for this (you can use a python
package such as NLTK).
1.2 Build “bag of words” (BOW) Document Representation
• (Q2) You are required to provide the following function to convert a text document into a feature vector:
BOW Dj = transfer(f ileDj, vocabulary)
where fileDj is the location of file j
• (Q3) Read in the training and test documents into BOW vector representations using the above function. Then store features into matrix Xtrain and Xtest, and use ytrain and ytest to store the labels.
You are required to provide the following function to convert a text document into a feature vector:
Xtrain, Xtest, ytrain, ytest = loadData(textDataSetsDirectoryF ullP ath)
– “textDataSetsDirectoryFullPath” is the real full path of the file directory that you get from
unzipping the datafile. For instance, it is “/HW3/data sets/” on the instructor’s laptop.
Note: Xtrain and Xtest are matrices with each column representing a document (in BOW vector
format). ytrain and ytest are vectors with a label at each position. These should all be represented
using a python list or numpy matrix.
1.3 Multinomial Naive Bayes Classifier (MNBC) Training Step
• We need to learn the P(cj ) and P(wi
|cj ) through the training set. Through MLE, we use the relativefrequency estimation with Laplace smoothing to estimate these parameters.
• Since we have the same number of positive samples and negative samples, P(c = −1) = P(c = 1) = 1
2
.
• (Q4) You are required to provide the following function (and module) for grading:
thetaP os, thetaNeg = naiveBayesMulF eature train(Xtrain, ytrain)
2
• (Q5) Provide the resulting value of thetaPos and thetaNeg into the writeup.
Note: Pay attention to the MLE estimator plus smoothing; Here we choose α = 1.
Note: thetaPos and thetaNeg should be python lists or numpy arrays (both 1-d vectors)
1.4 Multinomial Naive Bayes Classifier (MNBC) Testing+Evaluate Step
• (Q6) You are required to provide the following function (and module) for grading:
yP redict, Accuracy = naiveBayesMulF eature test(Xtest, ytest, thetaP os, thetaNeg)
Add the resulting Accuracy into the writeup.
• (Q7) Use ”sklearn ˙naive bayes.MultinomialNB” from the scikit learn package to perform training and
testing. Compare the results with your MNBC. Add the resulting Accuracy into the writeup.
1.5 Multivariate Bernoulli Naive Bayes Classifier (BNBC)
• We need to learn the P(cj ), P(wi = f alse|cj ) and P(wi = true|cj ) through the training. MLE gives the
relative-frequency as the estimation of parameters. We will add with Laplace smoothing for estimating
these parameters.
• Essentially, we simply just do counting to estimate P(wi = true|c).
P(wi = true|c) = #files which include wi and are in class c + 1
#files are in class c + 2
P(wi = f alse|c) = 1 − P(wi = true|c)
• Since we have the same number of positive samples and negative samples, P(c = −1) = P(c = 1) = 1
2
.
• (Q10) You are required to provide the following function (and module) for grading:
thetaP osT rue, thetaNegT rue = naiveBayesBernF eature train(Xtrain, ytrain)
• (Q11) Provide the resulting parameter estimations into the writing.
• (Q12) You are required to provide the following function (and module) for grading:
yP redict, Accuracy = naiveBayesBernF eature test(Xtest, ytest, thetaP osT rue, thetaNegT rue)
Add the resulting Accuracy into the writing.
1.6 How will your code be checked ?
In collab, you will find the sample codes named “naiveBayes.py”. “textDataSetsDirectoryFullPath” and
“testFileDirectoryFullPath” are string inputs. We will run the command line: “python naiveBayes.py textDataSetsDirectoryFullPath testFileDirectoryFullPath” to check your code if it can print the result of the
following functions in the table.
thetaPos, thetaNeg = naiveBayesMulFeature train(Xtrain, ytrain)
yPredict, Accuracy= naiveBayesMulFeature test(Xtest, ytest, thetaPos, thetaNeg)
thetaPosTrue, thetaNegTrue= naiveBayesBernFeature train(Xtrain, ytrain)
yPredict, Accuracy= naiveBayesBernFeature test(Xtest, ytest,thetaPosTrue, thetaNegTrue)
3
Congratulations ! You have implemented a state-ofthe-art machine-learning toolset for an important web
4
2 Sample Exam Questions:
Each assignment covers a few sample exam questions to help you prepare for the midterm and the final.
(Please do not bother by the information of points in some the exam questions.)
Question 1. Bayes Classifier
Suppose you are given the following set of data with three Boolean input variables a, b, and c, and a
single Boolean output variable G.
a b c G
1 0 1 1
1 1 1 1
0 1 1 0
1 1 0 0
1 0 1 0
0 0 0 1
0 0 0 1
0 0 1 0
For item (a), assume we are using a naive Bayes classifier to predict the value of G from the values of
the other variables.
(a) According to the naive Bayes classifier, what is P(G = 1|a = 1 ∧ b = 1)?
3
.(please provide detailed middles steps of how you get the number.)
Please provide a one-sentence justification for the following TRUE/FALSE questions.
(b) (True/False) Naive Bayes Classifier and logistic regression both directly model p(C|X).
Answer: False. Logistic regression directly model p(C|X). Naive Bayes Classifier directly model
p(X|C)
(c) (True/False) Gaussian Naive Bayes Classifier and Gaussian Mixture Model are similar since both
assume that p(X|cluster == i) follows Gaussian distribution.
(d) (True/False) when you train Gaussian Naive Bayes Classifier on data samples provided in the following
Figure, using separate covariance for each class, Σ1 6= Σ2, the decision boundary will be linear. Please
provide a one-sentence justification.
5
Answer: False. Gaussian Naive Bayes classifier with separate Σ will result in quadratic decision
boundary.
Question 2. Another Bayes Classifier
Suppose we are given the following dataset, where A,B,C are input binary random variables, and y is a
binary output whose value we want to predict.
A B C y
0 0 1 0
0 1 0 0
1 1 0 0
0 0 1 1
1 1 1 1
1 0 0 1
1 1 0 1
(a) (5 points) How would a naive Bayes classifier predict y given this input:
A = 0, B = 0, C = 1. Assume that in case of a tie the classifier always prefers to predict 0 for y.
Answer: The classifier will predict 1.
P(y = 0) = 3/7; P(y = 1) = 4/7
P(A = 0|y = 0) = 2/3; P(B = 0|y = 0) = 1/3; P(C = 1|y = 0) = 1/3
P(A = 0|y = 1) = 1/4; P(B = 0|y = 1) = 1/2; P(C = 1|y = 1) = 1/2
Prected y maximizes P(A = 0|y)P(B = 0|y)P(C = 1|y)P(y)
P(A = 0|y = 0)P(B = 0|y = 0)P(C = 1|y = 0)P(y = 0) = 0.0317
P(A = 0|y = 1)P(B = 0|y = 1)P(C = 1|y = 0)P(y = 1) = 0.0357
Hence, the predicted y is 1.
(b) (5 points) Suppose you know for fact that A, B, C are independent random variables. In this case is it
possible for any other classifier to do better than a naive Bayes classifier? (The dataset is irrelevant for
this question)