## Description

EECS 445 Introduction to Machine Learning

Project 1 – SVM: Yelp!

1 Introduction

The EECS445 team has recently created a website, Kelp, where users can submit reviews of restaurants,

hotels, and other businesses. Unfortunately, we’ve found that many people were reluctant to indicate whether

they felt happy, sad, or neutral when writing their reviews. We now have thousands of reviews, but no

corresponding ratings. This poses a problem: we do not want to manually read through each review to

determine its sentiment! Thankfully, we can harness the powers of machine learning to estimate these

ratings.

In this project, we have given you data from Yelp, since Yelp’s reviews are very similar to the target reviews on Kelp. The Yelp dataset contains hundreds of thousands of reviews and ratings of restaurants,

hotels, and other businesses. You will work with a subset of this dataset to train various Support Vector

Machines (SVMs) to classify the sentiment of a review. In this process, we will also explore some very

useful scikit-learn packages and data science techniques.

1.1 Requirements:

1. Updated version of Anaconda: https://docs.continuum.io/anaconda/install/

2. Updated version of scikit-learn (0.19.0) in your Anaconda install

• You can verify your version of Anaconda-managed packages by running conda list

• If needed, you can update the package by running conda update scikit-learn

• You may create an Anaconda environment for the project with only the packages you will need,

or you may simply use the Anaconda instance of Python which will make packages managed by

Anaconda available for your use. Please reference the Anaconda documentation.

1

1.2 Getting Started

To get started, download Project1 from Drive. It should contain the following files:

• data/dataset.csv

• data/heldout.csv

• data/imbalanced.csv

• project1.py

• helper.py

• test output.py

The files dataset.csv and imbalanced.csv have reviews from Yelp. These csv files have 3 columns:

content, label, rating. Each row in the csv file corresponds to one review. The content column contains the

text of the actual review. The label column has binary labels: 1 if the review is positive and −1 otherwise.

The rating column is a multiclass label: 1 if positive, 0 if neutral, and −1 if negative.

You will use the content and label columns for most of the project. The final challenge portion, however,

will use rating instead of label.

The helper file helper.py provides functions that allow you to read in the data from csv files. The file

project1.py contains skeleton code for the project, along with the helper function select classifier

which you may implement to return SVM classifiers depending on the given input parameters. The file

test output.py allows you to test your output csv file before submission to make sure the format is

correct.

The data for each part of the project has already been read in for you in the main function of the

skeleton code. Please do not change how the data is read in; doing so may affect your results.

The skeleton code project1.py provides specifications for functions that you will implement:

• extract dictionary(df)

• generate feature matrix(df, word dict)

• cv performance(clf, X, y, k=5, metric=‘accuracy’)

• select param linear(X, y, k=5, metric=‘accuracy’, C range=[], penalty=‘l2’)

• plot weight(X, y, penalty, metric, C range)

• select param quadratic(X, y, k=5, metric=‘accuracy’, param range = [])

• Optional: select classifier(penalty=‘l2’, c=1.0, degree=1, r=0.0,

class weight=‘balanced’)

• Optional: performance(y true, y pred, metric=‘accuracy’)

2

2 Feature Extraction [20 pts]

We will use a bag-of-words model to convert each review into a feature vector. A bag-of-words model treats

a text as a collection of words, disregarding word order. The first step in building a bag-of-words representation involves building a dictionary. A dictionary contains all of the unique words in the dataset. For this

part of the project, we will convert all text to lowercase and exclude punctuation in the dictionary (replace

punctuation with a space). For example, a review containing “The&restaurant%$$was the best!!” will

yield a dictionary {‘the’:0, ‘restaurant’:1, ‘was’:2, ‘best’:3}. Note that the (key,

value) pairs are (word, index) where the index assigns a unique identifier based on when the word

was first found. This dictionary will come in handy when mapping each review to the feature space.

Given a dictionary containing d unique words, we can transform the n variable-length reviews into n feature vectors of length d, by setting the i

th element of the j

th feature vector to 1 if the i

th word is in the

j

th review, and 0 otherwise. Given that the four words {‘the’:0, ‘restaurant’:1, ‘was’:2,

‘best’:3} are the only four words we ever encounter, the review “BEST restaurant ever!!” would map

to the feature vector [0, 1, 0, 1].

Note that we do not consider case. Also, note that since the word “ever” was not in the original dictionary,

it is ignored as a feature. In real-world scenarios, there may be words in test data that you do not encounter

in training data. There are many interesting methods for dealing with this that you may explore in part 5.

(a) Start by implementing the extract dictionary(df) function. You will use this function to read

all unique words contained in dataset.csv into a dictionary (as in the example above). You can start

implementing this function by removing all the punctuation in the dataset. While removing punctuation,

please make sure that you do not accidentally combine two different words that are separated by a punctuation mark. For instance, after you remove punctuation from “Restaurant was awesome!Yay”, you

should produce “Restaurant was awesome Yay”, not “Restaurant was awesomeYay”. After removing

all the punctuation, you should convert all the words to lowercase and start building your dictionary.

Your function should return a dictionary of d unique words.

Note: You will need to report the number of unique words in 2(c).

Hint: You might find string.punctuation along with the method string.replace() useful.

(b) Next, implement the generate feature matrix(df, word dict) function. For each review

j, construct a feature vector of length d, where the i

th entry in the feature vector is 1 if the i

th word

in the dictionary is present in review j, or 0 otherwise. Assuming that there are n reviews total, return

the feature vectors as an (n, d) feature matrix, where each row represents a review, and each column

represents whether or not a specific word appeared in that review.

(c) The function get split binary data in helper.py uses the functions you implemented in (a)

and (b). Examine how it is implemented. Then, use get split binary data to get the training

feature matrix X train.

In your write-up, include the following:

• The value of d which you recorded after extracting the training data (the number of unique words).

You should be able to extract d from the size of the training feature matrix.

• The average number of non-zero features per rating in the training data. You will need to calculate

this on X train.

3

3 Hyperparameter and Model Selection [40 pts]

In the skeleton code, the reviews have been transformed into a feature matrix X train and a label vector

y train using the functions you implemented in question 2. Test data X test, y test has also been

read in for you. You will use these data for all of question 3. You may notice that X train, y train

only have 1000 reviews, while the dataset.csv file has 3000. Here, we only give you a subset of the

data to train on. You will work with all 3000 reviews in question 5.

We will learn a classifier to separate the training data into positive and non-positive (i.e., “negative”) ratings.

The labels in y train are binary labels in {−1, 1}, where −1 means “poor or average” and 1 means “good.”

This is a binary classification problem, which you know how to solve!

For the classifier, we will use SVMs with two different kernels: linear and quadratic. In parts 3.1-3.3 we

will make use of the sklearn.svm.SVC class. At first, we will explicitly set only two of the initialization

parameters of SVC(): the kernel, and C. In addition, we will use the following methods in the SVC class:

fit(X,y), predict(X) and decision function(X) – please use predict(X) when measuring

accuracy and decision function(X) for other metrics (see the documentation for more details).

As discussed in lecture, SVMs have hyperparameters that must be set by the user. For both linear-kernel and

quadratic-kernel SVMs, we will select hyperparameters using 5-fold cross-validation (CV) on the training

data. We will select the hyperparameters that lead to the ‘best’ mean performance across all five folds. The

result of hyperparameter selection often depends upon the choice of performance measure. Here, we will

consider the following performance measures: Accuracy, F1-Score, AUROC, Precision, Sensitivity, and

Specificity.

Note: When calculating the F1-score, it is possible that a divide-by-zero may occur which throws a warning.

Consider how this metric is calculated, perhaps by reviewing the relevant scikit-learn documentation.

Some of these measures are already implemented as functions in the sklearn.metrics submodule.

Please use sklearn.metrics.roc auc score for AUROC. You can use the values from

sklearn.metrics.confusion matrix to calculate the others (Note – the confusion matrix is just

the table of Predicted vs. Actual label counts, that is, the True Positive, False Positive, True Negative, and

False Negative counts for binary classification). Make sure to read the documentation carefully, as when

calling this function you will want to set labels=[1,-1] for a deterministic ordering of your confusion

matrix output.

3.1 Hyperparameter Selection for a Linear-Kernel SVM [20 pts]

(a) To begin, implement the function cv performance(clf, X, y, k=5,metric=‘accuracy’)

as defined in the skeleton code. Here you will make use of the fit(X,y), predict(X), and

decision function(X) methods in the SVC class. The function returns the mean k-fold CV

performance for the performance metric passed into the function. The default metric is ‘accuracy’,

however your function should work for all of the metrics listed above. It may be useful to have a helper

function that calculates each performance metric. For instance: performance(y true, y pred,

metric=‘accuracy’)

You may have noticed that the proportion of the two classes (positive and non-positive) are equal in the

training data. When dividing the data into folds for CV, you should try to keep the class proportions

roughly the same across folds; in this case, the class proportions should be roughly equal across folds,

4

since the original training data has equal class proportions.

You must implement this function without using the scikit learn implementation of CV. However,

you may employ the following class for splitting the data: sklearn.model selection.StratifiedKFold().

Do not shuffle points when using this function (i.e., do not set shuffle=True). This is so the generated folds are consistent for the same dataset across runs of the entire program.

In your write-up, briefly describe why it might be beneficial to maintain class proportions across folds.

(b) Now implement the select param linear(X, y, k=5, metric=‘accuracy’, C range

= [], penalty=’l2’) function to choose a value of C for a linear SVM based on the training data

and the specified metric. Note that scikit-learn uses a slightly different formulation of SVM from the

one we introduced in lecture, namely:

minimize

θ,b,ξ ¯

i

||¯θ||2

2

+ C

Xn

i=1

ξi

subject to y

(i)

(

¯θ · x¯

(i) + b) ≥ 1 − ξi

ξi ≥ 0, ∀i = 1, 2, …, n

Essentially, the C is inversely proportional to the λ we used in lecture. Your function should call

your CV function (implemented earlier) passing in instances of SVC(kernel=‘linear’, C=c,

class weight=‘balanced’) with a range of values for C chosen in powers of 10 between 10−3

and 103

(i.e. 10−3

, 10−2

, . . . , 102

, 103

). You may choose to implement and use the helper function

select classifier to instantiate the needed classifier.

(c) Finally, using the training data from question 2 and the functions implemented here, find the best setting

for C for each performance measure. Report your findings in tabular format with three columns: names

of the performance measures, along with the corresponding values of C and the mean cross-validation

performance. The table should follow the format given below:

Performance Measures C Performance

Accuracy

F1-Score

AUROC

Precision

Sensitivity

Specificity

Your select param linear function returns the ‘best’ value of C given a range of values. Note:

as we are working with a fairly large feature matrix, this may take several minutes (our project solution

time is about 12 minutes for this question on our test computer).

Also, in your write-up, describe how the 5-fold CV performance varies with C. If you have to train

a final model, which performance measure would you optimize for when choosing C? Explain your

choice. This performance measure will be used in part d.

5

(d) Now, using the value of C that maximizes your chosen performance measure, create an SVM as in the

previous question. Again, you may choose to use the helper function select classifier. Train

this SVM on the training data X train, y train. Report the performance of this SVM on the test

data X test, y test for each metric below.

Performance Measures Performance

Accuracy

F1-Score

AUROC

Precision

Sensitvity

Specificity

(e) Finish the implementation of the plot weight(X, y, penalty, metric, C range) function. In this function, you need to find the L0-norm of ¯θ, the parameter vector learned by the SVM, for

each value of C in the given range. Finding out how to get the vector ¯θ from a SVC object may require

you to dig into the documentation. The L0-norm is given as follows, for ¯θ ∈ R

d

:

k

¯θk0 =

X

d

i=1

I{θi 6= 0}

where I{a 6= 0} is 0 if a is 0 and 1 otherwise.

Use the complete training data X train, Y train, i.e, don’t use cross-validation for this part. Once

you implement the function, the existing code will plot L0-norm k

¯θk0 against C and save it to a file. In

your write-up, include the produced plot and describe any interesting trends you observe.

(f) Recall that each coefficient of ¯θ is associated with a word. The more positive a coefficient is, the more

the presence of the associated word indicates that the review is positive, and similarly with negative

coefficients. In this way, we can use these coefficients to find out what word-rating associations our

SVM has learned.

Using C = 0.1 (for consistency with our results), train an SVM on X train, Y train. On this

trained SVM, find the top 4 most positive coefficients and the top 4 most negative coefficients of ¯θ.

Using the dictionary created on the training data dictionary binary, find the words that these

coefficients correspond to, and report both the coefficients and the corresponding words. As before, you

may choose to use the helper function select classifier.

Positive Coefficient Word

Negative Coefficient Word

6

3.2 Hyperparameter Selection for a Quadratic-Kernel SVM [10 pts]

Similar to the hyperparameter selection for a linear-kernel SVM, you will perform hyperparameter selection for a quadratic-kernel SVM. Here we are assuming a kernel of the form (¯x · x¯

0 + r)

2

, where r is a

hyperparameter.

(a) Implement the select param quadratic(X, y, k=5, metric=‘accuracy’, param range

= []) function to choose a setting for C and r as in the previous part. Your function should call your

CV function (implemented earlier) passing in instances of SVC(kernel=‘poly’, degree=2,

C=c, coef0=r, class weight=‘balanced’) with the same range of C that we use in 3.1(b).

You should also use the same range for r.

Again, you may choose to use the helper function select classifier. The function argument

param range should be a matrix with two columns with first column as values of C and second

column as values of r. You will need to try out the range of parameters via two methods:

i) Grid Search: In this case, we look at all the specified values of C in a given set and choose the best

setting after tuning. For this question, the values should be considered in powers of 10 for both C

(between 10−3

and 103

) and r (between 10−3

and 103

) [A total of 49 pairs]. This code will take a

substantial time to run (our project solution runs in about 17 minutes on our test computer).

ii) Random Search: Instead of assigning fixed values for the parameter C and r, we can also sample

random values from a particular distribution. For this case, we will be sampling from a log-uniform

distribution, i.e., the log of random variables follows a uniform distribution:

P[a ≤ log C ≤ b] = k(b − a)

for some constant k so the distribution is valid. In other words, we sample a uniform distribution

with the same range as above to yield exponents xi

, and corresponding values of C = 10xi

.

In your case, the values should range from the powers of 10 which you used in part (i). Choose 25

pairs of such sampled pairs of (C, r). Again, this code may take time to run (our solution runs in

about 12 minutes on our test computer).

(b) Finally, using the training data from question 2 and the function implemented here, find the best values

for C and r for AUROC and both tuning schemes mentioned above. Report your findings in tabular

format. The table should have four columns: Tuning Scheme, C, r and AUROC. Again, in the case of

ties, report the lower C and the lower r values that perform the best. Your table should look be similar

to the following:

Tuning Scheme C r AUROC

Grid Search

Random Search

How does the 5-fold CV performance vary with C and r? Also, is the use of random search better than

grid search? Give reasons for your conclusion.

7

3.3 Learning Non-linear Classifiers with a Linear-Kernel SVM [5 pts]

Here, we will explore the use of an explicit feature mapping in place of a kernel. (Note: you do not need to

write any code for question 3.3)

(a) Describe a feature mapping, φ(¯x), that maps your data to a feature space similar to the one implied by

the quadratic kernel from the question above.

(b) Instead of using a quadratic-kernel SVM, we could simply map the data to this higher dimensional

space via this mapping and then learn a linear classifier in this higher-dimensional space. What are the

tradeoffs (pros and cons) of using an explicit feature mapping over a kernel? Explain.

3.4 Linear-Kernel SVM with L1 Penalty and Squared Hinge Loss [5 pts]

In this part of the project, you will explore the use of a different penalty (i.e., regularization term) and a

different loss function. In particular, we will use the L1 penalty and squared hinge loss which corresponds

to the following optimization problem.

minimize

θ,b ¯

||¯θ||1 + C

Xn

i=1

loss(y

(i)

(

¯θ · x¯

(i) + b))

where loss(z) = max{0,(1 − z)}

2

. We will make use of the LinearSVC() class, which uses the

squared hinge loss and allows us to specify the penalty. We will consider only a linear-kernel SVM

and the original (untransformed) features. When calling LinearSVC please use the following settings:

LinearSVC(penalty=‘l1’, dual=False, C=c, class weight=’balanced’). As always,

you may implement and use the helper function select classifier to instantiate your SVM classifier.

(a) Using the training data from question 2 and 5-fold CV, find the best setting for C that maximizes the

AUROC using grid search CV with range specified in 3.1(b). When we say ”grid search” here, we mean

searching a one-dimensional grid as the only hyperparameter that is changing is C. In the case of ties,

report the lower C value. Report your findings.

(b) Similar to 3.1(e), plot the L0-norm of the learned parameter ¯θ against C using complete training data

and no cross-validation. You should be able to re-use the function plot weight with different input

parameters without writing additional code. Include the plot in your write-up.

(c) Beyond any change in performance you may notice, what effect does the L1 penalty have on the optimal

solution? (Hint: Have a careful look at the plots you generated!)

(d) Note that using the Squared Hinge Loss (as opposed to the Hinge Loss) changes the objective function

as shown above. What effect do you expect this will have on the optimal solution?

8

4 Asymmetric Cost Functions and Class Imbalance [20 pts]

The training data we have provided you with so far is balanced: the data contain an equal number of positive

and negative ratings (500 ratings per class). But this is not always the case. In many situations, you are given

imbalanced data, where one class may be represented more than the others.

In this section, you will investigate the objective function of the SVM, and how we can modify it to fit

situations with class imbalance. Recall that the objective function for an SVM in scikit-learn is as follows:

minimize

θ,b,ξ ¯

i

||¯θ||2

2

+ C

Xn

i=1

ξi

subject to y

(i)

¯θ · φ(¯x

(i)

) + b

?

≥ 1 − ξi

ξi ≥ 0, ∀i = 1, 2, 3, …, n

We can modify it in the following way:

minimize

θ,b,ξ ¯

i

||¯θ||2

2

+ Wp ∗ C

X

i|y

(i)=1

ξi + Wn ∗ C

X

i|y

(i)=−1

ξi

subject to y

(i)

¯θ · φ(¯x

(i)

) + b

?

≥ 1 − ξi

ξi ≥ 0, ∀i = 1, 2, 3, …, n

where P

i|y

(i)=1 is a sum over all indices i where the corresponding point is positive y

(i) = 1. Similarly,

P

i|y

(i)=−1

is a sum over all indices i where the corresponding point is negative y

(i) = −1.

4.1 Arbitrary class weights [5 pts]

Wp and Wn are called “class weights” and are built-in parameters in scikit-learn.

(a) Describe how this modification will change the solution. If Wn is much greater than Wp, what does this

mean in terms of classifying positive and negative points?

(b) Create a linear-kernel SVM with hinge loss and L2-penalty with C = 0.01. This time, when calling

SVC, set class weight={-1: 10, 1: 1}, or implement and use your select classifier

helper function. This corresponds to setting Wn = 10 and Wp = 1. Train this SVM on the training

data X train, y train. Report the performance of the modified SVM on the test data X test,

y test for each metric below.

Performance Measures Performance

Accuracy

F1-Score

AUROC

Precision

Sensitvity

Specificity

(c) Also, answer the following: Compared to your work in question 3.1(d), which performance measures

were affected the most by the new class weights? Why do you suspect this is the case?

Note: We set C = 0.01 to ensure that interesting trends can be found regardless of your work in

question 3. This may mean that your value for C differs in 4 and 3.1(d). In a real machine learning

setting, you’d have to be more careful about how you compare models.

4.2 Imbalanced data [5 pts]

You just saw the effect of arbitrarily setting the class weights when our training set is already balanced. Let’s

return to the class weights you are used to: Wn = Wp. We turn our attention to class imbalance. Using the

functions you wrote in part 2, we have provided you with a second feature matrix and vector of binary labels

IMB features, IMB labels. This class-imbalanced data set has 800 positive points and 200 negative

points. It also comes with a corresponding test feature matrix and label vector pair IMB test features,

IMB test labels, which has the same class imbalance.

(a) Create a linear-kernel SVM with hinge loss, L2-penalty and as before, C = 0.01. Set class weight={-1:

1, 1: 1}, which returns the SVM to the formulation you have seen in class. Now train this SVM

on the class-imbalanced data IMB features, IMB labels provided.

Use this classifier to predict the provided test data IMB test features, IMB test labels and

report the accuracy, specificity, sensitivity, precision, AUROC, and F1-Score of your predictions:

Class Weights Performance Measures Performance

Wn = 1, Wp = 1 Accuracy

Wn = 1, Wp = 1 Sensitivity

Wn = 1, Wp = 1 Specificity

Wn =?, Wp =? Precision

Wn = 1, Wp = 1 AUROC

Wn =?, Wp =? F1-Score

(b) How has training on an imbalanced data set affected performance?

4.3 Choosing appropriate class weights [5 pts]

(a) Now we will return to setting the class weights given the situation we explored in part 4.2. Using what

you have done in the preceding parts, find an appropriate setting for the class weights that mitigates

the situation in part 4.2 and improves the classifier trained on the imbalanced data set. That is, find

class weights that give a good mean cross validation performance, (Think: which performance metric(s)

are informative in this situation, and which metric(s) are less meaningful? Make sure the metric you

use for cross-validation is a good choice given the imbalanced class weights). Report here your strategy for choosing these parameters. This question requires you to choose hyperparameters based on

cross-validation; you should not be using the test data to choose hyperparameters.

(b) Use your custom classifier to predict the provided test data IMB test features, IMB text labels

again, and report the accuracy, specificity, sensitivity, precision, AUROC, and F1-Score of your predictions:

10

Class Weights Performance Measures Performance

Wn =?, Wp =? Accuracy

Wn =?, Wp =? Sensitivity

Wn =?, Wp =? Specificity

Wn =?, Wp =? Precision

Wn =?, Wp =? AUROC

Wn =?, Wp =? F1-Score

4.4 The ROC curve [5 pts]

(a) Given the above results, we are interested in investigating the AUROC metric more. First, provide a plot

of the ROC curve with labeled axes for both Wn = 1, Wp = 1 and your custom setting of Wn, Wp from

above. Put both curves on the same set of axes. Make sure to label the plot in a way that indicates which

curve is which.

(b) Test your understanding: You’ve seen that class weights can help improve the performance of a classifier trained on imbalanced data. Based on what you observed in part (a) of this question, briefly propose

an alternative method for “fixing” a classifier that is trained on imbalanced data. That is, what could

you do to create a classifier that achieves similar results to the classifier in 4.3(b) without changing the

class weights?

5 Challenge [20 pts]

Now, a challenge: in the previous problems, we had transformed the data into a binary dataset by combining

multiple ratings to generate two labels.

For this challenge, you will consider the original multiclass ratings of the reviews. We have already prepared

a held-out test set heldout features for this challenge, and multiclass training data multiclass features,

multiclass labels. This training data has 3000 reviews, 1000 of each class. You must work only

with the provided data; acquiring new data to train your model is not permitted. Your goal is to train

a multiclass classifier using SVC and LinearSVC classes to predict the true ratings of the held-out test set,

i.e., you will train your model on multiclass features and test on heldout features.

Note that the class balance of this training set matches the class balance of the heldout set. Also note that,

given the size of the data and the feature matrix, training may take several minutes.

In order to attempt this challenge, we encourage you to apply what you have learned about hyperparameter

selection and consider the following extensions:

1. Try different feature engineering methods. The bag-of-words models we have used so far are

simplistic. There are other methods to extract different features from the raw data, such as:

(a) Using a different method for extracting words from the reviews

(b) Using only a subset of the raw features

(c) Using the number of times a word occurs in a review as a feature (rather than binary 0, 1 features

indicating presence)

11

(d) Scaling or normalizing the data

(e) Alternative feature representations

2. Read about one-vs-one and one-vs-all. These are the two standard methods to implement multiclass

classifier using binary classifiers. You should understand the differences between them and implement

at least one.

You will have to save the output of your classifier into a csv file using the helper function

generate challenge labels(y, uniqname) we have provided. The base name of the output

file must be your Uniqname followed by the extension csv. For example, the output filename for a user

with Uniqname foo would be foo.csv. This file will be submitted according to the instructions at the

end of the file. You may use the file test output.py to ensure that your output has the correct format. To run this file, simply run python test output.py -i Uniqname.csv, replacing the file

Uniqname.csv with your generated output file.

We will evaluate your performance in this challenge based on two components:

1. Write-Up and Code [10 pts]: We will evaluate how much effort you have applied to attempt this

challenge based on your write-up and code. Ensure that both are present. Within your write-up,

you must provide discussions of the choices you made when designing the following components of

your classifier:

• Feature engineering

• Hyperparameter selection

• Algorithm selection (e.g., quadratic vs. linear kernel)

• Multiclass method (e.g., one-vs-rest vs. one-vs-all)

• Any techniques you used that go above and beyond current course material

2. Test Scores [10 pts]: We will evaluate your classifier based on accuracy. Consider the following

confusion matrix:

-1 0 1

-1 x1

0 x2

1 y1 x3

where each column corresponds to the actual class and each row corresponds to the predicted class.

For instance, y1 in the matrix above is the number of reviews with true rating −1 (poor), but are

classified as a review with rating 1 (good) by your model. The accuracy for a multiclass classification

problem is defined as follows:

accuracy =

x1 + x2 + x3

n

where n is the number of samples.

12

REMEMBER to submit your project report by 9:00pm ET on February 5, 2018 to Gradescope.

Include your code as an appendix (copy and pasted) in your report. Please try to format lines of code

so they are visible within the pages.

Upload your file uniqname.csv containing the label predictions for the held-out data at this link

https://tinyurl.com/eecs445-w18 providing your umich.edu email.

Appendix A: Approximate Run-times for Programming Problems

• Problem 3.1 c: around 12 minutes

• Problem 3.2 a i: around 17 minutes

• Problem 3.2 a ii: around 12 minutes

• Problem 4.3 b: around 5-8 minutes

N.B. these are approximate times, not exact. Different computers will result in different run-times, so do not

panic if yours is a little different. Algorithmic optimization can also improve run-time noticeably in certain

cases. However, if it is taking more than twice as long, something might be wrong.

Appendix B: Lecture Topics and Concepts

• Problem 3.1 a, b, e, f, Problem 3.4 c, d

– 01/22: Lec 5: Support Vector Machines; Primal Formulation; Geometric Margin

• Problem 3.2 a, Problem 3.3 a, Problem 4.1 a

– 01/24: Lec 6: Dual Formulation; Kernels

• Problem 3.1 c, d, Problem 3.2 b, Problem 3.3 b, Problem 3.4 a, b, Problem 4.1 b, c, Problem 4.2

a, b, Problem 4.3 a, b, Problem 4.4 a, b

– 01/29: Lec 7: Performance Measures; Dimensionality Reduction

The date provided for a problem indicates the last lecture in which relevant material is discussed. Some

problems can be completed before their listed date.

13