Project 1 – SVM: Yelp!




Rate this product

EECS 445 Introduction to Machine Learning

Project 1 – SVM: Yelp!

1 Introduction
The EECS445 team has recently created a website, Kelp, where users can submit reviews of restaurants,
hotels, and other businesses. Unfortunately, we’ve found that many people were reluctant to indicate whether
they felt happy, sad, or neutral when writing their reviews. We now have thousands of reviews, but no
corresponding ratings. This poses a problem: we do not want to manually read through each review to
determine its sentiment! Thankfully, we can harness the powers of machine learning to estimate these
In this project, we have given you data from Yelp, since Yelp’s reviews are very similar to the target reviews on Kelp. The Yelp dataset contains hundreds of thousands of reviews and ratings of restaurants,
hotels, and other businesses. You will work with a subset of this dataset to train various Support Vector
Machines (SVMs) to classify the sentiment of a review. In this process, we will also explore some very
useful scikit-learn packages and data science techniques.
1.1 Requirements:
1. Updated version of Anaconda:
2. Updated version of scikit-learn (0.19.0) in your Anaconda install
• You can verify your version of Anaconda-managed packages by running conda list
• If needed, you can update the package by running conda update scikit-learn
• You may create an Anaconda environment for the project with only the packages you will need,
or you may simply use the Anaconda instance of Python which will make packages managed by
Anaconda available for your use. Please reference the Anaconda documentation.
1.2 Getting Started
To get started, download Project1 from Drive. It should contain the following files:
• data/dataset.csv
• data/heldout.csv
• data/imbalanced.csv
• test
The files dataset.csv and imbalanced.csv have reviews from Yelp. These csv files have 3 columns:
content, label, rating. Each row in the csv file corresponds to one review. The content column contains the
text of the actual review. The label column has binary labels: 1 if the review is positive and −1 otherwise.
The rating column is a multiclass label: 1 if positive, 0 if neutral, and −1 if negative.
You will use the content and label columns for most of the project. The final challenge portion, however,
will use rating instead of label.
The helper file provides functions that allow you to read in the data from csv files. The file contains skeleton code for the project, along with the helper function select classifier
which you may implement to return SVM classifiers depending on the given input parameters. The file
test allows you to test your output csv file before submission to make sure the format is
The data for each part of the project has already been read in for you in the main function of the
skeleton code. Please do not change how the data is read in; doing so may affect your results.
The skeleton code provides specifications for functions that you will implement:
• extract dictionary(df)
• generate feature matrix(df, word dict)
• cv performance(clf, X, y, k=5, metric=‘accuracy’)
• select param linear(X, y, k=5, metric=‘accuracy’, C range=[], penalty=‘l2’)
• plot weight(X, y, penalty, metric, C range)
• select param quadratic(X, y, k=5, metric=‘accuracy’, param range = [])
• Optional: select classifier(penalty=‘l2’, c=1.0, degree=1, r=0.0,
class weight=‘balanced’)
• Optional: performance(y true, y pred, metric=‘accuracy’)
2 Feature Extraction [20 pts]
We will use a bag-of-words model to convert each review into a feature vector. A bag-of-words model treats
a text as a collection of words, disregarding word order. The first step in building a bag-of-words representation involves building a dictionary. A dictionary contains all of the unique words in the dataset. For this
part of the project, we will convert all text to lowercase and exclude punctuation in the dictionary (replace
punctuation with a space). For example, a review containing “The&restaurant%$$was the best!!” will
yield a dictionary {‘the’:0, ‘restaurant’:1, ‘was’:2, ‘best’:3}. Note that the (key,
value) pairs are (word, index) where the index assigns a unique identifier based on when the word
was first found. This dictionary will come in handy when mapping each review to the feature space.
Given a dictionary containing d unique words, we can transform the n variable-length reviews into n feature vectors of length d, by setting the i
th element of the j
th feature vector to 1 if the i
th word is in the
th review, and 0 otherwise. Given that the four words {‘the’:0, ‘restaurant’:1, ‘was’:2,
‘best’:3} are the only four words we ever encounter, the review “BEST restaurant ever!!” would map
to the feature vector [0, 1, 0, 1].
Note that we do not consider case. Also, note that since the word “ever” was not in the original dictionary,
it is ignored as a feature. In real-world scenarios, there may be words in test data that you do not encounter
in training data. There are many interesting methods for dealing with this that you may explore in part 5.
(a) Start by implementing the extract dictionary(df) function. You will use this function to read
all unique words contained in dataset.csv into a dictionary (as in the example above). You can start
implementing this function by removing all the punctuation in the dataset. While removing punctuation,
please make sure that you do not accidentally combine two different words that are separated by a punctuation mark. For instance, after you remove punctuation from “Restaurant was awesome!Yay”, you
should produce “Restaurant was awesome Yay”, not “Restaurant was awesomeYay”. After removing
all the punctuation, you should convert all the words to lowercase and start building your dictionary.
Your function should return a dictionary of d unique words.
Note: You will need to report the number of unique words in 2(c).
Hint: You might find string.punctuation along with the method string.replace() useful.
(b) Next, implement the generate feature matrix(df, word dict) function. For each review
j, construct a feature vector of length d, where the i
th entry in the feature vector is 1 if the i
th word
in the dictionary is present in review j, or 0 otherwise. Assuming that there are n reviews total, return
the feature vectors as an (n, d) feature matrix, where each row represents a review, and each column
represents whether or not a specific word appeared in that review.
(c) The function get split binary data in uses the functions you implemented in (a)
and (b). Examine how it is implemented. Then, use get split binary data to get the training
feature matrix X train.
In your write-up, include the following:
• The value of d which you recorded after extracting the training data (the number of unique words).
You should be able to extract d from the size of the training feature matrix.
• The average number of non-zero features per rating in the training data. You will need to calculate
this on X train.
3 Hyperparameter and Model Selection [40 pts]
In the skeleton code, the reviews have been transformed into a feature matrix X train and a label vector
y train using the functions you implemented in question 2. Test data X test, y test has also been
read in for you. You will use these data for all of question 3. You may notice that X train, y train
only have 1000 reviews, while the dataset.csv file has 3000. Here, we only give you a subset of the
data to train on. You will work with all 3000 reviews in question 5.
We will learn a classifier to separate the training data into positive and non-positive (i.e., “negative”) ratings.
The labels in y train are binary labels in {−1, 1}, where −1 means “poor or average” and 1 means “good.”
This is a binary classification problem, which you know how to solve!
For the classifier, we will use SVMs with two different kernels: linear and quadratic. In parts 3.1-3.3 we
will make use of the sklearn.svm.SVC class. At first, we will explicitly set only two of the initialization
parameters of SVC(): the kernel, and C. In addition, we will use the following methods in the SVC class:
fit(X,y), predict(X) and decision function(X) – please use predict(X) when measuring
accuracy and decision function(X) for other metrics (see the documentation for more details).
As discussed in lecture, SVMs have hyperparameters that must be set by the user. For both linear-kernel and
quadratic-kernel SVMs, we will select hyperparameters using 5-fold cross-validation (CV) on the training
data. We will select the hyperparameters that lead to the ‘best’ mean performance across all five folds. The
result of hyperparameter selection often depends upon the choice of performance measure. Here, we will
consider the following performance measures: Accuracy, F1-Score, AUROC, Precision, Sensitivity, and
Note: When calculating the F1-score, it is possible that a divide-by-zero may occur which throws a warning.
Consider how this metric is calculated, perhaps by reviewing the relevant scikit-learn documentation.
Some of these measures are already implemented as functions in the sklearn.metrics submodule.
Please use sklearn.metrics.roc auc score for AUROC. You can use the values from
sklearn.metrics.confusion matrix to calculate the others (Note – the confusion matrix is just
the table of Predicted vs. Actual label counts, that is, the True Positive, False Positive, True Negative, and
False Negative counts for binary classification). Make sure to read the documentation carefully, as when
calling this function you will want to set labels=[1,-1] for a deterministic ordering of your confusion
matrix output.
3.1 Hyperparameter Selection for a Linear-Kernel SVM [20 pts]
(a) To begin, implement the function cv performance(clf, X, y, k=5,metric=‘accuracy’)
as defined in the skeleton code. Here you will make use of the fit(X,y), predict(X), and
decision function(X) methods in the SVC class. The function returns the mean k-fold CV
performance for the performance metric passed into the function. The default metric is ‘accuracy’,
however your function should work for all of the metrics listed above. It may be useful to have a helper
function that calculates each performance metric. For instance: performance(y true, y pred,
You may have noticed that the proportion of the two classes (positive and non-positive) are equal in the
training data. When dividing the data into folds for CV, you should try to keep the class proportions
roughly the same across folds; in this case, the class proportions should be roughly equal across folds,
since the original training data has equal class proportions.
You must implement this function without using the scikit learn implementation of CV. However,
you may employ the following class for splitting the data: sklearn.model selection.StratifiedKFold().
Do not shuffle points when using this function (i.e., do not set shuffle=True). This is so the generated folds are consistent for the same dataset across runs of the entire program.
In your write-up, briefly describe why it might be beneficial to maintain class proportions across folds.
(b) Now implement the select param linear(X, y, k=5, metric=‘accuracy’, C range
= [], penalty=’l2’) function to choose a value of C for a linear SVM based on the training data
and the specified metric. Note that scikit-learn uses a slightly different formulation of SVM from the
one we introduced in lecture, namely:
θ,b,ξ ¯
+ C
subject to y
¯θ · x¯
(i) + b) ≥ 1 − ξi
ξi ≥ 0, ∀i = 1, 2, …, n
Essentially, the C is inversely proportional to the λ we used in lecture. Your function should call
your CV function (implemented earlier) passing in instances of SVC(kernel=‘linear’, C=c,
class weight=‘balanced’) with a range of values for C chosen in powers of 10 between 10−3
and 103
(i.e. 10−3
, 10−2
, . . . , 102
, 103
). You may choose to implement and use the helper function
select classifier to instantiate the needed classifier.
(c) Finally, using the training data from question 2 and the functions implemented here, find the best setting
for C for each performance measure. Report your findings in tabular format with three columns: names
of the performance measures, along with the corresponding values of C and the mean cross-validation
performance. The table should follow the format given below:
Performance Measures C Performance
Your select param linear function returns the ‘best’ value of C given a range of values. Note:
as we are working with a fairly large feature matrix, this may take several minutes (our project solution
time is about 12 minutes for this question on our test computer).
Also, in your write-up, describe how the 5-fold CV performance varies with C. If you have to train
a final model, which performance measure would you optimize for when choosing C? Explain your
choice. This performance measure will be used in part d.
(d) Now, using the value of C that maximizes your chosen performance measure, create an SVM as in the
previous question. Again, you may choose to use the helper function select classifier. Train
this SVM on the training data X train, y train. Report the performance of this SVM on the test
data X test, y test for each metric below.
Performance Measures Performance
(e) Finish the implementation of the plot weight(X, y, penalty, metric, C range) function. In this function, you need to find the L0-norm of ¯θ, the parameter vector learned by the SVM, for
each value of C in the given range. Finding out how to get the vector ¯θ from a SVC object may require
you to dig into the documentation. The L0-norm is given as follows, for ¯θ ∈ R
¯θk0 =
I{θi 6= 0}
where I{a 6= 0} is 0 if a is 0 and 1 otherwise.
Use the complete training data X train, Y train, i.e, don’t use cross-validation for this part. Once
you implement the function, the existing code will plot L0-norm k
¯θk0 against C and save it to a file. In
your write-up, include the produced plot and describe any interesting trends you observe.
(f) Recall that each coefficient of ¯θ is associated with a word. The more positive a coefficient is, the more
the presence of the associated word indicates that the review is positive, and similarly with negative
coefficients. In this way, we can use these coefficients to find out what word-rating associations our
SVM has learned.
Using C = 0.1 (for consistency with our results), train an SVM on X train, Y train. On this
trained SVM, find the top 4 most positive coefficients and the top 4 most negative coefficients of ¯θ.
Using the dictionary created on the training data dictionary binary, find the words that these
coefficients correspond to, and report both the coefficients and the corresponding words. As before, you
may choose to use the helper function select classifier.
Positive Coefficient Word
Negative Coefficient Word
3.2 Hyperparameter Selection for a Quadratic-Kernel SVM [10 pts]
Similar to the hyperparameter selection for a linear-kernel SVM, you will perform hyperparameter selection for a quadratic-kernel SVM. Here we are assuming a kernel of the form (¯x · x¯
0 + r)
, where r is a
(a) Implement the select param quadratic(X, y, k=5, metric=‘accuracy’, param range
= []) function to choose a setting for C and r as in the previous part. Your function should call your
CV function (implemented earlier) passing in instances of SVC(kernel=‘poly’, degree=2,
C=c, coef0=r, class weight=‘balanced’) with the same range of C that we use in 3.1(b).
You should also use the same range for r.
Again, you may choose to use the helper function select classifier. The function argument
param range should be a matrix with two columns with first column as values of C and second
column as values of r. You will need to try out the range of parameters via two methods:
i) Grid Search: In this case, we look at all the specified values of C in a given set and choose the best
setting after tuning. For this question, the values should be considered in powers of 10 for both C
(between 10−3
and 103
) and r (between 10−3
and 103
) [A total of 49 pairs]. This code will take a
substantial time to run (our project solution runs in about 17 minutes on our test computer).
ii) Random Search: Instead of assigning fixed values for the parameter C and r, we can also sample
random values from a particular distribution. For this case, we will be sampling from a log-uniform
distribution, i.e., the log of random variables follows a uniform distribution:
P[a ≤ log C ≤ b] = k(b − a)
for some constant k so the distribution is valid. In other words, we sample a uniform distribution
with the same range as above to yield exponents xi
, and corresponding values of C = 10xi
In your case, the values should range from the powers of 10 which you used in part (i). Choose 25
pairs of such sampled pairs of (C, r). Again, this code may take time to run (our solution runs in
about 12 minutes on our test computer).
(b) Finally, using the training data from question 2 and the function implemented here, find the best values
for C and r for AUROC and both tuning schemes mentioned above. Report your findings in tabular
format. The table should have four columns: Tuning Scheme, C, r and AUROC. Again, in the case of
ties, report the lower C and the lower r values that perform the best. Your table should look be similar
to the following:
Tuning Scheme C r AUROC
Grid Search
Random Search
How does the 5-fold CV performance vary with C and r? Also, is the use of random search better than
grid search? Give reasons for your conclusion.
3.3 Learning Non-linear Classifiers with a Linear-Kernel SVM [5 pts]
Here, we will explore the use of an explicit feature mapping in place of a kernel. (Note: you do not need to
write any code for question 3.3)
(a) Describe a feature mapping, φ(¯x), that maps your data to a feature space similar to the one implied by
the quadratic kernel from the question above.
(b) Instead of using a quadratic-kernel SVM, we could simply map the data to this higher dimensional
space via this mapping and then learn a linear classifier in this higher-dimensional space. What are the
tradeoffs (pros and cons) of using an explicit feature mapping over a kernel? Explain.
3.4 Linear-Kernel SVM with L1 Penalty and Squared Hinge Loss [5 pts]
In this part of the project, you will explore the use of a different penalty (i.e., regularization term) and a
different loss function. In particular, we will use the L1 penalty and squared hinge loss which corresponds
to the following optimization problem.
θ,b ¯
||¯θ||1 + C
¯θ · x¯
(i) + b))
where loss(z) = max{0,(1 − z)}
. We will make use of the LinearSVC() class, which uses the
squared hinge loss and allows us to specify the penalty. We will consider only a linear-kernel SVM
and the original (untransformed) features. When calling LinearSVC please use the following settings:
LinearSVC(penalty=‘l1’, dual=False, C=c, class weight=’balanced’). As always,
you may implement and use the helper function select classifier to instantiate your SVM classifier.
(a) Using the training data from question 2 and 5-fold CV, find the best setting for C that maximizes the
AUROC using grid search CV with range specified in 3.1(b). When we say ”grid search” here, we mean
searching a one-dimensional grid as the only hyperparameter that is changing is C. In the case of ties,
report the lower C value. Report your findings.
(b) Similar to 3.1(e), plot the L0-norm of the learned parameter ¯θ against C using complete training data
and no cross-validation. You should be able to re-use the function plot weight with different input
parameters without writing additional code. Include the plot in your write-up.
(c) Beyond any change in performance you may notice, what effect does the L1 penalty have on the optimal
solution? (Hint: Have a careful look at the plots you generated!)
(d) Note that using the Squared Hinge Loss (as opposed to the Hinge Loss) changes the objective function
as shown above. What effect do you expect this will have on the optimal solution?
4 Asymmetric Cost Functions and Class Imbalance [20 pts]
The training data we have provided you with so far is balanced: the data contain an equal number of positive
and negative ratings (500 ratings per class). But this is not always the case. In many situations, you are given
imbalanced data, where one class may be represented more than the others.
In this section, you will investigate the objective function of the SVM, and how we can modify it to fit
situations with class imbalance. Recall that the objective function for an SVM in scikit-learn is as follows:
θ,b,ξ ¯
+ C
subject to y

¯θ · φ(¯x
) + b
≥ 1 − ξi
ξi ≥ 0, ∀i = 1, 2, 3, …, n
We can modify it in the following way:
θ,b,ξ ¯
+ Wp ∗ C
ξi + Wn ∗ C
subject to y

¯θ · φ(¯x
) + b
≥ 1 − ξi
ξi ≥ 0, ∀i = 1, 2, 3, …, n
where P
(i)=1 is a sum over all indices i where the corresponding point is positive y
(i) = 1. Similarly,
is a sum over all indices i where the corresponding point is negative y
(i) = −1.
4.1 Arbitrary class weights [5 pts]
Wp and Wn are called “class weights” and are built-in parameters in scikit-learn.
(a) Describe how this modification will change the solution. If Wn is much greater than Wp, what does this
mean in terms of classifying positive and negative points?
(b) Create a linear-kernel SVM with hinge loss and L2-penalty with C = 0.01. This time, when calling
SVC, set class weight={-1: 10, 1: 1}, or implement and use your select classifier
helper function. This corresponds to setting Wn = 10 and Wp = 1. Train this SVM on the training
data X train, y train. Report the performance of the modified SVM on the test data X test,
y test for each metric below.
Performance Measures Performance
(c) Also, answer the following: Compared to your work in question 3.1(d), which performance measures
were affected the most by the new class weights? Why do you suspect this is the case?
Note: We set C = 0.01 to ensure that interesting trends can be found regardless of your work in
question 3. This may mean that your value for C differs in 4 and 3.1(d). In a real machine learning
setting, you’d have to be more careful about how you compare models.
4.2 Imbalanced data [5 pts]
You just saw the effect of arbitrarily setting the class weights when our training set is already balanced. Let’s
return to the class weights you are used to: Wn = Wp. We turn our attention to class imbalance. Using the
functions you wrote in part 2, we have provided you with a second feature matrix and vector of binary labels
IMB features, IMB labels. This class-imbalanced data set has 800 positive points and 200 negative
points. It also comes with a corresponding test feature matrix and label vector pair IMB test features,
IMB test labels, which has the same class imbalance.
(a) Create a linear-kernel SVM with hinge loss, L2-penalty and as before, C = 0.01. Set class weight={-1:
1, 1: 1}, which returns the SVM to the formulation you have seen in class. Now train this SVM
on the class-imbalanced data IMB features, IMB labels provided.
Use this classifier to predict the provided test data IMB test features, IMB test labels and
report the accuracy, specificity, sensitivity, precision, AUROC, and F1-Score of your predictions:
Class Weights Performance Measures Performance
Wn = 1, Wp = 1 Accuracy
Wn = 1, Wp = 1 Sensitivity
Wn = 1, Wp = 1 Specificity
Wn =?, Wp =? Precision
Wn = 1, Wp = 1 AUROC
Wn =?, Wp =? F1-Score
(b) How has training on an imbalanced data set affected performance?
4.3 Choosing appropriate class weights [5 pts]
(a) Now we will return to setting the class weights given the situation we explored in part 4.2. Using what
you have done in the preceding parts, find an appropriate setting for the class weights that mitigates
the situation in part 4.2 and improves the classifier trained on the imbalanced data set. That is, find
class weights that give a good mean cross validation performance, (Think: which performance metric(s)
are informative in this situation, and which metric(s) are less meaningful? Make sure the metric you
use for cross-validation is a good choice given the imbalanced class weights). Report here your strategy for choosing these parameters. This question requires you to choose hyperparameters based on
cross-validation; you should not be using the test data to choose hyperparameters.
(b) Use your custom classifier to predict the provided test data IMB test features, IMB text labels
again, and report the accuracy, specificity, sensitivity, precision, AUROC, and F1-Score of your predictions:
Class Weights Performance Measures Performance
Wn =?, Wp =? Accuracy
Wn =?, Wp =? Sensitivity
Wn =?, Wp =? Specificity
Wn =?, Wp =? Precision
Wn =?, Wp =? AUROC
Wn =?, Wp =? F1-Score
4.4 The ROC curve [5 pts]
(a) Given the above results, we are interested in investigating the AUROC metric more. First, provide a plot
of the ROC curve with labeled axes for both Wn = 1, Wp = 1 and your custom setting of Wn, Wp from
above. Put both curves on the same set of axes. Make sure to label the plot in a way that indicates which
curve is which.
(b) Test your understanding: You’ve seen that class weights can help improve the performance of a classifier trained on imbalanced data. Based on what you observed in part (a) of this question, briefly propose
an alternative method for “fixing” a classifier that is trained on imbalanced data. That is, what could
you do to create a classifier that achieves similar results to the classifier in 4.3(b) without changing the
class weights?
5 Challenge [20 pts]
Now, a challenge: in the previous problems, we had transformed the data into a binary dataset by combining
multiple ratings to generate two labels.
For this challenge, you will consider the original multiclass ratings of the reviews. We have already prepared
a held-out test set heldout features for this challenge, and multiclass training data multiclass features,
multiclass labels. This training data has 3000 reviews, 1000 of each class. You must work only
with the provided data; acquiring new data to train your model is not permitted. Your goal is to train
a multiclass classifier using SVC and LinearSVC classes to predict the true ratings of the held-out test set,
i.e., you will train your model on multiclass features and test on heldout features.
Note that the class balance of this training set matches the class balance of the heldout set. Also note that,
given the size of the data and the feature matrix, training may take several minutes.
In order to attempt this challenge, we encourage you to apply what you have learned about hyperparameter
selection and consider the following extensions:
1. Try different feature engineering methods. The bag-of-words models we have used so far are
simplistic. There are other methods to extract different features from the raw data, such as:
(a) Using a different method for extracting words from the reviews
(b) Using only a subset of the raw features
(c) Using the number of times a word occurs in a review as a feature (rather than binary 0, 1 features
indicating presence)
(d) Scaling or normalizing the data
(e) Alternative feature representations
2. Read about one-vs-one and one-vs-all. These are the two standard methods to implement multiclass
classifier using binary classifiers. You should understand the differences between them and implement
at least one.
You will have to save the output of your classifier into a csv file using the helper function
generate challenge labels(y, uniqname) we have provided. The base name of the output
file must be your Uniqname followed by the extension csv. For example, the output filename for a user
with Uniqname foo would be foo.csv. This file will be submitted according to the instructions at the
end of the file. You may use the file test to ensure that your output has the correct format. To run this file, simply run python test -i Uniqname.csv, replacing the file
Uniqname.csv with your generated output file.
We will evaluate your performance in this challenge based on two components:
1. Write-Up and Code [10 pts]: We will evaluate how much effort you have applied to attempt this
challenge based on your write-up and code. Ensure that both are present. Within your write-up,
you must provide discussions of the choices you made when designing the following components of
your classifier:
• Feature engineering
• Hyperparameter selection
• Algorithm selection (e.g., quadratic vs. linear kernel)
• Multiclass method (e.g., one-vs-rest vs. one-vs-all)
• Any techniques you used that go above and beyond current course material
2. Test Scores [10 pts]: We will evaluate your classifier based on accuracy. Consider the following
confusion matrix:
-1 0 1
-1 x1
0 x2
1 y1 x3
where each column corresponds to the actual class and each row corresponds to the predicted class.
For instance, y1 in the matrix above is the number of reviews with true rating −1 (poor), but are
classified as a review with rating 1 (good) by your model. The accuracy for a multiclass classification
problem is defined as follows:
accuracy =
x1 + x2 + x3
where n is the number of samples.
REMEMBER to submit your project report by 9:00pm ET on February 5, 2018 to Gradescope.
Include your code as an appendix (copy and pasted) in your report. Please try to format lines of code
so they are visible within the pages.
Upload your file uniqname.csv containing the label predictions for the held-out data at this link providing your email.
Appendix A: Approximate Run-times for Programming Problems
• Problem 3.1 c: around 12 minutes
• Problem 3.2 a i: around 17 minutes
• Problem 3.2 a ii: around 12 minutes
• Problem 4.3 b: around 5-8 minutes
N.B. these are approximate times, not exact. Different computers will result in different run-times, so do not
panic if yours is a little different. Algorithmic optimization can also improve run-time noticeably in certain
cases. However, if it is taking more than twice as long, something might be wrong.
Appendix B: Lecture Topics and Concepts
• Problem 3.1 a, b, e, f, Problem 3.4 c, d
– 01/22: Lec 5: Support Vector Machines; Primal Formulation; Geometric Margin
• Problem 3.2 a, Problem 3.3 a, Problem 4.1 a
– 01/24: Lec 6: Dual Formulation; Kernels
• Problem 3.1 c, d, Problem 3.2 b, Problem 3.3 b, Problem 3.4 a, b, Problem 4.1 b, c, Problem 4.2
a, b, Problem 4.3 a, b, Problem 4.4 a, b
– 01/29: Lec 7: Performance Measures; Dimensionality Reduction
The date provided for a problem indicates the last lecture in which relevant material is discussed. Some
problems can be completed before their listed date.