HW 6 Margins and Slack Variables


Rate this product

1. Margins and Slack Variables (5 points)
Let (w, b, ξ) be the solution of the optimal soft-margin hyperplane based on training data
(x1, y1), . . . ,(xn, yn). Show that if xi
is a margin error (see notes), then the distance from xi
to the hyperplane
{x : yi(wTx + b) = 1}
is proportional to ξi
, and give the constant of proportionality.
2. Support Vector Machines and Cross Validation (10 points)
Download the file diabetes scaled.mat from Canvas. This contains variables X and y for a
classification problem. There are 768 labeled feature vectors. The features are
1. Number of times pregnant
2. Plasma glucose concentration
3. Diastolic blood pressure (mm Hg)
4. Triceps skin fold thickness (mm)
5. 2-Hour serum insulin (mu U/ml)
6. Body mass index (weight in kg/(height in m)^2)
7. Diabetes pedigree function
8. Age (years)
The binary labels are
Binary label:
+1 = tested positive for diabetes
-1 = non-positive
Reserve the last 268 instances for testing. The goal is to minimize the probability of error on
the test data. On the 500 training examples, train an SVM using the Cauchy kernel
k(u, v) = ?
1 +
ku − vk
using a 5-fold cross validation estimate of the probability of error to select the parameters C
and σ. Report the selected parameters, the cross-validation error estimate at those parameters, the test error, and the number of support vectors of the final classifier. Also, submit
your code.
• Matlab users: we have provided you with the file smo.m, which implements the SMO
algorithm mentioned in the notes. Type help smo to see how it works. If you are
curious, take a look “under the hood” and see if you can understand what it is doing.
Alteratively, Matlab provides an SVM solver in one of its toolboxes, or in Python, you
can use the scikit-learn library and run “from sklearn.svm import SVC.” Details of this
function can be found at
• For consistency of everyone’s solutions, please search over the following grid of parameter
values (in Matlab notation):
grid_sigma = 2.^(0:5);
grid_C = 2.^(6:11);
So you are searching over 36 parameter combinations.
• SVM solvers often have a tolerance parameters that determines the algorithm’s termination criterion. For example, the provided smo.m routine has a tolerance parameter that
yields good results when set to 0.01. Setting the parameter too small can lead to long
run times, and setting it to large can lead to poor convergence. Try to find a middle
• Please don’t use randomization to select the folds, even though this is generally a good
idea. These data have already been randomized. This way everyone should get a similar
• Please also don’t worry about making sure the class proprtions are the same within each
fold. Once again this is generally a good idea, but in this problem it is not necessary.
3. PCA (5 points each)
Read over the proof of PCA based on the generalized Rayleigh quotient. Then solve the
following problems. The first problem motivates a method for selecting k, as discussed at the
end of the first set of notes on PCA.
a. Let k ∈ {0, 1, . . . , d} be arbitrary. Show that
kxi − µ − Aθik
2 = n
λj ,
where A ranges over all d × k matrices with orthonormal columns.
Hint: This is easy if you use properties of the trace operator.
b. Give a condition involving the spectral decomposition of the sample covariance matrix
that is both necessary and sufficient for the subspace hAi in PCA to be unique.
4. Eigenfaces (5 points each)
In this exercise you will apply PCA to a modified version of the Extended Yale Face Database
B. The modified database is available in the file yalefaces.mat on Canvas. The modification
I made was simply to reduce the resolution of each image by a factor of 4×4 = 16 to hopefully
avoid computational and memory bottlenecks.
For a whirlwind tour of the data, issue the following commands. In Matlab:
load yalefaces % loads the 3-d array yalefaces
for i=1:size(yalefaces,3)
x = double(yalefaces(:,:,i));
% pause(.1)
In Python:
import numpy as np
import as sio
import matplotlib.pyplot as plt
import time
yale = sio.loadmat(’yalefaces.mat’)
yalefaces = yale[’yalefaces’]
for i in range(0,yalefaces.shape[2]):
x = yalefaces[:,:,i]
ax.imshow(x, extent=[0, 1, 0, 1])
plt.imshow(x, cmap=plt.get_cmap(’gray’))
Uncomment the pause/sleep command to slow things down a bit. What you will see are
several different subjects (38 total) under a variety of lighting conditions.
a. By viewing each image as a vector in a high dimensional space, perform PCA on the full
dataset. Hand in a plot the sorted eigenvalues (use the semilogy command in Matlab;
plt.semilogy in Python) of the sample covariance matrix. How many principal components are needed to represent 95% of the total variation? 99%? What is the percentage
reduction in dimension in each case? Useful commands in Matlab:reshape, eig, svd,
mean, diag, and Python: np.reshape, np.linalg.eig, np.linalg.svd, np.mean,
b. Hand in a 4 × 5 array of subplots showing principal eigenvectors (‘eigenfaces’) 0 through
19 as images, treating the sample mean as the zeroth order principal eigenvector. Comment on what facial or lighting variations some of the different principal components
are capturing. Useful commands in Matlab: subplot, imagesc, colormap(gray), in
Python: plt.imshow(x, cmap=plt.get cmap(’gray’)), and for subplots a useful link
is examples/subplots demo.html
c. Submit your code.
PLEASE NOTE: For uniformity, please do not standardize the features as described in
the “preprocessing” section of the PCA notes.
Not to be turned in.
1. Cross Validation Bias and Variance
Let Z
n = ((X1, Y1), . . . ,(Xn, Yn)) denote training data for a supervised learning problem.
Let θ be a fixed value of the tuning parameter(s), and let fb
θ,Zn denote the model learned by
some supervised learning algorithm when applied to Zn.
Let us define the bias of K-fold cross-validation to be
Bθ,n = EZn [RbCV (fb
θ,Zn )] − EZn [R(fb
θ,Zn )],
and the variance to be
Vθ,n = EZn
RbCV (fb
θ,Zn ) − EZn [RbCV (fb
θ,Zn )]?2
Assume that the sequence en = EZn [R(fb
θ,Zn )] decreases monotonically to a real number
∗ ≥ 0.
(a) Argue that Bθ,n < 0, which means that CV tends to be optimistic. Also argue that the
magnitude of the bias decreases as K increases (for fixed n).
(b) Aruge that CV is asymptotically unbiased, meaning that Bθ,n → 0 as n → ∞ for fixed
(c) Argue that for fixed n, the variance of CV increases as K increases.

Open chat
Need help?
Can we help you?