CS 5350/6350: Machine Learning Homework 4


Rate this product

CS 5350/6350: Machine Learning
Homework 4
1 Paper Problems [40 points + 10 bonus]
1. [9 points] The learning of soft SVMs is formulated as the following optimization
⊤w + C
s.t. ∀1 ≤ i ≤ N, yi(w
⊤xi + b) ≥ 1 − ξi
ξi ≥ 0
where N is the number of the training examples. As we discussed in the class, the slack
variables {ξi} are introduced to allow the training examples to break into the margin
so that we can learn a linear classifier even when the data is not linearly separable.
(a) [3 point] What values ξi can take when the training example xi breaks into the
(b) [3 point] What values ξi can take when the training example xi stays on or outside
the margin?
(c) [3 point] Why do we incorporate the term C ·
in the objective function?
What will happen if we throw out this term?
(a) When a training example xi breaks into the margin, it means that the example
is either on the wrong side of the margin or has crossed over to the wrong side of the
decision boundary. Therefore, ξi ≥ 0.
If xi
is on the wrong side of the decision boundary, ξi ≥ 1, since yi(wT xi + b) < 0
making 1 − ξi < 0. The further into the margin or the wrong side xi
is, the larger the
ξi gets/.
(b) For training examples that are correctly classified and either on the margin or
outside the margin (but on the correct side), ξi ≈ 0. In this situation, yi(wT xi +b) ≥ 0
and therefore 1 − ξi ≤ 1, which holds true if ξi = 0.
(c) The term C·
is added to the objective function to penalize the misclassification
or the margin violations. C is a regularization parameter that controls the trade-off
between maximizing the margin and minimizing the classification error.
If we remove this term, the optimization would solely focus on maximizing the margin
without considering the classification errors or violations of the margin, i.e. the SVM
would revert to a hard-margin SVM, which requires perfect linear separability of the
data. This would likely result in a model that does not generalize well to unseen data,
especially if the training data is not linearly separable.
2. [6 points] Write down the dual optimization problem for soft SVMs. Please clearly
indicate the constraints, and explain how it is derived. (Note: do NOT directly copy
slides content, write down your own understanding.)
Solution. The dual problem is derived from the primal problem (mentioned in the
previous problem) by applying the Lagrange multipliers method. The primal problem
involves direct optimization over the weights w and bias b, along with the slack variables
. In the dual problem, we use Lagrange multipliers for the constraints of the primal
problem, leading to an optimization problem in terms of these multipliers.
The dual problem for a soft-margin SVM can be formulated as the following:
αi −
yiyjαiαj ⟨xi
, xj ⟩] (1)
with the following constraints:
αiyi = 0
0 ≤ αi ≤ C, ∀i = 1, . . . , N
Here αi are the Lagrange multipliers for the training examples, and C is the regularization parameter from the primal problem.
Constraint explanation:
The constraint PN
i=1 αiyi = 0 comes from the condition for the optimality regarding
the bias term b in the primal problem. It ensures that the solution lies in the decision
boundary hyperplane.
The constraints 0 ≤ αi ≤ C are derived from the primal problem’s constraints related
to the slack variables ξi
. They ensure that each multiplier respects the margin violation
penalty and the regularization parameter C.
1. Lagrangian Formulation: Start with the primal problem and introduce Lagrange
multipliers for each of the constraints. This results in the Lagrangian, which is a
function of w, b, {ξi}, and the Lagrange multipliers.
2. Conditions for Optimality: Compute the partial derivatives of the Lagrangian with
respect to w, b, and {ξi}, and set them to zero. These are the KKT (Karush-KuhnTucker) conditions for optimality.
3. Substituting Back into the Lagrangian: Substitute the expressions obtained from
the conditions for optimality back into the Lagrangian. This step eliminates w, b, and
{ξi} from the problem, resulting in a formulation that depends only on the Lagrange
multipliers αi
4. Dual Problem: What results is the dual problem, which is a maximization problem
in terms of the αis. This dual problem is easier to solve computationally, especially
when dealing with kernels for non-linear classification.
The dual formulation is particularly powerful because it allows the use of kernel functions to enable SVMs to perform non-linear classification. This approach is computationally more efficient, especially for large datasets, and it inherently supports the
extension to non-linear classifiers through the kernel trick.
3. [10 points] Continue with the dual form. Suppose after the training procedure, you
have obtained the optimal parameters.
(a) [4 points] What parameter values can indicate if an example stays outside the
(b) [6 points] if we want to find out which training examples just sit on the margin
(neither inside nor outside), what shall we do? Note you are not allowed to
examine if the functional margin (i.e., yi(w⊤xi + b)) is 1.
(a) The value of the Lagrange multiplier αi
for each training example xi
indicates its
position relative to the margin. For a training example that lies outside the margin
(correctly classified and not a support vector), the corresponding Lagrange multiplier αi
will be 0. This is because such examples do not contribute to the decision boundary in
SVMs. In other words, these are the examples that the model is most confident about,
lying beyond the reach of the margin and not affecting the position or orientation of
the decision boundary.
(b) To find out which training examples are exactly on the margin without examining
the functional margin yi(wT xi +b), we can look at the values of αi
. Training examples
that sit exactly on the margin are the support vectors that define the decision boundary.
They are characterized by Lagrange multipliers αi that are greater than 0 but less
than C, i.e. 0 < αi < C. The reason for this is that these examples are marginally
compliant with the classification condition. They are pivotal in defining the margin
and the decision boundary. The constraint 0 < αi < C is a direct consequence of the
slack variables ξi being exactly 0 for these points (since they are on the margin). and
the dual formulation constraints.
4. [6 points] How can we use the kernel trick to enable SVMs to perform nonlinear
classification? What is the corresponding optimization problem?
Solution. Using the kernel trick in SVMs to enable nonlinear classification involves
modifying the dual optimization problem. The key idea is to replace the linear inner
product ⟨xi
, xj ⟩ in the dual problem with a kernel function K(xi
, xj ). This kernel
function implicitly maps the data to a higher-dimensional space where it might be
linearly separable.
The dual optimization problem with the kernel trick for nonlinear classification is
formulated as follows:
i=1 αi −
i,j=1 yiyjαiαjK(xi
, xj )]
with the following constraints:
αiyi = 0
0 ≤ αi ≤ C, ∀i = 1, . . . , N
where K(xi
, xj )) is the kernel function, which computes the dot product between xi
and xj
in the transformed feature space.
5. [9 points] Suppose we have the training dataset shown in Table 1. We want to learn a
SVM classifier. We initialize all the model parameters with 0. We set the learning rates
for the first three steps to {0.01, 0.005, 0.0025} and hyperparameter C = 1. Please list
the sub-gradients of the SVM objective w.r.t the model parameters for the first three
steps, when using the stochastic sub-gradient descent algorithm.
x1 x2 x3 y
0.5 −1 0.3 1
−1 −2 −2 −1
1.5 0.2 −2.5 1
Table 1: Dataset
Solution. The primal form of the linear SVM objective function is:
||w||2 + C
max(0, 1 − yi(w
T xi + b))
The sub-gradients of the SVM objective with respect to the model parameters w and
b for the first three steps of the stochastic sub-gradient descent algorithm:
1. Step 1 learning rate = 0.01:
– Sub-gradient w.r.t w: [−0.5, 1.0, −0.3]
– Sub-gradient w.r.t b: −1.0
2. Step 2 learning rate = 0.005:
– Sub-gradient w.r.t w: [−0.995, −2.01, −1.997]
– Sub-gradient w.r.t b: 1.0
3.. Step 3 learning rate = 0.005:
– Sub-gradient w.r.t w: [−1.490025, −0.199995, 2.512985]
– Sub-gradient w.r.t b: −1.0
6. [Bonus][10 points] Let us derive a dual form for Perceptron. Recall, in each step of
Perceptron, we add to the current weights w (including the bias parameter) yixi
some misclassified example (xi
, yi). We initialize w with 0. So, instead of updating
w, we can maintain for each training example i a mistake count ci — the number of
times the data point (xi
, yi) has been misclassified.
• [2 points] Given the mistake counts of all the training examples, {c1, . . . , cN },
how can we recover w? How can we make predictions with these mistake counts?
• [3 points] Can you develop an algorithm that uses mistake counts to learn the
Perceptron? Please list the pseudo code.
• [5 points] Can you apply the kernel trick to develop an nonlinear Perceptron? If
so, how do you conduct classification? Can you give the pseudo code for learning
this kernel Perceptron?
(a) Given the mistake counts {c1, . . . , cN } for all training examples, the weight vector w
can be the following: w =
i=1 ciyixi
. To make predictions for a new example x, we use
the sign of wT x, which is equivalent to the sign of PN
i=1 ciyi(x
i x)
(b) The pseudo code for a Perceptron algorithm that uses mistake counts is shown below.
Data: ci = 0 for all examples
for each epoch or until convergence do
for each training example (xi
, yi) do
if yi∗ (sum over j(cj ∗ yj ∗ dot(xj
, xi))) ← 0 then
ci = ci + 1
Algorithm 1: Problem b
(c) Applying the kernel trick to develop a nonlinear Perceptron involves using a kernel
function K(xi
, xj ) to compute the dot product in a transformed feature space. This allows
the algorithm to learn nonlinear decision boundaries.
To conduct classification in this setting, the following expression is used:
, x)
The pseudo code for learning this kernel Perceptron would be:
Data: ci = 0 for all examples
Choose a kernel function K
for each epoch or until convergence do
for each training example (xi
, yi) do
if yi∗ (sum over j(cj ∗ yj ∗ K(xj
, xi))) ← 0 then
ci = ci + 1
Algorithm 2: Problem a
In this algorithm, the kernel function K allows us to implicitly map the data into a higherdimensional space and compute dot products in this space without explicitly transforming
the data points. This approach can effectively capture complex patterns and nonlinear
relationships in the data.
2 Practice [60 points + 10 bonus ]
1. [2 Points] Update your machine learning library. Please check in your implementation
of Perceptron, voted Perceptron and average Perceptron algorithms. Remember last
time you created the folders “Perceptron”. You can commit your code into the corresponding folders now. Please also supplement with concise descriptions
about how to use your code to run these algorithms (how to call the command, set the
parameters, etc). Please create a new folder “SVM” in the same level as these folders.
Solution. The changes made to the GitHub repository, can be seen in said GitHub
2. [28 points] We will first implement SVM in the primal domain with stochastic subgradient descent. We will reuse the dataset for Perceptron implementation, namely,
“” in Canvas. The features and labels are listed in the file “classification/datadesc.txt”. The training data are stored in the file “classification/train.csv”, consisting
of 872 examples. The test data are stored in “classification/test.csv”, and comprise
of 500 examples. In both the training and test datasets, feature values and labels are
separated by commas. Set the maximum epochs T to 100. Don’t forget to shuffle the
training examples at the start of each epoch. Use the curve of the objective function
(along with the number of updates) to diagnosis the convergence. Try the hyperparameter C from {
873 ,
873 ,
873 }. Don’t forget to convert the labels to be in {1, −1}.
(a) [12 points] Use the schedule of learning rate: γt =
1+ γ0
. Please tune γ0 > 0 and
a > 0 to ensure convergence. For each setting of C, report your training and test
Solution. Values for γ0 and a when converging first happened for each of the C
C = 0.1145475372279496, γ0 = 0.1, a = 1
C = 0.572737686139748, γ0 = 0.1, a = 1
C = 0.8018327605956472, γ0 = 0.1, a = 1
C Training Error Testing Error
873 0.15252293577981652 0.17
873 0.07110091743119266 0.074
873 0.06995412844036697 0.086
Table 2: learning rate: γt =
1+ γ0
The rest of the solution can be seen in Table 2.
(b) [12 points] Use the schedule γt =
. Report the training and test error for each
setting of C.
Solution. Values for γ0 and a when converging first happened for each of the C
C = 0.1145475372279496, γ0 = 0.1, a = 1
C = 0.572737686139748, γ0 = 0.1, a = 1
C = 0.8018327605956472, γ0 = 0.1, a = 1
The rest of the solution can be seen in Table 3.
(c) [6 points] For each C, report the differences between the model parameters learned
from the two learning rate schedules, as well as the differences between the training/test errors. What can you conclude?
C Training Error Testing Error
873 0.3795871559633027 0.378
873 0.30160550458715596 0.284
873 0.22821100917431192 0.23
Table 3: learning rate: γt =
Parameter Differences:
The parameter differences vary with different values of C. Specifically, for 500
873 , the
parameter difference is the highest, indicating that the model parameters are more
sensitive to this particular value of C when compared to the other values. For C =
and C =
873 , the parameter differences are lower, suggesting more stability in the
learned model parameters at these regularization strengths.
Training and Testing Error Differences:
The training and testing error differences also vary with C. The error differences are
relatively higher for C =
873 and lower for C =
873 . This indicates that the model’s
performance is more variable at the intermediate regularization strength (C =
873 ) and
more stable at the higher regularization strength (C =
873 ).
The higher error difference at C =
873 suggests that this particular regularization
strength may not be as effective in terms of achieving consistent performance across
different learning rate schedules.
C Parameter Difference Training Error Difference Testing Error Difference
873 0.23058077240441846 0.2270642201834862 0.208
873 0.4184984035737148 0.23050458715596328 0.20999999999999996
873 0.22365662239461243 0.15825688073394495 0.14400000000000002
Table 4: Comparison
3. [30 points] Now let us implement SVM in the dual domain. We use the same
dataset, “”. You can utilize existing constrained optimization libraries.
For Python, we recommend using “scipy.optimize.minimize”, and you can learn how
to use this API from the document at
reference/generated/scipy.optimize.minimize.html. We recommend using SLSQP
to incorporate the equality constraints. For Matlab, we recommend using the internal function “fmincon”; the document and examples are given at https://www. For R, we recommend using the “nloptr”
package with detailed documentation at
(a) [10 points] First, run your dual SVM learning algorithm with C in {
873 ,
873 ,
873 }.
Recover the feature weights w and the bias b. Compare with the parameters
learned with stochastic sub-gradient descent in the primal domain (in Problem
2) and the same settings of C, what can you observe? What do you conclude
and why? Note that if your code calculates the objective function with a double
loop, the optimization can be quite slow. To accelerate, consider writing down the
objective in terms of the matrix and vector operations, and treat the Lagrange
multipliers that we want to optimize as a vector! Recall, we have discussed about
it in our class.
Weights and Bias: The weights and bias obtained from the dual domain implementation are quite different from those obtained via stochastic sub-gradient descent.
This difference is expected since the primal and dual forms tackle the optimization problem from different perspectives. The primal form directly optimizes the
decision boundary, while the dual form focuses on the Lagrange multipliers, which
are then used to compute the weights and bias. The magnitudes and directions of
the weights can vary significantly, especially depending on the nature of the data
and the model complexity.
Errors: There is a notable difference in the training and test errors between the
two methods. The stochastic sub-gradient descent method shows varied errors
based on the learning rate schedules, while the dual domain implementation’s errors are consistent across different values of C. However, overall the dual domain
implementation errors are higher than the ones of the single domain. It’s important to note that the dual implementation may be more sensitive to the choice of
kernel and hyperparameters.
In summary, while both approaches aim to solve the same problem, the differences
in methodology lead to variations in the model parameters and performance. The
choice between them can depend on factors like the nature of the data, the need
for a linear or non-linear decision boundary, and the specific requirements of the
application at hand.
(b) [15 points] Now, use Gaussian kernel in the dual form to implement the nonlinear SVM. Note that you need to modify both the objective function and the
prediction. The Gaussian kernel is defined as follows:
, xj ) = exp(−
∥xi − xj∥
Test γ from {0.1, 0.5, 1, 5, 100} and the hyperparameter C from {
873 ,
873 ,
873 }.
List the training and test errors for the combinations of all the γ and C values.
What is the best combination? Compared with linear SVM with the same settings
of C, what do you observe? What do you conclude and why?
Solution. The comparison of all the combinations of different C and γ0 values
can be seen in Table 5.
Effect of γ and C: In the nonlinear SVM, the training and test errors vary significantly with different combinations of γ and C. Particularly, lower values of γ (e.g.,
0.1 and 0.5) generally yield better test performance, especially for C = 500/873
and C = 700/873. In contrast, as γ increases, the test error tends to increase,
indicating potential overfitting.
Comparison with Linear SVM: The linear SVM shows relatively consistent performance across different values of C, but the errors are generally higher compared
to the best-performing settings of the nonlinear SVM. This suggests that the nonlinear SVM can capture complex patterns in the data more effectively than the
linear SVM.
Best Performing Combination: For the nonlinear SVM, the best combination
appears to be C = 500/873 or C = 700/873 with γ = 0.5 or γ = 0.1, yielding the
lowest test errors.
Conclusions: Nonlinearity Advantage: The nonlinear SVM with a Gaussian kernel
can model complex, non-linear decision boundaries more effectively than the linear
SVM, leading to better generalization in certain cases.
Hyperparameter Sensitivity: The nonlinear SVM is sensitive to the choice of γ and
C. Proper tuning of these parameters is crucial for achieving optimal performance.
Risk of Overfitting: Higher values of γ in the nonlinear SVM lead to overfitting,
as indicated by increased test errors. This shows the importance of balancing the
model’s complexity with its ability to generalize.
The Gaussian kernel in the nonlinear SVM allows for capturing nonlinear relationships in the data, which the linear SVM cannot model. The variance parameter
γ in the Gaussian kernel controls the level of nonlinearity. Lower γ values lead
to smoother decision boundaries, which can generalize better for certain datasets.
The linear SVM’s performance is limited by its assumption of linear separability,
making it less effective for datasets with complex, nonlinear patterns.
(c) [5 points] Following (b), for each setting of γ and C, list the number of support
vectors. When C =
873 , report the number of overlapped support vectors between
consecutive values of γ, i.e., how many support vectors are the same for γ = 0.01
and γ = 0.1; how many are the same for γ = 0.1 and γ = 0.5, etc. What do you
observe and conclude? Why?
The number of support vectors and their overlap between different γ values provide insight into how the model complexity and decision boundary change with γ.
A high number of support vectors typically indicates a more complex model, potentially leading to overfitting, especially for higher values of γ (narrower Gaussian
kernel). Overlap in support vectors between different γ values suggests similarity
in the decision boundaries learned by the model for those values. If the overlap
is high, it indicates that the decision boundary does not change significantly between those γ values. Low overlap might indicate a more significant change in
the model’s behavior as γ changes.
(d) [Bonus] [10 points] Implement the kernel Perceptron algorithm you developed in
Problem 8 (Section 1). Use Gaussian kernel and test γ from {0.1, 0.5, 1, 5, 100}.
List the training and test errors accordingly. Compared with the nonlinear SVM,
C γ0 Training Error Testing Error
873 0.1 0.045871559633027525 0.044
873 0.5 0.1559633027522936 0.2
873 1 0.2775229357798165 0.332
873 5 0.30619266055045874 0.454
873 100 0.0 0.528
873 0.1 0.03784403669724771 0.024
873 0.5 0.008027522935779817 0.008
873 1 0.005733944954128441 0.012
873 5 0.18463302752293578 0.408
873 100 0.0 0.532
873 0.1 0.10321100917431193 0.076
873 0.5 0.0 0.002
873 1 0.0011467889908256881 0.002
873 5 0.0 0.22
873 100 0.0 0.532
Table 5: Comparison 2
what do you observe? what do you conclude and why?
Model Complexity and Regularization: The kernel Perceptron, lacking a regularization term, might be expected to overfit more than the SVM. However, in this
case, it generalizes well, likely due to the nature of the dataset and the effectiveness of the Gaussian kernel at capturing its structure.
Robustness to Hyperparameters: The kernel Perceptron appears more robust to
changes in γ compared to the nonlinear SVM, which may be more sensitive to the
choice of both C and γ.
Suitability for Different Data Types: The effectiveness of the kernel Perceptron
suggests that it can be a strong candidate for certain types of datasets, particularly
where the decision boundary is complex but can be effectively captured by the
chosen kernel.
Trade-offs in Model Selection: The choice between a kernel Perceptron and an
SVM would depend on the specific dataset, the need for regularization, and the
sensitivity to hyperparameters. The SVM’s regularization term can be crucial for
preventing overfitting, especially in cases with more noise or less clear-cut decision
The kernel Perceptron’s simplicity and lack of regularization make it less prone
to the kind of overfitting that might be expected. Its effectiveness on this dataset
indicates that the Gaussian kernel is well-suited to capturing the underlying data
structure. The nonlinear SVM’s varied performance across different γ and C
values highlights the importance of hyperparameter tuning in achieving optimal
performance. The regularization term in SVM helps control overfitting but also
introduces sensitivity to C and γ.
γ0 Training Error Testing Error
0.1 0.0 0.002
0.5 0.0 0.004
1 0.0 0.004
5 0.0 0.002
100 0.0 0.008
Table 6: Bonus


There are no reviews yet.

Be the first to review “CS 5350/6350: Machine Learning Homework 4”

Your email address will not be published. Required fields are marked *

Scroll to Top