CS 5350/6350: Machine Learning

Homework 3

1 Paper Problems [36 points + 15 bonus]

1. [8 points] Suppose we have a linear classifier for 2 dimensional features. The classification boundary, i.e., the hyperplane is 2×1 + 3×2 − 4 = 0 (x1 and x2 are the two input

features).

x1 x2 label

1 1 1

1 -1 -1

0 0 -1

-1 3 1

Table 1: Dataset 1

1

(a) [4 points] Now we have a dataset in Table 1. Does the hyperplane have a margin

for the dataset? If yes, what is the margin? Please use the formula we discussed in

the class to compute. If no, why? (Hint: when can a hyperplane have a margin?)

x1 x2 label

1 1 1

1 -1 -1

0 0 -1

-1 3 1

-1 -1 1

Table 2: Dataset 2

(b) [4 points] We have a second dataset in Table 2. Does the hyperplane have a

margin for the dataset? If yes, what is the margin? If no, why?

Solution. A hyperplane can have a margin if it perfectly separates the two classes in

the dataset. The margin is defined as the distance from the hyperplane to the nearest

point from either class. We should be using the formula for general hyperplane h:

w

T x + b = 0, to check whether the hyperplane 2×1 + 3×2 − 4 = 0 has a margin for the

datasets in both tables. We can calculate the distance of each point in the datasets to

the hyperplane 2×1 + 3×2 − 4 = 0 , dist(x0, h) = |w

T x0 + b|/||w||, where weight vector

w = [2, 3], and bias term b = 4. So, we calculate the distance from a point (x1, x2) to

the hyperplane as distance =

|2×1+3×2−4|

√

2

2+32 =

|2×1+3×2−4|

√

13 .

(a) To check whether the hyperplane 2×1 + 3×2 − 4 = 0 has a margin for the dataset

in Table 1, we need to compute the distance from each point to the hyperplane, and

then take the minimum of these distances. However, we first need to check the sign of

the decision function for each point. If the sign matches the label of the point, then

the point is on the correct side of the hyperplane. If any point is on the wrong side,

then the hyperplane does not have a margin for the dataset.

For the point (1, 1) with label 1:

2 ∗ 1 + 3 ∗ 1 − 4 = 1 > 0

For the point (1, −1) with label −1:

2 ∗ 1 + 3 ∗ (−1) − 4 = −3 < 0

For the point (0, 0) with label −1:

2 ∗ 0 + 3 ∗ 0 − 4 = −4 < 0

For the point (−1, 3) with label 1:

2 ∗ (−1) + 3 ∗ 3 − 4 = 3 > 0

Since the sign of the decision function matches the label for each point, the hyperplane

separates the two classes, and we can proceed to compute the distances:

For the point (1, 1) with label 1:

distance =

|2∗1+3∗1−4|

√

13 = √

1

13

≈ 0.28

For the point (1, −1) with label −1:

distance =

|2∗1+3∗(−1)−4|

√

13 = √

3

13

≈ 0.83

For the point (0, 0) with label −1:

2

distance =

|2∗0+3∗0−4|

√

13 = √

4

13

≈ 1.11

For the point (−1, 3) with label 1:

distance =

|2∗(−1)+3∗3−4|

√

13 = √

3

13

≈ 0.83

The margin is the minimum of these distances, which is 0.28. Therefore, the hyperplane has a margin of 0.28 for the dataset in Table 1.

(b)To check whether the hyperplane 2×1 + 3×2 − 4 = 0 has a margin for the dataset

in Table 2, we need to compute the distance from each point to the hyperplane, and

then take the minimum of these distances. However, we first need to check the sign of

the decision function for each point. If the sign matches the label of the point, then

the point is on the correct side of the hyperplane. If any point is on the wrong side,

then the hyperplane does not have a margin for the dataset.

For the point (1, 1) with label 1:

2 ∗ 1 + 3 ∗ 1 − 4 = 1 > 0

For the point (1, −1) with label −1:

2 ∗ 1 + 3 ∗ (−1) − 4 = −3 < 0

For the point (0, 0) with label −1:

2 ∗ 0 + 3 ∗ 0 − 4 = −4 < 0

For the point (−1, 3) with label 1:

2 ∗ (−1) + 3 ∗ 3 − 4 = 3 > 0

For the point (−1, −1) with label 1:

2 ∗ (−1) + 3 ∗ (−1) − 4 = 3 < 0

Since the decision function for the point (−1, −1) with label 1 is negative, this point

is on the wrong side of the hyperplane. Therefore, the hyperplane does not have a

margin for the dataset in Table 2.

2. [8 points] Now, let us look at margins for datasets. Please review what we have

discussed in the lecture and slides. A margin for a dataset is not a margin of a

hyperplane!

x1 x2 label

-1 0 -1

0 -1 -1

1 0 1

0 1 1

Table 3: Dataset 3

(a) [4 points] Given the dataset in Table 3, can you calculate its margin? If you

cannot, please explain why.

(b) [4 points] Given the dataset in Table 4, can you calculate its margin? If you

cannot, please explain why.

Solution.A margin of a dataset is defined as the maximum margin achievable by any

hyperplane. To calculate the margin of a dataset, we need to find the weight vector w

3

x1 x2 label

-1 0 -1

0 -1 1

1 0 -1

0 1 1

Table 4: Dataset 4

that maximizes the margin. The margin for a given weight vector w and a dataset can

be calculated as the minimum distance from any data point to the decision boundary

defined by w.

However, finding the optimal weight vector w that maximizes the margin is a complex optimization problem and is better solve using an algorithm, for example SVM

algorihtm. Nevertheless, for simple datasets, such as in Tables 3 and 4, we might be

able to find the optimal weight vector and the margin by inspection or by using simple

geometric arguments.

(a) Looking at the dataset in Table 3, we can see that the data points are linearly

separable. By inspection, we can see that the decision boundary x1 = 0 separates the

data points with label −1 on the left from the data points with label 1 on the right.

The margin for this decision boundary is the minimum distance from any data point

to the decision boundary. The margin can be calculated as margin = mini

|wT xi+b|

||w|| ,

where w = [1, 0] and b = 0, so the margin is 1.

(b)Looking at the dataset in Table 4, we can see that the data points are not linearly

separable. Therefore, we cannot find a decision boundary that separates the data

points with different labels, and the margin is undefined.

3. [Bonus] [5 points] Let us review the Mistake Bound Theorem for Perceptron discussed

in our lecture. If we change the second assumption to be as follows: Suppose there

exists a vector u ∈ R

n

, and a positive γ, we have for each (xi

, yi) in the training data,

yi(u

⊤xi) ≥ γ. What is the upper bound for the number of mistakes made by the

Perceptron algorithm? Note that u is unnecessary to be a unit vector.

Solution. The standard Mistake Bound Theorem assumes that there exists a unit

vector u such that for all (xi

, yi), yi(u

T xi) ≥ γ, where γ > 0. However, the modification

suggests that u is not necessarily a unit vector. To find the mistake bound in this case,

we can normalize u by its length to make it a unit vector and then apply the unmodified

theorem.

Given u ∈ R

n

(not necessarily a unit vector) and γ > 0 such that yi(u

T xi) ≥ γ, we can

define a new vector v =

u

||u|| , which is a unit vector. Now for each (xi

, yi), we have:

yi(v

T xi) = yi

u

||u||

T xi

=

yi(u

T xi)

||u|| ≥

γ

||u||

The modified mistake bound can the be computed as follows:

R2

||u||

γ

2

=

R||u||

γ

2

Where R is the maximum norm of any xi

in the dataset, i.e. R = maxi

||xi

||. Thus, the

4

new mistake bound is proportional to the square of the product of the norm of u and

the maximum norm of any input vector xi

, divided by the square of γ. THerefore, the

upper bound on the number of mistakes the Perceptron algorithm would make under

the modified assumption that u is need not be a unit vector is

R||u||

γ

2

.

4. [10 points] We want to use Perceptron to learn a disjunction as follows,

f(x1, x2, . . . , xn) = ¬x1 ∨ ¬ . . . ¬xk ∨ xk+1 ∨ . . . ∨ x2k (note that 2k < n).

The training set are all 2n Boolean input vectors in the instance space. Please derive

an upper bound of the number of mistakes made by Perceptron in learning this disjunction.

Solution: To derive an upper bound on the number of mistakes the Perceptron will

make on the given disjunction, we can apply the Mistake Bound Theorem. We need to

determine he appropriate values for R and γ for the given function f(x1, x2, . . . , xn) =

¬x1 ∨ ¬ . . . ¬xk ∨ xk+1 ∨ . . . ∨ x2k where 2k ≤ n.

First, we can find the norm R = ||x||, which is the maximum norm of any input

vector xi

. For Boolean vectors with components of 0 or 1, the norm is at most √

n,

since each component can contribute at most 1 to the sum of squares. In other words:

||x|| =

√

1

2 + 12 + · · · + 12 =

√

n

Next, we can find the margin γ. Since, we are dealing with Boolean input vectors, our

disjunction will output true (1) for any input where at least one of the literals is true.

We can set our weight vector u such that:

– We assign a weight of +1 to literals ¬x1 to ¬xk since their logical negation in the

function should trigger a true outcome.

– We assign a weight of +1 to literals xk+1 to ¬x2k since they are directly included in

the disjunction.

– We assign a weight of 0 to the remaining features.

Given that any example where the disjunction is true will have at least one of these

features being true, the smallest positive product of u

T x for correctly classified examples will be at least 1. Therefore: γ = 1.

Based on our findings, we now can say that since R =

√

n and γ = 1, by applying the

Mistake Bound Theorem, the number of mistakes M Perceptron will make is:

M ≤

R

γ

2

=

√

n

1

2

= n

Therefore, the upper bound on the number of mistakes the Perceptron will make on

this disjunction, given the training set of all 2n Boolean input vectors is n.

5. [10 points] Prove that linear classifiers in a plane cannot shatter any 4 distinct points.

Solution. To show that linear classifiers in a plane cannot shatter any set of 4 distinct

points, we need to demonstrate that there is no linear classifier (i.e., no line in the

plane) that can separate the 4 points in all possible ways that they might be labeled.

This relates to the concept of VC-dimension (Vapnik–Chervonenkis dimension).

A set of points is said to be shattered by a classifier if, for all possible assignments of

5

labels to those points, there exists a classifier that can perfectly separate the points

into their assigned classes. To shatter 4 points in the plane with a linear classifier, we

would need to be able to draw a line that separates any possible labeling of the points

into 2 classes (assume positive and negative).

Proof by contradiction:

Step 1: Assume 4 points can be shattered.

Let’s assume that we have 4 distinct points in a plane and that it’s possible for a linear

classifier to separate them in all possible ways.

Step 2: Label 3 points in a non-collinear configuration.

First, consider any 3 of the 4 points. Since no 3 of our distinct points are collinear

(they don’t all lie on the same line), we can always draw a line that separates one of

the points from the other two, regardless of how we label them.

Step 3: Introduce the fourth point.

Now, we introduce the fourth point. There are two cases:

1) The fourth point is on the same side of the line as the single point. If we want to

classify this fourth point into the opposite class from the single point, then we cannot

draw a line that separates it from the single point without also misclassifying one of

the other points.

2) The fourth point is on the same side as the other two points. In this case, if we

want to separate the single point and the fourth point from the two others, we again

face the problem that any line that correctly classifies the single point cannot correctly

classify the fourth point without misclassifying one of the two.

In both cases, we find that there’s no line that can be drawn to separate the points

into the two different classes as required.

Step 4: Show contradiction.

By showing that there is at least one labeling of the points that cannot be separated

by a single line (i.e., one configuration that cannot be shattered), we demonstrate that

our initial assumption (that 4 points can be shattered by a linear classifier) is false.

Therefore, we conclude that linear classifiers in a plane cannot shatter any set of 4

distinct points. The VC-dimension of a linear classifier in a plane (a line) is 3, since it

can shatter any set of 3 non-collinear points but not a set of 4.

6. [Bonus] [10 points] Consider our infinite hypothesis space H are all rectangles in a

plain. Each rectangle corresponds to a classifier — all the points inside the rectangle

are classified as positive, and otherwise classified as negative. What is VC(H)?

Solution. The VC-dimension (Vapnik-Chervonenkis dimension) of a hypothesis space

is the maximum number of points that can be shattered by that space. To shatter

a set of points means that for every possible way of labeling these points (with two

classes, positive and negative), there exists a hypothesis in the space that can classify

the points exactly as they are labeled.

For the hypothesis space H of all rectangles on a plane, let’s consider how we can

shatter points:

One point: We can always draw a rectangle around a single point to classify it as

positive and leave all other points as negative. So, V C(H) ≥ 1.

Two points: We can draw a rectangle such that both points are inside (both pos6

itive), or one is inside and the other is outside (one positive, one negative, in either

arrangement). So, V C(H) ≥ 2.

Three points: No matter how the three points are arranged, we can draw a rectangle

in such a way to include any one of the three points (the others being outside), any

two of the points (the other being outside), or all three. So, V C(H) ≥ 3.

Four points: If we place four points at the corners of a rectangle, we can shatter this

set as well. We can include any one of the four points, any pair of adjacent or opposite

points, or any three points, or all four. So, V C(H) ≥ 4.

Five points: If we arrange five points such that no four points are the corners of a

rectangle formed by those points (for instance, four points forming a rectangle and

one point inside this rectangle), there is no way to draw a rectangle that includes the

center point while excluding all the corner points. Thus, this set of five points cannot

be shattered by our hypothesis space of rectangles.

Therefore, since we can shatter four points but not five, the VC-dimension of the

hypothesis space of all rectangles in a plane is V C(H) = 4.

2 Practice [64 points ]

1. [2 Points] Update your machine learning library. Please check in your implementation

of ensemble learning and least-mean-square (LMS) method in HW1 to your GitHub

repository. Remember last time you created the folders “Ensemble Learning” and

“Linear Regression”. You can commit your code into the corresponding folders now.

Please also supplement README.md with concise descriptions about how to use your

code to run your Adaboost, bagging, random forest, LMS with batch-gradient and

stochastic gradient (how to call the command, set the parameters, etc). Please create

a new folder “Perceptron” in the same level as these folders.

Solution. Changes can be seen on GitHub: https://github.com/milenabel/CS6350-

ML2023/.

2. We will implement Perceptron for a binary classification task — bank-note authentication. Please download the data “bank-note.zip” from Canvas. The features and

labels are listed in the file “bank-note/data-desc.txt”. The training data are stored in

the file “bank-note/train.csv”, consisting of 872 examples. The test data are stored in

“bank-note/test.csv”, and comprise of 500 examples. In both the training and testing

datasets, feature values and labels are separated by commas.

(a) [16 points] Implement the standard Perceptron. Set the maximum number of

epochs T to 10. Report your learned weight vector, and the average prediction

error on the test dataset.

Solution. Learned weight vector:

[53 − 61.086591 − 42.70582 − 40.30786 − 3.146269].

Average prediction error on the test dataset: 0.020000000000000018.

7

(b) [16 points] Implement the voted Perceptron. Set the maximum number of epochs

T to 10. Report the list of the distinct weight vectors and their counts — the

number of correctly predicted training examples. Using this set of weight vectors

to predict each test example. Report the average test error.

Solution. Table of distinct vectors and their counts can be seen on Github as a

csv and tex files. (The tex file failed to insert in this document due to its size.)

Link: https://github.com/milenabel/CS6350-ML2023/blob/main/Perceptron/figs/…

The average test error: 0.014000000000000012

(c) [16 points] Implement the average Perceptron. Set the maximum number of

epochs T to 10. Report your learned weight vector. Comparing with the list of

weight vectors from (b), what can you observe? Report the average prediction

error on the test data.

Solution. Learned weight vector:

[36.38211009 − 46.60072831 − 29.06735968 − 30.14480192 − 8.94214493].

Part b results with similar vectors from the table:

131 [36. − 43.811101 − 33.60522 − 28.54354 − 10.474387], 84

133 [36. − 42.690241 − 25.54332 − 37.17104 − 11.082187], 6

135 [36. − 44.920141 − 27.05802 − 34.04964 − 9.699887], 49

137 [36. − 48.670141 − 31.00822 − 25.49584 − 9.661117], 23

141 [36. − 44.993221 − 29.15622 − 37.91314 − 5.236147], 38

Magnitude Differences: Although the weight vectors from the voted perceptron vary across the epochs, the average perceptron consolidates the weights into

a single vector that captures the average influence across all updates. The average weights don’t exactly match any single vector from the voted perceptron but

appear to be a consolidation.

Influence of Counts: The counts associated with the weight vectors from the

voted perceptron indicate how many times a particular weight vector was used

to make correct predictions. While the average perceptron doesn’t account for

this directly in its final weight vector, the impact of repeatedly used weights in

the voted perceptron might be reflected in the average weight vector due to the

nature of frequent updates in similar directions.

Observation on the Test Error: To compare the effectiveness of the learned

models, we look at the average prediction error on the test data. This gives us

an empirical measure of how well each model generalizes to unseen data. Both

voted and average perceptrons resulted in the same error on the testing data (as

seen below).

Insight into Algorithm Behavior: The voted perceptron generates multiple

models, each with its voting power, while the average perceptron seeks to find

a single model that performs well on average. The average perceptron’s weight

vector represents the cumulative knowledge gained over all training examples and

epochs, smoothed over time.

Conclusion:The average perceptron’s learned weight vector is a single representation that seems to capture the overall trend of the weight adjustments made

8

during training, while the voted perceptron retains individual weight vectors with

varying degrees of influence. Comparing them can reveal how stable the learning

process is and whether certain features are consistently influential across different

iterations and versions of the perceptron algorithm. The differences in the magnitude of the weights and the presence of certain features across multiple vectors

suggest that while there is a general direction of learning, the stochastic nature

of the training data and the order in which it is presented can result in variations

that the average perceptron’s single vector aims to smooth out.

Average prediction error on the test dataset: 0.014000000000000012.

(d) [14 points] Compare the average prediction errors for the three methods. What

do you conclude?

Solution.

When comparing the average prediction errors for the standard, voted, and average perceptron, we can draw the following conclusions:

Error Rates: The voted and average perceptrons both have an average test error

of 0.014, while the standard perceptron has a slightly higher average test error of

0.02. This indicates that both the voted and average perceptrons perform better

than the standard perceptron on the test data.

Consistency and Stability: The voted and average perceptrons use methods

that incorporate the history of the weight adjustments throughout the training

process. The voted perceptron does this by maintaining a weighted vote system

for each weight vector, whereas the average perceptron does this by maintaining

a cumulative sum of the weight vectors. Both methods seem to offer a more

consistent and stable hypothesis that generalizes better to unseen data compared

to the standard perceptron, which only keeps the last weight vector found at the

end of the training process.

Noise: The similar performance of the voted and average perceptrons could

suggest that these methods are more robust to noise in the training data. By

considering the history of the weight vectors, these algorithms might be less prone

to overfitting to the noise compared to the standard perceptron.

Algorithm Efficiency: In terms of computational efficiency, the average perceptron could be preferred over the voted perceptron. The average perceptron

only needs to perform calculations for a single final weight vector during testing,

whereas the voted perceptron must compute predictions using potentially many

weight vectors and sum their weighted votes.

In conclusion, while the standard perceptron is the simplest algorithm and easier

to understand, the voted and average perceptrons offer better performance, with

the added complexity being justified by a lower average prediction error. They

both are more suitable for practical applications where prediction accuracy on

unseen data is crucial. The choice between voted and average may then come

down to considerations of implementation complexity, memory requirements, and

computational efficiency at prediction time.

9

CS 5350/6350

# CS 5350/6350: Machine Learning Homework 3

Original price was: $35.00.$30.00Current price is: $30.00.

## Reviews

There are no reviews yet.