Sale!

Machine Learining Homework 2

$30.00

Category:

Description

5/5 - (2 votes)

CS 5350/6350: Machine Learining
Homework 2

General Instructions
• You are welcome to talk to other members of the class about the homework. I am more concerned
that you understand the underlying concepts. However, you should write down your own solution.
Please keep the class collaboration policy in mind.
• Feel free discuss the homework with the instructor or the TAs.
• Your written solutions should be brief and clear. You need to show your work, not just the final
answer, but you do not need to write it in gory detail. Your assignment should be no more than
10 pages. Every extra page will cost a point.
• Handwritten solutions will not be accepted.
• The homework is due by midnight of the due date. Please submit the homework on Canvas.
• Some questions are marked For 6350 students. Students who are registered for CS 6350 should do
these questions. Students who have registered for CS 5350 are welcome to do this question for extra
credit.
1 Warm up: Linear Classifiers and Boolean Functions
[10 points] Please indicate if each of the following boolean funcitons is linearly separable. If
it is linearly separable, write down a linear threshold unit equivalent to it.
1. [2 points] x1 ∨ ¬x2 ∨ x3
2. [2 points] (x1 ∨ x2) ∧ (x2 ∨ x3)
3. [2 points] (x1 ∧ ¬x2) ∨ x3
4. [2 points] x1 xor x2 xor x3
5. [2 points] ¬x1 ∧ x2 ∧ ¬x3
1
2 Mistake Bound Model of Learning
1. [20 points] Consider an instance space consisting of integer points on the two dimensional plane (x1, x2) with −80 ≤ x1, x2 ≤ 80. Let C be a concept class defined on this
instance space. Each function fl
in C is defined by a length l (with 1 ≤ l ≤ 80) as
follows:
fl(x1, x2) = {
+1 |x1| ≤ l and |x2| ≤ l
−1 otherwise
(1)
Our goal is to come up with a error-driven algorithm that will learn the correct function
f ∈ C that correctly classifies a dataset.
Side notes
(a) Recall that a concept class is the set of functions from which the true target
function is drawn and the hypothesis space is the set of functions that the learning
algorithm searches over. In this question, both these are the same set.
(b) Assume that there is no noise. That is, assume that the data is separable using
the hypothesis class.
Questions
(a) [5 points] Determine |C|, the size of concept class.
(b) [5 points] To design an error driven learning algorithm, we should be able to first
write down what it means to make a mistake. Suppose our current guess for the
function is fl defined as in Equation 1 above. Say we get an input point (x
t
1
, xt
2
)
along with its label y
t
. Write down an expression (an equality or an inequality) in
terms of x
t
1
, xt
2
, yt and l that checks whether the current hypothesis fl has made
a mistake.
(c) [5 points] Next, we need to specify how we will update a hypothesis if there is
an error. Since fl
is completely defined in terms of l, we only need to update l.
How will you update l if there is an error? Consider errors for both positive and
negative examples.
(d) [5 points] Use the answers from the previous two steps to write a mistake-driven
learning algorithm to learn the function. Please write the algorithm concisely in
the form of pseudocode. What is the maximum number of mistakes that this
algorithm can make on any dataset?
2. [20 points] Recall from class that the Halving algorithm assumes that there is the
true (hidden) function is in the concept class C with N elements and tries to find it.
In this setting, we know the number of mistakes made by the algorithm is O(log N).
Another way to think about this setting is that we are trying to predict with expert
advice. That is, we have a pool of N so called experts, only one of whom is perfect.
As the halving algorithm proceeds, it cuts down this pool by at least half each time a
mistake is made.
2
Suppose, instead of one perfect expert, we have M perfect experts in our pool. Show
that the mistake bound of the same Halving algorithm in this case is O(log N
M
). (Hint:
To show this, consider the stopping condition of the algorithm. At what point, will
the algorithm stop making mistakes?)
3 The Perceptron Algorithm and its Variants
For this question, you will experiment with the Perceptron algorithm and some variants on
a data set.
3.1 The task and data
Image you are an network security expert and you are given a collection of URLs. Now, your
goal is to build a classifier that identifies which among these are phishing websites.
We will use Phishing data set from the UCI Machine Learning repository1
to study this.
The data has been preprocessed into a standard format. Use the training/development/test
files called phishing.train, phishing.dev and phishing.test. These files are in the
LIBSVM format, where each row is a single training example. The format of the each row
in the data is:
<label> <index1>:<value1> <index2>:<value2> …
Here <label> denotes the label for that example. The rest of the elements of the row
is a sparse vector denoting the feature vector. For example, if the original feature vector is
[0, 0, 1, 2, 0, 3], this would be represented as 3:1 4:2 6:3. That is, only the non-zero entries
of the feature vector are stored.
3.2 Algorithms
You will implement several variants of the Perceptron algorithm. Note that each variant has
different hyper-parameters, as described below. Use 5-fold cross-validation to identify the
best hyper-parameters as you did in the previous homework. To help with this, we have split
the training set into five parts training00.data–training04.data in the folder CVSplits.
1. Simple Perceptron: Implement the simple batch version of Perceptron as described
in the class. Use a fixed learning rate η chosen from {1, 0.1, 0.01}. An update will be
performed on an example (x, y) if y(wT x + b) < 0 as:
w ← w + ηyx,
b ← b + ηy.
Hyper-parameter: Learning rate η ∈ {1, 0.1, 0.01}
Two things bear additional explanation.
1https://archive.ics.uci.edu/ml/datasets/Phishing+Websites
3
(a) First, note that in the formulation above, the bias term b is explicitly mentioned.
This is because the features in the data do not include a bias feature. Of course,
you could choose to add an additional constant feature to each example and
not have the explicit extra b during learning. (See the class lectures for more
information.) However, here, we will see the version of Perceptron that explicitly
has the bias term.
(b) Second, in this specific case, if w and b are initialized with zero, then the fixed
learning rate will have no effect. To see this, recall the Perceptron update from
above.
Now, if w and b are initialized with zeroes and a fixed learning rate η is used,
then we can show that the final parameters will be equivalent to having a learning
rate 1. The final weight vector and the bias term will be scaled by η compared
to the unit learning rate case, which does not affect the sign of wT x + b.
To avoid this, you should initialize the all elements of the weight vector w and
the bias to a small random number between -0.01 and 0.01.
2. Perceptron with dynamic learning rate: Implement a Perceptron whose learning
rate decreases as η0
1+t
, where η0 is the starting learning rate, and t is the time step.
Note that t should keep increasing across epochs. (That is, you should initialize t to 0
at the start and keep incrementing for each update.)
Hyper-parameter: Initial learning rate η0 ∈ {1, 0.1, 0.01}
3. Margin Perceptron: This variant of Perceptron will perform an update on an example (x, y) if y(wT x + b) < µ, where µ is an additional positive hyper-parameter,
specified by the user. Note that because µ is positive, this algorithm can update the
weight vector even when the current weight vector does not make a mistake on the
current example. You need to use the decreasing learning rate as before.
Hyper-parameters:
(a) Initial learning rate η0 ∈ {1, 0.1, 0.01}
(b) Margin µ ∈ {1, 0.1, 0.01}
4. Averaged Perceptron Implement the averaged version of the original Perceptron
algorithm from the first question. Recall from class that the averaged variant of the
Perceptron asks you to keep two weight vectors (and two bias terms). In addition to
the original parameters (w, b), you will need to update the averaged weight vector a
and the averaged bias ba as:
(a) a ← a + w
(b) ba ← ba + b
This update should happen once for every example in every epoch, irrespective of
whether the weights were updated or not for that example. In the end, the learning
algorithm should return the averaged weights and the averaged bias.
4
(Technically, this strategy can be used with any of the variants we have seen here.
For this homework, we only ask you to implement the averaged version of the original Perceptron. However, you are welcome to experiment with averaging the other
variants.)
5. Aggressive Perceptron with Margin, (For 6350 Students) This algorithm is an
extension of the margin Perceptron and performs an aggressive update as follows:
If y(wT x) ≤ µ then
(a) Update wnew ← wold + ηyx
Unlike the standard Perceptron algorithm, here the learning rate η is given by
η =
µ − y(wT x)
xT x + 1
As with the margin Perceptron, there is an additional positive parameter µ.
Explanation of the update. We call this the aggressive update because the learning rate is derived from the following optimization problem:
When we see that y(wT x) ≤ µ, we try to find new values of w such that y(wT x) = µ
using
min
wnew
1
2
||wnew − wold||2
such that y(wT x) = µ.
That is, the goal is to find the smallest change in the weights so that the current
example is on the right side of the weight vector.
By substituting (a) from above into this optimization problem, we will get a single
variable optimization problem whose solution gives us the η defined above. You can
think of this algorithm as trying to tune the weight vector so that the current example
is correctly classified right after the update.
Implement this aggressive Perceptron algorithm.
Hyper-parameters: µ ∈ {1, 0.1, 0.01}
3.3 Experiments
For all 5 settings above (4 for undergraduate students), you need to do the following things:
1. Run cross validation for ten epochs for each hyper-parameter combination to get the
best hyper-parameter setting. Note that for cases when you are exploring combinations of hyper-parameters (such as the margin Perceptron), you need to try out all
combinations.
5
2. Train the classifier for 20 epochs. At the end of each training epoch, you should measure the accuracy of the classifier on the development set. For the averaged Perceptron,
use the average classifier to compute accuracy.
3. Use the classifier from the epoch where the development set accuracy is highest to
evaluate on the test set.
3.4 What to report
1. [8 points] Briefly describe the design decisions that you have made in your implementation. (E.g, what programming language, how do you represent the vectors, etc.)
2. [2 points] Majority baseline: Consider a classifier that always predicts the most frequent label. What is its accuracy on test and development set?
3. [10 points per variant] For each variant above (5 for 6350 students, 4 for 5350 students),
you need to report:
(a) The best hyper-parameters
(b) The cross-validation accuracy for the best hyperparameter
(c) The total number of updates the learning algorithm performs on the training set
(d) Development set accuracy
(e) Test set accuracy
(f) Plot a learning curve where the x axis is the epoch id and the y axis is the dev set
accuracy using the classifier (or the averaged classifier, as appropriate) at the end
of that epoch. Note that you should have selected the number of epochs using
the learning curve (but no more than 20 epochs).
Experiment Submission Guidelines
1. The report should detail your experiments. For each step, explain in no more than a
paragraph or so how your implementation works. Describe what you did. Comment
on the design choices in your implementation. For your experiments, what algorithm
parameters did you use? Try to analyze this and give your observations.
2. Your report should be in the form of a pdf file, LATEX is recommended.
3. Your code should run on the CADE machines. You should include a shell script,
run.sh, that will execute your code in the CADE environment. Your code should
produce similar output to what you include in your report.
You are responsible for ensuring that the grader can execute the code using only the
included script. If you are using an esoteric programming language, you should make
sure that its runtime is available on CADE.
4. Please do not hand in binary files! We will not grade binary submissions.
5. Please look up the late policy on the course website.
6