COMS 4771: Machine Learning

Homework 3

Note: In your write-up please only provide the specific information requested. In particular,

no code need be submitted.

This homework will use the same data sets that you used in the last homework, from the file

spam.mat which can be found on Canvas or the course webpage. Loading this file into MATLAB will

generate two matrices: train spam, the training set, with 3601 labeled examples, and test spam,

the test set, with 1000 labeled examples. Each example represents an email with 57 features and a

binary label indicating whether the email is spam (label = 1) or not (label = 0). A description of the

data can be found at https://archive.ics.uci.edu/ml/datasets/Spambase with detailed descriptions of the features at https://archive.ics.uci.edu/ml/machine-learning-databases/

spambase/spambase.names. The rows of the data set found there have been randomly permuted

to generate the train and test sets.

In this homework, you will experiment with AdaBoost using decision stumps as the base learner.

Decision stumps are extremely simple classifiers which use a single feature and a threshold to make

their prediction: in other words, they’re depth 1 binary decision trees. To train a decision stump,

we compute, for each feature i, and for each threshold value θ among the values of feature i, the

cost of the decision rule which splits the training set using into two parts (examples with xi θ,

and examples with xi ≤ θ) and using the majority label as a prediction in each part. The cost is

simply the number of mistakes made by this decision rule (this corresponds to classification error

as the measure of uncertainty: don’t worry about Gini index and entropy in this homework). To

use decision stumps in AdaBoost, we need to generalize this training method to allow weights on

examples.

Part 1 of assignment: (weightage: 1%.) Give pseudocode for training a decision stump

when traininig examples are given weights. Use the same tie-breaking rules as the last homework:

in case there are two features xi and xj that both yield minimum cost decision stumps, break the

tie by choosing the lower indexed one (i.e. i if i < j). Also, if for any leaf, the costs of predicting

1 and 0 are the same, break the tie in favor of the 0 label.

Next, implement your pseudocode in MATLAB. Using this code, write a MATLAB function

function params = hw3 train adaboost(train data, num rounds)

where the integer parameter num rounds is the number of rounds of AdaBoost to run on the training

set train data, using your decision stump training code as the weak learner in each round. The

output params is any convenient MATLAB data structure to represent the ensemble of decision stumps

(and their weights) computed by AdaBoost.

1

Write another function which computes predictions for test data given the parameters computed

by the hw2 train adaboost function, and error rate1 made on the test data, with the following

signature:

function loss = hw3 test adaboost(params, test data)

Part 2 of assignment: (weightage: 3%.) Run AdaBoost using hw3 train adaboost with

train data = train spam and num rounds = 1, 2, . . . , 100. For every value of num rounds, compute the training error by using hw3 test adaboost on test data = train spam, as well as the

test error by using hw3 test adaboost on test data = test spam. Plot curves (on a single graph)

of the training error and test error on the y-axis with the number of rounds on the x-axis, and

include the graph in your homework submission. Also report a few specific values of the training

and test errors in a table with the following format:

num rounds training error rate test error rate

1

2

4

8

16

32

64

100

Although it is not required for this homework, you may wish to compare the numbers you get

using AdaBoost with decision stumps with the numbers you got for training decision trees in the

last homework. To do a fair comparison, a complete decision tree of depth d has 2d

leaf nodes,

each of which can be thought of as a decision stump, so compare such a decision tree to a classifier

constructed by AdaBoost running for 2d

rounds.

Hint: Since AdaBoost constructs the ensemble incrementally, the output param computed by

running AdaBoost for 100 rounds can also be used to reconstruct the output of AdaBoost for any

fewer number of rounds.

Part 3 of assignment: (weightage: 2%.) Polynomial regression is a technique for solving

regression problems by fitting a bounded degree polynomials to the data rather than a linear

function, as in ordinary least squares. The simplest case is when examples are represented a single

scalar feature x ∈ R. The class of regression functions, Fd, is the set of all polynomial functions of

x with degree bounded by d, where d is a parameter. Thus, the class can be defined as

Fd = {c0 + c1x + c2x

2 + . . . cdx

d

| c0, c1, . . . , cd ∈ R}.

Suppose we use squared loss to measure goodness of fit, i.e. for an example (x, y), if the prediction

yˆ, then the loss is (y − yˆ)

2

. So the polynomial regression problem is to find the polynomial f ∈ Fd

which minimizes training error on some training set {(x1, y1),(x2, y2), . . . ,(xn, yn)}, i.e.

arg min

f∈Fd

1

n

Xn

i=1

(y − f(x))2

.

1Error rate is the fraction of mistakes, i.e #mistakes

#examples .

2

Give a closed form formula for solving the polynomial regression problem, ignoring any numerical

issues.

Extra credit: (weightage: 2%.) We have informally argued that the AdaBoost algorithm

uses the weighting mechanism to force the weak learner to focus on the problematic examples in the

next iteration. In this question we will find some rigorous justification for this argument. Compute

the value of

X

(x,y)∈S

Dt+1(x, y) · yft(x).

Here, the notation is the same as used in the lecture slides: ft

is the classifier found by AdaBoost in

round t, and Dt+1 is the distribution computed from Dt using ft

. Based on the value you computed,

what can you say about how good of a classifier ft

is when examples are drawn from Dt+1?

Extra credit: (weightage: 1%.) In your experiments, you may have noticed something

curious about the error rates (either training or test error) when you run AdaBoost for 1 round or

2 rounds. Explain why this happens.

3

Sale!