COMS 4771 HW2

You are allowed to work in groups of (at max) three students. These group members don’t necessarily have to be the same from previous homeworks. Only one submission per group is required

by the due date on Gradescope. Name and UNI of all group members must be clearly specified on

the homework. You must cite all the references you used to do this homework. You must show your

work to receive full credit.

1 [Multi-class classification] In class, we have primarily been focusing on binary classifiers.

In this problem, we will study how one can use binary classification to do multiclass classification. Suppose we want to construct a k-class classifier f that maps data from some input

space X to the label space {1, 2, …, k}. There are two popular methods to construct f from

just combining results from multiple binary classifiers.

– one-vs-rest (OvR) technique – for each class i ∈ [k], one can view the classification

problem as computing a function fi

: R

d → {class i, not class i} (i.e., assigning examples from class i the label 1 and all other classes the label 0). One can combine the

results for each fi

to construct a multiclass classifier f.

– one-vs-one (OvO) technique – For each pair of classes i, j ∈ [k] (i, j distinct), one can

view the classification problem as computing a function fij : R

d → {class i, class j}

(taking only training points with labels i or j). One can combine the results for each fij

to construct a multiclass classifier f.

(i) Describe how one can design a multiclass classifier f from OvR and OvO techniques.

Make sure the precisely (mathematically) define f for each technique. For a new test

example, how many calls to binary classification are made in each technique?

(ii) Assuming that your base binary classifiers can only be linear, show a training dataset

for each of the following cases. (Your example training dataset for each case must have

the following properties – (i) number of classes in the dataset k 2, (ii) dataset contains equal number of datapoints per class, and (iii) each class contains at least two

datapoints.)

– OvR gives better accuracy over OvO

– OvO gives better accuracy over OvR

– For any ? 0, both OvO and OvR give accuracy of at most ?. (your example

training set can depend on ?)

– For any ? 0, both OvO and OvR give accuracy of at least 1 − ?. (your example

training set can depend on ?)

(iii) From part (i) we observed that OvO/OvR techniques made certain number of calls to

binary classification during test time. Suppose our goal is to minimize the number of

1

calls made to binary classification during test time (let’s call this quantity c). Propose a

technique to construct a k-class classifier f from binary classifiers that minimizes c. For

your proposed technique, what is c? (i.e., express it in terms of parameters of the data,

such as, number of classes k, number of datapoints n, dimensionality of your dataset d,

etc). Prove that your technique is indeed minimizes c, that is, there is no other technique

that makes fewer binary classification calls than your technique during test time and still

achieve comparable accuracy.

2 [Constrained optimization] Show that the distance from the hyperplane g(x) = w·x+w0 =

0 to a point xa is |g(xa)|/kwk by minimizing the squared distance kx − xak

2

subject to the

constraint g(x) = 0.

3 [A better output Perceptron algorithm guarantee] In class, we saw that when the training

sample S is linearly separable with a maximum margin γ 0, then the Perceptron algorithm

run cyclically over S is guaranteed to converge after T ≤

R/γ?2

updates, where R is the

radius of the sphere containing the sample points. This does not guarantee however that the

hyperplane solution returned by Perceptron, i.e. wT achieves a margin close to γ.

(i) Show an example training dataset S in R

2

that has margin γ, and an order of updates

made by the Perceptron algorithm where the hyperplane solution returned has arbitrarily

bad margin on S.

(ii) Consider the following modification to the perceptron algorithm:

Modified Perceptron Algorithm

Input: training dataset S = (xi

, yi)i=1,…,n

Output: learned vector w

– Initialize w0 := 0, t := 0

– while there exists an example (x, y) ∈ S, such that 2y(wt · x) ≤ γkwtk

– set wt+1 := wt + yx

– set t := t + 1

– return wt.

(a) If the Modified Perceptron Algorithm (MPA) terminates after T rounds, what margin guarantee is achieved by the hyperplane wT returned by MPA? Justify your

answer.

(b) We will now prove step-by-step the mistake bound for the Modified Perceptron

Algorithm (MPA) algorithm.

i. Show that after T rounds T γ ≤ kwT k, and observe that if kwT k < 4R2/γ,

then T < 4R2/γ2

.

In what follows, we will assume that kwT k ≥ 4R2/γ.

ii. Show that for any iteration t when mistake was made, the following holds:

kwtk

2 ≤ (kwt−1k + γ/2)2 + R

2

.

iii. Infer from that that for any iteration t, we have

kwtk ≤ kwt−1k + γ/2 +

R2

kwt−1k + kwtk + γ/2

.

iv. Using the previous question, show that for any iteration tsuch that either kwt−1k ≥

4R

2

γ

or kwtk ≥ 4R

2

γ

, we have

kwtk ≤ kwt−1k +

3

4

γ.

v. Show that kw0k ≤ R ≤ 4R2/γ. Since by assumption we have kwT k ≥ 4R

2

γ

,

conclude that there must exist some largest iteration t0 such that kwt0−1k ≤

4R

2

γ

and kwt0

k ≥ 4R

2

γ

.

vi. Show that kwT k ≤ kwt0−1k +

3

4

T γ, and finally deduce the mistake bound.

4 [Making data linearly separable by feature space mapping] Consider the infinite dimensional feature space mapping

Φσ : R → R

∞

x 7→

?

1

|α − x| < σ

· exp

− 1/(1 − (|α − x|/σ)

2

)

?

?

α∈R

.

(It may be helpful to sketch the function f(α) := 1

|α| < 1

· exp

− 1/(1 − α

2

)

?

for

understanding the mapping and answering the questions below)

(i) Show that for any n distinct points x1, . . . , xn, there exists a σ 0 such that the mapping

Φσ can linearly separate any binary labeling of the n points.

(ii) Show that one can efficiently compute the dot products in this feature space, by giving

an analytical formula for Φσ(x) · Φσ(x

0

) for arbitrary points x and x

0

.

(iii) Given an input space X and a feature space mapping φ that maps elements from X to a

(possibly infinite dimensional) inner product space V . Let K : X × X → R be a kernel

function that can efficiently compute the inner products in V , that is, for any x, x0 ∈ X,

K(x, x0

) = φ(x) · φ(x

0

).

Consider a binary classification algorithm that predicts the label of an unseen instance

according to the class with the closest average. Formally, given a training set S =

(x1, y1), . . . ,(xm, ym), for each y ∈ {±1} define

cy :=

1

my

X

i:yi=y

φ(xi),

where my = |{i : yi = y}|. Assume that m+ and m− are nonzero. Then, the algorithm

outputs the following decision rule:

h(x) := (

1 kφ(x) − c+k ≤ kφ(x) − c−k

0 otherwise.

(a) Let w := c+ − c− and let b =

1

2

(kc−k

2 − kc+k

2

). Show that

h(x) = sign(hw, φ(x)i + b).

(b) Show that h(x) can be expressed via K(·, ·), without accessing individual entries

of φ(x) or w, thus showing that h(x) is efficiently computable.

5 [Predicting Restaurant-reviews via Perceptrons]

Download the review dataset hw2data 1.zip from the course website. This data set is

comprised of reviews of restaurants in Pittsburgh; the label indicates whether or not the

reviewer-assigned rating is at least four (on a five-point scale). The data are in CSV format (where the first line is the header); the first column is the label (label; 0 or 1), and

the second column is the review text (text). The text has been processed to remove nonalphanumeric symbols and capitalization. The data has already been separated into training

data reviews tr.csv and test data and reviews te.csv.

The first challenge in dealing with text data is to come up with a reasonable “vectorial” representation of the data so that one can use standard machine learning classifiers with it.

Data-representations

In this problem, you will experiment with the following different data representations.

1. Unigram representation.

In this representation, there is a feature for every word t, and the feature value associated

with a word t in a document d is

tf(t; d) := number of times word t appears in document d.

(tf is short for term frequency.)

2. Term frequency-inverse document frequency (tf-idf) weighting.

This is like the unigram representation, except the feature associated with a word t in a

document d from a collection of documents D (e.g., training data) is

tf(t; d) × log10(idf(t; D)),

where tf(t; d) is as defined above, and

idf(t; D) := |D|

number of documents in D that contain word t

.

This representation puts more emphasis on rare words and less emphasis on common

words. (There are many variants of tf-idf that are unfortunately all referred to by the

same name.)

Note: When you apply this representation to a new document (e.g., a document in the

test set), you should still use the idf defined with respect to D. This, however, becomes

problematic if a word t appears in a new document but did not appear in any document

in D: in this case, idf(t; D) = |D|/0 = ∞. It is not obvious what should be done in

these cases. For this homework assignment, simply ignore words t that do not appear in

any document in D.

4

3. Bigram representation.

In addition to the unigram features, there is a feature for every pair of words (t1, t2)

(called a bigram), and the feature value associated with a bigram (t1, t2) in a given

document d is

tf((t1, t2); d) := number of times bigram (t1, t2) appears consecutively in document d.

In the sequence of words “a rose is a rose”, the bigrams that appear are: (a, rose), which

appears twice; (rose, is); and (is, a).

For all the of these representations, ensure to do data “lifting”.

Online Perceptron with online-to-batch conversion

Once can implement the Online Perceptron algorithm with the following online-to-batch conversion process.

– Run Online Perceptron to make two passes through the training data. Before each pass,

randomly shuffle the order of the training examples. Note that a total of 2n + 1 linear

classifiers wˆ1, . . . , wˆ2n+1 are created during the run of the algorithm, where n is the

number of training examples.

– Return the linear classifier wˆfinal given by the simple average of the final n + 1 linear

classifiers:

wˆfinal :=

1

n + 1

2

Xn+1

i=n+1

wˆi

.

Recall that as Online Perceptron is making its pass through the training examples, it only

updates its weight vector in rounds in which it makes a prediction error. So, for example, if

wˆ1 does not make a mistake in classifying the first training example, then wˆ2 = ˆw1. You can

use this fact to keep the memory usage of your code relatively modest.

(i) What are some potential issued with Unigram representation of textual data?

(ii) Implement the online Perceptron algorithm with online-to-batch conversion. You must

submit your code via Courseworks to receive full credit.

(iii) Which of the three representations give better predictive performance? As always, you

should justify your answer with appropriate performance graphs demonstrating the superiority of one representation over the other. Example things to consider: you should

evaluate how the classifier behaves on a holdout ‘test’ sample for various splits of the

data; how does the training sample size affects the classification performance; etc.

(iv) For the classifier based on the unigram representation, determine the 10 words that have

the highest (i.e., most positive) weights; also determine the 10 words that have the lowest

(i.e., most negative) weights.

5

6 [Neural networks as universal function approximators] Here we will experimentally verify

that (feed-forward) neural networks can approximate any smooth function. Recall that a feedforward neural network is simply a combination of multiple ‘neurons’ such that the output of

one neuron is fed as the input to another neuron. More precisely, a neuron νi

is a computational unit that takes in an input vector ~x and returns a wighted combination of the inputs (plus

the bias associated with neuron νi) passed through an activation function σ : R → R, that is:

νi(~x; ~wi

, bi) := σ( ~wi

· ~x + bi). With this notation, we can define a layer ` of a neural network

N `

that takes in an I-dimensional input and returns a O-dimensional output as

N

`

(x) :=

ν

`

1

(~x), ν`

2

(~x), · · · , ν`

O(~x)

?

x ∈ R

I

= σ(WT

` ~x +~b`) W` ∈ R

dI×dO defined as [ ~w

`

1

, . . . , ~w`

O],

and ~b` ∈ R

dO defined as [b

`

1

, . . . , b`

O]

T

.

Here ~w

`

i

and b

`

i

refers to the weight and the bias associated with neuron ν

`

i

in layer `, and the

activation function σ is applied pointwise. An L-layer (feed-forward) neural network FL-layer

is then defined as a network consisting of network layers N 1

, . . . , N L, where the input to

layer i is the output of layer i − 1. By convention, input to the first layer (layer 1) is the actual

input data.

(i) Consider a nonlinear activation function σ : R → R defined as x 7→ 1/(1 + e

−x

). Show

that ∂σ

∂x = σ(x)(1 − σ(x)).

(ii) Consider a single layer feed forward neural network that takes a dI -dimensional input and returns a dO-dimensional output, defined as F1-layer(x) := σ(WT x + b) for

some dI × dO weight matrix W and a dO × 1 vector b. Given a training dataset

(x1, y1), . . . ,(xn, yn), we can define the average error (with respect to the network

parameters) of predicting yi from input example xi as:

E(W, b) := 1

2n

Xn

i=1

kF1-layer(xi) − yik

2

.

What is ∂E

∂W , and ∂E

∂b ?

(note: we can use this gradient in a descent-type procedure to minimize this error and

learn a good setting of the weight matrix that can predict yi from xi

.)

(iii) Although reasonably expressive, single layer neural networks are not flexible enough

to approximate arbitrary smooth functions. Here we will focus on approximating onedimensional functions f : R → [0, 1] (the range of the function is taken in the interval

[0, 1] only for convenience, one can transform any bounded function to this interval

by simple scaling and translating) with a multi-layer neural network. Consider a Llayer neural network FL-layer as described above. Given a sample of input-output pairs

(x1, y1), . . . ,(xn, yn) generated from an unknown fixed function f, one can approximate f from FL-layer by learning the appropriate parameters by minimizing the following

error function.

E(W1, b1, . . . , WL, bL) := 1

2n

Xn

i=1

kFL-layer(xi) − yik

2

=

1

2n

Xn

i=1

kN L

◦ · · · ◦ N 1

(xi) − yik

2

.

Assuming that our input data is one dimensional, that is each xi ∈ R and yi ∈ [0, 1],

implement a gradient descent procedure to learn the parameters of a two-layer (ie, L =

2) feed forward neural network. Note that since each yi

is 1-dimensional, layer N 2

only contains one neuron. Layer N 1

can contain an arbitrary number, say k, neurons.

Graphically the network you are implementing looks as follows.

. . .

input (x) network

output (ŷ)

Here is the pseudocode of your implementation.

Learn 2-layer neural network for 1-d functions

input: data (x1, y1), . . . ,(xn, yn),

. size of the intermediate layer k

– Initialize weight parameters (W1, b1),(W2, b2) randomly

– Repeat until convergence:

– for each training example (xi

, yi)

– compute the network output yˆi on xi

– compute gradients ∂E

∂W2

,

∂E

∂b2

,

∂E

∂W1

,

∂E

∂b1

– update weight parameters:

– Wnew

2

:= W2 − η

∂E

∂W2

, b

new

2

:= b2 − η

∂E

∂b2

– Wnew

1

:= W1 − η

∂E

∂W1

, b

new

1

:= b1 − η

∂E

∂b1

7

You must submit your code on Courseworks to receive full credit.

(iv) Download hw2data 2.mat from the course website. It contains two vector valued

variables X and Y . Each row in Y is a noisy realization of applying an unknown smooth

1-dimensional function to the corresponding row in X. You can visualize the relationship between these variables by plotting X on the x-axis and Y on the y-axis.

Use your implementation from part (iii) to learn the unknown mapping function which

can yield Y from X. Once the network parameters are learned, plot the network output

on the y-axis (in red) along with the given Y values (in blue) for each input value X on

the x-axis.

8