COMS 4771 HW3

You are allowed to work in groups of (at max) three students. These group members don’t necessarily have to be the same from previous homeworks. Only one submission per group is required

by the due date. Name and UNI of all group members must be clearly specified on the homework.

You must cite all the references you used to do this homework. You must show your work to receive

full credit.

1(a) [Mixing Ridge and Lasso regression] You want to design a regressor with good prediction

accuracy. Knowing that various kinds of regularizations help, you decide to incorporate both

the lasso and ridge penalties in the design of your optimization criterion. Given training data

(x1, y1), . . . ,(xn, yn) (where each xi ∈ R

d

and each yi ∈ R) you come up with the following

optimization

arg min

w

wX − y

2

+ λ

h

αkwk

2

2 + (1 − α)kwk1

i

,

where: (i) X is a d × n data matrix where the i

th column corresponds to training data xi

, (ii)

y is a 1 × n output vector where the i

th component corresponds to the training output yi

, and

(iii) w is a 1×d parameter weight vector. λ ≥ 0 and α ∈ [0, 1] are known trade-off parameters.

Show that this optimization can be turned into a lasso regression by appropriately augmenting

the training data.

1(b) [Bayesian interpretation of ridge regression] Consider the following data generating process for linear regression problem in R

d

. Nature first selects d weight coefficients w1, . . . , wd

as wi ∼ N(0, τ 2

) i.i.d. Given n examples x1, . . . , xn ∈ R

d

, nature generates the output

variable yi as

yi =

X

d

j=1

wjxi,j + ?i

,

where ?i ∼ N(0, σ2

) i.i.d.

Show that finding the coefficients w1, . . . , wd that maximizes P[w1, . . . , wd|(x1, y1). . . ,(xn, yn)]

is equivalent to minimizing the ridge optimization criterion.

1

2 [Combining multiple classifiers] The concept of “wisdom-of-the-crowd” posits that collective knowledge of a group as expressed through their aggregated actions or opinions is superior to the decision of any one individual in the group. Here we will study a version of the

“wisdom-of-the-crowd” for binary classifiers: how can one combine prediction outputs from

multiple possibly low-quality binary classifiers to achieve an aggregate high-quality final output? Consider the following iterative procedure to combine classifier results.

Input:

– S – a set of training samples: S = {(x1, y1), . . . ,(xm, ym)}, where each yi ∈ {−1, +1}

– T – number of iterations (also, number of classifiers to combine)

– F – a set of (possibly low-quality) classifiers. Each f ∈ F, is of the form f : X → {−1, +1}

Output:

– F – a set of selected classifiers {f1, . . . , fT }, where each fi ∈ F.

– A – a set of combination weights {α1, . . . , αT }

Iterative Combination Procedure:

– Initialize distribution weights D1(i) = 1

m [for i = 1, . . . , m]

– for t = 1, . . . , T do

– // ?j is weighted error of j-th classifier w.r.t. Dt

– Define ?j := Pm

i=1 Dt(i) · 1[yi 6= fj (xi)] [for each fj ∈ F]

– // select the classifier with the smallest (weighted) error

– ft = arg minfj∈F ?j

– ?t = minfj∈F ?j

– // recompute weights w.r.t. performance of ft

– Compute classifier weight αt =

1

2

ln

1−?t

?t

?

– Compute distribution weight Dt+1(i) = Dt(i) exp(−αtyift(xi))

– Normalize distribution weights Dt+1(i) = P

Dt+1(i)

i Dt+1(i)

– endfor

– return weights αt, and classifiers ft for t = 1, . . . , T.

Final Combined Prediction:

– For any test input x, define the aggregation function as: g(x) := P

t αtft(x), and return the

prediction as sign(g(x)).

We’ll prove the following statement: If for each iteration t there is some γt 0 such that

?t =

1

2 −γt (that is, assuming that at each iteration the error of the classifier ft is just γt better

than random guessing), then error of the aggregate classifier

err(g) := 1

m

X

i

1[yi 6= sign(g(xi))] ≤ exp(−2

X

T

t=1

γ

2

t

).

That is, the error of the aggregate classifier g decreases exponentially fast with the number of

combinations T!

(i) Let Zt := P

i Dt+1(i) (i.e., Zt denotes the normalization constant for the weighted

distribution Dt+1). Show that

DT +1(i) = 1

m

1

Q

t Zt

exp(−yig(xi)).

(ii) Show that error of the aggregate classifier g is upper bounded by the product of Zt:

err(g) ≤

Q

t Zt.

(hint: use the fact that 0-1 loss is upper bounded by exponential loss)

(iii) Show that Zt = 2p

?t(1 − ?t).

(hint: noting Zt =

P

i Dt(i) exp(−αtyift(xi)), separate the expression for correctly

and incorrectly classified cases and express it in terms of ?t)

(iv) By combining results from (ii) and (iii), we have that err(g) ≤

Q

t

2

p

?t(1 − ?t), now

show that:

Y

t

2

p

?t(1 − ?t) = Y

t

q

1 − 4γ

2

t ≤ exp(−2

X

t

γ

2

t

).

Thus establishing that err(g) ≤ exp(−2

P

t

γ

2

t

).

3 [Low-dimensional information-preserving transformations] (hashing the cube) You have a

collection of nonzero distinct binary vectors x1, . . . , xm ∈ {0, 1}

n. To facilitate later lookup,

you decide to hash them to vectors of length p < n by means of a linear mapping xi

7→ Axi

,

where A is a p × n matrix with 0-1 entries, and all computations are performed modulo 2.

Suppose the entries of the matrix are picked uniformly at random (ie, each an independent

coin toss)

(i) Pick any 1 ≤ i ≤ m, and any b ∈ {0, 1}

p

. Show that the probability (over the choice of

A) that xi hashes to b is exactly 1/2

p

. (Hint: focus on a coordinate 1 ≤ j ≤ n for which

xij = 1.)

(ii) Pick any 1 ≤ i < j ≤ m. What is the probability that xi and xj hash to the same vector?

This is called a collision.

(iii) Show that if p ≥ 2 log2 m, then with probability at least 1/2, there are no collisions

among the xi

. Thus: to avoid collisions, it is enough to linearly hash into O(log m)

dimensions!

(question credit: Prof. Sanjoy Dasgupta)

3

4 [Regression Competition!] You’ll compete with your classmates on designing a good quality

regressor for a large scale dataset.

(i) Signup on http://www.kaggle.com with your columbia email address.

(ii) Visit the COMS 4771 competition at: https://www.kaggle.com/t/07fabc88beea472f8108fadc91d9c029

and develop a regressor for large-scale dataset available there. You can use any resource

publicly available to develop your regressor. (You don’t need to submit your code for

this question.)

(iii) Your pdf writeup should describe the design for your regressor: What preprocessing

techniques and regressor you used? Why you made these choices? What resources you

used and were helpful? What worked and what didn’t work?

Make sure cite all the resources that you used.

Evaluation criterion:

– You must use your (and your group members’) UNI as your team name in order to

get points. For example:

· If you have two group members with uni: ab1234 and xy0987,

· the teamname should be: ab1234 xy0987

– Your team must get a Mean Absolute Error (MAE) score of at most 7.000 on the

test-dataset to get full credit.

– Top ten teams (across all sections) will get extra points!

4