COMS 4771 HW1

You are allowed to work in groups of (at max) three students. Only one submission per group

is required by the due date. Name and UNI of all group members must be clearly specified on the

homework. You must cite all the references you used to do this homework. You must show your

work to receive full credit.

1 [Maximum Likelihood Estimation] Here we shall examine some properties of Maximum

Likelihood Estimation (MLE).

(i) Consider the density p(x|θ) ∝

n

x

2

e

−x/θ if x ≥ 0

0 otherwise , for some θ 0. Suppose that

n samples x1, . . . , xn are drawn i.i.d. from p(x|θ) . What is the MLE of θ given the

samples?

(ii) Consider the density p(x|θ) ∝

n

1 if − θ ≤ x ≤ θ

0 otherwise , for some θ 0. Suppose that

n samples x1, . . . , xn are drawn i.i.d. from p(x|θ). What is the MLE of θ given the

samples?

(iii) Recall the Gaussian density: p(x|µ, σ2

) := √

1

2πσ2

exp −(x−µ)

2

2σ2

?

, for some mean parameter µ ∈ R and variance parameter σ

2 0. Suppose that n samples x1, . . . , xn are

drawn i.i.d. from p(x|µ, σ2

). Show that if µ is unknown, then the MLE σ

2

ML is not an

unbiased estimator of the variance σ

2

for all sample sizes n. What simple modification

can we make to the estimate to make it unbiased?

(iv) Show that for the MLE θML of a parameter θ ∈ R

d

and any known differentiable function g : R

d → R

k

, the MLE of g(θ) is g(θML). From this result infer the MLE for the

standard deviation (σ) in the same setting as in Part (iii).

2 [Universality of Decision Trees]

(i) Show that any binary classifier g : {0, 1}

D → {0, 1} can be implemented as a decision

tree classifier. That is, for any classifier g there exists a decision tree classifier T with k

nodes n1, . . . , nk (each ni with a corresponding threshhold ti), such that g(x) = T(x)

for all x ∈ {0, 1}

D.

(ii) What is the best possible bound one can give on the maximum height of such a decision

tree T (from part (i))? For what function g is the bound tight?

3 [Designing the optimal predictor for continuous output spaces] We studied in class that

the “Bayes Classifier”

f := arg maxy P[Y |X]

?

is optimal in the sense that it minimizes

generalization error over the underlying distribution, that is, it maximizes Ex,y[1[g(x) = y]].

But what can we say when the output space Y is continuous?

Consider predictors of the kind g : X → R that predict a real-valued output for a given input

x ∈ X . One intuitive way to define the quality of of such a predictor g is as

Q(g) := Ex,y[(g(x) − y)

2

].

Observe that one would want a predictor g with the lowest Q(g).

Show that if one defines the predictor as f(x) := E[Y |X = x], then Q(f) ≤ Q(g) for any g,

thereby showing that f is the optimal predictor with respect to Q for continuous output spaces.

4 [Analyzing iterative optimization] In this problem, we will analyze the Richardson iteration

(from HW0) for finding β ∈ R

d

that (approximately) minimizes kAβ −bk

2

2

for a given matrix

A ∈ R

n×d

and vector b ∈ R

n

.

Recall the Richardson iteration, given as follows:

– Initially, β

(0) = (0, . . . , 0) ∈ R

d

is the zero vector in R

d

.

– For k = 1, 2, . . . , N:

– Compute β

(k)

:= β

(k−1) + ηAT(b − Aβ(k−1)).

Above, η 0 is a fixed positive number called the step size, and N is the total number of

iterations. Define M := ATA and v := ATb.

(i) Show that the matrix M is symmetric positive semi-definite.

Throughout, assume that the eigenvalues of M, denoted by λ1, . . . , λd, satisfy λi < 1/η

for all i = 1, . . . , d.

(ii) Prove (e.g., using mathematical induction) that, for any positive integer N,

β

(N) = η

N

X−1

k=0

(I − ηM)

k

v.

(Here, for a square matrix B, we have B0 = I, B1 = B, B2 = BB, B3 = BBB, and

so on.)

(iii) What are the eigenvalues of η

PN−1

k=0 (I−ηM)

k

? Give your answer in terms of λ1, . . . , λd,

η, and N.

(iv) Let βˆ be any vector in the range of M satisfying Mβˆ = v. Prove that

kβ

(N) − βˆk

2

2 ≤ e

−2ηλminN kβˆk

2

2

,

where λmin is the smallest non-zero eigenvalue of M.

Hint: You may use the fact that 1 + x ≤ e

x

for any x ∈ R.

This implies that as the number of iterations N increases, the difference between our

estimate β

(N)

and βˆ decreases exponentially!

5 [A comparative study of classification performance of handwritten digits] Download the

datafile hw1data.mat from the class website. This datafile contains 10,000 images (each of

size 28×28 pixels = 784 dimensions) of handwritten digits along with the associated labels.

Each handwritten digit belongs to one of the 10 possible categories {0, 1, . . . , 9}. There are

two variables in this datafile: (i) Variable X is a 10,000×784 data matrix, where each row is

a sample image of a handwritten digit. (ii) Variable Y is the 10,000×1 label vector where the

i

th entry indicates the label of the i

th sample image in X.

Special note for those who are not using Matlab: Python users can use scipy to read in

the mat file, R users can use R.matlab package to read in the mat file, Julia users can use

MAT.jl package.

To visualize this data (in Matlab): say you want to see the actual handwritten character image

of the 77th datasample. You may run the following code (after the data has been loaded):

figure;

imagesc(1-reshape(X(77,:),[28 28])’);

colormap gray;

To see the associated label value:

Y(77)

(i) Create a probabilistic classifier (as discussed in class) to solve the handwritten digit

classification problem. The class conditional densities of your probabilistic classifier

should be modeled by a Multivariate Gaussian distribution. It may help to recall that the

MLE for the parameters of a Multivariate Gaussian are:

~µML :=

1

n

Xn

i=1

~xi

ΣML :=

1

n

Xn

i=1

(~xi − ~µML)(~xi − ~µML)

T

You must submit your code via Courseworks to receive full credit.

(ii) Create a k-Nearest Neighbor classifier (with Euclidean distance as the metric) to solve

the handwritten digit classification problem.

You must submit your code via Courseworks to receive full credit.

(iii) Which classifier (the one developed in Part (i) or the one developed in Part (ii)) is better?

You must justify your answer with appropriate performance graphs demonstrating the

superiority of one classifier over the other. Example things to consider: you should

evaluate how the classifier behaves on a holdout ‘test’ sample for various splits of the

data; how does the training sample size affects the classification performance.

(iv) As discussed in class, there are several metrics one can use in a Nearest Neighbor classification. Do a similar analysis to justify which of the three metrics: L1, L2 or L∞ is

better for handwritten digit classification problem.

3

6 [Understanding model complexity and overfitting] Here we will empirically study the

tradeoff between model complexity and generalizability.

(i) Build a decision tree classifier for the digit dataset that we already introduced in Question

5. In building your decision tree, you may use any reasonable uncertainty measure to

determine the feature and threshold to split at in each cell. Make sure the depth of the

tree is adjustable with hyperparameter K.

You must submit your code via Courseworks to receive full credit.

(ii) Ensure that there is a random split between training and test data. Plot the training error

and test error as a function of K.

(iii) Do the trends change for different random splits of training and test data?

(iv) How do you explain the difference in the behavior of training and testing error as a

function of K?

(v) Based on your analysis, what is a good setting of K if you were deploy your decision

tree classifier to classify handwritten digits?

4