## Description

A Mathematical Introduction to Data Science

Homework 3. MLE and James-Stein Estimator

The problem below marked by ∗

is optional with bonus credits. For the experimental problem,

include the source codes which are runnable under standard settings.

1. Maximum Likelihood Method: consider n random samples from a multivariate normal distribution, Xi ∈ R

p ∼ N (µ, Σ) with i = 1, . . . , n.

(a) Show the log-likelihood function

ln(µ, Σ) = −

n

2

trace(Σ−1Sn) −

n

2

log det(Σ) + C,

where Sn =

1

n

Pn

i=1(Xi − µ)(Xi − µ)

T

, and some constant C does not depend on µ and

Σ;

(b) Show that f(X) = trace(AX−1

) with A, X 0 has a first-order approximation,

f(X + ∆) ≈ f(X) − trace(X−1A

0X−1∆)

hence formally df(X)/dX = −X−1AX−1

(note (I + X)

−1 ≈ I − X);

(c) Show that g(X) = log det(X) with A, X 0 has a first-order approximation,

g(X + ∆) ≈ g(X) + trace(X−1∆)

hence dg(X)/dX = X−1

(note: consider eigenvalues of X−1/2∆X−1/2

);

(d) Use these formal derivatives with respect to positive semi-definite matrix variables to

show that the maximum likelihood estimator of Σ is

ΣˆMLE

n = Sn.

A reference for (b) and (c) can be found in Convex Optimization, by Boyd and Vandenbergh,

examples in Appendix A.4.1 and A.4.3:

https://web.stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf

2. Shrinkage: Suppose y ∼ N (µ, Ip).

1

Homework 3. MLE and James-Stein Estimator 2

(a) Consider the Ridge regression

min

µ

1

2

ky − µk

2

2 +

λ

2

kµk

2

2

.

Show that the solution is given by

µˆ

ridge

i =

1

1 + λ

yi

.

Compute the risk (mean square error) of this estimator. The risk of MLE is given when

C = I.

(b) Consider the LASSO problem,

min

µ

1

2

ky − µk

2

2 + λkµk1.

Show that the solution is given by Soft-Thresholding

µˆ

sof t

i = µsof t(yi

; λ) := sign(yi)(|yi

| − λ)+.

For the choice λ =

√

2 log p, show that the risk is bounded by

Ekµˆ

sof t(y) − µk

2 ≤ 1 + (2 log p + 1)X

p

i=1

min(µ

2

i

, 1).

Under what conditions on µ, such a risk is smaller than that of MLE? Note: see Gaussian

Estimation by Iain Johnstone, Lemma 2.9 and the reasoning before it.

(c) Consider the l0 regularization

min

µ

ky − µk

2

2 + λ

2

kµk0,

where kµk0 := Pp

i=1 I(µi 6= 0). Show that the solution is given by Hard-Thresholding

µˆ

hard

i = µhard(yi

; λ) := yiI(|yi

| > λ).

Rewriting ˆµ

hard(y) = (1 − g(y))y, is g(y) weakly differentiable? Why?

(d) Consider the James-Stein Estimator

µˆ

JS(y) =

1 −

α

kyk

2

y.

Show that the risk is

Ekµˆ

JS(y) − µk

2 = EUα(y)

where Uα(y) = p − (2α(p − 2) − α

2

)/kyk

2

. Find the optimal α

∗ = arg minα Uα(y). Show

that for p > 2, the risk of James-Stein Estimator is smaller than that of MLE for all

µ ∈ R

p

.

Homework 3. MLE and James-Stein Estimator 3

(e) In general, an odd monotone unbounded function Θ : R → R defined by Θλ(t) with

parameter λ ≥ 0 is called shrinkage rule, if it satisfies

[shrinkage] 0 ≤ Θλ(|t|) ≤ |t|;

[odd] Θλ(−t) = −Θλ(t);

[monotone] Θλ(t) ≤ Θλ(t

0

) for t ≤ t

0

;

[unbounded] limt→∞ Θλ(t) = ∞.

Which rules above are shrinkage rules?

3. Necessary Condition for Admissibility of Linear Estimators. Consider linear estimator for

y ∼ N (µ, σ2

Ip)

µˆC(y) = Cy.

Show that ˆµC is admissible only if

(a) C is symmetric;

(b) 0 ≤ ρi(C) ≤ 1 (where ρi(C) are eigenvalues of C);

(c) ρi(C) = 1 for at most two i.

These conditions are satisfied for MLE estimator when p = 1 and p = 2.

Reference: Theorem 2.3 in Gaussian Estimation by Iain Johnstone,

http://statweb.stanford.edu/~imj/Book100611.pdf

4. *James Stein Estimator for p = 1, 2 and upper bound:

If we use SURE to calculate the risk of James Stein Estimator,

R(ˆµ

JS, µ) = EU(Y ) = p − Eµ

(p − 2)2

kY k

2

< p = R(ˆµ

MLE, µ)

it seems that for p = 1 James Stein Estimator should still have lower risk than MLE for any

µ. Can you find what will happen for p = 1 and p = 2 cases?

Moreover, can you derive the upper bound for the risk of James-Stein Estimator?

R(ˆµ

JS, µ) ≤ p −

(p − 2)2

p − 2 + kµk

2

= 2 +

(p − 2)kµk

2

p − 2 + kµk

2

.