COMS 4771: Machine Learning

Homework 5

Note: This homework is optional and will only count towards extra credit.

In class, the form of the Expectation-Maximization (EM) algorithm we studied is the most

common one, and is sometimes called “soft EM” since the E-step computes soft assignments of

the hidden variables. There is a variant called “hard EM” where the E-step is replaced by a hard

assignemnt of the hidden variables.

Specifically, suppose the observed data point is denoted by X ∈ X , the hidden variable is

denoted by Y ∈ [k] = {1, 2, . . . , k}, and their joint distribution is obtained from a parameterized

model with parameters denoted by θ. Let Prθ[·] denote probabilities in the distribution with

parameters θ.

The (soft) EM algorithm as we studied it in class repeats the following two steps until convergence, starting from an arbitrary initialization of the parameters θ:

• E-step: For each training example x

(i) and for all j ∈ [k], compute

q

(i)

j = Pr

θ

[Y = j | X = x

(i)

]

• M-step: Update θ as follows:

θ ← arg max

θ

0

Xn

i=1

X

k

j=1

q

(i)

j

· ln Pr

θ

0

[X = x

(i)

, Y = j]

In hard EM, the E-step above is replaced by a hard assignment of the hidden variable to the

most likely value for the given data point under the current distribution (i.e. with parameters θ).

We will call this the E’-step. The M-step stays the same.

The hard EM algorithm repeats the following two steps until convergence, starting from an

arbitrary initialization of the parameters θ:

• E’-step: For every training example x

(i)

, compute

j

(i) = arg max

j∈[k]

Pr

θ

[Y = j | X = x

(i)

],

breaking ties arbitrarily, and set

q

(i)

j =

(

1 if j = j

(i)

0 otherwise.

1

• M-step: The same as the M-step in the soft EM algorithm.

Now consider implementing the hard EM algorithm for the following specific scenario. We have

a Gaussian mixture model where all the k Gaussians have the same covariance matrix, I, the

identity matrix. Thus, in particular, the parameters of the model are

θ = (π1, µ1, π2, µ2, . . . , πk, µk),

where πj are the mixing weights, and µj are the centers of the Gaussians. Samples are generated

from this mixture model exactly as in class (viz. as described on slide 4 of the lecture.) The setting

is a bit simpler in that covariance matrices are already known (i.e. I).

Part 1 of assignment (weightage: 4%). For the mixture of Gaussians setting described

above, work out the precise implementation of the E’- and M-steps.

Part 2 of assignment (weightage: 1%). The hard EM algorithm you derived above should

be almost (but not exactly) the same as another algorithm we have already studied in class. Which

algorithm? What is the difference between the two algorithms?

2

Sale!