ECE421 –

Assignment 3:

Unsupervised Learning and Probabilistic Models

Submission: Submit both your report (a single PDF file) and all codes on Quercus.

Objectives:

In this assignment, you will implement learning and inference procedures for some of the probabilistic models described in class, apply your solutions to some simulated datasets, and analyze

the results.

General Note:

• Full points are given for complete solutions, including justifying the choices or assumptions

you made to solve each question. A written report should be included in the final submission.

• Programming assignments are to be solved and submitted individually. You are encouraged

to discuss the assignment with other students, but you must solve it on your own.

• Please ask all questions related to this assignment on Piazza, using the tag assignment3.

• There are 3 starter files attached, helper.py, starter_kmeans.py and starter_gmm.py which

will help you with your implementation.

1

1 K-MEANS [9 PT.]

1 K-means [9 pt.]

K-means clustering is one of the most widely used data analysis algorithms. It is used to summarize

data by discovering a set of data prototypes that represent clusters of data. The data prototypes are

usually referred to as cluster centers. Usually, K-means clustering proceeds by alternating between

assigning data points to clusters and then updating the cluster centers. In this assignment, we

will investigate a different learning algorithm that directly minimizes the K-means clustering loss

function.

1.1 Learning K-means

The K cluster centers can be thought of as K, D-dimensional parameter vectors and we can place

them in a K × D parameter matrix µ, where the k

th row of the matrix denotes the k

th cluster

center µk

. The goal of K-means clustering is to learn µ such that it minimizes the loss function,

L(µ) = PN

n=1 minK

k=1 kxn − µkk

2

2

, where N is the number of training observations.

Even though the loss function is not smooth due to the “min” operation, one may still be able

to find its solutions through iterative gradient-based optimization. The “min” operation leads to

discontinuous derivatives, in a way that is similar to the effect of the ReLU activation function,

but nonetheless, a good gradient-based optimizer can work effectively.

1. Implement the distanceFunc() function in starter_kmeans.py file to calculate the squared

pairwise distance for all pair of N data points and K clusters.

def distanceFunc(X, MU):

# Inputs

# X: is an NxD matrix (N observations and D dimensions)

# MU: is an KxD matrix (K means and D dimensions)

# Outputs

# pair_dist: is the squared pairwise distance matrix (NxK)

Hint: To properly use broadcasting, you can first add a dummy dimension to X, by using

tf.expand_dims or torch.unsqueeze, so that its new shape becomes (N, 1, D).

For the dataset data2D.npy, set K = 3 and find the K-means clusters µ by minimizing

the L(µ) using the gradient descent optimizer. The parameters µ should be initialized by

sampling from the standard normal distribution. Include a plot of the loss vs the number of

updates.

Use the Adam optimizer for this assignment with the following hyper-parameters:

learning_rate=0.1, beta1=0.9, beta2=0.99, epsilon=1e-5. The learning should converge

within a few hundred updates.

2. Hold out 1/3 of the data for validation, and for each value of K = 1, 2, 3, 4, 5:

(a) Train a K-means model.

2

2 MIXTURES OF GAUSSIANS [16 PT.]

(b) Include a 2D scatter plot of training data points colored by their cluster assignments.

(c) Compute and report the percentage of the training data points belonging to each of the

K clusters. (include this information in the legend of the scatter plot.)

(d) Compute and report the loss function over validation data. (include this in the caption

of the scatter plot.)

Based on the scatter plots, comment on the best number of clusters to use.

2 Mixtures of Gaussians [16 pt.]

Mixtures of Gaussians (MoG) can be interpreted as a probabilistic version of K-means clustering. For each data vector, MoG uses a latent variable z to represent the cluster assignment and uses a joint probability model of the cluster assignment variable and the data vector: P(x, z) = P(z)P(x | z). For N iid training cases, we have P(X, z) = QN

n=1 P(xn, zn). The

Expectation-Maximization (EM) algorithm is the most commonly used technique to learn a MoG.

Like the standard K-means clustering algorithm, the EM algorithm alternates between updating

the cluster assignment variables and the cluster parameters. What makes it different is that instead of making hard assignments of data vectors to cluster centers (the “min” operation above),

the EM algorithm computes probabilities for different cluster centers, P(z|x). These are computed

from P(z = k|x) = P(x, z = k)/

PK

j=1 P(x, z = j).

While the Expectation-Maximization (EM) algorithm is typically the go-to learning algorithm to

train MoG and is guaranteed to converge to a local optimum, it suffers from slow convergence. In

this assignment, we will explore a different learning algorithm that makes use of gradient descent.

2.1 The Gaussian cluster mode [7 pt.]

Each of the K mixture components in the MoG model occurs with probability π

k = P(z = k).

The data model is a multivariate Gaussian distribution centered at the cluster mean (data center)

µ

k ∈ R

D. We will consider a MoG model where it is assumed that for the multivariate Gaussian

for cluster k, different data dimensions are independent and have the same standard deviation, σ

k

.

1. Use the K-means distance function distanceFunc implemented in 1.1 to implement log_GaussPDF

function by computing the log probability density function for cluster k: log N (x ; µ

k

, σk

2

)

for all pair of N data points and K clusters. Include the snippets of the Python code.

2. Write a vectorized Tensorflow Python function that computes the log probability of the

cluster variable z given the data vector x: log P(z|x). The log Gaussian pdf function

implemented above should come in handy. The implementation should use the function

reduce_logsumexp() provided in the helper functions file. Include the snippets of the Python

code and comment on why it is important to use the log-sum-exp function instead of using

tf.reduce_sum.

3

2.2 Learning the MoG [9 pt.] 2 MIXTURES OF GAUSSIANS [16 PT.]

2.2 Learning the MoG [9 pt.]

The marginal data likelihood for the MoG model is as follows (here “marginal” refers to summing

over the cluster assignment variables):

P(X) = Y

N

n=1

P(xn) = Y

N

n=1

X

K

k=1

P(zn = k)P(xn | zn = k)

=

Y

n

X

k

π

kN (xn ; µ

k

, σk

2

)

The loss function we will minimize is the negative log likelihood L(µ, σ, π) = − log P(X). The

maximum likelihood estimate (MLE) is a set of the model parameters µ, σ, π that maximize the

log likelihood or, equivalently, minimize the negative log likelihood.

1. Implement the loss function using log-sum-exp function and perform MLE by directly optimizing the log likelihood function using gradient descent.

Note that the standard deviation has the constraint of σ ∈ [0,∞). One way to deal with

this constraint is to replace σ

2 with exp(φ) in the math and the software, where φ is an

unconstrained parameter. In addition, π has a simplex constraint, that is P

k

π

k = 1. We

can again replace this constrain with unconstrained parameter ψ through a softmax function

π

k = exp(ψ

k

)/

P

k

0 exp(ψ

k

0

). A log-softmax function, logsoftmax, is provided for convenience

in the helper functions file.

For the dataset data2D.npy, set K = 3 and report the best model parameters it has learned.

Include a plot of the loss vs the number of updates.

2. Hold out 1/3 of the data for validation, and for each value of K = 1, 2, 3, 4, 5:

(a) Train a MoG model.

(b) Include a 2D scatter plot of training data points colored by their cluster assignments.

(c) Compute and report the percentage of the training data points belonging to each of the

K clusters. (include this information in the legend of the scatter plot.)

(d) Compute and report the loss function over validation data. (include this in the caption

of the scatter plot.)

Explain which value of K is best, based on the validation loss.

3. Run both the K-means and the MoG learning algorithms on data100D.npy for K = {5, 10, 15,

20, 30} (Hold out 1/3 of the data for validation). Comment on how many clusters you think

are within the dataset by looking at the MoG validation loss and K-means validation loss.

Compare the learnt results of K-means and MoG.

4