Graphical Models for Denoising

We have seen several variants of the grid-structured Ising model where we have a binary-valued variable at

each position of a grid. Here we consider the grid Potts model which has the same graphical model structure,

but instead with multiple labels K at each node of the undirected graph yi,j ∈ {1, . . . , K}.

y1,1 y1,2 y1,3

y2,1 y2,2 y2,3

y3,1 y3,2 y3,3

In particular consider a conditional Potts model for image denoising. The input x will consist of a picture

with pixels xij , where each pixel is one of K different colors and has been perturbed by random noise.

Each random variable yij represents the color we think each pixel should take after denoising. Unary

potentials represent the prior belief for that pixel based on its observed value. Neighbor potentials enforce

smoothness of the labeling. Specifically, θ(yij = k) = 10 ∗ δ(xij = k), and for all neighboring pixels

n ∈ {(i − 1, j),(i + 1, j),(i, j − 1),(i, j + 1)},

θ(yi,j , yn) =

10 yi,j = yn

2 |yi,j − yn| = 1

0 o.w.

This is obviously a simplified and discretized view of the true denoising problem, but it gives us a reasonable

place to start. As an example consider the problem with K = 2 and noise over the image of a spiral.

[Note: for the example problems we will show k=1 as red, k=2 as green, and k=3 as blue. We will represent

this as the last dimension of the image in a one-hot encoding, so x[i, j, 0] = 1 for red, x[i, j, 1] = 1 for green,

and x[i, j, 2] = 1 for blue. Here red is “close” to green and green is close to blue, but red is not close to blue.

This is not supposed to be physically true, just part of the problem.]

1

Problem 1 (Variational Inference for Denoising, 30pts)

For the problems below we have provided a set of example images in the form of numpy arrays including

a small sample, a flag, a bullseye, and the large spiral above.

1. First as a sanity check, consider the 3×3 image small with K = 2. Compute using brute-force the

true posterior marginal probability p(yi,j |x) of any node.

2. Now consider a variational-inference based approach to this problem. Using mean-field factorization, with q(y) factored to each node of the graph, derive local mean field updates.

q1,1 q1,2 q1,3

q2,1 q2,2 q2,3

q3,1 q3,2 q3,3

3. Implement these mean-field updates with a synchronous schedule in PyTorch/numpy. (This means

that all parameters are updated with expectations from the previous time step.). Run for 30 epochs

starting with q set to a uniform distribution. Graph the results on the small images and compare

to the brute-force approach. Compare the variational values to the exact marginals for the small

example. Note: running on the spiral example will likely require a fast/matrix implementation.

4. Implement Loopy Belief Propagation with a synchronous or non-synchronous schedule in PyTorch/Numpy following the algorithm given in Murphy (Section 22.2.2). Run for 30 epochs using

the starting conditions in in Algorithm 22.1. Compare to the mean field approach.

5. (Optional) Write out the Integer Linear Programming formulation for the maximum assignment

problem. What is the advantage of mean field compared to the ILP approach?

6. (Optional) Install the PuLP toolkit in python. Implement the ILP formulation for this problem.

Compare your solution for the smaller images.

Modeling users and jokes with a Bayesian latent bilinear model

The next two questions will develop Bayesian inference methods for the simplest version of the latent bilinear

model you used to model jokes ratings in HW3. The data set we’ll use is the same as in HW3, a modified

and preprocessed variant of the Jester data set. However, to make things easier (and to make being Bayesian

more worthwhile) we’ll only use subsampling to 10% of the training data. The other ratings will

form your test set.

The model

The model is the same as in HW3, but with Gaussian priors on the latent parameter matrices U and V .

Let ri,j ∈ {1, 2, 3, 4, 5} be the rating of user i on joke j. A latent linear model introduces a vector ui ∈ R

K

for each user and a vector vj ∈ R

K for each joke. Then, each rating is modeled as a noisy version of the

appropriate inner product. Specifically,

ri,j ∼ N (u

T

i

vj , σ2

).

Fix σ

2

to be 1.0, and start with K = 2. We put independent Gaussian priors on each element of U and V :

Ui,k ∼ N (0, σ2

U = 5)

Vi,k ∼ N (0, σ2

V = 5)

4

Problem 2 (Stochastic Variational Inference, 30pts)

Recall that variational inference optimizes a lower bound on the log marginal likelihood (integrating

out parameters θ), like so:

log p(x) = log Z

p(x, θ)dθ = log Z

p(x|θ)p(θ)dθ (1)

= log Z

qλ(θ)

qλ(θ)

p(x|θ)p(θ)dθ = log Eqλ

1

q(θ)

p(x|θ)p(θ)dθ (2)

≥ Eqλ

log

1

qλ(θ)

p(x|θ)p(θ)

= −Eqλ

log qλ(θ)

| {z }

entropy

+ Eqλ

log p(θ)

| {z }

prior

+ Eqλ

log p(x|θ)

| {z }

likelihood

= L(λ) (3)

In this case, θ = U, V and x = R:

L(λ) = −Eqλ

log qλ(U, V ) + Eqλ

log p(U, V ) + X

N

n=1

Eqλ

log p(rn|U, V ) (4)

This is a general formula that works for many different priors, likelihoods and variational approximations. Here we will keep things simple and choose q(U, V ) to be a Gaussian factorized over every single

entry of each matrix for U and V , e.g. the same form as the prior. Thus our variational parameters

will consist of a mean and variance for each entry in U and V: λ

(µU)

ik , λ

(σ

2U)

ik , λ

(µV )

jk , and λ

(σ

2V )

jk .

1. Derive the expression for the KL divergence between two univariate Gaussians.

2. Exploiting the conditional independence of the model, we can write the variational objective

(which we want to maximize) as:

L(λ) = −KL(qλ(U) || p(U)) − KL(qλ(V ) || p(V )) + X

N

n=1

Eqλ

log p(rn|U, V )

Simplify the first two terms of this model to get a closed form expression.

3. The third term is the likelihood of the data under an expectation wrt the variational parameters.

Assume that we approximate this term using a single sample of rows ˜ui and ˜vj for each rating

ri,j . Write out the full objective with this approximation for the last term.

4. Unfortunately this is difficult to optimize, since the sampled variables depend on the variational

parameters λ. An alternative method, known as reparameterization, replaces expectation of the

form EX∼N(µ,σ)

[f(X)], in terms of EZ∼N(0,1)[f(Zσ+µ)]. Rewrite the objective in this form using

sampled dummy variables ˜z (and no ˜ui or ˜vj ).

5. Using PyTorch, set up this model using nn.Embedding for the variational parameters. For numerical stability, store the log of the variance in the embedding table, and also initialize this table

with very low values, e.g. logvar.weight.data = -1000. For K = 2, optimize the variational

parameters for 10 epochs over the sampled data. Use Adam with learning rate 0.001.

Plot the training and test-set log-likelihood as a function of the number of epochs, as well as the

marginal likelihood lower bound. That is to say: at the end of each epoch, evaluate the log of the

average predictive probability of all ratings in the training and test sets using 100 samples from

q(U,V). The lower bound is the sum of entropy, prior and likelihood terms, while the training-set

and test-set likelihoods only use the likelihood term.

6. Fit your variational model for K = 1 to K = 10, and plot the training-set log-likelihood, test-set

log-likelihood, and lower bound for each value of K. How do the shapes of these curves differ?

Problem 3 (Gibbs Sampling, 25pts)

.

In this problem we will consider a different sampling-based approach for estimating the posterior.

1. Write down the conditional equations for U and V. That is to say, write their conditional distributions, conditioned on all the other variables as well as the training data:

p(Ui

| V, R)

p(Vj | U, R)

Because the model is bi-linear, these updates should have fairly simple forms. Here, we mean Ui

to mean the latent parameters corresponding to the ith user, and Vj to mean those for the jth

joke.

2. A Gibbs sampler is an alternative model for computing the posteriors of intractable models. The

method works by repeatedly alternating between drawing samples of U conditioned on V , and

then samples of V conditioned on U. (We will derive in greater detail in coming classes).

Give the pseudocode for running this algorithm using the posterior equations from above.

3. Run the Gibbs sampler for 100 steps (i.e. update both U and V 100 times). Plot the training and

test-set log-likelihood as a function of the number of steps through your training set. That is, use

all previous samples of U, V to evaluate the predictive probability of all ratings.

4. One reason to be Bayesian is that you don’t have to worry about overfitting. Run your Gibbs

sampler for K = 1 to K = 10, and plot the training and test-set log-likelihood for each value

of K. How do the shapes of these curves differ from the curves you saw when doing maximum

likelihood estimation in HW3?

## Reviews

There are no reviews yet.