Assignment #4 Graphical Models for Denoising



Graphical Models for Denoising
We have seen several variants of the grid-structured Ising model where we have a binary-valued variable at
each position of a grid. Here we consider the grid Potts model which has the same graphical model structure,
but instead with multiple labels K at each node of the undirected graph yi,j ∈ {1, . . . , K}.
y1,1 y1,2 y1,3
y2,1 y2,2 y2,3
y3,1 y3,2 y3,3
In particular consider a conditional Potts model for image denoising. The input x will consist of a picture
with pixels xij , where each pixel is one of K different colors and has been perturbed by random noise.
Each random variable yij represents the color we think each pixel should take after denoising. Unary
potentials represent the prior belief for that pixel based on its observed value. Neighbor potentials enforce
smoothness of the labeling. Specifically, θ(yij = k) = 10 ∗ δ(xij = k), and for all neighboring pixels
n ∈ {(i − 1, j),(i + 1, j),(i, j − 1),(i, j + 1)},
θ(yi,j , yn) =

10 yi,j = yn
2 |yi,j − yn| = 1
0 o.w.
This is obviously a simplified and discretized view of the true denoising problem, but it gives us a reasonable
place to start. As an example consider the problem with K = 2 and noise over the image of a spiral.
[Note: for the example problems we will show k=1 as red, k=2 as green, and k=3 as blue. We will represent
this as the last dimension of the image in a one-hot encoding, so x[i, j, 0] = 1 for red, x[i, j, 1] = 1 for green,
and x[i, j, 2] = 1 for blue. Here red is “close” to green and green is close to blue, but red is not close to blue.
This is not supposed to be physically true, just part of the problem.]
Problem 1 (Variational Inference for Denoising, 30pts)
For the problems below we have provided a set of example images in the form of numpy arrays including
a small sample, a flag, a bullseye, and the large spiral above.
1. First as a sanity check, consider the 3×3 image small with K = 2. Compute using brute-force the
true posterior marginal probability p(yi,j |x) of any node.
2. Now consider a variational-inference based approach to this problem. Using mean-field factorization, with q(y) factored to each node of the graph, derive local mean field updates.
q1,1 q1,2 q1,3
q2,1 q2,2 q2,3
q3,1 q3,2 q3,3
3. Implement these mean-field updates with a synchronous schedule in PyTorch/numpy. (This means
that all parameters are updated with expectations from the previous time step.). Run for 30 epochs
starting with q set to a uniform distribution. Graph the results on the small images and compare
to the brute-force approach. Compare the variational values to the exact marginals for the small
example. Note: running on the spiral example will likely require a fast/matrix implementation.
4. Implement Loopy Belief Propagation with a synchronous or non-synchronous schedule in PyTorch/Numpy following the algorithm given in Murphy (Section 22.2.2). Run for 30 epochs using
the starting conditions in in Algorithm 22.1. Compare to the mean field approach.
5. (Optional) Write out the Integer Linear Programming formulation for the maximum assignment
problem. What is the advantage of mean field compared to the ILP approach?
6. (Optional) Install the PuLP toolkit in python. Implement the ILP formulation for this problem.
Compare your solution for the smaller images.

Modeling users and jokes with a Bayesian latent bilinear model
The next two questions will develop Bayesian inference methods for the simplest version of the latent bilinear
model you used to model jokes ratings in HW3. The data set we’ll use is the same as in HW3, a modified
and preprocessed variant of the Jester data set. However, to make things easier (and to make being Bayesian
more worthwhile) we’ll only use subsampling to 10% of the training data. The other ratings will
form your test set.
The model
The model is the same as in HW3, but with Gaussian priors on the latent parameter matrices U and V .
Let ri,j ∈ {1, 2, 3, 4, 5} be the rating of user i on joke j. A latent linear model introduces a vector ui ∈ R
for each user and a vector vj ∈ R
K for each joke. Then, each rating is modeled as a noisy version of the
appropriate inner product. Specifically,
ri,j ∼ N (u
vj , σ2

Fix σ

to be 1.0, and start with K = 2. We put independent Gaussian priors on each element of U and V :
Ui,k ∼ N (0, σ2
U = 5)
Vi,k ∼ N (0, σ2
V = 5)
Problem 2 (Stochastic Variational Inference, 30pts)
Recall that variational inference optimizes a lower bound on the log marginal likelihood (integrating
out parameters θ), like so:
log p(x) = log Z
p(x, θ)dθ = log Z
p(x|θ)p(θ)dθ (1)
= log Z
p(x|θ)p(θ)dθ = log Eqλ
p(x|θ)p(θ)dθ (2)
≥ Eqλ

= −Eqλ
log qλ(θ)
| {z }
+ Eqλ
log p(θ)
| {z }
+ Eqλ
log p(x|θ)
| {z }
= L(λ) (3)
In this case, θ = U, V and x = R:
L(λ) = −Eqλ
log qλ(U, V ) + Eqλ
log p(U, V ) + X
log p(rn|U, V ) (4)
This is a general formula that works for many different priors, likelihoods and variational approximations. Here we will keep things simple and choose q(U, V ) to be a Gaussian factorized over every single
entry of each matrix for U and V , e.g. the same form as the prior. Thus our variational parameters
will consist of a mean and variance for each entry in U and V: λ
ik , λ

ik , λ
(µV )
jk , and λ

2V )
jk .
1. Derive the expression for the KL divergence between two univariate Gaussians.
2. Exploiting the conditional independence of the model, we can write the variational objective
(which we want to maximize) as:
L(λ) = −KL(qλ(U) || p(U)) − KL(qλ(V ) || p(V )) + X
log p(rn|U, V )
Simplify the first two terms of this model to get a closed form expression.
3. The third term is the likelihood of the data under an expectation wrt the variational parameters.
Assume that we approximate this term using a single sample of rows ˜ui and ˜vj for each rating
ri,j . Write out the full objective with this approximation for the last term.
4. Unfortunately this is difficult to optimize, since the sampled variables depend on the variational
parameters λ. An alternative method, known as reparameterization, replaces expectation of the
form EX∼N(µ,σ)
[f(X)], in terms of EZ∼N(0,1)[f(Zσ+µ)]. Rewrite the objective in this form using
sampled dummy variables ˜z (and no ˜ui or ˜vj ).
5. Using PyTorch, set up this model using nn.Embedding for the variational parameters. For numerical stability, store the log of the variance in the embedding table, and also initialize this table
with very low values, e.g. = -1000. For K = 2, optimize the variational
parameters for 10 epochs over the sampled data. Use Adam with learning rate 0.001.
Plot the training and test-set log-likelihood as a function of the number of epochs, as well as the
marginal likelihood lower bound. That is to say: at the end of each epoch, evaluate the log of the
average predictive probability of all ratings in the training and test sets using 100 samples from
q(U,V). The lower bound is the sum of entropy, prior and likelihood terms, while the training-set
and test-set likelihoods only use the likelihood term.
6. Fit your variational model for K = 1 to K = 10, and plot the training-set log-likelihood, test-set
log-likelihood, and lower bound for each value of K. How do the shapes of these curves differ?

Problem 3 (Gibbs Sampling, 25pts)
In this problem we will consider a different sampling-based approach for estimating the posterior.
1. Write down the conditional equations for U and V. That is to say, write their conditional distributions, conditioned on all the other variables as well as the training data:
| V, R)
p(Vj | U, R)
Because the model is bi-linear, these updates should have fairly simple forms. Here, we mean Ui
to mean the latent parameters corresponding to the ith user, and Vj to mean those for the jth
2. A Gibbs sampler is an alternative model for computing the posteriors of intractable models. The
method works by repeatedly alternating between drawing samples of U conditioned on V , and
then samples of V conditioned on U. (We will derive in greater detail in coming classes).
Give the pseudocode for running this algorithm using the posterior equations from above.
3. Run the Gibbs sampler for 100 steps (i.e. update both U and V 100 times). Plot the training and
test-set log-likelihood as a function of the number of steps through your training set. That is, use
all previous samples of U, V to evaluate the predictive probability of all ratings.
4. One reason to be Bayesian is that you don’t have to worry about overfitting. Run your Gibbs
sampler for K = 1 to K = 10, and plot the training and test-set log-likelihood for each value
of K. How do the shapes of these curves differ from the curves you saw when doing maximum
likelihood estimation in HW3?


There are no reviews yet.

Be the first to review “Assignment #4 Graphical Models for Denoising”

Your email address will not be published.