CS294-158 Deep Unsupervised Learning

HW1: Autoregressive Models

1 Warmup

First, run the following code. It will generate a dataset of samples x ∈ {1, . . . , 100}. Take the first

80% of the samples as a training set and the remaining 20% as a test set.

import numpy as np

def sample_data():

count = 10000

rand = np.random.RandomState(0)

a = 0.3 + 0.1 * rand.randn(count)

b = 0.8 + 0.05 * rand.randn(count)

mask = rand.rand(count) < 0.5

samples = np.clip(a * mask + b * (1 – mask), 0.0, 1.0)

return np.digitize(samples, np.linspace(0.0, 1.0, 100))

Let θ = (θ1, . . . , θ100) ∈ R

100, and define the model

pθ(x) = e

θx

P

x0 e

θx0

(1)

Fit pθ with maximum likelihood via stochastic gradient descent on the training set, using θ initialized to zero. Use your favorite version of stochastic gradient descent, and optimize your hyperparameters on a validation set of your choice. Provide these deliverables:

1. Over the course of training, record the average negative log likelihood of the training data

(per minibatch) and validation data (for your entire validation set). Plot both on the same

graph – the x-axis should be training steps, and the y-axis should be negative log likelihood;

feel free to compute and report the validation performance less frequently. Report the test

set performance of your final model. Be sure to report all negative log likelihoods in bits.

2. Plot the model probabilities in a bar graph, with {1, . . . , 100} on the x-axis and a real number

in [0, 1] on the y-axis. Next, draw 1000 samples from your model, and plot their empirical

frequencies on a new bar graph with the same axes. How do both plots compare visually to

the data distribution?

1

2 Two-dimensional data

In this problem, you will work with bivariate data of the form x = (x1, x2), where x1, x2 ∈

{0, 1, . . . , 199}. In the file called distribution.npy, you are provided with a 2-dimensional array of floating point numbers representing the joint distribution of x: element (i, j) of this array is

the joint probability pdata(x1 = i, x2 = j).

Sample a dataset of 100,000 points from this distribution. Treat the first 80% as a training set

and the remaining 20% as a test set. Implement and train both of these models for this data:

1. pθ(x) = pθ(x1)pθ(x2|x1), where pθ(x1) is a distribution represented in the same way as in

part 1, and pθ(x2|x1) is a multilayer perceptron (MLP) that takes x1 as input and produces a

distribution over x2. (You have some freedom in designing the architecture of this MLP. For

example, it can read the x1 input either as a real number or as a one-hot vector. Experiment

with such designs and pick what works best on validation data.)

2. pθ(x) represented as a Masked Autoencoder for Distribution Estimation (MADE).1

Fit both models with maximum likelihood, paying attention to performance on a validation set.

Provide these deliverables for both models:

1. Over the course of training, record the average negative log likelihood of the training data

(per minibatch) and the validation data (for the entire validation set). Plot both on the same

graph – the x-axis should be training steps, and the y-axis should be negative log likelihood

– and report the test set performance of your final model. Report all negative log likelihoods

as bits per dimension (i.e. bits divided by 2).

2. Draw samples and plot them in a 2D histogram with 200 bins on each side. (Consider using

the hist2d function in matplotlib.)

3 High-dimensional data

Now, you will train more powerful models on high-dimensional data: colored MNIST digits. The

dataset you are provided has 60,000 training images and 10,000 test images. Each image has height

H = 28 and width W = 28, with C = 3 channels (red, green, and blue). Let x = (x1, . . . , xHW )

represent one image, where each xi = (x

1

i

, . . . , xC

i

) is a vector of channel values for one pixel. Each

x

c

i will take on a value in {0, 1, 2, 3}.

First, design an autoregressive model of the form

pθ(x) =

HW

Y

i=1

pθ(xi

|x1:i−1) =

HW

Y

i=1

Y

C

c=1

pθ(x

c

i

|x1:i−1) (2)

i.e. a model which is autoregressive over space, but factorized over channels. Use the masked

PixelCNN architecture2

to implement spatial dependencies. We recommend you to follow the

architecture from Table 1 in the paper linked. Start with a 7×7 masked convolution of type A,

1MADE: https://arxiv.org/abs/1502.03509

2PixelCNN: https://arxiv.org/abs/1601.06759

2

followed by 12 layers of residual blocks containing masked convolutions of 1×1, 3×3 and 1×1 of type

B (Figure 5 in the paper). The logits for the softmax layer are then obtained after two 1×1 masked

convolution layers of type B.

Next, introduce the capacity to model dependencies among channels:

pθ(x) =

HW

Y

i=1

pθ(xi

|x1:i−1) =

HW

Y

i=1

Y

C

c=1

pθ(x

c

i

|x

1:c−1

i

, x1:i−1) (3)

To do so, use a MADE over the channels, conditioned on previous pixels with the PixelCNN

structure. In other words, the joint probability for one pixel xi

, conditioned on previous pixels

x1:i−1, should have the form:

pθ(x

1

i

, . . . , xC

i

| x1:i−1) = gθ

x

1

i

, . . . , xC

i

| φθ(x1:i−1)

(4)

where φθ(x1:i−1) is a feature vector summarizing x1:i−1 using the PixelCNN structure, and

gθ(x

1

i

, . . . , xC

i

| φ) = Y

C

c=1

gθ(x

c

i

| x

1:c−1

i

, φ) (5)

is a MADE that takes φ as an auxiliary input.

Here are some tips that you may find useful for designing and training these models:

• You will need only a 4-way softmax for every prediction, as opposed to a 256-way softmax in

the PixelCNN paper. This is because the dataset is quantized to two bits per color channel.

• You can set number of filters for each convolution to 128. You can use the ReLU nonlinearity

throughout.

• Consider using layer normalization3

to improve performance.

• With a learning rate of 10−3 and a batch size of 128, you should be able to achieve a loglikelihood of 0.11 bits/dim in approximately 50 epochs. This should take about 30 minutes.

Provide these deliverables for both models:

1. Plot training and validation losses over time and report the test set performance of your final

model, just as in previous parts of this assignment. Report all negative log likelihoods in bits

per dimension.

2. Generate and display 100 samples from your model. You may want to scale the values

{0, 1, 2, 3} to {0, . . . , 255} for display purposes.

3. Visualize the receptive field of the conditional pixel distribution at (14, 14, 0). To do so,

compute the gradient of the log probability of the model with respect to the input image

at randomly initialized parameters. Turn the resulting gradient into a visualization by computing its elementwise absolute value and taking the maximum over channels – this should

transform the 28x28x3 gradient into a 28×28 image.

3Layer Normalization: https://arxiv.org/abs/1607.06450

## Reviews

There are no reviews yet.