## Description

CS 281- Assignment #2 v

NOTE: you must show derivations for your answers unless a question explicitly mentions that no justification

is required.

Problem 1 (Spherical Gaussian, 10pts)

One intuitive way to summarize a probability density is via the mode, as this is the “most likely”

value in some sense. A common example of this is using the maximum a posteriori (MAP) estimate

of a model’s parameters. In high dimensions, however, the mode becomes less and less representative

of typical samples. Consider variates from a D-dimensional zero mean spherical Gaussian with unit

variance:

x ∼ N (0D,ID),

where 0D indicates a column vector of D zeros and ID is a D × D identity matrix.

1. Compute the distribution that this implies over the distance of these points from the origin. That

is, compute the distribution over √

xTx, if x is a realization from N (0D,ID). (Note: Consider

transformations of a Gamma distribution described in Murphy 2.4.5.)

2. Make a plot that shows this probability density function for several different values of D, up to

D = 100.

3. Make a plot of the cumulative distribution function (CDF) over this distance distribution for D = 100.

A closed-form solution may be difficult to compute, so you can do this numerically.)

4. From examining the CDF we can think about where most of the mass lives as a function of radius.

For example, most of the mass for D = 100 is within a thin spherical shell. From eyeballing the

plot, what are the inner and outer radii for the shell that contains 90% of the mass in this case?

Problem 2 (Hurdle Models for Count Data, 10pts)

In this problem we consider predictive models of count data. For instance given information about the

student x, can we predict how often they went to the gym that week y? A natural choice is to use a

Poisson GLM i.e. y conditioned on x is modeled as a Poisson distribution.

However, in practice, it is common for count data of this form to follow a bi-modal distribution over

count data. For instance, our data may come from a survey asking students how often they went to the

gym in the past week. Some would do so frequently, some would do it occasionally but not in the past

week (a random zero), and a substantial percentage would never do so.

When modeling this count data with generalized linear models, we may observe more zero examples

than expected from our model. In the case of a Poisson, the mode of the distribution is the integer part

of the mean. A Poisson GLM may therefore be inadequate when means can be relatively large but the

mode of the output is 0. Such data is common when many data entries have 0 outputs and many also

have much larger outputs, so the mode of output is 0 but the overall mean is not near 0. This problem

is known as zero-inflation.

This problem considers handling zero-inflation with a two-part model called a hurdle model. One

part is a binary model such as a logistic model for whether the output is zero or positive. Conditional on

a positive output, the “hurdle is crossed” and the second part uses a truncated model that modifies an

ordinary distribution by conditioning on a positive output. This model can handle both zero inflation

and zero deflation.

Suppose that the first part of the process is governed by probabilities p(y > 0 | x) = π and p(y =

0 | x) = 1 − π; and the second part depends on {y ∈ Z | y > 0} and follows a probability mass function

f(y | x) that is truncated-at-zero. The complete distribution is therefore:

P(y = 0 | x) = 1 − π

P(y = j | x) = π

f(j | x)

1 − f(0 | x)

, j = 1, 2, …

One choice of parameterization is to use a logistic regression model for π:

π = σ(x

>w1)

and use a Poisson GLM for f with mean parameters λ (see Murphy 9.3):

λ = exp(x

>w2)

(a) Suppose we observe N data samples {(xn, yn)}

N

n=1. Write down the log-likelihood for the hurdle

model assuming an unspecified mass function f. Give an maximum likelihood estimation approach

for the specified parts of the model.

(b) Assume now that we select Poisson distribution for f. Show that the truncated-at-zero Poisson

distribution (as used in the hurdle model) is a member of the exponential family. Give its the

sufficient statistics, natural parameters and log-partition function.

(c) What is the mean and variance of a truncated Poisson model with mean parameter λ? If we

observe n i.i.d. samples from a truncated Poisson distribution, what is the maximum likelihood

estimate of λ? (Note: Give an equation which could be solved numerically to obtain the MLE. )

(d) Now assume that we using a hurdle model as a GLM with f as a Poisson distribution. Show

that this is a valid GLM (exponential family for y), derive its log-likelihood, and give its sufficient

statistics.

Problem 3 (Directed Graphical and Naive Bayes, 10pts)

To draw the DGMs for this problem, we recommend using the tikzbayesnet library. For example the

following is drawn in LATEX:

y

wx

τ

N

This problem focuses on modeling a joint distribution of random variables, p(y, x1, . . . , xV ), consisting

of discrete variables. These variables represent a class label y ∈ {1, . . . , C} and features x1 . . . , xV each

of each can take on a values xv ∈ {0, 1}.

(a) Let V = 4. Use the chain rule to select any valid factorization of this joint distribution into

univariate distributions. Draw the directed graphical model corresponding to this factorization.

(b) What is the sum of the sizes of the conditional probability tables associated with this graphical

model. Can you reduce the order of magnitude of this value with a different DGM?

(c) Now consider a naive Bayes factorization of this model, given by,

p(y, x1, . . . , xV ) ≈ p(y)

Y

v

p(xv|y).

Draw a directed graphical model for this factorization. What is the size of the conditional probability tables required to fully express any factored distribution of this form?

(d) In class, we parameterized naive Bayes such that the class distribution is Categorical with a

Dirichlet prior, and the class-conditional distributions are Bernoulli with a Beta prior. Extend the

graphical model above to show the generative model of N data points and include the parameters

and hyper-parameters as random variables.

(e) Assuming the data obeys the naive Bayes assumptions, answer the following questions as true/false

using your directed graphical model. Justify your answer.

• For a given example, features x1 and x2 are independent.

• The class labels y are always conditionally independent of the class-conditional parameters.

• Upon observing the class distribution parameters, the class labels are conditionally independent.

• Upon observing the class distribution parameters, the features are conditionally independent.

• Upon observing the class distribution hyper-parameters, the class labels are conditionally

independent.

(f) For the next problem, we will utilize naive Bayes for a problem where each example has a bag

or multiset of items. A bag is a set that may contain multiple instances of the same value. One

approach is to ignore this property and use xv as an indicator function for each item type. An

alternative is to model xv with sample space {0, . . . , D}, where D is the maximum times an item

appears and to use a Dirichlet-Categorical for the class-conditional. Give one benefit and one

drawback of this approach. Propose a third option for modeling this distribution.

Problem 4 (Naive Bayes Implementation, 10pts)

You will now implement a naive Bayes classifier for for sentiment classification. For this problem you

will use the IMDB Movie Reviews dataset which consists of positive and negative movie reviews . Here

are two example reviews:

there is no story! the plot is hopeless! a filmed based on a car with a

stuck accelerator, no brakes, and a stuck automatic transmission gear

lever cannot be good! … i feel sorry for the actors … poor script …

heavily over-dramatized … this film was nothing but annoying,

stay away from it! [negative review]

i had forgotten both how imaginative the images were, and how witty

the movie … anyone interested in politics or history will love the movie’s

offhand references – anyone interested in romance will be moved – this

one is superb. [positive review]

As noted in the last problem, it is common to think of the input data as a bag/multiset. In text

applications, sentences are often represented as a bag-of-words, containing how many times each word

appears in the sentence. For example, consider two sentences:

• We like programming. We like food.

• We like CS281.

A vocabulary is constructed based on these two sentences:

[“We”, “like”, “programming”, “food”, “CS281”]

Then the two sentences are represented as the number of occurrences of each word in the vocabulary

(starting from position 1):

• [0, 2, 2, 1, 1, 0]

• [0, 1, 1, 0, 0, 1]

We have included a utility file utils.py that does this mapping. For these problems you can therefore

treat text in this matrix representation.

• Implement a Naive Bayes classifier using a Bernoulli class-conditional with a Beta prior where

each feature is an indicator that a word appears at least once in the bag.

• Implement a Naive Bayes classifier using a Categorical class-conditional with a Dirichlet prior.

Here the features represent that count of each word in the bag.

• For both models, experiment with various settings for the priors. For the Dirichlet prior on the

class, begin with α = 1 (Laplace Smoothing). Do the same for the class-conditional prior (be

it Dirichlet or Beta). Keeping uniformity, vary the magnitude to .5 and smaller. If the classes

are unbalanced in the dataset, does it help to use a larger α for the less-often occuring class?

Optionally, choose class-conditional priors based on an outside text source. Validate your choices

on the validation set, and report accuracy on the test set.

• (Optional) With the bag-of-words representation, would the model be able to capture phrases like

“don’t like”? An alternative to the bag-of-words model is known as the bag-of-bigrams model,

where a bigram is two consecutive words in a sentence. Modify utils.py to include bigram

features with either model and see if they increase accuracy.

• (Optional Reading) Baselines and Bigrams: Simple, Good Sentiment and Topic Classification

http://www.aclweb.org/anthology/P/P12/P12-2.pdf#page=118

Problem 5 (Logistic Regression with Autograd, 15pts)

In the previous problem, you implemented a Naive Bayes classifier for sentiment classification on the

IMDB Movie Reviews dataset. In this problem, you will apply logistic regression to the same task.

(a) `1-regularized logistic regression. Consider a model parameterized by w:

p(w) = 1

2b

exp(−

kwk1

b

)

p(y = 1|x, w) = σ(w>x)

p(y = 0|x, w) = 1 − σ(w>x)

where σ(·) is the sigmoid function. Note that we are imposing a Laplacian prior on w, see Murphy,

2.4.4.

(i) Given a dataset {(x

(i)

, y(i)

)}

N

i=1, derive the necessary gradient updates for MAP of w.

a

(ii) Show that for some constant λ, MAP inference of w is equivalent to minimizing

−

1

N

X

N

i=1

log p(y

(i)

|x

(i)

, w) + λkwk1

(b) Implementation using PyTorch automatic differentiation.b

(i) Using the bag-of-words feature representation from the previous question, train a logistic

regression model using PyTorch autograd and torch.nn. Report test accuracy. Select regularization strength λ based on validation accuracy.

(ii) Which 5 words correspond to the largest weight indices, per class, in the learnt weight vectors?

Which 5 words correspond to the least weight indices?

(iii) Study how sparsity (i.e percentage of zero elements in a vector) of the parameter vector

changes with different values of λ. Again, tune λ on the validation set and report the test

accuracies on the test set. Suggested values to try are {0, 0.001, 0.01, 0.1, 1}. You can treat

parameters with < 1e − 4 absolute values as zeros.

Problem 6 (Neural Networks, 5pts)

In the previous problem, we have implemented a Logistic Regression classifier using PyTorch. Logistic Regression can be seen as a 1-layer neural network. With PyTorch automatic differentiation,

implementing a multi-layer neural network only requires incremental change to our logistic regression

implementation.

(a) Implement a multi-layer neural network for IMDB classification and report accuracy on the test

set. You are free to design the network structure (number of hidden units, activation function)

and choose the optimization methods (SGD or ADAM, regularization or not, etc.).

(b) (Optional) Implement sentiment classification based on Convolutional Neural Networks. We recommend reading Yoon Kim (2014) Convolution Neural Networks for Sentence Classification. Note

that in this part, you need to treat the text as a sequence of word vectors instead of bag-ofwords. You can do this by forwarding batch.text[0] to torch.nn.Embedding(vocab size,

embedding dim) after setting the weights using the pretrained vectors from

text field.vocab.vectors.