CS 281- Assignment #2 v


CS 281- Assignment #2 v

NOTE: you must show derivations for your answers unless a question explicitly mentions that no justification
is required.
Problem 1 (Spherical Gaussian, 10pts)
One intuitive way to summarize a probability density is via the mode, as this is the “most likely”
value in some sense. A common example of this is using the maximum a posteriori (MAP) estimate
of a model’s parameters. In high dimensions, however, the mode becomes less and less representative
of typical samples. Consider variates from a D-dimensional zero mean spherical Gaussian with unit
x ∼ N (0D,ID),



5/5 - (2 votes)

CS 281- Assignment #2 v

NOTE: you must show derivations for your answers unless a question explicitly mentions that no justification
is required.
Problem 1 (Spherical Gaussian, 10pts)
One intuitive way to summarize a probability density is via the mode, as this is the “most likely”
value in some sense. A common example of this is using the maximum a posteriori (MAP) estimate
of a model’s parameters. In high dimensions, however, the mode becomes less and less representative
of typical samples. Consider variates from a D-dimensional zero mean spherical Gaussian with unit
x ∼ N (0D,ID),
where 0D indicates a column vector of D zeros and ID is a D × D identity matrix.
1. Compute the distribution that this implies over the distance of these points from the origin. That
is, compute the distribution over √
xTx, if x is a realization from N (0D,ID). (Note: Consider
transformations of a Gamma distribution described in Murphy 2.4.5.)
2. Make a plot that shows this probability density function for several different values of D, up to
D = 100.
3. Make a plot of the cumulative distribution function (CDF) over this distance distribution for D = 100.
A closed-form solution may be difficult to compute, so you can do this numerically.)
4. From examining the CDF we can think about where most of the mass lives as a function of radius.
For example, most of the mass for D = 100 is within a thin spherical shell. From eyeballing the
plot, what are the inner and outer radii for the shell that contains 90% of the mass in this case?

Problem 2 (Hurdle Models for Count Data, 10pts)
In this problem we consider predictive models of count data. For instance given information about the
student x, can we predict how often they went to the gym that week y? A natural choice is to use a
Poisson GLM i.e. y conditioned on x is modeled as a Poisson distribution.
However, in practice, it is common for count data of this form to follow a bi-modal distribution over
count data. For instance, our data may come from a survey asking students how often they went to the
gym in the past week. Some would do so frequently, some would do it occasionally but not in the past
week (a random zero), and a substantial percentage would never do so.
When modeling this count data with generalized linear models, we may observe more zero examples
than expected from our model. In the case of a Poisson, the mode of the distribution is the integer part
of the mean. A Poisson GLM may therefore be inadequate when means can be relatively large but the
mode of the output is 0. Such data is common when many data entries have 0 outputs and many also
have much larger outputs, so the mode of output is 0 but the overall mean is not near 0. This problem
is known as zero-inflation.
This problem considers handling zero-inflation with a two-part model called a hurdle model. One
part is a binary model such as a logistic model for whether the output is zero or positive. Conditional on
a positive output, the “hurdle is crossed” and the second part uses a truncated model that modifies an
ordinary distribution by conditioning on a positive output. This model can handle both zero inflation
and zero deflation.
Suppose that the first part of the process is governed by probabilities p(y > 0 | x) = π and p(y =
0 | x) = 1 − π; and the second part depends on {y ∈ Z | y > 0} and follows a probability mass function
f(y | x) that is truncated-at-zero. The complete distribution is therefore:
P(y = 0 | x) = 1 − π
P(y = j | x) = π
f(j | x)
1 − f(0 | x)
, j = 1, 2, …
One choice of parameterization is to use a logistic regression model for π:
π = σ(x
and use a Poisson GLM for f with mean parameters λ (see Murphy 9.3):
λ = exp(x
(a) Suppose we observe N data samples {(xn, yn)}
n=1. Write down the log-likelihood for the hurdle
model assuming an unspecified mass function f. Give an maximum likelihood estimation approach
for the specified parts of the model.
(b) Assume now that we select Poisson distribution for f. Show that the truncated-at-zero Poisson
distribution (as used in the hurdle model) is a member of the exponential family. Give its the
sufficient statistics, natural parameters and log-partition function.
(c) What is the mean and variance of a truncated Poisson model with mean parameter λ? If we
observe n i.i.d. samples from a truncated Poisson distribution, what is the maximum likelihood
estimate of λ? (Note: Give an equation which could be solved numerically to obtain the MLE. )
(d) Now assume that we using a hurdle model as a GLM with f as a Poisson distribution. Show
that this is a valid GLM (exponential family for y), derive its log-likelihood, and give its sufficient

Problem 3 (Directed Graphical and Naive Bayes, 10pts)
To draw the DGMs for this problem, we recommend using the tikzbayesnet library. For example the
following is drawn in LATEX:
This problem focuses on modeling a joint distribution of random variables, p(y, x1, . . . , xV ), consisting
of discrete variables. These variables represent a class label y ∈ {1, . . . , C} and features x1 . . . , xV each
of each can take on a values xv ∈ {0, 1}.
(a) Let V = 4. Use the chain rule to select any valid factorization of this joint distribution into
univariate distributions. Draw the directed graphical model corresponding to this factorization.
(b) What is the sum of the sizes of the conditional probability tables associated with this graphical
model. Can you reduce the order of magnitude of this value with a different DGM?
(c) Now consider a naive Bayes factorization of this model, given by,
p(y, x1, . . . , xV ) ≈ p(y)
Draw a directed graphical model for this factorization. What is the size of the conditional probability tables required to fully express any factored distribution of this form?
(d) In class, we parameterized naive Bayes such that the class distribution is Categorical with a
Dirichlet prior, and the class-conditional distributions are Bernoulli with a Beta prior. Extend the
graphical model above to show the generative model of N data points and include the parameters
and hyper-parameters as random variables.
(e) Assuming the data obeys the naive Bayes assumptions, answer the following questions as true/false
using your directed graphical model. Justify your answer.
• For a given example, features x1 and x2 are independent.
• The class labels y are always conditionally independent of the class-conditional parameters.
• Upon observing the class distribution parameters, the class labels are conditionally independent.
• Upon observing the class distribution parameters, the features are conditionally independent.
• Upon observing the class distribution hyper-parameters, the class labels are conditionally
(f) For the next problem, we will utilize naive Bayes for a problem where each example has a bag
or multiset of items. A bag is a set that may contain multiple instances of the same value. One
approach is to ignore this property and use xv as an indicator function for each item type. An
alternative is to model xv with sample space {0, . . . , D}, where D is the maximum times an item
appears and to use a Dirichlet-Categorical for the class-conditional. Give one benefit and one
drawback of this approach. Propose a third option for modeling this distribution.

Problem 4 (Naive Bayes Implementation, 10pts)
You will now implement a naive Bayes classifier for for sentiment classification. For this problem you
will use the IMDB Movie Reviews dataset which consists of positive and negative movie reviews . Here
are two example reviews:
there is no story! the plot is hopeless! a filmed based on a car with a
stuck accelerator, no brakes, and a stuck automatic transmission gear
lever cannot be good! … i feel sorry for the actors … poor script …
heavily over-dramatized … this film was nothing but annoying,
stay away from it! [negative review]
i had forgotten both how imaginative the images were, and how witty
the movie … anyone interested in politics or history will love the movie’s
offhand references – anyone interested in romance will be moved – this
one is superb. [positive review]
As noted in the last problem, it is common to think of the input data as a bag/multiset. In text
applications, sentences are often represented as a bag-of-words, containing how many times each word
appears in the sentence. For example, consider two sentences:
• We like programming. We like food.
• We like CS281.
A vocabulary is constructed based on these two sentences:
[“We”, “like”, “programming”, “food”, “CS281”]
Then the two sentences are represented as the number of occurrences of each word in the vocabulary
(starting from position 1):
• [0, 2, 2, 1, 1, 0]
• [0, 1, 1, 0, 0, 1]
We have included a utility file that does this mapping. For these problems you can therefore
treat text in this matrix representation.
• Implement a Naive Bayes classifier using a Bernoulli class-conditional with a Beta prior where
each feature is an indicator that a word appears at least once in the bag.
• Implement a Naive Bayes classifier using a Categorical class-conditional with a Dirichlet prior.
Here the features represent that count of each word in the bag.
• For both models, experiment with various settings for the priors. For the Dirichlet prior on the
class, begin with α = 1 (Laplace Smoothing). Do the same for the class-conditional prior (be
it Dirichlet or Beta). Keeping uniformity, vary the magnitude to .5 and smaller. If the classes
are unbalanced in the dataset, does it help to use a larger α for the less-often occuring class?
Optionally, choose class-conditional priors based on an outside text source. Validate your choices
on the validation set, and report accuracy on the test set.
• (Optional) With the bag-of-words representation, would the model be able to capture phrases like
“don’t like”? An alternative to the bag-of-words model is known as the bag-of-bigrams model,
where a bigram is two consecutive words in a sentence. Modify to include bigram
features with either model and see if they increase accuracy.
• (Optional Reading) Baselines and Bigrams: Simple, Good Sentiment and Topic Classification

Problem 5 (Logistic Regression with Autograd, 15pts)
In the previous problem, you implemented a Naive Bayes classifier for sentiment classification on the
IMDB Movie Reviews dataset. In this problem, you will apply logistic regression to the same task.
(a) `1-regularized logistic regression. Consider a model parameterized by w:
p(w) = 1
p(y = 1|x, w) = σ(w>x)
p(y = 0|x, w) = 1 − σ(w>x)
where σ(·) is the sigmoid function. Note that we are imposing a Laplacian prior on w, see Murphy,
(i) Given a dataset {(x
, y(i)
i=1, derive the necessary gradient updates for MAP of w.
(ii) Show that for some constant λ, MAP inference of w is equivalent to minimizing

log p(y
, w) + λkwk1
(b) Implementation using PyTorch automatic differentiation.b
(i) Using the bag-of-words feature representation from the previous question, train a logistic
regression model using PyTorch autograd and torch.nn. Report test accuracy. Select regularization strength λ based on validation accuracy.
(ii) Which 5 words correspond to the largest weight indices, per class, in the learnt weight vectors?
Which 5 words correspond to the least weight indices?
(iii) Study how sparsity (i.e percentage of zero elements in a vector) of the parameter vector
changes with different values of λ. Again, tune λ on the validation set and report the test
accuracies on the test set. Suggested values to try are {0, 0.001, 0.01, 0.1, 1}. You can treat
parameters with < 1e − 4 absolute values as zeros.

Problem 6 (Neural Networks, 5pts)
In the previous problem, we have implemented a Logistic Regression classifier using PyTorch. Logistic Regression can be seen as a 1-layer neural network. With PyTorch automatic differentiation,
implementing a multi-layer neural network only requires incremental change to our logistic regression
(a) Implement a multi-layer neural network for IMDB classification and report accuracy on the test
set. You are free to design the network structure (number of hidden units, activation function)
and choose the optimization methods (SGD or ADAM, regularization or not, etc.).
(b) (Optional) Implement sentiment classification based on Convolutional Neural Networks. We recommend reading Yoon Kim (2014) Convolution Neural Networks for Sentence Classification. Note
that in this part, you need to treat the text as a sequence of word vectors instead of bag-ofwords. You can do this by forwarding batch.text[0] to torch.nn.Embedding(vocab size,
embedding dim) after setting the weights using the pretrained vectors from
text field.vocab.vectors.