CS 4650/7650 : Homework 2 SOLVED




5/5 - (2 votes)

CS 4650/7650 : Homework 2

1. This homework has two parts: questions 1–3 are theory questions, and Q4 is a
programming assignment with some parts requiring a written answer.
We will be using Gradescope to collect your assignments. Please read the following
instructions for submitting to Gradescope carefully!
(a) Each subproblem must be submitted on a separate page. When submitting to
Gradescope (under HW2 Writing), make sure to mark which page(s) correspond
to each problem or subproblem. For instance, Q1 has 5 subproblems, so the
solution to each must start on a new page.
(b) For the coding problem (Q4), please upload ‘hw2 skeleton’,
‘hw2 skeleton’, ‘ngram.ipynb’ and ‘rnn.ipynb’ under the HW2 Code
assignment on Gradescope. Write your solutions for Q4 (b), (c), (d), (e) in your
writeup, and attach pdf exports of ‘ngram.ipynb’ and ‘rnn.ipynb’, including
outputs, to your writeup.
(c) Note: This is a large class and Gradescope’s assignment segmentation features
are essential. Failure to follow these instructions may result in parts of your
assignment not being graded. We will not entertain regrading requests for
failure to follow instructions.
2. LATEX solutions are strongly encouraged (a solution template is available on the class
website), but scanned handwritten copies are also acceptable. Hard copies are not
3. We generally encourage collaboration with other students. You may discuss the
questions and potential directions for solving them with another student. However,
you need to write your own solutions and code separately, and not as a group activity.
Please list the students you collaborated with on the submission site.
Homework 1
Deadline: September 16, 3:29pm ET
CS 4650/7650
Natural Language
1. Language Modeling is a technique that allows us to compute the probabilities of
word sequences. The probability of a sequence W = w
1 = {w1, w2…wN }, with the
use of the chain rule, can be estimated as the product of probabilities of each word
given the history, as shown (leaving out the exact notation for the i = 1 edge case):
P(W) = P(w1, w2…wN )
= P(w1) P(w2|w1) P(w3|w1, w2)…P(wN |w1, w2…wN−1)
(a) Using an n-gram model allows us to approximate the above probability using
only a subset of of n−1 words from the history at each step. Simplify the above
expression for the general n-gram case, and the bigram case (n = 2). [3 pts]
(b) A common way to have markers for the start and the end of sentence is to add
the [BOS] (beginning of sentence) and [EOS] (end of sentence) tokens at the
start and end of each sentence. Consider the following text snippet:
[BOS] i made cheese at home [EOS]
[BOS] i like home made cheese [EOS]
[BOS] cheese made at home is tasty [EOS]
[BOS] i like cheese that is salty [EOS]
Using the expression derived in (a), find the probability of the following sequence
as per the bigram model: P([BOS] I like cheese made at home [EOS]). [5 pts]
(c) In practice, instead of raw probability, perplexity is used as the metric for evaluating
a language model. Define perplexity and find the value of perplexity for the
sequence in (b) for the bigram case. [2 pts]
(d) One way to deal with unseen word arrangements in the test set is to use Laplace
smoothing, which adds 1 to all bigram counts, before we normalize them into
probabilities. An alternative to Laplace smoothing (add-1 smoothing) is add-k
smoothing, where k is a fraction that allows assigning a lesser probability mass
to unseen word arrangements. Find the probability of the sequence in (b) with
add-k smoothing for k = 0.1. [5 pts]
(e) To deal with unseen words in the test set, a common way is to fix a vocabulary
by thresholding on the frequency of words, and assigning an [UNK] token to
represent all out-of-vocabulary words. In the example from (b), use a threshold
of count(w) > 1 to fix the vocabulary. Find the probability for the following
sequence for an add-0.1 smoothed bigram model:
P([BOS] i like pepperjack cheese [EOS]). [5 pts]
2 of 7
Homework 1
Deadline: September 16, 3:29pm ET
CS 4650/7650
Natural Language
2. Kneser-Ney smoothing is an n-gram smoothing method which addresses cases where
we may not have seen an n-gram very often. In this case, we can back off to probabilities
of n-1 grams. Kneser-Ney allows us to do that while taking into account the diversity
of n-1-grams, or their tendency to appear in unique contexts. For instance, if we
want to finish the sentence “I want to go to the movie ” using a bigram model
but have not observed “movie theatre” as a bigram before, we may use unigram
probability of “theatre” in our corpus. However, another word such as “Francisco”
may have a higher unigram probability than “theatre”. Since “Francisco” only appears
next to “San”, we can say it doesn’t appear in many diverse contexts. We can use
Kneser-Ney smoothing to take the diversity of contexts of each unigram into account.
The full formula for the bigram version of Kneser-Ney smoothing follows:
|wi−1) = max(C(wi−1, wi) − d, 0)
| {z }
discounted bigram probability
+ λ(wi−1) · PCONT INUAT ION (wi)
| {z }
joint unigram probability
λ(wi−1) = d
v C(wi−1, v)
· |w : C(wi−1, w) > 0| (2)
PCONT INUAT ION (wi) = |v ∈ V : C(v, wi) > 0|
w0 |v ∈ V : C(v, w0
) > 0|
, (3)
where V is our word vocabulary.
When combining bigram probability with unigram probability, we need to discount
the counts of bigrams to save some probability mass to distribute for unigrams.
The discount constant d in equation 2 discounts some probability mass from the
high bigram counts, and the normalizing constant λ in equation 3 distributes it
across unigram counts. Additionally, PCONT INUAT ION measures how likely a word
w appears as a novel continuation.
Assume that you have collected the data in the following tables, and assume that
all other observed counts are 0. In the bigram table, rows represent wi−1, columns
represent wi: e.g. C(computer, keyboard) = 6.
C(wi−1, wi) computer keyboard monitor store
computer 0 6 8 8
keyboard 2 0 0 3
monitor 1 1 0 4
store 1 0 0 0
Table 1: Bigram frequency. Rows = wi−1, columns = wi
Consider the following sentence fragment S: “I shopped at the computer ”.
You need to determine whether the sentence is more likely to end with “store” or
3 of 7
Homework 1
Deadline: September 16, 3:29pm ET
CS 4650/7650
Natural Language
computer 22
keyboard 10
monitor 15
store 15
Table 2: Unigram frequency.
(a) Compute the raw bigram probabilities for the candidate words {store, monitor}
to complete the sentence S, i.e. P(store|computer) and P(monitor|computer).
Is one word more likely than the other, and if so which one? [2 pts]
(b) Compute the Kneser-Ney smoothed bigram probability of the candidate words
{store, monitor} to complete the sentence. Use d = 0.5 as the discount term.
Is one word more likely than the other, and if so which one? If the result has
changed, why do you think it changed? [5 pts]
(c) Change the discount term to d = 0.1 and re-compute the Kneser-Ney smoothed
bigram probability of the candidate words {store, monitor} to complete the
sentence. Is one word more likely than the other, and if so which one? If the
result has changed, why do you think it changed? [3 pts]
4 of 7
Homework 1
Deadline: September 16, 3:29pm ET
CS 4650/7650
Natural Language
3. Neural Networks We learned about neural networks in class. The combination of
different model architectures, non-linearities through different activation functions,
and various ways to avoid overfitting make neural networks a flexible and robust
technique to learn from data.
(a) Activation functions Assume a 2-layer feedforward network y = f(x) such
a = W(1)x + b
z = σ(a)
y = W(2)z + b
where W(i) and b
(i) are the weights and biases for the i
th linear layer and σ(·)
is the sigmoid activation function at the hidden layer. Derive an equivalent
network f
0 — by adjusting W(i) and b
(i) — such that f
(x) = f(x) (i.e both
networks produce the same output for a given input) but f
uses the tanh(·)
activation function at the hidden layer. [4 pts]
(b) Dropout Among many regularization techniques to avoid overfitting in neural
networks, dropout is the simplest. During training, dropout randomly sets units
in the hidden layer h ∈ R
Dh to zero with probability pdrop (dropping different
units each minibatch), and then multiplies h by a constant γ. We can write this
hdrop = γd   h
where d ∈ {0, 1}
Dh is a mask vector where each entry is 0 with probability pdrop
and 1 with probability (1 − pdrop), and   is the element-wise product operation
(also known as a Hadamard product). γ is chosen such that the expected value
of hdrop is h i.e :
Epdrop [hdrop]i = hi ∀i ∈ {1, . . . , Dh}.
What should be the value of γ in terms of pdrop? Show the calculation to justify
your answer. [4 pts]
(c) Of the types of neural nets covered in class, which would you think is most
applicable to the task of language modeling? Explain using explicit elements
from the LM formulation and from the network you choose. [2 pts]
5 of 7
Homework 1
Deadline: September 16, 3:29pm ET
CS 4650/7650
Natural Language
4. In the textbook, language modeling was defined as the task of predicting the next
word in a sequence given the previous words. In this assignment, you will implement
both character-level and word-level n-gram language models. You will also implement
a character-level RNN language model. You need to submit your code ‘hw2 skeleton’,
‘hw2 skeleton’, ‘ngram.ipynb’ and ‘rnn.ipynb’ to HW2 Code in Gradescope.
Also, you should answer the questions for (b) (c) (d) (e) in your writeup, and export
‘ngram.ipynb’ and ‘rnn.ipynb’ with output to a pdf and attach to your writeup. Your
writeup should be uploaded to HW2 Writing.
(a) Complete the scripts ‘hw2 skeleton’ and ‘hw2 skeleton’. Detailed
instructions can be found in ‘ngram.ipynb’. You should also use test cases in
‘ngram.ipynb’ to get development results for (c) and (d). Submit ‘hw2 skeleton’
and ‘hw2 skeleton’ to HW2 Code on Gradescope, where you will see
the scores for your code. Character-level N-gram language models accounts for
20 points. Word-level n-gram language models accounts for 10 points, which
are bonus for CS 4650. [CS 7650: 30pts, CS 4650: 20 pts + bonus 10pts]
(b) Observe the generation results of your character-level and word-level n-gram
language models (n ≥ 1). The paragraphs which character-level n-gram language
models generate all start with F.The paragraphs which word-level n-gram language
models generate all start with In. Did you get such results? Explain what is
going on. (CS 4650 only need to explain this phenomenon in character-level
N-gram language model.) [2 pts]
(c) [Bonus for CS 4650, Required for CS 7650] Compare the generation results of
character-level and word-level n-gram language models. Which do you think
is better?
Compare the perplexity of ‘shakespeare sonnets.txt’ when using character-level
and word-level n-gram language models. Explain what you found. [2 pts]
(d) When you compute perplexity, you can play with different sets of hyper-parameters
in both character-level and word-level n-gram language models. You can tune
n, k and λ. Please report here the best results and the corresponding hyper-parameters
in development sets. For character-level n-gram language models [CS 4650 +
CS 7650], the development set is ‘shakespeare sonnets.txt’ [2 pts]. For word-level
n-gram language models [Bonus for CS 4650, Required for CS 7650], the
development sets are ‘shakespeare sonnets.txt’ [2 pts] and ‘val e.txt’ [2 pts].
(e) For RNN language models, you should complete the forward method of the
RNN class in ‘rnn.ipynb’. You need to figure out the code and tune the hyperparameters.
You should also
i. Copy a paragraph generated by your model here, [4 pts]
ii. Report hyperparameters and perplexity on the development set
‘shakespeare sonnets.txt’ here, [4 pts] and
iii. Compare the results of character-level RNN language model and character-level
n-gram language model here. [2 pts]
6 of 7
Homework 1
Deadline: September 16, 3:29pm ET
CS 4650/7650
Natural Language
Do not forget to export ‘rnn.ipynb’, including output, to pdf and attach to your
7 of 7