Assignment 3: Smoothed Language Modeling


5/5 - (2 votes)

601.465/665 — Natural Language Processing
Assignment 3: Smoothed Language Modeling

Probabilistic models are an indispensable part of modern NLP. ‘is assignment will try to convince
you that even simplistic and linguistically stupid models like n-gram models can be very useful, provided
their parameters are estimated carefully. See reading section A.
You now know enough about probability to build and use some trigram language models. You will
experiment with di‚erent types of smoothing. You will also get experience in running corpus experiments
over training, development, and test sets. ‘is is the only assignment in the course to focus on that.
Reading: Read the handout a‹ached to the end of this assignment!
Collaboration: You may work in teams of up to 2 on this assignment. ‘at is, if you choose, you
may collaborate with 1 partner from the class, handing in a single homework with multiple names on it.
Do the work together, not divide it up: if you didn’t work on a question, you don’t deserve credit for it!
Your solutions should emerge from collaborative real-time discussions with both of you present.
Programming language: We ask you to use Python 3 for this assignment. To ease the transition (as
this is the €rst assignment to require Python), we will give you some useful code as a starting point.1 On
getting programming help: Since this is an upper-level NLP class, not a programming class, I don’t want
you wasting time on low-level issues like how to handle I/O or hash tables of arrays. If you €nd yourself
wasting time on programming issues, then by all means seek help from someone who knows the language
be‹er! Your responsibility is the NLP stu‚—you do have to design, write, and debug the interesting code
and data structures on your own. But I don’t consider it cheating if another hacker (or the TA) helps you
with your I/O routines or language syntax or compiler warning messages. ‘ese aren’t InterestingTM.
How to hand in your written work: Via Gradescope as before. Besides the comments you embed
in your source €les, put all other notes, documentation, and answers to questions in a README.pdf €le.
How to test and hand in your code:
1 • Place your code and trained parameters into a single submission directory, which you will zip and
upload separately on Gradescope. We will post more detailed instructions on Piazza.
• For the parts where we tell you exactly what to do, an autograder will check that you got it right.
• For the open-ended challenge, an autograder will run your system and score its accuracy on “devtest” and “€nal-test” data.
– You should get a decent grade if you do at least as well as the “baseline” system provided by
the TAs. Be‹er systems will get higher grades.
– Each time you submit a version of your system, you’ll see the results on “dev test” data and
how they compare to the baseline system. You’ll also see what other teams tried and how their
accuracies compared to yours. (Teams will use made-up names, not real names.)
It counts word n-grams in a corpus, using hash tables, and uses the counts to calculate simple probability estimates. We also
supply a method that calls an optimization library to maximize a function.
– ‘e “dev-test” results are intended to help you develop your system. Grades will be based on
the “€nal-test” data that you have never seen.
1 Warmup: Word embeddings
‘e big part of this assignment will involve various kinds of smoothed language models. O‰en, smoothing
involves treating similar events in similar contexts as having similar probabilities. (Backo‚ is one example.)
One strategy will be to treat similar words as having similar probabilities. But what does “similar
words” mean? More concretely, how will we measure whether two words are similar?
Log into the ugrad machines, and look in the directory /home/arya/hw-lm/lexicons/. A lexicon
lists useful properties of the words of a language. Each of the words-*.txt €les2
is a lexicon that lists
over 70,000 words of English, with a representation of each word as a d-dimensional vector. In other words,
it embeds these words into the vector space R
We can now de€ne “similar words” as words with similar vectors. A word may have many syntactic
and semantic properties—some words are transitive verbs, some words are plural, some words refer to
animals. ‘ese properties can be considered by a log-linear model, and words with similar vectors tend to
have similar properties. ‘e larger d is, the more room the vector has to encode interesting properties.
‘e vectors in these particular €les were produced automatically by Google’s word2vec program.3
‘eir individual dimensions are hard to interpret as natural properties. It would certainly be nice if dimension 2 (for example) represented the “animal” property, so that a word that referred to an animal would be
one whose vector ~v = (v1, v2, . . . vd) had strongly positive v2. In practice, however, word2vec chooses an
arbitrary basis for the vector space. So the “animal” direction—to the extent that there is one—is actually
represented by some arbitrary vector ~u. ‘e animal words tend to be those whose vectors ~v have strongly
positive ~v · ~u.
(Geometrically, this dot product measures how far ~v extends in direction ~u (it projects ~v onto the ~u
vector). Algebraically, it computes a certain linear combination of v1, v2, . . . , vn. As a special case, if ~u
were (0, 1, 0, 0, . . .), then ~v · ~u would in fact return v2. But there’s no reason to expect that ~u would be
that simple.)
You’ll be using these lexicons in the next section. For this part, you’ll just warm up by writing a short
program to get a feel for vector embeddings.
1. Your program,, should print the 10 words most similar to seattle, other than seattle
itself, according to the 50-dimensional embeddings, if you run it as
python3 words-50.txt seattle
‘ey should be printed in decreasing order of similarity.
2‘e €le format should be easy to €gure out. ‘e €le words-d.txt can be regarded as a matrix with about 70,000 rows and
d columns. Each row is labeled by a word. ‘e €rst line of the €le is special: it gives the number of rows and columns.
3‘e details are not important for this assignment, but the vectors are optimized by gradient descent so as to arrange that the
vector for each word token wi will be predictable (to the extent possible) from the average vector of nearby word tokens (wj for
j 6= i and i − 5 ≤ j ≤ i + 5). If you’re curious, you can €nd details in Mikolov et al. (2013): “Distributed Representations of
Words and Phrases and their Compositionality”. We ran the CBOW method over the €rst 1 billion characters of English Wikipedia.
word2vec doesn’t produce vector representations for rare words (c(w) < 5), so we €rst replaced rare words with the special
symbol ool (“out of lexicon”), forcing word2vec to learn a vector representation for ool.
To measure the similarity of two words with vectors ~v and ~w, please use the cosine similarity, which
is the cosine of the angle θ between the two vectors:
cos θ =


|| ~w||
~v · ~w
||~v|| || ~w|| =
i=1 viwi
i=1 v
i=1 w2
∈ [−1, 1]
2 What are the most similar words to seattle, dog, communist, jpg, the, and google? Play around
 some more. What are some examples that work “well” or “poorly”? What pa‹erns do you notice?
 So far you have been using d = 50. What happens for larger or smaller values of d?
Hint: Start by making €nd only the single most similar word, which should be easy
enough. ‘en you can generalize this method to €nd the 10 most similar words. (Since you only
want the top 10, it’s not necessary to sort the whole lexicon by similarity—that method is acceptable,
but inecient.)
Note: You can copy the lexicon €les to your personal machine. But to avoid downloading such large
€les, it may be easier to run directly on the ugrad machines. If you do this, please
don’t waste space by making fresh copies of the €les on the ugrad machines. Just use them
at their current location. If you like, create symbolic links (shortcuts) to them by typing “ln -s
/home/arya/hw-lm/lexicons/* .” in your own working directory.
5 Submit on Gradescope.
2. Extra credit: Now extend your program so that it can also be run as follows:
python3 words-50.txt king –minus man –plus woman
‘is should €nd the 10 words most similar to the vector king−man+woman (other than those three
words themselves).
(‘e old command python3 words-50.txt seattle should still work. Note that
it is equivalent to python3 words-50.txt seattle –minus seattle –plus
seattle, and you may want to implement it that way.)
You can regard the above command as completing the analogy
man : woman :: king : ?
Try some more analogies, this time using the 200-dimensional vectors. For example:
king – man + woman
paris – france + uk
hitler – germany + italy
child – goose + geese
goes – eats + ate
car – road + air
6 Come up with some analogy questions of your own. Which ones work well or poorly? What happens
to the results if you use (say) d = 10 instead of d = 200?
7 Why does this work at all? Be sure to discuss the role of the vector king − man before woman is
added. What does it tell you about the vectors that they can solve analogies?
8 Submit your extended program on Gradescope instead of the original
2 Smoothed language modeling
1. Your starting point is the sample program fileprob. It can be found on the ugrad machines
(ugrad{1,2,…,24} in the directory /home/arya/hw-lm/code.
You’re welcome to develop directly on these machines, if you want to avoid transferring the data.
Each language-speci€c subdirectory contains an INSTRUCTIONS €le explaining how to get the program running. ‘ose instructions will let you automatically compute the log2-probability of three
sample €les (speech/{sample1,sample2,sample3}). Try it!
Next, you should spend a li‹le while looking at those sample €les yourself, and in general, browsing
around the /home/arya/hw-lm directory to see what’s there. See reading sections B and C for more
information about the datasets.
9 If a language model is built from the speech/switchboard-small corpus, using add-0.01 smoothing, what is the model’s perplexity per word on each of the three sample €les? (You can compute this
from the log2-probability that fileprob prints out, as discussed in class and in your textbook. Use
the command wc -w on a €le to €nd out how many words it contains.)
10 What happens to the log2-probabilities and perplexities if you train instead on the larger switchboard
corpus? Why?
2. Modify fileprob to obtain a new program textcat that does text categorization. ‘e two programs should share code for smoothed language models, so that when you write new smoothing
methods later, they will immediately be available from both programs. See the INSTRUCTIONS €le
for programming-language-speci€c directions about which €les to copy, alter, and submit for this
textcat should be run from the command line almost exactly like fileprob. However, as additional arguments,
• training needs to specify two training corpora rather than one: train1 and train2. ‘ese correspond to data for the two classes in your text categorization problem.
• testing needs to specify the prior probability of train1.
For example, you could train your system with a line like
textcat TRAIN add1 words-10.txt gen spam
which saves the trained models in a €le but prints no output. In this example, gen and spam are
the training corpora, corresponding to “genuine” and spam emails. words-10.txt is a lexicon containing word vectors, which will used by the log-linear method (see problem 6) but is ignored for
4One type of ‘working poorly’ was explored in this paper at NeurIPS. Since then, gender bias in NLP tools has become a hot
research area at Hopkins and elsewhere. But there are other types of ‘working poorly’ that you can €nd.
‘en you would test like this:
textcat TEST add1 words-10.txt gen spam 0.7 foo.txt bar.txt baz.txt
which loads the models from before and uses them to classify the remaining €les. It should print
output that labels each €le with a training corpus name (in this case gen or spam):
spam foo.txt
spam bar.txt
gen baz.txt
1 files were more probably gen (33.33%)
2 files were more probably spam (66.67%)
In other words, it classi€es each €le by printing its maximum a posteriori class (the €le name of the
training corpus that probably generated it). ‘en it prints a summary.
‘e number 0.7 on the test command line speci€es your prior probability that a test €le will be gen.
(See reading section C.3.)
Please use the exact output formats above. If you would like to print any additional output lines for
your own use, please direct it to STDERR; we provide examples.
As reading section D explains, both language models built by textcat should use the same €nite
vocabulary. De€ne this vocabulary to all words that appeared ≥ 3 times in the union of the two
training corpora, plus oov. Your add-λ model doesn’t actually need to store the set of words in
the vocabulary, but it does need to know its size V , because the add-1 smoothing method estimates
p(z | xy) as
. We’ve provided code to €nd V for you—see the INSTRUCTIONS €le for details.
3. In this question, you will evaluate your textcat program on one of two tasks. You can do either
language identi€cation (the english spanish directory) or else spam detection (the gen spam directory). Have a look at the development data in both directories to see which one ƒoats your boat.
(Don’t peek at the test data!)
Using add-1 smoothing, run textcat on all the dev data for your chosen task:5
• For the language ID task, classify the €les english spanish/dev/english/*/* using the
training corpora en.1K and sp.1K.
‘en classify english spanish/dev/spanish/*/* similarly. Note that for this corpus, the
“words” are actually le‹ers. Use 0.7 as your prior probability of English.
• Or, for the spam detection task, classify the €les gen spam/dev/gen/* using the training
corpora gen and spam.
‘en classify gen spam/dev/spam/* similarly. Use 0.7 as your prior probability of gen.
To simplify the wording below, I’ll assume that you’ve chosen the spam detection task.
11 (a) From the results, you should be able to compute a total error rate for the technique. ‘at is,
what percentage of the dev €les were classi€ed incorrectly?
12 (b) How small do you have to make the prior probability of gen before textcat classi€es all the
dev €les as spam?
It may be convenient to use symbolic links to avoid typing long €lenames. E.g.,
ln -s /home/arya/hw-lm/english_spanish/train ˜/estrain will create a subdirectory estrain under your home
directory; this subdirectory is really just a shortcut to the ocial training directory.
(c) Now try add-λ smoothing for λ 6= 1. First, use fileprob to experiment by hand with di‚erent
values of λ > 0. (You’ll be asked to discuss in question 4b why λ = 0 probably won’t work
13 What is the minimum cross-entropy per token that you can achieve on the gen development
€les (when estimating a model from gen training €les with add-λ smoothing)? How about for
Note: To check that you are smoothing correctly, the autograder will run your code on small
training and testing €les.
(d) In principle, you could apply di‚erent amounts of smoothing to the gen and spam models, and
this might be wise—for example, if their training sets have di‚erent rates of novel words.
However, for simplicity, your textcat program in this assignment smooths both models in
14 exactly the same way. So what is the minimum cross-entropy per token that you can achieve
on all development €les together, if both models are smoothed with the same λ?
(To measure cross-entropy per token, €nd the total number of bits that it takes to predict all
of the development €les from their respective models. ‘is means running fileprob twice:
once for the gen data and once for the spam data.6 Add the two results, and then divide by the
total number of tokens in all of the development €les.)
What value of λ gave you this minimum cross-entropy? Call this λ
∗ 15 . (See reading section E
for why you are using cross-entropy to select λ

(e) Each of the dev €les has a length. For language ID, the length in characters is given by the
directory name and is also embedded in the €lename (as the €rst number). For spam detection,
the length in words is embedded in the €lename (as the €rst number).
Come up with some way to quantify or graph the relation between €le length and the classi€-
cation accuracy of add-λ
∗ on development data. (Feel free to use Piazza to discuss how to do
16 this.) Write up your results. (‘is may be easier to do on your local machine, with a library
like the matplotlib Python package. You could also use this simple online tool.)
(f) Now try increasing the amount of training data. (Keep using add-λ

, for simplicity.) Compute
17 the overall error rate on dev data for training sets of di‚erent sizes. Graph the training size
versus classi€cation accuracy.
• For the language ID task, use training corpora of 6 di‚erent sizes: en.1K vs. sp.1K (1000
characters each); en.2K vs. sp.2K (2000 characters each); and similarly for 5K, 10K, 20K,
and 50K.
• Or, for the spam detection task, use training corpora of 4 di‚erent sizes: gen vs. spam;
gen-times2 vs. spam-times2 (twice as much training data); and similarly for . . .-times4
and . . .-times8.
4. Reading section F gives an overview of several smoothing techniques beyond add-λ.
(a) At the end of question 2, V was carefully de€ned to include oov. So if you saw 19,999 di‚erent
18 word types in training data, then V = 20, 000. What would go wrong with the UNIFORM
estimate if you mistakenly took V = 19, 999? What would go wrong with the ADDL estimate?
19 (b) What would go wrong with the ADDL estimate if we set λ = 0? (Remark: ‘is naive histor6‘is is not quite the right thing to do, actually. Running fileprob on gen uses only a vocabulary derived from gen, whereas
textcat is going to use a larger vocabulary derived from gen ∪ spam. If this were a research paper, I’d insist on using textcat’s
vocabulary to tune λ

. But for this assignment, take the shortcut and just run fileprob.
ical estimate is commonly called the maximum-likelihood estimate, because it maximizes the
probability of the training corpus.)
(c) Let’s see on paper how backo‚ behaves with novel trigrams. If c(xyz) = c(xyz0 20 ) = 0, then
does it follow that pˆ(z | xy) = ˆp(z
| xy) when those probabilities are estimated by BACKOFF ADDL smoothing? In your answer, work out and state the value of pˆ(z | xy) in this case.
How do these answers change if c(xyz) = c(xyz0
) = 1?
21 (d) In the BACKOFF ADDL scheme, how does increasing λ a‚ect the probability estimates? (‘ink
about your answer to the previous question.)
5. It’s time to implement the remaining smoothing methods, in addition to what’s already in the code.
Add support for add-λ smoothing with backo‚, ADDL BACKOFF. (‘is should allow both fileprob
and textcat to use that method.) See the INSTRUCTIONS €le for language-speci€c instructions.
‘is should be just a few lines of code. You will only need to understand how to look up counts in
the hash tables. Just study how the existing methods do it.
Hint: So pˆ(z | xy) should back o‚ to pˆ(z | y), which should back o‚ to pˆ(z), which backs o‚ to
. . . what Figure it out!
You will submit trained add-λ
∗ models as in question 3d. For simplicity, just use the same λ

as in
that question, even though some other λ might work be‹er with backo‚.
6. (a) Add support for the LOGLIN model. See the INSTRUCTIONS €le for language-speci€c instructions. Just as add-0.01 smoothing was selected by passing the model name add0.01 to
fileprob and textcat, a log-linear model with C = 1 should be selected by passing the
model name loglin1. (C is the regularization coe‚cient used during training: see reading
section H.1.)
Your code will need to compute pˆ(z | xy) using the features in reading section F.4.1. It should
refer to the current parameters ~θ (the entries of the X and Y matrices).
You can use embeddings of your choice from the lexicons directory. (See the README €le in
that directory. Make sure to use word embeddings for gen/spam, but character embeddings
for english/spanish.) ‘ese word embeddings were derived from Wikipedia, a large diverse
corpus with lots of useful evidence about the usage of many English words.
(b) Implement stochastic gradient ascent (Algorithm 1 in reading section H.2) to €nd the X and Y
matrices that maximize F(
~θ). See the INSTRUCTIONS €le in the code directory for details. For
the autograder’s sake, when loglin is speci€ed on the command line, please train for E = 10
epochs, use the exact hyperparameters suggested in reading section I.1, and print output in the
following format (this is printing F(
~θ) rather than Fi(
Training from corpus en.1K
Vocabulary size is 30 types including OOV and EOS
epoch 1: F=-1.83284
epoch 2: F=-1.64174
… [you should print these epochs too]
epoch 10: F=-1.58733
Finished training on 992 tokens
(c) Training a log-linear model takes signi€cantly more time than ADDL smoothing. ‘erefore,
we recommend that you experiment with the language ID task for this part.
Try your program out €rst by training a log-linear language model on en.1K, with character
embeddings d = 10 (chars-10.txt) and regularization strength C = 1. As a check, your
numbers should match those shown in 6b above.
You should now be able to measure cross-entropies and text categorization error rates under your fancy new language model! textcat should work as before. It will construct two
LOGLIN models as above, and then compare the probabilities of a new document (dev or test)
under these models.
22 Report cross-entropy and text categorization accuracy with C = 1, but also experiment with
other values of C > 0, including a small value such as C = 0.05. Let C
∗ be the best value you
€nd. Using C = C

, play with di‚erent embedding dimensions and report the results. How
and when did you use the training, development, and test data? What did you €nd? How do
your results compare to add-λ backo‚ smoothing?
23 (d) Now you get to have some fun! Add some new features to LOGLIN and report the e‚ect on
its performance. Some possible features are suggested in reading section J. You should make at
least one non-trivial improvement; you can do more for extra credit, including varying hyperparameters and training protocols (reading sections I.1 and I.5).
Your improved method should be selected by using the command-line argument improved (in
place of add1, loglin1, etc.). You will submit your system to Gradescope for autograding,
including the trained model.
You are free to submit many versions of your system—with di‚erent implementations of improved.
All will show up on the leaderboard, with comments, so that you and your classmates can see
what works well. For €nal grading, the autograder will take the submi‹ed version of your
system that worked best on dev-test data, and then evaluate its performance on €nal-test data.
(e) Extra credit: In your submission, you can also include a trained model for the spam detection
task. ‘is doesn’t require very much extra work in principle, but training will be much slower
because of the larger vocabulary (slowing down P
0) and the larger training corpus (slowing
down P
). You may need to adjust C, using development data as usual. See lexicons/README
for lexicons of embeddings that you could try. To speed up training, you could try a smaller
training set, a smaller vocabulary, or a lower-dimensional embedding. Report what you did.
7. Extra credit: We have been assuming a €nite vocabulary by replacing all unknown words with a
special oov symbol. But an alternative is an open-vocabulary language model (reading section D.5).
✌24 Devise a sensible way to estimate the word trigram probability p(z | xy), backing o‚ to a le‹er
n-gram model of z if z is an unknown word. Describe how you would train the le‹er n-gram model.
Just give the formulas for your estimate—you don’t have to implement and test your idea.
• x and/or y and/or z may be unknown; be sure you make sensible estimates of p(z | xy) in all
these cases
• be sure that P
p(z | xy) = 1
601.465/665 — Natural Language Processing
Reading for Assignment 3: Smoothed Language Modeling
Prof. Kevin Duh and Jason Eisner — Fall 2019
We don’t have a required textbook for this course. Instead, handouts like this one are the main readings.
‘is handout accompanies assignment 3, which refers to it.
A Are trigram models useful?
Why build n-gram models when we know they are a poor linguistic theory? Answer: A linguistic system
without statistics is o‰en fragile, and may break when run on real data. It will also be unable to resolve
ambiguities. So our €rst priority is to get some numbers into the system somehow. An n-gram model is a
starting point, and may get reasonable results even though it doesn’t have any real linguistics yet.
Speech recognition. Speech recognition systems have made heavy use of trigram models for decades.
Alternative approaches that don’t look at the trigrams do worse. One can do be‹er by building fancy
language models that combine trigrams with syntax, topic, and so on. But only a li‹le be‹er—dramatic
improvements over trigram models are hard to get. In the language modeling community, a rule of thumb
was that you had enough for a Ph.D. dissertation if you had managed to reduce a standard trigram model’s
perplexity per word by 10% (equivalent to a cross-entropy reduction of just 0.152 bits per word).
Machine translation. In the same way, machine translation (MT) systems have o‰en included 5-gram
models trained on quite massive amounts of data. An MT system has to generate a new ƒuent sentence of
English, and 5-grams do a be‹er job than 3-grams of memorizing common phrases and local grammatical
Why doesn’t a speech recognition system need 5-grams? Because it is not generating a new sentence.
It only has to determine what words have already been said by an English speaker. A 3-gram model helps
to choose between“ƒower,” “ƒour,” and “ƒoor” by using one word of context on either side. ‘at already
provides most of the value that we can get out of local context. Going to a 5-gram model wouldn’t help
too much with this choice, because it still wouldn’t look at enough of the sentence to determine whether
we’re talking about gardening, baking, or cleaning.
Neural language models. In the 2010’s, language models based on recurrent neural networks €nally
started to show dramatic gains over trigram models. ‘ese neural language models are much slower to
train, but they are not limited to trigrams, and can learn to notice complex ways in which the context
a‚ects the next word. We won’t quite be covering them in this course, but the starting point is the word
embeddings used in this assignment and the previous one.
Neural language models have become increasingly popular since the mid-2010’s. However, cleverly
smoothed 7-gram models can still do about as well by looking at lots of features of the previous 6-gram,
according to Pelemans et al. (2016).
B Boundary symbols
Remember from the previous assignment that a language model estimates the probability of any word
sequence ~w. In a trigram model, we use the chain rule and backo‚ to assume that
p( ~w) = Y
| wi−2, wi−1)
with start and end boundary symbols handled as in the previous assignment.
In other words, wN = eos (“end of sequence”), while for i < 1, wi = bos (“beginning of sequence”).
‘us, ~w consists of N − 1 words plus an eos symbol. Notice that we do not generate bos but condition on
it (it was always there). Conversely, we do generate eos but never condition on it (nothing follows it). ‘e
boundary symbols bos, eos are special symbols that do not appear among w1 . . . wN−1.
In assignment 3, we will consider every €le to implicitly start with bos and end with eos. A €le might
be a sentence, or an email message, or a fragment of text.
Some €les also contain sentence boundary symbols <s> and </s>. You should consider these to be
just ordinary words.
C Datasets for Assignment 3
Assignment 3 will have corpora for two tasks: language identi€cation and spam detection. Each corpus
has a README €le that you should look at.
C.1 ‡e train/dev/test split
Each corpus has already been divided for you into training, development, and test sets, which are in separate directories.
You will collect counts on the training set, tune the “hyperparameters” like λ to maximize performance
on the development set, and then evaluate your performance on the test set.
In this case, we actually have two test sets.
• ‘e “dev test” set is for use as you develop your system. If you experiment on it repeatedly, there is
a danger that you will “over€t” to it—that is, you might €nd your way to a method that seems really
good, but is actually only good for that particular dataset, not in general.
• ‘us, for fairness, we have to €nd out whether your system can do well in a blind test, when we run
it on data you’ve never seen. Your grade will therefore be determined based on a new “€nal test” set.
C.2 Domain mismatch
In this assignment, the training set is unfortunately a bit di‚erent from the others. To simplify the
command-line syntax for you, I’ve assumed that the training set consists of a single big €le. ‘at means
that there is only one bos and one eos in the whole training set.
By contrast, bos and eos are much more common in the dev set and the test sets. ‘is is a case of
domain mismatch, where the training data is somewhat unlike the test data.
Domain mismatch is a common source of errors in applied NLP. In this case, the only practical implication is that your language models won’t do a very good job of modeling what tends to happen at the
very start or very end of a €le—because it has seen only one €le!
C.3 Class ratios
In the assignment, you’ll have to specify a prior probability that a €le will be genuine email (rather than
spam) or English (rather than Spanish). In other words, how o‰en do you expect the real world to produce
genuine email or English in your test data? We will ask you to guess 0.7.
Of course, your system won’t know the true fraction on test data, because it doesn’t know the true
classes—it is trying to predict them.
We can try to estimate the fraction from training data. It happens that 2
of the documents are genuine
email, and 1
are English.1
In this case, the prior probability is a parameter of the model, to be estimated
from training data (with smoothing!) like any other parameter.
But if you think that test data might have a di‚erent rate of spam or Spanish than training data, then
the prior probability is not necessarily something that you should represent within the model and estimate
from training data. Instead it can be used to represent your personal guess about what you think test data
will be like.
Indeed, in the assignment, you’ll use training data only to get the smoothed language models, which
de€ne the likelihood of the di‚erent classes. ‘is leaves you free to specify your prior probability of the
classes on the command line. ‘is setup would let you apply the system to di‚erent test datasets about
which you have di‚erent prior beliefs—the spam-infested email account that you abandoned, versus your
new private email account that only your family knows about.
Does it seem strange to you that a guess or assumption might have a role in statistics? ‘at is actually
central to the Bayesian view of statistics—which says that you can’t get something for nothing. Just as you
can’t get theorems without assuming axioms, you can’t get posterior probabilities without assuming prior
D ‡e vocabulary
D.1 Choosing a €nite vocabulary
All the smoothing methods assume a €nite vocabulary, so that they can easily allocate probability to all
the words. But is this assumption justi€ed? Aren’t there in€nitely many potential words of English that
might show up in a test corpus (like xyzzy and JacrobinsteinIndustries and fruitylicious)?
Yes there are . . . so we will force the vocabulary to be €nite by a standard trick. Choose some €xed,
€nite vocabulary at the start. ‘en add one special symbol oov that represents all other words. You
should regard these other words as nothing more than variant spellings of the oov symbol.
Note that OOV stands for “out of vocabulary,” not for “out of corpus,” so OOV words may have token
count > 0 and in-vocabulary words may have count 0.
D.2 Consequences for evaluating a model
For example, when you are considering the test sentence
i saw snuffleupagus on the tv
what you will actually compute is the probability of
1‘ese numbers are actually derived from dev data rather than training data, because of the domain mismatch issue above—the
training data are not divided into documents.
i saw oov on the tv
which is really the total probability of all sentences of the form
i saw [some out-of-vocabulary word] on the tv
Admi‹edly, this total probability is higher than the probability of the particularsentence involving snuffleupagus.
But in most of this assignment, we only wish to compare the probability of the snu„eupagus sentence under di‚erent models. Replacing snuffleupagus with oov raises the sentence’s probability under all the
models at once, so it need not invalidate the comparison.2
D.3 Comparing apples to apples
We do have to make sure that if snuffleupagus is regarded as oov by one model, then it is regarded as
oov by all the other models, too. It’s not appropriate to compare pmodel1(i saw oov on the tv) with
pmodel2(i saw snuffleupagus on the tv), since the former is actually the total probability of many
sentences, and so will tend to be larger.
So all the models must have the same €nite vocabulary, chosen up front. In principle, this shared
vocabulary could be any list of words that you pick by any means, perhaps using some external dictionary.
Even if the context “oov on” never appeared in the training corpus, the smoothing method is required
to give a reasonable value anyway to p(the | oov, on), for example by backing o‚ to p(the | on).
Similarly, the smoothing method must give a reasonable (non-zero) probability to p(oov | i, saw).
Because we’re merging all out-of-vocabulary words into a single word oov, we avoid having to decide
how to split this probability among them.
D.4 How to choose the vocabulary
How should you choose the vocabulary? For this assignment, simply take it to be the set of word types
that appeared ≥ 3 times anywhere in training data. ‘en augment this set with a special oov symbol. Let
V be the size of the resulting set (including oov). Whenever you read a training or test word, you should
immediately convert it to oov if it’s not in the vocabulary. ‘is is fast to check if you store the vocabulary
in a hash set.
To help you understand/debug your programs, we have gra‰ed brackets onto all out-of-vocabulary
words in one of the datasets (the speech directory, where the training data is assumed to be train/switchboard).
‘is lets you identify such words at a glance. In this dataset, for example, we convert uncertain to
[uncertain]—this doesn’t change its count, but does indicate that this is one of the words that your code
ought to convert to oov.
D.5 Open-vocabulary language modeling
In this assignment, we assume a €xed €nite vocabulary. However, an open-vocabulary language model
does not limit in advance to a €nite vocabulary. estion 7 (extra credit) explores this possibility.
An open-vocabulary model must be able to assign positive probability to any word—that is, to any
string of le‹ers that might ever arise. If the alphabet is €nite, you could do this with a le‹er n-gram
Problem 7 explores a more elegant approach that may also work be‹er for text categorization.
Such a model is sensitive to the spelling and length of the unknown word. Longer words will generally
receive lower probabilities, which is why it is possible for the probabilities of all unknown words to sum
to 1, even though there are in€nitely many of them. (Just as 1
2 +
4 +
8 + · · · = 1.)
E Performance metrics
In this assignment, you will measure your performance in two ways. To measure the predictive power of
the model, you will use cross-entropy (per token). To measure how well the model does at a task, you will
use error rate (per document). In both cases, smaller is be‹er.
Error rate may be what you really care about! However, it doesn’t give a lot of information on a
small dev set. If your dev set has only 100 documents, then the error rate can only be one of the numbers
100 ,
100 , . . . ,
100 }. It can tell you if your changes helped by correcting a wrong answer. But it can’t tell
you that your changes were “moving in the right direction” by merely increasing the probability of right
In particular, for some of the tasks we are considering here, the error rate is just not very sensitive to
the smoothing parameter λ: there are many λ values that will give the same integer number of errors on
dev data. ‘at is why you will use cross-entropy to select your smoothing parameter λ on dev data: it will
give you clearer guidance.
E.1 Other possible metrics
As an alternative, could you devise a continuously varying version of the error rate? Yes, because our
system3 doesn’t merely compute a single output class for each document. It constructs a probability distribution over those classes, using Bayes’ ‘eorem. So we can evaluate whether that distribution puts high
probability on the correct answer.
• One option is the expected error rate. Suppose document #1 is gen. If the system thinks p(gen |
document1) = 0.49, then sadly the system will output spam, which ordinary error rate would
count as 1 error. But suppose you pretend—just for evaluation purposes—that the system chooses its
output randomly from its posterior distribution (“stochastic decoding” rather than “MAP decoding”).
In that case, it only has probability 0.51 of choosing spam, so the expected number of errors on this
document is only 0.51. Partial credit!
Notice that expected error rate gives us a lot of credit for increasing p(gen | document1) from 0.01
to 0.49, and li‹le additional credit for increasing it to 0.51. By contrast, the actual error rate only
gives us credit for the increase from 0.49 to 0.51, since that’s where the actual system output would
• Another continuous error metric is the log-loss, which is the system’s expected surprisal about the
correct answer. ‘e system’s surprisal on document 1 is − log2 p(gen | document1) = − log2 0.49 =
1.03 bits.
Both expected error rate and log-loss are averages over the documents that are used to evaluate.
So document 1 contributes 0.51 errors to the former average, and contributes 1.03 bits to the la‹er
3Unlike rule-based classi€ers and some other ML classi€ers.
In general, a single document contributes a number in [0, 1] to the expected error rate, but a number
in [0, ∞] to the log-loss. In particular, a system that thinks that p(gen | document1) = 0 is in€nitely
surprised by the correct answer (− log2 0 = ∞). So optimizing for log-loss would dissuade you
in€nitely strongly from using this system . . . basically on the grounds that a system that is completely
con€dent in even one wrong answer can’t possibly have the correct probability distribution. To
put it more precisely, if the dev set has size 100, then changing the system’s behavior on a single
document can change the error rate or the expected error rate by at most 1
100—a‰er all, it’s just one
document!—whereas it can change the log-loss by an unbounded amount.
What is the relation between the log-loss and cross-entropy metrics? ‘ey are both average surprisals.4
However, they are very di‚erent:
metric what it evaluates probability used units long docs count more?
log-loss the whole classi€cation system p(gen | document1) bits per document no
cross-entropy the gen model within the system p(document1 | gen) bits per gen token yes
E.2 Generative vs. discriminative
‘e error rate, expected error rate, and log-loss are all said to be discriminative metrics because they measure how well the system discriminates between correct and incorrect classes. By contrast, the crossentropy is a generative metric because it measures how good the system would be at randomly generating
the observed data.
Methods for se‹ing hyperparameters or parameters are said to be generative or discriminative according to whether they optimize a generative or discriminative metric.
A generative model includes a probability distribution p(input) that accounts for the input data. ‘us,
this assignment uses generative models (language models). A discriminative model only tries to predict
output from input, possibly using p(output | input). ‘us, a conditional log-linear model for text classi€-
cation would be discriminative. A discriminative model or training procedure focuses less on explaining
the input data and more on solving a particular task—less science, more engineering.
F Smoothing techniques
Here are the smoothing techniques we’ll consider, writing pˆ for our smoothed estimate of p.
F.1 Uniform distribution (UNIFORM)
pˆ(z | xy) is the same for every xyz; namely,
pˆ(z | xy) = 1/V (1)
where V is the size of the vocabulary including oov.
4Technically, you could regard the log-loss as a conditional cross-entropy . . . to be precise, it’s the conditional cross-entropy
between empirical and system distributions over the output class. By contrast, the metric you’ll use on this assignment is the
cross-entropy between empirical and system distributions over the input text. ‘e output and the input are di‚erent random
variables, so log-loss is quite di‚erent from the cross-entropy we’ve been using to evaluate a language model!
F.2 Add-λ (ADDL)
Add a constant λ ≥ 0 to every trigram count c(xyz):
pˆ(z | xy) = c(xyz) + λ
c(xy) + λV (2)
where V is de€ned as above. (Observe that λ = 1 gives the add-one estimate. And λ = 0 gives the naive
historical estimate c(xyz)/c(xy).)
F.3 Add-λ backo‚ (BACKOFF ADDL)
Suppose both z and z
0 have rarely been seen in context xy. ‘ese small trigram counts are unreliable, so
we’d like to rely largely on backed-o‚ bigram estimates to distinguish z from z
pˆ(z | xy) = c(xyz) + λV · pˆ(z | y)
c(xy) + λV (3)
where pˆ(z | y) is a backed-o‚ bigram estimate, which is estimated recursively by a similar formula. (If
pˆ(z | y) were the UNIFORM estimate 1/V instead, this scheme would be identical to ADDL.)
So the formula for pˆ(z | xy) backs o‚ to pˆ(z | y), whose formula backs o‚ to pˆ(z), whose formula
backs o‚ to . . . what Figure it out!
F.4 Conditional log-linear modeling (LOGLIN)
In the previous assignment, you learned how to construct log-linear models. Let’s restate that construction
in our current notation.5
Given a trigram xyz, our model pˆ is de€ned by
pˆ(z | xy)
def =
def = exp

θk · fk(xyz)

 = exp 
~θ ·

def =
u(xyz) (6)
Here ~f(xyz) is the feature vector extracted from xyz, and ~θ is the model’s weight vector. P
sums over
the V words in the vocabulary (including oov) in order to ensure that you end up with a probability
distribution over this chosen vocabulary. ‘at is the goal of all these language models; you saw a similar
5Unfortunately, the tutorial also used the variable names x and y, but to mean something di‚erent than they mean in this
assignment. ‘e previous notation is pre‹y standard in machine learning.
F.4.1 Bigrams and skip-bigram features from word embeddings
What features should we use in the log-linear model?
A natural idea is to use one binary feature for each speci€c unigram z, bigram yz, and trigram xyz
(see reading section J.2 below).
Instead, however, let’s start with the following model based on word embeddings:
def = exp 
~x>X~z + ~y>Y ~z
where the vectors ~x, ~y, ~z are speci€c d-dimensional embeddings of the word types x, y, z, while X, Y are
d × d matrices. ‘e > superscript is the matrix transposition operator, used here to transpose a column
vector to get a row vector.
‘is model may be a li‹le hard to understand at €rst, so here’s some guidance.
What’s the role of the word embeddings? Note that the language model is still de€ned as a conditional
probability distribution over the vocabulary. ‘e lexicon, which you will specify on the command line, is
merely an external resource that lets the model look up some a‹ributes of the vocabulary words. Just
like the dictionary on your shelf, it may also list information about some words you don’t need, and it
may lack information about some words you do need. In short, the existence of a lexicon doesn’t a‚ect
the interpretation of P
in (6): that formula remains the same regardless of whether the model’s features
happen to consult a lexicon!
For oov, or for any other type in your vocabulary that has no embedding listed in the lexicon, your
features should back o‚ to the embedding of ool—a special “out of lexicon” symbol that stands for “all
other words.” ool is listed in the lexicon, just as oov is included in the vocabulary.
Note that even if an speci€c out-of-vocabulary word is listed in the lexicon, you must not use that
listing.6 For an out-of-vocabulary word, you are supposed to be computing probabilities like p(oov | xy),
which is the probability of the whole oov class—it doesn’t even mention the speci€c word that was replaced
by oov. (See reading section D.2.)
Is this really a log-linear model? Now, what’s up with (7)? It’s a valid formula: you can always
get a probability distribution by de€ning pˆ(z | xy) = 1
exp(any function of x, y, z that you like)!
But is (7) really a log-linear function? Yes it is! Let’s write out those d-dimensional vector-matrix-vector
multiplications more explicitly:
u(xyz) = exp

xjXjmzm +

 (8)
= exp

Xjm · (xjzm) +X
Yjm · (yjzm)

 (9)
6‘is issue would not arise if we simply de€ned the vocabulary to be the set of words that appear in the lexicon. ‘is simple
strategy is certainly sensible, but it would slow down normalization because our lexicon is quite large.
‘is does have the log-linear form of (5). Suppose d = 2. ‘en implicitly, we are using a weight vector ~θ
of length d
2 + d
2 = 8, de€ned by
h θ1, θ2, θ3, θ4, θ5, θ6, θ7, θ8 i
↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓
hX11 , X12 , X21 , X22 , Y11 , Y12 , Y21 , Y22 i
for a vector ~f(xyz) of 8 features
hf1(xyz), f2(xyz), f3(xyz), f4(xyz), f5(xyz), f6(xyz), f7(xyz), f8(xyz) i
↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓
h x1z1, x1z2, x2z1, x2z2, y1z1, y1z2, y2z1, y2z2 i
Remember that the optimizer’s job is to automatically manipulate some control sliders. ‘is particular
model with d = 2 has a control panel with 8 sliders, arranged in two d×d grids (X and Y ). ‘e point is that
we can also refer to those same 8 sliders as θ1, . . . θ8 if we like. What features are these sliders (weights) be
connected to? ‘e ones in (11): if we adopt those feature de€nitions, then our general log-linear formula
(5) will yield up our speci€c model (9) (= (7)) as a special case.
As always, ~θ is incredibly important: it determines all the probabilities in the model.
Is this a sensible model? ‘e feature de€nitions in (11) are pairwise products of embedding dimensions.
Why on earth would such features be useful? First imagine that the embedding dimensions were bits (0
or 1). ‘en x2z1 = 1 i‚ (x2 = 1 and z1 = 1), so you could think of multiplication as a kind of feature
conjunction. Multiplication has a similar conjunctive e‚ect even when the embedding dimensions are in
R. For example, suppose z1 > 0 indicates the degree to which z is a human-like noun, while x2 > 0
indicates the degree to which x is a verb whose direct objects are usually human.7 ‘en the product x2z1
will be larger for trigrams like kiss the baby and marry the policeman. So by learning a positive
weight X21 (nicknamed θ3 above), the optimizer can drive pˆ(baby | kiss the) higher, at the expense
of probabilities like pˆ(benzene | kiss the). pˆ(bunny | kiss the) might be somewhere in the middle
since bunnies are a bit human-like and thus bunny1 might be numerically somewhere between baby1 and
Example. As an example, let’s calculate the le‹er trigram probability pˆ(s | er). Suppose the relevant
le‹er embeddings and the feature weights are given by
~e =

, ~r =

, ~s =

, X =

1 0
0 .5
, Y =

2 0
0 1 #
7You might wonder: What if the embedding dimensions don’t have such nice interpretations? What if z1 doesn’t represent a
single property like humanness, but rather a linear combination of such properties? ‘at actually doesn’t make much di‚erence.
Suppose z can be regarded as Mz˜ where z˜ is a more interpretable vector of properties. (Equivalently: each zj is a linear combination of the properties in z˜.) ‘en x
>Xz can be expressed as (Mx˜)
>X(My˜) = ˜x
>(M>XM)˜y. So now it’s X˜ = M>XM
that can be regarded as the matrix of weights on the interpretable products. If there exists a good X˜ and M is invertible, then
there exists a good X as well, namely X = (M>)
−1XM˜ −1
First, we compute the unnormalized probability.
u(ers) = exp

[−.5 1] ”
1 0
0 .5
# ” .5
+ [0 .5] ”
2 0
0 1 # ” .5

= exp(−.5 × 1 × .5 + 1 × .5 × .5 + 0 × 2 × .5 + .5 × 1 × .5) = exp 0.25 = 1.284
We then normalize u(ers).
pˆ(s | er)
def =
u(era) + u(erb) + · · · + u(erz)
1.284 + · · ·
Speedup. ‘e example illustrates that the denominator
Z(xy) = X
) = X
0 + ~y>Y ~z

is expensive to compute because of the summation over all z
in the vocabulary. Fortunately, you can
compute x
>Xz0 ∈ R simultaneously for all z
8 ‘e results can be found as the elements of the row vector
>XE, where E is a d × V matrix whose columns are the embeddings of the various words z
in the
vocabulary. ‘is is easy to see, and computing this vector still requires just as many scalar multiplications
and additions as before . . . but we have now expressed the computation as a pair of vector-matrix mutiplications, (x
>X)E, which you can perform using a matrix library (such as numpy or jblas). ‘at can be
considerably faster than looping over all z
. ‘at is because the library call is highly optimized and exploits
hardware support for matrix operations (e.g., parallelism).
F.5 Other smoothing schemes
Numerous other smoothing schemes exist. In past years, for example, our course assignments have used
Katz backo‚ with Good–Turing discounting. (Good–Turing is a‹ractive but a bit tricky in practice, and
has to be combined with backo‚.)
In practical se‹ings, the most popular n-gram smoothing scheme is something called modi€ed Kneser–
Ney. One can also use a more principled Bayesian method based on the hierarchical Pitman–Yor process—
the result is very close to modi€ed Kneser–Ney.
Remember: While these techniques are e‚ective, a really good language model would do more than
just smooth n-gram probabilities well. To predict a word sequence as accurately as a human can €nish
another human’s sentence, it would go beyond the whole n-gram family to consider syntax, semantics,
and topic. ‘is is an active area of research that uses grammars, recurrent neural networks, and other
G Safe practices for working with log-probabilities
G.1 Use natural log for internal computations
In this assignment, as in most of mathematics, log means loge
(the log to base e, or natural log, sometimes
wri‹en ln). ‘is is also the standard behavior of the log function in most programming languages.
8‘e same trick works for y
>Y z0
, of course.
With natural log, the calculus comes out nicely, thanks to the fact that d
log Z =
. It’s only with
natural log that the gradient of the log-likelihood of a log-linear model can be directly expressed as observed features minus expected features.
On the other hand, information theory conventionally talks about bits, and quantities like entropy and
cross-entropy are conventionally measured in bits. Bits are the unit of − log2 probability. A probability
of 0.25 is reported “in negative-log space” as − log2 0.25 = 2 bits. Some people do report that value more
simply as − loge 0.25 = 1.386 nats. But it is more conventional to use bits as the unit of measurement.
(‘e term “bits” was coined in 1948 by Claude Shannon to refer to “binary digits,” and “nats” was later
de€ned by analogy to refer to the use of natural log instead of log base 2. ‘e unit conversion factor is
log 2 ≈ 1.443 bits/nat.)
Even if you are planning to print bit values, it’s still wise to standardize on loge
-probabilities for all of
your formulas, variables, and internal computations. Why? ‘ey’re just easier! If you tried to use negative
-probabilities throughout your computation, then whenever you called the log function or took a
derivative, you’d have to remember to convert the result. It’s too easy to make a mistake by omi‹ing this
step or by ge‹ing it wrong. So the best practice is to do this conversion only when you print: at that point
convert your loge
-probability to bits by dividing by − log 2 ≈ −0.693. ‘is is a unit conversion.
G.2 Avoid exponentiating big numbers (not necessary for this assignment)
Log-linear models require calling the exp function. Unfortunately, exp(710)is already too large for a 64-bit
ƒoating-point number to represent, and will generate a runtime error (“overƒow”). Conversely, exp(−746)
is too close to 0 to represent, and will simply return 0 (“underƒow”).
‘at shouldn’t be a problem for this assignment if you stick to the language ID task. If you are experiencing an overƒow issue there, then your parameters probably became too positive or too negative as you
ran stochastic gradient descent. ‘at could mean that your stepsize was too large, or more likely, that you
have a bug in your computation of the gradient.
But to avoid these problems elsewhere—including with the spam detection task—standard trick is to
represent all values “in log-space.” In other words, simply store 710 and −746 rather than a‹empting to
exponentiate them.
But how can you do arithmetic in log-space? Suppose you have two numbers p, q, which you are
representing in memory by their logs, lp and lq.
• Multiplication: You can represent pq by its log, log(pq) = log(p) + log(q) = lp + lq. ‘at is,
multiplication corresponds to log-space addition.
• Division: You can represent p/q by its log, log(p/q) = log(p) − log(q) = lp − lq. ‘at is, division
corresponds to log-space subtraction.
• Addition: You can represent p+q by its log. log(p+q) = log(exp(lp)+exp(lq)) = logsumexp(lp, lq).
(But this is unstable, so you can either write a safe variant or use a version from a library.)
For training a log-linear model, you can work almost entirely in log space, representing u and Z in
memory by their logs, log u and log Z. In order to compute the expected feature vector in (18) below,
you will need to come out of log space and €nd p(z
| xy) = u
0/Z for each word z
. But computing
and Z separately is dangerous: they might be too large or too small. Instead, rewrite p(z
| xy) as
exp(log u
0 − log Z). Since u
0 ≤ Z, this is exp of a negative number, so it will never overƒow. It might
underƒow to 0 for some words z
, but that’s ok: it just means that p(z
| xy) really is extremely close to 0,
and so ~f(xyz0
) should make only a negligible contribution to the expected feature vector.
H Training a log-linear model
H.1 ‡e training objective
To implement the conditional log-linear model, the main work is to train ~θ (given some training data and
a regularization coecient C). As usual, you’ll set ~θ to maximize
def =


log ˆp(wi
| wi−2 wi−1)

| {z }
log likelihood

C ·

| {z }
L2 regularizer

which is the regularized log-likelihood per word token. (‘ere are N word tokens.)
So we want ~θ to make our training corpus probable, or equivalently, to make the N events in the corpus
(including the €nal eos) probable on average given their bigram contexts. At the same time, we also want
the weights in ~θ to be close to 0, other things equal (regularization).9
‘e regularization coecient C ≥ 0 can be selected based on dev data.
H.2 Stochastic gradient descent
Fortunately, concave functions like F(
~θ) in (14) are “easy” to maximize. You can implement a simple
stochastic gradient descent (SGD) method to do this optimization.
More properly, this should be called stochastic gradient ascent, since we are maximizing rather than
minimizing, but that’s just a simple change of sign. ‘e pseudocode is given by Algorithm 1. We rewrite
the objective F(
~θ) given in (14) as an average of local objectives Fi(
~θ) that each predict a single word, by
moving the regularization term into the summation.
~θ) = 1

log ˆp(wi
| wi−2 wi−1) −

| {z }
call this Fi(θ~)
~θ) (17)
‘e gradient of this average, ∇F(
~θ), is therefore the average value of ∇Fi(
9As explained on the previous homework, this can also be interpreted as maximizing p(
~θ | ~w)—that is, choosing the most
probable ~θ given the training corpus. By Bayes’ ‘eorem, p(
~θ | ~w) is proportional to
p( ~w |
| {z }
· p(
Let’s assume an independent Gaussian prior over each θk, with variance σ
. ‘en if we take C = 1/2σ
, maximizing (14) is just
maximizing the log of (15). ‘e reason we maximize the log is to avoid underƒow, and because the derivatives of the log happen
to have a simple “observed − expected” form (since the log sort of cancels out the exp in the de€nition of u(xyz).)
Algorithm 1 Stochastic gradient ascent
Input: Initial stepsize γ0, initial parameter values ~θ
(0), training corpus D = (w1, w2, · · · , wN ), regularization coecient C, number of epochs E
1: procedure SGATrain
2: ~θ ← ~θ
3: t ← 0 . number of updates so far
4: for e : 1 → E : . do E passes over the training data, or “epochs”
5: for i : 1 → N : . loop over summands of (16)
6: γ ←
1 + γ0 ·
· t
. current stepsize—decreases gradually
7: ~θ ← ~θ + γ · ∇Fi(
~θ) . move ~θ slightly in a direction that increases Fi(
8: t ← t + 1
9: return ~θ
Discussion. On each iteration, the algorithm picks some word i and pushes ~θ in the direction ∇Fi(
which is the direction that gets the fastest increase in Fi(
~θ). ‘e updates from di‚erent i will partly cancel
one another out,10 but their average direction is ∇F(
~θ), so their average e‚ect will be to improve the
overall objective F(
~θ). Since we are training a log-linear model, our F(
~θ) is a concave function with a
single global maximum; a theorem guarantees that the algorithm will converge to that maximum if allowed
to run forever (E = ∞).
How far the algorithm pushes ~θ is controlled by γ, known as the “step size” or “learning rate.” ‘is
starts at γ0, but needs to decrease over time in order to guarantee convergence of the algorithm. ‘e rule
in line 6 for gradually decreasing γ is the one recommended by Bo‹ou (2012), “Stochastic gradient descent
tricks,” which you should read in full if you want to use this method “for real” on your own problems.
Note that t increases and the stepsize decreases on every pass through the inner loop. ‘is is important
because N might be extremely large in general. Suppose you are training on the whole web—then the
stochastic gradient ascent algorithm should have essentially converged even before you €nish the €rst
epoch! See reading section I.5 for some more thoughts about epochs.
H.3 ‡e gradient vector
‘e gradient vector ∇Fi(
~θ) is merely the vector of partial derivatives 
, . . .
. where Fi(
was de€ned in (16). As you’ll recall from the previous assignment, each partial derivative takes a simple
10For example, in the training sentence eat your dinner but first eat your words, ∇F3(
~θ)is trying to raise the probability of dinner, while ∇F8(
~θ) is trying to raise the probability of words (at the expense of dinner!) in the same context.
and beautiful form11
= fk(xyz)
| {z }
observed value of feature fk

| xy)fk(xyz0
| {z }
expected value of feature fk, according to current pˆ

| {z }
pulls θk towards 0
where x, y, z respectively denote wi−2, wi−1, wi
, and the summation variable z
in the second term ranges
over all V words in the vocabulary, including oov. ‘is obtains the partial derivative by summing multiples
of three values: the observed feature count in the training data, the expected feature counts ccording to the
current pˆ (which is based on the entire current ~θ, not just θk), and the current weight θk itself.
H.4 ‡e gradient for the embedding-based model
When we use the speci€c model in (7), the feature weights are the entries of the X and Y matrices, as
shown in (9). ‘e partial derivatives with respect to these weights are
= xjzm −
| xy)xjz
m −
Xjm (19)
= yjzm −
| xy)yjz
m −
Yjm (20)
where as before, we use ~x, ~y, ~z, ~z 0
to denote the embeddings of the words x, y, z, z0
. ‘us, the update to ~θ
(Algorithm 1, line 7) is
(∀j, m = 1, 2, . . . d) Xjm ← Xjm + γ ·
(∀j, m = 1, 2, . . . d) Yjm ← Yjm + γ ·
I Practical hints for stochastic gradient ascent
I.1 Choose your hyperparameters carefully
In practice, the convergence rate of stochastic gradient ascent is sensitive to the initial guess ~θ
(0) and
the learning rate γ0. It’s common to take ~θ
(0) = ~0 (e.g., initialize all entries of X and V to 0), and we
recommend trying γ0 = 0.1 for spam detection or γ0 = 0.01 for language ID.
(Note that the assignment asks you to use the hyperparameters recommended above when loglin
is selected (question 6b). ‘is will let you and us check that your implementation is correct. However,
you may want to experiment with di‚erent se‹ings, and you are free to use those other se‹ings when
improved is selected, to get be‹er performance.)
11If you prefer to think of computing the whole gradient vector at once, using vector computations, you can equivalently write
this as
~θ) = ~f(xyz) −
| xy)
) −

I.2 Don’t modify the parameters as you compute the gradient
Make sure that at line 7 of Algorithm 1, you compute the entire gradient before modifying ~θ. If you were
to update each parameter immediately a‰er computing its partial derivative, then the subsequent partial
derivatives would be incorrect.
I.3 Compute the gradient eciently
Optimization should be reasonably fast (seconds per epoch).
Be careful that you compute the second term of (18) in time O(V ). If you’re not careful, you might
use O(V
) time, because each of the V summands has a factor pˆ(z
| xy) whose denominator Z(xy) itself
takes O(V ) time to compute. ‘e salvation is that xy remains the same throughout example i. So when
you’re computing a gradient vector ∇Fi(
~θ), you only have to compute Z(xy) once and reuse it for all z
and all k.
12 Warning: However, as Z(xy) depends on ~θ, it will become out-of-date once you update ~θ. You
can’t save it to reuse for some future example j where wj−2wj−1 also equals xy (not even if j = i so that
you’re processing the very same example again).
Constant factors can also play a big role in the speed. For the particular log-linear model used here,
reading section F.4.1 gave a way to speed up the gradient computation. In general, replacing for-loops
with library vector/matrix operations (via numpy) will probably speed you up a lot.
If you really want to get fancy, you could completely eliminate the need to sum over the whole vocabulary by predicting each word one bit at a time, as in the hierarchical log-bilinear language model of Mnih
and Hinton (2009); then you are making log2 V binary predictions, instead of a V -way prediction.
I.4 Compute the gradient correctly: ‡e €nite-di‚erence check
It’s easy to make a mistake when computing the gradient, which will mess everything up. One way to test
your gradient computation code is to use the €nite-di‚erence approximation (see Bou‹ou (2012), section
4.2, or Vieira (2017)). We won’t require this test, but it can help you debug your code; it’s generally wise
to test that what you’re computing really is the gradient!
Suppose you’ve just computed Fi(
~θ) for your current i and θ. Remember that Fi
is di‚erentiable at ~θ
(and everywhere), which means it can be approximated locally by a linear function. So try adding a small
number δ to one of the weights θk to obtain a new weight vector ~θ
. By the de€nition of partial derivatives,
you’d predict that Fi(

0) ≈ Fi(
~θ) + δ ·
You can check this prediction by actually evaluating Fi(

0). By making δ small enough (say 1e-6), you
should be able to make the predictions match the truth arbitrarily well (say to within 1e-10). If you can’t,
then either there’s a bug in your gradient computation or (unlikely) you need to be using higher-precision
ƒoating-point arithmetic.
(Your program doesn’t have to check every single partial derivative if that’s too slow: each time you
compute a partial derivative, you can ƒip a weighted coin to decide whether or not to check it. Once you’re
sure that the gradient computation is correct, you can turn the checks o‚.)
More generally, you can check that Fi(
~θ + ~δ) ≈ Fi(
~θ) + ~δ · ∇Fi(
~θ) where ~δ is any small vector,
perhaps a vector of small random numbers, or a vector that is mostly 0’s, or a small multiple of ∇Fi(
12Alternatively, note that you can rewrite the second term from the previous footnote as
z0 u(xyz0
z0 u(xyz0)
. ‘en a single
loop over z
serves to compute the numerator (a vector) and the denominator (the scalar Z(xy)) in parallel. You can then
increment ~θ by γ/Z(xy) times the numerator vector.
itself. (Again, you don’t have to do this check every time you compute a gradient.) A slightly be‹er version
is the symmetric €nite-di‚erence approximation, 1

~θ + ~δ) − Fi(
~θ − ~δ)

≈ ~δ · ∇Fi(
~θ), which should
be even more accurate, although it requires two extra computations of Fi
instead of just one.
‘inking about why all these approximations are valid may help you remember how partial derivatives
I.5 How much training is enough?
In reading H.2, you may have wondered how to choose E, the number of epochs. ‘e following improvements are not required for the assignment, but they might help. You should read this section in any case.
We are using a €xed number of epochs only to keep things simple. A be‹er approach is to continue
running until the function appears to have converged “well enough.” For example, you could stop if the
average gradient over the past epoch (or the past m examples) was very small. Or you could evaluate
the accuracy or cross-entropy on development data at the end of each epoch (or a‰er each group of m
examples), and stop if it has failed to improve (say) 3 times in a row; then you can use the parameter vector
~θ that performed best on development data.
In theory, stochastic gradient descent shouldn’t even use epochs. ‘ere should only be one loop, not
two nested loops. At each iteration, you pick a random example from the training corpus, and update
~θ based on that example. ‘at’s why it is called ”stochastic” (i.e., random). ‘e insight here is that the
regularized log-likelihood per token, namely F(
~θ), is actually just the average value of Fi(
~θ) over all
of the examples (see (16)). So if you compute the gradient on one example, it is the correct gradient on
average (since the gradient of an average is the average gradient). So line 7 is going in the correct direction
on average if you choose a random example at each step.
In practice, a common approach to randomization is to still use epochs, so that each example is visited
once per epoch, but to shu„e the examples into a random order at the start of training or at the start of each
epoch. To see why shu„ing can help, imagine that the €rst half of your corpus consists of Democratic
talking points and the second half consists of Republican talking points. If you shu„e, your stochastic
gradients will roughly alternate between the two, like alternating between le‰ and right strokes when
you paddle a canoe; thus, your average direction over any short time period will be roughly centrist. By
contrast, since Algorithm 1 doesn’t shu„e, it will paddle le‰ for the half of each epoch and then right for
the other half, which will make signi€cantly slower progress in the desired centrist direction.
J Ideas for log-linear features
Here are some ideas for extending your log-linear model. Most of them are not very hard, although training
may be slow. Or you could come up with your own!
Adding features means throwing some more parameters into the de€nition of the unnormalized probability. For example, extending the de€nition (7) with additional features (in the case d = 2) gives
def = exp 
~x>X~z + ~y>Y ~z + θ9f9(xyz) + θ10f10(xyz) + . . .
= exp

θ1f1(xyz) + · · · + θ8f8(xyz)
| {z }
as de€ned in (10)–(11)
+ θ9f9(xyz) + θ10f10(xyz) + . . .

 (24)
J.1 Unigram log-probability
A possible problem with the model so far is that it doesn’t have any parameters that keep track of how
frequent speci€c words are in the training corpus! Rather, it backs o‚ from the words to their embeddings.
Its probability estimates are based only on the embeddings.
One way to €x that (see section J.2 below) would be to have a binary feature fw for each word w in
the vocabulary, such that fw(xyz) is 1 if z = w and 0 otherwise.
But €rst, here’s a simpler method: just add a single non-binary feature de€ned by
funigram(xyz) = log ˆpunigram(z) (25)
where pˆunigram(z) is estimated by add-1 smoothing. Surely we have enough training data to learn an
appropriate weight for this single feature. In fact, because every training token wi provides evidence about
this single feature, its weight will tend to converge quickly to a reasonable value during SGD.
‘is is not the only feature in the model—as usual, you will use SGD to train the weights of all features
to work together, computing the gradient via (18). Let β = θunigram denote the weight that we learn for
the new feature. By including this feature in our de€nition of pˆunigram(z), we are basically multiplying a
factor of (ˆpunigram(z))β
into the numerator u(xyz) (check (5) to see that this is true). ‘is means that in
the special case where β = 1 and X = Y = 0, we simply have u(xyz) = ˆpunigram, so that the LOGLIN
model gives exactly the same probabilities as the add-1 smoothed unigram model pˆunigram. However, by
training the parameters, we might learn to trust the unigram model less (0 < β < 1) and rely more on the
word embeddings (X, Y 6= 0) to judge which words z are likely in the context xy.
A simpler way to implement this scheme is to de€ne
funigram(xyz) = log(c(z) + 1) (where c(z) is the count of z in training data) (26)
‘is gives the same model, since pˆunigram(z) is just c(z) + 1 divided by a constant, and our model renormalizes u(xyz) by a constant anyway.
J.2 Unigram, bigram, and trigram indicator features
Try adding a unigram feature fw for each word w in the vocabulary. ‘at is, fw(xyz) is 1 if z = w and 0
otherwise. Does this work be‹er than the log-unigram feature from section J.1?
Now try also adding a binary feature for each bigram and trigram that appears at least 3 times in
training data. How good is the resulting model?
In all cases, you will want to tune C on development data to prevent over€‹ing. ‘is is important—the
original model had only 2d
2+1 parameters where d is the dimensionality of the embeddings, but your new
model has enough parameters that it can easily over€t the training data. In fact, if C = 0, the new model
will exactly predict the unsmoothed probabilities, as if you were not smoothing at all (add-0)! ‘e reason
is that the maximum of the concave function F(
~θ) = PN
i=1 Fi(
~θ) is achieved when its partial derivatives
are 0. So for each unigram feature fw de€ned in the previous paragraph, we have, from equation (18) with
C = 0,
| {z }
observed count of w in corpus

| xy)fw(xyz0
| {z }
predicted count of w in corpus
Hence SGD will adjust ~θ until this is 0, that is, until the predicted count of w exactly matches the observed
count c(w). For example, if c(w) = 0, then SGD will try to allocate 0 probability to word w in all contexts
(no smoothing), by driving θw → −∞. Taking C > 0 prevents this by encouraging θw to stay close to 0.
J.3 Embedding-based features on unigrams and trigrams
Oddly, (7) only includes features that evaluate the bigram yz (via weights in the Y matrix) and the skipbigram xz (via weights in the X matrix). A‰er all, you can see in (9) that the features have the form yjzm
and xjzm. ‘is seems weaker than ADDL BACKOFF. ‘us, add unigram features of the form zm and
trigram features of the form xhyjzm.
J.4 Embedding-based features based on more distant skip-bigrams
For a log-linear model, there’s no reason to limit yourself to trigram context. Why not look at 10 previous
words rather than 2 previous words? In other words, your language model can use the estimate p(wi
wi−10, wi−9, . . . wi−1).
‘ere are various ways to accomplish this. You may want to reuse the X matrix at all positions i −
10, i−9, . . . , i−2 (while still using a separate Y matrix at position i−1). ‘is means that having the word
“bread” anywhere in the recent history (except at position wi−1) will have the same e‚ect on wi
. Such a
design is called “tying” the feature weights: if you think of di‚erent positions having di‚erent features
associated with them, you are insisting that certain related features have weights that are “tied together”
(i.e., they share a weight).
You could further improve the design by saying that “bread” has weaker inƒuence when it is in the
more distant past. ‘is could be done by rede€ning the features: for example, in your version of (9), you
could scale down the feature value (xjzm) by the number of word tokens that fall between x and z.
Note: ‘e provided code has separate methods for 3-grams, 2-grams, and 1-grams. To support general
n-grams, you’ll want to replace these with a single method that takes a list of n words. It’s probably easiest
to streamline the provided code so that it does this for all smoothing methods.
J.5 Spelling-based features
‘e word embeddings were automatically computed based on which words tend to appear near one another. ‘ey don’t consider how the words are spelled! So, augment each word’s embedding with addi13A fancier approach is to learn how much to scale down this inƒuence. For example, you could keep the feature value de€ned
as (xj zm), but say that the feature weights for position i−6 (for example) are given by the matrix λ6X. Now X is shared across
all positions, but the various multipliers such as λ6 are learned by SGD along with the entries of X and Y . If you learn that λ6 is
close to 0, then you have learned that wi−6 has li‹le inƒuence on wi. (In this case, the model is technically log-quadratic rather
than log-linear, and the objective function is no longer concave, but SGD will probably €nd good parameters anyway. You will
have to work out the partial derivatives with respect to the entries of λ as well as X and Y .)
tional dimensions that describe properties of the spelling. For example, you could have dimensions that
ask whether the word ends in -ing, -ed, etc. Each dimension will be 1 or 0 according to whether the word
has the relevant property.
Just throw in a dimension for each sux that is common in the data. You could also include properties
relating to word length, capitalization pa‹erns, vowel/consonant pa‹erns, etc.—anything that you think
might help!
You could easily come up with thousands of properties in this way. Fortunately, a given word such as
burgeoning will have only a few properties, so the new embeddings will be sparse. ‘at is, they consist
mostly of 0’s with a few nonzero elements (usually 1’s). ‘is situation is very common in NLP. As a result,
you don’t need to store all the new dimensions: you can compute them on demand when you are computing
summations like Pd
m=1 Yjm · (yjzm) in (9). In such a summation, j ranges over possible suxes of
y and m ranges over possible suxes of z (among other properties, including the original dimensions). To
compute the summation, you only have to loop over the few dimensions j for which yj 6= 0 and the few
dimensions m for which zm 6= 0. (All other summands are 0 and can be skipped.)
It is easy to identify these few dimensions. For example, burgeoning has the -ing property but not any
of the other 3-le‹er-sux properties. In the trigram xyz = demand was burgeoning, the summation
would include a feature weight Yjm for j = -was and m = -ing, which is included because yz has that
particular pair of suxes and so yjvm = 1. In practice, Y can be represented as a hash map whose keys
are pairs of properties, such as pairs of suxes.
J.6 Meaning-based features
If you can €nd online dictionaries or other resources, you may be able to obtain other, linguistically interesting properties of words. You can then proceed as with the spelling features above.
J.7 Repetition
As words tend to repeat, you could have a feature that asks whether wi appeared in the set {wi−10, wi−9, . . . , wi−1}.
‘is feature will typically get a positive weight, meaning that recently seen words are likely to appear
again. Since 10 is arbitrary, you should actually include similar features for several di‚erent history sizes:
for example, another feature asks whether wi appeared in {wi−20, wi−19, . . . wi−1}.
J.8 Ensemble modeling
Recall that (25) included the log-probability of another model as a feature within your LOGLIN model. You
could include other log-probabilities in the same way, such as smoothed bigram or trigram probabilities.
‘e LOGLIN model then becomes an “ensemble model” that combines the probabilities of several other
models, learning how strongly to weight each of these other models.
If you want to be fancy, your log-linear model can include various trigram-model features, each of
which returnslog ˆptrigram(z | xy) but only when c(xy)falls into a particular range, and returns 0 otherwise.
Training might learn di‚erent weights for these features. ‘at is, it might learn that the trigram model is
trustworthy when the context xy is well-observed, but not when it is rarely observed.

Scroll to Top