601.465/665 — Natural Language Processing
Assignment 3: Smoothed Language Modeling
Probabilistic models are an indispensable part of modern NLP. is assignment will try to convince
you that even simplistic and linguistically stupid models like ngram models can be very useful, provided
their parameters are estimated carefully. See reading section A.
You now know enough about probability to build and use some trigram language models. You will
experiment with dierent types of smoothing. You will also get experience in running corpus experiments
over training, development, and test sets. is is the only assignment in the course to focus on that.
Reading: Read the handout aached to the end of this assignment!
Collaboration: You may work in teams of up to 2 on this assignment. at is, if you choose, you
may collaborate with 1 partner from the class, handing in a single homework with multiple names on it.
Do the work together, not divide it up: if you didn’t work on a question, you don’t deserve credit for it!
Your solutions should emerge from collaborative realtime discussions with both of you present.
Programming language: We ask you to use Python 3 for this assignment. To ease the transition (as
this is the rst assignment to require Python), we will give you some useful code as a starting point.1 On
getting programming help: Since this is an upperlevel NLP class, not a programming class, I don’t want
you wasting time on lowlevel issues like how to handle I/O or hash tables of arrays. If you nd yourself
wasting time on programming issues, then by all means seek help from someone who knows the language
beer! Your responsibility is the NLP stu—you do have to design, write, and debug the interesting code
and data structures on your own. But I don’t consider it cheating if another hacker (or the TA) helps you
with your I/O routines or language syntax or compiler warning messages. ese aren’t InterestingTM.
How to hand in your written work: Via Gradescope as before. Besides the comments you embed
in your source les, put all other notes, documentation, and answers to questions in a README.pdf le.
How to test and hand in your code:
1 • Place your code and trained parameters into a single submission directory, which you will zip and
upload separately on Gradescope. We will post more detailed instructions on Piazza.
• For the parts where we tell you exactly what to do, an autograder will check that you got it right.
• For the openended challenge, an autograder will run your system and score its accuracy on “devtest” and “naltest” data.
– You should get a decent grade if you do at least as well as the “baseline” system provided by
the TAs. Beer systems will get higher grades.
– Each time you submit a version of your system, you’ll see the results on “dev test” data and
how they compare to the baseline system. You’ll also see what other teams tried and how their
accuracies compared to yours. (Teams will use madeup names, not real names.)
1
It counts word ngrams in a corpus, using hash tables, and uses the counts to calculate simple probability estimates. We also
supply a method that calls an optimization library to maximize a function.
– e “devtest” results are intended to help you develop your system. Grades will be based on
the “naltest” data that you have never seen.
1 Warmup: Word embeddings
e big part of this assignment will involve various kinds of smoothed language models. Oen, smoothing
involves treating similar events in similar contexts as having similar probabilities. (Backo is one example.)
One strategy will be to treat similar words as having similar probabilities. But what does “similar
words” mean? More concretely, how will we measure whether two words are similar?
Log into the ugrad machines, and look in the directory /home/arya/hwlm/lexicons/. A lexicon
lists useful properties of the words of a language. Each of the words*.txt les2
is a lexicon that lists
over 70,000 words of English, with a representation of each word as a ddimensional vector. In other words,
it embeds these words into the vector space R
d
.
We can now dene “similar words” as words with similar vectors. A word may have many syntactic
and semantic properties—some words are transitive verbs, some words are plural, some words refer to
animals. ese properties can be considered by a loglinear model, and words with similar vectors tend to
have similar properties. e larger d is, the more room the vector has to encode interesting properties.
e vectors in these particular les were produced automatically by Google’s word2vec program.3
eir individual dimensions are hard to interpret as natural properties. It would certainly be nice if dimension 2 (for example) represented the “animal” property, so that a word that referred to an animal would be
one whose vector ~v = (v1, v2, . . . vd) had strongly positive v2. In practice, however, word2vec chooses an
arbitrary basis for the vector space. So the “animal” direction—to the extent that there is one—is actually
represented by some arbitrary vector ~u. e animal words tend to be those whose vectors ~v have strongly
positive ~v · ~u.
(Geometrically, this dot product measures how far ~v extends in direction ~u (it projects ~v onto the ~u
vector). Algebraically, it computes a certain linear combination of v1, v2, . . . , vn. As a special case, if ~u
were (0, 1, 0, 0, . . .), then ~v · ~u would in fact return v2. But there’s no reason to expect that ~u would be
that simple.)
You’ll be using these lexicons in the next section. For this part, you’ll just warm up by writing a short
program to get a feel for vector embeddings.
1. Your program, findsim.py, should print the 10 words most similar to seattle, other than seattle
itself, according to the 50dimensional embeddings, if you run it as
python3 findsim.py words50.txt seattle
ey should be printed in decreasing order of similarity.
2e le format should be easy to gure out. e le wordsd.txt can be regarded as a matrix with about 70,000 rows and
d columns. Each row is labeled by a word. e rst line of the le is special: it gives the number of rows and columns.
3e details are not important for this assignment, but the vectors are optimized by gradient descent so as to arrange that the
vector for each word token wi will be predictable (to the extent possible) from the average vector of nearby word tokens (wj for
j 6= i and i − 5 ≤ j ≤ i + 5). If you’re curious, you can nd details in Mikolov et al. (2013): “Distributed Representations of
Words and Phrases and their Compositionality”. We ran the CBOW method over the rst 1 billion characters of English Wikipedia.
word2vec doesn’t produce vector representations for rare words (c(w) < 5), so we rst replaced rare words with the special
symbol ool (“out of lexicon”), forcing word2vec to learn a vector representation for ool.
2
To measure the similarity of two words with vectors ~v and ~w, please use the cosine similarity, which
is the cosine of the angle θ between the two vectors:
cos θ =
~v
~v
·
~w
 ~w
=
~v · ~w
~v  ~w =
Pd
i=1 viwi
qPd
i=1 v
2
i
qPd
i=1 w2
i
∈ [−1, 1]
2 What are the most similar words to seattle, dog, communist, jpg, the, and google? Play around
some more. What are some examples that work “well” or “poorly”? What paerns do you notice?
3
So far you have been using d = 50. What happens for larger or smaller values of d?
4
Hint: Start by making findsim.py nd only the single most similar word, which should be easy
enough. en you can generalize this method to nd the 10 most similar words. (Since you only
want the top 10, it’s not necessary to sort the whole lexicon by similarity—that method is acceptable,
but inecient.)
Note: You can copy the lexicon les to your personal machine. But to avoid downloading such large
les, it may be easier to run findsim.py directly on the ugrad machines. If you do this, please
don’t waste space by making fresh copies of the les on the ugrad machines. Just use them
at their current location. If you like, create symbolic links (shortcuts) to them by typing “ln s
/home/arya/hwlm/lexicons/* .” in your own working directory.
5 Submit findsim.py on Gradescope.
2. Extra credit: Now extend your program so that it can also be run as follows:
python3 findsim.py words50.txt king –minus man –plus woman
is should nd the 10 words most similar to the vector king−man+woman (other than those three
words themselves).
(e old command python3 findsim.py words50.txt seattle should still work. Note that
it is equivalent to python3 findsim.py words50.txt seattle –minus seattle –plus
seattle, and you may want to implement it that way.)
You can regard the above command as completing the analogy
man : woman :: king : ?
Try some more analogies, this time using the 200dimensional vectors. For example:
king – man + woman
paris – france + uk
hitler – germany + italy
child – goose + geese
goes – eats + ate
car – road + air
6 Come up with some analogy questions of your own. Which ones work well or poorly? What happens
to the results if you use (say) d = 10 instead of d = 200?
4
3
7 Why does this work at all? Be sure to discuss the role of the vector king − man before woman is
added. What does it tell you about the vectors that they can solve analogies?
8 Submit your extended findsim.py program on Gradescope instead of the original findsim.py.
2 Smoothed language modeling
1. Your starting point is the sample program fileprob. It can be found on the ugrad machines
(ugrad{1,2,…,24}.cs.jhu.edu) in the directory /home/arya/hwlm/code.
You’re welcome to develop directly on these machines, if you want to avoid transferring the data.
Each languagespecic subdirectory contains an INSTRUCTIONS le explaining how to get the program running. ose instructions will let you automatically compute the log2probability of three
sample les (speech/{sample1,sample2,sample3}). Try it!
Next, you should spend a lile while looking at those sample les yourself, and in general, browsing
around the /home/arya/hwlm directory to see what’s there. See reading sections B and C for more
information about the datasets.
9 If a language model is built from the speech/switchboardsmall corpus, using add0.01 smoothing, what is the model’s perplexity per word on each of the three sample les? (You can compute this
from the log2probability that fileprob prints out, as discussed in class and in your textbook. Use
the command wc w on a le to nd out how many words it contains.)
10 What happens to the log2probabilities and perplexities if you train instead on the larger switchboard
corpus? Why?
2. Modify fileprob to obtain a new program textcat that does text categorization. e two programs should share code for smoothed language models, so that when you write new smoothing
methods later, they will immediately be available from both programs. See the INSTRUCTIONS le
for programminglanguagespecic directions about which les to copy, alter, and submit for this
problem.
textcat should be run from the command line almost exactly like fileprob. However, as additional arguments,
• training needs to specify two training corpora rather than one: train1 and train2. ese correspond to data for the two classes in your text categorization problem.
• testing needs to specify the prior probability of train1.
For example, you could train your system with a line like
textcat TRAIN add1 words10.txt gen spam
which saves the trained models in a le but prints no output. In this example, gen and spam are
the training corpora, corresponding to “genuine” and spam emails. words10.txt is a lexicon containing word vectors, which will used by the loglinear method (see problem 6) but is ignored for
now.
4One type of ‘working poorly’ was explored in this paper at NeurIPS. Since then, gender bias in NLP tools has become a hot
research area at Hopkins and elsewhere. But there are other types of ‘working poorly’ that you can nd.
4
en you would test like this:
textcat TEST add1 words10.txt gen spam 0.7 foo.txt bar.txt baz.txt
which loads the models from before and uses them to classify the remaining les. It should print
output that labels each le with a training corpus name (in this case gen or spam):
spam foo.txt
spam bar.txt
gen baz.txt
1 files were more probably gen (33.33%)
2 files were more probably spam (66.67%)
In other words, it classies each le by printing its maximum a posteriori class (the le name of the
training corpus that probably generated it). en it prints a summary.
e number 0.7 on the test command line species your prior probability that a test le will be gen.
(See reading section C.3.)
Please use the exact output formats above. If you would like to print any additional output lines for
your own use, please direct it to STDERR; we provide examples.
As reading section D explains, both language models built by textcat should use the same nite
vocabulary. Dene this vocabulary to all words that appeared ≥ 3 times in the union of the two
training corpora, plus oov. Your addλ model doesn’t actually need to store the set of words in
the vocabulary, but it does need to know its size V , because the add1 smoothing method estimates
p(z  xy) as
c(xyz)+1
c(xy)+V
. We’ve provided code to nd V for you—see the INSTRUCTIONS le for details.
3. In this question, you will evaluate your textcat program on one of two tasks. You can do either
language identication (the english spanish directory) or else spam detection (the gen spam directory). Have a look at the development data in both directories to see which one oats your boat.
(Don’t peek at the test data!)
Using add1 smoothing, run textcat on all the dev data for your chosen task:5
• For the language ID task, classify the les english spanish/dev/english/*/* using the
training corpora en.1K and sp.1K.
en classify english spanish/dev/spanish/*/* similarly. Note that for this corpus, the
“words” are actually leers. Use 0.7 as your prior probability of English.
• Or, for the spam detection task, classify the les gen spam/dev/gen/* using the training
corpora gen and spam.
en classify gen spam/dev/spam/* similarly. Use 0.7 as your prior probability of gen.
To simplify the wording below, I’ll assume that you’ve chosen the spam detection task.
11 (a) From the results, you should be able to compute a total error rate for the technique. at is,
what percentage of the dev les were classied incorrectly?
12 (b) How small do you have to make the prior probability of gen before textcat classies all the
dev les as spam?
5
It may be convenient to use symbolic links to avoid typing long lenames. E.g.,
ln s /home/arya/hwlm/english_spanish/train ˜/estrain will create a subdirectory estrain under your home
directory; this subdirectory is really just a shortcut to the ocial training directory.
5
(c) Now try addλ smoothing for λ 6= 1. First, use fileprob to experiment by hand with dierent
values of λ > 0. (You’ll be asked to discuss in question 4b why λ = 0 probably won’t work
well.)
13 What is the minimum crossentropy per token that you can achieve on the gen development
les (when estimating a model from gen training les with addλ smoothing)? How about for
spam?
Note: To check that you are smoothing correctly, the autograder will run your code on small
training and testing les.
(d) In principle, you could apply dierent amounts of smoothing to the gen and spam models, and
this might be wise—for example, if their training sets have dierent rates of novel words.
However, for simplicity, your textcat program in this assignment smooths both models in
14 exactly the same way. So what is the minimum crossentropy per token that you can achieve
on all development les together, if both models are smoothed with the same λ?
(To measure crossentropy per token, nd the total number of bits that it takes to predict all
of the development les from their respective models. is means running fileprob twice:
once for the gen data and once for the spam data.6 Add the two results, and then divide by the
total number of tokens in all of the development les.)
What value of λ gave you this minimum crossentropy? Call this λ
∗ 15 . (See reading section E
for why you are using crossentropy to select λ
∗
.)
(e) Each of the dev les has a length. For language ID, the length in characters is given by the
directory name and is also embedded in the lename (as the rst number). For spam detection,
the length in words is embedded in the lename (as the rst number).
Come up with some way to quantify or graph the relation between le length and the classi
cation accuracy of addλ
∗ on development data. (Feel free to use Piazza to discuss how to do
16 this.) Write up your results. (is may be easier to do on your local machine, with a library
like the matplotlib Python package. You could also use this simple online tool.)
(f) Now try increasing the amount of training data. (Keep using addλ
∗
, for simplicity.) Compute
17 the overall error rate on dev data for training sets of dierent sizes. Graph the training size
versus classication accuracy.
• For the language ID task, use training corpora of 6 dierent sizes: en.1K vs. sp.1K (1000
characters each); en.2K vs. sp.2K (2000 characters each); and similarly for 5K, 10K, 20K,
and 50K.
• Or, for the spam detection task, use training corpora of 4 dierent sizes: gen vs. spam;
gentimes2 vs. spamtimes2 (twice as much training data); and similarly for . . .times4
and . . .times8.
4. Reading section F gives an overview of several smoothing techniques beyond addλ.
(a) At the end of question 2, V was carefully dened to include oov. So if you saw 19,999 dierent
18 word types in training data, then V = 20, 000. What would go wrong with the UNIFORM
estimate if you mistakenly took V = 19, 999? What would go wrong with the ADDL estimate?
19 (b) What would go wrong with the ADDL estimate if we set λ = 0? (Remark: is naive histor6is is not quite the right thing to do, actually. Running fileprob on gen uses only a vocabulary derived from gen, whereas
textcat is going to use a larger vocabulary derived from gen ∪ spam. If this were a research paper, I’d insist on using textcat’s
vocabulary to tune λ
∗
. But for this assignment, take the shortcut and just run fileprob.
6
ical estimate is commonly called the maximumlikelihood estimate, because it maximizes the
probability of the training corpus.)
(c) Let’s see on paper how backo behaves with novel trigrams. If c(xyz) = c(xyz0 20 ) = 0, then
does it follow that pˆ(z  xy) = ˆp(z
0
 xy) when those probabilities are estimated by BACKOFF ADDL smoothing? In your answer, work out and state the value of pˆ(z  xy) in this case.
How do these answers change if c(xyz) = c(xyz0
) = 1?
21 (d) In the BACKOFF ADDL scheme, how does increasing λ aect the probability estimates? (ink
about your answer to the previous question.)
5. It’s time to implement the remaining smoothing methods, in addition to what’s already in the code.
Add support for addλ smoothing with backo, ADDL BACKOFF. (is should allow both fileprob
and textcat to use that method.) See the INSTRUCTIONS le for languagespecic instructions.
is should be just a few lines of code. You will only need to understand how to look up counts in
the hash tables. Just study how the existing methods do it.
Hint: So pˆ(z  xy) should back o to pˆ(z  y), which should back o to pˆ(z), which backs o to
. . . what Figure it out!
You will submit trained addλ
∗ models as in question 3d. For simplicity, just use the same λ
∗
as in
that question, even though some other λ might work beer with backo.
6. (a) Add support for the LOGLIN model. See the INSTRUCTIONS le for languagespecic instructions. Just as add0.01 smoothing was selected by passing the model name add0.01 to
fileprob and textcat, a loglinear model with C = 1 should be selected by passing the
model name loglin1. (C is the regularization coecient used during training: see reading
section H.1.)
Your code will need to compute pˆ(z  xy) using the features in reading section F.4.1. It should
refer to the current parameters ~θ (the entries of the X and Y matrices).
You can use embeddings of your choice from the lexicons directory. (See the README le in
that directory. Make sure to use word embeddings for gen/spam, but character embeddings
for english/spanish.) ese word embeddings were derived from Wikipedia, a large diverse
corpus with lots of useful evidence about the usage of many English words.
(b) Implement stochastic gradient ascent (Algorithm 1 in reading section H.2) to nd the X and Y
matrices that maximize F(
~θ). See the INSTRUCTIONS le in the code directory for details. For
the autograder’s sake, when loglin is specied on the command line, please train for E = 10
epochs, use the exact hyperparameters suggested in reading section I.1, and print output in the
following format (this is printing F(
~θ) rather than Fi(
~θ)):
Training from corpus en.1K
Vocabulary size is 30 types including OOV and EOS
epoch 1: F=1.83284
epoch 2: F=1.64174
… [you should print these epochs too]
epoch 10: F=1.58733
Finished training on 992 tokens
(c) Training a loglinear model takes signicantly more time than ADDL smoothing. erefore,
we recommend that you experiment with the language ID task for this part.
7
Try your program out rst by training a loglinear language model on en.1K, with character
embeddings d = 10 (chars10.txt) and regularization strength C = 1. As a check, your
numbers should match those shown in 6b above.
You should now be able to measure crossentropies and text categorization error rates under your fancy new language model! textcat should work as before. It will construct two
LOGLIN models as above, and then compare the probabilities of a new document (dev or test)
under these models.
22 Report crossentropy and text categorization accuracy with C = 1, but also experiment with
other values of C > 0, including a small value such as C = 0.05. Let C
∗ be the best value you
nd. Using C = C
∗
, play with dierent embedding dimensions and report the results. How
and when did you use the training, development, and test data? What did you nd? How do
your results compare to addλ backo smoothing?
23 (d) Now you get to have some fun! Add some new features to LOGLIN and report the eect on
its performance. Some possible features are suggested in reading section J. You should make at
least one nontrivial improvement; you can do more for extra credit, including varying hyperparameters and training protocols (reading sections I.1 and I.5).
Your improved method should be selected by using the commandline argument improved (in
place of add1, loglin1, etc.). You will submit your system to Gradescope for autograding,
including the trained model.
You are free to submit many versions of your system—with dierent implementations of improved.
All will show up on the leaderboard, with comments, so that you and your classmates can see
what works well. For nal grading, the autograder will take the submied version of your
system that worked best on devtest data, and then evaluate its performance on naltest data.
(e) Extra credit: In your submission, you can also include a trained model for the spam detection
task. is doesn’t require very much extra work in principle, but training will be much slower
because of the larger vocabulary (slowing down P
z
0) and the larger training corpus (slowing
down P
i
). You may need to adjust C, using development data as usual. See lexicons/README
for lexicons of embeddings that you could try. To speed up training, you could try a smaller
training set, a smaller vocabulary, or a lowerdimensional embedding. Report what you did.
7. Extra credit: We have been assuming a nite vocabulary by replacing all unknown words with a
special oov symbol. But an alternative is an openvocabulary language model (reading section D.5).
✌24 Devise a sensible way to estimate the word trigram probability p(z  xy), backing o to a leer
ngram model of z if z is an unknown word. Describe how you would train the leer ngram model.
Just give the formulas for your estimate—you don’t have to implement and test your idea.
Notes:
• x and/or y and/or z may be unknown; be sure you make sensible estimates of p(z  xy) in all
these cases
• be sure that P
z
p(z  xy) = 1
8
601.465/665 — Natural Language Processing
Reading for Assignment 3: Smoothed Language Modeling
Prof. Kevin Duh and Jason Eisner — Fall 2019
We don’t have a required textbook for this course. Instead, handouts like this one are the main readings.
is handout accompanies assignment 3, which refers to it.
A Are trigram models useful?
Why build ngram models when we know they are a poor linguistic theory? Answer: A linguistic system
without statistics is oen fragile, and may break when run on real data. It will also be unable to resolve
ambiguities. So our rst priority is to get some numbers into the system somehow. An ngram model is a
starting point, and may get reasonable results even though it doesn’t have any real linguistics yet.
Speech recognition. Speech recognition systems have made heavy use of trigram models for decades.
Alternative approaches that don’t look at the trigrams do worse. One can do beer by building fancy
language models that combine trigrams with syntax, topic, and so on. But only a lile beer—dramatic
improvements over trigram models are hard to get. In the language modeling community, a rule of thumb
was that you had enough for a Ph.D. dissertation if you had managed to reduce a standard trigram model’s
perplexity per word by 10% (equivalent to a crossentropy reduction of just 0.152 bits per word).
Machine translation. In the same way, machine translation (MT) systems have oen included 5gram
models trained on quite massive amounts of data. An MT system has to generate a new uent sentence of
English, and 5grams do a beer job than 3grams of memorizing common phrases and local grammatical
phenomena.
Why doesn’t a speech recognition system need 5grams? Because it is not generating a new sentence.
It only has to determine what words have already been said by an English speaker. A 3gram model helps
to choose between“ower,” “our,” and “oor” by using one word of context on either side. at already
provides most of the value that we can get out of local context. Going to a 5gram model wouldn’t help
too much with this choice, because it still wouldn’t look at enough of the sentence to determine whether
we’re talking about gardening, baking, or cleaning.
Neural language models. In the 2010’s, language models based on recurrent neural networks nally
started to show dramatic gains over trigram models. ese neural language models are much slower to
train, but they are not limited to trigrams, and can learn to notice complex ways in which the context
aects the next word. We won’t quite be covering them in this course, but the starting point is the word
embeddings used in this assignment and the previous one.
Neural language models have become increasingly popular since the mid2010’s. However, cleverly
smoothed 7gram models can still do about as well by looking at lots of features of the previous 6gram,
according to Pelemans et al. (2016).
R1
B Boundary symbols
Remember from the previous assignment that a language model estimates the probability of any word
sequence ~w. In a trigram model, we use the chain rule and backo to assume that
p( ~w) = Y
N
i=1
p(wi
 wi−2, wi−1)
with start and end boundary symbols handled as in the previous assignment.
In other words, wN = eos (“end of sequence”), while for i < 1, wi = bos (“beginning of sequence”).
us, ~w consists of N − 1 words plus an eos symbol. Notice that we do not generate bos but condition on
it (it was always there). Conversely, we do generate eos but never condition on it (nothing follows it). e
boundary symbols bos, eos are special symbols that do not appear among w1 . . . wN−1.
In assignment 3, we will consider every le to implicitly start with bos and end with eos. A le might
be a sentence, or an email message, or a fragment of text.
Some les also contain sentence boundary symbols <s> and </s>. You should consider these to be
just ordinary words.
C Datasets for Assignment 3
Assignment 3 will have corpora for two tasks: language identication and spam detection. Each corpus
has a README le that you should look at.
C.1 e train/dev/test split
Each corpus has already been divided for you into training, development, and test sets, which are in separate directories.
You will collect counts on the training set, tune the “hyperparameters” like λ to maximize performance
on the development set, and then evaluate your performance on the test set.
In this case, we actually have two test sets.
• e “dev test” set is for use as you develop your system. If you experiment on it repeatedly, there is
a danger that you will “overt” to it—that is, you might nd your way to a method that seems really
good, but is actually only good for that particular dataset, not in general.
• us, for fairness, we have to nd out whether your system can do well in a blind test, when we run
it on data you’ve never seen. Your grade will therefore be determined based on a new “nal test” set.
C.2 Domain mismatch
In this assignment, the training set is unfortunately a bit dierent from the others. To simplify the
commandline syntax for you, I’ve assumed that the training set consists of a single big le. at means
that there is only one bos and one eos in the whole training set.
By contrast, bos and eos are much more common in the dev set and the test sets. is is a case of
domain mismatch, where the training data is somewhat unlike the test data.
Domain mismatch is a common source of errors in applied NLP. In this case, the only practical implication is that your language models won’t do a very good job of modeling what tends to happen at the
very start or very end of a le—because it has seen only one le!
R2
C.3 Class ratios
In the assignment, you’ll have to specify a prior probability that a le will be genuine email (rather than
spam) or English (rather than Spanish). In other words, how oen do you expect the real world to produce
genuine email or English in your test data? We will ask you to guess 0.7.
Of course, your system won’t know the true fraction on test data, because it doesn’t know the true
classes—it is trying to predict them.
We can try to estimate the fraction from training data. It happens that 2
3
of the documents are genuine
email, and 1
2
are English.1
In this case, the prior probability is a parameter of the model, to be estimated
from training data (with smoothing!) like any other parameter.
But if you think that test data might have a dierent rate of spam or Spanish than training data, then
the prior probability is not necessarily something that you should represent within the model and estimate
from training data. Instead it can be used to represent your personal guess about what you think test data
will be like.
Indeed, in the assignment, you’ll use training data only to get the smoothed language models, which
dene the likelihood of the dierent classes. is leaves you free to specify your prior probability of the
classes on the command line. is setup would let you apply the system to dierent test datasets about
which you have dierent prior beliefs—the spaminfested email account that you abandoned, versus your
new private email account that only your family knows about.
Does it seem strange to you that a guess or assumption might have a role in statistics? at is actually
central to the Bayesian view of statistics—which says that you can’t get something for nothing. Just as you
can’t get theorems without assuming axioms, you can’t get posterior probabilities without assuming prior
probabilities.
D e vocabulary
D.1 Choosing a nite vocabulary
All the smoothing methods assume a nite vocabulary, so that they can easily allocate probability to all
the words. But is this assumption justied? Aren’t there innitely many potential words of English that
might show up in a test corpus (like xyzzy and JacrobinsteinIndustries and fruitylicious)?
Yes there are . . . so we will force the vocabulary to be nite by a standard trick. Choose some xed,
nite vocabulary at the start. en add one special symbol oov that represents all other words. You
should regard these other words as nothing more than variant spellings of the oov symbol.
Note that OOV stands for “out of vocabulary,” not for “out of corpus,” so OOV words may have token
count > 0 and invocabulary words may have count 0.
D.2 Consequences for evaluating a model
For example, when you are considering the test sentence
i saw snuffleupagus on the tv
what you will actually compute is the probability of
1ese numbers are actually derived from dev data rather than training data, because of the domain mismatch issue above—the
training data are not divided into documents.
R3
i saw oov on the tv
which is really the total probability of all sentences of the form
i saw [some outofvocabulary word] on the tv
Admiedly, this total probability is higher than the probability of the particularsentence involving snuffleupagus.
But in most of this assignment, we only wish to compare the probability of the snueupagus sentence under dierent models. Replacing snuffleupagus with oov raises the sentence’s probability under all the
models at once, so it need not invalidate the comparison.2
D.3 Comparing apples to apples
We do have to make sure that if snuffleupagus is regarded as oov by one model, then it is regarded as
oov by all the other models, too. It’s not appropriate to compare pmodel1(i saw oov on the tv) with
pmodel2(i saw snuffleupagus on the tv), since the former is actually the total probability of many
sentences, and so will tend to be larger.
So all the models must have the same nite vocabulary, chosen up front. In principle, this shared
vocabulary could be any list of words that you pick by any means, perhaps using some external dictionary.
Even if the context “oov on” never appeared in the training corpus, the smoothing method is required
to give a reasonable value anyway to p(the  oov, on), for example by backing o to p(the  on).
Similarly, the smoothing method must give a reasonable (nonzero) probability to p(oov  i, saw).
Because we’re merging all outofvocabulary words into a single word oov, we avoid having to decide
how to split this probability among them.
D.4 How to choose the vocabulary
How should you choose the vocabulary? For this assignment, simply take it to be the set of word types
that appeared ≥ 3 times anywhere in training data. en augment this set with a special oov symbol. Let
V be the size of the resulting set (including oov). Whenever you read a training or test word, you should
immediately convert it to oov if it’s not in the vocabulary. is is fast to check if you store the vocabulary
in a hash set.
To help you understand/debug your programs, we have graed brackets onto all outofvocabulary
words in one of the datasets (the speech directory, where the training data is assumed to be train/switchboard).
is lets you identify such words at a glance. In this dataset, for example, we convert uncertain to
[uncertain]—this doesn’t change its count, but does indicate that this is one of the words that your code
ought to convert to oov.
D.5 Openvocabulary language modeling
In this assignment, we assume a xed nite vocabulary. However, an openvocabulary language model
does not limit in advance to a nite vocabulary. estion 7 (extra credit) explores this possibility.
An openvocabulary model must be able to assign positive probability to any word—that is, to any
string of leers that might ever arise. If the alphabet is nite, you could do this with a leer ngram
model!
2
Problem 7 explores a more elegant approach that may also work beer for text categorization.
R4
Such a model is sensitive to the spelling and length of the unknown word. Longer words will generally
receive lower probabilities, which is why it is possible for the probabilities of all unknown words to sum
to 1, even though there are innitely many of them. (Just as 1
2 +
1
4 +
1
8 + · · · = 1.)
E Performance metrics
In this assignment, you will measure your performance in two ways. To measure the predictive power of
the model, you will use crossentropy (per token). To measure how well the model does at a task, you will
use error rate (per document). In both cases, smaller is beer.
Error rate may be what you really care about! However, it doesn’t give a lot of information on a
small dev set. If your dev set has only 100 documents, then the error rate can only be one of the numbers
{
0
100 ,
1
100 , . . . ,
100
100 }. It can tell you if your changes helped by correcting a wrong answer. But it can’t tell
you that your changes were “moving in the right direction” by merely increasing the probability of right
answers.
In particular, for some of the tasks we are considering here, the error rate is just not very sensitive to
the smoothing parameter λ: there are many λ values that will give the same integer number of errors on
dev data. at is why you will use crossentropy to select your smoothing parameter λ on dev data: it will
give you clearer guidance.
E.1 Other possible metrics
As an alternative, could you devise a continuously varying version of the error rate? Yes, because our
system3 doesn’t merely compute a single output class for each document. It constructs a probability distribution over those classes, using Bayes’ eorem. So we can evaluate whether that distribution puts high
probability on the correct answer.
• One option is the expected error rate. Suppose document #1 is gen. If the system thinks p(gen 
document1) = 0.49, then sadly the system will output spam, which ordinary error rate would
count as 1 error. But suppose you pretend—just for evaluation purposes—that the system chooses its
output randomly from its posterior distribution (“stochastic decoding” rather than “MAP decoding”).
In that case, it only has probability 0.51 of choosing spam, so the expected number of errors on this
document is only 0.51. Partial credit!
Notice that expected error rate gives us a lot of credit for increasing p(gen  document1) from 0.01
to 0.49, and lile additional credit for increasing it to 0.51. By contrast, the actual error rate only
gives us credit for the increase from 0.49 to 0.51, since that’s where the actual system output would
change.
• Another continuous error metric is the logloss, which is the system’s expected surprisal about the
correct answer. e system’s surprisal on document 1 is − log2 p(gen  document1) = − log2 0.49 =
1.03 bits.
Both expected error rate and logloss are averages over the documents that are used to evaluate.
So document 1 contributes 0.51 errors to the former average, and contributes 1.03 bits to the laer
average.
3Unlike rulebased classiers and some other ML classiers.
R5
In general, a single document contributes a number in [0, 1] to the expected error rate, but a number
in [0, ∞] to the logloss. In particular, a system that thinks that p(gen  document1) = 0 is innitely
surprised by the correct answer (− log2 0 = ∞). So optimizing for logloss would dissuade you
innitely strongly from using this system . . . basically on the grounds that a system that is completely
condent in even one wrong answer can’t possibly have the correct probability distribution. To
put it more precisely, if the dev set has size 100, then changing the system’s behavior on a single
document can change the error rate or the expected error rate by at most 1
100—aer all, it’s just one
document!—whereas it can change the logloss by an unbounded amount.
What is the relation between the logloss and crossentropy metrics? ey are both average surprisals.4
However, they are very dierent:
metric what it evaluates probability used units long docs count more?
logloss the whole classication system p(gen  document1) bits per document no
crossentropy the gen model within the system p(document1  gen) bits per gen token yes
E.2 Generative vs. discriminative
e error rate, expected error rate, and logloss are all said to be discriminative metrics because they measure how well the system discriminates between correct and incorrect classes. By contrast, the crossentropy is a generative metric because it measures how good the system would be at randomly generating
the observed data.
Methods for seing hyperparameters or parameters are said to be generative or discriminative according to whether they optimize a generative or discriminative metric.
A generative model includes a probability distribution p(input) that accounts for the input data. us,
this assignment uses generative models (language models). A discriminative model only tries to predict
output from input, possibly using p(output  input). us, a conditional loglinear model for text classi
cation would be discriminative. A discriminative model or training procedure focuses less on explaining
the input data and more on solving a particular task—less science, more engineering.
F Smoothing techniques
Here are the smoothing techniques we’ll consider, writing pˆ for our smoothed estimate of p.
F.1 Uniform distribution (UNIFORM)
pˆ(z  xy) is the same for every xyz; namely,
pˆ(z  xy) = 1/V (1)
where V is the size of the vocabulary including oov.
4Technically, you could regard the logloss as a conditional crossentropy . . . to be precise, it’s the conditional crossentropy
between empirical and system distributions over the output class. By contrast, the metric you’ll use on this assignment is the
crossentropy between empirical and system distributions over the input text. e output and the input are dierent random
variables, so logloss is quite dierent from the crossentropy we’ve been using to evaluate a language model!
R6
F.2 Addλ (ADDL)
Add a constant λ ≥ 0 to every trigram count c(xyz):
pˆ(z  xy) = c(xyz) + λ
c(xy) + λV (2)
where V is dened as above. (Observe that λ = 1 gives the addone estimate. And λ = 0 gives the naive
historical estimate c(xyz)/c(xy).)
F.3 Addλ backo (BACKOFF ADDL)
Suppose both z and z
0 have rarely been seen in context xy. ese small trigram counts are unreliable, so
we’d like to rely largely on backedo bigram estimates to distinguish z from z
0
:
pˆ(z  xy) = c(xyz) + λV · pˆ(z  y)
c(xy) + λV (3)
where pˆ(z  y) is a backedo bigram estimate, which is estimated recursively by a similar formula. (If
pˆ(z  y) were the UNIFORM estimate 1/V instead, this scheme would be identical to ADDL.)
So the formula for pˆ(z  xy) backs o to pˆ(z  y), whose formula backs o to pˆ(z), whose formula
backs o to . . . what Figure it out!
F.4 Conditional loglinear modeling (LOGLIN)
In the previous assignment, you learned how to construct loglinear models. Let’s restate that construction
in our current notation.5
Given a trigram xyz, our model pˆ is dened by
pˆ(z  xy)
def =
u(xyz)
Z(xy)
(4)
where
u(xyz)
def = exp
X
k
θk · fk(xyz)
= exp
~θ ·
~f(xyz)
(5)
Z(xy)
def =
X
z
u(xyz) (6)
Here ~f(xyz) is the feature vector extracted from xyz, and ~θ is the model’s weight vector. P
z
sums over
the V words in the vocabulary (including oov) in order to ensure that you end up with a probability
distribution over this chosen vocabulary. at is the goal of all these language models; you saw a similar
P
z
in BACKOFF WB.
5Unfortunately, the tutorial also used the variable names x and y, but to mean something dierent than they mean in this
assignment. e previous notation is prey standard in machine learning.
R7
F.4.1 Bigrams and skipbigram features from word embeddings
What features should we use in the loglinear model?
A natural idea is to use one binary feature for each specic unigram z, bigram yz, and trigram xyz
(see reading section J.2 below).
Instead, however, let’s start with the following model based on word embeddings:
u(xyz)
def = exp
~x>X~z + ~y>Y ~z
(7)
where the vectors ~x, ~y, ~z are specic ddimensional embeddings of the word types x, y, z, while X, Y are
d × d matrices. e > superscript is the matrix transposition operator, used here to transpose a column
vector to get a row vector.
is model may be a lile hard to understand at rst, so here’s some guidance.
What’s the role of the word embeddings? Note that the language model is still dened as a conditional
probability distribution over the vocabulary. e lexicon, which you will specify on the command line, is
merely an external resource that lets the model look up some aributes of the vocabulary words. Just
like the dictionary on your shelf, it may also list information about some words you don’t need, and it
may lack information about some words you do need. In short, the existence of a lexicon doesn’t aect
the interpretation of P
z
in (6): that formula remains the same regardless of whether the model’s features
happen to consult a lexicon!
For oov, or for any other type in your vocabulary that has no embedding listed in the lexicon, your
features should back o to the embedding of ool—a special “out of lexicon” symbol that stands for “all
other words.” ool is listed in the lexicon, just as oov is included in the vocabulary.
Note that even if an specic outofvocabulary word is listed in the lexicon, you must not use that
listing.6 For an outofvocabulary word, you are supposed to be computing probabilities like p(oov  xy),
which is the probability of the whole oov class—it doesn’t even mention the specic word that was replaced
by oov. (See reading section D.2.)
Is this really a loglinear model? Now, what’s up with (7)? It’s a valid formula: you can always
get a probability distribution by dening pˆ(z  xy) = 1
Z(xy)
exp(any function of x, y, z that you like)!
But is (7) really a loglinear function? Yes it is! Let’s write out those ddimensional vectormatrixvector
multiplications more explicitly:
u(xyz) = exp
X
d
j=1
X
d
m=1
xjXjmzm +
X
d
j=1
X
d
m=1
yjYjmzm
(8)
= exp
X
d
j=1
X
d
m=1
Xjm · (xjzm) +X
d
j=1
X
d
m=1
Yjm · (yjzm)
(9)
6is issue would not arise if we simply dened the vocabulary to be the set of words that appear in the lexicon. is simple
strategy is certainly sensible, but it would slow down normalization because our lexicon is quite large.
R8
is does have the loglinear form of (5). Suppose d = 2. en implicitly, we are using a weight vector ~θ
of length d
2 + d
2 = 8, dened by
h θ1, θ2, θ3, θ4, θ5, θ6, θ7, θ8 i
↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓
hX11 , X12 , X21 , X22 , Y11 , Y12 , Y21 , Y22 i
(10)
for a vector ~f(xyz) of 8 features
hf1(xyz), f2(xyz), f3(xyz), f4(xyz), f5(xyz), f6(xyz), f7(xyz), f8(xyz) i
↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓
h x1z1, x1z2, x2z1, x2z2, y1z1, y1z2, y2z1, y2z2 i
(11)
Remember that the optimizer’s job is to automatically manipulate some control sliders. is particular
model with d = 2 has a control panel with 8 sliders, arranged in two d×d grids (X and Y ). e point is that
we can also refer to those same 8 sliders as θ1, . . . θ8 if we like. What features are these sliders (weights) be
connected to? e ones in (11): if we adopt those feature denitions, then our general loglinear formula
(5) will yield up our specic model (9) (= (7)) as a special case.
As always, ~θ is incredibly important: it determines all the probabilities in the model.
Is this a sensible model? e feature denitions in (11) are pairwise products of embedding dimensions.
Why on earth would such features be useful? First imagine that the embedding dimensions were bits (0
or 1). en x2z1 = 1 i (x2 = 1 and z1 = 1), so you could think of multiplication as a kind of feature
conjunction. Multiplication has a similar conjunctive eect even when the embedding dimensions are in
R. For example, suppose z1 > 0 indicates the degree to which z is a humanlike noun, while x2 > 0
indicates the degree to which x is a verb whose direct objects are usually human.7 en the product x2z1
will be larger for trigrams like kiss the baby and marry the policeman. So by learning a positive
weight X21 (nicknamed θ3 above), the optimizer can drive pˆ(baby  kiss the) higher, at the expense
of probabilities like pˆ(benzene  kiss the). pˆ(bunny  kiss the) might be somewhere in the middle
since bunnies are a bit humanlike and thus bunny1 might be numerically somewhere between baby1 and
benzene1.
Example. As an example, let’s calculate the leer trigram probability pˆ(s  er). Suppose the relevant
leer embeddings and the feature weights are given by
~e =
”
−.5
1
#
, ~r =
”
0
.5
#
, ~s =
”
.5
.5
#
, X =
”
1 0
0 .5
#
, Y =
”
2 0
0 1 #
7You might wonder: What if the embedding dimensions don’t have such nice interpretations? What if z1 doesn’t represent a
single property like humanness, but rather a linear combination of such properties? at actually doesn’t make much dierence.
Suppose z can be regarded as Mz˜ where z˜ is a more interpretable vector of properties. (Equivalently: each zj is a linear combination of the properties in z˜.) en x
>Xz can be expressed as (Mx˜)
>X(My˜) = ˜x
>(M>XM)˜y. So now it’s X˜ = M>XM
that can be regarded as the matrix of weights on the interpretable products. If there exists a good X˜ and M is invertible, then
there exists a good X as well, namely X = (M>)
−1XM˜ −1
.
R9
First, we compute the unnormalized probability.
u(ers) = exp
[−.5 1] ”
1 0
0 .5
# ” .5
.5
#
+ [0 .5] ”
2 0
0 1 # ” .5
.5
#
= exp(−.5 × 1 × .5 + 1 × .5 × .5 + 0 × 2 × .5 + .5 × 1 × .5) = exp 0.25 = 1.284
We then normalize u(ers).
pˆ(s  er)
def =
u(ers)
Z(er)
=
u(ers)
u(era) + u(erb) + · · · + u(erz)
=
1.284
1.284 + · · ·
(12)
Speedup. e example illustrates that the denominator
Z(xy) = X
z
0
u(xyz0
) = X
z
0
exp
~x>X~z
0 + ~y>Y ~z
0
(13)
is expensive to compute because of the summation over all z
0
in the vocabulary. Fortunately, you can
compute x
>Xz0 ∈ R simultaneously for all z
0
.
8 e results can be found as the elements of the row vector
x
>XE, where E is a d × V matrix whose columns are the embeddings of the various words z
0
in the
vocabulary. is is easy to see, and computing this vector still requires just as many scalar multiplications
and additions as before . . . but we have now expressed the computation as a pair of vectormatrix mutiplications, (x
>X)E, which you can perform using a matrix library (such as numpy or jblas). at can be
considerably faster than looping over all z
0
. at is because the library call is highly optimized and exploits
hardware support for matrix operations (e.g., parallelism).
F.5 Other smoothing schemes
Numerous other smoothing schemes exist. In past years, for example, our course assignments have used
Katz backo with Good–Turing discounting. (Good–Turing is aractive but a bit tricky in practice, and
has to be combined with backo.)
In practical seings, the most popular ngram smoothing scheme is something called modied Kneser–
Ney. One can also use a more principled Bayesian method based on the hierarchical Pitman–Yor process—
the result is very close to modied Kneser–Ney.
Remember: While these techniques are eective, a really good language model would do more than
just smooth ngram probabilities well. To predict a word sequence as accurately as a human can nish
another human’s sentence, it would go beyond the whole ngram family to consider syntax, semantics,
and topic. is is an active area of research that uses grammars, recurrent neural networks, and other
techniques.
G Safe practices for working with logprobabilities
G.1 Use natural log for internal computations
In this assignment, as in most of mathematics, log means loge
(the log to base e, or natural log, sometimes
wrien ln). is is also the standard behavior of the log function in most programming languages.
8e same trick works for y
>Y z0
, of course.
R10
With natural log, the calculus comes out nicely, thanks to the fact that d
dZ
log Z =
1
Z
. It’s only with
natural log that the gradient of the loglikelihood of a loglinear model can be directly expressed as observed features minus expected features.
On the other hand, information theory conventionally talks about bits, and quantities like entropy and
crossentropy are conventionally measured in bits. Bits are the unit of − log2 probability. A probability
of 0.25 is reported “in negativelog space” as − log2 0.25 = 2 bits. Some people do report that value more
simply as − loge 0.25 = 1.386 nats. But it is more conventional to use bits as the unit of measurement.
(e term “bits” was coined in 1948 by Claude Shannon to refer to “binary digits,” and “nats” was later
dened by analogy to refer to the use of natural log instead of log base 2. e unit conversion factor is
1
log 2 ≈ 1.443 bits/nat.)
Even if you are planning to print bit values, it’s still wise to standardize on loge
probabilities for all of
your formulas, variables, and internal computations. Why? ey’re just easier! If you tried to use negative
log2
probabilities throughout your computation, then whenever you called the log function or took a
derivative, you’d have to remember to convert the result. It’s too easy to make a mistake by omiing this
step or by geing it wrong. So the best practice is to do this conversion only when you print: at that point
convert your loge
probability to bits by dividing by − log 2 ≈ −0.693. is is a unit conversion.
G.2 Avoid exponentiating big numbers (not necessary for this assignment)
Loglinear models require calling the exp function. Unfortunately, exp(710)is already too large for a 64bit
oatingpoint number to represent, and will generate a runtime error (“overow”). Conversely, exp(−746)
is too close to 0 to represent, and will simply return 0 (“underow”).
at shouldn’t be a problem for this assignment if you stick to the language ID task. If you are experiencing an overow issue there, then your parameters probably became too positive or too negative as you
ran stochastic gradient descent. at could mean that your stepsize was too large, or more likely, that you
have a bug in your computation of the gradient.
But to avoid these problems elsewhere—including with the spam detection task—standard trick is to
represent all values “in logspace.” In other words, simply store 710 and −746 rather than aempting to
exponentiate them.
But how can you do arithmetic in logspace? Suppose you have two numbers p, q, which you are
representing in memory by their logs, lp and lq.
• Multiplication: You can represent pq by its log, log(pq) = log(p) + log(q) = lp + lq. at is,
multiplication corresponds to logspace addition.
• Division: You can represent p/q by its log, log(p/q) = log(p) − log(q) = lp − lq. at is, division
corresponds to logspace subtraction.
• Addition: You can represent p+q by its log. log(p+q) = log(exp(lp)+exp(lq)) = logsumexp(lp, lq).
(But this is unstable, so you can either write a safe variant or use a version from a library.)
For training a loglinear model, you can work almost entirely in log space, representing u and Z in
memory by their logs, log u and log Z. In order to compute the expected feature vector in (18) below,
you will need to come out of log space and nd p(z
0
 xy) = u
0/Z for each word z
0
. But computing
u
0
and Z separately is dangerous: they might be too large or too small. Instead, rewrite p(z
0
 xy) as
exp(log u
0 − log Z). Since u
0 ≤ Z, this is exp of a negative number, so it will never overow. It might
underow to 0 for some words z
0
, but that’s ok: it just means that p(z
0
 xy) really is extremely close to 0,
and so ~f(xyz0
) should make only a negligible contribution to the expected feature vector.
R11
H Training a loglinear model
H.1 e training objective
To implement the conditional loglinear model, the main work is to train ~θ (given some training data and
a regularization coecient C). As usual, you’ll set ~θ to maximize
F(
~θ)
def =
1
N
X
N
i=1
log ˆp(wi
 wi−2 wi−1)
 {z }
log likelihood
−
C ·
X
k
θ
2
k
 {z }
L2 regularizer
(14)
which is the regularized loglikelihood per word token. (ere are N word tokens.)
So we want ~θ to make our training corpus probable, or equivalently, to make the N events in the corpus
(including the nal eos) probable on average given their bigram contexts. At the same time, we also want
the weights in ~θ to be close to 0, other things equal (regularization).9
e regularization coecient C ≥ 0 can be selected based on dev data.
H.2 Stochastic gradient descent
Fortunately, concave functions like F(
~θ) in (14) are “easy” to maximize. You can implement a simple
stochastic gradient descent (SGD) method to do this optimization.
More properly, this should be called stochastic gradient ascent, since we are maximizing rather than
minimizing, but that’s just a simple change of sign. e pseudocode is given by Algorithm 1. We rewrite
the objective F(
~θ) given in (14) as an average of local objectives Fi(
~θ) that each predict a single word, by
moving the regularization term into the summation.
F(
~θ) = 1
N
X
N
i=1
log ˆp(wi
 wi−2 wi−1) −
C
N
·
X
k
θ
2
k
 {z }
call this Fi(θ~)
(16)
=
1
N
X
N
i=1
Fi(
~θ) (17)
e gradient of this average, ∇F(
~θ), is therefore the average value of ∇Fi(
~θ).
9As explained on the previous homework, this can also be interpreted as maximizing p(
~θ  ~w)—that is, choosing the most
probable ~θ given the training corpus. By Bayes’ eorem, p(
~θ  ~w) is proportional to
p( ~w 
~θ)
 {z }
likelihood
· p(
~θ)
{z}
prior
(15)
Let’s assume an independent Gaussian prior over each θk, with variance σ
2
. en if we take C = 1/2σ
2
, maximizing (14) is just
maximizing the log of (15). e reason we maximize the log is to avoid underow, and because the derivatives of the log happen
to have a simple “observed − expected” form (since the log sort of cancels out the exp in the denition of u(xyz).)
R12
Algorithm 1 Stochastic gradient ascent
Input: Initial stepsize γ0, initial parameter values ~θ
(0), training corpus D = (w1, w2, · · · , wN ), regularization coecient C, number of epochs E
1: procedure SGATrain
2: ~θ ← ~θ
(0)
3: t ← 0 . number of updates so far
4: for e : 1 → E : . do E passes over the training data, or “epochs”
5: for i : 1 → N : . loop over summands of (16)
6: γ ←
γ0
1 + γ0 ·
2C
N
· t
. current stepsize—decreases gradually
7: ~θ ← ~θ + γ · ∇Fi(
~θ) . move ~θ slightly in a direction that increases Fi(
~θ)
8: t ← t + 1
9: return ~θ
Discussion. On each iteration, the algorithm picks some word i and pushes ~θ in the direction ∇Fi(
~θ),
which is the direction that gets the fastest increase in Fi(
~θ). e updates from dierent i will partly cancel
one another out,10 but their average direction is ∇F(
~θ), so their average eect will be to improve the
overall objective F(
~θ). Since we are training a loglinear model, our F(
~θ) is a concave function with a
single global maximum; a theorem guarantees that the algorithm will converge to that maximum if allowed
to run forever (E = ∞).
How far the algorithm pushes ~θ is controlled by γ, known as the “step size” or “learning rate.” is
starts at γ0, but needs to decrease over time in order to guarantee convergence of the algorithm. e rule
in line 6 for gradually decreasing γ is the one recommended by Boou (2012), “Stochastic gradient descent
tricks,” which you should read in full if you want to use this method “for real” on your own problems.
Note that t increases and the stepsize decreases on every pass through the inner loop. is is important
because N might be extremely large in general. Suppose you are training on the whole web—then the
stochastic gradient ascent algorithm should have essentially converged even before you nish the rst
epoch! See reading section I.5 for some more thoughts about epochs.
H.3 e gradient vector
e gradient vector ∇Fi(
~θ) is merely the vector of partial derivatives
∂Fi(θ~)
∂θ1
,
∂Fi(θ~)
∂θ2
, . . .
. where Fi(
~θ)
was dened in (16). As you’ll recall from the previous assignment, each partial derivative takes a simple
10For example, in the training sentence eat your dinner but first eat your words, ∇F3(
~θ)is trying to raise the probability of dinner, while ∇F8(
~θ) is trying to raise the probability of words (at the expense of dinner!) in the same context.
R13
and beautiful form11
∂Fi(
~θ)
∂θk
= fk(xyz)
 {z }
observed value of feature fk
−
X
z
0
pˆ(z
0
 xy)fk(xyz0
)
 {z }
expected value of feature fk, according to current pˆ
−
2C
N
θk
 {z }
pulls θk towards 0
(18)
where x, y, z respectively denote wi−2, wi−1, wi
, and the summation variable z
0
in the second term ranges
over all V words in the vocabulary, including oov. is obtains the partial derivative by summing multiples
of three values: the observed feature count in the training data, the expected feature counts ccording to the
current pˆ (which is based on the entire current ~θ, not just θk), and the current weight θk itself.
H.4 e gradient for the embeddingbased model
When we use the specic model in (7), the feature weights are the entries of the X and Y matrices, as
shown in (9). e partial derivatives with respect to these weights are
∂Fi(
~θ)
∂Xjm
= xjzm −
X
z
0
pˆ(z
0
 xy)xjz
0
m −
2C
N
Xjm (19)
∂Fi(
~θ)
∂Yjm
= yjzm −
X
z
0
pˆ(z
0
 xy)yjz
0
m −
2C
N
Yjm (20)
where as before, we use ~x, ~y, ~z, ~z 0
to denote the embeddings of the words x, y, z, z0
. us, the update to ~θ
(Algorithm 1, line 7) is
(∀j, m = 1, 2, . . . d) Xjm ← Xjm + γ ·
∂Fi(
~θ)
∂Xjm
(21)
(∀j, m = 1, 2, . . . d) Yjm ← Yjm + γ ·
∂Fi(
~θ)
∂Yjm
(22)
I Practical hints for stochastic gradient ascent
I.1 Choose your hyperparameters carefully
In practice, the convergence rate of stochastic gradient ascent is sensitive to the initial guess ~θ
(0) and
the learning rate γ0. It’s common to take ~θ
(0) = ~0 (e.g., initialize all entries of X and V to 0), and we
recommend trying γ0 = 0.1 for spam detection or γ0 = 0.01 for language ID.
(Note that the assignment asks you to use the hyperparameters recommended above when loglin
is selected (question 6b). is will let you and us check that your implementation is correct. However,
you may want to experiment with dierent seings, and you are free to use those other seings when
improved is selected, to get beer performance.)
11If you prefer to think of computing the whole gradient vector at once, using vector computations, you can equivalently write
this as
∇Fi(
~θ) = ~f(xyz) −
X
z0
pˆ(z
0
 xy)
~f(xyz
0
) −
2C
N
~θ
.
R14
I.2 Don’t modify the parameters as you compute the gradient
Make sure that at line 7 of Algorithm 1, you compute the entire gradient before modifying ~θ. If you were
to update each parameter immediately aer computing its partial derivative, then the subsequent partial
derivatives would be incorrect.
I.3 Compute the gradient eciently
Optimization should be reasonably fast (seconds per epoch).
Be careful that you compute the second term of (18) in time O(V ). If you’re not careful, you might
use O(V
2
) time, because each of the V summands has a factor pˆ(z
0
 xy) whose denominator Z(xy) itself
takes O(V ) time to compute. e salvation is that xy remains the same throughout example i. So when
you’re computing a gradient vector ∇Fi(
~θ), you only have to compute Z(xy) once and reuse it for all z
0
and all k.
12 Warning: However, as Z(xy) depends on ~θ, it will become outofdate once you update ~θ. You
can’t save it to reuse for some future example j where wj−2wj−1 also equals xy (not even if j = i so that
you’re processing the very same example again).
Constant factors can also play a big role in the speed. For the particular loglinear model used here,
reading section F.4.1 gave a way to speed up the gradient computation. In general, replacing forloops
with library vector/matrix operations (via numpy) will probably speed you up a lot.
If you really want to get fancy, you could completely eliminate the need to sum over the whole vocabulary by predicting each word one bit at a time, as in the hierarchical logbilinear language model of Mnih
and Hinton (2009); then you are making log2 V binary predictions, instead of a V way prediction.
I.4 Compute the gradient correctly: e nitedierence check
It’s easy to make a mistake when computing the gradient, which will mess everything up. One way to test
your gradient computation code is to use the nitedierence approximation (see Bouou (2012), section
4.2, or Vieira (2017)). We won’t require this test, but it can help you debug your code; it’s generally wise
to test that what you’re computing really is the gradient!
Suppose you’ve just computed Fi(
~θ) for your current i and θ. Remember that Fi
is dierentiable at ~θ
(and everywhere), which means it can be approximated locally by a linear function. So try adding a small
number δ to one of the weights θk to obtain a new weight vector ~θ
0
. By the denition of partial derivatives,
you’d predict that Fi(
~θ
0) ≈ Fi(
~θ) + δ ·
∂F(θ~)
∂θk
.
You can check this prediction by actually evaluating Fi(
~θ
0). By making δ small enough (say 1e6), you
should be able to make the predictions match the truth arbitrarily well (say to within 1e10). If you can’t,
then either there’s a bug in your gradient computation or (unlikely) you need to be using higherprecision
oatingpoint arithmetic.
(Your program doesn’t have to check every single partial derivative if that’s too slow: each time you
compute a partial derivative, you can ip a weighted coin to decide whether or not to check it. Once you’re
sure that the gradient computation is correct, you can turn the checks o.)
More generally, you can check that Fi(
~θ + ~δ) ≈ Fi(
~θ) + ~δ · ∇Fi(
~θ) where ~δ is any small vector,
perhaps a vector of small random numbers, or a vector that is mostly 0’s, or a small multiple of ∇Fi(
~θ)
12Alternatively, note that you can rewrite the second term from the previous footnote as
P
z0 u(xyz0
)
~f(xyz0
)
P
z0 u(xyz0)
. en a single
loop over z
0
serves to compute the numerator (a vector) and the denominator (the scalar Z(xy)) in parallel. You can then
increment ~θ by γ/Z(xy) times the numerator vector.
R15
itself. (Again, you don’t have to do this check every time you compute a gradient.) A slightly beer version
is the symmetric nitedierence approximation, 1
2
Fi(
~θ + ~δ) − Fi(
~θ − ~δ)
≈ ~δ · ∇Fi(
~θ), which should
be even more accurate, although it requires two extra computations of Fi
instead of just one.
inking about why all these approximations are valid may help you remember how partial derivatives
work.
I.5 How much training is enough?
In reading H.2, you may have wondered how to choose E, the number of epochs. e following improvements are not required for the assignment, but they might help. You should read this section in any case.
We are using a xed number of epochs only to keep things simple. A beer approach is to continue
running until the function appears to have converged “well enough.” For example, you could stop if the
average gradient over the past epoch (or the past m examples) was very small. Or you could evaluate
the accuracy or crossentropy on development data at the end of each epoch (or aer each group of m
examples), and stop if it has failed to improve (say) 3 times in a row; then you can use the parameter vector
~θ that performed best on development data.
In theory, stochastic gradient descent shouldn’t even use epochs. ere should only be one loop, not
two nested loops. At each iteration, you pick a random example from the training corpus, and update
~θ based on that example. at’s why it is called ”stochastic” (i.e., random). e insight here is that the
regularized loglikelihood per token, namely F(
~θ), is actually just the average value of Fi(
~θ) over all
of the examples (see (16)). So if you compute the gradient on one example, it is the correct gradient on
average (since the gradient of an average is the average gradient). So line 7 is going in the correct direction
on average if you choose a random example at each step.
In practice, a common approach to randomization is to still use epochs, so that each example is visited
once per epoch, but to shue the examples into a random order at the start of training or at the start of each
epoch. To see why shuing can help, imagine that the rst half of your corpus consists of Democratic
talking points and the second half consists of Republican talking points. If you shue, your stochastic
gradients will roughly alternate between the two, like alternating between le and right strokes when
you paddle a canoe; thus, your average direction over any short time period will be roughly centrist. By
contrast, since Algorithm 1 doesn’t shue, it will paddle le for the half of each epoch and then right for
the other half, which will make signicantly slower progress in the desired centrist direction.
J Ideas for loglinear features
Here are some ideas for extending your loglinear model. Most of them are not very hard, although training
may be slow. Or you could come up with your own!
Adding features means throwing some more parameters into the denition of the unnormalized probability. For example, extending the denition (7) with additional features (in the case d = 2) gives
u(xyz)
def = exp
~x>X~z + ~y>Y ~z + θ9f9(xyz) + θ10f10(xyz) + . . .
(23)
= exp
θ1f1(xyz) + · · · + θ8f8(xyz)
 {z }
as dened in (10)–(11)
+ θ9f9(xyz) + θ10f10(xyz) + . . .
(24)
R16
J.1 Unigram logprobability
A possible problem with the model so far is that it doesn’t have any parameters that keep track of how
frequent specic words are in the training corpus! Rather, it backs o from the words to their embeddings.
Its probability estimates are based only on the embeddings.
One way to x that (see section J.2 below) would be to have a binary feature fw for each word w in
the vocabulary, such that fw(xyz) is 1 if z = w and 0 otherwise.
But rst, here’s a simpler method: just add a single nonbinary feature dened by
funigram(xyz) = log ˆpunigram(z) (25)
where pˆunigram(z) is estimated by add1 smoothing. Surely we have enough training data to learn an
appropriate weight for this single feature. In fact, because every training token wi provides evidence about
this single feature, its weight will tend to converge quickly to a reasonable value during SGD.
is is not the only feature in the model—as usual, you will use SGD to train the weights of all features
to work together, computing the gradient via (18). Let β = θunigram denote the weight that we learn for
the new feature. By including this feature in our denition of pˆunigram(z), we are basically multiplying a
factor of (ˆpunigram(z))β
into the numerator u(xyz) (check (5) to see that this is true). is means that in
the special case where β = 1 and X = Y = 0, we simply have u(xyz) = ˆpunigram, so that the LOGLIN
model gives exactly the same probabilities as the add1 smoothed unigram model pˆunigram. However, by
training the parameters, we might learn to trust the unigram model less (0 < β < 1) and rely more on the
word embeddings (X, Y 6= 0) to judge which words z are likely in the context xy.
A simpler way to implement this scheme is to dene
funigram(xyz) = log(c(z) + 1) (where c(z) is the count of z in training data) (26)
is gives the same model, since pˆunigram(z) is just c(z) + 1 divided by a constant, and our model renormalizes u(xyz) by a constant anyway.
J.2 Unigram, bigram, and trigram indicator features
Try adding a unigram feature fw for each word w in the vocabulary. at is, fw(xyz) is 1 if z = w and 0
otherwise. Does this work beer than the logunigram feature from section J.1?
Now try also adding a binary feature for each bigram and trigram that appears at least 3 times in
training data. How good is the resulting model?
In all cases, you will want to tune C on development data to prevent overing. is is important—the
original model had only 2d
2+1 parameters where d is the dimensionality of the embeddings, but your new
model has enough parameters that it can easily overt the training data. In fact, if C = 0, the new model
will exactly predict the unsmoothed probabilities, as if you were not smoothing at all (add0)! e reason
is that the maximum of the concave function F(
~θ) = PN
i=1 Fi(
~θ) is achieved when its partial derivatives
are 0. So for each unigram feature fw dened in the previous paragraph, we have, from equation (18) with
R17
C = 0,
∂F(
~θ)
∂θw
=
X
N
i=1
∂Fi(
~θ)
∂θw
(27)
=
X
N
i=1
fw(xyz)
 {z }
observed count of w in corpus
−
X
N
i=1
X
z
0
pˆ(z
0
 xy)fw(xyz0
)
 {z }
predicted count of w in corpus
(28)
Hence SGD will adjust ~θ until this is 0, that is, until the predicted count of w exactly matches the observed
count c(w). For example, if c(w) = 0, then SGD will try to allocate 0 probability to word w in all contexts
(no smoothing), by driving θw → −∞. Taking C > 0 prevents this by encouraging θw to stay close to 0.
J.3 Embeddingbased features on unigrams and trigrams
Oddly, (7) only includes features that evaluate the bigram yz (via weights in the Y matrix) and the skipbigram xz (via weights in the X matrix). Aer all, you can see in (9) that the features have the form yjzm
and xjzm. is seems weaker than ADDL BACKOFF. us, add unigram features of the form zm and
trigram features of the form xhyjzm.
J.4 Embeddingbased features based on more distant skipbigrams
For a loglinear model, there’s no reason to limit yourself to trigram context. Why not look at 10 previous
words rather than 2 previous words? In other words, your language model can use the estimate p(wi

wi−10, wi−9, . . . wi−1).
ere are various ways to accomplish this. You may want to reuse the X matrix at all positions i −
10, i−9, . . . , i−2 (while still using a separate Y matrix at position i−1). is means that having the word
“bread” anywhere in the recent history (except at position wi−1) will have the same eect on wi
. Such a
design is called “tying” the feature weights: if you think of dierent positions having dierent features
associated with them, you are insisting that certain related features have weights that are “tied together”
(i.e., they share a weight).
You could further improve the design by saying that “bread” has weaker inuence when it is in the
more distant past. is could be done by redening the features: for example, in your version of (9), you
could scale down the feature value (xjzm) by the number of word tokens that fall between x and z.
13
Note: e provided code has separate methods for 3grams, 2grams, and 1grams. To support general
ngrams, you’ll want to replace these with a single method that takes a list of n words. It’s probably easiest
to streamline the provided code so that it does this for all smoothing methods.
J.5 Spellingbased features
e word embeddings were automatically computed based on which words tend to appear near one another. ey don’t consider how the words are spelled! So, augment each word’s embedding with addi13A fancier approach is to learn how much to scale down this inuence. For example, you could keep the feature value dened
as (xj zm), but say that the feature weights for position i−6 (for example) are given by the matrix λ6X. Now X is shared across
all positions, but the various multipliers such as λ6 are learned by SGD along with the entries of X and Y . If you learn that λ6 is
close to 0, then you have learned that wi−6 has lile inuence on wi. (In this case, the model is technically logquadratic rather
than loglinear, and the objective function is no longer concave, but SGD will probably nd good parameters anyway. You will
have to work out the partial derivatives with respect to the entries of λ as well as X and Y .)
R18
tional dimensions that describe properties of the spelling. For example, you could have dimensions that
ask whether the word ends in ing, ed, etc. Each dimension will be 1 or 0 according to whether the word
has the relevant property.
Just throw in a dimension for each sux that is common in the data. You could also include properties
relating to word length, capitalization paerns, vowel/consonant paerns, etc.—anything that you think
might help!
You could easily come up with thousands of properties in this way. Fortunately, a given word such as
burgeoning will have only a few properties, so the new embeddings will be sparse. at is, they consist
mostly of 0’s with a few nonzero elements (usually 1’s). is situation is very common in NLP. As a result,
you don’t need to store all the new dimensions: you can compute them on demand when you are computing
summations like Pd
j=1
Pd
m=1 Yjm · (yjzm) in (9). In such a summation, j ranges over possible suxes of
y and m ranges over possible suxes of z (among other properties, including the original dimensions). To
compute the summation, you only have to loop over the few dimensions j for which yj 6= 0 and the few
dimensions m for which zm 6= 0. (All other summands are 0 and can be skipped.)
It is easy to identify these few dimensions. For example, burgeoning has the ing property but not any
of the other 3leersux properties. In the trigram xyz = demand was burgeoning, the summation
would include a feature weight Yjm for j = was and m = ing, which is included because yz has that
particular pair of suxes and so yjvm = 1. In practice, Y can be represented as a hash map whose keys
are pairs of properties, such as pairs of suxes.
J.6 Meaningbased features
If you can nd online dictionaries or other resources, you may be able to obtain other, linguistically interesting properties of words. You can then proceed as with the spelling features above.
J.7 Repetition
As words tend to repeat, you could have a feature that asks whether wi appeared in the set {wi−10, wi−9, . . . , wi−1}.
is feature will typically get a positive weight, meaning that recently seen words are likely to appear
again. Since 10 is arbitrary, you should actually include similar features for several dierent history sizes:
for example, another feature asks whether wi appeared in {wi−20, wi−19, . . . wi−1}.
J.8 Ensemble modeling
Recall that (25) included the logprobability of another model as a feature within your LOGLIN model. You
could include other logprobabilities in the same way, such as smoothed bigram or trigram probabilities.
e LOGLIN model then becomes an “ensemble model” that combines the probabilities of several other
models, learning how strongly to weight each of these other models.
If you want to be fancy, your loglinear model can include various trigrammodel features, each of
which returnslog ˆptrigram(z  xy) but only when c(xy)falls into a particular range, and returns 0 otherwise.
Training might learn dierent weights for these features. at is, it might learn that the trigram model is
trustworthy when the context xy is wellobserved, but not when it is rarely observed.
R19