Character-based Language Models : Assignment 5



Character-based Language Models :
Assignment 5
In the textbook, language modeling was defined as the task of predicting the next word in a sequence given
the previous words. In this assignment, we will focus on the related problem of predicting the next character in
a sequence given the previous characters.
The learning goals of this assignment are to:
Understand how to compute language model probabilities using maximum likelihood estimation
Implement basic smoothing, back-off and interpolation.
Have fun using a language model to probabilistically generate texts.
Use a set of language models to perform text classification.
Try out a more advanced form of language modeling using a Recurrent Neural Network.
Here are the materials that you should download for this assignment:
Skeleton python code (/downloads/hw5/
training data for text classification task (/downloads/hw5/
dev data for text classification task (/downloads/hw5/
test file for leaderboard (/downloads/hw5/cities_test.txt)
Part 1: Unsmoothed Maximum Likelihood
Character-Level Language Models
We’re going to be starting with some nice, compact code for character-level language models
( that was written by Yoav Goldberg
( Below is Yoav’s code for training a language model.
Note: all of this code is included in the provided code stub called . No need to
copy and paste.
Train a language model
Note: we provide you this code in the Python stub file (/downloads/hw5/
def train_char_lm(fname, order=4, add_k=1):
”’ Trains a language model.
This code was borrowed from
fname: Path to a text corpus.
order: The length of the n-grams.
add_k: k value for add-k smoothing. NOT YET IMPLMENTED
A dictionary mapping from n-grams of length n to a list of tuples.
Each tuple consists of a possible net character and its probability.
# TODO: Add your implementation of add-k smoothing.
data = open(fname).read()
lm = defaultdict(Counter)
pad = “~” * order
data = pad + data
for i in range(len(data)-order):
history, char = data[i:i+order], data[i+order]
def normalize(counter):
s = float(sum(counter.values()))
return [(c,cnt/s) for c,cnt in counter.items()]
outlm = {hist:normalize(chars) for hist, chars in lm.items()}
return outlm
fname is a file to read the characters from. order is the history size to consult. Note that we pad the data
with leading ~ so that we also learn how to start.
Now you can train a language model. First grab some text like this corpus of Shakespeare:
$ wget`
Now train the model:
lm = train_char_lm(“shakespeare_input.txt”, order=4)
P(hello world)
Ok. Now we can look-up the probability of the next letter given some history. Here are some example queries:
>>> lm[‘hell’]
[(‘!’, 0.06912442396313365), (‘ ‘, 0.22119815668202766), (“‘”, 0.018433179723502304),
(‘i’, 0.03225806451612903), (‘\n’, 0.018433179723502304), (‘-‘, 0.059907834101382486)
, (‘,’, 0.20276497695852536), (‘o’, 0.15668202764976957), (‘.’, 0.1336405529953917),
(‘s’, 0.009216589861751152), (‘;’, 0.027649769585253458), (‘:’, 0.018433179723502304)
, (‘?’, 0.03225806451612903)]
Actually, let’s pretty print the output, and sort the letters based on their probabilities.
import pprint
import operator
def print_probs(lm, history):
probs = sorted(lm[history],key=lambda x:(-x[1],x[0]))
pp = pprint.PrettyPrinter()
OK, print again:
>>> print_probs(lm, “hell”)
[(‘ ‘, 0.22119815668202766),
(‘,’, 0.20276497695852536),
(‘o’, 0.15668202764976957),
(‘.’, 0.1336405529953917),
(‘!’, 0.06912442396313365),
(‘-‘, 0.059907834101382486),
(‘?’, 0.03225806451612903),
(‘i’, 0.03225806451612903),
(‘;’, 0.027649769585253458),
(‘\n’, 0.018433179723502304),
(“‘”, 0.018433179723502304),
(‘:’, 0.018433179723502304),
(‘s’, 0.009216589861751152)]
This means that hell can be followed by any of these dozen characters:
and that the probability of o given hell is 15.7%, . The most probable character to see
after hell is a space, .
The distribution of letters that occur after worl is different than the distribution of letters that occur after
hell . Here is that distribution:
>>> print_probs(lm, “worl”)
[(‘d’, 1.0)]
What does that mean? It means that in our corpus, the only possible continuation that we observed for worl
was the letter d , and we assign 100% of probability mass to it, .
p(o ∣ hell) = 0.157
p(o ∣ hell) = 0.221
p(d ∣ worl) = 1.0
Let’s write some Shakespeare!
Generating text with the model is simple. To generate a letter, we will look up the last n characters, and then
sample a random letter based on the probability distribution for those letters. Here’s Yoav’s code for that:
from random import random
def generate_letter(lm, history, order):
history = history[-order:]
dist = lm[history]
x = random()
for c,v in dist:
x = x – v
if x <= 0: return c
To generate a passage of text, we just seed it with the initial history and run letter generation in a loop,
updating the history at each turn. We’ll stop generating after a specified number of letters.
def generate_text(lm, order, nletters=500):
”’Generates a bunch of random text based on the language model.

lm: The output from calling train_char_lm.
history: A sequence of previous text.
order: The length of the n-grams in the language model.

A letter
history = “~” * order
out = []
for i in range(nletters):
c = generate_letter(lm, history, order)
history = history[-order:] + c
return “”.join(out)
Now, try generating some Shakespeare with different order n-gram models. You should try running the
following commands.
>>> lm = train_char_lm(“shakespeare_input.txt”, order=2)
>>> print(generate_text(lm, 2))
>>> lm = train_char_lm(“shakespeare_input.txt”, order=3)
>>> print(generate_text(lm, 3))
>>> lm = train_char_lm(“shakespeare_input.txt”, order=4)
>>> print(generate_text(lm, 4))
>>> lm = train_char_lm(“shakespeare_input.txt”, order=7)
>>> print(generate_text(lm, 7))
What do you think? Is it as good as 1000 monkeys working at 1000 typewriters
What the F?
Try generating a bunch of short passages:
>>> print(generate_text(lm, 5, 40))
First, and quence
Shall we gave it. Now
>>> print(generate_text(lm, 5, 40))
First die.
I prithee,
>>> print(generate_text(lm, 5, 40))
First marriage,
And scarce it: wretches
>>> print(generate_text(lm, 5, 40))
First, the midsummer;
We make us allia s
>>> print(generate_text(lm, 5, 40))
First choose
Which now,
Where like thee.
>>> lm = train_char_lm(“shakespeare_input.txt”, order=4)
>>> print(generate_text(lm, 4, 40))
First blood
To grace and leade
>>> print(generate_text(lm, 4, 40))
First, are almightly,
Am I to bedew the
>>> print(generate_text(lm, 4, 40))
First Senato, come unexamination hast br
>>> lm = train_char_lm(“shakespeare_input.txt”, order=2)
>>> print(generate_text(lm, 2, 40))
Histed mor ituffe bonguis hon tract
>>> print(generate_text(lm, 2, 40))
Fir my fat,
Forromfor intre You to lor c
Do you notice anything? They all start with F! In fact, after we hit a certain order, the first word is always First?
Why is that? Is the model trying to be clever? First, generate the word First. Explain what is going on in your
Part 2: Perplexity, smoothing, back-off and
In this part of the assignment, you’ll adapt Yoav’s code in order to implement several of the techniques
described in Section 4.2 of the Jurafsky and Martin textbook (
How do we know whether a LM is good? There are two basic approaches:
1. Task-based evaluation (also known as extrinsic evaluation), where we use the LM as part of some other
task, like automatic speech recognition, or spelling correcktion, or an OCR system that tries to covert a
professor’s messy handwriting into text.
2. Intrinsic evaluation. Intrinsic evaluation tries to directly evalute the goodness of the language model by
seeing how well the probability distributions that it estimates are able to explain some previously
unseen test set.
Here’s what the textbook says:
For an intrinsic evaluation of a language model we need a test set. As with
many of the statistical models in our field, the probabilities of an N-gram model
come from the corpus it is trained on, the training set or training corpus. We can
then measure the quality of an N-gram model by its performance on some
unseen data called the test set or test corpus. We will also sometimes call test
sets and other datasets that are not in our training sets held out corpora
because we hold them out from the training data.
So if we are given a corpus of text and want to compare two different N-gram
models, we divide the data into training and test sets, train the parameters of
both models on the training set, and then compare how well the two trained
models fit the test set.
But what does it mean to “fit the test set”? The answer is simple: whichever
model assigns a higher probability to the test set is a better model.
We’ll implement the most common method for intrinsic metric of language models: perplexity. The perplexity
of a language model on a test set is the inverse probability of the test set, normalized by the number of words.
For a test set W = w1w2. . . wN :
Perplexity(W) = P(w1w2. . . wN)
P(w1w2. . . wN)

= ∏
N 1
P(w ∣ . . . ) i w1 wi−1

OK – let’s implement it. Here’s a possible function signature for perplexity. (We might update it during class on
Wednesday). Give it a go.
def perplexity(test_filename, lm, order=4):
”’Computes the perplexity of a text file given the language model.

test_filename: path to text file
lm: The output from calling train_char_lm.
order: The length of the n-grams in the language model.
test = open(test_filename).read()
pad = “~” * order
test = pad + data

A couple of things to keep in mind:
1. Remember to pad the front of the file
2. Numeric underflow is going to be a problem, so consider using logs.
In your report:
Discuss the perplexity for text that is similar and different from Shakespeare’s plays. We provide you two dev
text files (/downloads/hw5/, a New York Times article and several of Shakespeare’s sonnets, but
feel free to experiment with your own text.
Laplace Smoothing and Add-k Smoothing
Laplace Smoothing is described in section 4.4.1. Laplace smoothing adds one to each count (hence its
alternate name add-one smoothing). Since there are V words in the vocabulary and each one was
incremented, we also need to adjust the denominator to take into account the extra V observations.
A variant of Laplace smoothing is called Add-k smoothing or Add-epsilon smoothing. This is described in
section Add-k 4.4.2. Let’s change the function definition of train_char_lm so that it takes a new argument,
add_k , which specifies how much to add. By default, we’ll set it to one, so that it acts like Laplace
def train_char_lm(fname, order=4, add_k=1):
# Your code here…
PLaplace(wi) =
counti + 1
N + V
Next, let’s implement interpolation. The idea here is to calculate the higher order n-gram probabilities also
combining the probabilities for lower-order n-gram models. Like smoothing, this helps us avoid the problem of
zeros if we haven’t observed the longer sequence in our training data. Here’s the math:
where .
Now, write a back-off function:
def calculate_prob_with_backoff(char, history, lms, lambdas):
”’Uses interpolation to compute the probability of char given a series of
language models trained with different length n-grams.
char: Character to compute the probability of.
history: A sequence of previous text.
lms: A list of language models, outputted by calling train_char_lm.
lambdas: A list of weights for each lambda model. These should sum to 1.

Probability of char appearing next in the sequence.
You should also write a helper function to set the lambdas. Here’s a function definition that gives you access
to a development set. You can also experiment with setting them manually.
# returns a list of lambda values that weight the contribution of n-gram model
def set_lambdas(lms, dev_filename):
”’Returns a list of lambda values that weight the contribution of each n-gram mode
This can either be done heuristically or by using a development set.
lms: A list of language models, outputted by calling train_char_lm.
dev_filename: Path to a development text file to optionally use for tuning the lm
Probability of char appearing next in the sequence.
Part 3: Text Classification using LMs
Pba ( | ) = P( | ) + P( | ) + P( ) ckoff wi wi−2wi−1 λ1 wi wi−2wi−1 λ2 wi wi−1 λ3 wi
λ1 + λ2 + λ3 = 1
Language models can be applied to text classification. If we want to classify a text into a category
. We can pick the category that has the largest posterior probability given the text. That
Using Bayes rule, this can be rewritten as:
If we assume that all classes are equally likely, then we can just drop the term:
Here is the likelihood of under category , which can be computed by training language models
for all texts associated with category . This technique of text classification is drawn from literature on
authorship identification (, where the approach is to
learn a separate language model for each author, by training on a data set from that author. Then, to
categorize a new text D, they use each language model to calculate the likelihood of D under that model, and
pick the category that assigns the highest probability to D.
Try it! We’ll set up a leaderboard for a text classification task. Your job is to configure a set of language models
that perform the best on the text classification task. We will use the city names dataset, which you should
have already downloaded.
Later in this assignment, you will use a recurrent neural network on the same dataset in order to compare
Part 4: Character-Level Recurrent Neural
You will be using Pytorch for this assignment, and instead of providing you source code, we ask you to build
off a couple Pytorch tutorials. Pytorch is one of the most popular deep learning frameworks in both industry
and academia, and learning its use will be invaluable should you choose a career in deep learning.
Using miniconda
Miniconda is a package, dependency and environment management for python (amongst other languages). It
lets you install different versions of python, different versions of various packages in different environments
which makes working on multiple projects (with different dependencies) easy.
There are two ways to use miniconda,
1. Use an existing installation from another user (highly recommended): On biglab , add the
following line at the end of your ~/.bashrc file.
export PATH=”/home1/m/mayhew/miniconda3/bin:$PATH”
c ∈ C = c1 , . . . , cN c
c = argma P(c|D) ∗ xc∈C
c = argma P(D|c)P(c) ∗ xc∈C
= argmax P(D|c) c∈C
P(D ∣ c) D c
Then run the following command
source ~/.bashrc
If you run the command $ which conda , the output should be
/home1/m/mayhew/miniconda3/bin/conda .
2. Installing Miniconda from scratch: On biglab , run the following commands. Press Enter/Agree to all
prompts during installation.
$ wget
$ chmod +x
$ bash
After successful installation, running the command $ which conda should output
/home1/m/$USERNAME/miniconda3/bin/conda .
Installing Pytorch and Jupyter
For this assignment, you’ll be using Pytorch ( and Jupyter (
1. If you followed our recommendation and used the existing miniconda installation from 1. above, you’re
good to go. Stop wasting time and start working on the assignment!
2. Intrepid students who installed their own miniconda version from 2. above, need to install their own
copy of Pytorch and Jupyter. To install Pytorch, run the command
conda install pytorch-cpu torchvision -c pytorch
To check, run python and import torch . This should run without giving errors. To install jupyter, run
the command (it might take a while)
conda install jupyter
Running the command jupyter –version should yield the version installed.
How to use Jupyter notebook
For this homework, you have the option of using jupyter notebook (, which lets you
interactively edit your code within the web browser. Jupyter reads files in the .ipynb format. To launch from
biglab, do the following.
1. On biglab , navigate to the directory with your code files and type jupyter notebook –port 8888
–no-browser .
2. In your local terminal, set up port forward by typing ssh -N -f -L
localhost:8888:localhost:8888 [email protected] .
3. In your local web browser, navigate to localhost:8888 .
What’s a char-rnn?
Good question! Andrej Karpathy, a researcher at OpenAI, has written an excellent blog post which you should
read before beginning this assignment. (
Karpathy shows results for several fun domains:
Shakespeare plays
Paul Graham essays
LaTeX documents
Linux source code
Baby names
In this assignment you will follow the Pytorch tutorial code to implement your own char-rnn, and then test it on
a dataset of your choice. You will also train on our provided training set, and submit to the leaderboard, where
we will measure your model’s complexity on our test set.
Classification using char-rnn
Follow the tutorial code
Read through the tutorial here (
that builds a char-rnn that is used to classify baby names by their country of origin. While we strongly
recommend you carefully read through the tutorial, you will find it useful to build off the released code here
Make sure you can reproduce the tutorial’s results on the tutorial’s provided baby-name dataset before
moving on.
Switch to city names dataset
Download the city names dataset.
training sets (/downloads/hw5/
validation set (/downloads/hw5/
test file for leaderboard (/downloads/hw5/cities_test.txt)
Modify the tutorial code to instead read from city names dataset. The tutorial code problematically used the
same text file for both training and evaluation. We learned in class about how this is not a great idea. For the
city names dataset we provide you separate train and validation sets, as well as a test file for the leaderboard.
All training should be done on the train set and all evaluation (including confusion matrices and accuracy
reports) on the validation set. You will need to change the data processing code to get this working.
Specifically, you’ll need to modify the code in the 3rd code block to create two variables
category_lines_train and category_lines_val . In addition, to handle unicode, you might need to
replace calls to open with calls to, “r”,encoding=’utf-8′,
errors=’ignore’) .
Warning: you’ll want to lower the learning rating to 0.02 or less or you might get NaNs when training.
Attribution: the city names dataset is derived from Maxmind
(’s dataset.
Experimentation and Analysis
Complete the following analysis on the city names dataset, and include your finding in the report.
1. Write code to output accuracy on the validation set. Include your best accuracy in the report. (For a
benchmark, the TAs were able to get accuracy above 50%) Discuss where your model is making
mistakes. Use a confusion matrix plot to support your answer.
2. Modify the training loop to periodically compute the loss on the validation set, and create a plot with the
training and validation loss as training progresses. Is your model overfitting? Include the plot in your
report. TODO: Phrase better?
3. Experiment with the learning rate. You can try a few different learning rates and observe how this affects
the loss. Another common practice is to drop the learning rate when the loss has plateaued. Use plots
to explain your experiments and their effects on the loss.
4. Experiment with the size of the hidden layer or the model architecture How does this affect validation
Write code to make predictions on the provided test set. The test set has one unlabeled city name per line.
You code should output a file labels.txt with one two-letter country code per line. Extra credit will be given
to the top leaderboard submissions. Here are some ideas for improving your leaderboard performance:
Play around with the vocabulary (the all_letters variable), for example modifying it to only include
lowercase letters, apostrophe, and the hyphen symbol.
Test out label smoothing
Try a more complicated architecture, for example, swapping out the RNN for LSTM or GRU units.
Use a different initalization for the weights, for example, small random values instead of 0s
In your report, describe your final model and training parameters.
Part 5: Text generation using char-rnn
In this section, you will be following more Pytorch tutorial code in order to reproduce Karpathy’s text
generation results. Read through the tutorial here
(, and then download this ipython
notebook ( to base your own code
You will notice that the code is quite similar to that of the classification problem. The biggest difference is in
the loss function. For classification, we run the entire sequence through the RNN and then impose a loss only
on the final class prediction. For the text generation task, we impose a loss at each step of the RNN on the
predicted character. The classes in this second task are the possible characters to predict.
Experimenting with your own dataset
Be creative! Pick some dataset that interests you. Here are some ideas:
ABC music format (
This assignment is based on The Unreasonable Effectiveness of Character-level Language Models
( by Yoav Goldberg and The Unreasonable
Effectiveness of Recurrent Neural Networks ( by Andrej
Karpathy. Daphne Ippolito, John Hewitt, and Chris Callison-Burch adapted their work into a homework assignment
for UPenn’s CIS 530 class in Spring 2018.
Donald Trump speeches (
Webster dictionary (
Jane Austen novels (
Here are the deliverables that you will need to submit:
code (.zip). It should be written in Python 3 and include a README.txt briefly explaining how to run
labels.txt predictions for leaderboard.
Recommended readings
Language Modeling with N-grams. ( Dan Jurafsky and James H.
Martin. Speech and Language Processing (3rd edition draft) .
The Unreasonable Effectiveness of Recurrent Neural Networks. ( Andrej Karpathy. Blog post. 2015.
The Unreasonable Effectiveness of Character-level Language Models.
( Yoav Goldberg. Response to Andrej
Karpathy’s blog post. 2015.
Language Independent Authorship Attribution using Character Level Language Models.
( Fuchun Pen, Dale Schuurmans, Vlado Keselj, Shaojun
Wan. EACL 2003.


There are no reviews yet.

Be the first to review “Character-based Language Models : Assignment 5”

Your email address will not be published.