Sale!

Assignment 6: Finite-State Programming

$30.00

5/5 - (2 votes)

601.465/665 — Natural Language Processing
Assignment 6: Finite-State Programming

‘is assignment exposes you to €nite-state programming. You will build €nite-state machines automatically
using open-source toolkits. You’ll also get to work with word pronunciations and character-level edits.
Collaboration: You may work in pairs on this assignment, as it is fairly long and involved. ‘at is, if you
choose, you may collaborate with one partner from the class, handing in a single homework with multiple names
on it. However:
1. You are expected to do the work together, not divide it up: your solutions should emerge from collaborative
real-time discussions with the whole group present.
2. Your README €le should describe at the top what each of you contributed, so that we know you shared the
work fairly.
3. Your partner must not be the same partner as you had for HW5. Make new friends! 🙂
In any case, observe academic integrity and never claim any work by third parties as your own.
Reading: ‘ere is no separate reading this time. Instead, we’ll give you information and instructions as you work
through the assignment.
 What to hand in (via Gradescope): As usual, you should submit a README.pdf €le with answers to all the
questions in the text. We’ll also ask you to submit a .zip archive of all the grammar €les you create:
File †estions Should contain
binary.grm 1, 2, First, Second, Disagreements, Triplets, NotPillars, Oddlets,
6 WFlip, WeightedMultipath
rewrite.grm 3, 4 Cross, BitFlip1, BitFlip2, Parity1, Parity2, Parity3, UnParity, Split
chunker.grm 5a NP, MakeNmod, TransformNP, BracketTransform, BracketResults
noisy.grm 7, 8, 9 CompleteWord, DelSpaces, SpellText, Generate, RandomWord,
Spell (revised), PrintText (revised)
So…ware (not Python this time!):
• OpenFST is a very ecient C++ toolkit for building and manipulating semiring-weighted FSMs. You can use
the C++ API to directly specify states, arcs, and weights, or to combine existing FSMs through operations like
union and composition. You can also store these FSMs in .fst €les and manipulate them with command-line
utilities like fstunion and fstcompose.
A symbol table (.sym €le, or part of some .fst €les) speci€es the internal representation of the upper or
lower alphabet. E.g., the integers 1, 2, 3, . . .might internally represent the le‹ers a, b, c, . . . or perhaps the
words aardvark, aback, abacus, . . . . OpenFST uses these integers to label arcs in its data structures and €le
format ( arcs are labeled with 0). It is only the symbol table that tells what a given FSM’s integer labels are
supposed to mean.
However, we will not be using OpenFST directly (nor its Python interface, Pynini). Instead, we will use two packages
that provide a fairly friendly interface to OpenFST:
∗Many thanks to Frank Ferraro, who co-wrote this assignment and wrote the accompanying scripts, and to Jason Eisner.
• ‡rax is an extended regular expression language that you can use to de€ne collections of €nite-state machines. A ‘rax grammar can be created with any text editor and is stored in a .grm €le. ‘e ‘ax compiler
compiles this into a .far €le—an “fst archive” that contains multiple named OpenFST machines and symbol
tables.
• Since regular expressions are not good at specifying the topology of n-gram models, there’s also the NGram
toolkit, which builds a n-gram backo‚ language model from a corpus. It supports many types of smoothing.
‘e resulting language model is represented as a weighted FSA in OpenFST format.
First, get set up!
Optional: OpenFST, NGram, and ‘rax are installed on the ugrad machines (as well as the graduate network).
It’s probably easiest to do the assignment there. But if you would prefer to install a copy on your own machine,
1. Download and install OpenFST:
http://www.openfst.org/twiki/bin/view/FST/FstDownload
• Important: Make sure you run configure with the ƒag –enable-far=yes. If you don’t, ‘rax won’t
work!
• You should also use the ƒag –enable-ngram-fsts=yes.
• Do not use –enable-static=no.
• A‰er installation, you may need to set the environment variable LD_LIBRARY_PATH to where the OpenFST libraries are.
2. Download and install ‘rax:
http://www.openfst.org/twiki/bin/view/GRM/ThraxDownload
3. Download and install NGram:
http://www.openfst.org/twiki/bin/view/GRM/NGramDownload
4. To view drawings of FSMs, download and install graphviz:
http://www.graphviz.org/Download.php.
On Linux systems, you can just do sudo apt-get install graphviz.
5. Download a copy of the assignment directory hw-ofst, either from the ugrad network1
Look in the hw-ofst directory.1 Our scripts are in the bin subdirectory, which you should probably add to your
PATH so that you can execute these scripts without saying where they live. Run the following command (with the
bash shell) (and maybe put it in your ˜/.bashrc so that it will be executed automatically next time you log in).
export PATH=${PATH}:/usr/local/data/cs465/hw-ofst/bin
We’ve given you a script grmtest to help streamline the compilation and testing of ‘rax code. Its usage is:
grmtest <grm file> <transducer_name> [max number output lines]
1
/usr/local/data/cs465/hw-ofst/. You can copy this directory to your local machine or symlink to it on ugrad.
2
‘is script compiles the speci€ed .grm €le into a .far €le (using a make€le produced by thraxmakedep), and
then passes the standard input through the input through the exported FST named by
<transducer_name>. You’ll get to try it out below.
Warning: If the output string is the empty string , then for some reason grmtest skips printing it. ‘is seems
to be a bug in thraxrewritetester, which grmtest calls. Just be aware of it.
1. Now get to know ‘rax. We highly recommend looking through the online manual2
and perhaps the commented examples that come with ‘rax.3 ‘e following tutorial leads you through some of the basic FSM
operations you can do in ‘rax.
(a) Let’s €rst de€ne some simple FSMs over a binary alphabet. Type the following declarations into a new
€le binary.grm.
Zero = “0”;
One = “1”;
Bit = Zero | One;
export First = Optimize[Zero Zero* Bit* One One One One?];
‘is de€nes four named FSMs using ‘rax’s regular expression syntax (http://www.openfst.org/
twiki/bin/view/GRM/ThraxQuickTour#Standard_Library_Functions_Opera).4 Each de€nition
ends in a semicolon. ‘e €rst and second FSMs accept only the strings 0 and 1, respectively. ‘e third
de€nes our entire alphabet, and hence accepts either 0 or 1. ‘e fourth accepts some subset of Bit∗.
We can compile this collection of named FSMs into a fst archive (a “.far €le”). More precisely, the
archive provides only the FSMs that have been marked with an export declaration; so here Zero, One,
and Bit are just intermediate variables that help de€ne the exported FSM First.
i. Try compiling and running it using our grmtest script:
$ grmtest binary.grm First
[compiler messages appear here]
Input string: [type your input here]
‘e FSA First is interpreted as the identity FST on the corresponding language. So entering an
input string will transduce it to itself if it is in that language, and otherwise will fail to transduce.
Type Ctrl-D to quit.
You’ll get an error if you try running grmtest binary.grm Zero, because Zero wasn’t exported.
1 ii. What language does First accept (describe it in English)? Why are 0 and 1 quoted in the .grm €le?
iii. Let’s get information about First. First, we need to extract the FSA First from the FST archive:5
$ far2fst binary.far First
Now use the fstinfo shell command6
to analyze First.fst:
$ fstinfo First.fst
2 Look over this output: how many states are there? How many arcs?
2
http://www.openfst.org/twiki/bin/view/GRM/ThraxQuickTour
3
/usr/local/share/thrax/grammars/ on the ugrad or grad machines.
4Whereas XFST de€nes many special in€x operators, ‘rax instead writes most operators (and all user-de€ned functions) using the
standard form Function[arguments]. ‘rax uses square brackets [] for these function calls, and parentheses () for grouping. Optionality
is denoted with ? and composition with @. ‘ere is apparently no way to write a wildcard that means “any symbol”—you need to de€ne
Sigma = “a”|”b”|”c”|… and then you can use Sigma within other regular expressions.
5Use far2fst binary.far to extract all exported FSTs, or far2fst binary.far First Second to extract multiple ones that you
specify.
6
http://man.sourcentral.org/f14/1+fstinfo
3
iv. Optionally, look at a drawing of First (as an identity transducer over the alphabet {0, 1}):
$ fstview First.fst
Note that fstview is a wrapper script that we are providing for you.7 ‘e picture will take a few
seconds to appear if the graphics pixels are being sent over a remote X connection.8
(b) Now let’s look at equivalent ways of describing the same language.
i. Can you €nd a more concise way of de€ning First’s language? Add it to binary.grm as a new
regexp Second, of the form
export Second = Optimize[ …fill something in here… ];
Run grmtest to check that First and Second seem to behave the same on some inputs.
ii. Here’s how to check that First and Second really act the same on all possible inputs—that they
de€ne the same language:
export Disagreements = Optimize[ (First – Second) | (Second – First) ];
3 If First and Second are equivalent, then what strings should Disagreements accept?
To check that, run fstinfo on Disagreements.fst. From the output, can you conclude that
4 First and Second must be equivalent?
Note: ‘e fstequal utility is another way to check:
if fstequal First.fst Second.fst; then echo same; else echo different; fi
One way to program fstequal would be to construct the Disagreements FSA.
(c) You might have wondered about those Optimize[ … ] functions. ‘e ‘rax documentation notes
that Optimize
. . .involves a combination of removing epsilon arcs, summing arc weights, and determinizing and
minimizing the machine . . . [Details are here.]
To €nd out what di‚erence that made, make a new €le binary-unopt.grm that is a copy of binary.grm
with the Optimize functions removed. ‘en try:
grmtest binary-unopt.grm First # and type Ctrl-D to exit
far2fst binary-unopt.far
fstview First.fst Second.fst Disagreements.fst
estions:
i. Although First and Second may be equal, their unoptimized FSMs have di‚erent sizes and di‚erent
5 topologies, reƒecting the di‚erent regular expressions that they were compiled from. How big is
each one?
ii. ‘e drawing of the unoptimized Disagreements.fst shows that it immediately branches at the
6 start state into two separate sub-FSAs. Why? (Hint: Look back at the regexp that de€ned Disagreements.)
iii. Now test some sample inputs with
grmtest binary-unopt.grm First
7 How are the results di‚erent from the optimized version? Why?
7A‰er printing fstinfo, it calls fstdraw to produce a logical description of the drawing, then makes the drawing using the Graphviz
package’s dot command, and €nally displays the drawing using evince. Each of these commands has many tweakable options. What if
you’re running on your own machine and don’t have evince? ‘en edit the fstview script to use a di‚erent PDF viewer such as xreader,
atril, xpdf, or acroread.
8
If you start ge‹ing “can’t open display” errors, then try connecting via ssh -Y instead of ssh -X. An alternative is to copy the (small)
.pdf €le to your local machine and use your local image viewer. Mac users might also like the free remote €le browser Cyberduck.
4
(d) You may not want to call Optimize on every machine or regular sub-expression. ‘e documentation
o‚ers the following warning:
When using composition, it is o‰en a good idea to call Optimize[] on the arguments; some compositions can be massively sped up via argument optimization. However, calling Optimize[] across the
board (which one can do via the ƒag –optimize all fsts) o‰en results in redundant work and can
slow down compilation speeds on the whole. Judicious use of optimization is a bit of a black art.
8 If you optimize Disagreements without €rst optimizing First and Second, what do you get and why?
2. Now try some slightly harder examples. Extend your binary.grm to also export FSAs for the following
languages. (You are welcome to de€ne helper FSAs beyond these.)
(a) Triplets: Binary strings where 1 only occurs in groups of three or more, such as 000000 and
0011100001110001111111.
(b) NotPillars: All binary strings except for even-length strings of 1’s: , 11, 1111, 111111, 11111111,
. . . (‘ese correspond to binary numbers of the form 2
2k − 1 wri‹en in standard form.) Some strings
that are in this language are 0, 1, 000011, 111, 0101, 011110.
(c) Oddlets: Binary strings where 1’s only appear in groups of odd length. Careful testing this one! (Note
that 0000 is in this language, because 1’s don’t appear in it at all.)
You will extend binary.grm further in question 6.
3. So far we have only constructed FSAs. But ‘rax also allows FSTs. Create a new €le rewrite.grm in which
you will de€ne some FSTs for this question and the next question.
Complicated FSTs can be built up from simpler ones by concatenation, union, composition, and so on. But
where do you get some FSTs to start with? You need the built-in : operator:
input : output
which gives the cross-product of the input and output languages.
In addition, any FSA also behaves as the identity FST on its language.
Place the following de€nition into rewrite.grm:
export Cross = “a” ((“b”:”x”)* | (“c”+ : “y”*) | (“”:”fric”)) “a”;
Note that “” denotes the empty string .
9 (a) What is the input language of this relation (answer with an ordinary regexp)?
10 (b) Give inputs that are mapped by Cross to 0 outputs, 1 output, 2 outputs, and more than 2 outputs.
(c) How would you describe the Cross relation in English? (You do not have to hand in an answer for this,
but at least think about it.)
11 (d) Make an Optimized version of Cross and look at it with fstview. Is it consistent with your answers
above? How many states and arcs?
Check your answers by transducing some inputs, using the following commands:
5
grmtest rewrite.grm Cross
grmtest rewrite.grm Cross 3
‘e second version of the command limits the number of outputs that are printed for each input that you
enter.
4. If you have a simple FST, T, then you can make a more complicated one using context-dependent rewriting.
‘rax’s CDRewrite operator is similar to the -> operator in XFST. It applies T “everywhere it can” within the
input string, until there are no unreplaced substrings that could have been replaced. ‘e following shows
schematically how two substrings of might be replaced:
contains no
replaceable
substring
z }| {
transduce using
T
z }| {
contains no
replaceable
substring
z }| {
transduce using T
z }| {
contains no
replaceable
substring
z }| {
| {z }
matches Le‡
| {z }
matches input
language of T
| {z }
matches Right
| {z }
matches Le‡
| {z }
matches input language
of T
| {z }
matches
Right
‘e braces underneath the string show that each of the 2 replaced substrings appears in an appropriate
context—immediately between some substring that matches Le‡ and some substring that matches Right. ‘e
2 replaced substrings are not allowed to overlap, but the contexts can overlap with the replaced substrings
and with one another, as shown in the picture.9
If you want to require that the maximal substring to the le‰ matches Le‡, then start Le‡ with the special
symbol [BOS], which can only match at the beginning of the string. Similarly for Right and [EOS] (end of
string).
‘e above example shows only one way of dividing up the input string into regions that are transduced by
T and regions that are le‰ alone. If there are other ways of dividing up this input string, then the rewrite
transducer will try them too—so it will map this input to multiple outputs.10
CDRewrite[T, Left, Right, Any, Dir, Force] speci€es a rewrite transducer with arguments
• T : any FST
• Left, Right : unweighted FSAs describing the le‰ and right contexts in which T should be applied.
‘ey may contain the special symbols “[BOS]” and “[EOS]”, respectively. (‘ese symbols are only to
be used when describing contexts, as in these arguments to CDRewrite, which interprets them specially.
‘ey do not appear in the symbol table.)
• Any : a minimized FSA for Σ

, where Σ is the alphabet of input and output characters. ‘e FST produced
by CDRewrite will only allow input or output strings that are in Any, so be carefu!l
9
If you are wondering how this is accomplished, have a look at Mohri & Sproat (1996), section 3.1.
10So there is no notion here of selecting the regions to transduce in some deterministic way, such as the le‰-to-right longest match used
by the XFST @-> operator.
6
• Dir : the direction of replacement.
– ’sim’ speci€es “simultaneous transduction”: Left and Right are matched against the original
input string. So all the substrings to replace are identi€ed €rst, and then they are all transduced in
parallel.
– ’ltr’ says to perform replacements “in le‰-to-right order.” A substring should be replaced if Left
matches its le‰ context a‡er any replacements to the le‰ have been done, and Right matches its
right context before any replacements to the right have been done.
– ’rtl’ uses “right-to-le‰” order, the other way around.
• Force : how aggressive to be in replacement?
– ’obl’ (“obligatory,” like -> in XFST) says that the unreplaced regions may not contain any more
replaceable substrings, as illustrated above. ‘at is, they may not contain a substring that matches
the input language of T and which falls between substrings that match Left and Right.
– ’opt’ (“optional,” like (->) in XFST) says it’s okay to leave replaceable substrings unreplaced. Since
the rewrite transducer has the freedom to replace them or not, it typically has even more outputs
per input.
De€ne the following FSTs in your rewrite.grm, and test them out with grmtest:
(a) BitFlip1: Given a string of bits, changes every 1 to 0 and every 0 to 1. ‘is is called the “1’s complement” of the input string. De€ne this without using CDRewrite.
(b) BitFlip2: Like BitFlip1, but now it should work on any string of digits (e.g., transducing 1123401 to
0023410). De€ne this version using CDRewrite.
Hint: ‘e Any argument in this case should be Digit* where
Bit = “0” | “1”;
Digit = “0” | “1” | “2” | “3” | “4” | “5” | “6” | “7” | “8” | “9”;
(c) Parity1: Transduces even binary numbers to 0 and odd binary numbers to 1. Write this one without
using CDRewrite.
It’s always good to think through the unusual cases. Some people think the empty string  is a valid
12 binary number (representing 0), while others don’t. What does your transducer think?
(d) Parity2: Same thing as Parity1, but use the Reverse function in your de€nition. So start by writing
a tranducer that keeps the €rst bit of its input instead of the last bit.
(e) Parity3: Same thing as Parity1, but this use CDRewrite in your de€nition.
13 What does this transducer think about ?
Hint: You may €nd it helpful to use [BOS] and [EOS], or the composition operator @.
14 (f) UnParity: De€ne this as Invert[Parity3]. What does it do?
(g) Split: Split up a binary string by nondeterministically inserting spaces into the middle, so that input
0011 maps to the eight outputs
{0011, 001 1, 00 11, 00 1 1 , 0 011, 0 01 1, 0 0 11, 0 0 1 1}
Hint: Use CDRewrite[“”:” “,. . .] and €gure out the other arguments.
7
✌15 (h) Extra credit: SplitThree: Similar to Split, but always splits the input string into exactly three
(nonempty) binary strings. ‘is will produce multiple outputs for strings longer than 3 bits, and no
outputs for strings shorter than 3 bits.
Hint: Compose Split with something else. ‘e composition operator is @.
5. Bit strings are great, but let’s move to natural language. You know from HW1 that precisely describing syntax
can be challenging, and from HW4 that recovering the full parse of a sentence can be slow. So, what if you
just want a quick €nite-state method for €nding simple NPs in a sentence? ‘is could be useful for indexing
text for search engines, or as a preprocessing step that speeds up subsequent parsing. Or it could be part of
a cascade of FSTs that do information extraction (e.g., if you want to know who manufactures what in this
world, you could scan the web for phrases like “X manufactures Y ” where X and Y are NPs).
Identifying interesting substrings of a sentence is called “chunking.” It is simpler than parsing because the
“chunks” are not nested recursively. ‘is tutorial question will lead you through building an FST that does
simple NP chunking.
We will assume that the input sentence has already been tagged (perhaps by a tagging FST, which you could
compose with this chunking FST). You’ll build the following objects:
• An FSA that accepts simple noun phrases: an optional determiner, followed by zero or more adjectives
Adj, followed by one or more nouns Noun. ‘is will match a “base” NP such as the ghastly orange
tie, or Mexico—though not the recursiveNP the ghastly orange tie from Mexico that I never
wear.
‘e regexp de€ning this FSA is a kind of simple grammar. To make things slightly more interesting, we
suppose that the input has two kinds of determiners: quanti€ers (e.g., every) are tagged with Quant
whereas articles (e.g., the) are tagged with Art
• A transducer that matches exactly the same input as the previous regular expression, and outputs a
transformed version where non-€nal Noun tags are replaced by Nmod (“nominal modi€er”) tags. For
example, it would map the input Adj Noun Noun Noun deterministically to Adj Nmod Nmod Noun (as
in delicious peanut butter filling). It would map the input Adj to no outputs at all, since that
input is not a noun phrase and therefore does not allow even one accepting path.
• A transducer that reads an arbitrary input string and outputs a single version where all the maximal
noun phrases (chosen greedily from le‰ to right) have been transformed as above and bracketed.
(a) You’ll be editing the provided €le chunker.grm:
import ’byte.grm’ as bytelib;
import ’tags.grm’ as tags;
Sigma = (tags.Tags) | (bytelib.kBytes);
SigmaStar = Optimize[Sigma*];
Copy this €le along with byte.grm and tags.grm from the grammars/ directory. Line 2 de€nes tags
to be the collection of FSMs that are exported by tags.grm. Expressions like tags.Tags in line 3 then
refer to individual FSMs in that collection. You should look at these other €les referenced by lines 1–2.
Now:
i. De€ne an FSA NP that accepts an optional article (Art) or quanti€er (Quant); followed by an arbitrary number of adjectives (Adj); followed by at least one noun (Noun). We would like to write:
export NP = Optimize[(Art|Quant)? Adj* Noun+];
8
16 What goes wrong? (Hint: look at importable FSMs from tags.) Fix the de€nition in chunker.grm,
17 and in your README, provide evidence of what you were able to accept.
You will use the €xed chunker.grm for the rest of this question.
(Note: Really we should be working over the alphabet of tags rather than the default alphabet of
ASCII characters. Later in the assignment we’ll see how to de€ne other alphabets using symbol
tables.)
ii. Have a look at NP:
far2fst chunker.far NP
fstview NP.fst
18 How many states and arcs are there? Comment on the structure of the machine.
(b) In a noun-noun compound, such as the old meat packing district, the nouns meat and packing
act as nominal modi€ers. De€ne and try out a transducer MakeNmod, using CDRewrite, that replaces
Noun with Nmod immediately before any Noun. So ArtAdjNounNounNoun as in the example becomes
ArtAdjNmodNmodNoun.
To de€ne MakeNmod, you’ll need to €gure out what arguments to use to CDRewrite.
(c) Now de€ne an FST
export TransformNP = Optimize[NP @ MakeNmod];
19 i. Describe in words what this composition is doing.
 ii. What are the results on ArtAdjNounNounNoun and AdjNounNounNounVerb?
20
 iii. What is the size of TransformNP compared to MakeNmod?
21
 iv. How does the topology of TransformNP di‚er from that of NP?
22
(d) ‘is FST transduces a noun phrase to one that has <angle brackets> around it:
export BracketNP = (“” : “<“) NP (“” : “>”);
Here the NP language is being interpreted as the identity relation on that language, and concatenated
with two other simple regular relations. So BracketNP reads , any NP,  and writes <, the same NP, >.
23 What, if anything, is the di‚erence between the following?
export Brackets1 = Optimize[SigmaStar (BracketNP SigmaStar)*];
export Brackets2 = CDRewrite[BracketNP, “”, “”, SigmaStar,’sim’,’obl’];
Try them out on short and long strings, such as ArtAdj, AdjNoun,
and VerbArtAdjNounNounNounVerbPrepNoun.
(e) Now de€ne BracketTransform to be like Brackets2, except that it should not only bracket noun
phrases but also apply the transformation de€ned by TransformNP within each noun phrase. ‘is should
be a fairly simple change.
(f) One interesting thing about FSTs is that you can pass many strings through an FST at once. De€ne
BracketResults to be the regular language that you get by applying BracketTransform to all strings
of the form Quant Noun+ Verb at once.
(Hint: Check out the Project operator in the ‘rax documentation.11 You may want to optimize the
result.
11‘is operator is used to “project” a relation onto the upper or lower language, like the .u and .l operators in XFST. Why is that called
projection? Consider a set of points on the plane: {(x1, y1),(x2, y2), . . .}. ‘e projection of this set onto the x axis is {x1, x2 . . .} and the
projection onto the y axis is {y1, y2 . . .}. Same thing when projecting a regular relation, except that each (xi, yi) pair is a pair of strings.
9
You can check the FSA by using fstview on BracketResults (note that it may be drawn as an identity FST). To print out the strings it accepts (up to a limit), run grmtest on the cross-product machine
“”:BracketResults, and enter an empty input string to see all the outputs.
(g) Extra credit: To get a sense of how CDRewrite might do its work, de€ne your own version of TransformNP
that does not use CDRewrite. It should be a composition of three FSTs:
• Optionally replace each Noun with Nmod, without using CDRewrite. ‘is will transduce a single
input to many outputs.
• Check that no Noun is followed by another Noun or Nmod. ‘is €lters outputs where you didn’t
replace enough nouns.
• Check that every Nmod is followed by a Noun or another Nmod. ‘is €lters outputs where you replaced
too many nouns. (Hint: It is similar to the XFST “restrict” operator that we de€ned in class.)
✌24 Call your version TransformNP2.
6. In OpenFST, you can de€ne weighted FSMs. By default, OpenFST, NGram and ‘rax all use the tropical
semiring, hR ∪ {±∞}, ⊕ = min, ⊗ = +i.
12 ‘us, weights can be interpreted as costs. Concatenating two
paths or regexps will sum their costs, whereas if a string is accepted along two parallel paths or by a union of
regexps, then it gets the minimum of their cost.
Augment an FSM’s de€nition by appending a weight w in angle brackets, < and >, and wrapping the entire
de€nition in parentheses.
25 (a) i. What is the minimum-weight string accepted by this FSA, and what is the weight of that string?
(Remember that parentheses in ‘rax just represent grouping, not optionality.)
(Zero <1>) (Bit+ <0.2>) (One <0.5>)
26 ii. What is the minimum-weight pair of strings accepted by this FST, and what is the weight of that
pair?
(Zero : One <1>) (Bit+ <0.2>) (One : One One <0.75>)
(b) In your old binary.grm €le, de€ne a weighted transducer WFlip that accepts the language of the above
FSA and, reading le‰ to right:
• Nondeterministically ƒips the le‰most bit. Flipping has weight 2, while not ƒipping has weight 1.
• In the Bit+ portion, replaces every 0 with 01 (at cost 0.5), and replaces every 1 with 0 (at cost 0.4).
• Accepts the €nal 1 bit with weight 0.5.
Don’t use CDRewrite here.
For example, WFlip should produce the following:
Input string: 0011
Output string: 00101 <cost=2.4>
Output string: 10101 <cost=3.4>
(c) Now let’s consider cases where we aggregate the weights of multiple paths using ⊕.
27 i. In your README, name any two binary strings x, y: for example, (x, y) = (00, 1). In binary.grm,
de€ne WeightedMultipath to be a simple weighted FST of your choice, such that the particular
pair (x, y) that you named will be accepted along at least two di‚erent paths, of di‚erent weights.
To con€rm that these two accepting paths exist, view a drawing of the machine, and use grmtest
28 to €nd out what x maps to. What are the weights of these paths?
12And currently a portion of the ‘rax we’re using supports only this semiring.
10
29 ii. Now de€ne WeightedMultipathOpt = Optimize[WeightedMultipath]. In this FST, how many
paths accept (x, y) and what are their weights? Why?
iii. Suppose T is an FST with weights in some semiring, and x and y are strings. So T accepts the pair
(x, y) along 0 or more weighted paths.
30 Describe, in English, what the following weighted languages or relations tell you about T:
T_out = Project[ T, ’output’]; # erases input from T
xT_out = Project[ x @ T, ’output’]; # erases input x from x @ T
Ty_in = Project[ T @ y, ’input’]; # erases output y from T @ y
xTy = x @ T @ y;
exTye = (“”:x) @ T @ (y:””); # erases input x & output y from x @ T @ y
xT_out_opt = Optimize[xT_out];
Ty_in_opt = Optimize[Ty_in];
exTye_opt = Optimize[exTye];
How big is the last FSM, in general? Why do the last three FSMs have practical importance?  13
31
32
You can try these all out for the case where T is WeightedMultipath and x and y denote the strings
(x, y) you named above.
✌33 (d) Extra credit: De€ne an FSM NoDet that has no deterministic equivalent. (Unweighted FSAs can always be determinized, but it turns out that either outputs (FSTs) or weights can make determinization
impossible in some cases that have cycles.)
How will you know you’ve succeeded? Because the‘rax compiler will run forever on the line Determinize[RmEpsthe determinization step will be building up an in€nitely large machine. (RmEpsilon eliminates  arcs
from the FSM, which is the €rst step of determinization. ‘en Determinize handles the rest of the
construction.)
7. ‘roughout the remainder of this assignment, we’ll be focused on building noisy-channel decoders, where
weighted FSTs really shine. Your observed data y is assumed to be a distorted version of the “true” data x. We
would like to “invert” the distortion as best we can, using Bayes’ ‘eorem. ‘e most likely value of x is
x
∗ def = argmax
x
Pr(x | y) = argmax
x
Pr(x, y) = argmax
x
Pr(x)
| {z }
“language model”
Pr(y | x)
| {z }
“channel model”
(1)
In €nite-state land, both language and channel models should be easy to represent. A channel is a weighted
FST that can be de€ned with ‘rax, while a language model is a weighted FSA that can be straightforwardly
built with the NGram toolkit.14 To make even easier to build a language model, we’ve given you a wrapper
script, make-lm:
make-lm corpus.txt
13In general, one might want to do one of these computations for many di‚erent x (or y) values. Rather than compiling a new ‘rax €le for
each x value, you could use other means to create x at runtime and combine it with T. For example, to work with BracketTransform
from question 5e, try typing this pipeline at the command line: echo “ArtNounNounNoun” | fstcompilestring | fstcompose
– BracketTransform.fst | fstproject –project output | fstoptimize | fstview. (Other handy utilities discussed in this
handout are fstrandgen for ge‹ing random paths, fstshortestpath for ge‹ing the lowest-cost paths, farprintstrings for printing
strings from these paths, and our grmfilter script for transducing a €le.) Or you could just use the C++ API to OpenFST.
14Language models can’t be concisely described with regular expressions: at least, not the standard ones.
11
By default, this will create a Kneser–Ney back-o‚ trigram language model called corpus.fst. Every sentence
of corpus.txt should be on its own line.
(a) Create the default language model for the provided €le data/entrain.txt. Each line of this €le
represents an observed sentence from the English training data from the HMM assignment.15 Notice
that the sentences have already been tokenized for you (this could be done by another FST, of course).
You should now have a language model entrain.fst in your working directory, along with two other
€les you’ll need later: entrain.alpha is the alphabet of word types and entrain.sym is a symbol table
that assigns internal numbers to them.
Look at the €les, and try this:
wc -w entrain.txt # number of training tokens
wc -l entrain.alpha # number of training types
fstinfo entrain.fst # size of the FSA language model
(b) ‘e NGram package contains a number of useful shell commands for using FST-based n-gram models.
‘ree that you may €nd particularly interesting are ngramprint, ngramrandgen, and ngramperplexity.
You can read up on all three if you like (use the –help ƒag).
Sampling several sentences from the language model is easy:
ngramrandgen –max_sents=5 entrain.fst | farprintstrings
Each <epsilon> represents a backo‚ decision in the FSM. You can see that there is a lot of backo‚ from
2 to 1 to 0 words of context. ‘at’s because this corpus is very small by NLP standards: the smoothing
method realizes that very few of the possible words in each context have been seen yet.
To read the sentence more easily, use the ƒag –remove_epsilon to ngramrandgen (this prevents
<epsilon>s from being printed). Alternatively, just use our fstprintstring script to print a random
output from any FST:
fstprintstring entrain.fst
If you want to know the most probable path in the language model, you can use fstshortestpath,
which runs the Viterbi algorithm.16
fstshortestpath entrain.fst | fstprintstring
34 How long are the strings in both cases, and why? What do you notice about backo‚?
(c) Recommended: Although it’s a small language model, entrain.fst is far too large to view in its
entirety. To see what a language model FSA looks like, try making a tiny corpus, tiny-corpus.txt:
just 2–3 sentences of perhaps 5 words each. Try to reuse some words across sentences. Build a language
model tiny-corpus.fst and then look at it with fstview. ‘ere is nothing to hand in for this question.
(d) Now, let’s actually use the entrain.fst language model. Copy noisy.grm from the grammars/ directory:
import ’byte.grm’ as bytelib; # load a simple grammar (.grm)
export LM = LoadFst[’entrain.fst’]; # load trigram language model (.fst)
vocab = SymbolTable[’entrain.sym’]; # load model’s symbol table (.sym)
15Because the NGram toolkit assumes that sentence boundaries are marked by newlines, we’ve omi‹ed ###.
16In general fstprintstring will print a randomly chosen string, but in the output of fstshortestpath, there is only one string to
choose from.
12
Each line loads a di‚erent kind of external resource. In particular, the second line loads the trigram
language model FSA, and the third line loads the symbol table used by that FSA. ‘e symbol table consists
of the vocabulary of the language model, as well as the OOV symbol <unk> (“unknown”).17
You can therefore use LM to transduce some strings:
grmtest-with-symbols noisy.grm LM entrain.sym entrain.sym
35 What is the result of transducing the following? Explain your answers. What are the domain and range
of the relation LM?
• Andy cherished the barrels each house made .
• If only the reporters had been nice .
• Thank you
We now want to compose LM with a noisy channel FST. Because the language model is nothing more than
an FSA, we can use it in ‘rax. Of course, we’re going to have to be careful about symbol tables: the noisy
channel’s input alphabet must be the same as LM’s output alphabet.
Remember to be careful when creating FSTs over a nonstandard alphabet. If you write
(“barrels barrels” : (“” | “ship”))*;
then the input to this FST must be a multiple of 15 symbols in the default byte alphabet. But if you write
(“barrels barrels”.vocab : (“”.vocab | “ship”.vocab))*;
then ‘rax will parse the quoted strings using the vocab symbol table. So here, the input must be an even
number of symbols in the vocab alphabet. Writing “Thank”.vocab will give an error because that word is
not in the symbol table (it’s not in the €le entrain.sym from which vocab was loaded).
A noisy channel that “noises up” some text might modify the sequence of words (over the vocab alphabet) or
the sequence of le‹ers (over the byte alphabet). If you want to do the la‹er, you’ll need to convert words to
le‹ers. Recall from question ‰ that the StringFile function interprets its given tab-separated text €le as an
FST, with the domain and range as the second and third parameters, respectively. So we’ll add the following
line to noisy.grm:
Spell = Optimize[StringFile[’entrain.alpha’, vocab, byte]];
‘is maps a word to its spelling, just as Pronounce in ‰ mapped a word to its pronunciation.
(e) In noisy.grm, de€ne a transducer called CompleteWord that could be used to help people enter text
more quickly. ‘e input should be the €rst few characters in a word, such as barr. Each output should
be a word that starts with those characters, such as barrel or barrage.
36 Use LM to assign a cost to each word, so that each completed word is printed together with its cost. Is a
word’s cost equal to the unigram probability of the word, or something else?
Hint: Be careful to think about the input and output alphabets, and to pass them as arguments to
grmtest-with-symbols. ‘e input alphabet should be byte (not byte.sym), as explained in question ‰.
13
✌37 (f) Extra credit: Now de€ne CompleteWordInContext similarly. Here the input is a sequence—separated
by spaces—of 0 or more complete words followed by a partial word. Each output is a single word that
completes the partial word, as before. But this time the cost depends on the context: that’s what language
models are for.
✌38 Try it out and give a few example inputs that illustrate how the context can a‚ect the ranking of the
di‚erent completions.
Hint: You might not able to get away with exporting CompleteWordInContext as a single transducer—
it’s rather big because it spells out the words in the language model. It will be more ecient to use the
pipelining trick from question ‰. In your README, tell the graders what command to enter in order to
try out your pipeline, and give the original CompleteWordInContext de€nition that your pipeline was
derived from.
8. estion 7 de€ned our language model, Pr(x). Now let’s compose it with some channel models Pr(y | x)
that we’ll de€ne. In this question, we’ll practice by working through a simple deterministic noisy channel.
(a) Still working in noisy.grm, de€ne a deterministic transducer DelSpaces that deletes all spaces. De€ne
this using CDRewrite, and use the alphabet bytelib.kGraph | bytelib.kSpace. Using grmtest
you should be able to replicate the following:
Input string: If only the reporter had been nice .
Output string: Ifonlythereporterhadbeennice.
Input string: Thank you .
Output string: Thankyou.
Input string: The reporter said to the city that everyone is killed .
Output string: Thereportersaidtothecitythateveryoneiskilled.
(b) grmtest will transduce each string that you type in, providing multiple outputs when they exist. To
transduce a whole €le to a single output, once you’ve tested your transducer, we’ve provided another
wrapper script grmfilter:
$ grmfilter
Usage:
cat input.txt | grmfilter [-opts] <grammar file> <name1>,<name2>,…
-r: select a random path
-s: find shortest path (default)
-h: print this help message (and exit)
Just like grmtest, it takes two required arguments, a .grm €le and a comma-separated list of FST names
de€ned in that €le. It reads strings from the standard input, one per line, and writes their transductions
to the standard output. ‘e output string comes from one of the paths that accept the input. ‘e default
(which can be signaled explicitly with the -s ƒag) is to choose a maximum-probability path. ‘e alternative (the -r ƒag) is to select a path randomly in proportion to its probability. We know each path’s
probability because its total cost gives the negative log of its probability.
17‘e make-lm script takes the vocabulary to be all the words that appear in training data, except for a random subset of the singletons.
‘ese singletons are removed from the vocabulary to ensure that some training words will be OOV, allowing the smoother to estimate the
probability of OOV in di‚erent contexts. Ideally the smoothing method would €gure this out on its own.
14
Try running DelSpaces on the text €le data/entest.txt, which contains the €rst 50 sentences of the
English test data entest from the HMM assignment. Save the result as entest-noisy.txt. In general,
you should use the -r ƒag to pass text through a noisy channel, so that it will randomly noise up the
output (although in this introductory case the channel happens to be deterministic):
grmfilter -r noisy.grm DelSpaces < entest.txt > entest-noisy.txt
Uh-oh! Someone got into your €les and used your own DelSpaces against you! Now how will you ever read
any of your €les?
A‰er despairing for a while, you realize that you can just reverse DelSpaces’s actions. So you try Invert[DelSpaces],
but unfortunately that turns Ifonlythereporterhadbeennice. back into all kinds of things like
I fon lyt he reporterh adbeenni ce.
‘e correct solution is somewhere in that list of outputs, but you need to €nd it. What a perfect opportunity
to use your language model LM and the Viterbi algorithm for €nding the most probable path!
‘e idea is that the text actually came from the generative process (1), which can be represented as the composition
Generate = LM @ DelSpaces; # almost right!
Unfortunately the output of LM is words, but the input to DelSpaces is characters. So they won’t compose.
You will need to stick a transducer SpellText in between. ‘is transducer represents another deterministic
step in the generative process that resulted in the noisy sequence of characters.
(c) De€ne SpellText in noisy.grm. It should spell the €rst input word, output a space, spell the second
input word, output another space, and so on. ‘is yields the kind of text that actually appeared in
entrain.txt (there is a space a‰er each word in a sentence, including the last).
Now revise your de€nition of Generate to use SpellText.
(d) Now you should be able to decode noisy text via
Decode = Invert[Generate];
Unfortunately, this machine will be too large (and slow to compile). So you should use the same approach
as in question ‰, and ask grmtest to pass the noisy text through a sequence of inverted machines.
Important: At the end of your sequence of machines, you should add PrintText, which you can de€ne
for now to be equal to SpellText. ‘is has the e‚ect of pre‹y-printing the decoded result. It will turn
the recovered sequence of words back into characters, and put spaces between the words.
Using grmtest in this way, try decoding each of the following. Note that the lowest-cost results are
39 shown €rst. Discuss the pa‹ern of results, and their costs, in your README:
• Ifonlythereporterhadbeennice.
• If only.
• ThereportersaidtothecitythatEveryoneIskilled.
• Thankyou.
15
(e) ‘e reason Thankyou failed is because we didn’t account for OOVs. ‘e vocabulary has an OOV symbol
<unk>, but it is treated like any other word in the vocabulary.18 So LM will accept phrases like <unk>
you, but not Thank you.
So just as we described how to spell in the above questions, we’ll now describe how to spell OOV words.
We’ll say that <unk> can rewrite as an arbitrarily long sequence of non-space text characters (bytelib.kGraph):
RandomChar = bytelib.kGraph <4.54>;
RandomWord = Optimize[(RandomChar (RandomChar <w1>)* ) <w2>];
SpellOOV = “<unk>”.vocab : RandomWord;
‘e weight in RandomChar is saying that each of the 94 characters in bytelib.kGraph has the same
probability, namely 1
94 , since − log 1
94 ≈ 4.54.
How about RandomWord? When you de€ne it in noisy.grm, you’ll have to give actual numbers for the
numeric weights w1 and w2. Try se‹ing w1 = 0.1 and w2 = 2.3. To check out the results, try these
commands:
grmtest noisy.grm RandomWord # evaluate cost of some strings
far2fst noisy.far RandomWord # (get the FSA for commands below)
fstprintstring RandomWord.fst # generate a random string
fstview RandomWord.fst # look at the FSA
40 i. What do w1 and w2 represent? Hint: the costs 0.1 and 2.3 are the negative logs of 0.9 and 0.1.
 ii. For each n ≥ 0, what is the probability pn that the string generated by RandomWord will have length 41
n?
iii. What is the sum of those probabilities, P∞
n=0 42 pn?
 iv. How would you change w1 and w2 to get longer random words on average? 43
 v. If you decreased both w1 and w2, then what would happen to the probabilities of the random words?
44
How would this a‚ect the behavior of your decoder? Why?
45 vi. How could you improve the probability models RandomChar and RandomWord?
Once you’ve answered those questions, reset w1 = 0.1 and w2 = 2.3 and proceed.
(f) Now, revise Spell so that it is not limited to spelling words in the dictionary, but can also randomly
spell <unk>. (Hint: Use SpellOOV.)
Also revise PrintText so that if your decoder €nds an unknown word <unk>, you will be able to print
that as the 5-character string “<unk>.”
To check your updated decoder, try running the sentences from question 8d through it. Again discuss
the pa‹ern of results. Remember that if you want, you can add an extra argument to grmtest to limit
the number of outputs printed per input.
(g) Remember that your goal was to de-noise your corrupted €les, whose spaces were removed by DelSpaces.
Just run grmfilter again, but with three di‚erences:
• Before, you were converting entest.txt to entest-noisy.txt. Now you should convert
entest-noisy.txt to entest-recovered.txt.
• Instead of running the noisy channel forward, run it backward, using your pipeline from 8d. You
can leave out the PrintText step of the pipeline since grmfilter is a bit smarter than grmtest
about how it prints outputs.
18Except by some of the ngram utilities that we’re not using.
16
• Since you want the most likely decoding and not a random decoding, don’t use the -r ƒag this time.
46 Look at the results in entest-recovered.txt. What kinds of errors can you spot? Does this qualitative
error analysis give you any ideas how to improve your decoder?
(h) Suppose you’d like to quantify your performance. ‘e metric we’ll consider is the edit distance between
entest.txt and entest-recovered.txt.
Edit distance counts the minimum number of edits needed to transduce one string (x) into another (y).
‘e possible edits are
• substitute one le‹er for another;
• insert a le‹er;
• delete a le‹er;
• copy a le‹er unchanged.
Each of these operations has a cost associated with it. We’ll stick with the standard unweighted edit distance metric in which substitions, insertions and deletions all have cost 1; copying a character unchanged
has cost 0. For simplicity we will treat the unknown word symbol as if really were the 5-character word
<unk>, which must be edited into the true word.
As you know, edit distance can easily be calculated using weighted €nite-state machines:
Sigma = bytelib.kBytes;
export Edit = (Sigma | ((“”|Sigma) : (“”|Sigma) <1>) )*;
‘e Edit machine transduces an input string x one byte at a time: at each step, it either passes an input
character through with cost 0, or does an insert, delete or substitute with cost 1. ‘at gives an edited
version y. ‘e cheapest way to get from x to a given y corresponds to the shortest path through x
@ Edit @ y. As we saw in class, that machine has the form of an |x + 1| × |y + 1| grid with horizontal,
vertical, and diagonal transitions. It has exponentially many paths, of various total cost, that represent
di‚erent sequences of edit operations for turning x into y.
We’ve given you an edit distance script to calculate the edit distances between the corresponding
lines of two €les:
editdist entest.txt entest-recovered.txt
‘is will compare each recovered sentence to the original. Do the scores match your intuitive, qualitative
results from 8g?
Please look at grammars/editdist.grm, the ‘rax grammar used by the editdist script. You’ll see
that it’s more complicated than Edit, but this construction reduces the size of the overall machine by
a couple of orders of magnitude. While it still computes x Edit y, it splits Edit up into two separate
machines, Edit1 and Edit2. We still €nd the shortest path, but now through
(x @ Edit1) @ (Edit2 @ y);
By doing the composition this way, both x and y are able to impose their own constraints (what le‹ers
actually appear) on Edit1 and Edit2, thus reducing the size of the intermediate machines. ‘e resulting
FST can be built quite quickly, though as mentioned before, it does have |x + 1| × |y + 1| states and a
similar number of arcs.
17
✌47 (i) Extra credit: How can you modify your pipeline so that it recovers an appropriate spelling of each
unknown word, rather than <unk>? For example, decoding Thankyou should give Thank you rather
than <unk> you.
19
✌48 9. Extra credit (but maybe the real point of this assignment): Finally, it’s time to have some fun. We just
set up a noisy-channel decoder to handle a simple deterministic noisy channel. Now try it for some other
noisy channels! ‘e framework is nearly identical—just replace DelSpaces with some other FST. For each
type of noisy channel,
i. De€ne your channel in noisy.grm as a weighted FST.
ii. Explain in README what you implemented and why, and how you chose the weights.
iii. Use your channel to corrupt entest.txt into entest-noisy.txt.
iv. Use your inverted channel, the language model, and SpellText to decode entest-noisy.text back
into entest-recovered.txt.
v. Look at the €les and describe in your README what happened, with some examples.
vi. Report the edit distance between entest.txt and entest-recovered.txt.
Have fun designing some of the channels below. Each converts a string of bytes into a string of bytes. In
general make them non-deterministic (in contrast to DelSpaces), and play with the weights.
(a) DelSomeSpaces: Nondeterministically delete none, some, or all spaces from an input string.
(b) DelSuffixes: Delete various word endings. You may €nd
http://grammar.about.com/od/words/a/comsuffixes.htm helpful.
(c) Typos: Introduce common typos or misspellings. You may get some inspiration from
http://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings
or
http://en.wikipedia.org/wiki/File:Qwerty.svg
Some real-world typos are due to the fact that some words sound similar, so if you’re ambitious, you
might be able to make some use of data/cmudict.txt or the Pronounce transducer.
(d) Telephone: Deterministically convert (lower-case) le‹ers to digits from the phone keypad. For example,
a rewrites as 2.
(e) Tinyphone: Compose your Telephone FST with another FST that allows common cellphone keypad
typos. For example, there should be a small chance of deleting a digit, doubling a digit, or substituting
one of the adjacent digits for it.
(f) Try composing some of these machines in various orders. As usual, give examples of what happens, and
discuss the interactions.
Feel free to try additional noisy channels for more extra credit. You could consider capitalization, punctuation,
or something crazy out of your imagination.20
19‘e recovered spelling is determined by the language model and the channel model. It won’t always match the noisy spelling. E.g., if the
noisy channel tends to change le‹ers into lowercase, then decoding Thank you might yield THANK you.
20It might be fun to replace each word deterministically with its rhyming ending, using your WordEnding FST from question ‰ (composed
with something that transduces ARPAbet characters to the byte alphabet). ‘en your noisy channel decoder will €nd the highest-probability
string that rhymes word-by-word with your original input text. Should be amusing.
18

Scroll to Top