Problem set 2: Viterbi Sequence Labeling




5/5 - (2 votes)

Problem set 2: Viterbi Sequence Labeling
This project focuses on sequence labeling with Hidden Markov models and implementing the Viterbi algorithm.
The target domain is part-of-speech tagging on the Universal Dependencies dataset
import numpy as np
from collections import defaultdict, Counter

from gtnlplib import preproc, viterbi, most_common, clf_base
from gtnlplib import naive_bayes, scorer, constants, tagger_base, hmm

import matplotlib.pyplot as plt
# this enables you to create inline plots in the notebook
%matplotlib inline
1. Data Processing
The test data will be released around 48 hours before the deadline. The part-of-speech tags are defined on the universal dependencies website.

## Define the file names
DEV_FILE = constants.DEV_FILE
TEST_FILE = constants.TEST_FILE # You do not have this for now
Here is demo code for using the function conll_seq_generator()

## Demo
all_tags = set()
for i,(words, tags) in enumerate(preproc.conll_seq_generator(TRAIN_FILE,max_insts=100000)):
for tag in tags:
print all_tags
set([u’ADV’, u’NOUN’, u’NUM’, u’ADP’, u’PRON’, u’SCONJ’, u’PROPN’, u’DET’, u’SYM’, u’INTJ’, u’PART’, u’PUNCT’, u’VERB’, u’X’, u’AUX’, u’CONJ’, u’ADJ’])
## Demo
all_tags = set()
for i,(words, tags) in enumerate(preproc.conll_seq_generator(TRAIN_FILE,max_insts=100000)):
for tag in tags:
print all_tags
set([u’ADV’, u’NOUN’, u’NUM’, u’ADP’, u’PRON’, u’SCONJ’, u’PROPN’, u’DET’, u’SYM’, u’INTJ’, u’PART’, u’PUNCT’, u’VERB’, u’X’, u’AUX’, u’CONJ’, u’ADJ’])
Deliverable 1.1: Counting words per tag. ( 0.5 points )

Implement get_tag_word_counts in

Input: filename for data file, to be passed as argument to preproc.conll_seq_generation
Output: dict of counters, where keys are tags
# this block uses your code to find the most common words per tag
counters = most_common.get_tag_word_counts(TRAIN_FILE)
for tag,tag_ctr in counters.iteritems():
print tag,tag_ctr.most_common(3)
ADV [(u’so’, 371), (u’just’, 353), (u’when’, 306)]
NOUN [(u’time’, 385), (u’people’, 233), (u’way’, 187)]
NUM [(u’one’, 332), (u’two’, 157), (u’2′, 129)]
ADP [(u’of’, 3424), (u’in’, 2705), (u’to’, 1783)]
PRON [(u’I’, 3121), (u’you’, 1920), (u’it’, 1468)]
SCONJ [(u’that’, 982), (u’if’, 450), (u’as’, 312)]
PROPN [(u’Bush’, 211), (u’US’, 162), (u’Iraq’, 119)]
DET [(u’the’, 8142), (u’a’, 3588), (u’The’, 884)]
SYM [(u’$’, 251), (u’-‘, 101), (u’:)’, 34)]
INTJ [(u’Please’, 141), (u’please’, 111), (u’Yes’, 34)]
PART [(u’to’, 3221), (u’not’, 805), (u”n’t”, 655)]
PUNCT [(u’.’, 8645), (u’,’, 7021), (u'”‘, 1352)]
VERB [(u’is’, 1738), (u’was’, 808), (u’have’, 748)]
X [(u’etc’, 39), (u’1′, 29), (u’2′, 29)]
AUX [(u’will’, 811), (u’would’, 578), (u’can’, 578)]
CONJ [(u’and’, 4843), (u’or’, 697), (u’but’, 602)]
ADJ [(u’other’, 268), (u’good’, 251), (u’new’, 195)]
2. Tagging as classification
Now you will implement part-of-speech tagging via classification.

Tagging quality is evaluated using evalTagger, which takes three arguments:

a tagger, which is a function taking a list of words and a tagset as arguments
an output filename
a test file
You will want to use lambda expressions to create the first argument for this function, as shown below. Here’s how it works. I provide a tagger that labels everything as a noun.

# here is a tagger that just tags everything as a noun
noun_tagger = lambda words, alltags : [‘NOUN’ for word in words]
confusion = tagger_base.eval_tagger(noun_tagger,’nouns’,all_tags=all_tags)
print scorer.accuracy(confusion)
Deliverable 2.1 Classification-based tagging. ( 0.5 points )

Now do the same thing, but building your tagger as a classifier. To do this, implement make_classifier_tagger in

Input: defaultdict of weights
Output: function from (list of word tokens, list of all possible tags) –> tags for each word
The function that you output should use your predict() function from pset 1. You are free to edit this function if you don’t think you got it right in pset 1.

classifier_noun_tagger = tagger_base.make_classifier_tagger(most_common.get_noun_weights())
confusion = tagger_base.eval_tagger(classifier_noun_tagger,’all-nouns.preds’,all_tags=all_tags)
print scorer.accuracy(confusion)
Deliverable 2.2 Tagging words by their most common tag. ( 0.5 points )

Now build a classifier tagger that tags each word with its most common tag in the training set. To do this, implement get_most_common_word_weights in


training file

defaultdict of weights
Each word should get the tag that is most frequently associated with it in the training data. If the word does not appear in the training data, the weights should be set so that the tagger outputs ‘NOUN’.

theta_mc = most_common.get_most_common_word_weights(constants.TRAIN_FILE)
tagger_mc = tagger_base.make_classifier_tagger(theta_mc)
tags = tagger_mc([‘They’,’can’,’can’,’fish’],all_tags)
print tags
[u’PRON’, u’AUX’, u’AUX’, u’NOUN’]
tags = tagger_mc([‘The’,’old’,’man’,’the’,’boat’,’.’],all_tags)
print tags
[u’DET’, u’ADJ’, u’NOUN’, u’DET’, u’NOUN’, u’PUNCT’]
Now let’s run your tagger on the dev data.

confusion = tagger_base.eval_tagger(tagger_mc,’most-common.preds’,all_tags=all_tags)
print scorer.accuracy(confusion)
Deliverable 2.3 Naive Bayes as a tagger. ( 0.5 points )

Now use a Naive Bayes approach to setting the weights for the classifier tagger. For this, you can copy in your from pset1. If you don’t think you got it right, you are free to change it now.

sorted_tags = sorted(counters.keys())
print ‘ ‘.join(sorted_tags)
nb_weights = naive_bayes.estimate_nb([counters[tag] for tag in sorted_tags],
This gives weights for each tag-word pair that represent  logP(word∣tag)
print nb_weights[(‘ADJ’,’bad’)], nb_weights[(u’ADJ’,u’good’)]
print nb_weights[(u’PRON’,u’they’)], nb_weights[(u’PRON’,u’They’)], nb_weights[(u’PRON’,u’good’)]
print nb_weights[(u’PRON’,u’.’)], nb_weights[(u’NOUN’,u’.’)], nb_weights[(u’PUNCT’,u’.’)]
-5.38657494998 -3.92169758048
-3.11635765265 -4.29498560641 -14.4453722986
-14.4453722986 -15.0676307485 -1.01587097152
These weights should correspond to probabilities that sum to one for each tag.

vocab = set([word for tag,word in nb_weights.keys() if word is not constants.OFFSET])
print sum(np.exp(nb_weights[(‘ADJ’,word)]) for word in vocab)
print sum(np.exp(nb_weights[(‘DET’,word)]) for word in vocab)
print sum(np.exp(nb_weights[(‘PUNCT’,word)]) for word in vocab)
We have zero weights for OOV terms — more on this later.

print nb_weights[(‘ADJ’,’baaaaaaaaad’)]
As constructed here, the Naive Bayes tagger also does not correctly estimate weights for the offset,  logP(tag)
print nb_weights[(‘VERB’),constants.OFFSET]
print nb_weights[(‘ADV’),constants.OFFSET]
print nb_weights[(‘PRON’),constants.OFFSET]
As a result, the accuracy is not as good as the most-common-tag tagger from above.

confusion = tagger_base.eval_tagger(tagger_base.make_classifier_tagger(nb_weights),’nb-simple.preds’)
dev_acc = scorer.accuracy(confusion)
print dev_acc
Deliverable 2.4 Fixing the Naive Bayes tagger ( 0.5 points )

Implement naive_bayes.estimate_nb_tagger, which should take two arguments:

A list of word counters for each tag
A smoothing value
It should return weights so that

θ[(tag,word)]=logP(word∣tag) . If your naive_bayes is correct, it already does this.
θ[(tag,offset)]=logP(tag) . You will need to add some code to make this happen.
All probabilities should be smoothed relative frequency estimates.

Your implementation should call naive_bayes.estimate_nb.

theta_nb_fixed = naive_bayes.estimate_nb_tagger(counters,.01)
# emission weights still sum to one
print sum(np.exp(theta_nb_fixed[(‘ADJ’,word)]) for word in vocab)
# emission weights are identical to theta_nb
print nb_weights[(‘ADJ’,’okay’)],theta_nb_fixed[(‘ADJ’,’okay’)]
-7.36649959838 -7.36649959838
# but the offsets are now correct
print theta_nb_fixed[(‘VERB’),constants.OFFSET]
print theta_nb_fixed[(‘ADV’),constants.OFFSET]
print theta_nb_fixed[(‘PRON’),constants.OFFSET]
Offsets should correspond to log-probabilities  logP(y)  such that  ∑yP(y)=1 .

sum(np.exp(theta_nb_fixed[(tag,constants.OFFSET)]) for tag in all_tags)
Now let’s apply the tagger

confusion = tagger_base.eval_tagger(tagger_base.make_classifier_tagger(theta_nb_fixed),
dev_acc = scorer.accuracy(confusion)
print dev_acc
Just as good as the heuristic tagger from above.

3. Viterbi Algorithm
In this section you will implement the Viterbi algorithm. To get warmed up, let’s work out an example by hand. For simplicity, there are only two tags, NOUN and VERB. Here are the parameters:

$\log P_E(\cdot    N)$    they: -1, can: -3, fish: -3
$\log P_E(\cdot    V)$    they: -11, can: -2, fish: -4
$\log P_T(\cdot    N)$    N: -5, V: -2, END: -2
$\log P_T(\cdot    V)$    N: -1, V: -3, END: -3
$\log P_T(\cdot    \text{START})$    N :-1, V :-2
where  PE(⋅|⋅)  is the emission probability and  PT(⋅|⋅)  is the transition probability.

In class we discuss the sentence They can fish. Now work out a more complicated example: “They can can fish”, where the second “can” refers to the verb of putting things into cans.

Deliverable 3.1 Work out the trellis by hand, and fill in the table in the file (0.5 points)

Implementing Viterbi
Here are some predefined weights, corresponding to the weights from the problem 3.1.

TRANS = constants.TRANS
END_TAG = constants.END_TAG
EMIT = constants.EMIT

hand_weights = {(‘NOUN’,’they’,EMIT):-1,\
Deliverable 3.2 Building HMM features. (0.5 points)

Implement hmm_features in to compute the HMM features for the function  f(x,ym,ym−1,m) .

Expected behavior is shown below. Note how constants.EMIT and constants.TRANS are used to distinguish emission and transition features.

sentence = “they can can fish”.split()
print sentence
[‘they’, ‘can’, ‘can’, ‘fish’]
print hmm.hmm_features(sentence,’Noun’,’Verb’,2)
print hmm.hmm_features(sentence,’Noun’,START_TAG,0)
print hmm.hmm_features(sentence,END_TAG,’Verb’,4)
defaultdict(<type ‘int’>, {(‘Noun’, ‘can’, ‘–EMISSION–‘): 1, (‘Noun’, ‘Verb’, ‘–TRANS–‘): 1})
defaultdict(<type ‘int’>, {(‘Noun’, ‘they’, ‘–EMISSION–‘): 1, (‘Noun’, ‘–START–‘, ‘–TRANS–‘): 1})
defaultdict(<type ‘int’>, {(‘–END–‘, ‘Verb’, ‘–TRANS–‘): 1})
Deliverable 3.3 The Viterbi recurrence. (1 point)

Implement viterbi_step in gtnlplib/ This is the method that will compute the best path score and corresponding back pointer for a given tag and word, which you will call from the main viterbi routine.

Input 1: A tag to calculate the best path for
Input 2: An index of the word to calculate the best path for
Input 3: A list of words to tag
Input 4: A feature function, like hmm_feats
Input 5: A dict of weights
Input 6: A list of all possible tags
Input 7: A list of dicts representing the best scores for the previous trellis layer
Output 1: The score of the best-scoring sequence
Output 2: The feature in the previous trellis layer corresponding to the best score
There are a lot of inputs, but the code itself will not be very complex. Make sure you understand what each input represents before starting to write a solution.

You can run your viterbi step on the example in 3.1, by building on the code below.

print viterbi.viterbi_step(‘NOUN’,0,sentence,hmm.hmm_features,hand_weights,{START_TAG:0})
print viterbi.viterbi_step(‘VERB’,0,sentence,hmm.hmm_features,hand_weights,{START_TAG:0})
(-2, ‘–START–‘)
(-13, ‘–START–‘)
print viterbi.viterbi_step(‘NOUN’,1,sentence,
print viterbi.viterbi_step(‘VERB’,1,sentence,
(-10, ‘NOUN’)
(-6, ‘NOUN’)
Deliverable 3.4 Build the Viterbi trellis. (0.5 points)

Use viterbi_step to implement build_trellis in

This function should take:

A list of tokens to be tagged
A feature function
A defaultdict of weights
A tag set
It should output a list of dicts. In each dict should be key-value pairs of the form tag:(score,prev_tag). See the example output below.

all_tags = [‘NOUN’,’VERB’]
# let’s change the weights a little
hand_weights[‘NOUN’,’they’,EMIT] = -2
hand_weights[‘VERB’,’fish’,EMIT] = -5
hand_weights[‘VERB’,’VERB’,TRANS] = -2
trellis = viterbi.build_trellis(sentence,hmm.hmm_features,hand_weights,all_tags)
for line in trellis:
print line
{‘VERB’: (-13, ‘–START–‘), ‘NOUN’: (-3, ‘–START–‘)}
{‘VERB’: (-7, ‘NOUN’), ‘NOUN’: (-11, ‘NOUN’)}
{‘VERB’: (-11, ‘VERB’), ‘NOUN’: (-11, ‘VERB’)}
{‘VERB’: (-18, ‘VERB’), ‘NOUN’: (-15, ‘VERB’)}
Deliverable 3.5 Implement the Viterbi algorithm. (0.5 points)

Specifically, use build_trellis to implement viterbi_tagger in

Your first task will be to figure out the score of the best path, and the final node in the trellis. Use a transition feature involving constants.END_TAG to compute this.

Then iterate backwards through the trellis to construct the best path.


A list of tokens to be tagged
A feature function
A defaultdict of weights
A tag set

Best-scoring tag sequence
Score of best-scoring tag sequence
([‘NOUN’, ‘VERB’, ‘VERB’, ‘NOUN’], -17)
([‘NOUN’, ‘VERB’, ‘VERB’, ‘VERB’, ‘VERB’, ‘VERB’, ‘NOUN’], -29)
4. Hidden Markov Model: Estimation
You will now implement the estimatation for a hidden Markov model.

We’ll start with the tag transitions.

The following function, which we provide, computes the tag transitions counts.

tag_trans_counts = most_common.get_tag_trans_counts(TRAIN_FILE)
This function returns a dict of counters, where the keys are tags.

Each counter is the frequency of tags following a given tag, e.g.:

print tag_trans_counts[‘DET’]
print tag_trans_counts[START_TAG]
Counter({u’NOUN’: 9676, u’ADJ’: 3600, u’PROPN’: 1488, u’VERB’: 323, u’NUM’: 252, u’ADV’: 250, u’PUNCT’: 201, u’ADP’: 185, u’DET’: 166, u’PRON’: 65, u’SYM’: 32, u’X’: 16, u’AUX’: 16, u’CONJ’: 10, u’PART’: 4, u’SCONJ’: 1})
Counter({u’PRON’: 3533, u’PROPN’: 1466, u’DET’: 1259, u’ADV’: 997, u’VERB’: 826, u’NOUN’: 789, u’ADP’: 536, u’ADJ’: 486, u’SCONJ’: 445, u’PUNCT’: 433, u’NUM’: 406, u’INTJ’: 398, u’AUX’: 296, u’CONJ’: 290, u’X’: 243, u’SYM’: 97, u’PART’: 43})
Deliverable 4.1 Estimate transition log-probabilities for an HMM. (0.5 points)

Implement compute_transition_weights in

You should use the function get_tag_trans_counts in


Transition counts (from get_tag_trans_counts)

Defaultdict with weights for transition features, in the form  [ym,ym−1,TRANS]

Don’t forget to assign smoothed probabilities to transitions which do not appear in the counts.
Do not assign probabilities for transitions to the start tag, which can only come first. This will also affect your computation of the denominator, since you are not smoothing the probability of transitions to the start tag.
Don’t forget to assign probabilities to transitions to the end tag; this too will affect your denominator.
As always, probabilities should sum to one (this time conditioned on the previous tag)
hmm_trans_weights = hmm.compute_transition_weights(tag_trans_counts,.001)
print tag_trans_counts[START_TAG][‘NOUN’], hmm_trans_weights[(‘NOUN’,START_TAG,TRANS)]
print tag_trans_counts[START_TAG][‘VERB’], hmm_trans_weights[(‘VERB’,START_TAG,TRANS)]
print tag_trans_counts[‘DET’][‘VERB’], hmm_trans_weights[(‘VERB’,’DET’,TRANS)]
print tag_trans_counts[‘DET’][‘INTJ’], hmm_trans_weights[(‘INTJ’,’DET’,TRANS)]
print tag_trans_counts[‘DET’][‘NOUN’], hmm_trans_weights[(‘NOUN’,’DET’,TRANS)]
print tag_trans_counts[‘VERB’][START_TAG], hmm_trans_weights[(START_TAG,’VERB’,TRANS)]
#print tag_trans_counts[END_TAG][‘VERB’] # will throw key error
print hmm_trans_weights[(‘VERB’,END_TAG,TRANS)]
789 -2.76615186681
826 -2.72032347091
323 -3.92034540383
0 -16.605756102
9676 -0.520596847943
0 -inf
These log-probabilities should normalize to when summing over  ym
all_tags = tag_trans_counts.keys() + [END_TAG]
print sum(np.exp(hmm_trans_weights[(tag,’NOUN’,TRANS)]) for tag in all_tags)
print sum(np.exp(hmm_trans_weights[(tag,’SYM’,TRANS)]) for tag in all_tags)
Deliverable 4.2 (1 point 4650; 0.5 points 7650)

Now implement compute_HMM_weights in You should use two functions:

compute_transition_weights, which you just implemented
naive_bayes.estimate_nb, which you used in part 2 to compute the emission log-probabilities,  logP(w∣y) .


defaultdict of hidden Markov model weights
A tricky issue is how to handle unseen words and illegal tag-tag transitions. We will have to handle these cases slightly differently:

For illegal transitions, we need to assign zero probability, which is  −∞  log-probability. You can use -np.inf for these weights.
For unseen words, we don’t want to assign zero probability, since unseen words are likely to appear in the test set. Use a defaultdict for this, with log-probability = 0, but don’t include this in the normalizer. We’ll talk more about whether this is a good solution later.
theta_hmm,_ = hmm.compute_HMM_weights(TRAIN_FILE,.01)
print theta_hmm[‘NOUN’,’right’,EMIT], theta_hmm[‘ADV’,’right’,EMIT]
print theta_hmm[‘PRON’,’she’,EMIT], theta_hmm[‘DET’,’she’,EMIT]
print theta_hmm[‘NOUN’,’notarealword’,EMIT]
-7.41746204765 -5.33071419765
-4.5722924085 -14.3151646115
print sum(np.exp(theta_hmm[‘NOUN’,word,EMIT]) for word in vocab)
print sum(np.exp(theta_hmm[‘DET’,word,EMIT]) for word in vocab)
Deliverable 4.3 (0.5 points)

We can now combine Viterbi and the HMM weights to compute the tag sequence for the example sentence. Make sure your implementation passes the test for this deliverable, and explain (in whether you think these predicted tags are correct, based on your understanding of the universal part-of-speech tag set.

viterbi.viterbi_tagger([‘they’, ‘can’, ‘can’, ‘fish’],hmm.hmm_features,theta_hmm,all_tags)
([u’PRON’, u’AUX’, u’AUX’, u’NOUN’], -31.294333658962991)
Deliverable 4.4 (0.5 points)

As shown in the cell below, the HMM weights include a weight of zero for the emission of unseen words. In, please explain:

why this is a violation of the HMM probability model explained in the notes;
How, if at all, this will affect the overall tagging.
print theta_hmm[‘NOUN’,’right’,EMIT], theta_hmm[‘ADV’,’right’,EMIT]
print theta_hmm[‘PRON’,’she’,EMIT], theta_hmm[‘DET’,’she’,EMIT]
print theta_hmm[‘ADJ’,’thisworddoesnotappear’,EMIT]
-7.41746204765 -5.33071419765
-4.5722924085 -14.3151646115
Deliverable 4.5 (0.5 points)

Run your HMM tagger on the dev data and test data, using the code blocks below.
# this is just for fun
for i,(words,_) in enumerate(preproc.conll_seq_generator(DEV_FILE)):
print i,
pred_tags = viterbi.viterbi_tagger(words,hmm.hmm_features,theta_hmm,all_tags)[0]
for word,pred_tag in zip(words,pred_tags):
print “%s/%s”%(word,pred_tag),
if i >= 2: break
0 From/ADP the/DET AP/NOUN comes/VERB this/DET story/NOUN :/PUNCT
1 President/PROPN Bush/PROPN on/ADP Tuesday/PROPN nominated/VERB two/NUM individuals/NOUN to/PART replace/VERB retiring/PART jurists/VERB on/ADP federal/ADJ courts/NOUN in/ADP the/DET Washington/PROPN area/NOUN ./PUNCT
2 Bush/PROPN nominated/VERB Jennifer/PROPN M./PROPN Anderson/PROPN for/ADP a/DET 15/NUM -/PUNCT year/NOUN term/NOUN as/ADP associate/NOUN judge/NOUN of/ADP the/DET Superior/PROPN Court/PROPN of/ADP the/DET District/PROPN of/ADP Columbia/PROPN ,/PUNCT replacing/VERB Steffen/ADP W./PROPN Graae/PROPN ./PUNCT
tagger = lambda words, all_tags : viterbi.viterbi_tagger(words,
confusion = tagger_base.eval_tagger(tagger,’hmm-dev-en.preds’)
print scorer.accuracy(confusion)
# you don’t have en-ud-test.conllu, so you can’t run this
te_confusion = scorer.get_confusion(‘data/en-ud-test.conllu’,’hmm-te-en.preds’)
print scorer.accuracy(te_confusion)
IOError                                   Traceback (most recent call last)
<ipython-input-128-ffdd2b23185b> in <module>()
1 # you don’t have en-ud-test.conllu, so you can’t run this
—-> 2 te_confusion = scorer.get_confusion(‘data/en-ud-test.conllu’,’hmm-te-en.preds’)
3 print scorer.accuracy(te_confusion)

/Users/ananyaroy/Documents/Sem2/NLP/gt-nlp-class-master/psets/ps2/gtnlplib/scorer.pyc in get_confusion(keyfilename, responsefilename)
24     “””
25     counts = defaultdict(int)
—> 26     with,encoding=’utf8′) as keyfile:
27         with open(responsefilename,’r’) as resfile:
28             for key_line in keyfile:

/Users/ananyaroy/anaconda/lib/python2.7/codecs.pyc in open(filename, mode, encoding, errors, buffering)
894             # Force opening of the file in binary mode
895             mode = mode + ‘b’
–> 896     file =, mode, buffering)
897     if encoding is None:
898         return file

IOError: [Errno 2] No such file or directory: ‘data/en-ud-test.conllu’
5. Part of Speech tagging in Japanese
Now you will evaluate the tagger you just implemented on a Japanese dataset that uses the same set of part-of-speech tags.

Deliverable 5.1 (1 point 4650, 0.5 points 7650)

First, let’s compare the tag distributions between Japanese and English. Using your get_tag_trans_counts, identify the top three tags that are most likely to follow VERBS and NOUNS in each language.

List these tags in, and try to explain some of the differences you see. You may want to do a little research about Japanese on wikipedia. (If you really want to dig in, check out this research paper, which describes the annotation of the Japanese dataset that we are using:

from gtnlplib.constants import JA_TRAIN_FILE, JA_DEV_FILE, JA_TEST_FILE_HIDDEN, JA_TEST_FILE
from gtnlplib.constants import TRAIN_FILE, START_TAG, END_TAG
tag_trans_counts_en = most_common.get_tag_trans_counts(TRAIN_FILE)
tag_trans_counts_ja = most_common.get_tag_trans_counts(JA_TRAIN_FILE)
print tag_trans_counts_en[‘VERB’].most_common(3)
print tag_trans_counts_ja[‘VERB’].most_common(3)
print tag_trans_counts_en[‘NOUN’].most_common(3)
print tag_trans_counts_ja[‘NOUN’].most_common(3)
[(u’DET’, 5134), (u’ADP’, 4553), (u’PRON’, 4086)]
[(u’NOUN’, 6236), (‘–END–‘, 5232), (u’PUNCT’, 3018)]

[(u’PUNCT’, 10100), (u’ADP’, 7158), (u’NOUN’, 4346)]
[(u’NOUN’, 19477), (u’VERB’, 12790), (u’PUNCT’, 4528)]
Deliverable 5.2 (0.5 points)

Now, run the code below to calculate the HMM weights for Japanese, and evaluate a viterbi tagger on the Japanese dev set and test set.

theta_hmm_ja, all_tags_ja = hmm.compute_HMM_weights(JA_TRAIN_FILE,.01)
tagger = lambda words, all_tags : viterbi.viterbi_tagger(words,
confusion = tagger_base.eval_tagger(tagger,’hmm-dev-ja.preds’,testfile=JA_DEV_FILE)
print scorer.accuracy(confusion)
# you don’t have the test file, so you can’t run this
confusion_te_ja = scorer.get_confusion(JA_TEST_FILE,’hmm-test-ja.preds’)
print scorer.accuracy(confusion_te_ja)
IOError                                   Traceback (most recent call last)
<ipython-input-138-8b0d92b3cbbe> in <module>()
1 # you don’t have the test file, so you can’t run this
—-> 2 confusion_te_ja = scorer.get_confusion(JA_TEST_FILE,’hmm-test-ja.preds’)
3 print scorer.accuracy(confusion_te_ja)

/Users/ananyaroy/Documents/Sem2/NLP/gt-nlp-class-master/psets/ps2/gtnlplib/scorer.pyc in get_confusion(keyfilename, responsefilename)
24     “””
25     counts = defaultdict(int)
—> 26     with,encoding=’utf8′) as keyfile:
27         with open(responsefilename,’r’) as resfile:
28             for key_line in keyfile:

/Users/ananyaroy/anaconda/lib/python2.7/codecs.pyc in open(filename, mode, encoding, errors, buffering)
894             # Force opening of the file in binary mode
895             mode = mode + ‘b’
–> 896     file =, mode, buffering)
897     if encoding is None:
898         return file

IOError: [Errno 2] No such file or directory: ‘data/ja-ud-test.conllu’
6. 7650 “only”
Deliverable 6 Research question (1 point)

Find an example of sequence labeling for a task other than part-of-speech tagging, in a paper at ACL, NAACL, EMNLP, EACL, or TACL, within the last five years (2012-2017). Put your response in


The title, authors, and venue of the paper.
What is the task they are trying to solve?
What methods do they use? HMM, CRF, max-margin markov network, something else?
Which features do they use?
What methods and features are most effective?
As before, students in 4650 may optionally submit an answer to this question; if you do, you will be graded on the 7650 rubric.