Assignment 2: n-Gram Part of Speech Tagger

LING 380/780

Neural Network Models of Linguistic Structure

Total Points: 100

In this assignment, you will build a part of speech tagger—a model that assigns parts

of speech (i.e., grammatical categories like “noun” or “verb”) to words in a text. You

will do this by training a neural network that takes as input a word (with or without its

surrounding context) and then predicts the part of speech of that word. For many words,

the assignment of part of speech is straightforward: the word bird is always a noun, while

the word annoy is always a verb. However, there are also cases of part of speech ambiguity:

break can be either a noun or verb, depending on context.

For the programming portion, you are provided with an incomplete NumPy implementation

of a multi-layer perceptron. You will need to complete the codebase, including the forward

and backward passes for di↵erent types of network objects as well as the core of the

stochastic gradient descent algorithm. The amount of code you will have to write is quite

modest, but each line will require you to think very carefully. In some cases, you will need

to derive array-based mathematical expressions before implementing them in NumPy.

After you have completed your implementation of the neural network, you will run some

experiments to analyze the representations learned by your neural network.

1 Model Implementation

In this first part of the assignment, you will implement the part of speech tagger as a

multi-layer perceptron.

1.1 Installation

For this assignment, you will continue to use NumPy, which you learned in the previous

assignment. In addition, this assignment requires you to install the PyConll package, which

1

provides an interface for reading CoNLL-X files. CoNLL-X (Buchholz and Marsi, 2006) is

a common file format used in NLP. It represents natural language sentences where each

token is annotated with grammatical properties, including its part of speech tag. Please

install PyConll using pip by running the following command in Terminal (Mac OS and

Linux) or Command Prompt (Windows).

1 pip install pyconll

1.2 Task Specification and Network Architecture

In the POS tagging task, the neural network tagger will receive an n-gram

w = wkwk+1 …w1w0w1w2 …wk,

where n is odd and k = (n 1)/2. The goal is to classify the middle token w0 into one of

following 17 possible parts of speech, defined by the Universal Dependencies project.

ADJ: adjective ADP: adposition

ADV: adverb AUX: auxiliary

CCONJ: coordinating conjunction DET: determiner

INTJ: interjection NOUN: noun

NUM: numeral PART: particle

PRON: pronoun PROPN: proper noun

PUNCT: punctuation SCONJ: subordinating conjunction

SYM: symbol VERB: verb

X: other

Figure 1 shows the architecture of the neural network POS tagger you will implement for

this task. This network consists of an embedding layer, an output layer, and a variable

number of hidden layers determined by the hyperparameter h. Below we explain the various

parts of the model.

• Input Representation. The input to the model is mini-batch of n-grams, represented as a 2D array of shape (batch size, n). Each row (i.e., set of entries along

dimension 0) of the array represents an n-gram, while each column (i.e., set of entries along dimension 1) represents all the wis in the batch for some position i where

k i k. The entries of the matrix are indices: integers that represent one of the

possible tokens in the vocabulary, including the [UNK], [BOS], and [EOS] tokens.

• Embedding Layer. The embedding layer takes a batch of indices as input and

replaces each index with its corresponding word embedding vector. The output of

the embedding layer is a 3D array of shape (batch size, n, embedding size). The

embedding layer is parameterized by a word embedding matrix of shape (vocabulary

size, embedding size).

2

• Hidden Layers. Each hidden layer consists of a linear layer followed by a tanh

activation function. The output of the hidden layer, an array of shape (batch size,

hidden size), is a batch of vectors known as hidden representations, where the hidden

size is a hyperparameter of the model. Hidden representations are understood to

be latent features computed by the network—we don’t know what they represent or

how they are interpreted. The first hidden layer takes an input of shape (batch size,

n ⇥ embedding size) containing all the embeddings in the mini-batch concatenated

together; the output of the Embedding layer must be reshaped to this format. The

other hidden layers take an input of shape (batch size, hidden size) containing the

outputs of the previous layers.

• Output Layer. The output layer, the final layer of the network, consists solely

of a linear layer. For reasons of computational eciency, we exclude the softmax

activation from the output layer, and instead implement softmax as part of the crossentropy loss function. The output of the output layer (which is the final output of the

network) is an array of shape (batch size, 17) containing confidence scores assigned

by the model to each of the 17 possible POS tags. These scores are on the logit scale,

meaning that the softmax of the scores is interpreted as a probability distribution over

the possible POS tags. For each row of the output, we interpret column containing

the highest confidence score to represent the network’s predicted POS tag for the

corresponding item in the mini-batch.

Problem 1. Suppose our model uses the following hyperparameters.

• Vocabulary size: 100

• n (length of n-grams): 3 (i.e., k = 1)

• Embedding size: 50

• Hidden size: 16

• h (number of hidden layers): 1

How many trainable parameters does the neural network have? (In other words, how many

array entries are there among the embedding matrix as well as the weight matrix and bias

vector for all the hidden layers and the output layer?) Which part of the network has the

most parameters?

1.3 Starter Code

The file archive for this assignment should have the following directory structure.

• data

– en ewt-ud-train.conllu

4

– en ewt-ud-dev.conllu

– en ewt-ud-test.conllu

– glove embeddings.txt

– unambiguous pos tags.csv

• data loader.py

• layers.py

• loss.py

• metrics.py

• model.py

• train.py

The three .conllu files in the data folder contain the sentences that we will be using

for training, validating, and testing our POS tagger. The dataset files are in CoNLLU format, a variant of CoNLL-X. The file glove embeddings.txt is similar to the file

word2vec embeddings.txt from Assignment 1, except that the word embeddings are

trained using the GloVe model (Pennington et al., 2014) instead of the word2vec model.

The file unambiguous pos tags.csv will be used in Part 2 (Analysis) of the assignment.

The .py files in the root directory are Python modules—Python files containing code that

can be used in other files and scripts using the import keyword. For example, the function

load embeddings in the file data loader.py can be used as follows:

1 # Import load_embeddings from the data_loader module

2 from data_loader import load_embeddings

3

4 # Call the load_embeddings function

5 all_tokens , all_embeddings = load_embeddings()

As you can see, the name of each module is simply its filename, but without the .py

extension. A collection of modules within a folder, such as this one, is known as a package.

Now, examine the contents of each module in the package. Even though you only have to

modify a relatively small number of locations in the code, it is important to understand

the overall structure of the package and the role each module plays in the code. Broadly

speaking, the modules are organized into the following three functional categories.

Data Interface. In the data loader module, we have provided functions that read

CoNLL-X files and convert them into Python data structures that you will use in other parts

5

of the code. We have also provided (a slightly modified version of) the load embeddings

function from Assignment 1. When run the first time, this function loads the GloVe

embeddings from the glove embeddings.txt file and then saves it in binary format as a

Pickle file. The second time it is run, it loads the binary file (much more quickly).

The data loader module contains two important classes.

• data loader.Vocabulary: A Vocabulary represents a collection of possible tokens

and a mapping of each token form to a unique numerical identifier known as its index.

You will need to create two Vocabularys in this assignment: one that represents

possible tokens in the text and one that represents possible POS tags.

• data loader.Dataset: A Dataset represents a collection of n-grams, where n is odd,

labeled with the POS tag of the token in the middle. You will use the from conll

function to load data from a CoNLL-X file and convert it into NumPy arrays, and

you will use the get batches function to divide the data into mini-batches. Like

load embeddings, the from conll function will store a binary version of the data

which will be loaded (more quickly) on subsequent calls.

Example:

1 # Import everything from data_loader

2 from data_loader import *

3

4 # Create vocabularies

5 all_tokens , _ = load_embeddings()

6 token_vocab = Vocabulary(all_tokens +

7 [“[UNK]”, “[BOS]”, “[EOS]”])

8 pos_tag_vocab = Vocabulary(all_pos_tags)

9

10 # Load a CoNLL -U file and convert it to trigrams (n=3)

11 train_data = Dataset.from_conll(” data /en_ewt -ud – train . conllu “,

12 token_vocab , pos_tag_vocab ,

13 ngram_size=3)

14

15 # Loop over mini – batches of size 16

16 for ngrams , pos_tags in train_data.get_batches(16):

17 …

Neural Network Components. The modules layers, model, and loss contain code

used to implement the multi-layer perceptron and cross-entropy loss function. The functionality of each module is as follows.

6

• layers contains Python classes for three types of network layers: Embedding, Linear,

and Tanh. You will need to complete the forward and backward computations in the

implementations of these classes.

• model contains the MultiLayerPerceptron class, which represents the full neural

network that you will train. An object of this class contains within it objects of the

various layer types, which are assembled into the architecture depicted above. The

forward and backward methods perform the forward and backward computations

by calling the forward and backward methods for each layer in the appropriate order

(forward for the forward computation, in reverse for the backward computation).

• loss contains code that combines the softmax activation function of the neural network with the cross-entropy loss function. As you will soon discover, implementing

these two components together is more ecient than implementing them separately.

You will need to implement the backward computation of this combined softmax–loss

function unit.

At the heart of these three modules is the layers.Layer class, an abstract class1 that

represents neural network layers. Please study the API for our neural networks by looking

at the functions declared by Layer. The abstract class comes with code for resetting

gradients to 0 (the clear grad function) and updating parameters during SGD (the update

function). Each subclass of Layer must implement a forward pass (forward function) and

a backward pass (backward function) for backpropagation. The constructor ( init

method) of each Layer subclass must declare its parameters and place them within the

params dict. Layer objects themselves can be called as functions; this has the e↵ect of

calling the forward function while saving the input to forward. For example:

1 import numpy as np

2 from layers import Linear

3

4 linear = Linear(2, 3)

5 x = np.random.rand(5, 2)

6

7 # Calls linear . forward

8 print (linear(x))

9

10 # Stored layer input

11 print (linear.x)

When working with Layers, you should always run the forward pass by calling the Layer

object, and never call the forward function directly.

1Please read this blog post if you don’t know what an abstract class is: https://www.geeksforgeeks.

org/abstract-classes-in-python/

7

Notice that model.MultiLayerPerceptron and loss.CrossEntropySoftmaxLoss are also

subclasses of layers.Layer, even though they represent objects that are not typically

thought of as “neural network layers.” This is because both classes can be included in a

computation graph, and therefore require a forward and backward pass implementation,

making them compatible with the Layer interface.

Training and Evaluation Code. The remaining modules contain code for training,

validating, and testing a POS tagger.

• train contains functions to train and test your network. Below if name ==

” main “, the train.py file also contains a Python script that can be called from

the command line.2 Doing so will train a network using SGD with one particular

configuration of hyperparameters and test the performance with the best validation

accuracy.

• metrics provides code for the assessment of average loss and accuracy, which will be

used during training.

1.4 Forward Computations

For this part of the assignment, you will implement the forward pass for the layers.Tanh

activation function and layers.Linear layer. The forward pass of the layers.Embedding

layer and model.MultiLayerPerceptron model has already been implemented for you.

Problem 2. Implement the forward function for layers.Tanh. Please use the np.tanh

function for the forward pass.

Problem 3. Implement the forward function for layers.Linear. The forward function

should return the value of the linear map

W x + b

for input x, where W is the weight matrix of the layer and b is the bias vector. To do

this, you will need to have access to the parameters of the linear layer, which are stored in

the params dict. These can be retrieved as self.params[“w”] for the weight matrix and

self.params[“b”] for the bias vector.

Hints:

1. Be sure to do your computations via matrix multiplication and do not use loops

to calculate the separate values—otherwise your implementation will be too slow!

2The line if name == ” main ” allows a Python file to be used either as a module or a script. The

code under the if statement is not executed when the file is used as a module.

8

2. Your Linear must be able to apply to matrices of shape (batch size, hidden size)

or (batch size, n ⇥ embedding size), but the weight matrix has shape (output size,

input size) and the bias vector has shape (output size,). This means that you will not

be able to perform the matrix multiplication exactly as in the equation given above.

Instead, you may need to transpose one of the arrays in the forward pass in order to

make the matrix multiplication work.

1.5 Backward Computations

Next, you will implement the backward pass for the layers.Tanh activation function

and model.MultiLayerPerceptron model. The backward pass of layers.Embedding and

layers.Linear has already been implemented for you.

Problem 4. Implement the backward functions for layers.Tanh. Use the following

formula for the gradient of tanh:

d

dx tanh(x)=1 tanh(x)

2.

Recall that the input to backward is the Jacobian

= @L

@ tanh(x)

.

The shape of is the same as that of the output tanh(x). Each entry of contains the

partial derivative of L with respect to the corresponding entry of tanh(x). The return

value of backward needs to be the Jacobian

@L

@x,

represented as an array of the same shape as x where each entry contains the partial

derivative of L with respect to the corresponding entry of x.

Problem 5. Implement the backward functions for layers.Linear. In addition to returning the Jacobian @L/@x, you will also need to update the values of self.grad[“w”] and

self.grad[“b”], which contain the gradients rW L = (@L/@W)> and rbL = (@L/@b)>

respectively.3 The gradients are represented as arrays of the same shape as their respective

parameters. Your code will need to compute the gradients rW L and rbL using the chain

rule and add them to the existing values stored in self.grad. (As we learned in class, this

is called “gradient accumulation.”)

Problem 6. Implement the backward function of model.MultiLayerPerceptron. The

forward function has already been implemented for you: it first applies the forward of the

3If W 2 Rp⇥q, then technically rW L is a p⇥q matrix while @L/@W is a 1⇥p⇥q 3D array (a “tensor”).

The “transpose operation” is really just getting rid of that first dimension with only one row.

9

embedding layer, then reshapes its output to concatenate the word embeddings together,

then applies the forward of each of the layers. Remember that model.MultiLayerPerceptron does not include the softmax function.

As with layers.Tanh, the input to backward will be the Jacobian :

= @L

@yˆ

,

where yˆ is the output shown in Figure 1 (which does not include softmax). Your code

should iterate through the list of layers in self.layers (in reverse order), applying each

layer’s backward method to compute a new value of representing the partial derivative

of the loss with respect to the input to that layer, in accordance with the chain rule. For

example, if z is the input to the output layer, then we can push one step backward

through the output layer as follows:

@L

@z =

@yˆ

@z .

(Notice that the provided implmentations of layers.Embedding.backward and layers.

Linear.backward store the gradients with respect to their parameters in the grad dict in

addition to returning the gradients with respect to their inputs.)

Hints:

1. As with the previous problem, you need to express the computations in terms of

matrix operations, rather than for loops. Once again, work through the derivatives

on pencil and paper. First, apply the chain rule to compute the derivatives with

respect to individual units, weights, and biases. Next, take the formulas you’ve

derived, and express them in matrix form. You should be able to express all of

the required computations using only matrix addition, matrix multiplication, matrix

transpose, and elementwise operations—no for loops!

2. Notice that the output and of the Embedding layer has shape (batch size, n, embedding size), while the output of the backward function for the first Linear layer

(i.e., the first layer in the layers list) has shape (batch size, n ⇥ embedding size).

When propagating the gradient to the Embedding, you will need to reshape to the

correct shape by un-concatenating the word embeddings in the n-gram.

1.6 Stochastic Gradient Descent

SGD is implemented by the layers.Layer.update function, which updates the parameters

(in the params dict) of a layer based on gradients (in the grad dict) previously computed

by backward.

10

Problem 7. Implement the update function for the layers.Layer base class. This will

implement SGD for all the layers in layers in one fell swoop, since those layers do not

override the update function.

Problem 8. Implement the update function for models.MultiLayerPerceptron. This

function will need to call update on the embedding layer as well as all the hidden layers

and output layer of the multi-layer perceptron.

1.7 Implementing Softmax and Cross-Entropy Loss

Now, you will implement the softmax activation function and the cross-entropy loss function. Recall that these are defined as follows:

softmax(x) = ex

1>x

LCE(softmax(yˆ), y) = ln(softmax(yˆ)y) = ln

eyˆy

P17

i=1 eyˆi

!

.

Unfortunately, the softmax function su↵ers from issues of numerical stability: because ex

grows extremely quickly relative to x, implementing softmax directly from its definition

may lead to overflow errors. To try to avoid this, we shift x by a constant factor so that

ex will not be too large.

Problem 9. Implement the softmax function in the loss module. On input yˆ, this

function should return the vector

softmax(yˆ c),

where

c = max

i (ˆyi)

is the largest (i.e., most positive) entry of yˆ. The input and output to softmax should

both be arrays of shape (batch size, 17). The constant c should be computed separately

for each logit vector in the batch (i.e., the max should be taken along the last dimension).

Problem 10. Next, verify that your code for Problem 9 is a valid implementation of

softmax. Prove that for any input vector x and scalar c,

softmax(x) = softmax(x + c).

That is, show that the softmax function is invariant to constant o↵sets in the input. This

means that the stabilization trick of subtracting the maximum entry of yˆ does not a↵ect

the output of softmax.

11

To complete the code for your neural network POS tagger, you will simultaneously implement the backward pass for both the cross-entropy loss function and the softmax activation

function. It turns out that there is a simple mathematical expression for the gradient of

these two units put together, which makes it more ecient to implement them as a single

layer than as two separate layers.

Problem 11. Implement the forward function of CrossEntropySoftmaxLoss using your

own implementation of softmax from Problem 10.

Problem 12. Implement the backward function of CrossEntropySoftmaxLoss. Since the

of the loss function is always 14, this backward function does not take a delta parameter.

Its output should be the for the model.MultiLayerPerceptron model:

= @L

@yˆ = @

@yˆ

LCE(softmax(yˆ), y).

As before, your output should be a matrix of shape (batch size, 17) where each row represents an example in the input batch, and each column contains the derivative of the loss

with respect to the corresponding column of yˆ.

Hints:

1. Using pencil and paper, write out the definition of LCE(softmax(yˆ), y) and expand

this expression. Then, compute the partial derivative of this expression with respect

to each entry ˆyi of the output layer output.

2. After you have computed @L/@yˆi by hand, try to find a clean, array-based expression

for .

1.8 Training Code

The final step is to train the model. The functions train epoch and run trial from

train implement the main training procedure. However, train epoch, which trains a

given model for a single epoch, is incomplete—you are responsible for filling in the gaps.

Problem 13. Fill in the marked lines of code in the main loop of train.train epoch

(starting from line 31 in the file). This for-loop is iterating over mini-batches of data. You

will need to implement one step of the SGD algorithm: first clear the stored gradients, then

perform the forward pass, then perform the backward pass, then update the parameters of

the model.

Once you have implemented the training loop, there are two ways you can train the model.

The first is to call the train.py file as a script. You can do this from Terminal (Mac OS

and Linux) or Command Prompt (Windows) by calling:

4Noting that L is the sum of losses, convince yourself that this is the case.

12

1 python train.py

or by executing a system call in Jupyter Notebook:

1 !python train.py

Another way to train the model is to recreate the training script at the bottom of the

train.py file in your own script or in a Jupyter Notebook. You will need to import the

run trial function as follows:

1 from train import run_trial

We will not grade any scripts you choose to write for training the model, including the

script at the bottom of train.py. Please do not include any extra Python scripts or

Jupyter Notebooks with your submission.

2 Analysis

In the second part of the assignment, you will try to understand what your neural network

has learned during training.

2.1 Part of Speech Ambiguity

Most English words only have one part of speech; for these words, the network simply needs

to memorize their POS tag. Words with multiple parts of speech are more challenging for

the model. Can your trained network handle these more dicult cases?

Problem 14. Find three words that can take on more than one part of speech. For each

of your three words, give two 3-grams with your word in the middle: one that forces your

word to have one part of speech, and one that forces your word to have a di↵erent part of

speech.

Problem 15. Train a 3-gram POS tagger and feed your six 3-grams from Problem 14

into your POS tagger. Report the testing accuracy of your model; make sure that it is at

least 75%. What POS tags does your tagger assign to the six 3-grams? For which of the

3-grams does your model make a correct prediction?

Hints:

1. Try using the following hyperparameters for training your model: learning rate 0.1,

batch size 5, step size 4, = .25, 1 layer.

2. Use the following code template in a Python script or Jupyter Notebook to run your

model on a 3-gram.

13

1 # Create Vocabularies

2 token_vocab = \

3 Vocabulary(all_tokens + [“[UNK]”, “[BOS]”, “[EOS]”])

4 pos_tag_vocab = Vocabulary(all_pos_tags)

5

6 # Prepare model input

7 tokens = [” Hello “, ” world “, “!”]

8 ngrams = np.array(token_vocab.get_ngrams(tokens , 3))[1:2]

9

10 # Get model output

11 predictions = model(ngrams).argmax(axis=-1)

12 pos_tags = [pos_tag_vocab.get_form(p) for p in predictions]

13 print (pos_tags)

Problem 16. Find three symmetrical skip-grams with window size 1 (i.e., three 3-grams

with the middle word missing) such that only words belonging to one particular part of

speech can be inserted into the middle of the skip-gram. Then, form three 3-grams by

inserting [UNK] into the middle of your three skip-grams. What POS tag is assigned to

your three 3-grams by the POS tagger you trained in Problem 15?

2.2 POS and Word Embeddings

The POS tagging model learns embeddings that are optimized to carry out the tagging

task. In this final part, you will consider the structure of this embedding space and how

it facilitates the task of POS tagging. To do this, we will focus our attention on the

embeddings of POS-unambiguous words, i.e., words that appear tagged with only a single

part of speech in the training data. POS-ambiguous words will be represented as some

combination of the multiple parts of speech with which they are associated (sensitive to

the frequency with which they occur in each), and therefore will not be easily interpretable.

In the data folder we have provided with this assignment, there is a CSV file unambiguous

pos tags.csv which contains all of the POS-unambiguous words from the training set.

You can load these words, together with their corresponding POS tags, using Python’s csv

module:

1 import csv

2

3 # Open a CSV file for reading

4 with open (” data / unambiguous_pos_tags .csv “, “r”) as f:

5 # Convert the CSV file to a list , remove header row

6 unambig_pos_tags = list (csv.reader(f))[1:]

14

In order to explore the high-dimensional embeddings of these words, we must perform some

sort of dimensionality reduction. In this case, we will use a technique called t-distributed

stochastic neighbor embedding, or t-SNE. (You don’t need to know what this is for the

assignment, but if you are curious, you can read about it at the end of Chapter 1 of the

course notes.) The important point is that nearby points in the low-dimensional space that

t-SNE produces will correspond to nearby points in the high-dimensional space of word

embeddings—i.e., the embedding will be (roughly) isometric. To create t-SNE embeddings,

you need to look up the embedding corespond to the words in the unambig pos tags list

you just created. Given an embedding matrix embeddings, you can do this for just the first

1000 embeddings—performing t-SNE for larger sets gets computationally expensive—and

produce a 2-dimensional plot of the result as follows:

1 from sklearn.manifold import TSNE # Creates t-SNE embeddings

2 from matplotlib.pyplot as plt # Creates plots

3

4

5 def plot_embeddings_by_pos(tsne_embeddings , unambig_pos_tags ,

6 token_vocab , pos_tag_vocab):

7 “””

8 Plots a set of t-SNE embeddings and organizes the

9 points by POS tag.

10 “””

11 for pos in pos_tag_vocab:

12 # Get indices for all words whose POS tag is pos

13 indices = [token_vocab.get_index(w)

14 for w, p in unambig_pos_tags if p == pos]

15

16 # Remove indices outside of the first 1,000 words

17 indices = [i for i in indices if i < 1000]

18

19 # Add the points for these indices to the plot ,

20 # assigning a unique color to this POS tag

21 plt.plot(tsne_embeddings[indices , 0],

22 tsne_embeddings[indices , 1],

23 marker=”.”,

24 linestyle=””,

25 markersize=12,

26 label=pos)

27

28 plt.legend(loc=” best “, bbox_to_anchor=(-0.1, 1.1))

29

15

30 # You can omit if __name__ == ” __main__ “: if running in a

31 # Jupyter Notebook

32 if __name__ == ” __main__ “:

33 # Extract embeddings from a model

34 embeddings = model.embedding_layer.params[” embeddings “]

35 embeddings = embeddings[:1000]

36

37 # Fit t-SNE embeddings

38 tsne_model = TSNE(n_components=2)

39 tsne_embeddings = tsne_model.fit_transform(embeddings)

40

41 # Plot the embeddings

42 plot_embeddings_by_pos(tsne_embeddings , unambig_pos_tags ,

43 token_vocab , pos_tag_vocab)

Problem 17. First, initialize a MultiLayerPerceptron model without loading pre-trained

GloVe embeddings and create t-SNE plots for the model’s randomly initialized embeddings.

Then, train the model until it reaches at least 75% accuracy, and create t-SNE plots for the

model’s (trained) word embeddings. How do these plots compare? What does this tell us

about the information encoded in the network’s word embeddings pre- and post-training?

Now create a t-SNE plot for pre-trained GloVe or word2vec embeddings, loaded from

glove embeddings.txt or the word2vec embeddings.txt file from Assignment 1, that

have not undergone any POS-tag training. What can you conclude about the degee to

which the pre-trained embeddings encode information about part of speech? Given your

knowledge of word2vec (assume that GloVe is also based on distributional semantics), is

this what you would expect? Why?

3 Submission Instructions

To submit your completed assignment, please upload the following files to CodePost. Please

ensure that your files have the same filenames as indicated below. Failure to submit your

assignment correctly will result in a deduction of 5 points.

• All the modules in the provided Python code package, with your code filled in. Do

not include the data directory, and do not change any of the filenames in the package.

• A Markdown document called assignment2.md, containing your responses to Problems 1, 10, 14, 15, 16, and 17.

• Any images of t-SNE plots that are embedded in your assignment2.md file for Problem 17.

16

References

Sabine Buchholz and Erwin Marsi. 2006. CoNLL-X Shared Task on Multilingual Dependency Parsing. In Proceedings of the Tenth Conference on Computational Natural

Language Learning (CoNLL-X), pages 149–164, New York, NY, USA. Association for

Computational Linguistics.

Je↵rey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical

Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar. Association for Computational Linguistics.

17