Advanced Vector Space Models : Assignment
In this assignment, we will examine some advanced uses of vector representations of words. We are going to look at two
1. Solving word relation problems like analogies using word embeddings.
2. Discovering the different senses of a “polysemous” word by clustering together its synonyms. You will use an open
source Python package for creating and manipulating word vectors called gensim. Gensim lets you easily train word
embedding models like word2vec.
In order to use the gensim package, you’ll have to be using Python version 3.6 or higher. On my Mac, I did the
brew install python3
pip3 install gensim
Then when I ran python, I used the command python3 instead of just python
Here are the materials that you should download for this assignment:
question1.txt (/downloads/hw4/question1.txt) A template for answering question 1.
data.zip (/downloads/hw4/data.zip) Contains all the data
vectorcluster.py (/downloads/hw4/vectorcluster.py) Main code stub
evaluate.py (/downloads/hw4/evaluate.py) Evaluation script
writeup.tex (/downloads/hw4/writeup.tex) Report template.
makecooccurrences.py (/downloads/hw4/makecooccurrences.py) Script to make cooccurrences (optional
Tokenized Reuters RCV1 Corpus (http://www.cis.upenn.edu/~cis530/18sp/data/reuters.rcv1.tokenized.gz)
Google’s pretrained word2vec vectors (https://code.google.com/archive/p/word2vec/), under the heading
“Pretrained word and phrase vectors”
Part 1: Exploring Analogies and Other Word
Word2vec is a very cool word embedding method that was developed by Thomas Mikolov and his collaborators. One of
the noteworthy things about the method is that it can be used to solve word analogy problems like man is to king as
woman is to [blank] The way that it works is to perform vector math. They take the vectors representing king, man and
woman and perform some vector arithmetic to produce a vector that is close to the expected answer.
So We can find the nearest a vector in the vocabulary by looking for
. Omar Levy has a nice explnation of the method in this Quora post
(https://www.quora.com/How-does-Mikolovs-word-analogy-for-word-embedding-work-How-can-I-code-such-afunction), and in his paper Linguistic Regularities in Sparse and Explicit Word Representations
king − man + woman ≈ queen
argmax cos(x, king − man + woman)
In addition to solving this sort of analogy problem, the same sort of vector arithmetic was used with word2vec
embeddings to find relationships between pairs of words like the following:
In the first part of this homework, you will play around with the gensim library
(https://radimrehurek.com/gensim/index.html) library. You will use gensim load a dense vector model trained using
word2vec , and use it to manipulate and analyze the vectors.
You can start by experimenting on your own, or reading through this tutorial on using word2vec with gensim (https://raretechnologies.com/word2vec-tutorial/). You should familiarize yourself with the KeyedVectors documentation
The questions below are designed to familiarize you with the gensim Word2Vec package, and get you thinking about
what type of semantic information word embeddings can encode. You’ll submit your answers to these questions when
you submit your other homework materials.
Load the word vectors using the following Python commands:
from gensim.models import KeyedVectors
vecfile = ‘GoogleNews-vectors-negative300.bin’
vecs = KeyedVectors.load_word2vec_format(vecfile, binary=True)
What is the dimensionality of these word embeddings? Provide an integer answer.
What are the top-5 most similar words to picnic (not including picnic itself)? (Use the function
According to the word embeddings, which of these words is not like the others? [’tissue’, ‘papyrus’,
‘manila’, ‘newsprint’, ‘parchment’, ‘gazette’] (Use the function
Solve the following analogy: “leg” is to “jump” as X is to “throw”. (Use the function
gensim.models.KeyedVectors.wv.most_similar with positive and negative arguments.)
We have provided a file called question1.txt for you to submit answers to the questions above.
Part 2: Creating Word Sense Clusters
Many natural language processing (NLP) tasks require knowing the sense of polysemous words, which are words with
multiple meanings. For example, the word bug can mean
1. a creepy crawly thing
2. an error in your computer code
3. a virus or bacteria that makes you sick
4. a listening device planted by the FBI
In past research my PhD students and I have looked into automatically deriving the different meaning of polysemous
words like bug by clustering their paraphrases. We have developed a resource called the paraphrase database (PPDB)
(http://paraphrase.org/) that contains of paraphrases for tens of millions words and phrases. For the target word bug, we
have an unordered list of paraphrases including: insect, glitch, beetle, error, microbe, wire, cockroach, malfunction,
microphone, mosquito, virus, tracker, pest, informer, snitch, parasite, bacterium, fault, mistake, failure and many others. We
used automatic clustering group those into sets like:
These clusters approximate the different word senses of bug. You will explore the main idea underlying our word sense
clustering method: which measure the similarity between each pair of paraphrases for a target word and then group
together the paraphrases that are most similar to each other. This affinity matrix gives an example of one of the methods
for measuring similarity that we tried in our paper (https://www.cis.upenn.edu/~ccb/publications/clustering-paraphrasesby-word-sense.pdf):
Here the darkness values give an indication of how similar paraprhases are to each other. For instance sim(insect, pest) >
In this assignment, we will use vector representations in order to measure their similarities of pairs of paraprhases. You will
play with different vector space representations of words to create clusters of word senses.
In this image, we have a target word “bug”, and a list of all synonyms (taken from WordNet). The 4 circles are the 4 senses
of “bug.” The input to the problem is all the synonyms in a single list, and the task is to separate them correctly. As
humans, this is pretty intuitive, but computers aren’t that smart. We will use this task to explore different types of word
You can read more about this task in these (https://www.cis.upenn.edu/~ccb/publications/clustering-paraphrases-byword-sense.pdf) papers (https://cs.uwaterloo.ca/~cdimarco/pdf/cs886/Pantel+Lin02.pdf).
Clustering with Word Vectors
We expect that you have read Jurafsky and Martin, chapters 15 (https://web.stanford.edu/~jurafsky/slp3/15.pdf) and 16
(https://web.stanford.edu/~jurafsky/slp3/16.pdf). Word vectors, also known as word embeddings, can be thought of
simply as points in some high-dimensional space. Remember in geometry class when you learned about the Euclidean
plane, and 2-dimensional points in that plane? It’s not hard to understand distance between those points – you can even
measure it with a ruler. Then you learned about 3-dimensional points, and how to calculate the distance between these.
These 3-dimensional points can be thought of as positions in physical space.
Now, do your best to stop thinking about physical space, and generalize this idea in your mind: you can calculate a
distance between 2-dimensional and 3-dimensional points, now imagine a point with 300 dimensions. The dimensions
don’t necessarily have meaning in the same way as the X,Y, and Z dimensions in physical space, but we can calculate
distances all the same.
This is how we will use word vectors in this assignment: as points in some high-dimensional space, where distances
between points are meaningful. The interpretation of distance between word vectors depends entirely on how they were
made, but for our purposes, we will consider distance to measure semantic similarity. Word vectors that are close together
should have meanings that are similar.
With this framework, we can see how to solve our synonym clustering problem. Imagine in the image below that each
point is a (2-dimensional) word vector. Using the distance between points, we can separate them into 3 clusters. This is
(Image taken from Wikipedia (https://en.wikipedia.org/wiki/K-means_clustering))
The data to be used for this assignment consists of sets of paraphrases corresponding to one of 56 polysemous target
Target Paraphrase set
note.v comment mark tell observe state notice say remark mention
hot.a raging spicy blistering red-hot live
(Here the .v following the target note indicates the part of speech.)
Your objective is to automatically cluster each paraphrase set such that each cluster contains words pertaining to a single
sense, or meaning, of the target word. Note that a single word from the paraphrase set might belong to one or more
For evaluation, we take the set of ground truth senses from WordNet (http://wordnet.princeton.edu).
The development data consists of two files – a words file (the input), and a clusters file (to evaluate your output). The
words file dev_input.txt is formatted such that each line contains one target, its paraphrase set, and the number of
ground truth clusters k, separated by a :: symbol:
target.pos :: k :: paraphrase1 paraphrase2 paraphrase3 …
You can use k as input to your clustering algorithm.
The clusters file dev_output.txt contains the ground truth clusters for each target word’s paraphrase set, split over k
target.pos :: 1 :: paraphrase2 paraphrase6
target.pos :: 2 :: paraphrase3 paraphrase4 paraphrase5
target.pos :: k :: paraphrase1 paraphrase9
For testing, you will receive only words file test_input.txt containing the test target words and their paraphrase sets.
Your job is to create an output file, formatted in the same way as dev_output.txt , containing the clusters produced by
your system. Neither order of senses, nor order of words in a cluster matter.
There are many possible ways to evaluate clustering solutions. For this homework we will rely on the paired F-score,
which you can read more about in this paper (https://www.cs.york.ac.uk/semeval2010_WSI/paper/semevaltask14.pdf).
The general idea behind paired F-score is to treat clustering prediction like a classification problem; given a target word
and its paraphrase set, we call a positive instance any pair of paraphrases that appear together in a ground-truth cluster.
Once we predict a clustering solution for the paraphrase set, we similarly generate the set of word pairs such that both
words in the pair appear in the same predicted cluster. We can then evaluate our set of predicted pairs against the ground
truth pairs using precision, recall, and F-score.
We have provided an evaluation script that you can use when developing your own system. You can run it as follows:
python evaluate.py <GROUND-TRUTH-FILE> <PREDICTED-CLUSTERS-FILE>
On the dev data, a random baseline gets about 20%, the word cooccurrence matrix gets about 36%, and the word2vec
vectors get about 30%.
1. Sparse Representations
Your next task is to generate clusters for the target words in test_input.txt based on a feature-based (not dense)
vector space representation. In this type of VSM, each dimension of the vector space corresponds to a specific feature,
such as a context word (see, for example, the term-context matrix described in Chapter 15.1.2 of Jurafsky & Martin
You will calculate cooccurrence vectors on the Reuters RCV1 corpus. Download a tokenized and cleaned version here
(http://www.cis.upenn.edu/~cis530/18sp/data/reuters.rcv1.tokenized.gz). The original is here
Use the provided script, makecooccurrences.py , to build these vectors. Be sure to set D and W to what you want.
It can take a long time to build cooccurrence vectors, so we have pre-built a set, included in the data.zip, called
coocvec-500mostfreq-window-3.vec.filter . To save on space, these include only the words used in the given files.
You will add K-means clustering to vectorcluster.py . Here is an example of the K-means code:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=k).fit(X)
The baseline system for this section represents words using a term-context matrix M of size |V| x D , where |V| is the
size of the vocabulary and D=500. Each feature corresponds to one of the top 500 most-frequent words in the corpus.
The value of matrix entry M[i][j] gives the number of times the context word represented by column j appeared
within W=3 words to the left or right of the word represented by row i in the corpus. Using this representation, the
baseline system clusters each paraphrase set using K-means.
While experimenting, write out clusters for the dev input to dev_output_features.txt and use the evaluate.py
script to compare against the provided dev_output.txt .
Implementing the baseline will score you a B, but why not try and see if you can do better? You might try experimenting
with different features, for example:
What if you reduce or increase D in the baseline implementation?
Does it help to change the window W used to extract contexts?
Play around with the feature weighting – instead of raw counts, would it help to use PPMI?
Try a different clustering algorithm that’s included with the scikit-learn clustering package (http://scikitlearn.org/stable/modules/clustering.html), or implement your own.
What if you include additional types of features, like paraphrases in the Paraphrase Database
(http://www.paraphrase.org) or the part-of-speech of context words?
The only feature types that are off-limits are WordNet features.
Turn in the predicted clusters that your VSM generates in the file test_output_features.txt . Also provide a brief
description of your method in writeup.pdf , making sure to describe the vector space model you chose, the clustering
algorithm you used, and the results of any preliminary experiments you might have run on the dev set. We have provided a
LaTeX file shell, writeup.tex , which you can use to guide your writeup.
2. Dense Representations
Finally, we’d like to see if dense word embeddings are better for clustering the words in our test set. Run the word
clustering task again, but this time use a dense word representation.
For this task, use files:
Google’s pretrained word2vec vectors (https://code.google.com/archive/p/word2vec/), under the heading
“Pretrained word and phrase vectors”
The Google file is very large (~3.4GB), so we have also included in the data.zip a file called GoogleNews-vectorsnegative300.filter , which is filtered to contain only the words in the dev/test splits.
Modify vectorcluster.py to load dense vectors.
The baseline system for this section uses the provided word vectors to represent words, and K-means for clustering.
As before, achieving the baseline score will get you a B, but you might try to see if you can do better. Here are some
Try downloading a different dense vector space model from the web, like Paragram
(http://www.cs.cmu.edu/~jwieting/) or fastText
Train your own word vectors, either on the provided corpus or something you find online. You can use the
gensim.models.Word2Vec package for the skip-gram or CBOW models, or GLOVE
(https://nlp.stanford.edu/projects/glove/). Try experimenting with the dimensionality.
Retrofitting (https://www.cs.cmu.edu/~hovy/papers/15HLT-retrofitting-word-vectors.pdf) is a simple way to add
additional semantic knowledge to pre-trained vectors. The retrofitting code is available here
(https://github.com/mfaruqui/retrofitting). Experiment with different lexicons, or even try counter-fitting
As in question 2, turn in the predicted clusters that your dense vector representation generates in the file
test_output_dense.txt . Also provide a brief description of your method in writeup.pdf that includes the vectors
you used, and any experimental results you have from running your model on the dev set.
In addition, do an analysis of different errors made by each system – i.e. look at instances that the word-context matrix
representation gets wrong and dense gets right, and vice versa, and see if there are any interesting patterns. There is no
right answer for this.
3. The Leaderboard
In order to stir up some friendly competition, we would also like you to submit the clustering from your best model to a
leaderboard. Copy the output file from your best model to a file called test_output_leaderboard.txt , and include it
with your submission.
We made the clustering problem deliberately easier by providing you with k , the number of clusters, as an input. But in
most clustering situations the best k isn’t obvious. To take this assignment one step further, see if you can come up with
a way to automatically choose k . We have provided an additional test set, test_nok_input.txt , where the k field
has been zeroed out. See if you can come up with a method that clusters words by sense, and chooses the best k on its
own. (Don’t look at the number of WordNet synsets for this, as that would ruin all the fun.) The baseline system for this
portion always chooses k=5 . You can submit your output to this part in a file called
test_nok_output_leaderboard.txt . Be sure to describe your method in writeup.pdf .
Here are the deliverables that you will need to submit:
question1.txt file with answers to questions from Exploration
simple VSM clustering output test_output_features.txt
dense model clustering output test_output_dense.txt
your favorite clustering output for the leaderboard, test_output_leaderboard.txt (this will probably be a
copy of either test_output_features.txt or test_output_dense.txt )
writeup.pdf (compiled from writeup.tex )
your code (.zip). It should be written in Python 3.
(optional) the output of your model that automatically chooses the number of clusters,
Vector Semantics. (https://web.stanford.edu/~jurafsky/slp3/15.pdf) Dan Jurafsky and James H. Martin. Speech and
Stephen Mayhew, Anne Cocos and Chris Callison-Burch developed this homework assignment for UPenn’s CIS 530 class in
Language Processing (3rd edition draft) .
Semantics with Dense Vectors. (https://web.stanford.edu/~jurafsky/slp3/16.pdf) Dan Jurafsky and James H. Martin.
Speech and Language Processing (3rd edition draft) .
Efficient Estimation of Word Representations in Vector Space. (https://arxiv.org/pdf/1301.3781.pdf?) Tomas Mikolov, Kai
Chen, Greg Corrado, Jeffrey Dean. ArXiV 2013. Abstract
Linguistic Regularities in Continuous Space Word Representations. (https://www.aclweb.org/anthology/N13-1090) Tomas
Mikolov, Wen-tau Yih, Geoffrey Zweig. NAACL 2013. Abstract BibTex
Discovering Word Senses from Text. (https://www.semanticscholar.org/paper/Discovering-word-senses-from-text-PantelLin/) Patrick Pangel and Dekang Ling. KDD 2002. Abstract BibTex
Clustering Paraphrases by Word Sense. (https://www.cis.upenn.edu/~ccb/publications.html) Anne Cocos and Chris
Callison-Burch. NAACL 2016. Abstract BibTex