Deep Learning for NLP

Coding Assignment 1

CS 544

Introduction

The goal of this coding assignment to get you familiar with TensorFlow and walk you through

some practical Deep Learning techniques. You will be starting with code that is similar

to one taught in the first NLP Deep Learning class. Recall, the code we taught in class

implemented a 3-layer neural network over the document vectors. The output layer classified

each document into (positive/negative) and (truthful/deceptive). You will utilize the dataset

from coding assignments 1 and 2. In this assignment, you will:

• Improve the Tokenization.

• Convert the first layer into an Embedding Layer, which makes the model somewhat

more interpretable. Many recent Machine Learning efforts strive to make models more

interpretable, but sometimes at the expense of prediction accuracy.

• Increase the generalization accuracy of the model by implementing input sparse dropout

– TensorFlow’s (dense) dropout layer does not work out-of-the-box, as explained later.

• Visualize the Learned Embeddings using t-SNE.

In order to start the assignment, please download the starter-code from:

• http://sami.haija.org/cs544/DL1/starter.py

You can run this code as:

python s t a r t e r . py path / to / coding1 /and/2/ data /

Note: This assignment will automatically be graded by a script, which verifies

the implementation of tasks one-by-one. It is important that you stick to these

guidelines: Only implement your code in places marked by ** TASK. Do not

change the signatures of the methods tagged ** TASK, or else the grading script

will fail in finding them and you will get a zero for the corresponding parts.

Otherwise, feel free to create as many helper functions as you wish!

Finally, you might find the first NLP Deep Learning lecture slides useful.

• This assignment is due Thursday April 4. We are working on Vocareum integration.

Nonetheless, you are advised to start early (before we finish the Vocareum Integration).

You can submit until April 7, but all submissions after April 4 will receive penalties.

1

Prepared By:

Sami & Ron

Deep Learning for NLP

Coding Assignment 1

CS 544

Due: Thu April 4

[10 points] Task 1: Improve Tokenization

The current tokenization:

# ** TASK 1.

d e f Tokenize ( comment ) :

“””Receives a string (comment) and returns array of tokens.”””

words = comment . s p l i t ( )

r e t u r n words

is crude. It splits on whitespaces only (spaces, tabs, new-lines). It leaves all other punctuations e.g. single- and double-quotes, exclamation marks, etc – there should be no reason to

have both terms “house” and “house?” in the vocabulary. While a perfect tokenization can

be quite involved, let us only slightly improve the existing one. Specifically, you should split

on any non-letter. You might find the python standard re package useful.

• Update code of Tokenize to work as described. Correct implementation should reduce

the number of tokens by about half.

2

Prepared By:

Sami & Ron

Deep Learning for NLP

Coding Assignment 1

CS 544

Due: Thu April 4

[20 + 6.5 points] Task 2: Convert the 1

st layer into an

embedding layer

Our goal here is to replace the first layer with something equivalent to tf.nn.embedding lookup,

followed by averaging, but without using the function tf.nn.embedding lookup as we aim

to understand the underlying mathematics behind embeddings and we do not (yet) want to

discuss variable-length representations in tensorflow1

.

The end-goal from this task is to make the output of this layer to represent every comment

(document) by the average word embedding appearing in the comment. For example, if we

represent the document by vector x ∈ R

|V |

, with |V | being the size of the vocabulary and

entry xi being the number of times word i appears in the document. Then, we would like

the output of the embedding layer for document x to be:

σ

x

>Y

||x||

(1)

where σ is element wise activation function. We wish to train the embedding matrix Y ∈

R

|V |×d which will embed each word in a d-dimensional space (each word embedding lives in

one row of the matrix Y ). The denominator ||x|| is to compute the average, which can be the

L1 or the L2 norm of the vector. In this exercise, use the L2 norm. The above should make

our model more interpretable. Note the following differences between the above embedding

layer and a traditional fully-connected (FC) layer, with transformation: σ

x

>W + b

.

1. FC layers have an additional bias-vector b. We do not want the bias vector. Its presence

makes the embeddings more tricky to be visualized or ported to other applications.

Here, W corresponds to the embedding dictionary Y .

2. As mentioned, the input vector b to Equation 1 should be normalized. if x is a matrix,

then normalization should be row-wise. (Hint: you can use tf.nn.l2 normalize).

3. Modern fully-connected layers are have σ = ReLu. Embeddings generally either have

(1) no activation or (2) a squashing activation (e.g. tanh, or L2-norm). We will opt to

use (2) specifically tanh activation, as option (1) might force us to choose an adaptive

learning-rate2

for the embedding layer.

4. The parameter W will be L2-regularized in standard FC i.e. by adding λ||W||2

2

to the

overall minimization objective function (where the scalar coefficient λ is generally set

to a small value such as 0.0001 or 0.00001). When training embeddings, we only want

to regularize the words that appear in the document rather than *all* embeddings at

every optimization update step. Specifically, we want to regularize by replacing the

standard L2 regularization with λ

x>Y

||x||

2

2

1Variable-length representations will likely be on next coding assignment

2Adaptive learning rates are incorporated in training algorithms such as AdaGrad and ADAM.

Prepared By:

Sami & Ron

Deep Learning for NLP

Coding Assignment 1

CS 544

Due: Thu April 4

In this task, you will represent the embedding transformation using the fully connected functionality. must edit the code of function FirstLayer. Here are your sub-tasks:

3 points Replace the ReLu activation with tanh.

4 points Remove the Bias vector.

7 points Replace the L2-regularization of fully connected with manual regularization. Specifically, tf.add loss on R(Y ), but choose the one in Bonus’ Part 1. Unlike the bonus

questions, here you will let TensorFlow determine the gradient and update rule. Hint:

tf.add loss and to the collection tf.GraphKeys.REGULARIZATION LOSSES.

4 points Preprocess the layer input by passing e.g. through l2 normalize x as in x := σ(x) where

σ(x) = x

||x||2

function σ.

2 points Add Batch Normalization.

6.5 points Bonus: Work-out the analytical expression of the gradient of the regularization R(Y )

with respect to the parameters Y . Provide a TensorFlow operator that carries the

update by-hand (without using automatic differentiation). The update should act as

Y := Y − η

∂R(Y )

∂Y , where η ∈ R+ is the learning rate. Zero credit will be given to all

solutions utilizing tf.gradients(). However, you are allowed to “test it locally” by

comparing your expression with the output of tf.gradients(), so long as you dont

call the function (in)directly from Embedding*Update functions below.

3 points Part i. if R(Y ) = λ

x>Y

||x||

2

2

. Implement it in EmbeddingL2RegularizationUpdate.

3.5 points Part ii. if R(Y ) = λ

x>Y

||x||

1

. Implement it in EmbeddingL1RegularizationUpdate.

– PLEASE Do not discuss the bonus questions on Piazza, with the TAs, or amongst

yourselvs. You must be the sole author of the implementation. However, you are

allowed to discuss them after you submit but with only those who submitted.

– Note: The functions Embedding*Update are not invoked in the code. That is

okay! Our grading script will invoke them to check for correctness.

The completion of the above tasks should successfully convert the first layer to an embedding

layer (Equation 1). Especially for the last sub-task, you might find the documentation useful:

tf.contrib.layers.fully connected

4

Prepared By:

Sami & Ron

Deep Learning for NLP

Coding Assignment 1

CS 544

Due: Thu April 4

[25 points] Task 3: Sparse Input Dropout

At this point, after improving tokenization and converting the first layer to be an embedding

layer, the model accuracy might have reduced… Do not worry! In fact, the model “train”

accuracy have improved at this point: but we do not care about that! We always only care

about the model generalization capability i.e. its performance on an unseen test examples,

as we do not want it to over-fit (i.e. memorize) the training data while simultaneously

performing bad on test data.

Thankfully, we have Deep Learning techniques to improve generalization. Specifically, we

will be using Dropout.

Dropping-out document terms helps generalization. For example, if a document contains

terms “A B C D”, then in one training batch the document could look like “A B D”, and in

another, it could look like “A C D”, and so on. This will essentially prevent our 3-layer neural

network from “memorizing” how the document looks like, as it appears different every time

(there are exponentially many different configurations a document can appear with dropout,

and all configurations are equally likely).

The issue is that we cannot use TensorFlow dropout layer out-of-the-box as it is designed

for dense Vectors and Matrices. Specifically, if we perform tf.contrib.layers.dropout on

the input data using:

ne t = t f . c o n t ri b . l a y e r s . dropout ( x , keep p r ob =0.5 , i s t r a i n i n g=i s t r a i n i n g )

Then, TensorFlow will be dropping half of the entries in x. But, this is almost useless because

most entries of x are already zero (most words do not occur in most documents). We wish

to be efficient and drop-out exactly words that appear in the documents rather than entries

that are already zero.

Thankfully, we have students to implement sparse-dropout for us! There are many possible

ways to implement sparse dropout. Your task is to:

• Please trace the usage of SparseDropout and fill its body. It currently reads as:

# ** TASK 3

d e f SparseDropout ( s l i c e x , keep p r ob = 0. 3 ):

“””Sets random (1 – keep_prob) non-zero elements of slice_x to zero.

Args:

slice_x: 2D numpy array (batch_size , vocab_size)

Returns:

2D numpy array (batch_size , vocab_size)

“””

r e t u r n s l i c e x

Use vectorized implementation with numpy’s advanced indexing. Slow solutions (i.e. using

for-loops in python) will receive at most 15/25 points.

5

Prepared By:

Sami & Ron

Deep Learning for NLP

Coding Assignment 1

CS 544

Due: Thu April 4

[10 points] Task 4: Tracking Auto-Created TensorFlow

Variables

TensorFlow is arguably the best abstraction for describing a graph of mathematical operations and programming Machine Learning models, but it is not perfect. One weakness of

TensorFlow is that it does not provide easy access to variables that are automatically created

by the layers (e.g. the fully-connected layer). Often times, one would like to grab a handle

on specific variable in a specific layer e.g. to visualize the embeddings, as we will do in the

next task.

To do this task, you will find the function tf.trainable variables() helpful. Hint: you

can print the contents of tf.trainable variables() in BuildInferenceNetwork, before

and after FirstLayer. Your taks is:

10 points Modify the code of BuildInferenceNetwork. In it, populate EMBEDDING VAR to be a

reference to the tf.Variable that holds the embedding dictionary Y . A snapshot of the

code is here for your reference:

d e f B uil d I n f e r e n c eN e tw o r k ( x ) :

“””From a tensor x, runs the neural network forward to compute outputs.

This essentially instantiates the network and all its parameters.

Args:

x: Tensor of shape (batch_size , vocab size) which contains a sparse matrix

where each row is a training example and containing counts of words

in the document that are known by the vocabulary.

Returns:

Tensor of shape (batch_size , 2) where the 2-columns represent class

memberships: one column discriminates between (negative and positive) and

the other discriminates between (deceptive and truthful).

“””

g l o b a l EMBEDDING VAR

EMBEDDING VAR = None # ** TASK 4: Move and set appropriately.

## Build layers starting from input.

ne t = x

# … continues to construct ‘net‘ layer -by-layer …

Set EMBEDDING VAR to a tf.Variable reference object. Keep first line: ‘global EMBEDDING VAR’.

6

Prepared By:

Sami & Ron

Deep Learning for NLP

Coding Assignment 1

CS 544

Due: Thu April 4

[25 points] Task 5: Visualizing the embedding layer

We want to visualize the embeddings learned by our Deep Network. The embedding layer

learns Y , a 40-dimensional embedding for each word in the vocabulary. You will project

the 40 dimensions onto 2 dimensions using sklearn tsne. Rather than visualizing all the

words, we will choose 4 kinds of words: Words indicating positive class (shown in blue),

negative class (shown in Orange), Words describing furniture (Red) and location (green).

Notice that the words that are useful for this classification task occupy different parts of

the embedding space: You can easily separate the orange and the Blue with a separating

hyperplane. In contrast, words not indicative of the classes (e.g. furniture, location) are not

as well clustered3

.

Successfully visualizing the embeddings using t-SNE should like this:

20 15 10 5 0 5 10 15 20

20

10

0

10

20

relaxing

upscale

luxury

luxurious

recommend

relax

choice

best

pleasant

incredible

magnificent

superb perfect fantastic

polite

gorgeous

beautiful

elegant

spacious

dirty rude uncomfortable

unfortunately ridiculous

disappointment

mediocre worst terrible

blocks avenue block doorman windows concierge

living

bedroom floor

table

coffee

window

bathroom

bath

couch pillow

This is a fairly open-ended task, but there should be decent documentation in the TASK 5

functions that you should implement: ComputeTSNE and VisualizeTSNE. Note: you must

separately upload the PDF produced by VisualizeTSNE onto Vocareum with name

tsne embeds.pdf.

3

In word2vec, which we will learn soon, the training is unsupervised as the document classes are not

known. As a result, all symantically similar words should cluster around one another as there are no classes

(not just ones indicative of any classes, as they are not present during training)

7