CPSC 477

HW3

Problem Grade

QUESTION 1 (Language Modeling) /10

QUESTION 2 (Recurrent Neural Networks) /10

QUESTION 3 (Morphology) /10

QUESTION 4 (Text Similarity) /10

QUESTION 5 (Dimensionality Reduction /10

QUESTION 6 (NaΓ―ve Bayes Classifier) /10

QUESTION 7 (Training Neural Networks) /10

QUESTION 8 (Word Embeddings) /10

TOTAL /80

Submission Instructions

1. Submit each solution on a separate sheet.

2. Include your name, netID and problem number on each page.

QUESTION 1 (Language Modeling)

You are in a noisy coffee shop, diligently studying for your midterm, and your friend is

trying to get your attention, using only a two words vocabulary. She has said a sentence

but you couldnβt hear one of the words:

(π€1 = β²βπβ²

; π€2 =

β²π¦πβ²

; π€3 =? ? ? ; π€4 =

β²π¦πβ²)

Assume that your friend was generating words from this first-order Markov model:

π(β²βπβ²|β²βπβ²) = 0.7 π(β²π¦πβ²|β²βπβ²) = 0.3

π(β²βπβ²|β²π¦πβ²) = 0.5 π(β²π¦πβ²|β²π¦πβ²) = 0.5

Given these parameters, what is the posterior probability of whether the missing word is

βhiβ or βyoβ? Show your work.

QUESTION 2 (Recurrent Neural Networks)

Your goal will be to set up weights for a simple recurrent network (SRN and also known

as an Elman Netowork) to solve two simple sequence tasks. The network reads in a

sequence π₯1, . . , π₯π‘and produces outputs π¦1

, . . , π¦π‘

. We consider the final output π¦π‘

to be the

answer given by the network.

We will assume a threshold activation function:

π(π₯) = 1 if π₯ 0; otherwise π(π₯) = 0.

The network has one simple recurrent layer followed by a linear layer (shown in below

diagram). The dimension of both the input to the network and its output should be 1.

1. Construct an SRN with 1 hidden unit that reads a sequence of 1s and 0s and

outputs 1 if any of π₯1, .., π₯π‘ had the value 1. Justify the correctness of your network.

2. Construct an SRN with 2 hidden units to solve the parity task (read in a sequence

of 1s and 0s and return the sum mod 2, shown in the below table). Justify the

correctness of your network.

π₯ π¦

0 0

1 1

010 1

0110 0

10101 1

QUESTION 3 (Morphology)

Refer to the slides for lecture #143 for background on this question.

1. Consider the English verb drink. Give two inflected forms of drink, and two

derivational forms of drink.

2. Identify all the morphemes in infallibilities. For each morpheme, say whether it is a

prefix, suffix, or root.

3. Consider the word unnegotiable. Draw and label two morphological trees for this

word, one which corresponds to the meaning βnot able to be negotiatedβ, and

another which corresponds to βable to be unnegotiatedβ.

QUESTION 4 (Text Similarity)

We first need to introduce a function that compares the similarity of two sentences.

Recall the definition for the cosine distance between two vectors π₯ and π¦:

πππ πππ(π₯, π¦) =

β π₯ππ¦π

π

π=1

ββ π₯

π 2

π=1 ββ π¦

π 2

π=1

Part A: How would you define a mapping from sentences s to vectors π(π ) such that the

cosine distance πππ πππ(π(π 1), π(π 2)) is a measure of the similarity between two

sentences π 1 and π 2? We will give full marks for any reasonable mapping.

Part B: Now, we will define a dynamic programming algorithm that computes sentence

alignment of two translated documents. The alignment score is the sum of the similarity

scores of the aligned sentences. Our goal is to find an alignment with the highest score.

We will consider alignments of the following form:

β a sentence can be aligned to an empty sentence. This happens when one of the

translators omits a sentence.

β a sentence can be aligned to exactly one sentence.

β a sentence can be aligned to two sentences. This happens when one of the

translators either breaks or joins sentences.

Our sentence alignment algorithm recursively computes the alignment matrix πΉ indexed

by π and π, one index for each sentence. The value stored in πΉ(π,π) is the score of the

best alignment between the first π sentences of π₯ and the first π sentences of π¦. π (π₯π

, π¦π)

is the similarity between sentence π₯π and sentence π¦π

.

Based on the above information, please

Define πΉ(0,0), πΉ(π, 0) and πΉ(0,π).

Define πΉ(π,π) for π 0 and π 0.

Part C: Next, we will modify the alignment score to be the same as before, but to include

a fixed penalty π each time a sentence is aligned to an empty sentence. π is a parameter,

which is = 0, chosen by the user of the algorithm. Describe a modified dynamic

programming method that takes this new penalty into account.

QUESTION 5 (Dimensionality reduction and SVD)

(a) Prove that the left singular vectors of a matrix π΄ are the right singular vectors of π΄

π

(b) Calculate the covariance matrix for the matrix π shown below.

π = [

1 1 0 1

0 0 0 1

1 1 0 0

]

(c) Find the singular values of the matrix π

QUESTION 6 (NaΓ―ve Bayes Classifier)

Suppose you are given the following set of data with three Boolean input variables π, π,

and π, and a single Boolean output variable πΎ.

π π π π²

1 0 1 0

1 1 1 1

0 1 1 0

1 1 0 0

1 0 1 0

0 0 0 1

0 0 0 1

0 0 1 0

(For parts (a) and (b), assume we are using a NaΓ―ve Bayes classifier to predict the

value of K from the values of the other variables.)

(a) According to the naive Bayes Classifier, what is π(πΎ = 1|π = 1, π = 1, π = 0) ?

(b) According to the naive Bayes Classifier, what is π(πΎ = 0|π = 1, π = 1) ?

(c) Given π(π|π) = 0.7 and π(π|π) = 0.4.

Do you have enough information to compute π(π|π, π)? If not, write βnot enough infoβ.

If so, compute the value of π(π|π, π) from the above information

(e) Given π(π, π) = 0.2, π(π) = 0.3, π(π) = 1

Do you now have enough information to compute π(π|π, π)? If not, write βnot enough

infoβ. If so, compute the value of π(π|π, π)from the above information.

QUESTION 7 (Training Neural Networks)

This question is to train a three-layer neural network for classification task. Let π»1 denote

the number of hidden units, let π· be the dimension of the input π, and let πΎ be the

number of classes. The three-layer network has the following form:

β1 = π
ππΏπ(π1π+π1

)

β2 = π
ππΏπ(π2β1 +π2

)

π = ππππ‘πππ₯(π3β2 + π3)

where the parameters of the network are π1 β π

π»1

βπ·

, π1β π

π»1

, π2 β π

π»2

βπ»1

,

π2β π

π»2

, π3 β π

πΎβπ»2

, π3β π

πΎ

.

π
ππΏπ is the βrectified linear unitβ π
ππΏπ(π₯) = πππ₯{0, π₯} applied componentwise, and

SoftMax maps a d-vector to the probability simplex according to

ππππ‘πππ₯(π£)π =

ππ₯π(π£π)

β ππ₯π(π£π)

πΎ

π=1

For a given input π, this specifies how to calculate the probabilities π(π = π|π) for π =

1,2, . . . , π.

For a given training set {(ππ

, ππ

)}, consider the loss function:

πΏ =

1

π

ββ πππ π(π = ππ

|ππ

) +

π

2

(βπ1β

2 + βπ2β

2 + βπ3β

2

)

π

π=1

where βπ΄β

2 means the sum of the squares of the entries of the matrix π΄.

Give formulas for the derivatives for each layer of the networkπ = 1,2,3.

ππΏ

πππ

,

ππΏ

πππ

Hints:

i. You should ignore the fact that π
ππΏπ(π₯) is not differentiable at π₯π = 0.

ii. You might try to first do the calculations with π
ππΏπ() replaced by the identity function.

QUESTION 8 (Word Embeddings1

)

Word2Vec represents a family of word-embedding algorithms that are commonly used in

a variety of contexts.

(a) Letβs consider a recommender system for an online music-streaming service. It

includes information about a set of songs on how frequently users play them together

(e.g., song X is commonly played together with song Y). Explain how you would use

ideas similar to Word2Vec to recommend the next song to a user who has played

some songs from this list.

(b) Although pre-trained word vectors work quite well in many applications, it is

sometimes better to ββre-trainβ (i.e. continue to learn) the word vectors as parameters

of our neural network. Explain why retraining the word vectors may hurt the model

performance if the dataset for the specific task is too small.

(c) Give two examples of how we can evaluate word-embedding vectors.

(d) A simple way to produce word embeddings for a vocabulary of words is to create onehot vectors for each word – a vector with 0s everywhere and only a single 1 in the

position corresponding to the wordβs index in the vocabulary. Give at least two reasons

why word2vec-produced word embeddings are better than these one-hot vectors.