Homework I

Problem 1: Cosine and Dot Product Similarity (60 points)

In this homework assignment, you are required to compare the retrieval performance of two different

similarity measures, i.e., dot product and cosine similarity. The document collection has already

been preprocessed, with one file for each document. The collection of cleaned up documents and

queries can be downloaded from Canvas (Assignment/Homework I/hw1 data.zip). Upon unzipping

the file, you can see two folders. One folder named docs contains all documents, with one file for

each document. Similarly, in the folder named queries one file is for each query.

You need first to extract the vocabulary out of the document collection and create a vector

representation for each document and query. Let n be the number of unique worlds extracted from

the document collection. Let d = (d1, . . . , dn)

⊤ ∈ R

n denote a vector representation for a document

where di

is the term frequency of ith term in the vocabulary. Similarly, you can denote a query by

q = (q1, . . . , qn)

⊤ ∈ R

n

. Two similarity measures will be computed and compared. For dot product

similarity, the document-query similarity is computed as

Sdot(d, q) = d

⊤q =

Xn

i=1

diqi = d1q1 + d2q2 + . . . + dnqn

For cosine similarity, the document-query similarity can be computed by

Scos(d, q) = d

⊤q

∥d∥2∥q∥2

=

Pn

i=1 diqi

qPn

i=1 d

2

i

qPn

i=1 q

2

i

For each query, you are asked to compute the similarities between the query and all documents

using both similarity measures, and return the first 10 documents with the largest scores (you can

randomly break the tie when documents have identical scores). You will then compare the returned

documents using different similarity measures, and discuss your observation. In particular, you need

to submit in this homework:

1. For each of the five queries and for each similarity measure, report the list of 10 most similar

documents (i.e. documents with the largest similarity scores).

2. By looking at the content of the original documents, decide the relevance of the returned

documents to the query, and compare the performance of the two similarity measures.

1

Problem 2: Singular Value Decomposition (40 points)

Let X ∈ R

n×d

(n ≥ d) denote a matrix with the singular value decomposition given by X = UΣV

⊤,

where U ∈ R

n×d and V ∈ R

d×d are orthonormal matrices satisfying U

⊤U = Id and V

⊤V = Id, and

Σ = diag(σ1, · · · , σd) is a diagonal matrix with σi ≥ 0, i = 1, . . . , d. You are asked to compute

(λId + X⊤X)

−1X⊤

using U, Σ and V , where Id is an identity matrix of size d × d.

2

CSCE 633 Machine Learning

# Homework I- Cosine and Dot Product Similarity

Original price was: $35.00.$30.00Current price is: $30.00.