## Description

CSE 190: Midterm

1

Section 1: Regression

Q1: Restaurants & ratings (10 marks)

Suppose we collected the following data about restaurants from Yelp!:

Name Average Rating Takes reservations? Take-out? Price Good for

Oceana Coastal Kitchen 4.5 Yes No $$$ Breakfast

Beyer Deli 5.0 No Yes $ Lunch

Werewolf 4.5 Yes Yes $$ Brunch

C Level 4.0 No Yes $$ Lunch, Dinner

Cucina Urban 4.5 Yes Yes $$ Dinner

and that from this data we want to estimate

av. rating ‘ θ0 + θ1[takes reservations] + θ2[has take-out] + θ3[price]

1. What is the average rating across all restaurants (1 mark)? A:

2. What is the Mean Squared Error of the a predictor that just predicts the average rating for all items (1

mark)? A:

3. Suppose we’d like to write down the above expression for the rating in the form y ‘ Xθ. Complete the

following equation to do so:

4.5

‘

1 1 0 3

θ

(1 mark)

4. In the expression y ‘ Xθ, which term encodes the labels, which term encodes the features, and which

term encodes the parameters (1 mark)? labels: features: parameters:

5. Suppose that after fitting our model for the rating we obtain θ = [7, 0.5, −1, −1]T

. What is the interpretation of θ0 = 7 in this expression (1 mark)?

A:

6. What is the interpretation of θ3 = −1 (1 mark)?

A:

7. Write down the predictions made by the model when θ = [7, 0.5, −1, −1]T

:

predictions =

4.5

(1 mark)

8. What is the Mean Squared Error of the predictions you computed above (1 mark)? A:

9. Suppose you wanted to incorporate the ‘Good for’ field (the last column of the above table) into your

model. How would you represent the features in order to do so? Answer this by writing down the model

you would use:

av. rating ‘ θ0 + θ1[takes reservations] + θ2[has take-out]+

θ3[price] + A:

2

and by completing the feature matrix using your representation:

X =

1 1 0 3

(2 marks)

Q2: Training, testing, & model selection (6 marks)

Suppose we are training regressors to minimize the regularized Mean Squared Error

X

(x,y)∈train

1

|train|

(y − x · θ)

2 + λkθk

2

2

.

10. Suppose that we fit some model for λ ∈ {0.01, 0.1, 1, 10, 100, 1000} and obtain the following performance

on the training and validation sets:

training error

validation error

Which value of λ would you select based on the results above (1 mark)? λ =

11. Answer the following questions about training, validation, and test sets:

(a) What is the role of a validation set (1 mark)?

A:

(b) How does the training error typically vary with λ (1 mark)?

A:

(c) What is meant by under/over fitting? Which values of λ in the above figure correspond to maximum

over/under fitting (1 mark)?

A:

12. Further suppose that we consider two different feature representations (model X and model Y), and two

different regularization parameters (λ = 1 and λ = 10) and obtain the following results on the training

and validation sets:

model training error validation error

model X, λ = 1 23.34 ?

model X, λ = 10 ? ?

model Y, λ = 1 ? 18.32

model Y, λ = 10 25.98 ?

(‘?’ indicates an unknown value).

Assuming that our training/validation/test sets are large, independent samples, is the above information

enough to determine which model and which value of λ we would expect to yield the best performance

3

on the test set? If so, which model and which value of λ would you expect to perform best and why?

Explain your answer (2 marks).

A:

Section 2: Classification

Q3: Fantasy novels (6 marks)

Suppose we have a database consisting of the following books:

Title Genre Prediction

The Circle of Sorcerers Fantasy True

Knights: The Eye of Divinity Fantasy

Superman/Batman: Sorcerer Kings Graphic Novel

In the Blood Mystery

Remains of the Day Literature & Fiction

Blood Song Fantasy

Flame Moon Fantasy

The Book of The Sword: A History of Daggers History

A Storm of Swords Fantasy

The Storm Book Children’s

Further, suppose we are given the following classifier to classify Fantasy vs. non-Fantasy books:

if (Title contains ’Sorcerer’ or ’Blood’ or ’Knights’ or ’Moon’ or ’Storm’):

return True

else:

return False

13. What are the predictions made by this classifier? Write your answers in the last column of the table

above (1 mark).

14. Of these predictions, what is the number of true positives, true negatives, false positives, and false

negatives (1 mark)? true positive true negative false positive false negative

A:

15. What are the true positive rate (hint: TP / (TP + FN)), true negative rate, and balanced error rate (1

mark)?

true positive rate true negative rate balanced error rate

A:

16. In class we saw three approaches to classification: na¨ıve Bayes, logistic regression, and support-vector

machines. Describe one benefit of each approach compared to the other two (3 marks).

na¨ıve Bayes:

logistic regression:

SVM:

Section 3: Communities & clustering

Q4: Algorithms for community detection, dimensionality reduction, and clustering

Recall three algorithms we saw in class to detect communities: connected components, ratio cut, and clique

percolation (pseudocode is given as Algorithms 1, 2, and 3 at the end of the test).

4

17. Identify the communities that would be produced on the graphs below using these three algorithms.

Circle the communities directly in the space below (some more graphs are provided on the final page in

case you need to re-write your answer):

Connected components Ratio cut Clique percolation Clique percolation

(2 communities) (k = 3) (k = 2)

(1 mark) (1 mark) (1 mark) (1 mark)

18. Suppose we are given the following 2-dimensional data X, and wish to cluster it so as to minimize the

reconstruction error (P

x∈X kx¯−xk

2

2

). Separate the points into three clusters such that the reconstruction

error (when replacing each point by its cluster centroid) would be minimized. Draw the clusters directly

in the space below (1 mark):

19. By replacing each point with one of the three centroids above, we have effectively ‘compressed’ the data,

since each (2-d) point is replaced by a (1-d) integer. Another way to compress the data would be to

perform Principal Component Analysis, and discard the lowest variance dimension, which would also

result in a 1-d representation of the data. Out of these two possible compressed representations, which

one would result in the lower reconstruction error on the above data, and why (1 mark)?

A:

20. In class we saw hierarchical clustering, an algorithm that works by successively joining clusters whose

centroids are nearest. Psuedocode is given in Algorithm 4 over the page.

Suppose you are given the following set of points:

a b

f

e

g

d

c

Step Clusters merged List of clusters

0 (initialization) {a}, {b}, {c}, {d}, {e}, {f}, {g}

1 {a} merges with {b} {a, b}, {c}, {d}, {e}, {f}, {g}

2

3

4

5

6 {a, b, c, d, e, f, g}

If we were to perform hierarchical clustering on this data, in what order would the clusters be joined?

Answer this question by completing the table above (2 marks).

5

Algorithm 1 Connected components

Two nodes a and b should be assigned to the same community if and only if a is reachable from b, and b is

reachable from a

Algorithm 2 Ratio cut

Choose communities c ∈ C that minimize 1

2

P

c∈C

edges in cut

z }| {

cut(c, c¯)

|c|

|{z}

size of community

Algorithm 3 Clique percolation with parameter k

Initially, all k-cliques in the graph are communities

while there are two communities that have a (k − 1)-clique in common do

merge both communities into a single community

Algorithm 4 Hierarchical clustering

Initially, every point is assigned to its own cluster

while there is more than one cluster do

Compute the center of each cluster

Combine the two clusters with the nearest centers

Write any additional answers/corrections/comments here:

Here are a few more graphs in case you need to re-write your solutions to Q17:

6