CS322 Assignment 2

Written Questions:

1. 4.1

2. 4.2

3. 4.3

4. Congrats! You’re the mayor of Northfield! As mayor — you are interested in mitigating damage from

potential flooding of the Cannon river. It is of the utmost importance that you begin preparations for

flood control early (e.g., ordering sandbags, etc.) if you believe that there is going to be a flood this year.

To help you make this prediction, you do the same thing you do every year: reach-out to two teams of

hydrologists from Carleton (team A, and team B) to get their opinions. For simplicity, assume that the

Cannon flooding in a year is a binary outcome, and that these same teams have made predictions for the

past 20 years. From your records, you know that over the past 20 years, the Cannon has flooded 4 times.

This time, team A predicts that the Cannon will not flood, and points-out that they have achieved 80%

accuracy in predictions over the past 20 years. On the other hand, Team B predicts that the Cannon

will flood, but, according to your records, they have only achieved 60% prediction accuracy over the past

20 years. Which team’s prediction should you base your decision off of? If you can’t decide — what

additional information (if any) would you like to know from the teams? Ideally, your answer should

include discussions of accuracy, precision, recall, constant prediction, etc.

5. Criminal trials in the United States can be viewed as classifiers: juries attempt to make predictions based

on evidence about the guilt or innocence of a defendant. Even well-intentioned juries make mistakes; the

Innocence Project, for example, is a non-profit organization that “is committed to exonerating wrongly

convicted people through the use of DNA testing;”1

they have helped to overturn over 300 wrongful

convictions. Blackstone’s Ratio,2 a legal philosophy credited to Juror William Blackstone, states:

It is better that ten guilty persons escape than that one innocent suffer.

Discuss how Blackstone’s ratio relates to precision and recall. In your discussion, you should answer the

question of whether Blackstone’s Ratio advocates for high-precision juries or high-recall juries.

6. Logistic regression estimates binary probabilities, given a feature vector x. The probability that y = 1 is

modeled by computing the sigmoid of a dot product of w and x plus a bias, i.e.: P(y = 1|x) = σ(w

T x+b)

and P(y = 0|x) = 1 − σ(w

T x + b). For a true datapoint (xi

, yi), we can compute the likelihood of the

logistic regression parameters w, b given yi (the true label) and xi (the vector of features) as

Likelihood(w, b|xi

, yi) = P(y = yi

|W, b, xi)

= σ(w

T xi + b)

yi

· (1 − σ(w

T xi + b))1−yi

Note that we have used yi

, the true label, as a “switch” variable. So — if yi

is 1, the likelihood of the

parameters is estimated by σ(w

T xi + b). If the true label is 1, then the higher w

T xi + b is, the more

likely W and b are. If yi

is 0, the likelihood of the parameters is estimated by 1 − σ(w

T xi + b), i.e., for

datapoints with true label 0, the smaller w

T xi + b is, the more likely the parameters w, b are.

Binary crossentropy is a loss function that attempts to maximize the log-likelihood of all of the datapoints

(because the datapoints are assumed to be independent draws, we can compute their joint probability as

a product, i.e. P((x1, y1),(x2, y2),(x3, y3). . . |w, b) is assumed to be equal to P(x1, y1|w, b)·P(x2, y2|w, b)·

P(x3, y3|w, b). . .. Maximizing log likelihood is equivalent to minimizing negative log likelihood. Letting

zi = w

T xi + b, the loss function (which we will minimize — lower loss is good!) for logistic regression

1https://www.innocenceproject.org/

2https://en.wikipedia.org/wiki/Blackstone%27s ratio

CS322, Spring 2019 Assignment 2

with binary crossentropy loss is:

− log(Likelihood(w, b|{xi

, yi})) = − log(Y

i

P(y = yi

|w, b, xi))

= −

X

i

log(P(y = yi

|w, b, xi))

= −

X

i

log(σ(zi)

yi

· (1 − σ(zi + b))1−yi

)

= −

X

i

yi

· log σ(zi) + (1 − yi) log(σ(−zi))

L({xi

, yi}, w, b) = X

i

− yi

· log σ(zi) − (1 − yi) log(σ(−zi))

Let the loss for datapoint i be defined as the term in the sum corresponding to datapoint i, i.e., Li =

− yi

· log σ(zi) − (1 − yi) log(σ(−zi))

where zi = w

T xi + b.

(a) When yi = 1, what is the algebraic expression for the loss Li

incurred by a given setting of w, b (in

terms of w, b, xi)?

(b) Say w, b were set such that w

T xi + b was 3 for this example. What is P(yi = 1|x)? What is the raw

value of the loss? Is this a good setting of the parameters for this example, or a bad setting?

(c) Say instead that w, b were set such that w

T xi + b was -5 for this example. What is P(yi = 1|x)?

What is the raw value of the loss? Is this a good setting of the parameters for this example, or a bad

setting?

(d) Say instead that w

T xi + b was still -5, but yi was actually 0. What is P(yi = 0|xi)? What is the

algebraic expression for the loss incurred on this example for a given setting of w, b? What is the raw

value of the loss?

(e) To minimize the negative log likelihood with respect to the vector w and the bias value b for a

datapoint (xi

, yi), we may use a gradient-based method like gradient descent. Thus — we need to

compute the gradient of the loss function with respect to each parameter. We will drop the subscript

i in this question for notational ease…

i. What is dL

dz =

d

dz

− y · log σ(z) − (1 − y) log(σ(−z))

? Potentially helpful facts: d

dz σ(z) =

σ(z) · (1 − σ(z)) and d

dz log(z) = 1

z

ii. What is dz

db =

d

db (w

T x + b)?

iii. What is dz

dwi

=

d

dwi

w

T x + b =

d

dwi

P

j wjxj + b?

iv. Wrapping up — what is the gradient of the loss L with respect to the parameter b? (Note that

the chain rule states: dL

db =

dL

dz

dz

db ).

v. Wrapping up — what is the gradient of the loss L with respect to the parameter wi? (Note that

the chain rule states: dL

dwi

=

dL

dz

dz

dwi

)

Now that you’ve derived the gradient of the loss with respect to each parameter, you could write your

own gradient descent loop, i.e., with learning rate α, wi

:= wi−α∗

dL

dwi

for each wi

, and b := b−α∗

dL

db .

You aren’t required to do so, but you did do the hardest parts!

7. Please complete intro.py which contains some numpy introduction exercises. Insert your answers in the

code (see the instructions at the top of the file).