HOMEWORK 2

CSC2515

1. Bias and Variance Decomposition for the `2-regularized Mean Estimator – 25pts.

This exercise helps you become more comfortable with the bias and variance calculations and

decomposition. It focuses on the simplified setting of the mean estimator.

Consider a r.v. Y with an unknown distribution p. This random variable has an (unknown)

mean µ = E[Y ] and variance σ

2 = Var[Y ] = E

(Y − µ)

2

. Consider a dataset D = {Y1, . . . , Yn}

with independently sampled Yi ∼ p.

(a) [3pt] Show that the sample average estimator havg =

1

n

Pn

i=1 Yi

is the solution of the

following optimization problem:

havg(D) ← argmin

m∈R

1

n

Xn

i=1

|Yi − m|

2

.

Recall that we have the following bias-variance decomposition for the mean estimator h(D):

ED

h

|h(D) − µ|

2

i

= |ED[h(D)] − µ|

2

| {z }

bias

+ ED

h

|h(D) − ED[h(D)]|

2

i

| {z }

variance

.

(b) [2pt] Compute the bias and variance of havg(D).

(c) [5pt] Consider the `2-regularized mean estimator hλ(D) defined for any λ ≥ 0.

hλ(D) ← argmin

m∈R

1

n

Xn

i=1

|Yi − m|

2 + λ|m|

2

.

Notice that this is similar to the ridge regression, but only for a single random variable.

Provide an explicit formula for the estimator hλ(D) in terms of the sample average and λ.

1

(d) [6pt] Compute the bias and variance of this regularized estimator.

(e) [5pt] Visualize ED

h

|hλ(D) − µ|

2

i

, the bias, and the variance terms for a range of λ. As a

starting point, you can choose µ = 1, σ

2 = 9, and n = 10.

(f) [4pt] Explain the visualization from the previous part, in terms of the effect of bias and

variance on the expected squared error.

2

2. Different Flavours of Regression – 20 pts. We have seen in the lecture how to solve

the regression problem with the linear models and the squared error loss. In this question, you

learn two other regression methods: Poisson regression and locally weighted regression. You will

derive the associated loss functions and solutions for these methods.

Poisson regression: Recall that the squared error has a probabilistic justification. If we assume

that the target values are contaminated with a zero-mean Gaussian noise, the maximum likelihood

estimate is the same as the solution of the minimizer of the least squares loss.

We may use probabilistic models other than the Gaussian distribution to model the targets.

One such choice is the Poisson distribution, which is a discrete probability distribution that can

be used to model the number of times a certain event has happened. The probability mass function

of a random variable Z with a Poisson distribution with the rate λ ∈ (0, ∞) is

P{Z = k; λ} = p(k; λ) = λ

k

e

−λ

k!

(2.1) .

This is the probability that the number of events Z is k ∈ {0, 1, . . . }. It can be shown that

the expected value and the variance of this random variable are both λ, that is, E[Z] = λ and

Var[Z] = λ.

The Poisson distribution can be used to model the number of events in a given amount of time

or space. Some examples are the number of calls to a call centre in an hour, network packages

arriving at a router in a minute, meteorites hitting Earth each year, goals or scores in a soccer or

hockey match, bikes used at a bicycle sharing facility per-hour, and “soldiers killed by horse-kicks

each year in the Prussian cavalry”.

We may use the Poisson distribution to model a regression problem when the target has a count

interpretation. In that case, instead of having a fixed rate λ ∈ (0, ∞), the rate λ is a function

of the input, which means that λ : X → (0, ∞). This modelling might be more natural for some

problems than using a zero-mean Gaussian noise contamination.

(a) [5pt] Suppose that we are given a dataset of {Z1, . . . ZN } all independently drawn from

a Poisson distribution (2.1) with unknown parameter λ. Derive the maximum likelihood

estimate of the λ. Show the individual steps and note where you use the data assumptions

(identically and independently distributed).

Hint: Use the standard approach to derive the MLE via argmaxλ

log Q

p(Zi

; λ).

(b) [5pt] The Poisson regression model is based on considering the λ parameter of the Poisson

distribution a function of x and a weight w, which should be determined. The relation is

specified by

log λ(x; w) = w>x.

Here the logarithm of the rate has a linear model, as opposed to the rate itself. This ensures

that the rate is always non-negative.

This rate function leads to the following statistical model for target y ∈ {0, 1, . . . } conditioned on the input x:

p(y|x; w) = exp(yw>x) · exp(−e

w>x

)

y!

.

3

Derive the maximum likelihood function of this model given a dataset consisting of i.i.d.

input and response variables {(x

(1), y(1)), . . . ,(x

(N)

, y(N)

)}. Note that the resulting estimator

does not have a closed form solution, so you only have to simplify the resulting loss function

as far as possible.

Locally Weighted Regression is a non-parametric algorithm, that is, the model does

not learn a fixed set of parameters as is done in regression with a linear model. Rather,

parameters are computed individually for each data point x. The next two questions help

you derive the locally weighted regression.

(c) [5pt] The weighted least squares cost uses positive weights a1, . . . , αN per datapoint to

construct a parameter estimate of the form:

w∗ ← argmin

w

1

2

X

N

i=1

a

(i)

i

(y

(i) − w>x

(i)

)

2

.

Show that the solution to this optimization problem is given by the formula

w∗ =

X>AX−1

X>Ay,

where X is the design matrix (defined in class) and A is a diagonal matrix where Aii = a

(i)

(d) [5pt] Locally weighted least squares combines ideas from k-NN and linear regression. For

each query x, we first compute distance-based weights for each training example

a

(i)

(x) =

exp

−

kx−x

(i)k

2

2τ

2

PN

j=1 exp

−

kx−x(j)k

2

2τ

2

for some temperature parameter τ > 0. We then construct a local solution

w∗

(x) ← argmin

w

1

2

X

N

i=1

a

(i)

(x)

y

(i) − w>x

(i)

2

.

The prediction then takes the form ˆy = w∗

(x)

>

x. How does this algorithm behave as the

temperature τ → 0? What happens as τ → ∞? What is the disadvantage of this approach

compared to the least squares regression in terms of computational complexity?

4

3. Implementing Regression Methods in Python – 55 pts. This question will take

you step-by-step through implementing and empirically studying several regression methods on

the Capital Bikesharing dataset. We are interested in predicting the number of used bikes

on a per-hour basis based on calendar features (e.g., time, weekday, holiday) and environmental

measurements (e.g., humidity, temperature).

We ask that you provide us with both a Jupyter notebook source file as well as an exported

PDF file of the notebook. Your code needs to be functional once downloaded. Non-functional code

will result in losing all marks for this question. If your code is non-modular or otherwise difficult

to understand, you risk losing a significant portion of the marks, even if the code is functional.

Environment setup: For this question you are strongly encouraged to use the following Python

packages:

• sklearn

• matplotlib

• numpy

• pandas

• autograd

Note that you may NOT use the sklearn implementations for least squares regression, Poisson

regression, and locally weighted least squares regression since that would trivialize the exercise.

You may however use your code from the gradient descent tutorial.

3.1. Initial data analysis – 15 pts.

(a) [0pt] Load the Capital Bikesharing Dataset from the downloaded hour.csv file as a pandas

dataframe.

(b) [2pt] Describe and summarize the data in terms of number of data points, dimensions, and

used data types.

(c) [5pt] Present a single grid containing plots for each feature against the target. Choose the

appropriate axis for dependent vs. independent variables.

Hint: use the pyplot.tight_layout function to make your grid readable.

(d) [5pt] Perform a correlation analysis on the data and plot the correlation matrix as a colored

image. State which feature is the most positively, most negatively, and least correlated with

the target column cnt.

(e) [1pt] Drop the following columns from the dataframe: instant, atemp, registered, casual,

dteday.

(f) [2pt] Shuffle the dataframe’s rows using sklearn.utils.shuffle with random state 0.

Split the data into a training set and a test set on index 10000.

3.2. Regression implementations – 40 pts.

(a) [5pt] Implement the ordinary least squares regression algorithm as discussed in class. You

are free to implement the closed form solution.

5

(b) [3pt] Fit the ordinary least squares regression model to the pre-processed data. Report

the coefficient of determination, also known as the R2

score, of your model. The R2

score

between the correct target labels y and the predicted target labels ˆy can be calculated as:

R

2

(y, yˆ) = 1 −

Pn

i=1(yi − yˆi)

2

Pn

i=1(yi − (

Pn

i=1 yi))2

Hint: you may use sklearn.metrics.r2 score.

(c) [5pt] You will find that the fit of the model is not very good. This is in part due to the fact

that the dataset contains categorical input variables. So far, your implementation uses these

categorical features as continuous inputs, which leads to not so good performance. Instead, it

is advised to explicitly encode these input variables as categorical variables. Recall that one

way of dealing with categorical variables is to replace them with 1-hot-encoded vectors. The

following columns in the dataframe are known to be categorical: season, mnth, hr, weekday,

weathersit. Substitute these features with 1-hot-encoded vectors in the dataframe. Use this

dataframe for all upcoming questions. Make sure that you split the dataframe as described

in Question 3.1 (f).

Hint: use pandas.get dummies.

(d) [2pt] Re-fit the model with the new data and report the updated R2

score.

(e) [5pt] Implement the locally weighted regression algorithm as described in Question 2.

(f) [5pt] Fit the locally weighted regression model to the data with τ = 1. Report the R2

score

of your model. Verify whether and describe how the expected behaviour for τ → 0 and

τ → ∞ as described in your answer to Question 2 holds.

For this question, you may use a reduced test set with the first 200 samples from the full

test set. If you do so, please mark this clearly on your solution.

(g) [2pt] Plot a histogram of the target variable. What distribution does the target follow?

(h) [5pt] Implement the Poisson regression algorithm as describe in Question 2.

Since the maximum likelihood estimate is not solvable in closed form, you will need to implement gradient descent. You may use autograd to compute the gradient of the loss function,

but implement the gradient descent procedure by hand as presented in the tutorial.

(i) [3pt] Fit the Poisson model to the data. Report the fraction of explained Tweedie deviance,

also known as the D score, of your model. The D score between the correct target labels y

and the predicted target labels ˆy can be calculated as:

D(y, yˆ) = 1

n

Xn

i=1

2(yi

log(yi/yˆi) + ˆyi − yi)

Hint: you may use sklearn.metrics.d2 tweedie score with power=1.

6

(j) [5pt] For Linear Regression and Poisson Regression, report the final weights. Which are

the most and least significant features in each model? Justify your answer. If the feature

is categorical, report both the feature name, as well as the 1-hot index. Write a small

explanation why visualizing weight contribution is not meaningful for the locally weighted

linear regression.

7

## Reviews

There are no reviews yet.