## Description

DS 303 Homework 6

Instructions: Homework is to be submitted on Canvas by the deadline stated above. Please

clearly print your name and student ID number on your HW.

Show your work (including calculations) to receive full credit. Please work hard to make your

submission as readable as you possibly can – this means no raw R output or code (unless

it is asked for specifically or needed for clarity).

Code should be submitted with your homework as a separate file (for example, a

.R file, text file, Word file, or .Rmd are all acceptable). You should mark sections of the

code that correspond to different homework problems using comments (e.g. ##### Problem 1

#####).

Problem 1: Concept Review

1. Explain in plain language (using limited statistics terminology) why lasso can set some of the

regression coefficients to be 0 exactly, while ridge regression cannot. You may include a figure

if that is helpful.

2. Suppose we estimate the regression coefficients in a linear regression model by minimizing

Xn

i=1

yi − β0 −

X

p

j=1

βjxij

2

+ λ

X

p

j=1

β

2

j

for a particular value of λ. For parts (a) through (e), indicate which of i. through v. is

correct. Justify your answer.

a. As we increase λ from 0, the training MSE will:

i. Increase initially, and then eventually start decreasing in an inverted U shape.

ii. Decrease initially, and then eventually start decreasing in an inverted U shape.

iii. Steadily increase.

iv. Steadily decrease.

v. Remain constant.

b. Repeat (a) for test MSE.

c. Repeat (a) for variance.

d. Repeat (a) for (squared) bias.

e. Repeat (a) for irreducible error.

DS 303: Homework 6 1 Fall 2021

Problem 2: Build a predictive model

We will work with the Boston housing data set; it is part of library(ISLR2). Your goal here is

to build a predictive model that can predict per capita crime rate. Split your data into a training

set and test such that 90% of the observations go into the training set and the remaining 10% go

into the test set. Your model building should include the following components:

1. A least-square model with the predictors chosen using a model selection technique of your

choice. Explain and justify what technique you have chosen. Call the from this step Model1.

– Do you think Model1 needs an interaction term between any predictors? Justify.

– Do you think Model1 requires higher order terms to model any non-linearities? Justify.

– Do you think Model1 can be improved using a regression spline? Justify.

2. A ridge regression with the optimal λ chosen using 10-fold cross-validation. Compare your

models using the λ that gives the smallest CV error and the λ based on the one standard

error rule. Call these models Model2a and Model2b, respectively. Report both models.

3. A lasso regression with the optimal λ chosen using 10-fold cross-validation. Compare your

models using the λ that gives the smallest CV error and the λ based on the one standard

error rule. Call these models Model3a and Model3b, respectively. Report both models.

4. Propose a model (or set of models) that seems to perform well on this dataset. Make sure you

are evaluating your model performance using the test set and not using the training error.

Report your chosen model(s) here. Does your chosen model involve all of the features in the

data set? Why or why not?

Problem 3: Bootstrap

We will continue working with the Boston housing data set.

a. Based on this data set, provide an estimate for the population mean of medv. Call this

estimate ˆµ.

b. Provide an estimate of the standard error of ˆµ using an analytical formula. Interpret this

result.

c. Now the estimate the standard error ˆµ using the bootstrap. How does this compare to your

answer from (b)?

d. Using bootstrap, provide a 95% confidence interval for the mean of medv. Compare it to

results using analytical formulas.

e. Based on this data set, provide an estimate ˆµmed for the median value of medv.

f. We would like to estimate the standard error of ˆµmed. Since there is no simple formula for

computing the standard error of the median, use bootstrap. Comment on your findings.

g. Based on this data set, provide an estimate ˆµ0.1, the 10th percentile of medv.

h. Use bootstrap to estimate the standard error of ˆµ0.1. Comment on your findings.

DS 303: Homework 6 2 Fall 2021

Problem 4: Properties of Bootstrap

a. What is the probability that the first bootstrap observation is the jth observation from the

original sample? Justify your answer.

b. What is the probability that the first bootstrap observation is not the jth observation from

the original sample? Justify your answer.

c. What is the probability that the jth observation from the original sample is not in the

bootstrap sample?

d. When n = 5, what is the probability that the jth observation is in the bootstrap sample?

e. When n = 100, what is the probability that the jth observation is in the bootstrap sample?

f. When n = 10, 000, what is the probability that the jth observation is in the bootstrap sample?

g. Create a plot (in R) that displays, for each integer value of n from 1 to 100,000, the probability

that the jth observation is in the bootstrap sample. Comment on what you observe.

h. Investigate numerically the probability that a bootstrap sample of size n = 100 contains the

jth observation. Here j = 5. We repeatedly create bootstrap samples, and each time we

record whether or not the fourth observation is contained in the bootstrap sample. You may

use the following code:

results <- rep(NA, 10000)

for(i in 1:10000){

results[i] <- sum(sample(1:100, rep=TRUE) == 4) > 0

}

mean(results)

Comment on your findings.

DS 303: Homework 6 3 Fall 2021