## Description

DS 303 Homework 5

Instructions: Homework is to be submitted on Canvas by the deadline stated above. Please

clearly print your name and student ID number on your HW.

Show your work (including calculations) to receive full credit. Please work hard to make your

submission as readable as you possibly can – this means no raw R output or code (unless

it is asked for specifically or needed for clarity).

Code should be submitted with your homework as a separate file (for example, a

.R file, text file, Word file, or .Rmd are all acceptable). You should mark sections of the

code that correspond to different homework problems using comments (e.g. ##### Problem 1

#####).

Problem 1: Concept Review

a. Subset selection will produce a collection of p+1 models M0, M1, M2, . . . , Mp. These represent

the ‘best’ model of each size (where ‘best’ here is defined as the model with the smallest RSS).

Is it true that the model identified as Mk+1 must contain a subset of the predictors found in

Mk? In other words, is it true that if M1 : Y ∼ X1, then M2 must also contain X1. And if

M2 contains X1 and X2, then M3 must also contain X1 and X2? Explain your answer.

b. Same question as part (a) but instead of subset selection, we now carry out forward stepwise

selection.

c. Suppose we perform subset, forward stepwise, and backward stepwise selection on a single data

set. For each approach, again we can obtain p + 1 models containing 0, 1, 2, . . . , p predictors.

As we know, best subset will give us a best model with k predictors. Call this Mk,subset.

Forward stepwise selection will give us a best model with k predictors. Call this Mk,f orward.

Backward stepwise selection will give us a best model with k predictors. Call this Mk,backward.

Which of these three models has the smallest training MSE? Explain your answer. Hint:

Consider the case for k = 0 and k = p first. Then the case for k = 1. Then the case for

k = 2, . . . , p − 1.

d. Same setup as part (c). Which of these three models has the smallest test MSE? Explain

your answer.

Problem 2: Simulation Studies

a. Use the rnorm() function to generate a predictor X of length n = 100, as well as error vector

of length n = 100. Assume that has variance 1.

b. Generate a response vector Y of length n = 100 according to the model:

Y = β0 + β1X + β2X2 + β3X3 + ,

DS 303: Homework 5 1 Fall 2021

where β0, β1, β2, and β3 are constants of your choice.

c. Use best subset selection in order to choose the best model containing the predictors X, X2

, . . . , X10

.

What is the best model obtained according to BIC and adjusted R2

? Report the coefficients

of the best model obtained. Note: you will need to use the data.frame() function to create

a single data set containing both X and Y .

d. Repeat (c) using forward selection and also using backward selection. Report the best models

obtains according to BIC and adjusted R2

for both approaches. How does your answer

compared to the results in part (c)?

e. Now fit a lasso model to the simulated data, again using X, X2

, . . . , X10 as predictors. Use

10-fold cross-validation to select the optimal value of λ. Present a plot of the cross-validation

error as a function of λ. Report the resulting coefficient estimates and discuss the results

obtained.

f. Now generate a response vector Y according to the model

Y = β0 + β7X7 + ,

and perform best subset selection and the lasso (again using predictors X, X2

, . . . , X10).

Discuss the results obtained.

Problem 3: Ridge Regression

For this problem, we will use the College data set in the ISLR2 R package. Our aim is to predict

the number of applications (Apps) received using the other variables in the dataset.

a. Split the data set into a training and a test set. Please set.seed(12) so that we can all have

the same results.

b. Fit a least squares linear model (using all predictors) on the training set, and report the test

MSE obtained.

c. Fit a ridge regression model (using all predictors) on the training set. The function glmnet,

by default, internally scales the predictor variables so that they will have standard deviation

1. Explain why this scaling is necessary when implementing regularized models.

d. Find an optimal λ for the ridge regression model on the training set by using 5-fold crossvalidation. Report the optimal λ here.

e. Using that optimal λ, evaluate your trained ridge regression model on the test set. Report

the test MSE obtained. Is there an improvement over the model from part (b)?

f. Fit a lasso regression model on the training set. Find the optimal lambda using 5-fold crossvalidation. Report the optimal λ and the test MSE obtained.

g. Comment on your results. How accurately can we predict the number of college applications

received? Is there much difference among the test errors resulting from these 3 approaches?

DS 303: Homework 5 2 Fall 2021

Problem 4: Regularized Regression Models

For this problem, we will continue with the Hitters example from lecture. Our aim is to predict

the salary of baseball players based on their career statistics.

a. We will start with a little data cleaning. We’ll also split the data into a training and test set.

So that we all get the same results, please use the following code:

library(ISLR2)

Hitters = na.omit(Hitters)

n = nrow(Hitters) #there are 263 observations

x = model.matrix(Salary ~.,data=Hitters)[,-1] #19 predictors

Y = Hitters$Salary

set.seed(1)

train = sample(1:nrow(x), nrow(x)/2)

test=(-train)

Y.test = Y[test]

b. Fit a ridge regression model. Replicate the example we had in class to obtain the the optimal

λ using 10-fold CV. Present a plot of the cross-validation error as a function of λ. Report

that value here and call it λ

ridge

min .

c. Naturally, if we had taken a different training/test set or a different set of folds to carry out

cross-validation, our optimal λ and therefore test error would change. An alternative is to

select λ using the one-standard error rule. The idea is, instead of picking the λ that produces

the smallest CV error, we pick the model whose CV error is within one standard error of

the lowest point on the curve you produced in part (b). The intention is to produce a more

parimonious model. The glmnet function does all of this hard work for you and we can

extract the λ based on this rule using the following code: cv.out$lambda.1se (assuming

your cv.glmnet object is named cv.out). Report your that λ here and call it λ

ridge

1se .

d. Fit a lasso regression model. Replicate the example we had in class to obtain the the optimal

λ using 10-fold CV. Report that value here and call it λ

lasso

min . Also report the optimal λ using

the smallest standard error rule and called it λ

lasso

1se .

e. You now have 4 values for the tuning parameter:

λ

ridge

min , λridge

1se , λlasso

min , λlasso

1se .

Now evaluate the ridge regression models on your test set using λ = λ

ridge

min and λ = λ

ridge

1se .

Evaluate the lasso models on your test set using λ

lasso

min and λ

lasso

1se . Compare the obtained test

errors and report them here. Which model performs the best in terms of prediction? Do you

have any intuition as to why?

f. Report the coefficient estimates coming from ridge using λ

ridge

min and λ

ridge

1se and likewise for the

lasso models. How do the ridge regression estimates compare to those from the lasso? How

do the coefficient estimates from using λmin compare to those from the one-standard error

rule?

g. If you were to make a recommendation to an upcoming baseball player who wants to make

it big in the major leagues, what handful of features would you tell this player to focus on?

DS 303: Homework 5 3 Fall 2021