## Description

DS 303 Homework 4

Instructions: Homework is to be submitted on Canvas by the deadline stated above. Please

clearly print your name and student ID number on your HW.

Show your work (including calculations) to receive full credit. Please work hard to make your

submission as readable as you possibly can – this means no raw R output or code (unless

it is asked for specifically or needed for clarity).

Code should be submitted with your homework as a separate file (for example, a

.R file, text file, Word file, or .Rmd are all acceptable). You should mark sections of the

code that correspond to different homework problems using comments (e.g. ##### Problem 1

#####).

Problem 1: Best subset selection

The data for this problem comes from a study by Stamey et al. (1989). They examined the

relationship between the level of prostate-specific antigen and a number of clinical measures in men

who were about to receive a radical prostatectomy. The variables are log cancer volume (lcavol),

log prostate weight (lweight), age, log of the amount of benign prostatic hyperplasia (lbph),

seminal vesicle invasion (svi), log of capsular penetration (lcp), Gleason score (gleason), and

percent of Gleason scores 4 or 5 (ppp45). The last column corresponds to which observations were

used in the training set and which were used in the test set (train).

Read in the prostate data set using the following code:

prostate = read.table(‘…/prostate.data’,header=TRUE)

In place of ‘…’, specify the pathway where you saved the dataset.

Our response of interest here is the log prostate-specific antigen (lpsa). We will use this data set

to practice 3 common subset selection approaches.

a. Approach 1: Perform best subset selection on the entire data set with lpsa as the response.

For each model size, you will obtain a ’best’ model (size here is just the number of predictors

in the model): M1 is the best model with 1 predictor (size 1), M2 is the best model with 2

predictors (size 2), and so on. Create a table of the AIC, BIC, adjusted R2 and Mallow’s Cp

for each model size. Report the model with the smallest AIC, smallest BIC, largest adjusted

R2 and smallest Mallow’s Cp. Do they lead to different results? Using your own judgement,

choose a final model.

b. Approach 2: The dataset has already been split into a training and test set. Construct your

training and test set based on this split. You may use the following code for convenience:

train = subset(prostate,train==TRUE)[,1:9]

test = subset(prostate,train==FALSE)[,1:9]

DS 303: Homework 4 1 Fall 2021

For each model size, you will obtain a ‘best’ model. Fit each of those models on the training

set. Then evaluate the model performance on the test set by computing their test MSE.

Choose a final model based on prediction accuracy. Fit that model to the full dataset and

report your final model here.

c. Approach 3: This approach is used to select the optimal size, not which predictors will end

up in our model. Split the dataset into k folds (you decide what k should be). We will perform

best subset selection within each of the k training sets. Here are more detailed instructions:

i. For each fold k = 1, . . . , K:

1. Perform best subset selection using all the data except for those in fold k (training

set). For each model size, you will obtain a ‘best’ model.

2. For each ‘best’ model, evaluate the test MSE on the data in fold k (test set).

3. Store the test MSE for each model.

Once you have completed this for all k folds, take the average of your test MSEs for

each model size. In other words, for all k models of size 1, you will compute their kfold cross-validated error. For all the k models of size 2, you will compute their k-fold

cross-validated errors, and so on. Report your 8 CV errors here.

ii. Choose the model size that gives you the smallest CV error. Now perform best subset

selection on the full data set again in order to obtain this final model. Report that

model here. (For example, suppose cross-validation selected a 5-predictor model. I

would perform best subset selection on the full data set again in order to obtain the

5-predictor model.)

DS 303: Homework 4 2 Fall 2021

Problem 2: Cross-validation

a. Explain how k-fold cross-validation is implemented.

b. What are the advantages and disadvantages of k-fold cross-validation relative to:

i. The validation set approach?

ii. LOOCV?

c. For the following questions, we will perform cross-validation on a simulated data set. Generate

a simulated data set such that Y = X − 2X2 + , ∼ N(0, 1

2

). Fill in the following code:

set.seed(1)

x = rnorm(100)

error = ??

y = ??

d. Set a random seed, and then compute the LOOCV errors that result from fitting the following

4 models using least squares:

M1 : a linear model with X

M2 : a polynomial regression model with degree 2

M3 : a polynomial regression model with degree 3

M4 : a polynomial regression model with degree 4

You may find it helpful to use the data.frame() function to create a single data set containing

both X and Y .

e. Repeat the above step using another random seed, and report your results. Are your results

the same as what you got in (d). Why?

f. Which of the models in (d) had the smallest LOOCV error? Is this what you expected?

Explain your answer.

g. Comment on the statistical significance of the coefficient estimates that results from fitting

each of the models in (c) using least squares. Do these results agree with the conclusions

drawn based on the cross-validation results?

Problem 3: Forward and backward selection

We will use the College data set in the ISLR2 library to predict the number of applications (Apps

each university received. Randomly split the data set so that 90% of the data belong to the training

set and the remaining 10% belong to the test set. Implement forward and backward selection on

the training set only. For each approach, report the best model based on AIC. From these 2 models,

pick a final model based on their performance on the test set. Report both model’s test MSE and

summarize your final model.

DS 303: Homework 4 3 Fall 2021