## Description

DS 303 Homework 1

Instructions: Homework is to be submitted on Canvas by the deadline stated above. Please

clearly print your name and student ID number on your HW.

Show your work (including calculations) to receive full credit. Please work hard to make your

submission as readable as you possibly can – this means no raw R output or code (unless

it is asked for specifically or needed for clarity).

Code should be submitted with your homework as a separate file (for example, a

.R file, text file, Word file, or .Rmd are all acceptable). You should mark sections of the

code that correspond to different homework problems using comments (e.g. ##### Problem 1

#####).

Problem 1: Bias-variance decomposition

a. Provide a sketch of typical (squared) bias, variance, expected test MSE, training MSE, and

the irreducible error curves on a single plot, as we go from less flexible statistical learning

methods towards more flexible methods. The x-axis should represent the amount of flexibility

in the method, and the y-axis should represent the values for each curve. There should be 5

curves. Make sure to label each one.

b. Explain why each of the five curves has the shape displayed in part (a).

Problem 2: Multiple linear regression

For this problem, we will use the Boston data set which is part of the ISLR2 package. To access

the data set, install the ISLR2 package and load it into your R session:

install.packages(“ISLR2”) #you only need to do this one time.

library(ISLR2) #you will need to do this every time you open a new R session.

To get a snapshot of the data, run head(Boston). To find out more about the data set, we can

type ?Boston.

We will now try to predict per capita crime rate using the other variables in this data set. In other

words, per capita crime rate is the response, and the other variables are the predictors.

a. How many rows (n) are in the data set? How many variables are in the data set? What does

the variable lstat represent?

b. Fit a simple linear regression model with crim as the response and lstat as the predictor.

Describe your results. What are the estimated coefficients from this model? Report them

here.

Note: a simple linear regression is just a regression model with a single predictor.

DS 303: Homework 1 1 Fall 2021

c. Repeat this process for each predictor in the dataset. That means for each predictor, fit a

simple linear regression model to predict the response. Describe your results. In which of the

models is there a statistically significant association between the predictor and the response?

Create some plots to back up your assertions.

d. Fit a multiple regression model to predict the response using all of the predictors. You can

do this from a single line of code:

lm(crim~.,data=Boston)

Summarize your results. For which predictors can we reject the null hypothesis: H0 : βj = 0?

e. How do your results from (c) compare to your results from (d)? Create a table (or a plot)

comparing the simple linear regression coefficients from (c) to the multiple regression coefficients from (d). Describe what you observe. How does this provide evidence that using

many simple linear regression models is not sufficient compared to a multiple linear regression

model?

f. First set.seed(1) to ensure we all get the same values. Then, split the Boston data set

into a training set and test set. On the training set, fit a multiple linear regression model to

predict the response using all of the predictors. Report the training MSE and test MSE you

obtain from this model.

g. On the training set you created in part (f), fit a multiple linear regression model to predict the

response using only the predictors zn, indux, nox, dis, rad, ptratio, medv. Report

the training MSE and test MSE you obtain from this model. How do they compare to your

results in part (f)? Are these results surprising or what you expected?

Problem 3: Properties of least square estimators via simulations

Simulations are a very powerful tool data scientists use to deepen our understanding of model

behaviors and theory.

Let’s pretend we know that the true underlying population regression line is as follows (this is

almost never the case in real life) :

Yi = 2 + 3 × X1i + 5 × log(X2i) + i (i = 1, . . . , n), i ∼ N (0, 1

2

).

a. What are the true values for β0, β1, and β2?

b. Generate 100 observations Yi under this normal error model. You can use the following code

to generate x1 and x2:

X1 = seq(0,10,length.out =100) #generates 100 equally spaced values from 0 to 10.

X2 = runif(100) #generates 100 uniform values.

c. Draw a scatterplot of X1 and Y and a scatterplot of X2 and Y . Describe what you observe.

d. Design a simple simulation to show that βˆ

1 is an unbiased estimator of β1.

DS 303: Homework 1 2 Fall 2021

e. Plot a histogram of the sampling distribution of the βˆ

1’s you generated. Add a vertical line

to the plot showing β1 = 3.

f. Design a simple simulation to show that βˆ

2 is an unbiased estimator of β2.

g. Plot a histogram of the sampling distribution of the βˆ

2’s you generated. Add a vertical line

to the plot showing β2 = 5.

DS 303: Homework 1 3 Fall 2021