## Description

SCHOOL OF MATHEMATICS AND STATISTICS

MATH3821 Statistical Modelling and Computing

Assignment One

Number of exercises: 5 (one per page)

INSTRUCTIONS: This assignment is to be done by a group of at most 5 students. The same mark

will be given to each student within the group, unless I have good reasons to believe that somebody did

not contribute appropriately. It is strongly advised that you use the RStudio software and its File/New

File/R Markdown. . . /PDF capability to produce a PDF file that you will submit on Moodle (see instructions

on Moodle close to due date). The computing language you will be using is called RMarkdown (see the

first few lessons starting here https://rmarkdown.rstudio.com/lesson-1.html for a quick introduction). For

typesetting mathematical formulae, you will need to have a distribution of the LATEX software installed on

your computer (e.g., TEX Live or MikTEX). For Microsoft Windows and Unix users, you might consider using

the install_tinytex() function from the R package tinytex. Another function (Microsoft Windows only)

is install.MikTeX() from the installr R package. MacOS users should consider installing the MacTEX

software directly; this is not an R package (see https://www.tug.org/mactex/).

Only one of the five students should submit the PDF file, with the names of the other students in the group

clearly indicated in the document.

We declare that this assessment item is our own work, except where acknowledged, and has not been submitted

for academic credit elsewhere. We acknowledge that the assessor of this item may, for the purpose of assessing

this item reproduce this assessment item and provide a copy to another member of the University; and/or

communicate a copy of this assessment item to a plagiarism checking service (which may then retain a copy

of the assessment item on its database for the purpose of future plagiarism checking). We certify that we

have read and understood the University Rules in respect of Student Academic Misconduct.

Name Student No Signature Date

1

Question One

Recall the Simple Linear Regression (SLR) model Yi = β0 + β1xi + i where i ∼ N(0, σ2

).

(a) Show that the SLR model can be expressed in the following form

Yi = α + β1(xi − x¯) + i

.

(b) Provide an interpretation for the parameter α.

(c) Find a closed form formula for the least square parameter estimates αˆ and βˆ

1.

(d) Find the variance of the estimates αˆ and βˆ

1 and the covariance between them Cov(ˆα, βˆ

1).

(e) Using the method of gradient descent with γ = 0.00001, find the estimates for the following simulated

data:

set.seed(1234567)

x = runif(1000)

eps = rnorm(1000)

y = 5 + 10*x + eps

Start the algorithm at the initial value (α

[0], β[0]

1

) = (0, 0). Use the convergence criteria that the L2

norm of the score is less than 0.00001. Show that the results are comparable to the closed formed

solutions found in (c). Report the number of iterations required. Make sure that you provide all the

workings/derivations that are needed to implement the above algorithm.

(f) Plot the data and include the fitted regression line.

(g) Using the Newton-Raphson method find the estimates for the same simulated data. Use the same initial

value (α

[0], β[0]

1

) = (0, 0) and convergence criteria as (e). Did it take more or less iterations than part

(e). Why? Make sure that you provide all the workings/derivations that are needed to implement the

above algorithm.

2

Quesion Two

Consider n independent binary random variables Y1, …, Yn with

P(Yi = 1) = πi and P(Yi = 0) = 1 − πi

.

The probability function of Yi

is:

π

Yi

i

(1 − πi)

1−Yi

where Yi = 0 or 1.

(a) Show that this probability function belongs to the exponential family of distributions.

(b) Show that the natural parameter is

log

πi

1 − πi

.

(c) Show that E(Yi) = πi using the cumulant generator c(θ) in the definition of the exponential family.

(d) Suppose the link function is

g(π) = log

π

1 − π

= x

T β.

Show that this is equivalent to modelling the probability π as

π =

e

x

T β

1 + e

xT β

.

(e) Sketch the graph of π aganist x for the particular case x

T β = β1 + β2x where β1 and β2 are constants.

How would you interpret this graph if x is the dose of an insecticide and π is the probability of an

insect dying?

(f) Does the following probability density function

f(y; θ) = 1

φ

exp

(y − θ)

φ

− exp

(y − θ)

φ

where φ > 0 is regarded as a nuisance parameters, belong to the exponential family?

3

Question Three

The Titanic was a British luxury passenger liner that sank when it struck an iceberg about 640 km south of

Newfoundland on April 14–15, 1912, on its maiden voyage to New York City from Southampton, England.

The data in the file titanic.txt (from the assignment section on Moodle!) classify the people on

board the ship according to their Sex, Age, and Class, either first, second, third.

(a) Read the file titanic.txt (see Moodle) into a variable called titanic. Display the first six lines of

titanic and then provide a summary of the variables in the dataset using summary.

(b) Compute the number of men and women on the Titanic. Calculate the survivial rates for each sex.

Conduct a test which tests whether the survivial rates for men and women are the same aganist the

alternative that they are different. What is the hypothesis, test statistic, p-value and conclusion from

the test?

(c) Fit a logistic regression model with response Survived and predictor Age, and provide an interpretation

for the fitted coefficient for Age using the odds ratio with a factor change and a standardized factor

change in the variable Age.

(d) Plot the graph of Survived versus Age. Then add both a fitted logistic curve and a loess smoother

to the graph. Explain what the differences are betwen these two fits. Fit again, but this time, add a

quadratic term in Age. Does the fitted curve now match the smoother more accurately? Provide all

plots in a single graph, with correctly defined labels, titles and a legend.

(e) Use the method of scoring algorithm to compute an estimate of the parameters of the logistic regression

model with survived as the response and age and a quadratic term in age as explantory variables and

provide your R code for it. You must also present the calculations that you used to come up with your

algorithm.

(f) Check that, using the code in (e), you obtain estimates of the coefficients numerically close to the ones

given by the glm() function.

(g) Create an R code, and provide it, to compute the estimation of the variances-covariances matrix of the

corresponding estimators (using the first approach presented in the slide entitled “Estimation of the

variance” in Chapter 2).

(h) Check the numerical closeness of the result obtained using your code from (g) to the one you get when

using the vcov() function.

(i) Fit the logistic regression model with terms for an intercept, Age, Age2

, Sex, and PClass. Obtain tests

on the basis of the deviance for adding each of the terms to a mean function that already includes the

other terms (in the order given above), and summarize the results of each of the tests via a p-value and

a one-sentence summary of the results.

(j) Provide a plot that interprets the relationship between Age, Sex and their Survival rates. Make sure

that you include titles with a legend.

4

Question Four

In this question we will examine binomial response data. Consider the single response Y with Y ∼

binomial(n, π).

(a) Find the Wald statistic (πˆ − π)I(π)(πˆ − π) where πˆ is the maximum likelihood estimator of π and I(π)

is the information.

(b) Verify that the Wald statistic is the same as the score statistic U

>I(π)

−1U in this case.

(c) Find the deviance

2

log L(ˆπ; y) − log L(π; y)

.

(d) For large samples, both the Wald/score statistic and the deviance approximately have the χ

2

(1)

distribution. For n = 10 and y = 3 use both statistics to assess the adequacy of the models:

(i) π = 0.1;

(ii) π = 0.3;

(iii) π = 0.5. Do the two statistics lead to the same conclusions.

(e) Give the three parts of the GLM for the binomial regression model with a fixed number of trials:

• state the law of Y ;

• prove it is a member of the exponential family;

• give the parameters (notably the mean µi) and the canonical link function.

5