Assignment #3 STA355H1S

Instructions: Solutions to problems 1 and 2 are to be submitted on Quercus (PDF files

only) – the deadline is 11:59pm on March 20. You are strongly encouraged to do problems

3 through 6 but these are not to be submitted for grading.

Problems to hand in:

1. Suppose that X1, · · · , Xn are independent Gamma random variables with pdf

f(x; λ, α) = λ

αx

α−1

exp(−λx)

Γ(α)

for x > 0

where λ > 0 and α > 0 are unknown parameters. Given X1 = x1, · · · , Xn = xn, the

likelihood function is

L(λ, α) =

λ

nα (Yn

i=1

x

α−1

i

)

exp

−λ

Xn

i=1

xi

!

[Γ(α)]n

(a) Assume the following prior distribution for (λ, α):

π(λ, α) = 1

10000

exp(−λ/100) exp(−α/100) for λ, α > 0

Given X1 = x1, · · · , Xn = xn, show that the posterior density of α is

π(α|x1, · · · , xn) = K(x1, · · · , xn)

Γ(nα + 1)

[Γ(α)]n

exp

α

Xn

i=1

ln(xi) −

α

100! 1

100

+

Xn

i=1

xi

!−(nα+1)

(b) Data on intervals (in hours) between failures of air conditioning units on ten Boeing

aircraft are given (and analyzed) in Example T of Cox & Snell1

(1981). These data (199

observations) are provided in a file on Quercus. Using the prior for (λ, α) in part (a), compute

the posterior distribution for α.

Note: To compute the posterior density, you will need to compute the normalizing constant

– on Quercus, I will give some suggestions on how to do this. A simple estimate of α is

αb = ¯x

2/s2 where ¯x and s

2 are the sample mean and variance, respectively. We would expect

the posterior to be concentrated around this estimate.

1Cox, D.R. and Snell, E.J. (1981) Applied Statistics: Principles and Examples. Chapman and Hall, New

York.

(c) [Bonus] We may be interested in “testing” whether an Exponential model is an appropriate model for the data. One approach to doing this is to put a prior probability θ on the

Exponential model (i.e. a Gamma model with α = 1) and prior probability 1−θ on the more

general Gamma model. This leads to a prior distribution on (λ, α) having a point mass θ

(which is a hyperparameter) at α = 1 so that

π(λ, 1) = 1

100

θ exp(−λ/100) for λ > 0

and

π(λ, α) = 1

10000

(1 − θ) exp(−λ/100) exp(−α/100) otherwise

(This type of prior is an example of a “spike-and-slab” prior used in Bayesian model selection.) We then have

P(α = 1|x1, · · · , xn) =

Z ∞

0

π(λ, 1)L(λ, 1) dλ

Z ∞

0

π(λ, 1)L(λ, 1) dλ +

Z ∞

0

Z ∞

0

π(λ, α)L(λ, α) dλ dα

.

For θ = 0.1, 0.2, 0.3, · · · , 0.9, evaluate P(α = 1|x1, · · · , xn). (This is easier than it looks as

much of the work has been done in part (b).)

2. Suppose that F is a continuous distribution with density f. The mode of the distribution

here is defined to be the global maximizer of the density function. (In other applications, it

is useful to think of modes as local maxima of the density function.)

In some applications (for example, when the data are “contaminated” with outliers), the

centre of the distribution is better described by the mode than by the mean or median;

however, unlike the mean and median, the mode turns out to be a difficult parameter to

estimate. There is a document on Quercus that discusses a few of the various methods for

estimating the mode.

The τ -shorth is the shortest interval that contains at least a fraction τ of the observations

X1, · · · , Xn; it will have the form [X(a)

, X(b)

] where b − a + 1 ≥ τn. A number of mode

estimators are based on the observations lying in [X(a)

, X(b)

]; for example, we can the sample

mean of these observation or the sample median. (Note that taking the sample mean of the

observations in [X(a)

, X(b)

] is like a trimmed mean although the trimming here is typically

asymmetrical.) In this problem, we will consider estimating the mode using the midpoint of

the interval, that is, µb = (X(a) + X(b))/2. This estimator is called a Venter estimator. An

R function to compute this estimator for a given value of τ is available on Quercus in a file

venter.txt.

(a) For the Hidalgo stamp thickness data considered in Assignment #2, compute Venter

estimates for various values of τ . How small does τ need to be in order that the estimate

“makes sense”? (Recall from Assignment #2 that the density seemed to have a number of

local maxima but one clear global maximum.)

(b) Suppose that the underlying density f is asymmetric and unimodal. The choice of τ for

the Venter estimator involves a bias-variance tradeoff: If τ is small then we should expect the

estimator to have a small bias but larger variance with the bias increasing and the variance

decreasing as τ increases.

Suppose that X1, · · · , Xn are independent Gamma random variables with density

f(x; α) = x

α−1

exp(−x)

Γ(α)

where we will assume that α > 1; in this case, the mode is α − 1.

Using Monte Carlo simulation, estimate the MSE of the Venter estimator for τ = 0.5 and

τ = 0.1, sample sizes n = 100 and n = 1000, and shape parameters α = 2 and α = 10. (8

simulations in total.) For α = 2 and α = 10, which Venter estimator seems to be better (on

the basis of MSE)?

(c) The density of Venter estimator µb under the Gamma model in part (b) can be estimated

using the fact that µ/b (X1 + · · · + Xn) and X1 + · · · + Xn are independent. (This fact follows

from Basu’s Theorem. which you may encounter in STA452/453.)

We can exploit this independence as follows: Define T = X1 + · · · + Xn; T has a Gamma

distribution with shape parameter n. Then

P(µb ≤ x) = P

µb

T

≤

x

T

=

Z ∞

0

P

µb

T

≤

x

T

T = t

t

nα−1

exp(−t)

Γ(nα)

dt

=

Z ∞

0

P

µb

T

≤

x

t

T = t

t

nα−1

exp(−t)

Γ(nα)

dt

=

Z ∞

0

P

µb

T

≤

x

t

t

nα−1

exp(−t)

Γ(nα)

dt

Thus given (µb1, T1), · · · ,(µbN , TN ), we can estimate

P

µb

T

≤

x

t

by

1

N

X

N

i=1

I

µbi

Ti

≤

x

t

=

1

N

X

N

i=1

I

t ≤

Ti

µbi

x

!

and so we can estimate P(µb ≤ x) by

1

N

X

N

i=1

Z xTi/µbi

0

t

nα−1

exp(−t)

Γ(nα)

dt

and differentiating, we obtain the density estimate

bf(x) = 1

N

X

N

i=1

Ti

µbi

Ti

µbi

x

!nα−1

exp(−xTi/µbi)

Γ(nα)

Using the simulation data in part (b) for n = 100, α = 2 and τ = 0.5, compute bf(x).

Supplemental problems (not to hand in):

3. Suppose that X1, · · · , Xn are independent random variables with density or mass function f(x; θ) and suppose that we estimate θ using the maximum likelihood estimator bθ; we

estimate its standard error using the observed Fisher information estimator

se( b

bθ) = (

−

Xn

i=1

`

00(Xi

;

bθ)

)−1/2

where `

0

(x; θ), `00(x; θ) are the first two partial derivatives of ln f(x; θ) with respect to θ.

Alternatively, we could use the jackknife to estimate the standard error of bθ; if our model

is correct then we would expect (hope) that the two estimates are similar. In order to

investigate this, we need to be able to get a good approximation to the “leave-one-out”

estimators {

bθ−i}.

(a) Show that bθ−i satisfies the equation

`

0

(Xi

;

bθ−i) = Xn

j=1

`

0

(Xj

;

bθ−i).

(b) Expand the right hand side in (a), in a Taylor series around bθ to show that

bθ−i − bθ ≈

`

0

(Xi

;

bθ−i)

Xn

j=1

`

00(Xj

;

bθ)

≈

`

0

(Xi

;

bθ)

Xn

j=1

`

00(Xj

;

bθ)

and so

bθ• =

1

n

Xn

i=1

bθ−i ≈ bθ.

(You should try to think about the magnitude of the approximation error but a rigorous

proof is not required.)

(c) Use the results of part (b) to derive an approximation for the jackknife estimator of the

standard error. Comment on the differences between the two estimators – in particular, why

is there a difference? (Hint: What type of model – parametric or non-parametric – are we

assuming for the two standard error estimators?)

(d) For the air conditioning data considered in Assignment #1, compute the two standard

error estimates for the parameter λ in the Exponential model (f(x; λ) = λ exp(−λx) for

x ≥ 0). Do these two estimates tell you anything about how well the Exponential model fits

the data?

4. Suppose that X1, · · · , Xn are independent continuous random variables with density

f(x; θ) where θ is real-valued. We are often not able to observe the Xi

’s exactly rather only

if they belong to some region Bk (k = 1, · · · , m); an example of this is interval censoring in

survival analysis where we are unable to observe an exact failure time but know that the

failure occurs in some finite time interval. Intuitively, we should be able to estimate θ more

efficiently with the actual values of {Xi}; in this problem, we will show that this is true (at

least) for MLEs.

Assume that B1, · · · , Bm are disjoint sets such that P(Xi ∈ ∪m

k=1Bk) = 1. Define independent

discrete random variables Y1, · · · , Yn where Yi = k if Xi ∈ Bk; the probability mass function

of Yi

is

p(k; θ) = Pθ(Xi ∈ Bk) = Z

x∈Bk

f(x; θ) dx for k = 1, · · · , m.

Also define

IX(θ) = Varθ

”

∂

∂θ ln f(Xi

; θ)

#

=

Z ∞

−∞ ”

∂

∂θ ln f(x; θ)

#2

f(x; θ) dx

IY (θ) = Varθ

”

∂

∂θ ln p(Yi

; θ)

#

=

Xm

k=1

”

∂

∂θ ln p(k; θ)

#2

p(k; θ).

Under the standard MLE regularly conditions, the MLE of θ based on X1, · · · , Xn will have

variance approximately 1/{nIX(θ)} while the MLE based on Y1, · · · , Yn will have variance

approximately 1/{nIY (θ)}.

(a) Assume the usual regularity conditions for f(x; θ), in particular, that f(x; θ) can be

differentiated with respect to θ inside integral signs with impunity! Show that IX(θ) ≥ IY (θ)

and indicate under what conditions there will be strict inequality.

Hints: (i) f(x; θ)/p(k; θ) is a density function on Bk.

(ii) For any function g,

Z ∞

−∞

g(x)f(x; θ) dx =

Xm

k=1

p(k; θ)

Z

x∈Bk

g(x)

f(x; θ)

p(k; θ)

dx.

(iii) For any random variable U, E(U

2

) ≥ [E(U)]2 with strict inequality unless U is constant.

(b) Under what conditions on B1, · · · , Bm will IX(θ) ≈ IY (θ)?

5. In seismology, the Gutenberg-Richter law states that, in a given region, the number of

earthquakes N greater than a certain magnitude m satisfies the relationship

log10(N) = a − b × m

for some constants a and b; the parameter b is called the b-value and characterizes the seismic

activity in a region. The Gutenberg-Richter law can be used to predict the probability of

large earthquakes although this is a very crude instrument. On Quercus, there is a file

containing earthquakes magnitudes for 433 earthquakes in California of magnitude (rounded

to the nearest tenth) of 5.0 and greater from 1932–1992.

(a) If we have earthquakes of (exact) magnitudes M1, · · · , Mn greater than some known m0,

the Gutenberg-Richter law suggests that M1, · · · , Mn can be modeled as independent random

variables with a shifted Exponential density

f(x; β) = β exp(−β(x − m0)) for x ≥ m0.

where β = b × ln(10) and m0 is assumed known. However, if the magnitudes are rounded

to the nearest δ then they are effectively discrete random variables taking values xk =

m0 + δ/2 + kδ for k = 0, 1, 2, · · · with probability mass function

p(xk; β) = P(m0 + kδ ≤ M < m0 + (k + 1)δ)

= exp(−βkδ) − exp(−β(k + 1)δ)

= exp(−β(xk − m0 − δ/2)) {1 − exp(−βδ)} for k = 0, 1, 2, · · ·.

If X1, · · · , Xn are the rounded magnitudes, find the MLE of β. (There is a closed-form

expression for the MLE in terms of the sample mean of X1, · · · , Xn.)

(b) Compute the MLE of β for the earthquake data (using m0 = 4.95 and δ = 0.1) as well

as estimates of its standard error using (i) the Fisher information and (ii) the jackknife. Use

these to construct approximate 95% confidence intervals for β. How similar are the intervals?

6. In genetics, the Hardy-Weinberg equilibrium model characterizes the distributions of

genotype frequencies in populations that are not evolving and is a fundamental model of

population genetics. In particular, for a genetic locus with two alleles A and a, the frequencies

of the genotypes AA, Aa, and aa are

Pθ(genotype = AA) = θ

2

, Pθ(genotype = Aa) = 2θ(1−θ), and Pθ(genotype = aa) = (1−θ)

2

where θ, the frequency of the allele A in the population, is unknown.

In a sample of n individuals, suppose we observe XAA = x1, XAa = x2, and Xaa = x3

individuals with genotypes AA, Aa, and aa, respectively. Then the likelihood function is

L(θ) = {θ

2

}

x1 {2θ(1 − θ)}

x2 {(1 − θ)

2

}

x3

.

(The likelihood function follows from the fact that we can write

Pθ(genotype = g) = {θ

2

}

I(g=AA)

{2θ(1 − θ)}

I(g=Aa)

{(1 − θ)

2

}

I(g=aa)

;

multiplying these together over the n individuals in the sample gives the likelihood function.)

(a) Find the MLE of θ and give an estimator of its standard error using the observed Fisher

information.

(b) A useful family of prior distributions for θ is the Beta family:

π(θ) = Γ(α + β)

Γ(α)Γ(β)

θ

α−1

(1 − θ)

β−1

for 0 ≤ θ ≤ 1

where α > 0 and β > 0 are hyperparameters. What is the posterior distribution of θ given

XAA = x1, XAa = x2, and Xaa = x3?

## Reviews

There are no reviews yet.