STA442 Homework 1

Question 1: Flies

In this experiment, we want to investigate the effect of thorax length and mating activity has

on the expected lifespan of fruitflies. Fruitflies are placed in solitary, with one virgin female

and eight virgin females. The experiment also contains two control groups, which are kept

with either one pregnant female or eight pregnant females. The thorax lengths of fruitflies are

also measured since it is known to affect the lifespan of fruitflies.

We use Gamma GLM to model the lifetimes as a function of the thorax length and mating

activity. In the model, we have centred the predictor variable of thorax length by subtraction

of 0.84 from every value in the variable. We have also scaled the variables using exp().

We get the estimated parameters below:

Fig 1. Estimated Parameters of the Gamma GLM

From Figure 1, we can see that:

● For the increase of every one unit length of thorax length, there will be an increase of

approximately 14.70 days.

● Comparing flies kept in isolation, the flies kept with one virgin female have their

lifespan reduced by 11%, whereas the flies kept with eight virgin females have their

lifespan reduced by 34%.

I attach the modelled and empirical distribution below, to indicate the Gamma GLM is a good

fit for the data:

Fig 2. Empirical Distribution of the Data and the Model Fit

In conclusion, the lifespan of fruitflies is indeed affected by the presence and frequency of

their mating activities. Fruitflies with no mating activities have a higher lifespan comparing to

fruitflies with mating activities. Fruitflies with a lower frequency of mating activities also have

a higher lifespan comparing wit fruitflies with a higher frequency of mating activities.

Question 2: Smoking

Summary

We have analyzed the 2014 American National Youth Tobacco survey, to see if race and

gender are factors contributing to the increase in the use of substances such as chewing

tobacco and hookah in the American youth population. From the analysis, we have found

that race is a significant contributing factor in the use of chewing tobacco. The regular use of

chewing tobacco, snuff or dip is around 2 times and 5 times more common amongst

Americans of European ancestry than Hispanic Americans and African Americans

respectively. On the other hand, gender is not a significant contributing factor in the use of

hookah. The use of Hookah among women is only 4% more than men.

Introduction

In this report, we will be analyzing the 2014 American National Youth Tobacco survey, to

investigate if demographics have effects on the smoking habits of the American population.

The first topic we want to investigate is whether race is a factor contributing to the regular

use of chewing tobacco, snuff or dip, regular use meaning the youths have used these

substances on 1 or more days in the past 30 days. We are especially interested in seeing

the comparison between Americans or European ancestry, Hispanic-Americans and

African-Americans. The second topic we want to investigate is whether gender is a factor

affecting the likelihood of having used a hookah or waterpipe on at least one occasion, given

the other demographic characteristics are similar.

Methods

We will be using the logistic regression in this research, as in both topics, we want to model

probabilities. For the variables in our model, we will be including age, gender, race and living

area (urban or rural) of the youth.

In the first topic, the “Odds” refers to the odds of regular use of chewing tobacco, snuff or

dip. Since we want to see if the race is a significant factor contributing to that odds, the null

hypothesis of this topic is:

In the second topic, the “Odds” refers to the odds of having used a hookah or waterpipe on

at least one occasion. Given that all demographic characteristics except gender are similar,

and we want to see if gender is a significant factor contributing to that odds, the null

hypothesis of this topic is:

Then, using the likelihood ratio test, we compare the likelihood of the data under the full

model against the likelihood of the data under a model without race or gender respectively.

From that, we can test whether the observed difference in model fit is statistically significant.

Results

Fig 3. Likelihood Ratio Test for Regular Use of Chewing Tobacco, Snuff or Dip

From the LRT in Figure 3, given the p-value <0.05, we can reject the null hypothesis, and

conclude that race is a significant factor contributing to the odd of regular use of chewing

tobacco.

Fig 4. Odds of Regular Use of Chewing Tobacco, Snuff or Dip

To compare the regular use of these substances amongst Americans of European ancestry,

Hispanic-Americans and African-Americans, we can look at the coefficients of our model in

Figure 4. The odds of regular use of these substances among Hispanic-Americans and

African-Americans are 0.22 and 0.48 times, the odds of that among Americans of European

ancestry respectively. It also seems that youth living in rural areas tends to use these

substances more regularly than youth living in the urban area.

Fig 5. Likelihood Ratio Test for Having Used Hookah or Waterpipe

From the LRT in Figure 4, given the p-value 0.05, which is statistically not significant, and

hence we cannot reject the null hypothesis. Hence it seems that the reduced model without

the gender variable is adequate in fitting the data.

Fig 6. Odds of Having Used Hookah or Waterpipe

From Figure 6, we can see that the odds of having used Hookah or Waterpipe for a female is

1.04 times than that of a male, given all other demographic characteristics stay the same.

But again given p-value 0.05, we cannot determine whether or not gender is a significant

factor contributing to the odds of having used hookah or water pipes.

STA442 Homework 1 Appendix

Question 1

data(fruitfly, package=faraway)

throaxC = fruitfly$thorax – 0.84 # Use quantile(fruitfly$thorax) to center variable

mod1 = glm(longevity ~ throaxC + activity, family=Gamma(link = log), data=fruitfly)

mod1coeff = summary(mod1)$coef

mod1coeff[,1] = round(exp(mod1coeff[,1]),2) # Use to rescale variables

knitr::kable(rbind(mod1coeff,shape=c(1/summary(mod1)$dispersion, NA, NA, NA)), digits=4,

caption = “Estimated Parameters of the Gamma GLM”)

Table 1: Estimated Parameters of the Gamma GLM

Estimate Std. Error t value Pr(|t|)

(Intercept) 63.1200 0.0377 109.9177 0.0000

throaxC 14.7000 0.2277 11.8044 0.0000

activityone 1.0600 0.0534 1.0357 0.3024

activitylow 0.8900 0.0533 -2.1844 0.0309

activitymany 1.0900 0.0541 1.5240 0.1302

activityhigh 0.6600 0.0539 -7.6874 0.0000

shape 28.1455 NA NA NA

shape = 1/summary(mod1)$dispersion

scale = exp(mod1$coef[“(Intercept)”])/shape

xSeq = seq(0,100,len=100)

hist(fruitfly$longevity, prob = TRUE, breaks = 20, xlab = “longevity”,ylim = c(0,0.040),

main=”Empirical Distribution of the Data and the Model Fit”)

lines(xSeq, dgamma(xSeq, shape = shape,scale = scale), col = “blue”)

1

Empirical Distribution of the Data and the Model Fit

longevity

Density

20 40 60 80 100

0.00 0.01 0.02 0.03 0.04

Question 2

smokeUrl = http://pbrown.ca/teaching/appliedstats/data/smoke.RData

(smokeFile = tempfile(fileext=.RData))

## [1] “/var/folders/d6/69h871p92l59n9sx36t_bz000000gn/T//RtmpDV0x4b/file33cb37bbaf98.RData”

download.file(smokeUrl, smokeFile, mode=wb)

(load(smokeFile))

## [1] “smoke” “smokeFormats”

smokeFormats[smokeFormats$colName == Tried_cigarette_smkg_even, ]

## ID label

## 23 qn7 Tried cigarette smkg, even 1 or 2 puffs

## shortLabel colName

## 23 Tried cigarette smkg even 1 or 2 puffs Tried_cigarette_smkg_even

smoke$everSmoke = factor(smoke$Tried_cigarette_smkg_eve, levels=1:2, labels=c(yes,no))

smokeSub = smoke[smoke$Age != 9 & !is.na(smoke$Race)

& !is.na(smoke$ever_tobacco_hookah_or_wa)

& !is.na(smoke$chewing_tobacco_snuff_or), ]

smokeAgg = reshape2::dcast(smokeSub,

Age + Sex + Race + RuralUrban ~ chewing_tobacco_snuff_or,

length)

## Using everSmoke as value column: use value.var to override.

2

smokeAgg = na.omit(smokeAgg)

smokeAgg = smokeAgg[-7]

dim(smokeAgg)

## [1] 207 6

# smokeModel

smokeAgg$y = cbind(smokeAgg$TRUE, smokeAgg$FALSE)

smokeFit = glm(y ~ Age + Sex + Race + RuralUrban,

family=binomial(link=logit), data=smokeAgg)

# We want to scale the variable Age,

# since the center age of intercept is 15, we substract 15 from values of variable

smokeAgg$ageC = smokeAgg$Age – 15

smokeFit = glm(y ~ ageC + Sex + Race + RuralUrban,

family=binomial(link=logit), data=smokeAgg)

smokeTable = as.data.frame(summary(smokeFit)$coef)

# LRT

smokeFitReduced = glm(y ~ ageC + Sex + RuralUrban,

family=binomial(link=logit), data=smokeAgg)

knitr::kable(anova(smokeFit,smokeFitReduced,test = “Chisq”),

digits = 2,caption = “Likelihood Ratio Test of

Regular Use of Chewing Tobacco, Snuff or Dip”)

Table 2: Likelihood Ratio Test of Regular Use of Chewing Tobacco,

Snu or Dip

Resid. Df Resid. Dev Df Deviance Pr(Chi)

198 265.33 NA NA NA

203 427.36 -5 -162.04 0

# After renaming the variables and using knitr

rownames(smokeTable) = gsub(“Race|RuralUrban|C$”, “”,

rownames(smokeTable) )

rownames(smokeTable) = gsub(“SexF”,”Female”,

rownames(smokeTable))

smokeTable[,1] = exp(smokeTable[,1])

knitr::kable(smokeTable, digits=4,

caption = “Odds of Regular Use of Chewing Tobacco, Snuff or Dip”)

Table 3: Odds of Regular Use of Chewing Tobacco, Snu or Dip

Estimate Std. Error z value Pr(|z|)

(Intercept) 0.0473 0.0849 -35.9626 0.0000

age 1.4106 0.0214 16.1053 0.0000

Female 0.1680 0.1107 -16.1118 0.0000

black 0.2244 0.1720 -8.6872 0.0000

hispanic 0.4791 0.1073 -6.8565 0.0000

asian 0.2190 0.3424 -4.4357 0.0000

native 1.1070 0.2866 0.3549 0.7227

pacific 2.4259 0.3973 2.2303 0.0257

3

Estimate Std. Error z value Pr(|z|)

Rural 2.5635 0.0894 10.5322 0.0000

smokeAgg1 = reshape2::dcast(smokeSub,

Age + Sex + Race + RuralUrban ~ ever_tobacco_hookah_or_wa,

length)

## Using everSmoke as value column: use value.var to override.

smokeAgg1 = na.omit(smokeAgg1)

smokeAgg1 = smokeAgg1[-7]

dim(smokeAgg1)

## [1] 207 6

# smokeModel

smokeAgg1$y = cbind(smokeAgg1$TRUE, smokeAgg1$FALSE)

smokeFit1 = glm(y ~ Age + Sex + Race + RuralUrban,

family=binomial(link=logit), data=smokeAgg1)

# We want to scale the variable Age,

# since the center age of intercept is 15, we substract 15 from values of variable

smokeAgg1$ageC = smokeAgg1$Age – 15

smokeFit1 = glm(y ~ ageC + Sex + Race + RuralUrban,

family=binomial(link=logit), data=smokeAgg1)

smokeTable1 = as.data.frame(summary(smokeFit1)$coef)

smokeFitReduced1 = glm(y ~ ageC + Sex + RuralUrban,

family=binomial(link=logit), data=smokeAgg1)

knitr::kable(anova(smokeFit1,smokeFitReduced1,test = “Chisq”),

digits = 2,caption = “Likelihood Ratio Test for Having Used Hookah or Waterpipe”)

Table 4: Likelihood Ratio Test for Having Used Hookah or Waterpipe

Resid. Df Resid. Dev Df Deviance Pr(Chi)

198 371.79 NA NA NA

203 625.40 -5 -253.61 0

# After renaming the variables and using knitr

rownames(smokeTable1) = gsub(“Race|RuralUrban|C$”, “”,

rownames(smokeTable1) )

rownames(smokeTable1) = gsub(“SexF”,”Female”,

rownames(smokeTable1))

smokeTable1[,1] = exp(smokeTable1[,1])

knitr::kable(smokeTable1, digits=4,

caption = “Odds of Having Used Hookah or Waterpipe”)

Table 5: Odds of Having Used Hookah or Waterpipe

Estimate Std. Error z value Pr(|z|)

(Intercept) 0.1780 0.0441 -39.1338 0.0000

4

Estimate Std. Error z value Pr(|z|)

age 1.5221 0.0116 36.2011 0.0000

Female 1.0414 0.0431 0.9424 0.3460

black 0.5249 0.0711 -9.0705 0.0000

hispanic 1.4155 0.0486 7.1475 0.0000

asian 0.5234 0.1188 -5.4507 0.0000

native 1.1773 0.1905 0.8569 0.3915

pacific 2.7478 0.2705 3.7366 0.0002

Rural 0.6794 0.0445 -8.6916 0.0000

5

Sale!