Exercises D
Statistics III - CdL SSE
Tommaso Rigon
Università degli Studi di Milano-Bicocca
Homepage
The theoretical exercises described below are quite difficult. At the exam, you can expect a simplified version of them; otherwise, they would represent a formidable challenge for most of you.
On the other hand, the data analyses are more or less aligned with what you may encounter in the final examination.
Data analysis
Britishdoc dataset
The data in the Britishdoc data frame, available in the MLGdata library come from a prospective study on mortality among doctors in the United Kingdom. The original source of the dataset is the textbook:
Agresti, A. (2015). Foundations of Linear and Generalized Linear Models. Wiley.
For different age groups (age), and for smokers and non-smokers (smoke, with levels y and n), the dataset reports the total number of person-years observed (person.years) and the number of deaths due to heart attack (deaths).
Import the data and then:
Identify the statistical units, the response variable, and the explanatory variables, indicating the type (quantitative or qualitative) of each variable.
Conduct a graphical analysis to assess the dependence of the mortality rate
\text{mortality rate} = \frac{\text{deaths}}{\text{person.years}} on age, and to compare the trends between smokers and non-smokers.Using pen and paper, specify a Poisson regression model, additive in age and smoking status, for the logarithm of the mortality rate. Treat
ageas a numerical variable for example using the midpoints of the age classes.Fit the model described in (c) using R.
Report the estimates and confidence intervals for the model coefficients. Provide an interpretation of the estimated parameters.
Evaluate the goodness of fit of the model.
Fit a model that also includes the interaction between smoking status and age. Test whether it can be omitted or not.
Assess whether it is appropriate to add the square of the age variable as an additional explanatory variable.
Crabs dataset
The Crabs dataset is available here and comes from a study of female horseshoe crabs1 on an island in the Gulf of Mexico.
During spawning season, a female migrates to the shore to breed. With a male attached to her posterior spine, she burrows into the sand and lays clusters of eggs. The eggs are fertilized externally, in the sand beneath the pair. During spawning, other male crabs may cluster around the pair and may also fertilize the eggs. These male crabs are called satellites.
The response outcome for each of the n = 173 female crabs is her y = number of satellites. Explanatory variables are the female crab’s color (color; 1 = medium light; 2 = medium; 3 = medium dark; 4 = dark), spine condition (spine; 1 = both good; 2 = one worn or broken; 3 = both worn or broken), weight (weight in kg), and carapace width (width in cm). The source of the dataset is the textbook:
Agresti, A. (2015). Foundations of Linear and Generalized Linear Models. Wiley. Import the data and then:
Obtain a scatterplot of
widthvsy, coloring points according tocolor.Fit two linear models, say
m_lin_redandm_lin_full, havingyas the response variable and:widthas covariate;
widthandcoloras covariates (treatcoloras categorical).
Test if
colorcan be dropped from the linear model using an F test.The response variable
yis a count; therefore, its distribution cannot be Gaussian. Does this mean we should not use a linear model in this case? Under which assumption is this still a valid method?Check the diagnostic plots of model
m_lin_full. Is there any evidence of heteroskedasticity? In particular, does the variance depend on the mean? If so, adjust the standard errors and obtain a corrected F-test for the same hypothesis of point c., using robust standard errors.Fit two Poisson regression models using the canonical link, say
m_pois_redandm_pois_full, havingyas the response variable and the same covariates of point b.Test if
colorcan be dropped from the Poisson model using a log-likelihood ratio test. Compare the results with those obtained in point b. and e. Which result is more trustworthy?Check the diagnostic plots of
m_pois_fullas well as the potential presence of overdispersion. Are there influential observations?Fit two quasi-likelihood models, say
m_quasi_redandm_quasi_full, using the same covariates of points b. and f. Repeat the test of point g and comment the results.Check the diagnostic plots of
m_quasi_fullas well as the potential presence of overdispersion. Are there influential observations?Interpret the estimated coefficients of
m_quasi_redandm_lin_red. Provide confidence intervals for the considered quantities.
1 See https://horseshoecrab.org and https://en.wikipedia.org/wiki/Horseshoe_crab for details about horseshoe crabs, including pictures of their mating.
The data contained in the data frame Homicide, available in the MLGdata R package, report the responses of n = 1308 individuals in the United States to the question:
“How many people do you personally know who have been victims of homicide in the past 12 months?”
The observed variables are:
count: the reported number of victims
race: the race of the respondent (0= White,1= Black)
Let y_i denote the response of subject i, for i = 1, \dots, 1308, and let x_i = 1 for Black respondents and x_i = 0 for White respondents.
Import the data and then
Fit a Poisson regression model with canonical link and linear predictor \eta_i = \beta_1 + \beta_2 x_i. Interpret the estimates of the regression coefficients.
Compute the observed frequency of the response variable for White and Black respondents and compare them with those expected under the fitted Poisson model.
Compute the X^2 Pearson goodness-of-fit statistic. Can it be used to test the goodness of fit of the model, in this case? Why?
Evaluate whether the data exhibit overdispersion relative to the Poisson model. Moreover, fit a quasi-Poisson model and check the diagnostics. Note: this point requires you have already studied quasi-likelihoods, discussed in Unit E.
Assess whether it might be appropriate to fit a zero-inflated model to these data.
Theoretical
Conditionally on positive covariates U_i = u_i, the random variable Y_i \mid U_i = u are Poisson distributed with mean \mu u_i, where \mu > 0 is an unknown parameter. Suppose the covariates U_i are iid with mean \mathbb{E}(U_i)=1 and variance \mathrm{Var}(U_i)=\tau.
Show that Y_i’s are iid with mean \mathbb{E}(Y_i)=\mu and variance \mathrm{Var}(Y_i) = \mu + \tau\mu^2. Comment on the result and the issue of overdispersion.