Exercises D

Statistics III - CdL SSE

Author

Affiliation

Tommaso Rigon

Università degli Studi di Milano-Bicocca

Homepage

The theoretical exercises described below are quite difficult. At the exam, you can expect a simplified version of them; otherwise, they would represent a formidable challenge for most of you.

On the other hand, the data analyses are more or less aligned with what you may encounter in the final examination.

The vast majority of these exercises are taken from the textbooks Salvan et al. (2020) and Agresti (2015), possibly with a few minor modifications. You can consult these textbooks if you need additional exercises.

Data analysis

Britishdoc dataset

The data in the Britishdoc data frame, available in the MLGdata library come from a prospective study on mortality among doctors in the United Kingdom. The original source of the dataset is the textbook:

Agresti, A. (2015). Foundations of Linear and Generalized Linear Models. Wiley.

For different age groups (age), and for smokers and non-smokers (smoke, with levels y and n), the dataset reports the total number of person-years observed (person.years) and the number of deaths due to heart attack (deaths).

Import the data and then:

Identify the statistical units, the response variable, and the explanatory variables, indicating the type (quantitative or qualitative) of each variable.
Conduct a graphical analysis to assess the dependence of the mortality rate
\text{mortality rate} = \frac{\text{deaths}}{\text{person.years}} on age, and to compare the trends between smokers and non-smokers.
Using pen and paper, specify a Poisson regression model, additive in age and smoking status, for the logarithm of the mortality rate. Treat age as a numerical variable for example using the midpoints of the age classes.
Fit the model described in (c) using R.
Report the estimates and confidence intervals for the model coefficients. Provide an interpretation of the estimated parameters.
Evaluate the goodness of fit of the model.
Fit a model that also includes the interaction between smoking status and age. Test whether it can be omitted or not.
Assess whether it is appropriate to add the square of the age variable as an additional explanatory variable.

Crabs dataset

The Crabs dataset is available here and comes from a study of female horseshoe crabs¹ on an island in the Gulf of Mexico.

During spawning season, a female migrates to the shore to breed. With a male attached to her posterior spine, she burrows into the sand and lays clusters of eggs. The eggs are fertilized externally, in the sand beneath the pair. During spawning, other male crabs may cluster around the pair and may also fertilize the eggs. These male crabs are called satellites.

The response outcome for each of the n = 173 female crabs is her y = number of satellites. Explanatory variables are the female crab’s color (color; 1 = medium light; 2 = medium; 3 = medium dark; 4 = dark), spine condition (spine; 1 = both good; 2 = one worn or broken; 3 = both worn or broken), weight (weight in kg), and carapace width (width in cm). The source of the dataset is the textbook:

Agresti, A. (2015). Foundations of Linear and Generalized Linear Models. Wiley. Import the data and then:

Obtain a scatterplot of width vs y, coloring points according to color.
Fit two linear models, say m_lin_red and m_lin_full, having y as the response variable and:
1. width as covariate;
2. width and color as covariates (treat color as categorical).
Test if color can be dropped from the linear model using an F test.
The response variable y is a count; therefore, its distribution cannot be Gaussian. Does this mean we should not use a linear model in this case? Under which assumption is this still a valid method?
Check the diagnostic plots of model m_lin_full. Is there any evidence of heteroskedasticity? In particular, does the variance depend on the mean? If so, adjust the standard errors and obtain a corrected F-test for the same hypothesis of point c., using robust standard errors.
Fit two Poisson regression models using the canonical link, say m_pois_red and m_pois_full, having y as the response variable and the same covariates of point b.
Test if color can be dropped from the Poisson model using a log-likelihood ratio test. Compare the results with those obtained in point b. and e. Which result is more trustworthy?
Check the diagnostic plots of m_pois_full as well as the potential presence of overdispersion. Are there influential observations?
Fit two quasi-likelihood models, say m_quasi_red and m_quasi_full, using the same covariates of points b. and f. Repeat the test of point g and comment the results.
Check the diagnostic plots of m_quasi_full as well as the potential presence of overdispersion. Are there influential observations?
Interpret the estimated coefficients of m_quasi_red and m_lin_red. Provide confidence intervals for the considered quantities.

¹ See https://horseshoecrab.org and https://en.wikipedia.org/wiki/Horseshoe_crab for details about horseshoe crabs, including pictures of their mating.

Homicide dataset - Partial solution: Homicide.R

The data contained in the data frame Homicide, available in the MLGdata R package, report the responses of n = 1308 individuals in the United States to the question:

“How many people do you personally know who have been victims of homicide in the past 12 months?”

The observed variables are:

count: the reported number of victims
race: the race of the respondent (0 = White, 1 = Black)

Let y_i denote the response of subject i, for i = 1, \dots, 1308, and let x_i = 1 for Black respondents and x_i = 0 for White respondents.

Import the data and then

Fit a Poisson regression model with canonical link and linear predictor \eta_i = \beta_1 + \beta_2 x_i. Interpret the estimates of the regression coefficients.
Compute the observed frequency of the response variable for White and Black respondents and compare them with those expected under the fitted Poisson model.
Compute the X^2 Pearson goodness-of-fit statistic. Can it be used to test the goodness of fit of the model, in this case? Why?
Evaluate whether the data exhibit overdispersion relative to the Poisson model. Moreover, fit a quasi-Poisson model and check the diagnostics. Note: this point requires you have already studied quasi-likelihoods, discussed in Unit E.
Assess whether it might be appropriate to fit a zero-inflated model to these data.

Theoretical

Exercise A

Conditionally on positive covariates U_i = u_i, the random variable Y_i \mid U_i = u are Poisson distributed with mean \mu u_i, where \mu > 0 is an unknown parameter. Suppose the covariates U_i are iid with mean \mathbb{E}(U_i)=1 and variance \mathrm{Var}(U_i)=\tau.

Show that Y_i’s are iid with mean \mathbb{E}(Y_i)=\mu and variance \mathrm{Var}(Y_i) = \mu + \tau\mu^2. Comment on the result and the issue of overdispersion.

References

Agresti, A. (2015), Foundations of Linear and Generalized Linear Models, Wiley.

Salvan, A., Sartori, N., and Pace, L. (2020), Modelli lineari generalizzati, Springer.