The theoretical exercises described below are quite difficult. At the exam, you can expect a simplified version of them; otherwise, they would represent a formidable challenge for most of you.
On the other hand, the data analyses are more or less aligned with what you may encounter in the final examination.
The vast majority of these exercises are taken from the textbooks Salvan et al. (2020) and Agresti (2015), possibly with a few minor modifications. You can consult these textbooks if you need additional exercises.
Data analysis
The data in the Britishdoc data frame, available in the MLGdata library (Agresti, 2015, Exercise 7.36) come from a prospective study on mortality among doctors in the United Kingdom.
For different age groups (age), and for smokers and non-smokers (smoke, with levels y and n), the dataset reports the total number of person-years observed (person.years) and the number of deaths due to heart attack (deaths).
Import the data and then:
Identify the statistical units, the response variable, and the explanatory variables, indicating the type (quantitative or qualitative) of each variable.
Conduct a graphical analysis to assess the dependence of the mortality rate
\text{mortality rate} = \frac{\text{deaths}}{\text{person.years}}
on age, and to compare the trends between smokers and non-smokers.
Using pen and paper, specify a Poisson regression model, additive in age and smoking status, for the logarithm of the mortality rate. Treat age as a numerical variable for example using the midpoints of the age classes.
Fit the model described in (c) using R.
Report the estimates and confidence intervals for the model coefficients. Provide an interpretation of the estimated parameters.
Evaluate the goodness of fit of the model.
Fit a model that also includes the interaction between smoking status and age. Test whether it can be omitted or not.
Assess whether it is appropriate to add the square of the age variable as an additional explanatory variable.
The data contained in the data frame Homicide, available in the MLGdata R package, report the responses of n = 1308 individuals in the United States to the question:
“How many people do you personally know who have been victims of homicide in the past 12 months?”
The observed variables are:
count: the reported number of victims
race: the race of the respondent (0 = White, 1 = Black)
Let y_i denote the response of subject i, for i = 1, \dots, 1308, and let x_i = 1 for Black respondents and x_i = 0 for White respondents.
Import the data and then
Fit a Poisson regression model with canonical link and linear predictor \eta_i = \beta_1 + \beta_2 x_i. Interpret the estimates of the regression coefficients.
Compute the observed frequency of the response variable for White and Black respondents and compare them with those expected under the fitted Poisson model.
Compute the X^2 Pearson goodness-of-fit statistic. If necessary, combine higher response values into a single category and assess the overall goodness of fit of the model.
Evaluate whether the data exhibit overdispersion relative to the Poisson model.
Optional: assess whether it might be appropriate to fit a zero-inflated model to these data.
References
Agresti, A. (2015), Foundations of Linear and Generalized Linear Models, Wiley.
Salvan, A., Sartori, N., and Pace, L. (2020), Modelli lineari generalizzati, Springer.