Exercises C
Statistics III - CdL SSE
Tommaso Rigon
UniversitĂ degli Studi di Milano-Bicocca
Homepage
The theoretical exercises described below are quite difficult. At the exam, you can expect a simplified version of them; otherwise, they would represent a formidable challenge for most of you.
On the other hand, the data analyses are more or less aligned with what you may encounter in the final examination.
Data analysis
Heart
dataset
The data in the Heart
dataframe included in the MLGdata
library report the number of confirmed myocardial infarctions in a sample of 360 patients hospitalized with suspected infarction (Hand et al., 1994, p. 45).
For each level of the enzyme Creatine Kinase (IU per liter), grouped into classes (ck
), the dataset provides:
- the number of confirmed infarctions (
ha
),
- the number of non-confirmed infarctions (
nha
),
- and the midpoint value of the variable
ck
(mck
).
The goal is to evaluate the influence of the Creatine Kinase enzyme level on the probability of infarction.
Import the data and then:
Identify the statistical units, the response variable, and the covariates, specifying the type of each variable (continuous quantitative, discrete quantitative, qualitative nominal, qualitative ordinal).
Conduct a graphical analysis to assess whether a linear regression model may be appropriate.
Specify a generalized linear model to analyze the problem.
Fit in R the generalized linear model from the previous point and comment on the results.
Report the estimates and confidence intervals for the coefficients. Provide an interpretation of the obtained estimates.
For each element of the
summary
output of theglm
object in R, indicate which quantity is being calculated, matching them with the formulas in the slides.Evaluate the goodness of fit of the model.
Assess whether it is appropriate to introduce a quadratic term in
mck
in the linear predictor and evaluate the goodness of fit of the expanded model.Plot the scatter diagram of the points (x_i, y_i), i = 1, \dots, 13, with x_i equal to
mck
and y_i equal to the corresponding proportion of infarctions. Superimpose on this diagram the curves of the predicted values from the two fitted models.Evaluate whether the introduction of the quadratic term significantly increases the variability of the estimates. Discuss the interpretability of the quadratic model.
Obtain a 95% confidence interval for the probability of infarction corresponding to a Creatine Kinase value of 150.
Beetles
dataset
For the logit model considered in the slides applied to the Beetles
data, verify that \psi = \beta_1 / \beta_2 corresponds to the log-dose x_{0.5} at which the probability that an insect is killed is equal to 0.5.
Apply the delta method to obtain an approximate 95% Wald confidence interval for \psi.
Obtain the analogous interval for the dose, defined as \tilde{\psi} = 10^{x_{0.5}} = 10^{\psi} (lethal dose 50).
kalythos
dataset
Male inhabitants of the Greek island of Kalythos suffer from a congenital eye disease, whose effects become more pronounced at older ages.
A sample of male islanders of different ages was examined, and the number of blind individuals was recorded, yielding the results in the table below (Silvey, 1975, Exercise 4.2).
Age | 20 | 35 | 45 | 55 | 70 |
---|---|---|---|---|---|
Number of observed individuals | 50 | 50 | 50 | 50 | 50 |
Number of blind individuals | 6 | 17 | 26 | 37 | 44 |
Using a logit or probit model:
Estimate the
LD50
, i.e., the age at which the probability of blindness is equal to 0.5, and its corresponding variance.Compare the results between the logit and probit models.
Germination
dataset
The data contained in the Germination
dataframe (Cox and Snell, 1989, Example 3.2), that can be found in the MLGdata
library, were obtained from a 2 \times 2 factorial experiment (two factors, each with two levels) to compare two seeds, Orobanche aegyptiaca 75 and Orobanche aegyptiaca 73, germinated on two different root extracts: bean and cucumber.
For each combination, the total number of seeds m and the number of seeds that germinate s are reported.
Import the data and then:
Identify the statistical units, the response variable, and the covariates, specifying the type of each variable (continuous quantitative, discrete quantitative, qualitative nominal, qualitative ordinal).
Specify an appropriate model to evaluate the effect of the two factors (seed and root used) on the probability of germination. Indicate both the model without interaction and the model with interaction.
Verify that the model with interaction is equivalent to a model that assumes the number of seeds that germinate in each experiment is a realization of a binomial distribution with success probabilities \pi_i, i=1,\dots,21, corresponding to:
- \pi_{75F} for Orobanche aegyptiaca 75 on bean root;
- \pi_{75C} for Orobanche aegyptiaca 75 on cucumber root;
- \pi_{73F} for Orobanche aegyptiaca 73 on bean root;
- \pi_{73C} for Orobanche aegyptiaca 73 on cucumber root.
- \pi_{75F} for Orobanche aegyptiaca 75 on bean root;
Fit the generalized linear model in R using the canonical link, and assess the significance of the interaction effect.
Report the estimates and confidence intervals for the coefficients. Provide an interpretation of the obtained estimates.
For each element of the
summary
output of theglm
object in R, indicate which quantity is calculated, providing a correspondence with the formulas in the slides.Evaluate the goodness of fit of the model, with particular attention to potential overdispersion.
Theoretical
Let Y \sim \mathrm{Binomial}(m, \pi). Consider the logit parametrization \theta = \log\{\pi/(1-\pi)\}.
Obtain the Wald test for H_0: \theta = 0 against H_1: \theta > 0.
Assume m = 25 and verify that the test statistic with y = 24 is smaller than the test statistic with y = 23. Explain why this behavior is anomalous.
Verify that the anomaly does not occur for the likelihood-ratio test.