Exercises C
Statistics III - CdL SSE
Tommaso Rigon
Università degli Studi di Milano-Bicocca
Homepage
The theoretical exercises described below are quite difficult. At the exam, you can expect a simplified version of them; otherwise, they would represent a formidable challenge for most of you.
On the other hand, the data analyses are more or less aligned with what you may encounter in the final examination.
Data analysis
Heart dataset
The data in the Heart dataframe included in the MLGdata library report the number of confirmed myocardial infarctions in a sample of 360 patients hospitalized with suspected infarction. These data were originally collected in Edinburgh Royal Infirmary; the original study was published as
Smith, A. F. (1967) Diagnostic value of serum-creatinine-kinase in a coronary care unit. Lancet, 2, 178.
For each level of the enzyme Creatine Kinase (IU per liter), grouped into classes (ck), the dataset provides:
- the number of confirmed infarctions (
ha),
- the number of non-confirmed infarctions (
nha),
- and the midpoint value of the variable
ck(mck).
The goal is to evaluate the influence of the Creatine Kinase enzyme level on the probability of infarction.
Import the data and then:
Identify the statistical units, the response variable, and the covariates, specifying the type of each variable (continuous quantitative, discrete quantitative, qualitative nominal, qualitative ordinal).
Conduct a graphical analysis to assess whether a linear regression model may be appropriate.
Specify a generalized linear model to analyze the problem.
Fit in R the generalized linear model from the previous point and comment on the results.
Report the estimates and confidence intervals for the coefficients. Provide an interpretation of the obtained estimates.
For each element of the
summaryoutput of theglmobject in R, indicate which quantity is being calculated, matching them with the formulas in the slides.Evaluate the goodness of fit of the model.
Assess whether it is appropriate to introduce a quadratic term in
mckin the linear predictor and evaluate the goodness of fit of the expanded model.Plot the scatter diagram of the points (x_i, y_i), i = 1, \dots, 13, with x_i equal to
mckand y_i equal to the corresponding proportion of infarctions. Superimpose on this diagram the curves of the predicted values from the two fitted models.Evaluate whether the introduction of the quadratic term significantly increases the variability of the estimates. Discuss the interpretability of the quadratic model.
Obtain a 95% confidence interval for the probability of infarction corresponding to a Creatine Kinase value of 150.
Germination dataset
The data contained in the Germination dataframe, that can be found in the MLGdata library, were obtained from a 2 \times 2 factorial experiment (two factors seed and root, each with two levels) to compare two seeds, Orobanche aegyptiaca 75 (seed = 075) and Orobanche aegyptiaca 73 (seed = 073), germinated on two different root extracts: bean (root = F) and cucumber (root = C). The source of the data is
Cox, D. R. e Snell, E. J. (1989). Analysis of Binary Data, 2nd ed. London: Chapman & Hall/CRC.
For each combination, the total number of seeds m_i (variable m) and the number of seeds that germinate S_i (variable s) are reported.
Import the data and then:
Identify the statistical units, the response variable, and the covariates, specifying the type of each variable (continuous quantitative, discrete quantitative, qualitative nominal, qualitative ordinal).
Specify an appropriate model to evaluate the effect of the two factors (seed and root used) on the probability of germination. Indicate both the model without interaction and the model with interaction.
Verify that the model with interaction is equivalent to a model that assumes the number of seeds that germinate in each experiment is a realization of a binomial distribution with success probabilities \pi_i, i=1,\dots,21, corresponding to:
- \pi_{75F} for Orobanche aegyptiaca 75 on bean root;
- \pi_{75C} for Orobanche aegyptiaca 75 on cucumber root;
- \pi_{73F} for Orobanche aegyptiaca 73 on bean root;
- \pi_{73C} for Orobanche aegyptiaca 73 on cucumber root.
- \pi_{75F} for Orobanche aegyptiaca 75 on bean root;
Fit the generalized linear model in R using the canonical link, and assess the significance of the interaction effect.
Report the estimates and confidence intervals for the coefficients. Provide an interpretation of the obtained estimates.
For each element of the
summaryoutput of theglmobject in R, indicate which quantity is calculated, providing a correspondence with the formulas in the slides.Evaluate the goodness of fit of the model, with particular attention to potential overdispersion.
kalythos dataset
Male inhabitants of the Greek island of Kalythos suffer from a congenital eye disease, whose effects become more pronounced at older ages. Data are taken from
Silvey, S. D. (1970), Statistical Inference. Penguin, London.
A sample of male islanders of different ages was examined, and the number of blind individuals was recorded, yielding the results in the table below
| Age | 20 | 35 | 45 | 55 | 70 |
|---|---|---|---|---|---|
| Number of observed individuals | 50 | 50 | 50 | 50 | 50 |
| Number of blind individuals | 6 | 17 | 26 | 37 | 44 |
Using a logit or probit model:
Estimate the
LD50, i.e., the age at which the probability of blindness is equal to 0.5, and its corresponding variance.Compare the results between the logit and probit models.
Code for importing the data in R:
Endometrial dataset
Heinze and Schemper (2002) described a study about endometrial cancer that analyzed how HG = histology grade (0 = low grade, 1 = high grade) relates to three risk factors: NV = neovasculation (1= present, 0 = absent), PI = pulsatility index of arteria uterina (ranging from 0 to 49), and EH = endometrium height (ranging from 0.27 to 3.61).
The Endometrial dataset is available here; the source is the following paper:
Heinze, G., and Schemper, M. (2002). A solution to the problem of separation in logistic regression. Statistics in Medicine, 21 (16), 2409–2419.
Import the data and consider the main effects logistic regression model in which the binary variable HG is the response and NV, PI, and EH are the covariates. Then
Compute the contingency table for the variable
HGandNV. Comment the results.Fit a logistic regression model using all covariates. Do you note anything suspicious? Would you trust the estimate for \beta_2, its p-value and the associated confidence intervals? What happened?
The issue with this dataset is known as quasi-separation, which complicates estimation because the maximum likelihood estimate for some coefficients does not exist (e.g., \hat{\beta}_2 = \infty). This dataset is discussed in detail in Section 5.7.1 of Agresti (2015). An advanced solution is provided below, whose details are obviously not required at the exam.
library(brglm2)
m_corrected <- glm(HG ~ NV + PI + EH, family = "binomial", data = Endometrial, method = "brglmFit")
summary(m_corrected)
Call:
glm(formula = HG ~ NV + PI + EH, family = "binomial", data = Endometrial,
method = "brglmFit")
Deviance Residuals:
Min 1Q Median 3Q Max
-1.4740 -0.6706 -0.3411 0.3252 2.6123
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.77456 1.48869 2.535 0.011229 *
NV 2.92927 1.55076 1.889 0.058902 .
PI -0.03475 0.03958 -0.878 0.379915
EH -2.60416 0.77602 -3.356 0.000791 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 104.903 on 78 degrees of freedom
Residual deviance: 56.575 on 75 degrees of freedom
AIC: 64.575
Type of estimator: AS_mixed (mixed bias-reducing adjusted score equations)
Number of Fisher Scoring iterations: 6
Theoretical
Let Y \sim \mathrm{Binomial}(m, \pi). Consider the logit parametrization \theta = \log\{\pi/(1-\pi)\}.
Obtain the Wald test for H_0: \theta = 0 against H_1: \theta > 0.
Assume m = 25 and verify that the test statistic with y = 24 is smaller than the test statistic with y = 23. Explain why this behavior is anomalous.
Verify that the anomaly does not occur for the likelihood-ratio test.
Explain how the expression for \text{var}(\hat{\beta}) in logistic regression suggest that the standard errors of \hat{\beta}_j tend to be smaller as you obtain more data.
Answer this for (a) grouped data with m_i increasing, and (b) ungrouped data with n increasing.
Suppose the logistic model holds in which x is uniformly distributed on the interval [0, 100] and \text{logit}(\pi_i) = -2.0 + 0.04 x_i. Randomly generate 100 independent observations from this model. Plot the response residuals against x and against the fitted values.
Why do residual plots for binary data have this appearance?
Let S_i = m_i Y_i be independent S_i \overset{\textup{ind}}{\sim} \text{Binomial}(m_i, \pi_i) random variables for i=1,\dots,n. Consider the null model, in which which \pi_1 = \cdots = \pi_n.
Show that \hat{\pi} = \sum_{i=1}^n m_i Y_i / \sum_{i=1}^n m_i, i.e. the weighted average of the proportions.
When all m_i = 1 (binary data), show that X^2 = n. Conclude that the X^2 is uninformative in this case.
Let Y_i be independent \text{Bernoulli}(\pi_i) random variables, for i=1,\dots,n. For the logit model \text{logit}(\pi_i) = \beta_0 + \beta_1 x_i, show that the deviance depends on \hat{\pi}_i but not on y_i. Hence, it is not useful for checking model fit.
(This exercise and the previous one show that goodness of fit statistics are not useful for binary data.)