Exercises C

Statistics III - CdL SSE

Author

Affiliation

Tommaso Rigon

Università degli Studi di Milano-Bicocca

Homepage

The theoretical exercises described below are quite difficult. At the exam, you can expect a simplified version of them; otherwise, they would represent a formidable challenge for most of you.

On the other hand, the data analyses are more or less aligned with what you may encounter in the final examination.

The vast majority of these exercises are taken from the textbooks Salvan et al. (2020) and Agresti (2015), possibly with a few minor modifications. You can consult these textbooks if you need additional exercises.

Data analysis

Heart dataset - Partial solution: Heart.R

The data in the Heart dataframe included in the MLGdata library report the number of confirmed myocardial infarctions in a sample of 360 patients hospitalized with suspected infarction. These data were originally collected in Edinburgh Royal Infirmary; the original study was published as

Smith, A. F. (1967) Diagnostic value of serum-creatinine-kinase in a coronary care unit. Lancet, 2, 178.

For each level of the enzyme Creatine Kinase (IU per liter), grouped into classes (ck), the dataset provides:

the number of confirmed infarctions (ha),
the number of non-confirmed infarctions (nha),
and the midpoint value of the variable ck (mck).

The goal is to evaluate the influence of the Creatine Kinase enzyme level on the probability of infarction.

Import the data and then:

Identify the statistical units, the response variable, and the covariates, specifying the type of each variable (continuous quantitative, discrete quantitative, qualitative nominal, qualitative ordinal).
Conduct a graphical analysis to assess whether a linear regression model may be appropriate.
Specify a generalized linear model to analyze the problem.
Fit in R the generalized linear model from the previous point.
Report the estimates and confidence intervals for the coefficients. Provide an interpretation of the obtained estimates.
Plot the scatter diagram of the points (x_i, y_i), i = 1, \dots, 13, with x_i equal to mck and y_i equal to the corresponding proportion of infarctions. Superimpose on this diagram the curves of the predicted values from the two fitted models.
For each element of the summary output of the glm object in R, indicate which quantity is being calculated, matching them with the formulas in the slides.
Obtain a 95% confidence interval for the probability of infarction corresponding to a Creatine Kinase value of 150.
Evaluate the goodness of fit of the model.
Assess whether it is appropriate to introduce a quadratic term in mck in the linear predictor and evaluate the goodness of fit of the expanded model.
Evaluate whether the introduction of the quadratic term increases the variability of the estimates. Discuss the interpretability of the quadratic model.

Germination dataset - Partial solution: Germination.R

The data contained in the Germination dataframe, that can be found in the MLGdata library, were obtained from a 2 \times 2 factorial experiment (two factors seed and root, each with two levels) to compare two seeds, Orobanche aegyptiaca 75 (seed = 075) and Orobanche aegyptiaca 73 (seed = 073), germinated on two different root extracts: bean (root = F) and cucumber (root = C). The source of the data is

Cox, D. R. e Snell, E. J. (1989). Analysis of Binary Data, 2nd ed. London: Chapman & Hall/CRC.

For each combination, the total number of seeds m_i (variable m) and the number of seeds that germinate S_i (variable s) are reported.

Import the data and then:

Identify the statistical units, the response variable, and the covariates, specifying the type of each variable (continuous quantitative, discrete quantitative, qualitative nominal, qualitative ordinal).
Specify an appropriate model to evaluate the effect of the two factors (seed and root used) on the probability of germination. Indicate both the model without interaction and the model with interaction.
Verify that the model with interaction is equivalent to a model that assumes the number of seeds that germinate in each experiment is a realization of a binomial distribution with success probabilities \pi_i, i=1,\dots,21, corresponding to:
- \pi_{75F} for Orobanche aegyptiaca 75 on bean root;
- \pi_{75C} for Orobanche aegyptiaca 75 on cucumber root;
- \pi_{73F} for Orobanche aegyptiaca 73 on bean root;
- \pi_{73C} for Orobanche aegyptiaca 73 on cucumber root.
Fit the generalized linear model in R using the canonical link, and assess the significance of the interaction effect.
Report the estimates and confidence intervals for the coefficients. Provide an interpretation of the obtained estimates.
For each element of the summary output of the glm object in R, indicate which quantity is calculated, providing a correspondence with the formulas in the slides.
Evaluate the goodness of fit of the model, with particular attention to potential overdispersion.

kalythos dataset - Partial solution: kalythos.R

Male inhabitants of the Greek island of Kalythos suffer from a congenital eye disease, whose effects become more pronounced at older ages. Data are taken from

Silvey, S. D. (1970), Statistical Inference. Penguin, London.

A sample of male islanders of different ages was examined, and the number of blind individuals was recorded, yielding the results in the table below

Age	20	35	45	55	70
Number of observed individuals	50	50	50	50	50
Number of blind individuals	6	17	26	37	44

Using a logit or probit model:

Estimate the LD50 (lethal dose), i.e., the age at which the probability of blindness is equal to 0.5.
Compare the results between the logit and probit models.
☠️ - Optional. Compute an estimate for the variance of LD50 and use it to construct a confidence interval.

Code for importing the data in R:

kalythos <- data.frame(
  age   = c(20, 35, 45, 55, 70),
  blind = c(6, 17, 26, 37, 44),
  total = c(50, 50, 50, 50, 50)
)

Endometrial dataset - Partial solution: Endometrial.R

Heinze and Schemper (2002) described a study about endometrial cancer that analyzed how HG = histology grade (0 = low grade, 1 = high grade) relates to three risk factors: NV = neovasculation (1= present, 0 = absent), PI = pulsatility index of arteria uterina (ranging from 0 to 49), and EH = endometrium height (ranging from 0.27 to 3.61).

The Endometrial dataset is available here; the source is the following paper:

Heinze, G., and Schemper, M. (2002). A solution to the problem of separation in logistic regression. Statistics in Medicine, 21 (16), 2409–2419.

Import the data and consider the main effects logistic regression model in which the binary variable HG is the response and NV, PI, and EH are the covariates. Then

Compute the contingency table for the variable HG and NV. Comment the results.
Fit a logistic regression model using all covariates. Do you note anything suspicious? Would you trust the estimate for \beta_2, its p-value and the associated confidence intervals? What happened?

Theoretical

Exercise A - Solution: notes.pdf

Let Y \sim \mathrm{Binomial}(m, \pi). Consider the logit parametrization \theta = \log\{\pi/(1-\pi)\}.

Obtain the Wald test for H_0: \theta = 0 against H_1: \theta > 0.
Assume m = 25 and verify that the test statistic with y = 24 is smaller than the test statistic with y = 23. Explain why this behavior is anomalous.
Verify that the anomaly does not occur for the likelihood-ratio test.

Exercise B - Solution: notes.pdf

Explain how the expression for \text{var}(\hat{\beta}) in logistic regression suggest that the standard errors of \hat{\beta}_j tend to be smaller as you obtain more data.

Answer this for (a) grouped data with m_i increasing, and (b) ungrouped data with n increasing.

Exercise C

Suppose the logistic model holds in which x is uniformly distributed on the interval [0, 100] and \text{logit}(\pi_i) = -2.0 + 0.04 x_i. Randomly generate 100 independent observations from this model. Plot the response residuals against x and against the fitted values.

Why do residual plots for binary data have this appearance?

Exercise D

Let S_i = m_i Y_i be independent S_i \overset{\textup{ind}}{\sim} \text{Binomial}(m_i, \pi_i) random variables for i=1,\dots,n. Consider the null model, in which which \pi_1 = \cdots = \pi_n.

Show that \hat{\pi} = \sum_{i=1}^n m_i Y_i / \sum_{i=1}^n m_i, i.e. the weighted average of the proportions.
When all m_i = 1 (binary data), show that X^2 = n. Conclude that the X^2 is uninformative in this case.

Exercise E

Let Y_i be independent \text{Bernoulli}(\pi_i) random variables, for i=1,\dots,n. For the logit model \text{logit}(\pi_i) = \beta_0 + \beta_1 x_i, show that the deviance depends on \hat{\pi}_i but not on y_i. Hence, it is not useful for checking model fit.

(This exercise and the previous one show that goodness of fit statistics are not useful for binary data.)

References

Agresti, A. (2015), Foundations of Linear and Generalized Linear Models, Wiley.

Salvan, A., Sartori, N., and Pace, L. (2020), Modelli lineari generalizzati, Springer.