Exercises B

Statistics III - CdL SSE

Author

Affiliation

Tommaso Rigon

Università degli Studi di Milano-Bicocca

Homepage

The theoretical exercises described below are quite difficult. At the exam, you can expect a simplified version of them; otherwise, they would represent a formidable challenge for most of you.

On the other hand, the data analyses are more or less aligned with what you may encounter in the final examination.

The vast majority of these exercises are taken from the textbooks Salvan et al. (2020) and Agresti (2015), possibly with a few minor modifications. You can consult these textbooks if you need additional exercises.

Data analysis

Seed dataset - Partial solution: seed.R

The dataset Seed of the MLGdata library was obtained from an experiment designed to evaluate whether, and to what extent, the amount of fertilizer influences the germination of a seed. Twenty seeds were used, and for each seed:

fert indicates the amount of fertilizer,
x is a binary variable equal to 1 if the seed germinated, and 0 otherwise.

Import the data and then:

Identify the statistical units, the response variable, and the covariates, specifying the type of each variable (continuous quantitative, discrete quantitative, nominal qualitative, ordinal qualitative).
Conduct an exploratory analysis to assess the relationship between the response and the covariates.
Using pen and paper, specify an appropriate GLM adopting the canonical link function.
Fit the GLM specified in (c) using R.
Report the estimates and confidence intervals for the coefficients. Provide an interpretation of the obtained values.
For each element in the summary output of the fitted glm object in R, indicate what quantity is being computed, providing the correspondence with the formulas in the slides.

Wool dataset - Partial solution: Wool.R

The dataset Wool contained in the MLGdata library were obtained from an experiment aimed at evaluating the effect of three variables, length (x1), width (x2), and load (x3), on the number of test cycles until rupture (y) of a wool yarn. The data were originally analyzed in the famous paper Box and Cox (1964) paper:

Box, G. E. P. and Cox, D. R. (1964) An analysis of transformations (with discussion). Journal of the Royal Statistical Society, Series A, 143, 383-430.

For each of the three variables x1, x2, and x3, three levels were fixed:

Length: 250, 300, 350 mm (coded as -1, 0, 1)
Width: 8, 9, 10 mm (coded as -1, 0, 1)
Load: 40, 45, 50 g (coded as -1, 0, 1)

Import the data and then:

Identify the statistical units, the response variable, and the covariates, specifying the type of each variable (continuous quantitative, discrete quantitative, qualitative nominal, qualitative ordinal).
Using pen and paper, specify a normal linear model for the logarithmic transformation of the response.
Fit in R the linear model from the previous point.
Assess the goodness of fit of the model, compare the usual standard errors with the robust ones. Discuss whether a transformation of the response other than the logarithmic one may be more appropriate.
Write down the expression of the estimated curve.
Obtain a 95% confidence interval for the mean number of cycles to rupture for a test with length = 300 mm, width = 10 mm, and load = 40 g. For the same values of length, width, and load, obtain a prediction interval for the response.
For the same data, considering the untransformed response and using pen and paper, specify a generalized linear model with Gamma response and logarithmic link function.
Fit in R the generalized linear model from the previous point.
Write down the expression of the estimated curve.
Report the estimates and confidence intervals for the coefficients. Provide an interpretation of the obtained values.
For each element of the summary output of the glm object in R, indicate which quantity is being calculated, matching them with the formulas in the slides.
Evaluate the goodness of fit of the Gamma model.
Obtain a 95% confidence interval for the mean response in an experiment with length = 300 mm, width = 10 mm, and load = 40 g, using the fitted Gamma model.
Compare the results of the analysis based on the normal linear model with those of the analysis based on the Gamma model.

dde dataset - Partial solution: dde.R

The dataset dde dataset collect data from a sample of n = 2312 pregnant women and can be downloaded here. This dataset was also analyzed in Statistica I. The original data are described in this study:

Longnecker, M. P., Klebanoff, M. A., Zhou, H., J. W. Brock (2001). Association between maternal serum concentration of the DDT metabolite DDE and preterm and small-for-gestational-age babies at birth. Lancet, 358 (9276), 110-114.

The DDE is measured, i.e., a substance related to DDT, present in maternal serum during the third trimester of pregnancy. The variable GAD (Gestational Age at Delivery) measures the day of pregnancy on which delivery occurred.

DDT is extremely effective against malaria mosquitoes and is therefore widely used in areas where malaria is endemic. At the same time, DDT may pose a health risk, especially for pregnant women.

Research question. We are interested in understanding the impact of DDE on the GAD, and in particular in estimating the pre-term delivery (< 37th week, i.e. < 259 days).

Import the data and then:

Specify a several “reasonable” models for the main research question. Decide which model is the most appropriate, based on predictive accuracy and diagnostic tools.
Does the DDT impact the gestational age at delivery?
Provide a prediction for the GAD for a mother being exposed to a dose of 100 (mg/L) of DDE and compare it with the prediction for a mother not being exposed to DDE.
Compute the probabilities of pre-term birth for all the observed values, namely \psi_i = \mathbb{P}(Y_i < 259; \bm{x}_i), \qquad i=1,\dots,n. Do they differ across models? Hint: use the function pnorm if you considered a normal model and pgamma if you considered a gamma GLM. Be careful with the parametrization!
Provide an estimate for the probability of pre-term delivery \psi for a mother being exposed to a dose of 100 (mg/L) of DDE and compare it with the probability for a mother not being exposed to DDE.
☠️ - Optional. Provide a prediction interval for the GAD for a mother being exposed to a dose of 100 (mg/L) of DDE and compare it with the prediction for a mother not being exposed to DDE.

Houses dataset - Solution: Sections 3.4 and 4.7 in Agresti (2015).

What affects the selling price of a house? The Houses dataset is available here and comprises n = 100 observations on recent home sales in Gainesville, Florida.

Variables listed are selling price (in thousands of dollars, price), size of house (in square feet, size), annual property tax bill (in dollars, taxes), number of bedrooms (beds), number of bathrooms (baths), and whether the house is new (new, with 1 = new and 0 = not new). The original source of the dataset is the textbook:

Agresti, A. (2015). Foundations of Linear and Generalized Linear Models. Wiley.

Since these 100 observations are from one city alone, we cannot use them to make inferences about the relationships in general. But for illustrative purposes, we treat them as a random sample of a conceptual population of home sales in this market and analyze how selling price seems to relate to these characteristics. The goal is the predict the sale price as a function of other characteristics

Import the data and then

Summarize the data with descriptive statistics and plots.
Using a forward/stepwise/backward selection procedures with all five predictors together with judgments about practical significance, select and interpret a (generalized) linear model for selling price.
Check whether results depend on any influential observations.

This exercise is intentionally left without strict instructions, so you can freely explore the data and develop your own modeling strategy. As is often the case, multiple approaches can yield reasonable results.

Theoretical

Exercise A - Solution: notes.pdf

Let Y be a random variable with an inverse Gaussian distribution, with support [0, \infty) and probability density function

p(y ; \xi, \lambda) = \left( \frac{\lambda}{2 \pi} \right)^{1/2}y^{-3/2}e^{\sqrt{\lambda\xi}}\exp\left\{-\frac{1}{2}\left(\frac{\lambda}{y} + \xi y\right)\right\}, \qquad y > 0, \; \xi \ge 0, \; \lambda > 0. Show that this distribution belongs to the exponential dispersion family, and identify its characteristic elements (canonical parameter, a_i(\cdot), b(\cdot), c(\cdot) functions, dispersion parameter, variance function).

Exercise B - Solution: notes.pdf

Let Y be a random variable with a negative binomial distribution, representing the number of independent Bernoulli trials with constant success probability \pi \in (0,1) required to obtain k successes. The support is S = \{k, k+1, \dots\} and the probability mass function is p(y; \pi) = \binom{y-1}{k-1} \pi^k (1-\pi)^{\,y-k}, \qquad y \in S.

Verify that, assuming k is known, this distribution belongs to the exponential-dispersion family and identify its characteristic elements: canonical parameter, a_i(\cdot), b(\cdot), c(\cdot) functions, dispersion parameter, and variance function.
Using the properties of exponential families, recover the well-known relations
\mathbb{E}(Y) = \frac{k}{\pi}, \qquad \text{var}(Y) = \frac{k(1-\pi)}{\pi^2}.
Show that the variance function is quadratic in \mu.
Does the distribution still form an exponential-dispersion family if k is treated as unknown?

Exercise C - Solution: notes.pdf

Consider the general likelihood equations for a GLM:

Specialize the general formulas and re-obtain the likelihood equations for the binomial regression model (Beetles data with p = 2) presented in the slides, using the canonical link function.
Specialize the general formulas and re-obtain the likelihood equations for the Poisson regression model (Aids data with p = 2) presented in the slides, using the canonical link function.
Obtain the likelihood equations for a binomial regression model (Beetles data with p = 2), using the Cauchy link, which is defined as g(\mu) = \tan(\pi(\mu - 1/2)). This link indeed is such that g(\mu):(0, 1)\to \mathbb{R}.

Hint: there are almost no calculations to do at points i. and ii., only substitutions. This exercise is designed to make you familiar with the notation and the general formulas. Point iii. is more elaborate and you will need to use the derivative of the Cauchy link, which is g'(\mu) = \pi /\sin^2(\pi \mu).

Exercise D - Solution: notes.pdf

Explicitly derive the contribution of a single observation to the deviance (i.e. d_i) for a Gamma generalized linear model. Then, write down the expression of the deviance D(\hat{\bm{\mu}}; \boldsymbol{y}) for a sample of size n.

Exercise E - Solution: notes.pdf

Let Y_i be independent random variables, with corresponding values x_1,\dots,x_n of a univariate quantitative covariate. Consider a GLM with no intercept and p=1, i.e. \eta_i = \beta x_i, \qquad i=1,\dots,n, under each of the following exponential–dispersion models (mean–variance pairs):

Y_i \sim \text{ED}(\mu_i, \phi) and \mu_i \in \mathbb{R}, with \phi > 0 known.
Y_i \sim \text{ED}(\mu_i, \mu_i) and \mu_i \in (0, \infty).
Y_i \sim \text{ED}(\mu_i,\mu_i(1 - \mu_i)) and \mu_i \in (0, 1).

For each model consider the canonical link and:

Specify the statistical model in terms of the “standard” probability distribution and write the log-likelihood as a function of \beta.
Derive the score function \ell_*(\beta) and the observed/expected Fisher information I.
Obtain the maximum likelihood estimator \hat\beta and provide an approximation to its sampling distribution.
For a fixed covariate value x_i, construct an approximate 1-\alpha confidence interval for the linear predictor \eta_i = \beta x_i. Then derive the corresponding confidence interval for \mu_i. Indicate in which of the four cases these intervals are also exact 1-\alpha intervals.

Exercise F - Solution: notes.pdf

Suppose Y_i has a Poisson distribution with g(\mu_i) = \beta_1 + \beta_2 x_i, where x_i = 1 for i = 1, \dots, n_A (group A) and x_i = 0 for i = n_A + 1, \dots, n_A + n_B (group B), i.e. a dummy variable. Assume all observations are independent.

Show that, for the log-link function g(x) = \log{x}, the GLM likelihood equations imply that the fitted means \hat\mu_A and \hat\mu_B equal the respective sample means.
Using the likelihood equations, show that the same result holds for any link function for this Poisson model,
Using the likelihood equations, show that the same result holds for any GLM of the form g(\mu_i) = \beta_1 + \beta_2 x_i with a binary indicator predictor. Note: if the weights \omega_i are non-constant, then \hat\mu_A and \hat\mu_B equal the respective weighted means.

Exercise G - Solution: notes.pdf

In a generalized linear model that uses a non-canonical link function, explain why it need not be true that
\sum_{i=1}^n \hat{\mu}_i = \sum_{i=1}^n y_i. Hence, the response residuals need not have a mean of 0.

Then, explain why a GLM with a canonical link function and constant weights \omega_i requires the inclusion of an intercept term in order to ensure that this equality holds.

Exercise H - Solution: notes.pdf

In selecting explanatory variables for a linear model, what is inadequate about the strategy of selecting the model with lowest deviance (or the highest R^2 in linear models)?

References

Agresti, A. (2015), Foundations of Linear and Generalized Linear Models, Wiley.

Salvan, A., Sartori, N., and Pace, L. (2020), Modelli lineari generalizzati, Springer.