Exercises B

Statistics III - CdL SSE

Author
Affiliation

Tommaso Rigon

Università degli Studi di Milano-Bicocca

Homepage

The theoretical exercises described below are quite difficult. At the exam, you can expect a simplified version of them; otherwise, they would represent a formidable challenge for most of you.

On the other hand, the data analyses are more or less aligned with what you may encounter in the final examination.

The vast majority of these exercises are taken from the textbooks Salvan et al. (2020) and Agresti (2015), possibly with a few minor modifications. You can consult these textbooks if you need additional exercises.

Data analysis

The dataset Seed of the MLGdata library was obtained from an experiment designed to evaluate whether, and to what extent, the amount of fertilizer influences the germination of a seed. Twenty seeds were used, and for each seed:

  • fert indicates the amount of fertilizer,
  • x is a binary variable equal to 1 if the seed germinated, and 0 otherwise.

Import the data and then:

  1. Identify the statistical units, the response variable, and the covariates, specifying the type of each variable (continuous quantitative, discrete quantitative, nominal qualitative, ordinal qualitative).

  2. Conduct an exploratory analysis to assess the relationship between the response and the covariates.

  3. Specify an appropriate Generalized Linear Model (GLM), adopting the canonical link function.

  4. Fit the GLM specified in (c) using R.

  5. Report the estimates and confidence intervals for the coefficients. Provide an interpretation of the obtained values.

  6. For each element in the summary output of the fitted glm object in R, indicate what quantity is being computed, providing the correspondence with the formulas in the slides.

The dataset Wool (Hand et al., 1994, p. 328) contained in the MLGdata library were obtained from an experiment aimed at evaluating the effect of three variables, length (x1), width (x2), and load (x3), on the number of test cycles until rupture (y) of a wool yarn.

For each of the three variables x1, x2, and x3, three levels were fixed:

  • Length: 250, 300, 350 mm (coded as 1, 0, 1)
  • Width: 8, 9, 10 mm (coded as 1, 0, 1)
  • Load: 40, 45, 50 g (coded as 1, 0, 1)

Import the data and then:

  1. Identify the statistical units, the response variable, and the covariates, specifying the type of each variable (continuous quantitative, discrete quantitative, qualitative nominal, qualitative ordinal).

  2. Specify a normal linear model for the logarithmic transformation of the response.

  3. Fit in R the linear model from the previous point.

  4. Assess the goodness of fit of the model and consider whether a transformation of the response other than the logarithmic one may be more appropriate.

  5. Write down the expression of the estimated curve.

  6. Obtain a 95% confidence interval for the mean number of cycles to rupture for a test with length = 300 mm, width = 10 mm, and load = 40 g. For the same values of length, width, and load, obtain a prediction interval for the response.

  7. For the same data, considering the untransformed response, specify a generalized linear model with Gamma response and logarithmic link function.

  8. Fit in R the generalized linear model from the previous point.

  9. Report the estimates and confidence intervals for the coefficients. Provide an interpretation of the obtained values.

  10. For each element of the summary output of the glm object in R, indicate which quantity is being calculated, matching them with the formulas in the slides.

  11. Evaluate the goodness of fit of the Gamma model.

  12. Write down the expression of the estimated curve.

  13. Obtain a 95% confidence interval for the mean response in an experiment with length = 300 mm, width = 10 mm, and load = 40 g, using the fitted Gamma model.

  14. Compare the results of the analysis based on the normal linear model with those of the analysis based on the Gamma model.

Theoretical

Let Y be a random variable with an inverse Gaussian distribution, with support [0, \infty) and probability density function

f(y \mid \xi, \lambda) = \left( \frac{\lambda}{2 \pi} \right)^{1/2}y^{-3/2}e^{\sqrt{\lambda\xi}}\exp\left\{-\frac{1}{2}\left(\frac{\lambda}{y} + \xi y\right)\right\}, \qquad y > 0, \; \xi \ge 0, \; \lambda > 0. Show that this distribution belongs to the exponential dispersion family, and identify its characteristic elements (canonical parameter, a_i(\cdot), b(\cdot), c(\cdot) functions, dispersion parameter, variance function).

Let Y be a random variable with a negative binomial distribution, representing the number of independent Bernoulli trials with constant success probability \pi \in (0,1) required to obtain k successes. The support is S = \{k, k+1, \dots\} and the probability mass function is P(Y = y) = \binom{y-1}{k-1} \pi^k (1-\pi)^{\,y-k}, \qquad y \in S.

  1. Verify that, assuming k is known, this distribution belongs to the exponential-dispersion family and identify its characteristic elements (canonical parameter, a_i(\cdot), b(\cdot), c(\cdot) functions, dispersion parameter, variance function).

  2. Using the properties of exponential families, recover the well-known relations
    \mathbb{E}(Y) = \frac{k(1-\pi)}{\pi}, \qquad \text{var}(Y) = \frac{k(1-\pi)}{\pi^2}.

  3. Show that the variance function is quadratic in \mu.

  4. Does the distribution still form an exponential-dispersion family if k is treated as unknown?

  1. Specialize the general formulas and re-obtain the likelihood equations for the binomial regression model (Beetles data with p = 2) presented in the slides, using the canonical link function.

  2. Specialize the general formulas and re-obtain the likelihood equations for the Poisson regression model (Aids data with p = 2) presented in the slides, using the canonical link function.

  3. Obtain the likelihood equations for a binomial regression model (Beetles data with p = 2), using the Cauchy link, which is defined as g(\mu) = \tan(\pi(\mu - 1/2)). This link indeed is such that g(\mu):(0, 1)\to \mathbb{R}.

Tip: there are almost no calculations to do at points i. and ii., this exercise is designed to make you familiar with the notation and the general formulas. Point iii. is more elaborate and you will need to use the derivative of the Cauchy link, which is g'(\mu) = \pi /\sin^2(\pi x).

Explicitly derive the contribution of a single observation to the deviance (i.e. d_i) for a Gamma generalized linear model with a canonical link (inverse link). Then, write down the expression of the deviance D(\hat{\bm{\mu}}; \boldsymbol{y}) for a sample of size n.

Let Y_i be independent, with corresponding values x_1,\dots,x_n of a univariate quantitative covariate. Consider a GLMs with no intercept, \eta_i = \beta x_i, \qquad i=1,\dots,n, under each of the following exponential–dispersion models (mean–variance pairs):

  1. Y_i \sim \text{ED}(\mu_i, \phi) with V(\mu_i)=1, with \phi known.
  2. Y_i \sim \text{ED}(\mu_i,\phi) with V(\mu_i)=\mu_i^2, with \phi known.
  3. Y_i \sim \text{ED}(\mu_i,\phi_i) with V(\mu_i)=\mu_i(1-\mu_i).
  4. Y_i \sim \text{ED}(\mu_i,\phi) with V(\mu_i)=\mu_i.

For each model:

  1. Specify the statistical model in terms of the “standard” probability distribution and write the log-likelihood as a function of \beta.

  2. Derive the score function \ell_*(\beta), the observed information J, and the expected Fisher information I.

  3. Obtain the maximum likelihood estimator \hat\beta and provide an approximation to its sampling distribution.

  4. For a fixed covariate value x_i, construct an approximate 95% confidence interval for the linear predictor \eta_i = \beta x_i. Then derive the corresponding confidence interval for \mu_i. Indicate in which of the four cases these intervals are also exact 95% intervals.

Suppose y_i has a Poisson distribution with g(\mu_i) = \beta_0 + \beta_1 x_i, where x_i = 1 for i = 1, \dots, n_A (group A) and x_i = 0 for i = n_A + 1, \dots, n_A + n_B (group B), i.e. a dummy variable. Assume all observations are independent.

  1. Show that, for the log-link function g(x) = \log{x}, the GLM likelihood equations imply that the fitted means \hat\mu_A and \hat\mu_B equal the respective sample means.

  2. Using the likelihood equations, show that the same result holds for any link function for this Poisson model,

  3. Using the likelihood equations, show that the same result holds for any GLM of the form g(\mu_i) = \beta_0 + \beta_1 x_i with a binary indicator predictor.

In a generalized linear model that uses a non-canonical link function, explain why it need not be true that
\sum_{i=1}^n \hat{\mu}_i = \sum_{i=1}^n y_i. Hence, the residuals need not have a mean of 0. Then, explain why a GLM with a canonical link function requires the inclusion of an intercept term in order to ensure that this equality holds.

In selecting explanatory variables for a linear model, what is inadequate about the strategy of selecting the model with lowest deviance (or the highest R^2 in linear models)?

References

Agresti, A. (2015), Foundations of Linear and Generalized Linear Models, Wiley.
Salvan, A., Sartori, N., and Pace, L. (2020), Modelli lineari generalizzati, Springer.