Exercises B
Statistics III - CdL SSE
Tommaso Rigon
Università degli Studi di Milano-Bicocca
Homepage
The theoretical exercises described below are quite difficult. At the exam, you can expect a simplified version of them; otherwise, they would represent a formidable challenge for most of you.
On the other hand, the data analyses are more or less aligned with what you may encounter in the final examination.
Data analysis
The dataset Seed of the MLGdata library was obtained from an experiment designed to evaluate whether, and to what extent, the amount of fertilizer influences the germination of a seed. Twenty seeds were used, and for each seed:
fertindicates the amount of fertilizer,
xis a binary variable equal to1if the seed germinated, and0otherwise.
Import the data and then:
Identify the statistical units, the response variable, and the covariates, specifying the type of each variable (continuous quantitative, discrete quantitative, nominal qualitative, ordinal qualitative).
Conduct an exploratory analysis to assess the relationship between the response and the covariates.
Using pen and paper, specify an appropriate GLM adopting the canonical link function.
Fit the GLM specified in (c) using R.
Report the estimates and confidence intervals for the coefficients. Provide an interpretation of the obtained values.
For each element in the
summaryoutput of the fittedglmobject in R, indicate what quantity is being computed, providing the correspondence with the formulas in the slides.
The dataset Wool contained in the MLGdata library were obtained from an experiment aimed at evaluating the effect of three variables, length (x1), width (x2), and load (x3), on the number of test cycles until rupture (y) of a wool yarn. The data were originally analyzed in the famous paper Box and Cox (1964) paper:
Box, G. E. P. and Cox, D. R. (1964) An analysis of transformations (with discussion). Journal of the Royal Statistical Society, Series A, 143, 383-430.
For each of the three variables x1, x2, and x3, three levels were fixed:
- Length: 250, 300, 350 mm (coded as
-1,0,1)
- Width: 8, 9, 10 mm (coded as
-1,0,1)
- Load: 40, 45, 50 g (coded as
-1,0,1)
Import the data and then:
Identify the statistical units, the response variable, and the covariates, specifying the type of each variable (continuous quantitative, discrete quantitative, qualitative nominal, qualitative ordinal).
Using pen and paper, specify a normal linear model for the logarithmic transformation of the response.
Fit in R the linear model from the previous point.
Assess the goodness of fit of the model, compare the usual standard errors with the robust ones. Discuss whether a transformation of the response other than the logarithmic one may be more appropriate.
Write down the expression of the estimated curve.
Obtain a 95% confidence interval for the mean number of cycles to rupture for a test with length = 300 mm, width = 10 mm, and load = 40 g. For the same values of length, width, and load, obtain a prediction interval for the response.
For the same data, considering the untransformed response and using pen and paper, specify a generalized linear model with Gamma response and logarithmic link function.
Fit in R the generalized linear model from the previous point.
Write down the expression of the estimated curve.
Report the estimates and confidence intervals for the coefficients. Provide an interpretation of the obtained values.
For each element of the
summaryoutput of theglmobject in R, indicate which quantity is being calculated, matching them with the formulas in the slides.Evaluate the goodness of fit of the Gamma model.
Obtain a 95% confidence interval for the mean response in an experiment with length = 300 mm, width = 10 mm, and load = 40 g, using the fitted Gamma model.
Compare the results of the analysis based on the normal linear model with those of the analysis based on the Gamma model.
The dataset dde dataset collect data from a sample of n = 2312 pregnant women and can be downloaded here. This dataset was also analyzed in Statistica I. The original data are described in this study:
Longnecker, M. P., Klebanoff, M. A., Zhou, H., J. W. Brock (2001). Association between maternal serum concentration of the DDT metabolite DDE and preterm and small-for-gestational-age babies at birth. Lancet, 358 (9276), 110-114.
The DDE is measured, i.e., a substance related to DDT, present in maternal serum during the third trimester of pregnancy. The variable GAD (Gestational Age at Delivery) measures the day of pregnancy on which delivery occurred.
DDT is extremely effective against malaria mosquitoes and is therefore widely used in areas where malaria is endemic. At the same time, DDT may pose a health risk, especially for pregnant women.
Research question. We are interested in understanding the impact of DDE on the GAD, and in particular in estimating the pre-term delivery (< 37th week, i.e. < 259 days).
Import the data and then:
Specify a several “reasonable” models for the main research question. Decide which model is the most appropriate, based on predictive accuracy and diagnostic tools.
Does the DDT impact the gestational age at delivery?
Provide a prediction for the
GADfor a mother being exposed to a dose of 100 (mg/L) ofDDEand compare it with the prediction for a mother not being exposed toDDE.Compute the probabilities of pre-term birth for all the observed values, namely \psi_i = \mathbb{P}(Y_i < 259; \bm{x}_i), \qquad i=1,\dots,n. Do they differ across models? Hint: use the function
pnormif you considered a normal model andpgammaif you considered a gamma GLM. Be careful with the parametrization!Provide an estimate for the probability of pre-term delivery \psi for a mother being exposed to a dose of 100 (mg/L) of
DDEand compare it with the probability for a mother not being exposed toDDE.☠️ - Optional. Provide a prediction interval for the
GADfor a mother being exposed to a dose of 100 (mg/L) ofDDEand compare it with the prediction for a mother not being exposed toDDE.
Houses dataset
What affects the selling price of a house? The Houses dataset is available here and comprises n = 100 observations on recent home sales in Gainesville, Florida.
Variables listed are selling price (in thousands of dollars, price), size of house (in square feet, size), annual property tax bill (in dollars, taxes), number of bedrooms (beds), number of bathrooms (baths), and whether the house is new (new, with 1 = new and 0 = not new). The original source of the dataset is the textbook:
Agresti, A. (2015). Foundations of Linear and Generalized Linear Models. Wiley.
Since these 100 observations are from one city alone, we cannot use them to make inferences about the relationships in general. But for illustrative purposes, we treat them as a random sample of a conceptual population of home sales in this market and analyze how selling price seems to relate to these characteristics. The goal is the predict the sale price as a function of other characteristics
Import the data and then
Summarize the data with descriptive statistics and plots.
Using a forward/stepwise/backward selection procedures with all five predictors together with judgments about practical significance, select and interpret a (generalized) linear model for selling price.
Check whether results depend on any influential observations.
This exercise is intentionally left without strict instructions, so you can freely explore the data and develop your own modeling strategy. As is often the case, multiple approaches can yield reasonable results. A detailed discussion of this dataset can be found in Sections 3.4 and 4.7 of Agresti (2015).
Theoretical
Let Y be a random variable with an inverse Gaussian distribution, with support [0, \infty) and probability density function
p(y ; \xi, \lambda) = \left( \frac{\lambda}{2 \pi} \right)^{1/2}y^{-3/2}e^{\sqrt{\lambda\xi}}\exp\left\{-\frac{1}{2}\left(\frac{\lambda}{y} + \xi y\right)\right\}, \qquad y > 0, \; \xi \ge 0, \; \lambda > 0. Show that this distribution belongs to the exponential dispersion family, and identify its characteristic elements (canonical parameter, a_i(\cdot), b(\cdot), c(\cdot) functions, dispersion parameter, variance function).
Let Y be a random variable with a negative binomial distribution, representing the number of independent Bernoulli trials with constant success probability \pi \in (0,1) required to obtain k successes. The support is S = \{k, k+1, \dots\} and the probability mass function is p(y; \pi) = \binom{y-1}{k-1} \pi^k (1-\pi)^{\,y-k}, \qquad y \in S.
Verify that, assuming k is known, this distribution belongs to the exponential-dispersion family and identify its characteristic elements: canonical parameter, a_i(\cdot), b(\cdot), c(\cdot) functions, dispersion parameter, and variance function.
Using the properties of exponential families, recover the well-known relations
\mathbb{E}(Y) = \frac{k}{\pi}, \qquad \text{var}(Y) = \frac{k(1-\pi)}{\pi^2}.
Show that the variance function is quadratic in \mu.
Does the distribution still form an exponential-dispersion family if k is treated as unknown?
Specialize the general formulas and re-obtain the likelihood equations for the binomial regression model (
Beetlesdata with p = 2) presented in the slides, using the canonical link function.Specialize the general formulas and re-obtain the likelihood equations for the Poisson regression model (
Aidsdata with p = 2) presented in the slides, using the canonical link function.Obtain the likelihood equations for a binomial regression model (
Beetlesdata with p = 2), using the Cauchy link, which is defined as g(\mu) = \tan(\pi(\mu - 1/2)). This link indeed is such that g(\mu):(0, 1)\to \mathbb{R}.
Hint: there are almost no calculations to do at points i. and ii., this exercise is designed to make you familiar with the notation and the general formulas. Point iii. is more elaborate and you will need to use the derivative of the Cauchy link, which is g'(\mu) = \pi /\sin^2(\pi \mu).
Explicitly derive the contribution of a single observation to the deviance (i.e. d_i) for a Gamma generalized linear model with a canonical link (inverse link). Then, write down the expression of the deviance D(\hat{\bm{\mu}}; \boldsymbol{y}) for a sample of size n.
Let Y_i be independent, with corresponding values x_1,\dots,x_n of a univariate quantitative covariate. Consider a GLM with no intercept and p=1, i.e. \eta_i = \beta x_i, \qquad i=1,\dots,n, under each of the following exponential–dispersion models (mean–variance pairs):
- Y_i \sim \text{ED}(\mu_i, \phi) with V(\mu_i)=1 and \mu_i \in \mathbb{R}, with \phi known.
- Y_i \sim \text{ED}(\mu_i,\phi) with V(\mu_i)=\mu_i^2 and \mu_i \in (0, \infty), with \phi known.
- Y_i \sim \text{ED}(\mu_i,\phi_i) with V(\mu_i)=\mu_i(1-\mu_i) and \mu_i \in (0, 1).
- Y_i \sim \text{ED}(\mu_i,\phi) with V(\mu_i)=\mu_i and \mu_i \in (0, \infty).
For each model:
Specify the statistical model in terms of the “standard” probability distribution and write the log-likelihood as a function of \beta.
Derive the score function \ell_*(\beta), the observed information J, and the expected Fisher information I.
Obtain the maximum likelihood estimator \hat\beta and provide an approximation to its sampling distribution.
For a fixed covariate value x_i, construct an approximate 95% confidence interval for the linear predictor \eta_i = \beta x_i. Then derive the corresponding confidence interval for \mu_i. Indicate in which of the four cases these intervals are also exact 95% intervals.
Suppose Y_i has a Poisson distribution with g(\mu_i) = \beta_1 + \beta_2 x_i, where x_i = 1 for i = 1, \dots, n_A (group A) and x_i = 0 for i = n_A + 1, \dots, n_A + n_B (group B), i.e. a dummy variable. Assume all observations are independent.
Show that, for the log-link function g(x) = \log{x}, the GLM likelihood equations imply that the fitted means \hat\mu_A and \hat\mu_B equal the respective sample means.
Using the likelihood equations, show that the same result holds for any link function for this Poisson model,
Using the likelihood equations, show that the same result holds for any GLM of the form g(\mu_i) = \beta_1 + \beta_2 x_i with a binary indicator predictor. Note: if the weights \omega_i are non-constant, then \hat\mu_A and \hat\mu_B equal the respective weighted means.
In a generalized linear model that uses a non-canonical link function, explain why it need not be true that
\sum_{i=1}^n \hat{\mu}_i = \sum_{i=1}^n y_i.
Hence, the response residuals need not have a mean of 0.
Then, explain why a GLM with a canonical link function and constant weights \omega_i requires the inclusion of an intercept term in order to ensure that this equality holds.
In selecting explanatory variables for a linear model, what is inadequate about the strategy of selecting the model with lowest deviance (or the highest R^2 in linear models)?