Generalized Linear Models

Statistics III - CdL SSE

Tommaso Rigon

UniversitΓ  degli Studi di Milano-Bicocca

Homepage

  • This unit will cover the following topics:

    • Exponential dispersion families
    • Likelihood, inference, and testing
    • Iteratively Re-weighted Least Squares (IRLS)
    • Deviance, model checking, and residuals
    • Model selection
  • GLMs are regression models with a linear predictor, where the response variable follows an exponential dispersion family.

  • The symbol πŸ“– means that a few extra steps are discussed in the handwritten notes.

The content of this Unit is covered in Chapter 2 of Salvan et al. (2020). Alternatively, see Chapter 4 of Agresti (2015) or Chapter 6 of Azzalini (2008).

Introduction

Preliminaries

  • GLMs are a class of regression models in which a response random variable Y_i is modeled as a function of a vector of covariates \bm{x}_i \in \mathbb{R}^p.

  • The random variables Y_i are not restricted to be Gaussian. For example:

    • Y_i \in \{0,1\}, known as binary regression
    • Y_i \in \{0,1,\dots\}, known as count regression
    • Y_i \in (0,\infty) or Y_i \in (-\infty,\infty)
  • Gaussian linear models are a special case of GLMs, arising when Y_i \in (-\infty,\infty).

  • The response random variables are collected in the random vector \bm{Y} = (Y_1,\dots,Y_n)^T, whose observed realization is \bm{y} = (y_1,\dots,y_n)^T.

  • The design matrix \bm{X} is an n \times p non-stochastic matrix containing the covariate values. The jth variable (column) is denoted by \tilde{\bm{x}}_j, while the ith observation (row) is \bm{x}_i.

  • We assume that \bm{X} has full rank, that is, \text{rk}(\bm{X}) = p with p \le n.

Beetles data, from Bliss (1935)

  • The Beetles dataset originates from Bliss (1935). It records the number of adult flour beetles that died after a 5-hour exposure to gaseous carbon disulphide.
n deaths logdose
59 6 1.6907
60 13 1.7242
62 18 1.7552
56 28 1.7842
63 52 1.8113
59 53 1.8369
62 61 1.8610
60 60 1.8839
  • We aim to predict the proportion of deaths as a function of logdose.

  • Modeling death proportions directly with linear models is inappropriate. A variable transformation provides a more principled solution, but it comes with drawbacks.

Beetles data, a dose-response plot

  • There is a clear positive and non-linear pattern between the proportion of deaths as a function of the logdose. The response variable take values in [0, 1].

Modelling the Beetles data

  • Let Y_i be the number of dead beetles out of m_i, and let x_i denote the log-dose. By definition, S_i \in \{0, 1, \dots, m_i\} for i = 1,\dots,8.

  • It is natural to model each Y_i as independent binomial random variables, counting the number of deaths out of m_i individuals. In other words: S_i \overset{\text{ind}}{\sim} \text{Binomial}(m_i, \pi_i), \qquad i = 1,\dots,8, where \pi_i is the probability of death at a given dose x_i. Moreover, we have \mathbb{E}\left(\frac{S_i}{m_i}\right) = \pi_i = \mu_i.

  • A modeling approach, called logistic regression, specifies:
    g(\pi_i) = \log\left(\frac{\pi_i}{1 - \pi_i}\right) = \beta_1 + \beta_2 x_i \quad \Longrightarrow \quad \pi_i = g^{-1}(\beta_1 + \beta_2 x_i) = \frac{\exp(\beta_1 + \beta_2 x_i)}{1 + \exp(\beta_1 + \beta_2 x_i)}. for some parameters \beta_1, \beta_2 \in \mathbb{R}. Note that \pi_i \in (0, 1) by construction.

Beetles data, fitted model

  • The maximum likelihood estimates are \hat{\beta}_1 = -60.72 and \hat{\beta}_2 = 34.3. This yields the predictive curve \hat{\pi}(x) = g^{-1}(\hat{\beta}_1 + \hat{\beta}_2 x), which estimates the mean proportion \mathbb{E}(S_i / m_i).

A comparison with old tools I πŸ“–

Let Y_i = S_i / m_i be the proportion of deaths. A direct application of linear models implies: Y_i = \beta_1 + \beta_2 x_i + \epsilon_i. The coefficients \beta_1 and \beta_2 are then estimated using OLS using Y_i as response.

  • The prediction \hat{\beta}_1 + \hat{\beta}_2 x_i is unrestricted, meaning it could produce values like β€œ1.3” or β€œ-2” as estimated proportions, which is clearly undesirable.

  • The additive structure Y_i = \beta_1 + \beta_2 x_i + \epsilon_i cannot hold with iid errors \epsilon_i, because S_i, and thus Y_i, are discrete. As a result, the errors are always heteroschedastic.

  • If m_i = 1, i.e. when the data are binary, all the above issues are exacerbated.

This approach is sometimes called the linear probability model. Before GLMs, it was considered acceptable despite its issues, but by modern standards it should not be used.

A comparison with old tools II πŸ“–

We consider the empirical logit variable transformation of S_i = Y_i / m_i, obtaining
\text{logit}(\tilde{Y}_i) = \log\left(\frac{S_i + 0.5}{m_i - S_i + 0.5}\right) = \beta_1 + \beta_2 x_i + \epsilon_i, \qquad \tilde{Y}_i = \frac{S_i + 0.5}{m_i +1}. A correction term is necessary because otherwise g(\cdot) = \text{logit}(\cdot) is undefined. The predictions belong to (0, 1), since \hat{\pi}_i = g^{-1}[\mathbb{E}\{g(\tilde{Y}_i)\}] = g^{-1}(\hat{\beta}_1 + \hat{\beta}_2 x_i) = \frac{\exp(\hat{\beta}_1 + \hat{\beta}_2 x_i)}{1 + \exp(\hat{\beta}_1 + \hat{\beta}_2 x_i)}, in which \hat{\beta}_1 and \hat{\beta}_2 are estimated with OLS using \text{logit}(\tilde{Z}_i) as response.

  • The interpretation of \hat{\beta} is less clear, as they refer to the mean of \text{logit}(\tilde{Y}_i) instead of \mathbb{E}(Y_i).

  • An arbitrary boundary correction is needed.

  • Inference is problematic and requires further corrections, because of heteroschedastic errors.

  • This approach is not compatible with the reasonable assumption S_i \sim \text{Binomial}(m_i, \pi_i).

A comparison with old tools III

  • The black line is the predicted curve of a logistic regression GLM. The orange line is the predictived curve of a linear model. The blue line is the predictive curve of a linear model after an empirical logit variable transformation.

Aids data

  • Number of AIDS deaths in Australia in a sequence of three-months periods between 1983 and 1986.
1983-1 1984-1 1985-1 1986-1 1983-2 1984-2 1985-2
deaths 0 1 2 3 1 4 8
period 1 2 3 4 5 6 7
1986-2 1983-3 1984-3 1985-3 1986-3 1983-4 1984-4
deaths 17 23 32 20 24 37 45
period 8 9 10 11 12 13 14
  • We are interested in predicting the number of deaths as a function of the period of time.

  • The response variable Y_i \in \{0, 1, \dots\} is a non-negative count.

Aids data, scatter plot

  • There is a clear positive association between period and deaths. However, the increase appears to be faster than linear. Note that both the mean and the variability of Y_i increase over time.

Modelling the Aids data

  • Let Y_i be the number of deaths, and let x_i denote the period. By definition, Y_i \in \{0, 1, \dots\} are non-negative counts, for i = 1,\dots,14.
  • We model Y_i as independent Poisson random variables, counting the number of deaths: Y_i \overset{\text{ind}}{\sim} \text{Poisson}(\mu_i), \qquad i = 1,\dots,14, where \mu_i is the mean of Y_i, namely \mathbb{E}(Y_i) = \mu_i.

  • A modeling approach, called Poisson regression, specifies:
    g(\mu_i) = \log(\mu_i) = \beta_1 + \beta_2 x_i \quad \Longrightarrow \quad \mu_i = g^{-1}(\beta_1 + \beta_2 x_i) = \exp(\beta_1 + \beta_2 x_i), for some parameters \beta_1, \beta_2 \in \mathbb{R}. Note that \mu_i > 0 by construction.

  • Under this specification, the variances of the observations are
    \text{var}(Y_i) = \mu_i = \exp(\beta_1 + \beta_2 x_i), which increases with x, as desired. This implies that Y_1,\dots,Y_n are heteroschedastic, but this is not an issue in GLMs, as this aspect is automatically accounted for.

Aids data, fitted model

  • The maximum likelihood estimates are \hat{\beta}_1 = 0.30 and \hat{\beta}_2 = 0.26. This yields the predictive curve \hat{\mu}(x) = \exp(\hat{\beta}_1 + \hat{\beta}_2 x), which estimates the mean \mathbb{E}(Y_i).

A comparison with old tools I

We consider the variance-stabilizing transformation S_i = \sqrt{Y_i}, obtaining
\sqrt{Y_i} = \beta_1 + \beta_2 x_i + \epsilon_i. The predictions belong to (0, \infty), since \hat{\mu}_i = \mathbb{E}(\sqrt{Y_i})^2 = (\hat{\beta}_1 + \hat{\beta}_2 x_i)^2, in which \hat{\beta}_1 and \hat{\beta}_2 are estimated with OLS using \sqrt{Y_i} as response.

  • The interpretation of \hat{\beta} is less clear, as they refer to the mean of \sqrt{Y}_i instead of \mathbb{E}(Y_i).

  • This approach is not compatible with the reasonable assumption Y_i \sim \text{Poisson}(\mu_i) and it only valid as an asymptotic approximation.

To compare such a model with a similar specification, we also fit another Poisson GLM in which Y_i \overset{\text{ind}}{\sim} \text{Poisson}(\mu_i), \qquad \sqrt{\mu_i} = \beta_1 + \beta_2 x_i, \qquad i=1,\dots,14.

A comparison with old tools II

  • The black line is the predicted curve of a Poisson regression GLM with logarithmic link. The orange line is the predicted curve of a linear model with a square-root transformation. The blue line is the predictive curve of a Poisson regression GLM with square-root link.

The components of a GLM

  • Random component. This specifies the probability distribution response variable Y_i. The observations \bm{y} =(y_1,\dots,y_n) on that distribution are treated as independent.
  • Linear predictor. For a parameter vector \bm{\beta} = (\beta_1,\dots,\beta_p)^T and an n \times p design matrix \bm{X}, the linear predictor is \bm{\eta} = \bm{X}\beta. We will also write \eta_i = \bm{x}_i^T\beta = \beta_1x_{i1} + \cdots + x_{ip}\beta_p, \qquad i=1,\dots,n.
  • Link function. This is an invertible and differentiable function g(\cdot) applied to each component of the mean \mu_i = \mathbb{E}(Y_i) that relates it to the linear predictor: g(\mu_i) = \eta_i = \bm{x}_i^T\beta, \qquad \Longrightarrow \qquad \mu_i = g^{-1}(\eta_i) = g^{-1}(\bm{x}_i^T\beta).

Note that, in general, we cannot express the response in an additive way Y_i = g^{-1}(\eta_i) + \epsilon_i.

Random component of a GLM

  • In GLMs the random variables Y_i are independent and they are distributed according to an exponential dispersion family, whose definition will be provided in a few slides.

  • For now, it suffices to say that the distributions most commonly used in Statistics, such as the normal, binomial, gamma, and Poisson, are exponential family distributions.

  • Exponential dispersion families are characterized by their mean and variance. Let v(\mu) > 0 be a function of the mean, called variance function and let a_i(\phi) >0 be functions of an additional unknown parameter \phi > 0 called dispersion.

In a GLMs the observations are independent draws from a distribution \text{ED}(\mu_i, a_i(\phi)v(\mu_i)): Y_i \overset{\text{ind}}{\sim} \text{ED}(\mu_i, a_i(\phi)v(\mu_i)), \qquad \mathbb{E}(Y_i) = \mu_i, \qquad g(\mu_i) = \bm{x}_i^T\beta, with \mu_i \in \mathcal{M}. Moreover, the variance is connected to the mean via v(\mu): \text{var}(Y_i) = a_i(\phi) v(\mu_i), where a_i(\phi) = \phi / \omega_i and \omega_i are known weights. Special cases are a_i(\phi) = \phi and a_i(\phi) = 1.

Notable examples

In a Gaussian linear model we consider the identity link g(\mu) = \mu and let Y_i \overset{\text{ind}}{\sim}\text{N}(\mu_i, \sigma^2), \qquad \mu_i = \bm{x}_i^T\beta. The unknown variance \sigma^2 = \phi is called dispersion in GLMs. The parameter space is \mathcal{M} = \mathbb{R}, whereas a_i(\phi) = \phi and the variance function is constant v(\mu) = 1 (homoschedasticity).

In a binomial regression model with logit link g(\mu) = \text{logit}(\mu) we let Y_i = S_i/m_i and S_i \overset{\text{ind}}{\sim}\text{Binomial}(m_i, \pi_i),\qquad \mathbb{E}\left(Y_i\right) = \pi_i = \mu_i, \qquad \text{logit}(\mu_i) = \bm{x}_i^T\beta. We have a_i(\phi) = 1/m_i and v(\mu) = \mu(1-\mu). There is no dispersion parameter.

In Poisson regression with logarithmic link g(\mu) = \log(\mu) we let Y_i \overset{\text{ind}}{\sim}\text{Poisson}(\mu), \qquad \log(\mu_i) = \bm{x}_i^T\beta. We have a_i(\phi) = 1 and v(\mu) = \mu. There is no dispersion parameter.

Exponential dispersion families

Overview

  • Figure 1 of Efron (2023). Three level of statistical modeling.
  • The prime role of exponential families in the theory of statistical inference was first emphasized by Fisher (1934).

  • Most well-known distributionsβ€”such as Gaussian, Poisson, Binomial, and Gammaβ€”are instances of exponential families.

Definition

The density of Y_i belongs to an exponential dispersion family if it can be written as p(y_i; \theta_i, \phi) = \exp\left\{\frac{\theta_i y_i - b(\theta_i)}{a_i(\phi)} + c(y_i, \phi)\right\}, where y_i \in \mathcal{Y} \subseteq \mathbb{R}, \theta_i \in \Theta \subseteq\mathbb{R} and a_i(\phi) = \phi / \omega_i where \omega_i are known positive weights. The parameter \theta_i is called natural parameter while \phi is called dispersion parameter.

  • By specifying the functions a_i(\cdot), b(\cdot) and c(\cdot) one obtain a particular parametric model.

  • The support \mathcal{Y} of Y_i does not depend on the parameters \phi or \theta_i and b(\cdot) can be differentiated infinitely many times. In particular, this is a regular statistical model.

  • As mentioned, special cases are a_i(\phi) = \phi and a_i(\phi) = 1. When a_i(\phi) = 1 and c(y_i, \phi) = c(y_i) we obtain p(y_i; \theta_i) = \exp\left\{\theta_i y_i - b(\theta_i) + c(y_i)\right\}, which is called natural exponential family of order 1.

Mean and variance I πŸ“–

  • Let us consider the log-likelihood contribution of the ith observations, which is defined as \ell(\theta_i, \phi; y_i) = \log{p(y_i; \theta_i, \phi)} = \frac{\theta_i y_i - b(\theta_i)}{a_i(\phi)} + c(y_i, \phi). If you prefer, this is the log-likelihood when the sample size n = 1 and we only observe Y_i.

  • The score and hessian functions, namely the first and second derivative over \theta_i are \frac{\partial}{\partial \theta_i} \ell(\theta_i, \phi; y_i) = \frac{y_i - b'(\theta_i)}{a_i(\phi)}, \qquad \frac{\partial^2}{\partial \theta_i^2}\ell(\theta_i, \phi; y_i) = \frac{-b''(\theta_i)}{a_i(\phi)}. where b'(\cdot) and b''(\cdot) denote the first and second derivative of b(\cdot).

  • Recall the following Bartlett identities, valid in any regular statistical model: \begin{aligned} \mathbb{E}\left(\frac{\partial}{\partial \theta_i} \ell(\theta_i, \phi; Y_i) \right) &= 0, \\ \mathbb{E}\left\{\left(\frac{\partial}{\partial \theta_i} \ell(\theta_i, \phi; Y_i) \right)^2\right\} = \text{var}\left(\frac{\partial}{\partial \theta_i} \ell(\theta_i, \phi; Y_i) \right) &= \mathbb{E}\left(-\frac{\partial^2}{\partial \theta_i^2}\ell(\theta_i, \phi; Y_i)\right). \end{aligned}

Mean and variance II πŸ“–

  • Specializing Bartlett identities in exponential dispersion families, we obtain \mathbb{E}\left(\frac{Y_i - b'(\theta_i)}{a_i(\phi)}\right) = 0, \qquad \text{var}\left(\frac{Y_i - b'(\theta_i)}{a_i(\phi)}\right) = \frac{\text{var}(Y_i)}{a_i(\phi)^2} = \frac{b''(\theta_i)}{a_i(\phi)}. Re-arranging the terms, we finally get the following key result.

Let Y_i be an exponential dispersion family, identified by the functions a_i(\cdot), b(\cdot) and c(\cdot), and with natural parameter \theta_i. Then the mean and the variance of Y_i equal \mathbb{E}(Y_i) = b'(\theta_i), \qquad \text{var}(Y_i) = a_i(\phi) b''(\theta_i).

  • The mean \mu_i = b'(\theta_i) does not depend on the dispersion parameter.

  • We have b''(\cdot) > 0 because \text{var}(Y_i), which means that b(\cdot) is a convex function.

  • Moreover, the function b'(\theta) is continuous and monotone increasing and hence invertible.

Mean parametrization, variance function

Let Y_i be an exponential dispersion family, identified by the functions a_i(\cdot), b(\cdot) and c(\cdot), and with natural parameter \theta_i, then \mu(\theta_i):= \mu_i = \mathbb{E}(Y_i) = b'(\theta_i). The function \mu(\cdot) : \Theta \to\mathcal{M} is one-to-one and invertible, that is, a reparametrization of \theta_i. We call \mu_i the mean parametrization of an exponential dispersion family.

  • The inverse relationship, re-obtaining \theta_i as a function of \mu_i, is denoted with \theta_i = \theta(\mu_i) = b'^{-1}(\mu_i).

  • Using this notation, we can express the variance of Y_i as a function of \mu_i as follows \text{var}(Y_i) = a_i(\phi)b''(\theta_i) = a_i(\phi)b''(\theta(\mu_i)) = a_i(\phi)v(\mu_i), where v(\mu_i) := b''(\theta(\mu_i)) is the variance function.

  • The domain \mathcal{M} and the variance function v(\mu) characterize the function b(\cdot) and the entire distribution, for any given a_i(\phi). This justifies the notation Y_i \sim \text{ED}(\mu_i, a_i(\phi)v(\mu_i)).

Gaussian distribution πŸ“–

  • Let Y_i \sim \text{N}(\mu_i, \sigma^2). The density function of Y_i can be written as \begin{aligned} p(y_i; \mu_i, \sigma^2) &= \frac{1}{\sqrt{2 \pi \sigma^2}}\exp\left\{-\frac{1}{2\sigma^2}(y_i - \mu_i)^2\right\} \\ &=\exp\left\{\frac{y_i \mu_i - \mu_i^2/2}{\sigma^2}- \frac{\log(2\pi\sigma^2)}{2}-\frac{y_i^2}{2\sigma^2}\right\} \end{aligned}

  • Then, we can recognize the following relationships: \theta_i = \theta(\mu_i) = \mu_i, \quad a_i(\phi) = \phi = \sigma^2, \quad b(\theta_i) = \frac{\theta_i^2}{2}, \quad c(y_i, \phi) = - \frac{\log(2\pi\phi)}{2}-\frac{y_i^2}{2\phi}. In the Gaussian case, the mean parametrization and the natural parametrization coincide. Moreover, the dispersion \phi coincides with the variance \sigma^2.

  • Using the results we previously discussed, we obtain the well-known relationships \mathbb{E}(Y_i) = b'(\theta_i) = \theta_i, \qquad \text{var}(Y_i) = a_i(\phi)b''(\theta_i) = \phi. The variance function v(\mu_i) = 1 is constant. We will write Y_i \sim \text{ED}(\mu_i, \phi) with \mu_i \in \mathcal{M} = \mathbb{R}.

Poisson distribution πŸ“–

Let Y_i \sim \text{Poisson}(\mu_i). The pdf function of Y_i can be written as \begin{aligned} p(y_i; \mu_i) &= \frac{\mu_i^{y_i} e^{-\mu_i}}{y_i!}=\exp\{y_i \log(\mu_i) - \mu_i - \log(y_i!)\} \\ &=\exp\{y_i \theta_i - e^{\theta_i} - \log(y_i!)\}, \qquad y_i = 0, 1, 2,\dots. \end{aligned}

  • Then, we can recognize the following relationships: \begin{aligned} \theta_i &= \theta(\mu_i) = \log(\mu_i), \quad &&a_i(\phi) = 1, \\ b(\theta_i) &= e^{\theta_i}, \quad &&c(y_i, \phi) = c(y_i) = -\log(y_i!). \end{aligned} There is no dispersion parameter since a_i(\phi) = 1.

  • Using the results we previously discussed, we obtain the well-known relationships \begin{aligned} \mathbb{E}(Y_i) &= b'(\theta_i) = e^{\theta_i} = \mu_i, \\ \text{var}(Y_i) &= a_i(\phi) b''(\theta_i) = e^{\theta_i} = \mu_i. \end{aligned} The variance function v(\mu_i) = \mu_i is linear. We will write Y_i \sim \text{ED}(\mu_i, \mu_i) with \mu_i \in (0, \infty).

Gamma distribution I πŸ“–

  • Let Y_i \sim \text{Gamma}(\alpha, \lambda_i). The density function of Y_i can be written as \begin{aligned} p(y_i; \alpha, \lambda_i) &= \frac{\lambda_i^\alpha y_i^{\alpha-1}\alpha e^{-\lambda_i y_i}}{\Gamma(\alpha)} \\ &=\exp\left\{\alpha\log{\lambda_i} - \lambda_i y_i + (\alpha-1)\log{y_i} - \log{\Gamma}(\alpha)\right\} \\ &=\exp\left\{\alpha\left(\log{\lambda_i} - \frac{\lambda_i}{\alpha} y_i\right) + (\alpha-1)\log{y_i} - \log{\Gamma}(\alpha)\right\} \\ &=\exp\left\{\frac{\theta_i y_i + \log(-\theta_i)}{\phi} - (1/\phi)\log{\phi}+ (1/\phi - 1)\log{y_i} - \log{\Gamma}(1/\phi)\right\}, \qquad y > 0,\\ \end{aligned} having defined the dispersion \phi = 1/\alpha and the natural parameter \theta_i = -\lambda_i/\alpha.

  • Then, we can recognize the following relationships: \begin{aligned} \quad a_i(\phi) &= \phi, \qquad b(\theta_i) = - \log(-\theta_i), \\ c(y_i, \phi) &= - (1/\phi)\log{\phi}+ (1/\phi - 1)\log{y_i} - \log{\Gamma}(1/\phi). \end{aligned}

Gamma distribution II πŸ“–

  • Using the results we previously discussed, we obtain the well-known relationships \mathbb{E}(Y_i) = b'(\theta_i) = - \frac{1}{\theta_i} = \frac{\alpha}{\lambda_i} = \mu_i, \qquad \text{var}(Y_i) = a_i(\phi)b''(\theta_i) = \frac{\phi}{\theta_i^2} = \frac{\alpha}{\lambda_i^2}.

  • At the same time, we can write the inverse relationship linking \theta_i to the mean as \theta_i = \theta(\mu_i) = - \frac{1}{\mu_i} from which we finally obtain the following quadratic variance function v(\mu_i) = \mu_i^2.

  • We will write Y_i \sim \text{ED}(\mu_i, \phi\mu_i^2) with \mu_i \in (0, \infty).

Binomial distribution I πŸ“–

  • Let S_i \sim \text{Binomial}(m_i, \pi_i), with \pi_i \in (0, 1). The random variable Y_i = S_i/m_i has density \begin{aligned} p(y_i; m_i, \pi_i) &= \binom{m_i}{m_i y_i}\pi_i^{m_i y_i}(1 - \pi_i)^{m_i - m_i y_i}\\ &=\binom{m_i}{m_i y_i}\left(\frac{\pi_i}{1 - \pi_i}\right)^{m_i y_i}(1 - \pi_i)^{m_i}\\ &=\exp\left\{m_iy_i\log\left(\frac{\pi_i}{1 - \pi_i}\right) + m_i\log(1 - \pi_i) + \log\binom{m_i}{m_i y_i}\right\}, \end{aligned} for y_i \in \{0, 1/m_i, 2/m_2, \dots, m_i/m_i\}. This can be written as p(y_i; m_i, \pi_i) =\exp\left\{\frac{y_i\theta_i - \log\{1 + \exp(\theta_i)\}}{1/m_i}+ \log\binom{m_i}{m_i y_i}\right\}, where the natural parameter is \theta_i = \text{logit}(\pi_i) = \log\{\pi/(1-\pi_i)\}.

Binomial distribution II πŸ“–

  • Note that \mathbb{E}(Y_i) = \mathbb{E}(Z_i / m_i) = \pi_i = \mu_i. This means there no dispersion parameter \phi and \theta_i = \text{logit}(\mu_i), \quad a_i(\phi) = \frac{1}{m_i}, \quad b(\theta_i) = \log\{1 + \exp(\theta_i)\}, \quad c(y_i) = \log\binom{m_i}{m_i y_i}.

  • Using the general formulas therefore we obtain \begin{aligned} \mathbb{E}(Y_i) &= b'(\theta_i) = \frac{\exp(\theta_i)}{1 + \exp(\theta_i)} = \mu_i, \\ \text{var}(Y_i) &= a_i(\phi)b''(\theta_i) = \frac{1}{m_i}\frac{\exp(\theta_i)}{[1 + \exp(\theta_i)]^2} = \frac{\mu_i (1 - \mu_i)}{m_i}, \end{aligned} from which we obtain that the variance function is v(\mu_i) = \mu_i(1-\mu_i) is quadratic.

  • We will write Y_i \sim \text{ED}(\mu_i, \mu_i(1-\mu_i)) with \mu_i \in \mathcal{M} = (0, 1).

Notable exponential dispersion families

Model \text{N}(\mu_i, \sigma^2) \text{Gamma}(\alpha, \alpha/\mu_i) \frac{1}{m_i}\text{Binomial}(m_i, \mu_i) \text{Poisson}(\mu_i)
Support \mathcal{Y} \mathbb{R} [0, \infty) \{0, 1/m_i,\dots, 1\} \mathbb{N}
\theta_i \mu_i - 1/\mu_i \log\left(\frac{\mu_i}{1 - \mu_i}\right) \log{\mu_i}
Parametric space \Theta \mathbb{R} (-\infty, 0) \mathbb{R} \mathbb{R}
b(\theta_i) \theta_i^2/2 -\log(-\theta_i) \log\{1 + \exp(\theta_i)\} \exp(\theta_i)
\phi \sigma^2 1/\alpha 1 1
a_i(\phi) \sigma^2 1/\alpha 1/m_i 1
\mathcal{M} \mathbb{R} (0, \infty) (0, 1) (0, \infty)
v(\mu_i) 1 \mu_i^2 \mu_i(1-\mu_i) \mu_i

The list of exponential dispersion families does not end here. Other examples are the inverse Gaussian, the negative binomial and hyperbolic secant distributions.

Likelihood, inference, and testing

Likelihood function

  • Let Y_i \overset{\text{ind}}{\sim}\text{ED}(\mu_i, a_i(\phi)v(\mu_i)) be the response variable of a GLM, with g(\mu_i) = \bm{x}_i^T\beta. The joint distribution of the responses \bm{Y} = (Y_1,\dots,Y_n) is p(\bm{y}; \beta, \phi) = \prod_{i=1}^np(y_i; \beta, \phi) = \prod_{i=1}^n \exp\left\{\frac{y_i\theta_i - b(\theta_i)}{a_i(\phi)} + c(y_i, \phi)\right\}. with \theta_i = \theta(\mu_i) = \theta(g^{-1}(\bm{x}_i^T\beta)).

  • The log-likelihood function therefore is \ell(\beta, \phi) = \sum_{i=1}^n\frac{y_i\theta_i - b(\theta_i)}{a_i(\phi)} + c(y_i, \phi).

  • In general, there is no sufficient statistic with dimension smaller than n.

Likelihood equations I πŸ“–

  • To conduct inference using the classical theory (as in Statistica II), we need to consider the first and second derivative of the log-likelihood, that is, the score function \frac{\partial}{\partial \beta_r}\ell(\beta; \phi), \qquad r=1,\dots,p, and the observed information matrix \bm{J}, whose elements are j_{rs} = - \frac{\partial}{\partial \beta_r}\frac{\partial}{\partial \beta_s}\ell(\beta; \phi), \qquad r, s=1,\dots,p.

  • These quantities have a simple expression in the end, but getting there requires quite a bit of calculus.

Let us focus on the estimation of \beta, assuming for now that \phi is a known parameter, as is the case in binomial or Poisson regression.

This assumption is not restrictive, even when \phi is actually unknown. In fact, we will show that the maximum likelihood estimate \hat{\beta} does not depend on \phi, and that \beta and \phi are orthogonal.

Likelihood equations II πŸ“–

  • Let us begin by noting that \frac{\partial}{\partial \beta_r}\ell(\beta; \phi) = \sum_{i=1}^n\frac{1}{a_i(\phi)} \left(y_i \frac{\partial \theta_i}{\partial \beta_r} - \frac{\partial b(\theta_i)}{\partial \beta_r} \right), \qquad r = 1,\dots,p. Such an expression can be simplified because \frac{\partial b(\theta_i)}{\partial \beta_r} = b'(\theta_i)\frac{\partial \theta_i}{\partial \beta_r} = \mu_i\frac{\partial \theta_i}{\partial \beta_r}, which implies that the score function will have the following structure: \frac{\partial}{\partial \beta_r}\ell(\beta; \phi) = \sum_{i=1}^n\frac{1}{a_i(\phi)}(y_i - \mu_i)\frac{\partial \theta_i}{\partial \beta_r}, \qquad r=1,\dots,p.

  • Recall that a_i(\phi) = \phi/\omega_i, hence the maximum likelihood estimator is obtained by solving: \textcolor{red}{\cancel{\frac{1}{\phi}}}\sum_{i=1}^n\omega_i(y_i - \mu_i)\frac{\partial \theta_i}{\partial \beta_r} = 0, \qquad r=1,\dots,p.

Likelihood equations III πŸ“–

Let f(x) be a function with inverse g(x) = f^{-1}(x) and first derivative f'(x). Then \frac{\partial g}{\partial{x}} = [f^{-1}]'(x) = \frac{1}{f'(f^{-1}(x))}.

  • Recall that g(\mu_i) = \bm{x}_i^T\beta = \eta_i and that \theta_i = \theta(\mu_i) is the inverse of \mu(\theta_i). As an application of the above lemma: \frac{\partial \theta_i}{\partial \mu_i} = \theta'(\mu_i) = \frac{1}{\mu'(\theta(\mu_i))}= \frac{1}{b''(\theta(\mu_i))} = \frac{1}{v(\mu_i)}, Moreover, since we \mu_i = g^{-1}(\eta_i) we obtain \frac{\partial \mu_i}{\partial \eta_i} = \frac{1}{g'(g^{-1}(\eta_i))} = \frac{1}{g'(\mu_i)}.
  • Summing up, the chain rule of derivation for composite functions gives: \frac{\partial \theta_i}{\partial \beta_r} = \frac{\partial \theta_i}{\partial \mu_i} \frac{\partial \mu_i}{\partial \eta_i} \frac{\partial \eta_i}{\partial\beta_r} = \frac{1}{v(\mu_i)}\frac{1}{g'(\mu_i)}x_{ir}, \qquad r=1,\dots,p.

Likelihood equations IV πŸ“–

  • Combining all the above equations, we obtain an explicit formula for the score function \frac{\partial}{\partial \beta_r}\ell(\beta; \phi) = \frac{1}{\phi}\sum_{i=1}^n \omega_i \frac{(y_i - \mu_i)}{v(\mu_i)}\frac{x_{ir}}{g'(\mu_i)} = \sum_{i=1}^n \frac{(y_i - \mu_i)}{\text{var}(Y_i)}\frac{x_{ir}}{g'(\mu_i)}, \qquad r=1,\dots,p.

The maximum likelihood estimator solves the likelihood equations: \sum_{i=1}^n \omega_i \frac{(y_i - \mu_i)}{v(\mu_i)}\frac{x_{ir}}{g'(\mu_i)} = 0, \qquad r=1,\dots,p, which do not depend on \phi. In matrix notation \bm{D}^T \bm{V}^{-1}(\bm{y} - \bm{\mu}) = \bm{0}, where \bm{V} = \text{diag}(v(\mu_1)/\omega_1,\dots,v(\mu_n)/\omega_n) and \bm{D} is an n \times p matrix whose elements are d_{ir} = \frac{\partial \mu_i}{\partial \beta_r} =\frac{\partial \mu_i}{\partial \eta_i} \frac{\partial \eta_i}{\partial\beta_r} =\frac{1}{g'(\mu_i)}x_{ir}, \qquad i=1,\dots,n, \quad r=1,\dots,p.

Examples of estimating equations

Let Y_i \sim \text{ED}(\mu_i, \phi) with g(\mu_i) = \mu_i, namely the Gaussian linear model with the identity (canonical) link. The likelihood equations are \bm{X}^T(\bm{y} - \bm{X}\beta) = \bm{0}, which are also called normal equations. Their solution over \beta is the OLS \hat{\beta} = (\bm{X}^T\bm{X})^{-1}\bm{X}^T\bm{y}.

Let Y_i \sim \text{ED}(\mu_i, \phi/\omega_i) with g(\mu_i) = \mu_i, namely the Gaussian linear model with the identity (canonical) link and heteroschedastic errors. The likelihood equations are \bm{X}^T\bm{\Omega}(\bm{y} - \bm{X}\beta) = \bm{0}, Their solution over \beta is the weighted least square estimator \hat{\beta}_\text{wls} = (\bm{X}^T\bm{\Omega}\bm{X})^{-1}\bm{X}^T\bm{\Omega}\bm{y}.

Let Y_i \sim \text{ED}(\mu_i, \mu_i) with g(\mu_i) = \log{\mu_i}, namely a Poisson regression model with the logarithmic (canonical) link. The likelihood equations can be solved numerically \bm{X}^T(\bm{y} - \bm{\mu}) = \bm{0}, \qquad \bm{\mu} = (e^{\bm{x}_1^T\beta}, \dots, e^{\bm{x}_n^T\beta}).

Example: Aids data

  • In the Aids data, we specified a Poisson regression model with \mathbb{E}(Y_i) = \exp(\beta_1 + \beta_2 x_i).

  • The maximum likelihood estimate (\hat{\beta}_1, \hat{\beta}_2) solve simultaneously: \sum_{i=1}^n y_i = \sum_{i=1}^n \exp(\beta_1 + \beta_2x_i), \quad \text{and }\quad \sum_{i=1}^n x_i y_i = \sum_{i=1}^n x_i\exp(\beta_1 + \beta_2 x_i).

  • This system does not always admits a solution. This happens, for example, in the extreme case \sum_{i=1}^ny_i = 0, occurring when all counts equal zero.

  • Using the Aids data we have \sum_{i=1}^ny_i = 217 and \sum_{i=1}^nx_i y_i = 2387. Via numerical methods we solve the above system of equations and we obtain \hat{\beta}_1 = 0.304 and \hat{\beta}_2 = 0.259.

  • The estimated mean values are \hat{\mu}_i = \mathbb{E}(Y_i) = \exp(0.304 + 0.259 x_i) and in particular the mean for the next period is \hat{\mu}_{i+1} = \exp(0.304 + 0.259 (x_i +1)) = \exp(0.259) \hat{\mu}_i = 1.296 \hat{\mu}_i. In other words, the estimated number of deaths increases by approximately 30% every trimester.

Observed and expected information I

  • Let us first consider the negative derivative of the score function, that is the observed information matrix \bm{J} with entries: \begin{aligned} j_{rs} &= -\frac{\partial}{\partial \beta_s}\left[\frac{\partial}{\partial \beta_r}\ell(\beta; \phi)\right] = -\frac{\partial}{\partial \beta_s}\sum_{i=1}^n\frac{1}{a_i(\phi)}(y_i - \mu_i)\frac{\partial \theta_i}{\partial \beta_r} \\ &=\sum_{i=1}^n\frac{1}{a_i(\phi)}\left[\frac{\partial\mu_i}{\partial \beta_s}\frac{\partial\theta_i}{\partial \beta_r} - (y_i - \mu_i) \frac{\partial^2\theta_i}{\partial \beta_r \partial \beta_s}\right], \qquad r,s = 1,\dots,p. \end{aligned}

  • Let \bm{I} = \mathbb{E}(\bm{J}) be the p \times p Fisher information matrix associated with \beta, whose elements are i_{rs} = \mathbb{E}(j_{rs}) = \mathbb{E}\left(- \frac{\partial}{\partial \beta_r}\frac{\partial}{\partial \beta_s}\ell(\beta; \phi)\right), \qquad r,s = 1,\dots,p.

  • Thus, the Fisher information matrix substantially simplifies because \mathbb{E}(Y_i) = \mu_i, obtaining: i_{rs} = \sum_{i=1}^n\frac{1}{a_i(\phi)}\frac{\partial\mu_i}{\partial \beta_s}\frac{\partial\theta_i}{\partial \beta_r}, \qquad r,s = 1,\dots,p.

Observed and expected information II

  • In the previous slides we already computed the explicit values of these derivatives: \frac{\partial\mu_i}{\partial \beta_s} = \frac{x_{is}}{g'(\mu_i)}, \qquad \frac{\partial\theta_i}{\partial \beta_r} = \frac{x_{is}}{v(\mu_i) g'(\mu_i)}.

Combining the above equations, we obtain that the Fisher information \bm{I} of a GLM has entries i_{rs} = \frac{1}{\phi}\sum_{i=1}^n \omega_i \frac{x_{ir} x_{is}}{v(\mu_i)g'(\mu_i)^2} = \sum_{i=1}^n \frac{x_{ir}x_{is}}{\text{var}(Y_i) g'(\mu_i)^2}, \qquad r,s = 1,\dots,p. In matrix notation, we have that \bm{I} = \bm{X}^T\bm{W}\bm{X}, where \bm{W} = \text{diag}(w_1,\dots,w_n) and w_i are weights such that w_i = \frac{1}{\phi}\frac{\omega_i}{v(\mu_i) g'(\mu_i)^2} = \frac{1}{\text{var}(Y_i) g'(\mu_i)^2}, \qquad i=1,\dots,n.

Further considerations

  • The observed and expected information matrices \bm{J} and \bm{I}, as well as weights \bm{W}, depend on \beta and \phi. We write \hat{\bm{J}}, \hat{\bm{I}} and \hat{\bm{W}} to indicate that \beta and \phi have been estimated with \hat{\beta} and \hat{\phi}.
  • If \bm{X} has full rank and g'(\mu) \neq 0, then \bm{I} is positive definite for any value of \beta and \phi.
  • Under the canonical link, we have \bm{J} = \bm{I}, and both matrices are positive definite if \text{rk}(\bm{X}) = p.

  • This implies that the log-likelihood function is concave because its second derivative is negative definite, so any solution to the estimating equations is also a global optimum.

  • The Fisher information matrix could be computed exploiting Bartlett identity, namely i_{rs} = \mathbb{E}\left[\left(\frac{\partial}{\partial \beta_r}\ell(\beta; \phi)\right)\left(\frac{\partial}{\partial \beta_s}\ell(\beta; \phi)\right)\right], \qquad r,s = 1,\dots,p. as in Agresti (2015). Of course, the final result coincide with ours.

☠️ - Orthogonality of \beta and \psi

  • Let us now consider the case in which \phi is unknown so that a_i(\phi) = \phi/\omega_i. We obtain: j_{r \phi} = - \frac{\partial}{\partial \beta_r}\frac{\partial}{\partial \phi}\ell(\beta; \phi) = \frac{1}{\phi^2}\sum_{i=1}^n \omega_i(y_i - \mu_i)\frac{\partial \theta_i}{\partial \beta_r}, \qquad r = 1,\dots,p. whose expected value is i_{r\phi} = \mathbb{E}(j_{r\phi}) = 0 since \mathbb{E}(Y_i) = \mu_i.
  • This means the Fisher information matrix accounting for \phi takes the form: \begin{bmatrix} \bm{I} & \bm{0} \\ \bm{0} & i_{\phi \phi} \end{bmatrix} \qquad\Longrightarrow\qquad \begin{bmatrix} \bm{I} & \bm{0} \\ \bm{0} & i_{\phi \phi} \end{bmatrix}^{-1} = \begin{bmatrix} \bm{I}^{-1} & \bm{0} \\ \bm{0} & 1 /i_{\phi \phi} \end{bmatrix} where [\bm{I}]_{rs} = i_{rs} are the elements associated to \beta as before.

The parameters \beta and \phi are orthogonal and their maximum likelihood estimates are asymptotically independent.

Moreover, the matrices \bm{I} and \bm{I}^{-1} are sufficient for inference on \beta and there is no need to compute i_{\phi \phi}. Note that the maximum likelihood \hat{\beta} can also be computed without knowing \phi.

Asymptotic distribution of \hat{\beta}

The asymptotic distribution of the maximum likelihood estimator is \hat{\beta} \, \dot{\sim} \, \text{N}_p\left(\beta, (\bm{X}^T\bm{W}\bm{X})^{-1}\right), for large values of n and under mild regularity conditions on \bm{X}.

  • This means that, under correct specification and mild conditions on \bm{X}, the maximum likelihood estimator is asymptotically unbiased and with known asymptotic variance \mathbb{E}(\hat{\beta}) \approx 0, \qquad \text{var}(\hat{\beta}) \approx (\bm{X}^T\bm{W}\bm{X})^{-1}.

  • In practice, since \bm{W} depends on \beta and \phi, rely on the following approximation \widehat{\text{var}}(\hat{\beta}) = (\bm{X}^T\hat{\bm{W}}\bm{X})^{-1}, where we plugged in the consistent estimates \hat{\beta} and \hat{\phi} into \bm{W} obtaining \hat{\bm{W}}.

  • Note that the asymptotic variance implicitly takes into account heteroschedasticity.

Example: Aids data

  • In the Aids data, we specified a Poisson regression model with \mathbb{E}(Y_i) = \exp(\beta_1 + \beta_2 x_i) and estimated \hat{\beta} = (0.304, 0.259).

  • This means that the weights are estimated as \hat{\bm{W}} = \text{diag}(\hat{\mu}_1,\dots,\hat{\mu}_n) = \text{diag}(1.755, \dots, 50.863). from which we obtain the estimated Fisher information matrix: \bm{X}^T\hat{\bm{W}}\bm{X} = \begin{bmatrix} \sum_{i=1}^n\hat{\mu}_i & \sum_{i=1}^n x_i\hat{\mu_i}\\ \sum_{i=1}^n x_i\hat{\mu_i} & \sum_{i=1}^n x_i^2\hat{\mu_i} \end{bmatrix} = \begin{bmatrix} 217 & 2387\\ 2387 & 28279.05 \end{bmatrix}.

  • Hence, the estimated covariance matrix of the maximum likelihood estimator is \widehat{\text{var}}(\hat{\beta}) = (\bm{X}^T\hat{\bm{W}}\bm{X})^{-1} = \begin{bmatrix} 0.06445 & -0.00544 \\ -0.00544 & 0.00049 \end{bmatrix}.

  • Therefore the estimated standard errors are [\widehat{\text{se}}(\hat{\beta})]_j = \sqrt{[(\bm{X}^T\hat{\bm{W}}\bm{X})^{-1}]_{jj}} \quad\Longrightarrow \quad \widehat{\text{se}}(\hat{\beta}) = (0.254, 0.022).

Test and confidence intervals

Example: Aids data


z test of coefficients:

            Estimate Std. Error z value Pr(>|z|)    
(Intercept) 0.303655   0.253867  1.1961   0.2317    
period      0.258963   0.022238 11.6448   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Example: Aids data

                 2.5 %    97.5 %
(Intercept) -0.1939158 0.8012249
period       0.2153764 0.3025494
                 2.5 %    97.5 %
(Intercept) -0.2146576 0.7816308
period       0.2164770 0.3037480

Delta methods and fitted values

Example: Aids data

IRLS algorithm

Numerical methods for maximum likelihood estimation

Iteratively reweighted least squares I

Iteratively reweighted least squares II

Iteratively reweighted least squares III

Estimation of \phi

Deviance, model checking, residuals

Model selection

References

Agresti, A. (2015), Foundations of Linear and Generalized Linear Models, Wiley.
Azzalini, A. (2008), Inferenza statistica, Springer Verlag.
Efron, B. (2023), Exponential Families in Theory and Practice, Cambridge University Press.
Fisher, R. A. (1934), β€œTwo new properties of mathematical likelihood,” Proceedings of the Royal Society of London. Series A, 144, 285–307.
Salvan, A., Sartori, N., and Pace, L. (2020), Modelli lineari generalizzati, Springer.