Statistics III - CdL SSE
Università degli Studi di Milano-Bicocca
GLMs for count data are very common and have theoretical connections with binary and binomial models.
This unit focuses on Poisson regression models.
I will not cover the analysis of contingency tables.
Such a topic is nonetheless discussed in the textbook but is not part of the exam.
The most important aspects have been already covered in Unit B.
In a Poisson regression model, we observe Y_i independent Poisson random variables, so that Y_i \overset{\text{ind}}{\sim} \text{Poisson}(\mu_i), \qquad g(\mu_i) = \eta_i = \bm{x}_i^T \beta, \qquad i=1,\dots,n.
The canonical link is g(\cdot) = \log(\cdot), which implies a multiplicative structure \mu_i = \exp(\bm{x}_i^T \beta) = \exp(\beta_1)^{x_{i1}} \times \cdots \times \exp(\beta_p)^{x_{ip}} = \prod_{j=1}^p \alpha_j^{x_{ij}}, \qquad \alpha_j = \exp(\beta_j).
Under the canonical link, the likelihood equations are \sum_{i=1}^n(y_i - \mu_i)x_{ir} = 0, \qquad r=1,\dots,p. The solution therefore has a nice interpretation as a method of moments estimator, in that \sum_{i=1}^n y_i x_{ir} = \sum_{i=1}^n\mathbb{E}(Y_i) x_{ir}, \qquad r=1,\dots,p.
Under the logarithmic link, the mean has a multiplicative structure, namely \mu_i = \exp(\bm{x}_i^T \beta) = \exp(\beta_1)^{x_{i1}} \times \cdots \times \exp(\beta_p)^{x_{ip}} = \prod_{j=1}^p \alpha_j^{x_{ij}}, \qquad \alpha_j = \exp(\beta_j).
As a result, a unitary increase of the jth covariate from x_{ij} to x_{ij} + 1 has the following impact on the new mean, say \mu_\text{new} \mu_\text{new} = \alpha_1^{x_{i1}} \times \cdots \times \alpha_j^{x_{ij} + 1} \times \cdots \times \alpha_p^{x_{ip}} = \alpha_j \left( \alpha_1^{x_{i1}} \times \cdots \times \alpha_p^{x_{ip}} \right) = \alpha_j \mu_i. In other words, the regression parameters, once exponentiated, can be interpreted as relative changes of the mean, namely \alpha_j - 1 = \exp(\beta_j) - 1 = \frac{\mu_\text{new} - \mu_i}{\mu_i}.
The interpretation in terms of relative changes is a consequence of the logarithmic link function. Therefore, the same interpretation applies whenever this link is used, including the Gamma GLM.
Often the expected value of a response count Y_i is proportional to an index t_i, the exposure.
For instance, t_i might be an amount of time and/or a population size, such as in modeling crime counts for various cities. Or, it might be a spatial area, such as in modeling counts of plant species.
In these case, the sample rate is Y_i / t_i, with expected value \mu_i / t_i. With explanatory variables, a model for the expected rate under a logarithmic link has the form
\log\left(\frac{\mu_i}{t_i}\right) = \bm{x}_i^T\beta, \qquad \implies \qquad \log{\mu_i} = \bm{x}_i^T\beta + \log{t_i},
Because \log(\mu_i / t_i) = \log{\mu_i} - \log{t_i}, the model makes the adjustment \log{t_i} to the linear predictor. This adjustment term is called an offset, implemented in R using the offset option.
The fit corresponds to using \log{t_i} as an explanatory variable in the linear predictor for \log(\mu_i) and forcing its coefficient to equal 1.
Summarising, for this model, the response counts Y_i \sim \text{Poisson}(\mu_i) satisfy \mu_i = t_i \exp(\bm{x}_i^T\beta). The mean has a proportionality constant for t_i that depends on the values of the covariates.
In Poisson regression the main assumption is that Y_i \sim \text{Poisson}(\mu_i), implying that \text{var}(Y_i) = \mu_i, where implicitly we have set \phi = 1.
However, from the analysis of the residuals or by computing the X^2 statistic we may realize that the data present overdispersion, namely the correct model is such that \text{var}(Y_i) = \textcolor{red}{\phi} \mu_i, with \phi > 1. This implies that the Poisson regression model is misspecified.
The two most common solutions to overdispersion are the following:
In practice, the frequency of zero outcomes is often larger than expected under a Poisson regression.
Because the mode of a Poisson distribution is the integer part of its mean, a Poisson GLM can be inadequate when the mean is relatively large but the modal response is 0.
Such data are called zero-inflated. This often occurs when:
Example: the number of times individuals report exercising (e.g., going to a gym) in the past week:
The two most common solutions to zero-inflation are the following: