BISP14 workshop
Bayesian inference for generalized linear models via quasi-posteriors
Warm thanks
Davide Agnoletto (Duke University)
David Dunson (Duke University)
Foundations
- De Finetti’s representation Theorem (De Finetti 1937) has a central role in Bayesian statistics because it provides the fundamental justification to the two approaches to Bayesian statistics:
- the hypothetical approach;
- the predictive approach.
- While representing opposite interpretations of the Theorem, the two approaches are intrinsically connected.
Hypothetical approach
The hypothetical approach represents the the most common way to operate within the Bayesian community.
In a parametric setting, \Pi has support on a class \Theta\subseteq\mathbb{R}^p, p<\infty, such that \boldsymbol{\theta}\in\Theta indexes the class of distributions \mathcal{F}_{\boldsymbol{\theta}}=\{F_{\boldsymbol{\theta}} : \boldsymbol{\theta} \in \Theta\subseteq\mathbb{R}^p\}.
Bayes’ rule takes the well-known formulation: \pi(\boldsymbol{\theta}\mid y_1,\ldots,y_n) \propto \pi(\boldsymbol{\theta}) \prod_{i=1}^n f_{\boldsymbol{\theta}}(y_i), where \pi and f_{\boldsymbol{\theta}} denote the probability density functions associated with \Pi and F_{\boldsymbol{\theta}}, respectively.
However, when the link between observations and parameter of interest cannot be expressed through a distribution function, the traditional hypothetical approach fails.
Solution: generalized posterior distributions, sometimes called Gibbs-posteriors.
This is a lively recent topic, see for instance: Chernozhukov and Hong (2003); Bissiri et al. (2016) Heide et al. (2020); Grünwald and Mehta (2020); Knoblauch et al. (2022); Matsubara et al. (2022); Matsubara et al. (2023); Jewson and Rossell (2022); Rigon et al. (2023).
Generalizations of the hypothetical approach
- Bissiri et al. (2016) showed that the generalized posterior \pi_\omega(\boldsymbol{\theta} \mid \mathbf{y}_{1:n}) \propto \pi(\boldsymbol{\theta}) \exp\left\{ - \omega \sum_{i=1}^n \ell(\boldsymbol{\theta}; y_i) \right\}, is the only coherent update of the prior beliefs about \boldsymbol{\theta}^* = \arg\min_{\boldsymbol{\theta}\in\Theta} \int_{\mathcal{Y}} \ell(\boldsymbol{\theta}; y)\, F_0(\mathrm{d}y), where \ell(\boldsymbol{\theta}, y) is a loss function, \omega is the loss-scale, and F_0 is the unknown true sampling distribution.
Learning the loss scale \omega from the data is a delicate task. Assuming a prior for \omega can lead to degenerate estimates if not accompanied by additional adjustments to the loss function.
However, there are several solutions for its calibration: Holmes and Walker (2017); Lyddon et al. (2019); Syring and Martin (2019); Matsubara et al. (2023).
- Our contribution: Bayesian inference for generalized linear models via quasi-posteriors.
Generalized Bayes for GLMs
Generalized linear models (GLMs) are routinely used to model a wide variety of data.
The Bayesian approach for GLMs is also incredibly popular, e.g. because of the possibility of naturally incorporating random effects, complex penalizations, prior information, and more.
However, they often incur misspecification, and this could compromise inferential conclusions.
A common case is overdispersion, i.e., when proportion or count observations show larger variability than the one assumed by the model.
Traditional solutions have important drawbacks:
- Model-based: may lead to computational bottlenecks and can result again in misspecification.
- Nonparametric: increased computational cost and loss of efficiency and interpretability.
- We rely on a semi-parametric approach, making only assumptions on the mean and variance of the response while preserving computational tractability.
Second order assumptions
Let Y_i \in \mathcal{Y} denote a response variable, \boldsymbol{x}_i \in \mathbb{R}^p be a vector of covariates for i = 1, \ldots, n, and \boldsymbol{\beta} \in \mathbb{R}^p be the parameter of interest.
Standard GLMs assume that observations y_i are independent realizations of Y_i \mid \boldsymbol{x}_i, whose distribution belongs to the exponential dispersion family.
- Although the mean and variance functions need to be carefully chosen to fit the data, the resulting inferences are robust to misspecification of higher-order moments.
Quasi-likelihood
Under the second-order assumptions, it is possible to specify the so-called log-quasi-likelihood function (Wedderburn 1974): \ell_Q(\boldsymbol{\beta}; \mathbf{y}, \mathbf{X}, \psi) = \sum_{i=1}^n \ell_Q(\boldsymbol{\beta}; y_i, \boldsymbol{x}_i, \psi) = \sum_{i=1}^n \int_a^{\mu_i(\boldsymbol{\beta})} \frac{y_i - t}{\psi V(t)} \, \mathrm{d}t, where a is an arbitrary constant that does not depend on \boldsymbol{\beta}.
The above integral can be written in closed form for many choices of variance functions, including those associated with exponential family distributions.
- Quasi-likelihoods retain many properties of genuine likelihoods, such as unbiased estimating equations and the information identity: \mathbb{E}\left\{ \nabla \ell_Q(\boldsymbol{\beta}; \mathbf{Y}, \mathbf{X}, \psi) \right\} = 0, \qquad \mathbb{E}\left\{ -\nabla^2 \ell_Q(\boldsymbol{\beta}; \mathbf{Y}, \mathbf{X}, \psi) \right\} = \mathbb{E}\left\{ \nabla \ell_Q \nabla \ell_Q^\top \right\}, where \nabla denotes the gradient with respect to \boldsymbol{\beta}.
- Under mild conditions, the maximum quasi-likelihood is consistent and has the smallest asymptotic variance among estimators derived from linear estimating equations (McCullagh 1983).
Quasi-posteriors I
The quasi-posterior is a rational update of a belief distribution within the generalized Bayesian framework, with loss function: \ell(\boldsymbol{\beta}; \mathbf{y}, \mathbf{X}) = - \sum_{i=1}^n \int_a^{\mu_i(\boldsymbol{\beta})} \frac{y_i - t}{V(t)} \, \mathrm{d}t.
The dispersion parameter \psi plays the role of a loss-scale parameter for the quasi-posterior.
Quasi-posteriors II
The quasi-posterior represents subjective uncertainty about the unknown parameter value: \boldsymbol{\beta}^* = \arg\min_{\boldsymbol{\beta}} \int_{\mathcal{Y}} \ell(\boldsymbol{\beta}; \mathbf{y}, \mathbf{X}) \, F_0(d\mathbf{y} \mid \mathbf{X}), which is assumed to be unique (Bissiri et al. 2016).
The definition of \boldsymbol{\beta}^* does not automatically guarantee that \boldsymbol{\beta}^* = \boldsymbol{\beta}_0.
Calibration of the dispersion I
- Based on a comparison with the Bayesian bootstrap, Lyddon et al. (2019) propose calibrate the dispersion \psi setting it equal to: \psi_{\text{LLB}} = \frac{\mathrm{tr}\{j(\boldsymbol{\beta}_0)\}}{\mathrm{tr}\{j(\boldsymbol{\beta}_0) h(\boldsymbol{\beta}_0)^{-1} j(\boldsymbol{\beta}_0)\}}, where we define j(\boldsymbol{\beta}) := \lim_{n \to \infty} \frac{1}{n} \mathbb{E}\left[\nabla^2 \ell(\boldsymbol{\beta}; \mathbf{Y}, \mathbf{X})\right], \qquad h(\boldsymbol{\beta}) := \lim_{n \to \infty} \frac{1}{n} \mathbb{E}\left[\nabla \ell(\boldsymbol{\beta}; \mathbf{Y}, \mathbf{X}) \nabla \ell(\boldsymbol{\beta}; \mathbf{Y}, \mathbf{X})^\top\right].
Calibration of the dispersion II
- As \psi_0 is typically unknown, we can use the classical method of moments estimator: \widehat{\psi} = \frac{1}{n - p} \sum_{i=1}^n \frac{(y_i - \widehat{\mu}_i)^2}{V(\widehat{\mu}_i)}, where \widehat{\mu}_i = \mu_i(\widehat{\boldsymbol{\beta}}), which is fast and consistent (McCullagh and Nelder 1989).
Simulation results I
Data are generated from a distribution with \mathbb{E}(Y_i) = \mu_i(\boldsymbol{\beta}_0)= \exp(\mathbf{x}_i^\top \boldsymbol{\beta}_0) and \mathrm{var}(Y_i) = \psi_0 \mu_i(\boldsymbol{\beta}_0) —not a Poisson!— with parameters \beta_0 = (3.5,\; 1.5,\; -1.0,\; 0.5), and \psi_0 = 3.5.
We computed posterior means and 95\% credible intervals.
The sample size is n = 100; estimates are averages over multiple simulated datasets.
Poisson | Negative Binomial | DFD-Bayes | Quasi-posterior | Quasi-posterior + alternative \hat{\psi} | ||
---|---|---|---|---|---|---|
\beta_1 | Mean | 3.50 (0.035) | 3.49 (0.047) | 57.46 (105.41) | 3.50 (0.035) | 3.50 (0.035) |
Cover. | 0.715 | 0.920 | 0.201 | 0.945 | 0.970 | |
\beta_2 | Mean | 1.50 (0.020) | 1.51 (0.040) | 5.26 (6.31) | 1.50 (0.020) | 1.50 (0.020) |
Cover. | 0.675 | 0.960 | 0.454 | 0.945 | 0.990 | |
\beta_3 | Mean | -1.00 (0.020) | -1.01 (0.034) | -3.98 (6.53) | -1.00 (0.020) | -1.00 (0.020) |
Cover. | 0.715 | 0.995 | 0.479 | 0.950 | 0.965 | |
\beta_4 | Mean | 0.50 (0.018) | 0.50 (0.037) | 2.55 (7.37) | 0.50 (0.018) | 0.50 (0.018) |
Cover. | 0.655 | 0.965 | 0.526 | 0.950 | 0.970 |
Simulation results II
Data are generated from a distribution with \mathbb{E}(Y_i) = \mu_i(\boldsymbol{\beta}_0)= \exp(\mathbf{x}_i^\top \boldsymbol{\beta}_0) and \mathrm{var}(Y_i) = \psi_0 \mu_i(\boldsymbol{\beta}_0) —not a Poisson!— with parameters \beta_0 = (3.5,\; 1.5,\; -1.0,\; 0.5), and \psi_0 = 3.5.
We computed posterior means and 95\% credible intervals.
The sample size is n = 1000; estimates are averages over multiple simulated datasets.
Poisson | Negative Binomial | DFD-Bayes | Quasi-posterior | Quasi-posterior + alternative \hat{\psi} | ||
---|---|---|---|---|---|---|
\beta_1 | Mean | 3.50 (0.010) | 3.50 (0.015) | 4.64 (8.45) | 3.50 (0.010) | 3.50 (0.010) |
Cover. | 0.690 | 0.835 | 0.070 | 0.945 | 0.955 | |
\beta_2 | Mean | 1.50 (0.005) | 1.50 (0.012) | 1.87 (1.37) | 1.50 (0.005) | 1.50 (0.005) |
Cover. | 0.665 | 0.910 | 0.510 | 0.950 | 0.960 | |
\beta_3 | Mean | -1.00 (0.005) | -1.00 (0.010) | -1.21 (0.81) | -1.00 (0.005) | -1.00 (0.005) |
Cover. | 0.680 | 0.960 | 0.690 | 0.955 | 0.950 | |
\beta_4 | Mean | 0.50 (0.005) | 0.50 (0.009) | 0.54 (0.35) | 0.50 (0.005) | 0.50 (0.005) |
Cover. | 0.715 | 0.950 | 0.810 | 0.950 | 0.940 |
Thank you!
The main paper is:
Agnoletto, D., Rigon, T., and Dunson D.B. (2025+). Bayesian inference for generalized linear models via quasi-posteriors. Biometrika, to appear.