Bayesian inference for generalized linear models via quasi-posteriors
Università degli Studi di Milano-Bicocca
2025-05-27
Davide Agnoletto (Duke University)
David Dunson (Duke University)
De Finetti’s representation theorem
Let (Y_n)_{n\ge 1}, Y_n\in\mathcal{Y}, be a sequence of exchangeable random variables with probability law P. Then there exists a unique probability measure \Pi such that, for any n\ge 1,
P(y_1,\ldots,y_n) = \int_{\mathcal{F}} \prod_{i=1}^n F(y_i)\,\Pi(\mathrm{d}F).
The hypothetical approach represents the the most common way to operate within the Bayesian community.
In a parametric setting, \Pi has support on a class \Theta\subseteq\mathbb{R}^p, p<\infty, such that \boldsymbol{\theta}\in\Theta indexes the class of distributions \mathcal{F}_{\boldsymbol{\theta}}=\{F_{\boldsymbol{\theta}} : \boldsymbol{\theta} \in \Theta\subseteq\mathbb{R}^p\}.
Bayes’ rule takes the well-known formulation: \pi(\boldsymbol{\theta}\mid y_1,\ldots,y_n) \propto \pi(\boldsymbol{\theta}) \prod_{i=1}^n f_{\boldsymbol{\theta}}(y_i), where \pi and f_{\boldsymbol{\theta}} denote the probability density functions associated with \Pi and F_{\boldsymbol{\theta}}, respectively.
However, when the link between observations and parameter of interest cannot be expressed through a distribution function, the traditional hypothetical approach fails.
Solution: generalized posterior distributions, sometimes called Gibbs-posteriors.
This is a lively recent topic, see for instance: Chernozhukov and Hong (2003); Bissiri et al. (2016) Heide et al. (2020); Grünwald and Mehta (2020); Knoblauch et al. (2022); Matsubara et al. (2022); Matsubara et al. (2023); Jewson and Rossell (2022); Rigon et al. (2023).
Learning the loss scale \omega from the data is a delicate task. Assuming a prior for \omega can lead to degenerate estimates if not accompanied by additional adjustments to the loss function.
However, there are several solutions for its calibration: Holmes and Walker (2017); Lyddon et al. (2019); Syring and Martin (2019); Matsubara et al. (2023).
Generalized linear models (GLMs) are routinely used to model a wide variety of data.
The Bayesian approach for GLMs is also incredibly popular, e.g. because of the possibility of naturally incorporating random effects, complex penalizations, prior information, and more.
However, they often incur misspecification, and this could compromise inferential conclusions.
A common case is overdispersion, i.e., when proportion or count observations show larger variability than the one assumed by the model.
Traditional solutions have important drawbacks:
Let Y_i \in \mathcal{Y} denote a response variable, \boldsymbol{x}_i \in \mathbb{R}^p be a vector of covariates for i = 1, \ldots, n, and \boldsymbol{\beta} \in \mathbb{R}^p be the parameter of interest.
Standard GLMs assume that observations y_i are independent realizations of Y_i \mid \boldsymbol{x}_i, whose distribution belongs to the exponential dispersion family.
We assume the second-order conditions: \mathbb{E}\{Y_i\} = \mu_i = g^{-1}(\boldsymbol{x}_i^\top \boldsymbol{\beta}), \quad \mathrm{var}\{Y_i\} = \psi\, V(\mu_i), where g(\cdot) is a link function, V(\cdot)>0 is a variance function, and \psi \in (0,\infty) is a dispersion parameter.
We let (\boldsymbol{\beta}_0, \psi_0) be the true values for the parameters (\boldsymbol{\beta}, \psi) and we assume the data are generated under F_0(\mathrm{d}y \mid \boldsymbol{x}) = F(\mathrm{d}y \mid \boldsymbol{x}, \boldsymbol{\beta}_0, \psi_0).
Under the second-order assumptions, it is possible to specify the so-called log-quasi-likelihood function (Wedderburn 1974): \ell_Q(\boldsymbol{\beta}; \mathbf{y}, \mathbf{X}, \psi) = \sum_{i=1}^n \ell_Q(\boldsymbol{\beta}; y_i, \boldsymbol{x}_i, \psi) = \sum_{i=1}^n \int_a^{\mu_i(\boldsymbol{\beta})} \frac{y_i - t}{\psi V(t)} \, \mathrm{d}t, where a is an arbitrary constant that does not depend on \boldsymbol{\beta}.
The above integral can be written in closed form for many choices of variance functions, including those associated with exponential family distributions.
Let \exp\{\ell_Q(\boldsymbol{\beta}; \mathbf{y}, \mathbf{X}, \psi)\} be the quasi-likelihood function and p(\boldsymbol{\beta}) be the prior distribution for \boldsymbol{\beta}. We define the quasi-posterior distribution for \boldsymbol{\beta} as: p_Q(\boldsymbol{\beta} \mid \mathbf{y}, \mathbf{X}, \psi) \propto p(\boldsymbol{\beta}) \exp \left\{ \ell_Q(\boldsymbol{\beta}; \mathbf{y}, \mathbf{X}, \psi) \right\} = p(\boldsymbol{\beta}) \exp \left\{ \frac{1}{\psi} \sum_{i=1}^n \int_a^{\mu_i(\boldsymbol{\beta})} \frac{y_i - t}{V(t)} \, \mathrm{d}t \right\}
The quasi-posterior is a rational update of a belief distribution within the generalized Bayesian framework, with loss function: \ell(\boldsymbol{\beta}; \mathbf{y}, \mathbf{X}) = - \sum_{i=1}^n \int_a^{\mu_i(\boldsymbol{\beta})} \frac{y_i - t}{V(t)} \, \mathrm{d}t.
The dispersion parameter \psi plays the role of a loss-scale parameter for the quasi-posterior.
The quasi-posterior represents subjective uncertainty about the unknown parameter value: \boldsymbol{\beta}^* = \arg\min_{\boldsymbol{\beta}} \int_{\mathcal{Y}} \ell(\boldsymbol{\beta}; \mathbf{y}, \mathbf{X}) \, F_0(d\mathbf{y} \mid \mathbf{X}), which is assumed to be unique (Bissiri et al. 2016).
The definition of \boldsymbol{\beta}^* does not automatically guarantee that \boldsymbol{\beta}^* = \boldsymbol{\beta}_0.
Theorem (Agnoletto, R., Dunson, 2025)
Assume the second-order conditions are well-specified, and suppose the target of inference \boldsymbol{\beta}^* is unique. Then, for quasi-posteriors, \boldsymbol{\beta}^* must coincide with the true value \boldsymbol{\beta}_0: \boldsymbol{\beta}^* = \arg\min_{\boldsymbol{\beta}} \int_{\mathcal{Y}} \sum_{i=1}^n \int_a^{\mu_i(\boldsymbol{\beta})} \frac{t - y_i}{V(t)} \, \mathrm{d}t \, F_0(d\mathbf{y} \mid \mathbf{X}) = \boldsymbol{\beta}_0.
Proposition (Agnoletto, R., Dunson, 2025)
Under the second order conditions, namely if \mathbb{E}(Y_i) = g^{-1}(\mathbf{x}_i^\top \boldsymbol{\beta}_0) and \mathrm{var}(Y_i) = \psi_0 V\{\mu_i(\boldsymbol{\beta}_0)\}, then for quasi posteriors with loss \ell(\boldsymbol{\beta}; \mathbf{y}, \mathbf{X}) = -\sum_{i=1}^n \int_a^{\mu_i(\boldsymbol{\beta})} \frac{y_i - t}{V(t)} \mathrm{d}t, we have \psi_{\text{LLB}} = \psi_0.
Theorem (Agnoletto, R., Dunson, 2025)
Assume the second-order conditions are correctly specified. Let S_1, S_2, \ldots \subseteq \mathbb{R}^p be a sequence of convex credible sets of level \rho \in (0,1). Then, under mild conditions and setting \psi = \psi_0: \mathbb{P}(\boldsymbol{\beta}_0 \in S_n \mid \mathbf{y}, \mathbf{X}, \psi_0) \to \rho \quad \text{as } n \to \infty.
Data are generated from a distribution with \mathbb{E}(Y_i) = \mu_i(\boldsymbol{\beta}_0)= \exp(\mathbf{x}_i^\top \boldsymbol{\beta}_0) and \mathrm{var}(Y_i) = \psi_0 \mu_i(\boldsymbol{\beta}_0) —not a Poisson!— with parameters \beta_0 = (3.5,\; 1.5,\; -1.0,\; 0.5), and \psi_0 = 3.5.
We computed posterior means and 95\% credible intervals.
The sample size is n = 100; estimates are averages over multiple simulated datasets.
Poisson | Negative Binomial | DFD-Bayes | Quasi-posterior | Quasi-posterior + alternative \hat{\psi} | ||
---|---|---|---|---|---|---|
\beta_1 | Mean | 3.50 (0.035) | 3.49 (0.047) | 57.46 (105.41) | 3.50 (0.035) | 3.50 (0.035) |
Cover. | 0.715 | 0.920 | 0.201 | 0.945 | 0.970 | |
\beta_2 | Mean | 1.50 (0.020) | 1.51 (0.040) | 5.26 (6.31) | 1.50 (0.020) | 1.50 (0.020) |
Cover. | 0.675 | 0.960 | 0.454 | 0.945 | 0.990 | |
\beta_3 | Mean | -1.00 (0.020) | -1.01 (0.034) | -3.98 (6.53) | -1.00 (0.020) | -1.00 (0.020) |
Cover. | 0.715 | 0.995 | 0.479 | 0.950 | 0.965 | |
\beta_4 | Mean | 0.50 (0.018) | 0.50 (0.037) | 2.55 (7.37) | 0.50 (0.018) | 0.50 (0.018) |
Cover. | 0.655 | 0.965 | 0.526 | 0.950 | 0.970 |
Data are generated from a distribution with \mathbb{E}(Y_i) = \mu_i(\boldsymbol{\beta}_0)= \exp(\mathbf{x}_i^\top \boldsymbol{\beta}_0) and \mathrm{var}(Y_i) = \psi_0 \mu_i(\boldsymbol{\beta}_0) —not a Poisson!— with parameters \beta_0 = (3.5,\; 1.5,\; -1.0,\; 0.5), and \psi_0 = 3.5.
We computed posterior means and 95\% credible intervals.
The sample size is n = 1000; estimates are averages over multiple simulated datasets.
Poisson | Negative Binomial | DFD-Bayes | Quasi-posterior | Quasi-posterior + alternative \hat{\psi} | ||
---|---|---|---|---|---|---|
\beta_1 | Mean | 3.50 (0.010) | 3.50 (0.015) | 4.64 (8.45) | 3.50 (0.010) | 3.50 (0.010) |
Cover. | 0.690 | 0.835 | 0.070 | 0.945 | 0.955 | |
\beta_2 | Mean | 1.50 (0.005) | 1.50 (0.012) | 1.87 (1.37) | 1.50 (0.005) | 1.50 (0.005) |
Cover. | 0.665 | 0.910 | 0.510 | 0.950 | 0.960 | |
\beta_3 | Mean | -1.00 (0.005) | -1.00 (0.010) | -1.21 (0.81) | -1.00 (0.005) | -1.00 (0.005) |
Cover. | 0.680 | 0.960 | 0.690 | 0.955 | 0.950 | |
\beta_4 | Mean | 0.50 (0.005) | 0.50 (0.009) | 0.54 (0.35) | 0.50 (0.005) | 0.50 (0.005) |
Cover. | 0.715 | 0.950 | 0.810 | 0.950 | 0.940 |
The main paper is:
Agnoletto, D., Rigon, T., and Dunson D.B. (2025+). Bayesian inference for generalized linear models via quasi-posteriors. Biometrika, to appear.