BISP14 workshop

Bayesian inference for generalized linear models via quasi-posteriors

Tommaso Rigon

Università degli Studi di Milano-Bicocca

2025-05-27

Warm thanks

Davide Agnoletto (Duke University)

David Dunson (Duke University)

Foundations

De Finetti’s representation Theorem (De Finetti 1937) has a central role in Bayesian statistics because it provides the fundamental justification to the two approaches to Bayesian statistics:
- the hypothetical approach;
- the predictive approach.

De Finetti’s representation theorem

Let (Y_n)_{n\ge 1}, Y_n\in\mathcal{Y}, be a sequence of exchangeable random variables with probability law P. Then there exists a unique probability measure \Pi such that, for any n\ge 1,

P(y_1,\ldots,y_n) = \int_{\mathcal{F}} \prod_{i=1}^n F(y_i)\,\Pi(\mathrm{d}F).

While representing opposite interpretations of the Theorem, the two approaches are intrinsically connected.

Hypothetical approach

The hypothetical approach represents the the most common way to operate within the Bayesian community.
In a parametric setting, \Pi has support on a class \Theta\subseteq\mathbb{R}^p, p<\infty, such that \boldsymbol{\theta}\in\Theta indexes the class of distributions \mathcal{F}_{\boldsymbol{\theta}}=\{F_{\boldsymbol{\theta}} : \boldsymbol{\theta} \in \Theta\subseteq\mathbb{R}^p\}.
Bayes’ rule takes the well-known formulation: \pi(\boldsymbol{\theta}\mid y_1,\ldots,y_n) \propto \pi(\boldsymbol{\theta}) \prod_{i=1}^n f_{\boldsymbol{\theta}}(y_i), where \pi and f_{\boldsymbol{\theta}} denote the probability density functions associated with \Pi and F_{\boldsymbol{\theta}}, respectively.
However, when the link between observations and parameter of interest cannot be expressed through a distribution function, the traditional hypothetical approach fails.

Solution: generalized posterior distributions, sometimes called Gibbs-posteriors.
This is a lively recent topic, see for instance: Chernozhukov and Hong (2003); Bissiri et al. (2016) Heide et al. (2020); Grünwald and Mehta (2020); Knoblauch et al. (2022); Matsubara et al. (2022); Matsubara et al. (2023); Jewson and Rossell (2022); Rigon et al. (2023).

Generalizations of the hypothetical approach

Bissiri et al. (2016) showed that the generalized posterior \pi_\omega(\boldsymbol{\theta} \mid \mathbf{y}_{1:n}) \propto \pi(\boldsymbol{\theta}) \exp\left\{ - \omega \sum_{i=1}^n \ell(\boldsymbol{\theta}; y_i) \right\}, is the only coherent update of the prior beliefs about \boldsymbol{\theta}^* = \arg\min_{\boldsymbol{\theta}\in\Theta} \int_{\mathcal{Y}} \ell(\boldsymbol{\theta}; y)\, F_0(\mathrm{d}y), where \ell(\boldsymbol{\theta}, y) is a loss function, \omega is the loss-scale, and F_0 is the unknown true sampling distribution.

Learning the loss scale \omega from the data is a delicate task. Assuming a prior for \omega can lead to degenerate estimates if not accompanied by additional adjustments to the loss function.
However, there are several solutions for its calibration: Holmes and Walker (2017); Lyddon et al. (2019); Syring and Martin (2019); Matsubara et al. (2023).

Our contribution: Bayesian inference for generalized linear models via quasi-posteriors.

Generalized Bayes for GLMs

Generalized linear models (GLMs) are routinely used to model a wide variety of data.
The Bayesian approach for GLMs is also incredibly popular, e.g. because of the possibility of naturally incorporating random effects, complex penalizations, prior information, and more.
However, they often incur misspecification, and this could compromise inferential conclusions.

A common case is overdispersion, i.e., when proportion or count observations show larger variability than the one assumed by the model.
Traditional solutions have important drawbacks:
- Model-based: may lead to computational bottlenecks and can result again in misspecification.
- Nonparametric: increased computational cost and loss of efficiency and interpretability.

We rely on a semi-parametric approach, making only assumptions on the mean and variance of the response while preserving computational tractability.

Second order assumptions

Let Y_i \in \mathcal{Y} denote a response variable, \boldsymbol{x}_i \in \mathbb{R}^p be a vector of covariates for i = 1, \ldots, n, and \boldsymbol{\beta} \in \mathbb{R}^p be the parameter of interest.
Standard GLMs assume that observations y_i are independent realizations of Y_i \mid \boldsymbol{x}_i, whose distribution belongs to the exponential dispersion family.

We assume the second-order conditions: \mathbb{E}\{Y_i\} = \mu_i = g^{-1}(\boldsymbol{x}_i^\top \boldsymbol{\beta}), \quad \mathrm{var}\{Y_i\} = \psi\, V(\mu_i), where g(\cdot) is a link function, V(\cdot)>0 is a variance function, and \psi \in (0,\infty) is a dispersion parameter.

We let (\boldsymbol{\beta}_0, \psi_0) be the true values for the parameters (\boldsymbol{\beta}, \psi) and we assume the data are generated under F_0(\mathrm{d}y \mid \boldsymbol{x}) = F(\mathrm{d}y \mid \boldsymbol{x}, \boldsymbol{\beta}_0, \psi_0).

Although the mean and variance functions need to be carefully chosen to fit the data, the resulting inferences are robust to misspecification of higher-order moments.

Quasi-likelihood

Under the second-order assumptions, it is possible to specify the so-called log-quasi-likelihood function (Wedderburn 1974): \ell_Q(\boldsymbol{\beta}; \mathbf{y}, \mathbf{X}, \psi) = \sum_{i=1}^n \ell_Q(\boldsymbol{\beta}; y_i, \boldsymbol{x}_i, \psi) = \sum_{i=1}^n \int_a^{\mu_i(\boldsymbol{\beta})} \frac{y_i - t}{\psi V(t)} \, \mathrm{d}t, where a is an arbitrary constant that does not depend on \boldsymbol{\beta}.
The above integral can be written in closed form for many choices of variance functions, including those associated with exponential family distributions.

Quasi-likelihoods retain many properties of genuine likelihoods, such as unbiased estimating equations and the information identity: \mathbb{E}\left\{ \nabla \ell_Q(\boldsymbol{\beta}; \mathbf{Y}, \mathbf{X}, \psi) \right\} = 0, \qquad \mathbb{E}\left\{ -\nabla^2 \ell_Q(\boldsymbol{\beta}; \mathbf{Y}, \mathbf{X}, \psi) \right\} = \mathbb{E}\left\{ \nabla \ell_Q \nabla \ell_Q^\top \right\}, where \nabla denotes the gradient with respect to \boldsymbol{\beta}.

Under mild conditions, the maximum quasi-likelihood is consistent and has the smallest asymptotic variance among estimators derived from linear estimating equations (McCullagh 1983).

Quasi-posteriors I

Let \exp\{\ell_Q(\boldsymbol{\beta}; \mathbf{y}, \mathbf{X}, \psi)\} be the quasi-likelihood function and p(\boldsymbol{\beta}) be the prior distribution for \boldsymbol{\beta}. We define the quasi-posterior distribution for \boldsymbol{\beta} as: p_Q(\boldsymbol{\beta} \mid \mathbf{y}, \mathbf{X}, \psi) \propto p(\boldsymbol{\beta}) \exp \left\{ \ell_Q(\boldsymbol{\beta}; \mathbf{y}, \mathbf{X}, \psi) \right\} = p(\boldsymbol{\beta}) \exp \left\{ \frac{1}{\psi} \sum_{i=1}^n \int_a^{\mu_i(\boldsymbol{\beta})} \frac{y_i - t}{V(t)} \, \mathrm{d}t \right\}

The quasi-posterior is a rational update of a belief distribution within the generalized Bayesian framework, with loss function: \ell(\boldsymbol{\beta}; \mathbf{y}, \mathbf{X}) = - \sum_{i=1}^n \int_a^{\mu_i(\boldsymbol{\beta})} \frac{y_i - t}{V(t)} \, \mathrm{d}t.
The dispersion parameter \psi plays the role of a loss-scale parameter for the quasi-posterior.

Quasi-posteriors II

The quasi-posterior represents subjective uncertainty about the unknown parameter value: \boldsymbol{\beta}^* = \arg\min_{\boldsymbol{\beta}} \int_{\mathcal{Y}} \ell(\boldsymbol{\beta}; \mathbf{y}, \mathbf{X}) \, F_0(d\mathbf{y} \mid \mathbf{X}), which is assumed to be unique (Bissiri et al. 2016).
The definition of \boldsymbol{\beta}^* does not automatically guarantee that \boldsymbol{\beta}^* = \boldsymbol{\beta}_0.

Theorem (Agnoletto, R., Dunson, 2025)

Assume the second-order conditions are well-specified, and suppose the target of inference \boldsymbol{\beta}^* is unique. Then, for quasi-posteriors, \boldsymbol{\beta}^* must coincide with the true value \boldsymbol{\beta}_0: \boldsymbol{\beta}^* = \arg\min_{\boldsymbol{\beta}} \int_{\mathcal{Y}} \sum_{i=1}^n \int_a^{\mu_i(\boldsymbol{\beta})} \frac{t - y_i}{V(t)} \, \mathrm{d}t \, F_0(d\mathbf{y} \mid \mathbf{X}) = \boldsymbol{\beta}_0.

Calibration of the dispersion I

Based on a comparison with the Bayesian bootstrap, Lyddon et al. (2019) propose calibrate the dispersion \psi setting it equal to: \psi_{\text{LLB}} = \frac{\mathrm{tr}\{j(\boldsymbol{\beta}_0)\}}{\mathrm{tr}\{j(\boldsymbol{\beta}_0) h(\boldsymbol{\beta}_0)^{-1} j(\boldsymbol{\beta}_0)\}}, where we define j(\boldsymbol{\beta}) := \lim_{n \to \infty} \frac{1}{n} \mathbb{E}\left[\nabla^2 \ell(\boldsymbol{\beta}; \mathbf{Y}, \mathbf{X})\right], \qquad h(\boldsymbol{\beta}) := \lim_{n \to \infty} \frac{1}{n} \mathbb{E}\left[\nabla \ell(\boldsymbol{\beta}; \mathbf{Y}, \mathbf{X}) \nabla \ell(\boldsymbol{\beta}; \mathbf{Y}, \mathbf{X})^\top\right].

Proposition (Agnoletto, R., Dunson, 2025)

Under the second order conditions, namely if \mathbb{E}(Y_i) = g^{-1}(\mathbf{x}_i^\top \boldsymbol{\beta}_0) and \mathrm{var}(Y_i) = \psi_0 V\{\mu_i(\boldsymbol{\beta}_0)\}, then for quasi posteriors with loss \ell(\boldsymbol{\beta}; \mathbf{y}, \mathbf{X}) = -\sum_{i=1}^n \int_a^{\mu_i(\boldsymbol{\beta})} \frac{y_i - t}{V(t)} \mathrm{d}t, we have \psi_{\text{LLB}} = \psi_0.

Calibration of the dispersion II

Theorem (Agnoletto, R., Dunson, 2025)

Assume the second-order conditions are correctly specified. Let S_1, S_2, \ldots \subseteq \mathbb{R}^p be a sequence of convex credible sets of level \rho \in (0,1). Then, under mild conditions and setting \psi = \psi_0: \mathbb{P}(\boldsymbol{\beta}_0 \in S_n \mid \mathbf{y}, \mathbf{X}, \psi_0) \to \rho \quad \text{as } n \to \infty.

As \psi_0 is typically unknown, we can use the classical method of moments estimator: \widehat{\psi} = \frac{1}{n - p} \sum_{i=1}^n \frac{(y_i - \widehat{\mu}_i)^2}{V(\widehat{\mu}_i)}, where \widehat{\mu}_i = \mu_i(\widehat{\boldsymbol{\beta}}), which is fast and consistent (McCullagh and Nelder 1989).

Simulation results I

Data are generated from a distribution with \mathbb{E}(Y_i) = \mu_i(\boldsymbol{\beta}_0)= \exp(\mathbf{x}_i^\top \boldsymbol{\beta}_0) and \mathrm{var}(Y_i) = \psi_0 \mu_i(\boldsymbol{\beta}_0) —not a Poisson!— with parameters \beta_0 = (3.5,\; 1.5,\; -1.0,\; 0.5), and \psi_0 = 3.5.
We computed posterior means and 95\% credible intervals.
The sample size is n = 100; estimates are averages over multiple simulated datasets.

		Poisson	Negative Binomial	DFD-Bayes	Quasi-posterior	Quasi-posterior + alternative \hat{\psi}
\beta_1	Mean	3.50 (0.035)	3.49 (0.047)	57.46 (105.41)	3.50 (0.035)	3.50 (0.035)
	Cover.	0.715	0.920	0.201	0.945	0.970
\beta_2	Mean	1.50 (0.020)	1.51 (0.040)	5.26 (6.31)	1.50 (0.020)	1.50 (0.020)
	Cover.	0.675	0.960	0.454	0.945	0.990
\beta_3	Mean	-1.00 (0.020)	-1.01 (0.034)	-3.98 (6.53)	-1.00 (0.020)	-1.00 (0.020)
	Cover.	0.715	0.995	0.479	0.950	0.965
\beta_4	Mean	0.50 (0.018)	0.50 (0.037)	2.55 (7.37)	0.50 (0.018)	0.50 (0.018)
	Cover.	0.655	0.965	0.526	0.950	0.970

Simulation results II

Data are generated from a distribution with \mathbb{E}(Y_i) = \mu_i(\boldsymbol{\beta}_0)= \exp(\mathbf{x}_i^\top \boldsymbol{\beta}_0) and \mathrm{var}(Y_i) = \psi_0 \mu_i(\boldsymbol{\beta}_0) —not a Poisson!— with parameters \beta_0 = (3.5,\; 1.5,\; -1.0,\; 0.5), and \psi_0 = 3.5.
We computed posterior means and 95\% credible intervals.
The sample size is n = 1000; estimates are averages over multiple simulated datasets.

		Poisson	Negative Binomial	DFD-Bayes	Quasi-posterior	Quasi-posterior + alternative \hat{\psi}
\beta_1	Mean	3.50 (0.010)	3.50 (0.015)	4.64 (8.45)	3.50 (0.010)	3.50 (0.010)
	Cover.	0.690	0.835	0.070	0.945	0.955
\beta_2	Mean	1.50 (0.005)	1.50 (0.012)	1.87 (1.37)	1.50 (0.005)	1.50 (0.005)
	Cover.	0.665	0.910	0.510	0.950	0.960
\beta_3	Mean	-1.00 (0.005)	-1.00 (0.010)	-1.21 (0.81)	-1.00 (0.005)	-1.00 (0.005)
	Cover.	0.680	0.960	0.690	0.955	0.950
\beta_4	Mean	0.50 (0.005)	0.50 (0.009)	0.54 (0.35)	0.50 (0.005)	0.50 (0.005)
	Cover.	0.715	0.950	0.810	0.950	0.940

Thank you!

The main paper is:

Agnoletto, D., Rigon, T., and Dunson D.B. (2025+). Bayesian inference for generalized linear models via quasi-posteriors. Biometrika, to appear.

References

Bissiri, P. G., Holmes, C. C., and Walker, S. G. (2016), “A general framework for updating belief distributions,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), Wiley Online Library, 78, 1103–1130.

Chernozhukov, V., and Hong, H. (2003), “An MCMC approach to classical estimation,” Journal of econometrics, Elsevier, 115, 293–346.

De Finetti, B. (1937), “La prévision: Ses lois logiques, ses sources subjectives,” in Annales de l’institut henri poincaré, pp. 1–68.

Grünwald, P. D., and Mehta, N. A. (2020), “Fast rates for general unbounded loss functions: from ERM to generalized Bayes,” The Journal of Machine Learning Research, JMLRORG, 21, 2040–2119.

Heide, R. de, Kirichenko, A., Grunwald, P., and Mehta, N. (2020), “Safe-bayesian generalized linear regression,” in Proceedings of the twenty third international conference on artificial intelligence and statistics, PMLR, pp. 2623–2633.

Holmes, C. C., and Walker, S. G. (2017), “Assigning a value to a power likelihood in a general Bayesian model,” Biometrika, Oxford University Press, 104, 497–503.

Jewson, J., and Rossell, D. (2022), “General bayesian loss function selection and the use of improper models,” Journal of the Royal Statistical Society Series B: Statistical Methodology, Oxford University Press, 84, 1640–1665.

Knoblauch, J., Jewson, J., and Damoulas, T. (2022), “An optimization-centric view on bayes’ rule: Reviewing and generalizing variational inference,” Journal of Machine Learning Research, 23, 1–109.

Lyddon, S. P., Holmes, C. C., and Walker, S. G. (2019), “General Bayesian updating and the loss-likelihood bootstrap,” Biometrika, Oxford University Press, 106, 465–478.

Matsubara, T., Knoblauch, J., Briol, F.-X., and Oates, C. J. (2022), “Robust generalised bayesian inference for intractable likelihoods,” Journal of the Royal Statistical Society Series B: Statistical Methodology, Oxford University Press, 84, 997–1022.

Matsubara, T., Knoblauch, J., Briol, F.-X., and Oates, C. J. (2023), “Generalized bayesian inference for discrete intractable likelihood,” Journal of the American Statistical Association, Taylor & Francis, 1–11.

McCullagh, P. (1983), “Quasi-likelihood functions,” Annals of Statistics, Institute of Mathematical Statistics, 11, 59–67.

McCullagh, P., and Nelder, J. A. (1989), Generalized linear models, Chapman & Hall.

Rigon, T., Herring, A. H., and Dunson, D. B. (2023), “A generalized Bayes framework for probabilistic clustering,” Biometrika, Oxford University Press, 10, 559–578.

Syring, N., and Martin, R. (2019), “Calibrating general posterior credible regions,” Biometrika, Oxford University Press, 106, 479–486.

Wedderburn, R. W. (1974), “Quasi-likelihood functions, generalized linear models, and the Gauss-Newton method,” Biometrika, Oxford University Press, 61, 439–447.