# A tibble: 10 Ă— 3
date wind_dir `Wind speed`
<dttm> <dbl> <dbl>
1 2025-04-14 00:00:00 148 4.6
2 2025-04-14 00:05:00 148 4.4
3 2025-04-14 00:10:00 152 4.1
4 2025-04-14 00:15:00 150 4.1
5 2025-04-14 00:20:00 150 4
6 2025-04-14 00:25:00 148 3.8
7 2025-04-14 00:30:00 151 3.3
8 2025-04-14 00:35:00 145 3
9 2025-04-14 00:40:00 148 3.5
10 2025-04-14 00:45:00 150 2.9
Exponential families
Statistical Inference - PhD EcoStatData
Homepage
This unit will cover the following topics:
- One-parameter and multiparameter exponential families
- Likelihood, inference, sufficiency and completeness
- Exponential dispersion families
The prime role of exponential families in the theory of statistical inference was first emphasized by Fisher (1934).
Most well-known distributions—such as Gaussian, Poisson, Binomial, and Gamma—are instances of exponential families.
Exponential families are the distributions typically considered when presenting the usual “regularity conditions”.
- With a few minor exceptions, this presentation will closely follow Chapters 5 and 6 of Pace and Salvan (1997).
Overview
- Figure 1 of Efron (2023). Three level of statistical modeling.
One-parameter exponential families
Exponential tilting
Let Y be a non-degenerate random variable with support \mathcal{Y} \subseteq \mathbb{R} and density f_0(y) with respect to a dominating measure \nu(\mathrm{d}y).
We aim at building a parametric family \mathcal{F} = \{f(;\theta) : \theta \in \Theta \subseteq \mathbb{R} \} with common support \mathcal{Y} such that f_0 is a special case, namely f_0 \in \mathcal{F}.
A strategy for doing this is called exponential tilting, namely we could set f(y; \theta) \propto e^{\theta y}f_0(y). Thus, if f(y;\theta) is generated via exponential tilting, then f(y; 0) = e^0 f_0(y) = f_0(y).
Let us define the mapping M_0:\mathbb{R}\rightarrow (0,\infty] M_0(\theta):=\int_\mathcal{Y}e^{\theta y}f_0(y)\nu(\mathrm{d}y), \qquad \theta \in \mathbb{R}. If M_0(\theta) is finite in a neighborhood of the origin, it is the moment generating function of Y.
Moreover, we define the set \tilde{\Theta} \subseteq \mathbb{R} as the set of all \theta such that M_0(\theta) is finite, i.e. \tilde{\Theta} = \{\theta \in \mathbb{R} : M_0(\theta) < \infty\}.
Natural exponential family of order one
- The mapping K(\theta) = K_0(\theta) = \log{M_0(\theta)} is the cumulant generating function of f_0. It is finite if and only if M_0(\theta) is finite.
The natural parameter space \tilde{\Theta} is the widest possible and must be an interval; see exercises. The family \mathcal{F}_{\text{ne}}^1 is said to be full, whereas a subfamily of \mathcal{F}_{\text{ne}}^1 with \Theta \subseteq \tilde{\Theta} is non-full.
By definition, all the densities f(y;\theta) \in \mathcal{F}_{\text{ne}}^1 have the same support.
Moment generating function
- In regular problems, the functions M_0(\theta) and K_0(\theta) associated to a r.v. Y with density f_0 are finite in a neighbor of the origin. A sufficient condition is that \tilde{\Theta} is an open set (regular \mathcal{F}_\text{en}^1).
Cumulant generating function
Refer to Pace and Salvan (1997), Section 3.2.5 for detailed derivations. Standardized cumulants \kappa_3/\kappa_2^{3/2} and \kappa_4/\kappa_2^2 are the skewness and the (excess of) kurtosis of Y.
Example: uniform distribution đź“–
Let Y \sim \text{Unif}(0,1) so that f_0(y) = 1 for y \in [0,1]. The exponential tilting of f_0 gives f(y; \theta) \propto e^{\theta y}f_0(y) = e^{\theta y}, \qquad y \in [0,1], \quad \theta \in \mathbb{R}.
The normalizing constant, that is, the moment generating function, is M_0(\theta)= \mathbb{E}(e^{\theta Y}) = \int_0^1 e^{\theta y} \mathrm{d}y = \frac{e^\theta}{\theta}\Big|_0^1 = \frac{e^\theta - 1}{\theta}, \qquad \theta \neq 0. with M_0(0) = 1. Note that M_0 is continuous since \lim_{\theta \to 0}(e^\theta - 1)/\theta = 1.
Consequently, we have M_0(\theta) < \infty for all \theta \in \mathbb{R} and the natural parameter space is \tilde{\Theta} = \mathbb{R}, which is an open set. The resulting density is f(y; \theta) = \frac{\theta e^{\theta y}}{e^{\theta -1}} = \exp\{\theta y - K(\theta)\}, \qquad y \in [0, 1], where K(\theta) = \log\{(e^\theta - 1)/\theta\}.
It holds in general that \tilde{\Theta} = \mathbb{R} whenever f_0 has bounded support; thus, the family is regular.
Example: Poisson distribution đź“–
Let Y \sim \text{Poisson}(1) so that f_0(y) = e^{-1}/y! for y \in \mathbb{N}. The exponential tilting of f_0 gives f(y; \theta) \propto e^{\theta y}f_0(y) = \frac{e^{\theta y}e^{-1}}{y!}, \qquad y \in \mathbb{N}, \quad \theta \in \mathbb{R}.
The normalizing constant, that is, the moment generating function, is M_0(\theta)= \mathbb{E}(e^{\theta Y}) = e^{-1}\sum_{k=0}^\infty \frac{e^{\theta y}}{y!} = \exp\{e^\theta - 1\}, \qquad \theta \in \mathbb{R}.
Consequently, we have M_0(\theta) < \infty for all \theta \in \mathbb{R} and the natural parameter space is \tilde{\Theta} = \mathbb{R}, which is an open set. The resulting density is f(y; \theta) = \frac{e^{\theta y} e^{-1}}{y!}\frac{e^{-e^\theta}}{e^{-1}} = \frac{e^{-1}}{y!}\exp\{\theta y - (e^\theta - 1)\} = \frac{\lambda^y e^{\lambda}}{y!}, \qquad y \in \mathbb{N}, so that K(\theta) = e^\theta - 1 and having defined \lambda = e^\theta.
In other words, the tilted density is again a Poisson distribution with mean e^\theta.
Example: exponential family generated by a Gaussian đź“–
Let Y \sim \text{N}(0,1) so that f_0(y) = 1/(\sqrt{2\pi})e^{-y^2/2} for y \in \mathbb{R}. The exponential tilting of f_0 gives f(y; \theta) \propto e^{\theta y}f_0(y) = \frac{1}{\sqrt{2\pi}}e^{\theta y -y^2/2}, \qquad y,\theta \in \mathbb{R}.
The normalizing constant, that is, the moment generating function, is M_0(\theta)= \mathbb{E}(e^{\theta Y}) = \frac{1}{\sqrt{2\pi}}\int_\mathbb{R}e^{\theta y -y^2/2}\mathrm{d}y = e^{\theta^2/2}, \qquad \theta \in \mathbb{R}.
Consequently, we have M_0(\theta) < \infty for all \theta \in \mathbb{R} and the natural parameter space is \tilde{\Theta} = \mathbb{R}, which is an open set. The resulting density is f(y; \theta) = \frac{1}{\sqrt{2\pi}}e^{\theta y}e^{-y^2/2}e^{-\theta^2/2} = \frac{e^{-y^2/2}}{\sqrt{2\pi}}\exp\{\theta y - \theta^2/2\} = \frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}(y - \theta)^2}, \qquad y \in \mathbb{R}, so that K(\theta) = \theta^2/2.
In other words, the tilted density is again a Gaussian distribution with mean \theta.
Closure under exponential tilting đź“–
Let \mathcal{F}_{\text{ne}}^1 be an exponential family with parameter \psi and natural parameter space \tilde{\Psi}, with density f(y; \psi) = f_0(y)\exp\{\psi y - K(\psi)\}. The exponential tilting of f(y; \psi) gives f(y; \theta, \psi) \propto e^{\theta y} f(y; \psi) \propto f_0(y) \exp\{(\theta + \psi)y\}, and the normalizing constant of f_0(y) \exp\{(\theta + \psi)y\} is therefore
\int_\mathcal{Y} f_0(y) \exp\{(\theta + \psi)y\} \, \nu(\mathrm{d}y) = M_0(\theta + \psi).Thus, for any \theta and \psi such that M_0(\theta + \psi) < \infty, the corresponding density is f(y; \theta, \psi) = f_0(y) \exp\{(\theta + \psi)y - K(\theta + \psi)\}, which is again a member of the exponential family \mathcal{F}_{\text{ne}}^1, with updated parameter \theta + \psi.
Moments and cumulants
The functions M_0(\theta) and K(\theta) = K_0(\theta) of a \mathcal{F}_\text{en}^1, refer to the baseline density f_0(y). Indeed, for any fixed \theta, the moment generating function of f(y; \theta) \in \mathcal{F}_\text{en}^1 is M_\theta(t) := \int_\mathcal{Y} e^{ty} f(y; \theta)\, \nu(\mathrm{d}y) = \frac{1}{M_0(\theta)} \int_\mathcal{Y} e^{(t + \theta)y} f_0(y)\, \nu(\mathrm{d}y) = \frac{M_0(t + \theta)}{M_0(\theta)}, \quad t + \theta \in \tilde{\Theta}.
Consequently, the cumulant generating function of f(y; \theta) relates to K_0 as follows: K_\theta(t) = \log M_\theta(t) = K_0(t + \theta) - K_0(\theta), \quad t + \theta \in \tilde{\Theta}.
Textbooks sometimes suppress additive constants in defining K_0(\theta), e.g. using e^\theta instead of e^\theta-1. This is inconsequential (constants cancel in K_\theta(t)) but somewhat misleading.
Mean value mapping I
- Moments and cumulants exist for every \theta \in \text{int}\:\tilde{\Theta}. In particular, the cumulants are \kappa_k = \frac{\partial^k}{\partial t^k} K_\theta(t) \Big|_{t = 0} = \frac{\partial^k}{\partial t^k} \left[ K(t + \theta) - K(\theta) \right] \Big|_{t = 0} = \frac{\partial^k}{\partial \theta^k} K(\theta), \qquad k \ge 1.
If f_0 is non-degenerate, then \text{var}_\theta(Y) > 0, implying that K(\theta) is a convex function, and \mu(\theta) is a smooth and monotone increasing, namely is a one-to-one map.
Thus, if \mathcal{F}_\text{en}^1 is a regular exponential family, then \tilde{\Theta} = \text{int}\:\tilde{\Theta} and \mu(\theta) is a reparametrization.
Mean value mapping II
Hence, if \mathcal{Y} = \{0, 1, \dots, N\}, then C = [0,N]. If \mathcal{Y} = \mathbb{N}, then C = \mathbb{R}^+. If \mathcal{Y} = \mathbb{R}, then C = \mathbb{R}.
Because of the properties of expectations, \mu(\theta) \in \text{int}\:C(\mathcal{Y}) for all \theta \in \text{int}\:\tilde{\Theta}, namely \mathcal{M} \subseteq \text{int}\:C(\mathcal{Y}). Indeed, \text{int}\:C(\mathcal{Y}) is an open interval whose extremes are the infimum and supremum of \mathcal{Y}.
Both definitions naturally generalize to the multivariate case when C, \mathcal{Y} \subseteq \mathbb{R}^p, for p > 1.
Mean value mapping III đź“–
In a regular exponential family, the mean value mapping \mu(\theta) is a reparametrization, meaning that for each \theta \in \tilde{\Theta}, there exists a unique mean \mu \in \mathcal{M} such that \mu = \mu(\theta).
Moreover, in regular families, a much stronger result holds: for each value of y \in \text{int}\:C(\mathcal{Y}), there exists a unique \theta \in \tilde{\Theta} such that \mu(\theta) = y.
This establishes a duality between the expectation space \mathcal{M} and the sample space. Any value in \text{int}\:C can be “reached”, that is, there exists a distribution f(y; \theta) with that mean.
This correspondence is crucial in maximum likelihood estimation and inference.
This theorem can actually be strengthened: a necessary and sufficient condition for \mathcal{M} = \text{int}\:C is that the family \mathcal{F}_\text{en}^1 is steep (a regular family is also steep); see Pace and Salvan (1997).
A non regular and non steep exponential family
- Let us a consider an exponential family \mathcal{F}_\text{en}^1 generated by the density f_0(y) = c \frac{e^{-|y|}}{1 + y^4}, \qquad y \in \mathbb{R}. for some normalizing constant c > 0. The exponential tilting of f_0 gives f(y; \theta) \propto e^{\theta y}f_0(y) \propto \frac{e^{-|y| + \theta y}}{1 + y^4}, \qquad y \in \mathbb{R}, \quad \theta \in \tilde{\Theta}.
- The function M_0(\theta) is unavailable in closed form, however \tilde{\Theta} = [-1,1] since M_0(\theta) < \infty, \qquad \theta \in [-1, 1].
- Since \tilde{\Theta} is a closed set, the exponential family is not regular (and is not steep either). In fact, one can show that \lim_{\theta \to 1} \mu(\theta) = a < \infty, implying that \mathcal{M} = (-a, a), \qquad \text{ whereas } \qquad \text{int}\:C = \mathbb{R}.
- In other words, there are no values of \theta such that \mu(\theta) = y for any y > a, which implies, for instance, that the method of moments will encounter difficulties in estimating \theta.
Variance function I đź“–
- The importance of the variance function V(\mu) is related to the following characterization result due to Morris (1982).
Variance function II đź“–
The characterization theorem of Morris (1982) is constructive in nature, as its proof provides a practical way of determining K(\theta) from (\mathcal{M}, V(\mu)). In particular, the function K(\cdot) must satisfy K\left(\int_{\mu_0}^\mu \frac{1}{V(m)}\mathrm{d}m\right) = \int_{\mu_0}^\mu \frac{m}{V(m)}\mathrm{d}m, where \mu_0 is an arbitrary point in \mathcal{M}.
For example, let \mathcal{M} = (0, \infty) and V(\mu) = \mu^2. Then, choosing \mu_0=1 gives K\left(1 - \frac{1}{\mu}\right) = \log\mu, and therefore \theta(\mu) = 1 - 1/\mu, giving \tilde{\Theta} = (-\infty, 1) and \mu(\theta) = (1 - \theta)^{-1}. Hence we obtain K(\theta) = -\log(1 - \theta), which corresponds to the exponential density f_0(y) = e^{-y}, for y > 0.
Well-known exponential families
Notation | \text{N}(\psi, 1) | \text{Poisson}(\psi) | \text{Bin}(N, \psi) | \text{Gamma}(\nu,\psi), \nu > 0 |
---|---|---|---|---|
\mathcal{Y} | \mathbb{R} | \mathbb{N} | \{0, 1, \dots, N\} | (0, \infty) |
Natural param. | ||||
\theta(\psi) | \psi | \log{\psi} | \log\{\psi/(1 - \psi)\} | -\psi |
f_0(y) | (\sqrt{2\pi})^{-1}e^{-\frac{1}{2}y^2} | e^{-1}/ y! | \binom{N}{y}\left(\frac{1}{2}\right)^N | y^{\nu - 1}e^{-y}/\Gamma(\nu) |
K(\theta) | \theta^2/2 | e^\theta-1 | N \log(1 + e^\theta) - N\log{2} | -\nu \log(1-\theta) |
\tilde{\Theta} | \mathbb{R} | \mathbb{R} | \mathbb{R} | (-\infty, 0) |
Mean param. | ||||
\mu(\theta) | \theta | e^\theta | N e^\theta/(1 + e^{\theta}) | -\nu/\theta |
\mathcal{M} | \mathbb{R} | (0, \infty) | (0, N) | (0, \infty) |
V(\mu) | 1 | \mu | \mu(1 - \mu/ N) | \mu^2/\nu |
Quadratic variance functions
There is more in Morris (1982)’s paper. Specifically, he focused on a subclass of quadratic variance functions, which can be written as V(\mu) = a + b\mu + c\mu^2, for some known constants a, b, and c.
Morris (1982) showed that, up to transformations such as convolution, there exist only six families within \mathcal{F}_\text{en}^1 that possess a quadratic variance function. These are: (i) the normal, (ii) the Poisson, (iii) the gamma, (iv) the binomial, (v) the negative binomial, and (vi) a sixth family.
The sixth (less known) distribution is called the generalized hyperbolic secant, and it has density f(y; \theta) = \frac{\exp\left\{\theta y - \log\cos{\theta}\right\}}{2\cosh(\pi y/2)}, \qquad y \in \mathbb{R}, \quad \theta \in (-\pi/2, \pi/2), with mean function \mu(\theta) = \tan{\theta} and variance function V(\mu) = \csc^2(\theta) = 1 + \mu^2, and \mathcal{M} = \mathbb{R}. It is also a regular exponential family.
A general definition of exponential families I
A general definition of exponential families II
When s(y) is an arbitrary function of y, then \mathcal{F}_\text{e}^1 is broader than \mathcal{F}_\text{en}^1.
Without loss of generality, we can focus on the natural parametrization \theta \in \Theta and a density baseline h(y) = f_0(y), meaning that f(y;\theta) \in \mathcal{F}_\text{e}^1 can be written as f(y; \theta) = f_0(y)\exp\{\theta s(y) - K(\theta)\}, because the general case would be a reparametrization of this one.
Let Y \sim f(y; \theta), with f(y; \theta) \in \mathcal{F}_\text{e}^1. Then, the random variable S = s(Y) has density f_S(s; \psi) = \tilde{f}_0(s)\exp\{\theta s - K(\theta)\}, for some baseline density \tilde{f}_0(s), namely f_S(s; \psi) \in \mathcal{F}_\text{en}^1. If in addition s(y) is a one-to-one invertible mapping, this means Y = s^{-1}(S) is just a transformation of an \mathcal{F}_\text{en}^1.
Multiparameter exponential families
Natural exponential families of order p
Let Y be a non-degenerate random variable with support \mathcal{Y} \subseteq \mathbb{R}^p and density f_0(y) with respect to a dominating measure \nu(\mathrm{d}y).
Let us define the mapping M_0:\mathbb{R}^p\rightarrow (0,\infty] M_0(\theta):=\int_\mathcal{Y}e^{\theta^T y}f_0(y)\nu(\mathrm{d}y), \qquad \theta \in \mathbb{R}^p.
- The family \mathcal{F}_{\text{ne}}^p is said to be full, whereas a subfamily of \mathcal{F}_{\text{ne}}^p with \Theta \subseteq \tilde{\Theta} is non-full. Moreover, the family \mathcal{F}_{\text{ne}}^p is said to be regular if \tilde{\Theta} is an open set.
Example: multinomial distribution I đź“–
Let Y = (Y_1,\dots,Y_{p-1}) \sim \text{Multinom}(N; 1/p,\dots,1/p) be a multinomial random vector with uniform probabilities, so that its density f_0 is
f_0(y) = \frac{N!}{y_1!\cdots y_p!}\left(\frac{1}{p}\right)^N, \qquad y = (y_1,\dots,y_{p-1}) \in \mathcal{Y} \subseteq \mathbb{R}^{p-1}, where \mathcal{Y} = \{(y_1,\dots,y_{p-1}) \in \{0,\dots,N\}^{p-1} : \sum_{j=1}^{p-1} y_j \le N\}, having set y_p := N - \sum_{j=1}^{p-1} y_j.The exponential tilting of f_0 yields f(y; \theta) \propto f_0(y) e^{\theta^T y} = \frac{N!}{y_1!\cdots y_p!}\left(\frac{1}{p}\right)^N e^{\theta_1 y_1 + \cdots + \theta_{p-1} y_{p-1}}, \qquad y \in \mathcal{Y}, \;\theta \in \mathbb{R}^{p-1}.
As a consequence of the multinomial theorem, the normalizing constant, that is, the moment generating function, is
M_0(\theta) = \mathbb{E}\left(e^{\theta^T Y}\right) = \left(\frac{1}{p}\right)^N(1 + e^{\theta_1} + \cdots + e^{\theta_{p-1}})^N. Thus M_0(\theta) < \infty for all \theta \in \mathbb{R}^{p-1} and the natural parameter space is the open set \tilde{\Theta} = \mathbb{R}^{p-1}.
Example: multinomial distribution II đź“–
The resulting tilted density is f(y; \theta) = f_0(y)e^{\theta^Ty - K(\theta)} = \frac{N!}{y_1!\cdots y_p!}\frac{e^{\theta_1 y_1 + \cdots + \theta_{p-1}y_{p-1}}}{(1 + e^{\theta_1} + \cdots + e^{\theta_{p-1}})^N}, where K(\theta) = \log{M_0(\theta)} = N\log(1 + e^{\theta_1} + \cdots + e^{\theta_{p-1}}) - N\log{p}.
In other words, the tilted density is again a multinomial distribution with parameters N and probabilities \pi_j = e^{\theta_j} / (1 + e^{\theta_1} + \cdots + e^{\theta_{p-1}}). In fact, we can write: \begin{aligned} f(y; \theta) &= \frac{N!}{y_1!\cdots y_p!}\frac{e^{\theta_1 y_1} \cdots e^{\theta_p y_p}}{(\sum_{j=1}^p e^{\theta_j})^{y_1} \cdots (\sum_{j=1}^p e^{\theta_j})^{y_p}} = \frac{N!}{y_1!\cdots y_p!} \prod_{j=1}^p\left(\frac{e^{\theta_j}}{\sum_{k=1}^p e^{\theta_k}}\right)^{y_j} \\ &= \frac{N!}{y_1!\cdots y_p!} \prod_{j=1}^p\pi_j^{y_j}. \end{aligned} where we defined \theta_p := 0, so that \sum_{j=1}^pe^{\theta_j} = 1 + e^{\theta_1} + \cdots + e^{\theta_{p-1}}, recalling that \sum_{j=1}^py_j = N.
The tilted density belongs to a regular natural exponential family \mathcal{F}_\text{en}^{p-1} of order p-1.
Example: independent exponential families đź“–
Let Y = (Y_1,\dots,Y_p) be a random vector of independent random variables each belonging to a full natural exponential family \mathcal{F}_\text{en}^1 of order 1, with density f(y_j; \theta_j) = f_j(y_j)\exp\{\theta_j y_j - K_j(\theta_j)\}, \qquad \theta_j \in \tilde{\Theta}_j.
Let \theta = (\theta_1,\dots,\theta_p). Because of independence, the joint distribution of Y is \begin{aligned} f(y;\theta) &= \prod_{j=1}^p f(y_j;\theta_j) = \prod_{j=1}^pf_j(y_j)\exp\{\theta_j y_j - K_j(\theta_j)\} \\ &= \left[\prod_{j=1}^pf_j(y_j)\right]\exp\left\{\sum_{j=1}^p\theta_jy_j - \sum_{j=1}^pK_j(\theta_j)\right\} \\ & = f_0(y)\exp\{\theta^Ty - K(\theta)\} \end{aligned} where f_0(y) = \prod_{j=1}^pf_j(y_j), K(\theta) = \sum_{j=1}^pK_j(\theta_j) and the natural parameter space is \tilde{\Theta} = \tilde{\Theta}_1 \times \cdots \times \tilde{\Theta}_p.
Thus, f(y;\theta) is an \mathcal{F}_\text{en}^p, in which K(\theta) is a separable function.
Mean value mapping and other properties
Let Y \sim f(y; \theta), with f(y; \theta) \in \mathcal{F}_\text{en}^p. The cumulant generating function is
K_\theta(t) = \log M_\theta(t) = K_0(t + \theta) - K_0(\theta), \qquad t + \theta \in \tilde{\Theta}. In particular, the first two moments of Y are obtained as:
\mu(\theta) := \mathbb{E}_\theta(Y) = \frac{\partial}{\partial \theta} K(\theta), \qquad \text{var}_\theta(Y) = \frac{\partial}{\partial \theta^\top} \mu(\theta) = \frac{\partial^2}{\partial \theta \partial \theta^\top} K(\theta),If f_0 is non-degenerate, then the covariance matrix \text{var}_\theta(Y) is positive definite, implying that K(\theta) is a convex function, and \mu(\theta) is a smooth one-to-one map.
The definitions of mean value mapping \mu(\theta), its range \mathcal{M}, the convex hull C(\mathcal{Y}) of the sample space, and the variance function V(\mu) also naturally extend to the multi-parameter setting.
Refer to Jorgensen (1987) for an extension of the results of Morris (1982) about V(\mu).
Independence of the components
- This theorem essentially establishes that if the baseline density f_0(\cdot) has independent components, then the exponential tilting preserves independence.
This generalizes a well-known fact of multivariate Gaussians, which are in fact an \mathcal{F}_\text{en}^p.
In practice, if the hessian matrix of K(\theta) is diagonal, then the natural observations are independent. This occurs whenever K(\theta) is separable.
Marginal and conditional distributions
Consider a \mathcal{F}_\text{en}^p family, so that f(y; \theta) = f_0(t)\exp\{\theta^T y - K(\theta)\}.
Let y = (t, u) be a partition of the natural observations y, where t(\cdot) has k components and u has p-k components. Let us partition \theta accordingly, so that \theta = (\tau, \zeta) and f(y; \tau, \zeta) = f_0(y)\exp\{\tau^Tt + \zeta^Tu - K(\tau,\zeta)\}, \qquad (\tau,\zeta) \in \tilde{\Theta}.
Conditional likelihoods
The former result on marginal and conditional laws is not just an elegant probabilistic fact. Indeed, it has sensible inferential applications.
Often times, we can split the parameter vector \theta into a parameter of interest \tau and a nuisance parameter \zeta. We are not interested in learning \zeta.
A practical drawback of this approach is that the conditional cumulant generating function K_u(\tau) is not always available in closed form, albeit with notable exceptions.
The approach is valid, in the sense that a likelihood based on f_{T\mid U=u} is a genuine likelihood. On the other hand, note that the full likelihood would be based on f(y;\tau, \zeta) = f_U(u; \tau, \zeta) f_{T\mid U=u}(t; u, \tau), and thus conditional likelihood is discarding information, that is, it is neglecting f_U(u; \tau, \zeta).
A general definition of exponential families I
- If q > p, then \psi is not identifiable and this possibility should be discarded.
- If q = p, then \theta(\psi) must be a one-to-one mapping, i.e. a reparametrization, otherwise the model is again not identifiabile.
- If q < p, we have a (p,q)-curved exponential family, which corresponds to a restriction of the natural parameter space.
Curved exponential families
- Figure 4.1 of Efron (2023), Chapter 4. Three levels of statistical modeling, now with a fourth added representing curved exponential families.
A general definition of exponential families II
We refer to Efron (2023), Chapter 4 for a detailed discussion on curved exponential families. From now on, we will focus on the p = q case.
Without loss of generality, we can focus on the natural parametrization \theta \in \Theta \subseteq \mathbb{R}^p and density baseline h(y) = f_0(y), meaning that f(y;\theta) \in \mathcal{F}_\text{e}^p can be written as f(y; \theta) = f_0(y)\exp\{\theta^T s(y) - K(\theta)\}, because the general case would be a reparametrization of this one.
Let Y \sim f(y; \theta), with f(y; \theta) \in \mathcal{F}_\text{e}^p. Then, the random vector S = s(Y) = (s_1(Y),\dots,s_p(Y)) has density f_S(s; \psi) = \tilde{f}_0(s)\exp\{\theta s - K(\theta)\}, for some baseline density \tilde{f}_0(s), namely f_S(s; \psi) \in \mathcal{F}_\text{en}^p. If in addition s(y) is a one-to-one invertible mapping, this means Y = s^{-1}(S) is just a transformation of an \mathcal{F}_\text{en}^p.
Example: gamma distribution đź“–
The family \text{Gamma}(\nu, \lambda) with \nu,\lambda > 0 is an \mathcal{F}_\text{e}^2. In fact, its density is \begin{aligned} f(y; \nu, \lambda) &= \frac{\lambda^\nu}{\Gamma(\nu)}y^{\nu -1}e^{-\lambda y} = \frac{1}{y}\exp\{\nu\log{y} - \lambda y - \log\Gamma(\nu) + \nu\log{\lambda} \} \\ &=h(y)\exp\{\theta(\psi)s(y) - G(\psi)\}. \end{aligned} where h(y) = y^{-1}, the sufficient statistic s(y) = (s_1(y), s_2(y)) = (\log{y}, y) whereas the natural parameters and the cumulant generating functions are \theta(\psi) = (\theta_1(\psi), \theta_2(\psi)) = (\nu, -\lambda), \qquad G(\psi) = \log{\Gamma(\nu)} - \nu\log{\lambda}. having set \psi = (\nu, \lambda).
As previously shown, this implies that the family f(s; \theta) = f_0(s)\exp\{\theta^Ts - \log{\Gamma(\theta_1)} + \theta_1\log(-\theta_2)\}, \qquad \theta \in \tilde{\Theta}, is a regular natural exponential family of order 2, with some baseline density f_0(s).
Example: von Mises distribution I
Let Y be a random variable describing an angle, so that \mathcal{Y} = (0, 2\pi) and let us consider the uniform density on the circle, namely h(y) = \frac{1}{2\pi}, \qquad y \in (0, 2\pi).
We define a tilted density f(y; \theta) \in \mathcal{F}_\text{e}^2 by considering s(y) = (\cos{y}, \sin{y}), i.e. be the cartesian coordinates of y. This choice of s(y) ensures the appealing property f(y;\theta) = f(y + 2k\pi;\theta).
More precisely, let \theta = (\theta_1,\theta_2) and define the parametric family of densities f(y; \theta) = h(y)\exp\{\theta^Ts(y) - K(\theta)\}, \qquad \theta \in \tilde{\Theta}, where h(y) = 1/2\pi. The normalizing constant has a “closed form” \exp{K(\theta)} = \frac{1}{2\pi}\int_0^{2\pi}\exp\{\theta_1\cos(y) + \theta_2\sin(y)\}\mathrm{d}y = \mathcal{I}_0(||\theta||_2), where \mathcal{I}_\nu(\cdot) is known as the modified Bessel function of the first kind and order \nu.
It is easy to check that K(\theta) < \infty for all values of \theta \in \mathbb{R}^2, therefore \tilde{\Theta} = \mathbb{R}^2. This complete the definition of what is known as the von Mises distribution.
Example: von Mises distribution II
Instead of the natural parametrization it is often convenient consider a reparametrization \psi =(\tau, \gamma), defined trough the one-to-one mapping \theta(\psi) = (\tau\cos{\gamma}, \tau\sin{\gamma}), \qquad \psi \in \tilde{\Psi} = (0, \infty) \times (0, 2\pi).
Using this parametrization, thanks to well-known trigonometric relationships, we obtain the more familiar formulation of the von Mises distribution, which is f(y; \psi) = h(y)\exp\{\theta(\psi)s(y) - G(\psi))\} = \frac{1}{2\pi \mathcal{I}_0(\tau)}e^{\tau\cos(y - \gamma)}, \qquad y \in (0, 2\pi), so that \gamma \in (0,2\pi) can be interpreted as the location and \tau > 0 as the precision.
We also note that the distribution of s(Y) is a regular natural exponential family of order 2, with density f_S(s; \theta) = \frac{1}{2\pi}\exp\{\theta^Ts - \log\mathcal{I}_0(||\theta||_2)\}, \qquad s \in \mathcal{S} = \{(s_1,s_2) \in \mathbb{R}^2 : s_1^2 + s_2^2 = 1\}, clarifying that S = s(Y) is a random vector taking values on a circle with unitary radius.
Example: wind direction in Venice I
The von Mises distribution is sometimes regarded as the “Gaussian distribution for circular data”. To provide a concrete example, let us consider the wind directions measured from the San Giorgio metereological station, in Venice.
Measurements are recorded every 5 minutes, from the 14-04-2025 to 18-04-2025, for a total of n = 1153. The variables
wind_dir
is recorded in degrees, i.e. between 0 and 360.
The dataset is available here. The original source is the webpage of Venice municipality.
Example: wind direction in Venice II
- This is a somewhat misleading graphical representation of wind directions evolving over time. Indeed, the “spykes” are not real: the angles 1 and 359 are, in fact, very close.
Example: wind direction in Venice III
- A better graphical representation of wind directions and wind speed, using cartesian coordinates. From this windrose, it is clear the winds were coming mostly from the East.
Inference
Independent sampling, sufficiency and completeness
Let Y_1,\dots,Y_n be iid random vectors with density f(y; \theta), where f(y; \theta) \in \mathcal{F}_\text{e}^p and, without loss of generality, we let f(y; \theta) = f_0(y)\exp\{\theta^T s(y) - K(\theta)\}. The likelihood function is L(\theta; \bm{y}) = \prod_{i=1}^n \exp\left\{\theta^T s(y_i) - K(\theta)\right\} = \exp\left\{\theta^T \sum_{i=1}^n s(y_i) - n K(\theta)\right\}, from which we see that s = \sum_{i=1}^ns(y_i) =(\sum_{i=1}^n s_1(y_i),\dots,\sum_{i=1}^n s_p(y_i)) is the minimal sufficient statistic as long as n \ge p, which has fixed dimension p whatever the sample size.
Inference can therefore be based on the random vector S = \sum_{i=1}^n s(Y_i), whose distribution is f_S(s; \theta) = \tilde{f}_0(s)\exp\{\theta^T s - \tilde{K}(\theta)\}, with \tilde{K}(\theta) = n K(\theta) and for some density \tilde{f}_0(s). In other words f_S(s; \theta) \in \mathcal{F}_\text{en}^p.
Sufficiency and completeness
Thus the log-likelihood function, after a reduction via sufficiency, is \ell(\theta) = \ell(\theta; s) = \theta^Ts - n K(\theta), \qquad \theta \in \tilde{\Theta}, with S = \sum_{i=1}^n s(Y_i) being distributed as a \mathcal{F}_\text{en}^p with cumulant generating function n K(\theta).
The completeness of S in exponential families is a classical result that enables the usage of Rao-Blackwell-Lehmann-Scheffé theorem for finding the UMVUE.
The existence of a minimal sufficient statistic that performs a non-trivial dimensionality reduction, from n to p, is a major simplification.
This appears to be a characteristic of exponential families, with the only exception of non-regular cases.
Likelihood quantities
Optimality properties of point estimates
Maximum likelihood estimation
Asymptotic theory
Firth corrections
Example: wind direction in Venice IV
Example: wind direction in Venice V
Exponential dispersion families
References and study material
Main references
- Pace and Salvan (1997)
- Chapter 5 (Exponential families)
- Chapter 6 (Exponential dispersion families)
- Davison (2003)
- Chapter 5 (Models)
- Efron and Hastie (2016)
- Chapter 5 (Parametric models and exponential families)
- Efron (2023)
- Chapter 1 (One-parameter exponential families)
- Chapter 2 (Multiparameter exponential families)
Morris (1982)
Morris (1982, AoS) is a seminal paper in the field of exponential families.
It is a must-read, as it encompasses and overviews many of the results discussed in this unit.
It also shows that exponential families with quadratic variance are infinitely divisible, provided that c \ge 0.
The paper covers several advanced topics, including:
- orthogonal polynomials;
- limiting results;
- large deviations;
- …and more.
Jorgensen (1987)
Jorgensen (1987, JRSSB) is another seminal paper in the field of exponential dispersion families.
It studies a multivariate extension of exponential dispersion models of Nelder and Wedderburn (1972).
It characterizes the entire class in terms of variance function, extending Morris (1982).
It also describes a notion of asymptotic normality called small sample asymptotics.
It is a read paper and among the discussants we find, J.A. Nelder, A.C. Davison, C.N. Morris.
Diaconis and Ylvisaker (1979)
Bayesian statistics also greatly benefits from the use of exponential families.
Diaconis and Ylvisaker (1979, AoS) is a seminal paper on the topic of conjugate priors.
Broadly speaking, conjugate priors always exist for exponential families.
These are known as the Diaconis–Ylvisaker conjugate priors.
Classical priors such as beta–Bernoulli and Poisson–gamma are special cases.
The posterior expectation under the mean parametrization is a linear combination of the data and the prior mean.
Consonni and Veronese (1992)
Consonni and Veronese (1992, JASA) is another Bayesian contribution which refines the results of Diaconis and Ylvisaker (1979).
It investigates when a conjugate prior specified on the mean parameter \mu of a natural exponential family leads to a linear posterior expectation of \mu.
The main result shows that this posterior linearity holds if and only if the variance function is quadratic.
The paper also explores the monotonicity of the posterior variance of \mu with respect to both the sample size and the prior sample size.