A Bayesian theory for the estimation of biodiversity

Università of Milano-Bicocca

Tommaso Rigon

Università degli Studi di Milano-Bicocca

2026-06-09

Warm thanks (statistical team)

Ching-Lung Hsu

(Duke University)

Alessandro Zito

(Harvard School of Public Health)

David Dunson

(Duke University)

Warm thanks (ecological team, and more…)

Otso Ovaskainen

(University of Jyväskylä)

Tomas Roslin

(University of Helsinki)

International BARCODE OF LIFE

  • …and Niittynen, P., Hebert, P. D. N., Zakharov, E., Ratnasingham, S., and the entire iBOL Consortium!,

A journey into statistical biodiversity

Approximately 80 years ago… Fisher et al. (1943)

Early ideas (Good 1953; Good and Toulmin 1956)

The Ewens sampling formula Ewens (1972)

The Dirichlet process (Ferguson 1973)

… and species sampling models (Pitman 1996)

Ewens formula governs the neutral theory! (Hubbell 2001)

All ties together (McCullagh 2016)

Species sampling models

  • Let X_1,\dots,X_n be some collection of species or taxa. Suppose X_n are conditionally iid samples from a species sampling model: (X_n \mid \tilde{p}) \overset{\text{iid}}{\sim} \tilde{p}, \qquad \tilde{p}(\cdot) = \sum_{h=1}^{\infty}\pi_h \delta_{Z_h}(\cdot), \qquad n \ge 1, where (\pi_h)_{h \ge 1} is a set of probabilities (species proportions) and Z_h represent distinct species.

  • The discreteness of \tilde{p} identifies Y^{(n)}=y distinct taxa X_1^*, \ldots, X_y^* with frequencies n_1, \ldots, n_y, called abundances in ecology.

  • Gibbs-type priors have emerged as the most natural extension of the DP (De Blasi et al. 2015).

  • The predictive distribution of a Gibbs-type prior is given by: \mathbb{P}(X_{n+1} \in \cdot \mid X_1,\dots,X_n) = \frac{V_{n+1, y+1}}{V_{n,y}}P(\cdot) + \frac{V_{n+1, y}}{V_{n,y}}\sum_{j=1}^y(n_j - \sigma)\delta_{X^*_j}(\cdot). where (a)_{n} denotes a rising factorial, and \sigma < 1 is the discount parameter. The V_{n,y}’s are non-negative weights satisfying a forward recursive equation.

Three notable examples

  • Dirichlet multinomial (\sigma < 0). For \sigma < 0 and H \in \mathbb{N}, a valid set of Gibbs coefficients is given by: V_{n, y}(\sigma, H) := \frac{|\sigma|^{y-1}\prod_{j=1}^{y-1}(H - j)}{(H|\sigma| +1)_{n-1}} \mathbb{I}(y \le H).

  • Dirichlet process (\sigma = 0). For \sigma = 0 and \alpha > 0, a valid set of Gibbs coefficients is given by: V_{n, y}(\alpha) := \frac{\alpha^y}{(\alpha)_n}.

  • \sigma-stable Poisson–Kingman process (0 < \sigma < 1). For \sigma \in (0,1) and gamma \gamma > 0, a valid set of Gibbs coefficients is defined as: V_{n, y}(\sigma, \gamma) := \frac{\sigma^y\gamma^y}{\Gamma(n-y\sigma)f_{\sigma}\left(\gamma^{-1/\sigma}\right)} \int_{0}^{1}s^{n-1-y\sigma}f_{\sigma}\left(\left(1-s\right)\gamma^{-1/\sigma}\right)\, \mathrm{d}s, where f_\sigma(t) = (\pi)^{-1} \sum_{h=1}^\infty (-1)^{h+1}\sin(h \pi \sigma )\Gamma(h\sigma + 1) / t^{h\sigma + 1}.

A characterization theorem

The Gibbs coefficients V_{n, y} satisfy the recursive equation in the following three cases:

  • If \sigma < 0, whenever V_{n, y} = \sum_{h=1}^\infty V_{n, y}(\sigma, h) p(h), for some discrete random variable H \in \mathbb{N} with pdf p(h), where the V_{n, y}(\sigma, h)’s are those of the Dirichlet multinomial.

  • If \sigma = 0, whenever V_{n, y} = \int_{\mathbb{R}^+} V_{n, y}(\alpha) p(\mathrm{d}\alpha), for some positive random variable \alpha with probability measure p(\mathrm{d}\alpha), where the V_{n, y}(\alpha)’s are those of the Dirichlet process.

  • If \sigma \in (0,1), whenever V_{n, y} = \int_{\mathbb{R}^+} V_{n, y}(\sigma, \gamma) p(\mathrm{d}\gamma), for some positive random variable \gamma with probability measure p(\mathrm{d}\gamma), where the V_{n, y}(\sigma, \gamma)’s are those of the \sigma-stable PK.

  • The Dirichlet multinomial, Dirichlet process, and \sigma-stable PK form the foundation of any Gibbs-type prior.

  • In fact, any Gibbs-type process can be represented hierarchically, involving a suitable prior distribution for the key parameters H, \alpha, and \gamma.

The quantification of biodiversity (Lijoi et al. 2007a)

  • The simplest measure of biodiversity is arguably the taxon richness Y^{(n)} = y.

  • A priori, the distribution of Y^{(n)} induced by a Gibbs-type prior has a simple form: \mathbb{P}(Y^{(n)} = y)=V_{n,y}\frac{\mathscr{C}(n,y;\sigma)}{\sigma^y}, where \mathscr{C}(n, y;\sigma) denotes a generalized factorial coefficient.

  • The a priori expectations \mathbb{E}(K_1),\dots,\mathbb{E}(K_n) define a model-based rarefaction curve.

  • The posterior distribution of the number of previously unobserved taxa Y_m^{(n)} is \mathbb{P}(Y_m^{(n)}= j \mid X_1,\dots,X_n)=\frac{V_{n + m, y + j}}{V_{n, y}}\frac{\mathscr{C}(m, j;\sigma, -n + y\sigma)}{\sigma^j}, \qquad j=0,\dots,m, where \mathscr{C}(m, j;\sigma, -n + y\sigma) is the noncentral generalized factorial coefficient.

  • The posterior expectations \mathbb{E}(Y^{(n+1)} \mid Y^{(n)} = y), \dots, \mathbb{E}(Y^{(n + m)} \mid Y^{(n)} = y) represents a model-based extrapolation of the accumulation curve.

Rarefaction and extrapolation

The \sigma-diversity (Pitman 2003)

  • Let Y^{(n)} be the number of distinct values arising from a Gibbs-type prior:
    1. Let \sigma < 0 and V_{n, y}(\sigma, H) be the weights of a Dirichlet multinomial, then Y^{(n)} \rightarrow H a.s.
    2. Let \sigma = 0 and V_{n, y}(\alpha) be the weights of a Dirichlet process, then Y^{(n)} / \log(n) \rightarrow \alpha a.s.
    3. Let \sigma \in (0,1) and V_{n, y}(\sigma, \gamma) be the weights of a \sigma-stable PK, then Y^{(n)} / n^\sigma \rightarrow \gamma a.s.
  • Moreover, consider a generic set of weights V_{n, y} and let c_\sigma(n) = \begin{cases} 1, & \sigma <0,\\[2mm] \log(n), & \sigma = 0,\\[1mm] n^{\sigma}, & \sigma \in (0,1) \end{cases} Then, as n \rightarrow \infty: \frac{Y^{(n)}}{c_\sigma(n)} \overset{\textup{a.s.}}{\longrightarrow} S_{\sigma}. The r.v. S_{\sigma} is the \sigma-diversity and its distribution coincides with the prior for H, \alpha, and \gamma.

The \sigma-diversity (Pitman 2003)

  • Broadly speaking, the \sigma-diversity can be seen as a rescaled richness measure.

  • The \sigma-diversity is deterministic and assumed to be known in the Dirichlet multinomial, Dirichlet process, and \sigma-stable PK.

  • However, the \sigma-diversities H, \alpha, or \gamma are typically unknown, and they can be estimated employing a prior distribution, leading to a Gibbs-type prior.

  • The posterior law of the \sigma-diversity is a key quantity for measuring biodiversity. The posterior law of S_\sigma has an elegant connection with accumulation curves.

Theorem (Rigon et al. 2025b)

Let X_1,\dots,X_{n+m} be a sample from a Gibbs-type prior with Y^{(n+m)} distinct values. Then:

\left(\frac{Y^{(n+m)}}{c_\sigma(m)} \mid X_1,\dots,X_n \right) \overset{d}{\longrightarrow} \tilde{S}_{\sigma}, \qquad \tilde{S}_\sigma \overset{d}{=} (S_\sigma \mid X_1,\dots,X_n), as m \rightarrow \infty, where S_\sigma is the \sigma-diversity.

Tree species in the Amazon Basin (Hubbell et al. 2008)

  • How to get a Bayesian estimator for the biodiversity? We can place a prior on \alpha and the compute its posterior distribution using Bayes theorem.

  • Posterior distribution of the \sigma-diversity \alpha, under a (conjugate) Stirling-gamma prior (Zito et al. 2024). The dotted lines represent 95% credible intervals. The dashed line is the posterior mean.

Historical summary and extensions

  • The fundamental biodiversity number \alpha is a growth-adjusted richness: \frac{Y^{(n)}}{\log n} \stackrel{\mathrm{a.s.}}{\longrightarrow} \alpha, \qquad n\to \infty

  • The \alpha from the neutral theory, DP and Fisher log-series is the same for large n: \underbrace{y = \hat{\alpha}^\mathrm{Fisher}\log\Big(1 + \frac{n}{\hat{\alpha}^\mathrm{Fisher}}\Big)}_{\text{Solution from Fisher}} \ \stackrel{\text{Under large } n}{\approx} \ \underbrace{\sum_{j = 1}^{n} \frac{\hat{\alpha}^{\text{DP}}}{\hat{\alpha}^{\text{DP}} + j - 1} = y}_{\text{MLE from Hubbell and DP}}

  • There are a lot of possible extensions. This is a very biased list:

Specimens and BINs in Barcode of Life

  • Our data: global Malaise Trap project.

  • We have a total of 1,781,340 arthropods split across 154,688 BINs and N = 2415 sampling sites.

The Hubbell regression

  • The number of distinct species from a DP is an exponential family! \mathbb{P}(Y^{(n)} = y; \alpha) = \frac{\alpha^y}{(\alpha)_n}|s(n, y)|, \qquad y \in\{ 1, \ldots, n\}, where (\alpha)_n = \Gamma(\alpha + n)/\Gamma(\alpha) and |s(n,y)| are signless Stirling numbers of the first kind.

The Hubbell regression is a GLM where:

  • The distinct species Y_i^{(n_i)} for i=1,\dots,N are independent draws from the distribution f(y_i; \theta_i) = \exp \left\{y_i \theta_i - \log[\Gamma(e^{\theta_i} + n_i) - \Gamma(e^{\theta_i})] + \log|s(n_i, y_i)|\right\}, with \alpha_i = e^{\theta_i} whose mean and variance functions are \mu_i = \mu(\theta_i, n_i) = \sum_{j=0}^{n_i-1} \frac{e^{\theta_i}}{e^{\theta_i} + j}, \qquad V(\mu(\theta_i, n_i)) = \sum_{j=0}^{n_i-1} \frac{e^{\theta_i}}{e^{\theta_i} + j}\left(1 - \frac{e^{\theta_i}}{e^{\theta_i} + j}\right).

  • The link function g depends on \eta_i and n_i and satisfies \mathbb{E}(Y_i\mid \bm{x}_i, n_i) = \mu_i = g^{-1}(\bm{x}_i^\top \bm{\beta}, n_i).

Regression-based biodiversity indexes

Theorem (Regression-based diversity)

For the mean function of the a Hubbell regression model with the aforementioned non-canonical link, it holds \frac{\mu(\bm{x}_i^\top \bm{\beta}; n, \sigma)}{c_n(\sigma)} \longrightarrow S_\sigma(\bm{x}_i^\top\bm{\beta}), \qquad n \to \infty, Moreover:

  • If \sigma = 0, then c_n(\sigma) = \log n (logarithmic growth) and S_\sigma(\bm{x}_i^\top\bm{\beta}) = e^{\bm{x}_i^\top\bm{\beta}} = \alpha_i;
  • If \sigma \in (0, 1), then c_n(\sigma) = n^{\sigma} (polynomial growth) and S_\sigma(\bm{x}_i^\top\bm{\beta}) = e^{\bm{x}_i^\top\bm{\beta}}/\sigma = \alpha_i / \sigma;
  • if \sigma < 0, then c_n(\sigma) = 1 (finite richness) and S_\sigma(\bm{x}_i^\top\bm{\beta}) \approx C_\sigma^{-1}e^{\bm{x}_i^\top\bm{\beta}/(1-\sigma)} = C_\sigma^{-1} \alpha_i^{1/(1-\sigma)}, with C_\sigma a known constant.
  • This gives a novel notion of regression-based biodiversity.

Iteratively reweighted least squares (IRLS)

  1. Update the mean: \mu_i^{(t)} \gets \sum_{j = 0}^{n_i-1} \frac{e^{\eta_i^{(t)}}}{e^{\eta_i^{(t)}} + j^{1-\sigma}}, \qquad \eta_i^{(t)} = \bm{x}_i^\top \bm{\beta}^{(t)}.

  2. Numerically retrieve the diversity \alpha_i and compute variances: \alpha_i^{(t)} \gets \texttt{uniroot}\!\left(\sum_{j=0}^{n_i-1}\frac{x}{x+j}=\mu_i^{(t)}\right)\texttt{\$root}, \qquad v_i^{(t)} \gets \sum_{j=0}^{n_i-1} j\,\frac{\alpha_i^{(t)}}{(\alpha_i^{(t)}+j)^2}.

  3. Compute working weights: w_i^{(t)} \gets \frac{1}{v_i^{(t)}} \sum_{j=0}^{n_i-1} j\frac{\alpha_i^{(t)}} {(\alpha_i^{(t)}+j^{1-\sigma})^2}.

  4. Update the regression coefficients:

    \bm{\beta}^{(t+1)} \gets (\bm{X}^\top \bm{W}^{(t)} \bm{X})^{-1} \bm{X}^\top \bm{W}^{(t)} \bm{z}^{(t)}, \qquad z_i^{(t)} \gets \eta_i^{(t)} + \frac{y_i - \mu_i^{(t)}}{(w_i^{(t)}v_i^{(t)})^{1/2}}.

Composite marginal likelihood and standard errors

  • The assumption behind GLMs is that observations are independent. In our case, we might have seen the same species (or BIN) at two different locations!

  • This information is lost when calculating the number Y_i^{(n_i)}. However, we can interpret L(\boldsymbol{\beta}) \propto \prod_{i=1}^N \frac{\alpha_i^{y_i}} {(\alpha_i)_{n_i}}, as a composite marginal likelihood. The estimating equation is still unbiased and estimates consistent.

  • We propose spatially heteroskedastic-consistent standard errors (Conley 1999) that use the Jaccard Index (or similar indices) between two samples: \widehat{\mathrm{se}}(\hat{\boldsymbol{\beta}}) = (\bm{X}^\top \hat{\bm{W}}\bm{X})^{-1} \textstyle\sum_{ik} {\color{#8B0000}{s_{ik}}}\, {\color{blue}{\hat{\bm{u}}_i\hat{\bm{u}}_k^\top}}\, (\bm{X}^\top \hat{\bm{W}}\bm{X})^{-1}, \qquad {\color{blue}{\hat{\bm{u}}_i}} = \frac{\partial}{\partial\boldsymbol{\beta}} \log \mathcal{L}(\hat{\boldsymbol{\beta}}), where {\color{#8B0000}{s_{ik}}} \in [0,1] is the fraction of shared BINs between locations i and k.

Our scientific questions

  • Is (actual) evapotranspiration the main driver of diversity?

  • How does anthropogenic impact affect diversity?

Modelling approach

  • Goal: test the joint effect of evapotranspiration and human footprint
HubbelGLM(cbind(n, y) ~ ., family = hubbell(sigma = 0), data = data)

Model considered (in order of complexity)

  • M0: no covariates
  • M1: AET
  • M2: + fixed effect for realm
  • M3: + time (julian week, shifter by 6 months if below equator, 1 degree Fourier transform
  • M4: + spatial gradient (ns(Latitude, 6))
  • M5: + zone and hfp and interaction
  • M6: + controls (wetlands, wind speed, collection days, residuals of regression of temperature with latitude)
  • M7: + anomalies for weather (temperature, humidity, precipitation, wind speed)

Hubbell achieves the best predictive performance

  • We run a 10-fold cross-validation experiment. Our benchmarks are other GLMs (matching the growth rate with n).

  • For each nested specification, we split the data into 10 folds, train all models in 90% of the data and predict the remaining 10% (cycle through).

Key results

  • Model M0: \sigma is estimated as 0.562. Square root!

  • Model M1: AET explains nearly 30% of the total variation in species richness

  • Model M7:

    • A 10 mm/year increase in AET leads to a 12.4\% increase in diversity on average (\beta = 0.012; P < 0.0001).
    • A 5 units on the HFP scale is associated with a 6.1\% reduction in diversity in tropical areas (\beta = −0.013; P < 0.0001), and a 7.5\% reduction in dry zones (\beta = −0.016; P < 0.004)
    • Polar areas show an interesting increase of 8.2\% (\beta = 0.016; P < 0.038)
    • No detected effect in continental areas

Effects for all diversity indices: variations

Effects for all diversity indices: global scales

Scenario evaluation

  • Tropical sites with high human activity (HFP > 30) miss a median of 197 species (21.3\% of potential richness).
  • Dry sites miss 202 species (29.2\%).
  • Temperate sites miss 26 species (4.3\%)
  • Continental sites miss 70 species (9.23\%)
  • A 5‑unit intensification of human presence globally leads to up to an 8\% reduction in potential richness

  • A 10\% increase in AET is associated with an average 3.81\% increase in potential richness.

Summary and future directions

  • Fisher’s \alpha is the same quantity described by Hubbell’s unified neutral theory of biodiversity, and it has deep connections with the Dirichlet process.

  • The Hubbell regression extends the neutral theory in a covariate-dependent GLM framework.

  • Different growth rates are captured via \sigma < 0, \sigma = 0, and \sigma \in (0,1).

  • From here to infinity: next steps

    1. Extend to mixed models, allowing for random effects;
    2. More on nonparametrics: GAMs

Rigon, T., Hsu, C., and Dunson D.B. (2025). A Bayesian theory for estimation of biodiversity. arXiv:2502.01333

Zito, A., Rigon, T., Roslin, T., Niittynen, P., Hebert, P. D. N., Zakharov, E., Ratnasingham, S., iBOL Consortium, Ovaskainen, O. and Dunson, D. B. (2026). Predicting global biodiversity via Hubbell regression. bioRxiv

Software: the R package HubbellGLM

Hubbell regression paper available on biorXiv

References

Camerlenghi, F., Lijoi, A., Orbanz, P., and Prünster, I. (2019), “Distribution theory for hierarchical processes,” Ann. Statist., 47, 67–92.
Conley, T. G. (1999), “GMM estimation with cross sectional dependence,” Journal of Econometrics, 92, 1–45.
De Blasi, P., Favaro, S., Lijoi, A., Mena, R. H., Prunster, I., and Ruggiero, M. (2015), “Are Gibbs-type priors the most natural generalization of the Dirichlet process?” IEEE Trans. Pattern Anal. Mach. Intell., IEEE, 37, 212–229.
Ewens, W. (1972), “The sampling theory of selectively neutral alleles,” Theoretical Population Biology, 3, 87–112.
Ferguson, T. S. (1973), A Bayesian analysis of some nonparametric problems,” Ann. Statist., 1, 209–230.
Fisher, R. A., Corbet, A. S., and Williams, C. B. (1943), “The relation between the number of species and the number of individuals in a random sample of an animal population,” J. Anim. Ecol., 12, 42–58.
Franzolini, B., Lijoi, A., Prünster, I., and Rebaudo, G. (2025), “Multivariate species sampling models,” arXiv:2503.24004.
Ghilotti, L., Camerlenghi, F., and Rigon, T. (2025), “Bayesian analysis of product feature allocation models,” Journal of the Royal Statistical Society: Series B (Statistical Methodology).
Gnedin, A. (2010), Species sampling model with finitely many types,” Electron. Comm. Prob., 15, 79–88.
Gnedin, A., and Pitman, J. (2005), Exchangeable Gibbs partitions and Stirling triangles,” Zapiski Nauchnykh Seminarov, POMI, 325, 83–102.
Good, I. J. (1953), “The population frequencies of species and the estimation of population parameters,” Biometrika, 40, 237–264.
Good, I. J., and Toulmin, G. H. (1956), “The number of new species, and the increase in population coverage, when a sample is increased,” Biometrika, 43, 45–63.
Hubbell, S. P. (2001), The unified neutral theory of biodiversity and biogeography, Princeton University Press.
Hubbell, S. P., He, F., Condit, R., Borda-de-Água, L., Kellner, J., and Ter Steege, H. (2008), How many tree species are there in the Amazon and how many of them will go extinct? Proceedings of the National Academy of Sciences of the United States of America, 105, 11498–11504.
Lijoi, A., Mena, R. H., and Prünster, I. (2007a), “Bayesian nonparametric estimation of the probability of discovering new species,” Biometrika, Oxford University Press, 94, 769–786.
Lijoi, A., Mena, R. H., and Prünster, I. (2007b), Controlling the reinforcement in Bayesian non-parametric mixture models,” J. R. Statist. Soc. B, Wiley Online Library, 69, 715–740.
Lijoi, A., Prünster, I., and Rigon, T. (2020), The Pitman–Yor multinomial process for mixture modeling,” Biometrika, 107, 891–906.
McCullagh, P. (2016), Two Early Contributions to the Ewens Saga,” Statist. Sci., 31, 23–26.
Pitman, J. (1996), “Some developments of the Blackwell-MacQueen urn scheme,” in Statistics, probability and game theory, IMS lecture notes monogr. ser., Inst. Math. Statist., Hayward, CA, pp. 245–267.
Pitman, J. (2003), Poisson-Kingman partitions,” Lecture Notes-Monograph Series, 40, 1–34.
Pitman, J., and Yor, M. (1997), The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator,” Ann. Probab., 25, 855–900.
Rigon, T., Hsu, C., and B., D. D. (2025a), A Bayesian theory for estimation of biodiversity,” arXiv:2502.01333.
Rigon, T., Petrone, S., and Scarpa, B. (2025b), Enriched Pitman-Yor processes,” Scand. J. Statist.
Zito, A., Rigon, T., and Dunson, D. B. (2023), Inferring taxonomic placement from DNA barcoding allowing discovery of new taxa,” Meth. Ecol. Evol., 14, 529–542.
Zito, A., Rigon, T., and Dunson, D. B. (2024), Bayesian nonparametric modeling of latent partitions via Stirling-gamma priors,” Bayesian Analysis.