A Bayesian theory for the estimation of biodiversity

Università of Milano-Bicocca

A joint work with Alessandro Zito, Ching-Lung Hsu et al. presented at the University of Milano-Bicocca.

Author

Affiliation

Tommaso Rigon

Università degli Studi di Milano-Bicocca

Published

June 9, 2026

Warm thanks (statistical team)

Ching-Lung Hsu

(Duke University)

Alessandro Zito

(Harvard School of Public Health)

David Dunson

(Duke University)

Warm thanks (ecological team, and more…)

Otso Ovaskainen

(University of Jyväskylä)

Tomas Roslin

(University of Helsinki)

International BARCODE OF LIFE

…and Niittynen, P., Hebert, P. D. N., Zakharov, E., Ratnasingham, S., and the entire iBOL Consortium!,

A journey into statistical biodiversity

Approximately 80 years ago… Fisher et al. (1943)

Early ideas (Good 1953; Good and Toulmin 1956)

The Ewens sampling formula Ewens (1972)

The Dirichlet process (Ferguson 1973)

… and species sampling models (Pitman 1996)

Ewens formula governs the neutral theory! (Hubbell 2001)

All ties together (McCullagh 2016)

Species sampling models

Let X_1,\dots,X_n be some collection of species or taxa. Suppose X_n are conditionally iid samples from a species sampling model: (X_n \mid \tilde{p}) \overset{\text{iid}}{\sim} \tilde{p}, \qquad \tilde{p}(\cdot) = \sum_{h=1}^{\infty}\pi_h \delta_{Z_h}(\cdot), \qquad n \ge 1, where (\pi_h)_{h \ge 1} is a set of probabilities (species proportions) and Z_h represent distinct species.
The discreteness of \tilde{p} identifies Y^{(n)}=y distinct taxa X_1^*, \ldots, X_y^* with frequencies n_1, \ldots, n_y, called abundances in ecology.
Gibbs-type priors have emerged as the most natural extension of the DP (De Blasi et al. 2015).
The predictive distribution of a Gibbs-type prior is given by: \mathbb{P}(X_{n+1} \in \cdot \mid X_1,\dots,X_n) = \frac{V_{n+1, y+1}}{V_{n,y}}P(\cdot) + \frac{V_{n+1, y}}{V_{n,y}}\sum_{j=1}^y(n_j - \sigma)\delta_{X^*_j}(\cdot). where (a)_{n} denotes a rising factorial, and \sigma < 1 is the discount parameter. The V_{n,y}’s are non-negative weights satisfying a forward recursive equation.

Three notable examples

Dirichlet multinomial (\sigma < 0). For \sigma < 0 and H \in \mathbb{N}, a valid set of Gibbs coefficients is given by: V_{n, y}(\sigma, H) := \frac{|\sigma|^{y-1}\prod_{j=1}^{y-1}(H - j)}{(H|\sigma| +1)_{n-1}} \mathbb{I}(y \le H).
Dirichlet process (\sigma = 0). For \sigma = 0 and \alpha > 0, a valid set of Gibbs coefficients is given by: V_{n, y}(\alpha) := \frac{\alpha^y}{(\alpha)_n}.
\sigma-stable Poisson–Kingman process (0 < \sigma < 1). For \sigma \in (0,1) and gamma \gamma > 0, a valid set of Gibbs coefficients is defined as: V_{n, y}(\sigma, \gamma) := \frac{\sigma^y\gamma^y}{\Gamma(n-y\sigma)f_{\sigma}\left(\gamma^{-1/\sigma}\right)} \int_{0}^{1}s^{n-1-y\sigma}f_{\sigma}\left(\left(1-s\right)\gamma^{-1/\sigma}\right)\, \mathrm{d}s, where f_\sigma(t) = (\pi)^{-1} \sum_{h=1}^\infty (-1)^{h+1}\sin(h \pi \sigma )\Gamma(h\sigma + 1) / t^{h\sigma + 1}.

A characterization theorem

Theorem (Gnedin and Pitman 2005)

The Gibbs coefficients V_{n, y} satisfy the recursive equation in the following three cases:

If \sigma < 0, whenever V_{n, y} = \sum_{h=1}^\infty V_{n, y}(\sigma, h) p(h), for some discrete random variable H \in \mathbb{N} with pdf p(h), where the V_{n, y}(\sigma, h)’s are those of the Dirichlet multinomial.
If \sigma = 0, whenever V_{n, y} = \int_{\mathbb{R}^+} V_{n, y}(\alpha) p(\mathrm{d}\alpha), for some positive random variable \alpha with probability measure p(\mathrm{d}\alpha), where the V_{n, y}(\alpha)’s are those of the Dirichlet process.
If \sigma \in (0,1), whenever V_{n, y} = \int_{\mathbb{R}^+} V_{n, y}(\sigma, \gamma) p(\mathrm{d}\gamma), for some positive random variable \gamma with probability measure p(\mathrm{d}\gamma), where the V_{n, y}(\sigma, \gamma)’s are those of the \sigma-stable PK.

The Dirichlet multinomial, Dirichlet process, and \sigma-stable PK form the foundation of any Gibbs-type prior.
In fact, any Gibbs-type process can be represented hierarchically, involving a suitable prior distribution for the key parameters H, \alpha, and \gamma.

The quantification of biodiversity (Lijoi et al. 2007a)

The simplest measure of biodiversity is arguably the taxon richness Y^{(n)} = y.
A priori, the distribution of Y^{(n)} induced by a Gibbs-type prior has a simple form: \mathbb{P}(Y^{(n)} = y)=V_{n,y}\frac{\mathscr{C}(n,y;\sigma)}{\sigma^y}, where \mathscr{C}(n, y;\sigma) denotes a generalized factorial coefficient.
The a priori expectations \mathbb{E}(K_1),\dots,\mathbb{E}(K_n) define a model-based rarefaction curve.
The posterior distribution of the number of previously unobserved taxa Y_m^{(n)} is \mathbb{P}(Y_m^{(n)}= j \mid X_1,\dots,X_n)=\frac{V_{n + m, y + j}}{V_{n, y}}\frac{\mathscr{C}(m, j;\sigma, -n + y\sigma)}{\sigma^j}, \qquad j=0,\dots,m, where \mathscr{C}(m, j;\sigma, -n + y\sigma) is the noncentral generalized factorial coefficient.
The posterior expectations \mathbb{E}(Y^{(n+1)} \mid Y^{(n)} = y), \dots, \mathbb{E}(Y^{(n + m)} \mid Y^{(n)} = y) represents a model-based extrapolation of the accumulation curve.

Rarefaction and extrapolation

The \sigma-diversity (Pitman 2003)

Let Y^{(n)} be the number of distinct values arising from a Gibbs-type prior:
1. Let \sigma < 0 and V_{n, y}(\sigma, H) be the weights of a Dirichlet multinomial, then Y^{(n)} \rightarrow H a.s.
2. Let \sigma = 0 and V_{n, y}(\alpha) be the weights of a Dirichlet process, then Y^{(n)} / \log(n) \rightarrow \alpha a.s.
3. Let \sigma \in (0,1) and V_{n, y}(\sigma, \gamma) be the weights of a \sigma-stable PK, then Y^{(n)} / n^\sigma \rightarrow \gamma a.s.
Moreover, consider a generic set of weights V_{n, y} and let c_\sigma(n) = \begin{cases} 1, & \sigma <0,\\[2mm] \log(n), & \sigma = 0,\\[1mm] n^{\sigma}, & \sigma \in (0,1) \end{cases} Then, as n \rightarrow \infty: \frac{Y^{(n)}}{c_\sigma(n)} \overset{\textup{a.s.}}{\longrightarrow} S_{\sigma}. The r.v. S_{\sigma} is the \sigma-diversity and its distribution coincides with the prior for H, \alpha, and \gamma.

The \sigma-diversity (Pitman 2003)

Broadly speaking, the \sigma-diversity can be seen as a rescaled richness measure.
The \sigma-diversity is deterministic and assumed to be known in the Dirichlet multinomial, Dirichlet process, and \sigma-stable PK.
However, the \sigma-diversities H, \alpha, or \gamma are typically unknown, and they can be estimated employing a prior distribution, leading to a Gibbs-type prior.
The posterior law of the \sigma-diversity is a key quantity for measuring biodiversity. The posterior law of S_\sigma has an elegant connection with accumulation curves.

Theorem (Rigon et al. 2025b)

Let X_1,\dots,X_{n+m} be a sample from a Gibbs-type prior with Y^{(n+m)} distinct values. Then:

\left(\frac{Y^{(n+m)}}{c_\sigma(m)} \mid X_1,\dots,X_n \right) \overset{d}{\longrightarrow} \tilde{S}_{\sigma}, \qquad \tilde{S}_\sigma \overset{d}{=} (S_\sigma \mid X_1,\dots,X_n), as m \rightarrow \infty, where S_\sigma is the \sigma-diversity.

Tree species in the Amazon Basin (Hubbell et al. 2008)

How to get a Bayesian estimator for the biodiversity? We can place a prior on \alpha and the compute its posterior distribution using Bayes theorem.
Posterior distribution of the \sigma-diversity \alpha, under a (conjugate) Stirling-gamma prior (Zito et al. 2024). The dotted lines represent 95% credible intervals. The dashed line is the posterior mean.

Historical summary and extensions

The fundamental biodiversity number \alpha is a growth-adjusted richness: \frac{Y^{(n)}}{\log n} \stackrel{\mathrm{a.s.}}{\longrightarrow} \alpha, \qquad n\to \infty
The \alpha from the neutral theory, DP and Fisher log-series is the same for large n: \underbrace{y = \hat{\alpha}^\mathrm{Fisher}\log\Big(1 + \frac{n}{\hat{\alpha}^\mathrm{Fisher}}\Big)}_{\text{Solution from Fisher}} \ \stackrel{\text{Under large } n}{\approx} \ \underbrace{\sum_{j = 1}^{n} \frac{\hat{\alpha}^{\text{DP}}}{\hat{\alpha}^{\text{DP}} + j - 1} = y}_{\text{MLE from Hubbell and DP}}
There are a lot of possible extensions. This is a very biased list:
- The choice \sigma \in (0, 1) allows for polynomial growth (Lijoi et al. 2007b; Pitman and Yor 1997) while \sigma < 0 leads to a finite richness (Gnedin 2010; Lijoi et al. 2020).
- Multiple samples, a.k.a. partial exchangeability (Camerlenghi et al. 2019; Franzolini et al. 2025)
- Enriched processes, taxonomical data (Rigon et al. 2025a; Rigon et al. 2025b; Zito et al. 2023).
- Incidence data and feature models (Ghilotti et al. 2025).
- What about covariates?

Specimens and BINs in Barcode of Life

Our data: global Malaise Trap project.
We have a total of 1,781,340 arthropods split across 154,688 BINs and N = 2415 sampling sites.

How can we link covariates with biodiversity?

For outcome Y_i and covariates \bm{x}_i = (x_{i1}, \ldots, x_{ip}) \in \mathbb{R}^p, i = 1, \ldots, N, a generalized linear model (GLM) extends the linear model in non-Gaussian settings via three ingredients:
The exponential dispersion family directing each Y_i: f(y_i; \theta_i, \phi) = \exp\left\{\frac{y_i \theta_i - b(\theta_i)}{a_i(\phi)} + c(y_i, \phi)\right\}, where \theta_i \in \Theta is the natural parameter, \phi > 0 is the dispersion, and a_i,b,c are given functions.
The linear predictor \eta_i = \bm{x}_i^\top \bm{\beta}, with \bm{\beta} \in \mathbb{R}^p.
A monotonic link function g that links \eta_i with the mean of Y_i, namely \mu(\theta_i) = \mu_i: \mathbb{E}(Y_i \mid \bm{x}_i) = \mu_i = g^{-1}(\eta_i). The link is canonical when g is such that \theta_i = \eta_i.
Is there an exponential family that is related to a sample-specific diversity \alpha_i?

The Hubbell regression

The number of distinct species from a DP is an exponential family! \mathbb{P}(Y^{(n)} = y; \alpha) = \frac{\alpha^y}{(\alpha)_n}|s(n, y)|, \qquad y \in\{ 1, \ldots, n\}, where (\alpha)_n = \Gamma(\alpha + n)/\Gamma(\alpha) and |s(n,y)| are signless Stirling numbers of the first kind.

The Hubbell regression is a GLM where:

The distinct species Y_i^{(n_i)} for i=1,\dots,N are independent draws from the distribution f(y_i; \theta_i) = \exp \left\{y_i \theta_i - \log[\Gamma(e^{\theta_i} + n_i) - \Gamma(e^{\theta_i})] + \log|s(n_i, y_i)|\right\}, with \alpha_i = e^{\theta_i} whose mean and variance functions are \mu_i = \mu(\theta_i, n_i) = \sum_{j=0}^{n_i-1} \frac{e^{\theta_i}}{e^{\theta_i} + j}, \qquad V(\mu(\theta_i, n_i)) = \sum_{j=0}^{n_i-1} \frac{e^{\theta_i}}{e^{\theta_i} + j}\left(1 - \frac{e^{\theta_i}}{e^{\theta_i} + j}\right).
The link function g depends on \eta_i and n_i and satisfies \mathbb{E}(Y_i\mid \bm{x}_i, n_i) = \mu_i = g^{-1}(\bm{x}_i^\top \bm{\beta}, n_i).

The canonical link

Using the canonical link, we have: \log \alpha_i = \theta_i = \bm{x}_i^\top \bm{\beta} = \eta_i, \qquad \mu_i = \mu(\bm{x}_i^\top \bm{\beta}, n_i) = \sum_{j = 0}^{n_i-1} \frac{e^{\bm{x}_i^\top \bm{\beta}}}{e^{\bm{x}_i^\top \bm{\beta}} + j} = g^{-1}(\bm{x}_i^\top \bm{\beta}, n_i).
Coefficients \bm{\beta} have the classic interpretation of log-linear models in terms of \alpha-diversity

If x_{ip} ↑ by 1 unit → \alpha_i ↑ by 100(e^{\beta_p} - 1)\%

The saturated model corresponds to the case in which \hat{\alpha}_i \approx \hat{\alpha}^{\text{Fisher}}_i. The null model corresponds to the case \hat{\alpha}_i = \hat{\alpha} for all sites, i.e. there is no variation in biodiversity across sampling sites.

Theorem (Poisson regression with log-log offset)

Call \gamma = 0.5772\dots the Euler’s constant. Under large values of n, when each n_i \to \infty we have Y^{(n)}_i \;\dot{\sim}\; 1 + \mathrm{Poisson}\left(\exp\left\{\bm{x}_i^\top\bm{\beta} + \log(\gamma + \log n_i)\right\}\right). Informally, we will say that \hat{\bm{\beta}}^{\mathrm{Hubbell}} \approx \hat{\bm{\beta}}^{\mathrm{Poisson}} .

Beyond the logarithmic growth: non-canonical links

Extensions of the neutral theory of Hubbell (2001) proposes a power-law behavior with a dispersal limitation parameter \omega \ge 0 so that \mathbb{E}\left(Y_i^{(n)}\right) = \sum_{j = 0}^{n_i-1} j^{-\omega} \frac{\alpha_i}{\alpha_i + j}.
Our proposal: a flexible growth depending on \sigma < 1 via non-canonical link \mu_i = \mu(\bm{x}_i^\top\bm{\beta}, n_i, \sigma) = \sum_{j = 0}^{n_i-1} \frac{e^{\bm{x}_i^\top\bm{\beta}}}{e^{\bm{x}_i^\top\bm{\beta}} + j^{1-\sigma}} = g^{-1}(\bm{x}_i^\top\bm{\beta}, n_i, \sigma), \qquad \sigma < 1. The parameter \sigma plays the same role it has in Gibbs-type priors.

Regression-based biodiversity indexes

Theorem (Regression-based diversity)

For the mean function of the a Hubbell regression model with the aforementioned non-canonical link, it holds \frac{\mu(\bm{x}_i^\top \bm{\beta}; n, \sigma)}{c_n(\sigma)} \longrightarrow S_\sigma(\bm{x}_i^\top\bm{\beta}), \qquad n \to \infty, Moreover:

If \sigma = 0, then c_n(\sigma) = \log n (logarithmic growth) and S_\sigma(\bm{x}_i^\top\bm{\beta}) = e^{\bm{x}_i^\top\bm{\beta}} = \alpha_i;
If \sigma \in (0, 1), then c_n(\sigma) = n^{\sigma} (polynomial growth) and S_\sigma(\bm{x}_i^\top\bm{\beta}) = e^{\bm{x}_i^\top\bm{\beta}}/\sigma = \alpha_i / \sigma;
if \sigma < 0, then c_n(\sigma) = 1 (finite richness) and S_\sigma(\bm{x}_i^\top\bm{\beta}) \approx C_\sigma^{-1}e^{\bm{x}_i^\top\bm{\beta}/(1-\sigma)} = C_\sigma^{-1} \alpha_i^{1/(1-\sigma)}, with C_\sigma a known constant.

This gives a novel notion of regression-based biodiversity.

Iteratively reweighted least squares (IRLS)

Update the mean: \mu_i^{(t)} \gets \sum_{j = 0}^{n_i-1} \frac{e^{\eta_i^{(t)}}}{e^{\eta_i^{(t)}} + j^{1-\sigma}}, \qquad \eta_i^{(t)} = \bm{x}_i^\top \bm{\beta}^{(t)}.
Numerically retrieve the diversity \alpha_i and compute variances: \alpha_i^{(t)} \gets \texttt{uniroot}\!\left(\sum_{j=0}^{n_i-1}\frac{x}{x+j}=\mu_i^{(t)}\right)\texttt{\$root}, \qquad v_i^{(t)} \gets \sum_{j=0}^{n_i-1} j\,\frac{\alpha_i^{(t)}}{(\alpha_i^{(t)}+j)^2}.
Compute working weights: w_i^{(t)} \gets \frac{1}{v_i^{(t)}} \sum_{j=0}^{n_i-1} j\frac{\alpha_i^{(t)}} {(\alpha_i^{(t)}+j^{1-\sigma})^2}.
Update the regression coefficients:

\bm{\beta}^{(t+1)} \gets (\bm{X}^\top \bm{W}^{(t)} \bm{X})^{-1} \bm{X}^\top \bm{W}^{(t)} \bm{z}^{(t)}, \qquad z_i^{(t)} \gets \eta_i^{(t)} + \frac{y_i - \mu_i^{(t)}}{(w_i^{(t)}v_i^{(t)})^{1/2}}.

Composite marginal likelihood and standard errors

The assumption behind GLMs is that observations are independent. In our case, we might have seen the same species (or BIN) at two different locations!
This information is lost when calculating the number Y_i^{(n_i)}. However, we can interpret L(\boldsymbol{\beta}) \propto \prod_{i=1}^N \frac{\alpha_i^{y_i}} {(\alpha_i)_{n_i}}, as a composite marginal likelihood. The estimating equation is still unbiased and estimates consistent.
We propose spatially heteroskedastic-consistent standard errors (Conley 1999) that use the Jaccard Index (or similar indices) between two samples: \widehat{\mathrm{se}}(\hat{\boldsymbol{\beta}}) = (\bm{X}^\top \hat{\bm{W}}\bm{X})^{-1} \textstyle\sum_{ik} {\color{#8B0000}{s_{ik}}}\, {\color{blue}{\hat{\bm{u}}_i\hat{\bm{u}}_k^\top}}\, (\bm{X}^\top \hat{\bm{W}}\bm{X})^{-1}, \qquad {\color{blue}{\hat{\bm{u}}_i}} = \frac{\partial}{\partial\boldsymbol{\beta}} \log \mathcal{L}(\hat{\boldsymbol{\beta}}), where {\color{#8B0000}{s_{ik}}} \in [0,1] is the fraction of shared BINs between locations i and k.

Our scientific questions

Is (actual) evapotranspiration the main driver of diversity?
How does anthropogenic impact affect diversity?

Modelling approach

Goal: test the joint effect of evapotranspiration and human footprint

HubbelGLM(cbind(n, y) ~ ., family = hubbell(sigma = 0), data = data)

Model considered (in order of complexity)

M0: no covariates
M1: AET
M2: + fixed effect for realm
M3: + time (julian week, shifter by 6 months if below equator, 1 degree Fourier transform
M4: + spatial gradient (ns(Latitude, 6))
M5: + zone and hfp and interaction
M6: + controls (wetlands, wind speed, collection days, residuals of regression of temperature with latitude)
M7: + anomalies for weather (temperature, humidity, precipitation, wind speed)

Hubbell achieves the best predictive performance

We run a 10-fold cross-validation experiment. Our benchmarks are other GLMs (matching the growth rate with n).
For each nested specification, we split the data into 10 folds, train all models in 90% of the data and predict the remaining 10% (cycle through).

Key results

Model M0: \sigma is estimated as 0.562. Square root!
Model M1: AET explains nearly 30% of the total variation in species richness
Model M7:
- A 10 mm/year increase in AET leads to a 12.4\% increase in diversity on average (\beta = 0.012; P < 0.0001).
- A 5 units on the HFP scale is associated with a 6.1\% reduction in diversity in tropical areas (\beta = −0.013; P < 0.0001), and a 7.5\% reduction in dry zones (\beta = −0.016; P < 0.004)
- Polar areas show an interesting increase of 8.2\% (\beta = 0.016; P < 0.038)
- No detected effect in continental areas

Effects for all diversity indices: variations

Effects for all diversity indices: global scales

Scenario evaluation

Tropical sites with high human activity (HFP > 30) miss a median of 197 species (21.3\% of potential richness).
Dry sites miss 202 species (29.2\%).
Temperate sites miss 26 species (4.3\%)
Continental sites miss 70 species (9.23\%)

A 5‑unit intensification of human presence globally leads to up to an 8\% reduction in potential richness
A 10\% increase in AET is associated with an average 3.81\% increase in potential richness.

Summary and future directions

Fisher’s \alpha is the same quantity described by Hubbell’s unified neutral theory of biodiversity, and it has deep connections with the Dirichlet process.
The Hubbell regression extends the neutral theory in a covariate-dependent GLM framework.
Different growth rates are captured via \sigma < 0, \sigma = 0, and \sigma \in (0,1).
From here to infinity: next steps
1. Extend to mixed models, allowing for random effects;
2. More on nonparametrics: GAMs

Rigon, T., Hsu, C., and Dunson D.B. (2025). A Bayesian theory for estimation of biodiversity. arXiv:2502.01333

Zito, A., Rigon, T., Roslin, T., Niittynen, P., Hebert, P. D. N., Zakharov, E., Ratnasingham, S., iBOL Consortium, Ovaskainen, O. and Dunson, D. B. (2026). Predicting global biodiversity via Hubbell regression. bioRxiv

Software: the R package `HubbellGLM`

Hubbell regression paper available on biorXiv

References

Camerlenghi, F., Lijoi, A., Orbanz, P., and Prünster, I. (2019), “Distribution theory for hierarchical processes,” Ann. Statist., 47, 67–92.

Conley, T. G. (1999), “GMM estimation with cross sectional dependence,” Journal of Econometrics, 92, 1–45.

De Blasi, P., Favaro, S., Lijoi, A., Mena, R. H., Prunster, I., and Ruggiero, M. (2015), “Are Gibbs-type priors the most natural generalization of the Dirichlet process?” IEEE Trans. Pattern Anal. Mach. Intell., IEEE, 37, 212–229.

Ewens, W. (1972), “The sampling theory of selectively neutral alleles,” Theoretical Population Biology, 3, 87–112.

Ferguson, T. S. (1973), “A Bayesian analysis of some nonparametric problems,” Ann. Statist., 1, 209–230.

Fisher, R. A., Corbet, A. S., and Williams, C. B. (1943), “The relation between the number of species and the number of individuals in a random sample of an animal population,” J. Anim. Ecol., 12, 42–58.

Franzolini, B., Lijoi, A., Prünster, I., and Rebaudo, G. (2025), “Multivariate species sampling models,” arXiv:2503.24004.

Ghilotti, L., Camerlenghi, F., and Rigon, T. (2025), “Bayesian analysis of product feature allocation models,” Journal of the Royal Statistical Society: Series B (Statistical Methodology).

Gnedin, A. (2010), “Species sampling model with finitely many types,” Electron. Comm. Prob., 15, 79–88.

Gnedin, A., and Pitman, J. (2005), “Exchangeable Gibbs partitions and Stirling triangles,” Zapiski Nauchnykh Seminarov, POMI, 325, 83–102.

Good, I. J. (1953), “The population frequencies of species and the estimation of population parameters,” Biometrika, 40, 237–264.

Good, I. J., and Toulmin, G. H. (1956), “The number of new species, and the increase in population coverage, when a sample is increased,” Biometrika, 43, 45–63.

Hubbell, S. P. (2001), The unified neutral theory of biodiversity and biogeography, Princeton University Press.

Hubbell, S. P., He, F., Condit, R., Borda-de-Água, L., Kellner, J., and Ter Steege, H. (2008), “How many tree species are there in the Amazon and how many of them will go extinct?” Proceedings of the National Academy of Sciences of the United States of America, 105, 11498–11504.

Lijoi, A., Mena, R. H., and Prünster, I. (2007a), “Bayesian nonparametric estimation of the probability of discovering new species,” Biometrika, Oxford University Press, 94, 769–786.

Lijoi, A., Mena, R. H., and Prünster, I. (2007b), “Controlling the reinforcement in Bayesian non-parametric mixture models,” J. R. Statist. Soc. B, Wiley Online Library, 69, 715–740.

Lijoi, A., Prünster, I., and Rigon, T. (2020), “The Pitman–Yor multinomial process for mixture modeling,” Biometrika, 107, 891–906.

McCullagh, P. (2016), “Two Early Contributions to the Ewens Saga,” Statist. Sci., 31, 23–26.

Pitman, J. (1996), “Some developments of the Blackwell-MacQueen urn scheme,” in Statistics, probability and game theory, IMS lecture notes monogr. ser., Inst. Math. Statist., Hayward, CA, pp. 245–267.

Pitman, J. (2003), “Poisson-Kingman partitions,” Lecture Notes-Monograph Series, 40, 1–34.

Pitman, J., and Yor, M. (1997), “The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator,” Ann. Probab., 25, 855–900.

Rigon, T., Hsu, C., and B., D. D. (2025a), “A Bayesian theory for estimation of biodiversity,” arXiv:2502.01333.

Rigon, T., Petrone, S., and Scarpa, B. (2025b), “Enriched Pitman-Yor processes,” Scand. J. Statist.

Zito, A., Rigon, T., and Dunson, D. B. (2023), “Inferring taxonomic placement from DNA barcoding allowing discovery of new taxa,” Meth. Ecol. Evol., 14, 529–542.

Zito, A., Rigon, T., and Dunson, D. B. (2024), “Bayesian nonparametric modeling of latent partitions via Stirling-gamma priors,” Bayesian Analysis.

Warm thanks (statistical team)

Warm thanks (ecological team, and more…)

A journey into statistical biodiversity

Approximately 80 years ago… Fisher et al. (1943)

Early ideas (Good 1953; Good and Toulmin 1956)

The Ewens sampling formula Ewens (1972)

The Dirichlet process (Ferguson 1973)

… and species sampling models (Pitman 1996)

Ewens formula governs the neutral theory! (Hubbell 2001)

All ties together (McCullagh 2016)

Species sampling models

Three notable examples

A characterization theorem

The quantification of biodiversity (Lijoi et al. 2007a)

Rarefaction and extrapolation

The \sigma-diversity (Pitman 2003)

The \sigma-diversity (Pitman 2003)

Tree species in the Amazon Basin (Hubbell et al. 2008)

Historical summary and extensions

Specimens and BINs in Barcode of Life

How can we link covariates with biodiversity?

The Hubbell regression

The canonical link

Beyond the logarithmic growth: non-canonical links

Regression-based biodiversity indexes

Iteratively reweighted least squares (IRLS)

Composite marginal likelihood and standard errors

Our scientific questions

Modelling approach

Hubbell achieves the best predictive performance

Key results

Effects for all diversity indices: variations

Effects for all diversity indices: global scales

Scenario evaluation

Summary and future directions

Software: the R package HubbellGLM

Hubbell regression paper available on biorXiv

References

Software: the R package `HubbellGLM`