Statistics III - CdL SSE
Università degli Studi di Milano-Bicocca
“I would like to think of myself as a scientist, who happens largely to specialise in the use of statistics.”
Sir David Cox (1924-2022)
Statistica III is a monographic course on Generalized Linear Models (GLMs), a broadly applicable regression technique.
This is a B.Sc.-level course, but there are some prerequisites: it is assumed that you have already been exposed to:
In Statistica III we extend linear models within a unified and elegant framework.
Regression is such an important topic that the tour will continue at the M.Sc. at CLAMSES. In Data Mining I will cover penalized methods and nonparametric regression.
Indeed, GLMs can be arguably regarded one of the most influential statistical ideas of the XX century.
Classical linear models and least squares began with the work of Gauss and Legendre who applied the method to astronomical data.
Their idea, in modern terms, was to predict the mean of a normal, or Gaussian, distribution as a function of a covariate: \mathbb{E}(Y_i) = \beta_1 + \beta_2 x_i, \qquad i=1,\dots,n.
As early as Fisher (1922), a more advanced non-linear model was introduced, designed to handle proportion data of the form S_i / m.
Through some modeling and calculus, Fisher derived a binomial model for S_i, with \mathbb{E}(S_i/m) = \pi_i = 1 - \exp\{-\exp(\beta_1 + \beta_2 x_i)\}, \qquad i = 1, \dots, n. where \pi_i \in (0, 1) is the probability of success of a binomial distribution.
The corresponding inverse relationship is known as the complementary log-log link function: \beta_1 + \beta_2 x_i = \log\{-\log(1-\pi_i)\}.
Dyke and Patterson (1952) also considered the case of modelling proportions, but specified \mathbb{E}(S_i/m) = \pi_i = \frac{\exp(\beta_1 + \beta_2 x_i)}{1 + \exp(\beta_1 + \beta_2 x_i)}, \qquad i = 1, \dots, n.
The corresponding inverse relationship is known as the logit link function: \beta_1 + \beta_2 x_i = \text{logit}(\pi_i) = \log\left(\frac{\pi_i}{1 - \pi_i}\right). In fact, this approach is currently known as logistic regression.
We will use several textbooks throughout this course — some more specialized than others. They are listed in order of importance: