Statistics III - CdL SSE
Università degli Studi di Milano-Bicocca
“Everything should be made as simple as possible, but not simpler”
Attributed to Albert Einstein
This unit will cover the following topics:
The main theme is: what should we do when the assumptions of linear models are violated?
We will push the linear model to its limit, using it even when is not supposed to work.
The symbol 📖 means that a few extra steps are discussed in the handwritten notes.
city.distance
)engine.size
)n.cylinders
)curb.weight
)fuel
).We assume you are already familiar with linear models. The following is a brief recap rather than a full discussion.
Let us consider the variables city.distance
(y), engine.size
(x) and fuel
(z).
A simple linear regression Y_i = \beta_1 + \beta_2 x_i + \epsilon_i, \qquad i=1,\dots,n, could be easily fit by least squares…
… but the plot suggests that the relationship between city.distance
and engine.size
is not well approximated by a linear function.
… and also that fuel
has a non-negligible effect on the response.
A general and more flexible formulation for modeling the relationship between a vector of fixed covariates \bm{x}_i = (x_{i1},\dots,x_{ip})^T \in \mathbb{R}^p and a random variable Y_i \in \mathbb{R} is Y_i = f(\bm{x}_i; \beta) + \epsilon_i, \qquad i=1,\dots,n, where the “errors” \epsilon_i are iid random variables, having zero mean and variance \sigma^2.
To estimate the unknown parameters \beta, a possibility is to rely on the least squares criterion: we seek the minimum of the objective function D(\beta) = \sum_{i=1}^n\{y_i - f(\bm{x}_i; \beta)\}^2, using n pairs of covariates \bm{x}_i = (x_{i1},\dots,x_{ip})^T and the observed realizations y_i of the random variables Y_i, for i = 1,\dots,n. The optimal value is denoted by \hat{\beta}.
The predicted values are \hat{y}_i = \mathbb{E}(Y_i) = f(\bm{x}_i; \hat{\beta}), for i=1,\dots,n.
Let us consider again the variables city.distance
(y), engine.size
(x) and fuel
(z).
Which function f(x,z;\beta) should we choose?
Linear model
In a linear model the response variable Y_i is related to the covariates through the function \mathbb{E}(Y_i) =f(\bm{x}_i; \beta) = \beta_1 x_{i1} + \cdots + \beta_p x_{ip} =\bm{x}_i^T\beta, where \bm{x}_i = (x_{i1},\dots,x_{ip})^T is a vector of covariates and \beta = (\beta_1,\dots,\beta_p)^T is the corresponding vector of coefficients.
The response random variables are collected in the random vector \bm{Y} = (Y_1,\dots,Y_n)^T, whose observed realization is \bm{y} = (y_1,\dots,y_n)^T.
The design matrix is a n \times p matrix, comprising the covariate’s values, defined by \bm{X} = \begin{bmatrix} x_{11} & \cdots & x_{1p}\\ \vdots & \ddots & \vdots \\ x_{n1} & \cdots & x_{np} \end{bmatrix}.
Least square estimate (OLS)
If the design matrix has full rank, that is, if \text{rk}(\bm{X}^T\bm{X}) = p, then the least square estimate has an explicit solution: \hat{\beta} = (\bm{X}^T\bm{X})^{-1}\bm{X}^T \bm{y}.
The quantity D(\hat{\beta}) is the so-called deviance, which is equal to D(\hat{\beta}) = ||\bm{y} - \hat{\bm{y}}||^2 = \bm{y}^T(I_n - \bm{H})\bm{y}.
Moreover, a typical estimate for the residual variance \sigma^2 is obtained as follows: s^2 = \frac{D(\hat{\beta})}{n - p} = \frac{1}{n-p}\sum_{i=1}^n(y_i - \bm{x}_i^T\hat{\beta})^2.
To evaluate the goodness of fit, we can calculate the coefficient of determination: R^2 = 1 - \frac{\text{(``Residual deviance'')}}{\text{(``Total deviance'')}} = 1 - \frac{\sum_{i=1}^n (y_i - \hat{y}_i)^2}{\sum_{i=1}^n(y_i - \bar{y})^2}.
Recall that the errors \bm{\epsilon} have zero mean \mathbb{E}(\bm{\epsilon}) = 0 and are uncorrelated \text{var}(\bm{\epsilon}) = \sigma^2I_n.
Then, the estimator \hat{\beta} is unbiased \mathbb{E}(\hat{\beta}) = \beta and its variance is \text{var}(\hat{\beta}) = \sigma^2 (\bm{X}^T\bm{X})^{-1}. Since \sigma^2 is also unknown, we can estimate the variances of \hat{\beta} as follows: \widehat{\text{var}}(\hat{\beta}) = s^2 (\bm{X}^T\bm{X})^{-1}.
The standard errors of the components of \hat{\beta} correspond to the square root of the diagonal of the above covariance matrix.
Let us additionally assume that the errors follow a Gaussian distribution: \epsilon_i \overset{\text{iid}}{\sim} \text{N}(0, \sigma^2).
This implies that the distribution of the estimator \hat{\beta} is \hat{\beta} \sim \text{N}_p(\beta, \sigma^2 (\bm{X}^T\bm{X})^{-1}).
Confidence interval and Wald’s tests can be obtained through classical inferential theory.
The diagonal elements h_i \in [0, 1] of the matrix \bm{H} are called leverages and it holds \text{var}(\hat{Y}_i) = \sigma^2 h_i, \qquad \text{var}(Y_i - \hat{Y}_i) = \sigma^2 (1 - h_i), \qquad \text{cor}(Y_i, \hat{Y}_i) = \sqrt{h_i}. The leverage h_i determines the precision with which \hat{Y}_i predicts Y_i. For large h_i close to 1, \text{cor}(Y_i, \hat{Y}_i) \approx 1, therefore changes of a single point Y_i leads to significant changes in \hat{Y}_i.
Leverages also appear in the definition of standardized residuals: \tilde{r}_i = \frac{r_i}{\sqrt{s^2(1 - h_i)}} = \frac{y_i - \bm{x}_i^T\hat{\beta}}{\sqrt{s^2(1 - h_i)}}, where r_i = y_i - \bm{x}_i^T\hat{\beta} are the (raw) residuals.
Our first attempt for predicting city.distance
(y) via engine.size
(x) and fuel
(z) is:
f(x, z; \beta) = \beta_1 + \beta_2 x + \beta_3 x^2 + \beta_4 x^3 + \beta_5 I(z = \texttt{gas}).
We obtain the following summary for the regression coefficients \hat{\beta}.
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) |
28.045 | 3.076 | 9.119 | 0.000 |
engine.size |
-10.980 | 3.531 | -3.109 | 0.002 |
engine.size^2 |
2.098 | 1.271 | 1.651 | 0.100 |
engine.size^3 |
-0.131 | 0.139 | -0.939 | 0.349 |
fuel_gas |
-3.214 | 0.427 | -7.523 | 0.000 |
r.squared | sigma | deviance |
---|---|---|
0.5973454 | 1.790362 | 634.6687 |
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) |
3.060 | 0.047 | 64.865 | 0 |
log(engine.size) |
-0.682 | 0.040 | -17.129 | 0 |
fuel_gas |
-0.278 | 0.038 | -7.344 | 0 |
r.squared.original | r.squared | sigma | deviance |
---|---|---|---|
0.5847555 | 0.6196093 | 0.1600278 | 5.121777 |
This second model is more parsimonious, and yet it reaches satisfactory predictive performance.
It is also more coherent with the nature of the data: the predictions cannot be negative, and the relationship between engine size and the consumption is monotone.
Yet, there is still some heteroscedasticity in the residuals — is this is due to a missing covariate that has not been included in the model?
Let us consider two additional variables: curb.weight
(w) and n.cylinders
(v).
A richer model, therefore, could be: \log{Y_i} = \beta_1 + \beta_2 \log{x_i} + \beta_3 \log{w_i} + \beta_4 I(z_i = \texttt{gas}) + \beta_5 I(v_i = 2) + \epsilon_i, for i=1,\dots,n. The estimates are:
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) |
9.423 | 0.482 | 19.549 | 0.000 |
log(engine.size) |
-0.180 | 0.051 | -3.504 | 0.001 |
log(curb.weight) |
-0.943 | 0.072 | -13.066 | 0.000 |
fuel_gas |
-0.353 | 0.022 | -15.934 | 0.000 |
cylinders2_TRUE |
-0.481 | 0.052 | -9.301 | 0.000 |
r.squared.original | r.squared | sigma | deviance |
---|---|---|---|
0.869048 | 0.8819199 | 0.0896089 | 1.589891 |
In this third model, we handled the outliers appearing in the residual plots, which it turns out are identified by the group of cars having 2 cylinders.
The diagnostic plots are also very much improved, although still not perfect.
The estimates are coherent with our expectations, based on common knowledge. Have a look at the book (Azzalini and Scarpa (2012)) for a detailed explanation of \beta_4!
The car dataset is available from the textbook (A&S) website:
Classical assumptions of linear models
(A.1) Linear structure, namely \bm{Y} = \bm{X}\beta + \bm{\epsilon} with \mathbb{E}(\bm{\epsilon}) = 0, implying \mathbb{E}(\bm{Y}) = \bm{X}\beta. 1
(A.2) Homoschedasticity and uncorrelation of the errors, namely \text{var}(\bm{\epsilon}) = \sigma^2 I_n.
(A.3) Gaussianity, namely \bm{\epsilon} \sim \text{N}_n(0, \sigma^2 I_n). In other words, the errors \epsilon_i \overset{\text{iid}}{\sim}N(0, \sigma^2) are iid Gaussian random variables with zero mean and variance \sigma^2.
It is also commonly asked that \text{rk}(\bm{X}) = p, otherwise the model is not identifiable.
A plane can still fly with one of its engines on fire, but this is hardly an appealing situation.
Similarly, robust estimators may work under model misspecification, but this does not mean we should neglect checking whether the original assumptions hold.
Let us consider the case in which assumptions (A.1)-(A.2) are valid but (A.3) is not, that is \mathbb{E}(\bm{\epsilon}) = 0 and \text{var}(\bm{\epsilon}) = \sigma^2 I_n, but \epsilon does not follow a Gaussian distribution.
For example, \epsilon_i may follow a Laplace distribution, a skew-Normal, a logistic distribution, a Student’s t distribution, etc.
Under (A.1)-(A.2), even without requiring normality of the errors (A.3), we obtain the usual formulas: \mathbb{E}(\hat{\beta}) = \beta, \qquad \text{var}(\hat{\beta}) = \sigma^2(\bm{X}^T\bm{X})^{-1}. Moreover, because of Gauss-Markov theorem, the OLS estimator \hat{\beta} is the most efficient within the class of linear and unbiased estimators (BLUE) for any distribution of the errors \bm{\epsilon}.
When the errors are non Gaussian the exact inferential results are not valid. In particular \hat{\beta} does not follow anymore a Gaussian distribution.
However, a central limit theorem can be invoked under very mild conditions on the design matrix \bm{X}.
Thus, when the sample size n is large enough, then the following approximation holds \hat{\beta} \:\dot{\sim}\: \text{N}_p(\beta, \sigma^2(\bm{X}^T\bm{X})^{-1}), from which confidence intervals and test statistics can be obtained as usual. The approximation is excellent if the errors are symmetric around 0.
Non-normality of the errors is not a major concern: the OLS estimator preserves most of its properties, including approximate normality for sufficiently large n.
There is often an over-emphasis on testing whether the residuals are Gaussian. However, even if normality is rejected, the practical implications are minimal.
Suppose now that the linearity assumption (A.1) is valid but homoschedasticity of the errors (A.2) is not. Instead, we consider heteroschedastic errors: \text{var}(\bm{\epsilon}) = \bm{\Sigma},\quad \text{or equivalenty} \quad \text{var}(Y_i) = \sigma^2_i, \quad i=1,\dots,n where \bm{\Sigma} = \text{diag}(\sigma^2_1,\dots,\sigma_n^2) is a diagonal matrix with positive entries.
The OLS estimator is still unbiased, with a modified covariance structure1 \mathbb{E}(\hat{\beta}) = \beta, \qquad \text{var}(\hat{\beta}) = (\bm{X}^T\bm{X})^{-1}\bm{X}^T\bm{\Sigma}\bm{X} (\bm{X}^T\bm{X})^{-1}. If in addition we assume Gaussianity of the errors, that is \bm{\epsilon} \sim \text{N}_n(0,\bm{\Sigma}), then \hat{\beta} \sim \text{N}_p(\beta, (\bm{X}^T\bm{X})^{-1}\bm{X}^T\bm{\Sigma}\bm{X} (\bm{X}^T\bm{X})^{-1}). Under suitable but mild conditions on \bm{X} and \bm{\Sigma}, the estimator is also consistent.
The OLS estimator in presence of heteroschedasticity still gives a good point estimate. However, the OLS estimator is not efficient and the classical standard errors are wrong.
A potential approach is to accept the inefficiency of the OLS estimator in this scenario and correct the standard errors.
The elements of \bm{\Sigma} are unknown, but we can estimate them from the data. Note that \text{var}(r_i) = \text{var}(y_i - \bm{x}_i^T\hat{\beta}) = \sigma^2_i(1 - h_i), suggesting the estimate \hat{\sigma}^2_i = r_i^2 / (1 - h_i).
This leads to the so-called sandwich estimator of the covariance matrix: \widehat{\text{var}}(\hat{\beta}) = (\bm{X}^T\bm{X})^{-1}\bm{X}^T\hat{\bm{\Sigma}}\bm{X} (\bm{X}^T\bm{X})^{-1}) where \hat{\bm{\Sigma}} = \text{diag}(\hat{w}_1,\dots,\hat{w}_n) and \hat{w}_i = r_i^2 / (1 - h_i).
These are known as White’s heteroscedasticity-consistent standard errors. 1
Let us consider again the case of heteroschedastic errors: \text{var}(\bm{\epsilon}) = \sigma^2\bm{\Omega}^{-1},\quad \text{or equivalenty} \quad \text{var}(Y_i) = \sigma^2_i = \frac{\sigma^2}{\omega_i}, \quad i=1,\dots,n where \bm{\Omega} = \text{diag}(\omega_1,\dots,\omega_n) are positive weights. However, here we assume that the weights \omega_1,\dots,\omega_n are known, a common situation in survey design.
Let us define the standardized quantities: \bm{Y}^* = \bm{\Omega}^{1/2}\bm{Y}, \qquad \bm{X}^* = \bm{\Omega}^{1/2}\bm{X}. This is equivalent to say that Y_i^* = \sqrt{\omega_i} Y_i and x_{ij}^* = \sqrt{\omega_i} x_{ij}. Then, it is easy to show that \mathbb{E}(\bm{Y}^*) = \bm{X}^*\beta, \qquad \text{var}(\bm{Y}^*) = \sigma^2\bm{\Omega}^{1/2}\bm{\Omega}^{-1}\bm{\Omega}^{1/2}= \sigma^2 I_n, namely the assumptions (A.1) and (A.2) are valid in the transformed scale.
In other words, after a suitable transformation, we reconducted the problem to a standard linear model.
Thus an estimator for \beta, based on the transformed data, is obtained minimizing the deviance \begin{aligned} D_\text{wls}(\beta) &= (\bm{y}^* - \bm{X}^*\beta)^T(\bm{y}^* - \bm{X}^*\beta) = (\bm{y} - \bm{X}\beta)^T\bm{\Omega}(\bm{y} - \bm{X}\beta) \\ &=\sum_{i=1}^n \omega_i (y_i - \bm{x}_i^T\beta)^2. \end{aligned} which is a weighted version of the original quadratic loss, with high weight = low variance.
The resulting OLS estimate minimizing D_\text{wls}(\beta) in the transformed and original scales is \hat{\beta}_\text{wls} = [(\bm{X}^*)^T\bm{X}^*]^{-1}(\bm{X}^*)^T\bm{y}^* = (\bm{X}^T\bm{\Omega}\bm{X})^{-1}\bm{X}^T\bm{\Omega}\bm{y} and it is referred to as weighted least squares estimator of \beta.
Such an estimator is unbiased and efficient (BLUE), with \mathbb{E}(\hat{\beta}_\text{wls}) = \beta, \qquad \text{var}(\hat{\beta}_\text{wls}) = \sigma^2 (\bm{X}^T\bm{\Omega}\bm{X})^{-1}. Moreover, if \bm{\epsilon} \sim \text{N}_n(0, \sigma^2\bm{\Omega}^{-1}) it also coincides with the maximum likelihood estimator.
While the model may have been incorrectly specified for the original data, it could become appropriate once the transformations are considered, namely g(Y_i) = h_1(\bm{x}_i)\beta_1 + \cdots + h_p(\bm{x}_i)\beta_p + \epsilon_i, \qquad i=1,\dots,n, where g(\cdot) and h_j(\cdot) for j=1,\dots,p are non-linear and known functions.
This idea is conceptually simple and powerful. It also shows that linear models are capable of capturing non-linear relationships, as long as they remain linear in the parameters.
However, choosing g(\cdot) and h_j(\cdot) in practice is not simple. In our case study, we proceeded by trial and error and used contextual information to guide our final choice.
Box-Cox transform
If the data are y_i are positive, we may consider a parametric class of transformations: g_\lambda(y) = \frac{y^\lambda - 1}{\lambda}, \qquad \lambda \neq 0. and g_\lambda(y) = \log{y} when \lambda = 0. This is the celebrated Box-Cox transform.
The case \lambda = 1 corresponds to no transformation, \lambda= 1/2 to the square root, \lambda = 0 to the logarithm, and \lambda= −1 to the reciprocal.
We estimate \lambda from the data using maximum likelihood, so that the data themselves can inform us about the best transformation. We assume g_\lambda(Y_i) = \bm{x}_i^T\beta + \epsilon_i, \qquad \epsilon_i \sim \text{N}(0, \sigma^2), \qquad i=1,\dots,n.
The aim of the transformation is to produce a response for which the variance of \epsilon_i is constant with an approximately normal distribution.
By assumption, the distribution of the transformed data \bm{Z}_\lambda = (g_\lambda(Y_1), \dots,g_\lambda(Y_n))^T is Gaussian, therefore their joint density is f_Z(\bm{z}_\lambda) = \frac{1}{(2\pi\sigma^2)^{n/2}}\exp\left\{-\frac{1}{2\sigma^2}(\bm{z}_\lambda - \bm{X}\beta)^T(\bm{z}_\lambda - \bm{X}\beta)\right\}.
Using standard tools of probability theory, we can obtain the density of the original data: f_Y(\bm{y})= f_Z(g_\lambda(y_1),\dots,g_\lambda(y_n))\prod_{i=1}^n\left|\frac{\partial g_\lambda(y_i)}{\partial y_i}\right|, \quad \text{where}\quad \left|\frac{\partial g_\lambda(y_i)}{\partial y_i}\right| = y_i^{\lambda - 1}. The additional term is the determinant of the Jacobian of the transformation.
The log-likelihood therefore is \ell(\beta, \sigma^2, \lambda) = -\frac{n}{2}\log{\sigma^2} - \frac{1}{2\sigma^2}(\bm{z}_\lambda - \bm{X}\beta)^T(\bm{z}_\lambda - \bm{X}\beta) + (\lambda - 1)\sum_{i=1}^n\log{y_i}.
Note that, for any given value of \lambda, the maximum likelihood estimates are \hat{\beta}_\lambda = (\bm{X}^T\bm{X})^{-1}\bm{X}^T\bm{z}_\lambda, \qquad \hat{\sigma}^2_\lambda = \frac{1}{n}(\bm{z}_\lambda - \bm{X}\hat{\beta}_\lambda)^T(\bm{z}_\lambda - \bm{X}\hat{\beta}_\lambda),
We can plug-in the above estimates into the log-likelihood. This gives the profile log-likelihood for \lambda, which admits a very simple expression:
\ell_P(\lambda) = \ell(\hat{\beta}_\lambda, \hat{\sigma}^2_\lambda, \lambda) = -\frac{n}{2}\log{\hat{\sigma}^2_\lambda} + (\lambda -1)\sum_{i=1}^n\log{y_i},
which must be numerically maximized over \lambda, e.g. using optim
.
The optimal value \hat{\lambda} = \arg\max\ell_P(\lambda), as well as a confidence interval for it, may offer guidance in choosing the right transformation.
Box and Cox suggested using this approach as an exploratory tool. For instance, an optimal value \hat{\lambda} = 0.4210283 is hard to interpret but it could suggest a square root transformation.
Let Y_i \sim \text{Poisson}(\mu_i) with mean \mathbb{E}(Y_i) = \mu_i = f(\bm{x}_i;\beta) = \text{var}(Y_i). Note that Y_i \,\dot{\sim}\, \text{N}(\mu_i, \mu_i), is asymptotically Gaussian for large values of \mu_i. However, data are heteroschedastic.
In modeling count data, we could transform the counts so that, at least approximately, the variance of g(Y_i) is constant and ordinary least squares methods can be used.
Let Y_i \sim \text{Binomial}(\pi_i, m_i), with success probability \pi_i = f(\bm{x}_i; \beta) and trials m_i. For large values of m_i, the Gaussian approximation holds Y_i \,\dot{\sim}\, \text{N}(m_i \pi_i, m_i\pi_i(1 - \pi_i)). However, the data are heteroschedastic, because \text{var}(Y_i) = m_i \pi_i(1- \pi_i).
Thus, a variance stabilizing transformation in this case is g_{m_i}(y) = \sqrt{m_i}\arcsin\left(\frac{2 y}{m_i} - 1\right), because in fact we have that \text{var}(g_{m_i}(Y_i)) \approx \left(\frac{\sqrt{m_i}}{\sqrt{1 - (2\pi_i-1)^2}} \frac{2}{m_i}\right)^2 m_i \pi_i(1- \pi_i) = 1.
In the case of transformations of the response variable we let \mathbb{E}(g(Y_i)) = \bm{x}_i^T\beta. However:
g(\mathbb{E}(Y_i)) \neq E(g(Y_i)) \quad \Longrightarrow \quad \mathbb{E}(Y_i) \neq g^{-1}(\bm{x}_i^T\beta).
Thus \hat{y}_i = g^{-1}(\bm{x}_i^T\hat{\beta}) is a reasonable prediction for Y_i and not an estimate for its mean.
When g(y) = \log{y} this distinction can be made explicit, because we have g^{-1}(\mathbb{E}\{g(Y_i)\}) = g^{-1}(\bm{x}_i^T\beta) = \exp(\bm{x}_i^T\beta), \qquad \mathbb{E}(Y_i) = \exp(\bm{x}_i^T\beta + \sigma^2/2), the former being the geometric mean of Y_i, whereas the latter is the usual mean.
Suppose Y_i \sim \text{Binomial}(\pi, m_i). The variance stabilizing transformation is not fully satisfactory:
Besides, this transform is clearly not applicable when m_i = 1 and Y_i \in \{0, 1\}, a very common problem called binary regression.
If we know that Y_i follows, say, a Bernoulli or a Gamma distribution, then we should use the appropriate likelihood rather than a Gaussian approximation.
Generalized Linear Models provide a much more elegant solution to the above problem.
Comments and criticisms
Is this a good model?
The overall fit seems satisfactory at first glance, especially if we aim at predicting the urban distance of cars when average engine size (i.e., between 1.5L and 3L).
Also, this model is unsuitable for extrapolation. Indeed:
It is plausible that we can find a better one, so what’s next?