Data Mining - CdL CLAMSES
Università degli Studi di Milano-Bicocca
In this unit we will cover the following topics:
We have seen that fully nonparametric methods are plagued by the curse of dimensionality.
GAMs and MARS are semi-parametric approaches that keep the model complexity under control so that:
The running example is about trawl data from the Great Barrier Reef.
trawl
datasetWe consider the trawl
dataset, which refers to a survey of the fauna on the sea bed lying between the coast of northern Queensland and the Great Barrier Reef.
The response variable is Score
, which is a standardized numeric quantity measuring the amount of fishes caught on a given location.
We want to predict the catch score, as a function of a few covariates:
Latitude
and Longitude
of the sampling position. The longitude can be seen as a proxy of the distance from the coast in this specific experiment;Depth
of the sea on the sampling position;Zone
of the sampling region, either open or closed to commercial fishing;Year
of the sampling, which can be either 1992
or 1993
.Having remove a few observations due to missingness, we split the data into training (119 obs.) and test set (30 obs.). The full trawl
dataset is available in the sm
R package.
trawl
datasetLet begin our analysis by trying to predict the Score
using a linear model of the form
y_i = \beta_0 + \beta_1 x_{i1} + \cdots + \beta_px_{ip}, \qquad i=1,\dots,n,
The above values correspond to the variables of the trawl
dataset, so that
\begin{aligned}
\texttt{Score}_i = \beta_0 &+ \beta_1 \texttt{Latitude}_i + \beta_2 \texttt{Longitude}_i + \\
&+ \beta_3\texttt{Depth}_i + \beta_4 I(\texttt{Zone}_i = \texttt{Closed}) + \beta_5 I(\texttt{Year}_i = \texttt{1993}).
\end{aligned}
Such a model can be estimated using ordinary least squares, resulting in:
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) |
297.690 | 26.821 | 11.099 | 0.000 |
Latitude |
0.256 | 0.222 | 1.151 | 0.252 |
Longitude |
-2.054 | 0.187 | -10.955 | 0.000 |
Depth |
0.020 | 0.007 | 3.003 | 0.003 |
Zone_Closed |
-0.116 | 0.102 | -1.143 | 0.255 |
Year_1993 |
0.127 | 0.103 | 1.242 | 0.217 |
loess
estimateBy imposing suitable constraints, this decomposition can be made unique.
More importantly, this decomposition gives us an intuition on how to build non-linear models with a simplified structure.
In GAMs we include only the main effects and we exclude the interactions terms.
Generalized linear models (GLMs) are a special case of GAMs, in which f_j(x_{ij}) = \beta_j x_{ij}.
Suppose we model each f_j(x) = \sum_{m = 1}^{M_j}\beta_{mj} h_{mj}(x) with a basis expansion, for example using regression splines.
In a regression problem we need to minimize, over the unknown \beta parameters, the loss \sum_{i=1}^n\left\{y_i - \beta_0 - \sum_{j=1}^pf_j(x_{ij})\right\}^2 subject to the constraint \sum_{i=1}^n f_j(x_{ij}) = 0.
When f_j are regression splines, the above loss can be minimized using least squares. The identifiability issue could be handled by removing the intercept term from each spline basis.
However, here we consider an alternative and iterative minimization method, which is similar to the coordinate descent algorithm we employed for the elastic-net.
The backfitting algorithm for additive regression models
Initialize \hat{\beta}_0 = \bar{y} and set f_j(x_j) = 0, for j=1,\dots,p.
Cycle j =1,\dots,p, j =1,\dots,p, \dots, until convergence:
Update the kth function by smoothing via \mathcal{S}_j the partial residuals, so that \hat{f}_j(x) \leftarrow \mathcal{S}_j\left[\left\{x_{ij}, y_i - \hat{\beta}_0 - \sum_{k \neq j} \hat{f}_k(x_{ik})\right\}_{i=1}^n\right].
Center the function by subtracting its mean \hat{f}_j(x) \leftarrow \hat{f}_j(x) - \frac{1}{n}\sum_{i=1}^n\hat{f}_j(x_{ij}).
The backfitting algorithm, when f_j are modeled as regression splines, is known as “Gauss-Seidel”. The convergence is guaranteed under standard conditions.
Interestingly, even when \mathcal{S}_j are smoothing splines the convergence of backfitting is guaranteed; the proof for this statement is less straightforward.
Local scoring algorithm for additive logistic regression
Initialize \hat{\beta}_0 = \text{logit}(\bar{y}) and set f_j(x_j) = 0, for j=1,\dots,p.
Iterate until convergence:
Define the quantities \hat{\eta}_i = \hat{\beta}_0 + \sum_{j=1}^p\hat{f}_j(x_{ij}) and \hat{\pi}_i = \{1 + \exp(-\hat{\eta}_i)\}^{-1}.
Construct the working response z_i = \hat{\eta}_i + \frac{y_i - \hat{\pi}_i}{\hat{\pi}_i(1 - \hat{\pi}_i)}, \qquad i=1,\dots,n.
Construct the weights w_i = \hat{\pi}_i(1 - \hat{\pi}_i), for i=1,\dots,n.
Use a weighted backfitting algorithm using the z_i as responses, which produces a new set of estimates \hat{f}_1,\dots,\hat{f}_p.
Each f_j(x;\beta) is a natural cubic spline, therefore the penalized least squares criterion is \mathscr{L}(\beta; \lambda) = \sum_{i=1}^n\left\{y_i - \beta_0 - \sum_{j=1}^pf_j(x_j; \beta_j)\right\}^2 + \sum_{j=1}^p\lambda_j \beta_j^T\bm{\Omega}_j\beta_j, whose joint minimization over \beta is available in closed form.
Hence, a direct algorithm that minimizes \mathscr{L}(\beta; \lambda) is used instead of backfitting.
In GAMs there are p smoothing parameters \lambda_1,\dots,\lambda_p that must be selected. We can proceed in the usual way, e.g. considering the generalized cross-validation criteria: \text{GCV}(\lambda_1,\dots,\lambda_p) = \frac{1}{n}\sum_{i=1}^n \left(\frac{y_i - \hat{y}_i}{1 - \text{df}/n}\right)^2.
An alternative criterion in this context is the REML (Restricted Maximum Likelihood), which is the marginal likelihood of the corresponding Bayesian model.
Hence, many software packages numerically optimize the \text{GCV}(\lambda_1,\dots,\lambda_p), or other information criteria, as a function of \lambda_1,\dots,\lambda_p, using e.g. the Newton-Raphson method.
Such an approach is particularly convenient in combination with smoothing splines, because the derivatives needed for Newton’s method are available in closed form.
When p is large there is need to remove the potentially irrelevant variables. There exist several variable selection ideas for GAMs, but we will not cover the details here.
Option 1. Stepwise regression. Perhaps the simplest method, although it is not as efficient as in linear models because we cannot exploit the same computational tricks.
Option 2. COSSO: Component Selection and Smoothing Operator (Lin and Zhang, 2006). It’s an idea based on combining lasso-type penalties and GAMs.
Option 3. SpAM: Sparse Additive Models (Ravikumar, Liu, Lafferty and Wasserman, 2009). Similar to the above, but it exploits a variation of the non-negative garrote.
Option 4. Double-penalty and shrinkage (Marra and Wood, 2011). It acts on the penalty term of smoothing splines so that high-values of \lambda_1,\dots,\lambda_p leads to constant functions.
Option X. Fancy name. Yet another method for variable selection with GAMs.
trawl
dataLet us get back to the trawl
data. A specification based on GAM could be
\begin{aligned}
\texttt{Score}_i = \beta_0 &+ f_1(\texttt{Longitude}_i)+ f_2(\texttt{Latitude}_i) + f_3(\texttt{Depth}_i) +\\
&+ \beta_1 I(\texttt{Zone}_i = \texttt{Closed}) + \beta_2 I(\texttt{Year}_i = \texttt{1993}).
\end{aligned}
In GAMs the predictors are not necessarily modeled using nonparametric methods. Indeed, it is common to have a combination of smooth functions and linear terms.
Besides, it does not make sense to “smooth” a dummy variable.
term | estimate | std.error | df |
---|---|---|---|
(Intercept) |
0.849 | 0.088 | 1 |
Zone_Closed |
-0.075 | 0.099 | 1 |
Year_1993 |
0.149 | 0.093 | 1 |
s(Longitude) |
- | - | 4.694 |
s(Latitude) |
- | - | 1 |
s(Depth) |
- | - | 2.447 |
Longitude
)Latitude
)Depth
)trawl
data)trawl
data.In the first place, it seems confirmed that the Longitude
has a marked non-linear impact on the catch score, as the initial analysis was suggesting.
In particular, the catch score is high when the sampling location is close to the coast (but not too close!), and then it suddenly decreases.
The effective degrees of freedom of Latitude
is \text{df}_2 = 1, meaning that the estimated \hat{f}_2 collapsed to a linear term. The corresponding shrinkage parameter \lambda_2 is very high.
Overall, the effect due to the Latitude
looks small or not present at all.
Depth
seems to have a relevant effect on the Score
, but this is likely due to a few leverage points at the right extreme of the Depth
range.Zone
and Year
seem to have a minor effect.The naïve Bayes classifier expresses the binary classification probability \text{pr}(y = 1 \mid \bm{x}) as \text{pr}(y = 1 \mid \bm{x}) = \frac{\pi_1 \prod_{j=1}^p p_{j1}(x_j)}{\pi_0\prod_{j=1}^p p_{j0}(x_j) + \pi_1\prod_{j=1}^p p_{j1}(x_j)} = \frac{\pi_1\prod_{j=1}^p p_{j1}(x_j)}{p(\bm{x})}.
Hence, using class 0 as a baseline, we can derive the following expression: \log\frac{\text{pr}(y = 1 \mid \bm{x})}{\text{pr}(y = 0 \mid \bm{x})} = \log\frac{\pi_1\prod_{j=1}^p p_{j1}(x_j)}{\pi_0\prod_{j=1}^p p_{j0}(x_j)} = \log\frac{\pi_1}{\pi_0} + \sum_{j=1}^p\log\frac{p_{j1}(x_j)}{p_{j0}(x_j)} = \beta_0 + \sum_{j=1}^pf_j(x_j).
Therefore, although naïve Bayes and GAMs are fitted in a quite different way, there is a tight connection among the two methods.
Naïve Bayes has a generalized additive model structure. This also suggests that the “additive assumption” is linked to the notion of independence among the covariates.
mgcv
R packageGAMs were invented by Hastie and Tibshirani in 1986, including the backfitting algorithm.
Simon Wood (2003) described thin-plate regression splines and their estimation (no backfitting).
Simon Wood (2004, 2011) invented methods for estimating \lambda_1,\dots,\lambda_p in an efficient and stable manner.
Marra and Wood (2011) discussed many methods for practical variable selection for GAMs.
For further details, there is a recent and advanced book by Simon Wood (2017) entitled “Generalized Additive Models: An Introduction with R”.
The mgcv
package in R (by Simon Wood) implements everything mentioned here.
Pros
GAMs can automatically model non-linear relationships. This can potentially make more accurate predictions for the response.
GAMs, as linear models, are interpretable: the variation of the fitted response, holding all but one predictor fixed, does not depend on the values of the other predictors.
In practice, this means that we can plot the fitted functions \hat{f}_j separately to examine the roles of the predictors in modelling the response.
Additive assumption is quite strong, but it is still possible to manually add interactions as in the linear regression case.
Cons
MARS are a generalization of GAMs that avoid the additivity assumption.
MARS allow modeling of non–linear interactions and not just non–linear marginal effects.
MARS are at the same time:
MARS combine many of the techniques we have seen in this course into a single sophisticated algorithm.
Once the basis functions are specified, the estimate for \hat{\beta} is straightforward, using for example least squares or the IWLS algorithm in the classification case.
The main distinction with GAMs is that in MARS the basis functions are estimated from the data and therefore they are not pre-specified in advance.
Let us consider the following set of pairs of basis functions (linear splines): \mathcal{C} = \{(x_j - \xi)_+, (\xi - x_j)_+ : \xi \in \{x_{1j},\dots,x_{nj}\}, \ j=1,\dots,p \}. For example, two basis functions could be h_1(\bm{x}) = (x_1 - 0.5)_+ and h_2(\bm{x}) = (0.5 - x_1)_+.
The knots are placed in correspondence of the observed data. Hence, there are in total 2 n p possible basis functions among which we can choose.
Hence, after the first step of the MARS algorithm, our model for example could be f(\bm{x}; \beta) = \beta_0 + \sum_{m=1}^2\beta_m h_m(\bm{x}) = \beta_0 + \beta_1 (x_1 - 0.5)_+ + \beta_2(0.5 - x_1)_+.
In the subsequent step, we consider a new pair of basis functions (x_j - \xi)_+, (\xi - x_j)_+ in \mathcal{C}, but this time we are allowed to perform two kind of operations:
MARS algorithm (degree d)
Initialize f(\bm{x}; \beta) = \beta_0 and let K be maximum number of pairs, so that M = 2K.
Identify the initial pair of basis functions h_1 and h_2 in \mathcal{C} that minimize the loss function. Then, let h_0(\bm{x}) = 1 and set \mathcal{M}_1 = \{h_0, h_1, h_2\}.
For k = 1,\dots,K - 1, do:
Let \mathcal{M}_k = \{h_0, h_1,\dots,h_{2k}\} be the basis functions already present in the model.
Consider a novel pair of bases \tilde{h}_1,\tilde{h}_2 \in \mathcal{C} \setminus \mathcal{M}_k. A candidate pair is obtained by multiplying \tilde{h}_1,\tilde{h}_2 with one of the bases in \mathcal{M}_k. Note that h_0 \in \mathcal{M}_k.
A valid candidate basis does not contain the same variable x_j more than once in the product, and it must involve at most d product terms.
Identify the optimal pair among the candidates at steps (ii)-(iii) that reduces the loss function the most. This results in a new pair of bases h_{2k+1}, h_{2k+2}.
Set \mathcal{M}_{k+1} \leftarrow \mathcal{M}_k \cup \{h_{2k+1}, h_{2k+2}\}.
Return the collection of models \mathcal{M}_1,\dots,\mathcal{M}_K.
The final model with M terms is very likely overfitting the data. Hence, it is important to remove some of the bases using backward regression or best subset.
The optimal reduced model can be selected via cross-validation, but generalized cross-validation is often preferred due to computational reasons.
Unfortunately, it is not clear how to compute the effective degrees of freedom that are needed in the \text{GCV} formula.
In MARS, however, we do not have any miraculous simple formula like in LAR or ridge.
Simulation studies suggest that, for every knot placed, we should pay a price of about 3 degrees of freedom. However, this result is quite heuristic.
The basis functions used in MARS have the advantage of operating locally.
When the basis functions in \mathcal{C} are multiplied together, the result is nonzero only over the small part of the feature space where both component functions are nonzero.
Hence, the estimated function is built up parsimoniously, by making small local modifications to the fit obtained at the previous step.
This is important, since one should “spend” degrees of freedom carefully in high dimensions, to avoid incurring into the curse of dimensionality.
The constructional logic of the model is hierarchical. We can multiply new basis functions that involve new variables only to the basis functions already in the model.
Hence, an interaction of a higher order can only be introduced when interactions of a lower order are present.
This constraint, introduced for computational reasons, does not necessarily reflect the real behavior of the data, but it often helps in interpreting the results.
trawl
data (d = 1)Term | Basis function | Coefficient |
---|---|---|
h_0(\bm{x}) | 1 | 1.382 |
h_1(\bm{x}) | (\texttt{Longitude} - 143.28)_+ | -4.275 |
h_2(\bm{x}) | (\texttt{Longitude} - 143.58)_+ | 3.984 |
The estimated function f(\bm{x}_i; \hat{\beta}) is remarkably simple and it involves only the Longitude
. Moreover, the relationship between Score
and Longitude
is non-linear.
Both these considerations are consistent with the previous findings, obtained using GAMs.
trawl
data (d = 2)Term | Basis function | Coefficient |
---|---|---|
h_0(\bm{x}) | 1 | 1.318 |
h_1(\bm{x}) | (\texttt{Longitude} - 143.28)_+ | -5.388 |
h_2(\bm{x}) | (\texttt{Longitude} - 143.58)_+ | 4.172 |
h_3(\bm{x}) | I(\texttt{Year} = \texttt{1993})(\texttt{Longitude} - 143.05)_+ | 0.679 |
h_4(\bm{x}) | [\texttt{Latitude} - (-11.72)]_+ (\texttt{Longitude} - 143.05)_+ | 1.489 |
As expected, a degree 2 MARS lead to a more sophisticated fit involving interactions between Year
and Longitude
as well as between Latitude
and Longitude
.
We can explore these effects using partial plots.
Null model | Linear model | GAM | MARS (degree 1) | MARS (degree 2) | |
---|---|---|---|---|---|
MAE | 0.611 | 0.361 | 0.315 | 0.305 | 0.334 |
RMSE | 0.718 | 0.463 | 0.408 | 0.390 | 0.407 |
Longitude
was indeed a good idea that improved the predictive performance.Somewhat surprisingly, the best model is the extremely simple (yet effective) MARS of degree 1.
The weak interactions effects captured by the MARS of degree 2 led an increased variance of the estimates, slightly deteriorating the fit.
Pros
Cons
MARS can quickly overfit the data if the model complexity is not kept under control.
Unfortunately, this is not always easy to do as there is limited theoretical support that guides us in the choice of the effective degree of freedom.
Comments and cricism of linear models
By simple graphical inspection, it seems that the relationship between
Score
andLongitude
is non-linear.Also, an interaction effect between
Year
andLongitude
could be present.These considerations support the idea that a nonparametric approach might be more appropriate.
However, the number of covariates is p = 5 and therefore a fully nonparametric estimation would not be feasible, because of the curse of dimensionality.
We need a simplified modelling strategy, that accounts for non-linearities but at the same time is not fully flexible.