Exercises C

Data mining - CdL CLAMSES

Author
Affiliation

Tommaso Rigon

Università degli Studi di Milano-Bicocca

Homepage

Theoretical exercises

C.1 - Equivalence between PCR and OLS when k = p

Show that when the number of principal components k = p, then the predicted values of PCR and ordinary least squares coincide, namely \bm{X}\hat{\beta}_\text{ols} = \bm{Z}\hat{\gamma}_\text{pcr}, where the definition of \bm{Z} and \hat{\gamma} = (\hat{\gamma}_1,\dots, \hat{\gamma}_p) is given in this slide. Moreover, show that when k = p \hat{\beta}_\text{ols} = \hat{\beta}_\text{pcr}, where \hat{\beta}_\text{pcr} has been defined in this slide.

To be added. Refer to the notes available on e-learning.

C.2 - Centering and scaling the predictors

Part I

Suppose the covariates \tilde{\bm{x}}_j were scaled (same variance) but not centered (different means). Moreover, suppose the response y was not centered. Consider the following estimator (\hat{\beta}_0, \hat{\beta}_\text{ridge}) = \arg\min_{\beta_0,\beta} \sum_{i=1}^n(y_{i} - \beta_0 - \bm{x}_{i}^T\beta)^2 + \lambda \sum_{j=1}^p\beta_j^2. Show that \hat{\beta}_\text{ridge} can be equivalently obtained using the centered data, that is \hat{\beta}_\text{ridge} = \arg\min_{\beta} \sum_{i=1}^n\{(y_{i} - \bar{y}) - (\bm{x}_{i} - \bar{\bm{x}})^T\beta\}^2 + \lambda \sum_{j=1}^p\beta_j^2. and that \hat{\beta}_0 = \bar{y} - \bar{\bm{x}}\hat{\beta}_\text{ridge}.

Part II

Suppose the covariates \tilde{\bm{x}}_j were not scaled (different variances s_j) and not centered (different means). Moreover, suppose the response y was not centered. Consider the following scaled-ridge estimator (\hat{\beta}_0, \hat{\beta}_\text{scaled-ridge}) = \arg\min_{\beta_0,\beta} \sum_{i=1}^n\left(y_{i} - \beta_0 - \bm{x}_i^T\beta\right)^2 + \lambda \sum_{j=1}^p s^2_j \beta_j^2. Consider now the following estimator \hat{\beta}_\text{ridge} = \arg\min_{\beta} \sum_{i=1}^n\left\{(y_{i} - \bar{y}) - \sum_{j=1}^p\left(\frac{x_{ij} - \bar{x}_j}{s_j}\right)\beta_j\right\}^2 + \lambda \sum_{j=1}^p\beta_j^2.

As mentioned in this slide, show that the coefficients of ridge regression, expressed in the original scale, are \hat{\beta}_0 = \bar{y} - \bar{\bm{x}}\hat{\beta}_\text{scaled-ridge}, \qquad \hat{\beta}_\text{scaled-ridge} = \text{diag}(1 / s_1,\dots, 1/s_p) \hat{\beta}_\text{ridge}.

C.3 - Ridge estimator

Verify the correctness of the statement of this slide. In other words, show that the ridge estimator, under standard centering and scaling assumptions and defined as \hat{\beta}_\text{ridge} = (\bm{X}^T\bm{X} + \lambda I_p)^{-1}\bm{X}^T\bm{y}, is the minimizer of the penalized loss function \sum_{i=1}^n(y_{i} - \bm{x}_{i}^T\beta)^2 + \lambda \sum_{j=1}^p\beta_j^2. Hint: the proof is a simple adaptation of the one for ordinary least squares.

To be added. Refer to the notes available on e-learning.

C.4 - Ridge data-augmentation

Show that the ridge regression estimator \hat{\beta}_\text{ridge} can be obtained by ordinary least squares regression on an augmented data set.

We augment the centered matrix \bm{X} with p additional rows \sqrt{\lambda} I_p and augment \bm{y} with p zeros, namely we consider the augmented dataset \tilde{\bm{X}} = \begin{pmatrix} \bm{X}\\ \sqrt{\lambda}I_p\\ \end{pmatrix}, \qquad \tilde{\bm{y}} = \begin{pmatrix}\bm{y} \\ 0_p\end{pmatrix}. By introducing artificial data having response value zero, the fitting procedure is forced to shrink the coefficients towards zero. Show that \hat{\beta}_\text{ridge} = (\tilde{\bm{X}}^T\tilde{\bm{X}})^{-1} \tilde{\bm{X}}^T\tilde{\bm{y}}.

To be added. Refer to the notes available on e-learning.

C.5 - Ridge regression with identical variables

Suppose we run a ridge regression with parameter \lambda on a single variable \tilde{\bm{x}}_1 (centered and scaled) and get a coefficient \hat{\beta}_1. We now include an exact copy \tilde{\bm{x}}_2 = \tilde{\bm{x}}_1 and refit our ridge regression.

Show that both coefficients are identical and derive their value.

C.6 - Lasso solution with a single predictor

Consider the lasso problem with a single-predictor \hat{\beta}_\text{lasso} = \arg\min_{\beta}\frac{1}{2n}\sum_{i=1}^n(y_{i} - x_{i}\beta)^2 + \lambda |\beta|. Show that \hat{\beta}_\text{lasso} has an explicit expression, which is \hat{\beta}_\text{lasso} = \begin{cases} \text{cov}(x,y) - \lambda, \qquad &\text{if} \quad \text{cov}(x,y) > \lambda \\ 0 \qquad &\text{if} \quad \text{cov}(x,y) \le \lambda\\ \text{cov}(x,y) + \lambda, \qquad &\text{if} \quad \text{cov}(x,y) < -\lambda \\ \end{cases} Show in addition that \hat{\beta}_\text{ridge} = \frac{1}{\lambda + 1}\text{cov}(x,y) =\frac{1}{\lambda + 1}\hat{\beta}_\text{ols} = \frac{1}{\lambda + 1}\frac{1}{n}\sum_{i=1}^n x_{i}y_{i}, Note. In ridge regression, you need to include a n^{-1} scaling factor in the penalized loss: this makes the values of \lambda comparable between ridge and lasso. More precisely, you need to consider the ridge solution as a special case of this equation with \alpha = 0.

C.7 - The tied covariances in LAR

Consider a regression problem with k variables \tilde{\bm{x}}_j having mean zero and standard deviation one. The response variable is centered as well. Moreover, suppose that each variable has identical absolute covariance with the response: \left|\text{cov}(\tilde{\bm{x}}_j, \bm{y})\right| = \left|\frac{1}{n}\sum_{j=1}^k x_{ij} y_i \right|= \lambda_{k-1}, \qquad j=1,\dots,k. Let \hat{\beta} be the least squares estimate and let us denote with r_i(\alpha) = y_i - \alpha \bm{x}_i^T\hat{\beta} the residuals from a fraction of the predictions, with \alpha \in [0, 1].

Prove that the following relationship holds: \left|\text{cov}(\tilde{\bm{x}}_j, \bm{r}(\alpha))\right| = \left|\frac{1}{n}\sum_{i=1}^nx_{ij}\{y_i - r_i(\alpha)\}\right| = \lambda_{k-1} (1 - \alpha), \qquad j=1,\dots,k, implying that the magnitude of the residuals is constant across the covariates as we progress toward the least squares fit \bm{x}_i^T\hat{\beta}.

Realize that such a property implies that the covariances with the residuals in the LAR algorithm are tied, having set \alpha = (\lambda_{k-1} - \lambda) / \lambda_{k-1} and 0 \le \lambda \le \lambda_k.

Hint. This problem is easier if you use matrix notation, that is, considering the covariances n^{-1}\bm{X}^T\bm{r}(\alpha) and the partial residuals \bm{r}(\alpha) = \bm{y} - \alpha\bm{X}\hat{\beta}.

To be added. Refer to the notes available on e-learning.

C.8 - Ridge and lasso with orthogonal predictors

When the predictors are mutually orthogonal, lasso and ridge become simpler. Let \bm{Z} = (\tilde{\bm{z}}_1,\dots,\tilde{\bm{z}}_p) be the design matrix and suppose that \bm{Z} is orthogonal and standardized, which means that \bm{Z}^T\bm{Z} = I_p. Moreover, suppose the predictors and the response have been centered, that is \sum_{i=1}^ny_i = \sum_{i=1}^n z_{ij} = 0.

  • Find an explicit expression for \hat{\beta}_\text{ridge}.

  • Find an explicit expression for \hat{\beta}_\text{lasso}.

Discuss the results.

C.9 - ☠️ Support vector machine as a ridge minimization problem

Consider this exercise only if you already know Support Vector Machines (SVM). In SVMs, we seek for the optimal values \hat{\beta}_0, \hat{\beta} that solve the following quadratic programming problem: \min_{\beta_0, \beta} \frac{1}{2}\sum_{j=1}^p\beta_j^2 + C \sum_{i=1}^n\xi_i, \quad \text{subject to}\quad \xi_i \ge 0, \quad y_i(\beta_0 + \bm{x}_i^T\beta) \ge 1 - \xi_i, \quad \forall i, where C > 0 is a cost parameter and y_i\in\{-1, 1\} the categorical response. The SVM classifier is then obtained as \hat{G}(x) = \text{sign}(\hat{f}(x)), with \hat{f}(x) = \hat{\beta}_0 + \bm{x}_i^T\hat{\beta}.

Show that the SVM estimates corresponds to the solution of the following ridge problem (\hat{\beta}_0,\hat{\beta}) = \arg\min_{\beta_0,\beta}\: \underbrace{\sum_{i=1}^n[1 - y_i(\beta_0 + \bm{x}_i^T\beta)]_+}_{\text{hinge loss}} + \underbrace{\lambda\sum_{j=1}^p\beta_j^2}_{\text{ridge penalty}}, in which \lambda = 1/C. Hence, SVMs are highly similar to a logistic regression, with responses y_i \in \{-1,1\} in which the log-likelihood \ell(\beta_0,\beta; \bm{y}) = \sum_{i=1}^n\log[1 + \exp\{-y_i (\beta_0 + \bm{x}_i^T\beta)\}], has been replaced by the hinge loss \ell_\text{hinge}(\beta_0,\beta; \bm{y}) = \sum_{i=1}^n[1 - y_i(\beta_0 + \bm{x}_i^T\beta)]_+. A detailed discussion is offered in Section 12.3.2 of textbook The Elements of Statistical Learning.

Practical exercises

C.10 - The Hitters dataset

Consider the Hitters dataset which is available in the ISLR R library. Having removed the missing values, consider a regression models to predict the Salary as a function of the available covariates.

Use best subset, principal components, ridge regression, and lasso to handle the presence of potentially irrelevant variables.

C.11 - Implementation of pathwise coordinate optimization

Implement the pathwise coordinate optimization algorithm that is described in this slide. Use it to predict the Salary of the baseball players on the Hitters dataset.