Exercises C

Data mining - CdL CLAMSES

Author
Affiliation

Tommaso Rigon

Università degli Studi di Milano-Bicocca

Homepage

The theoretical exercises described below could be quite difficult. At the exam, you can expect a simplified version of them; otherwise, they would represent a formidable challenge for most of you.

Theoretical exercises

Show that when the number of principal components k = p, then the predicted values of PCR and ordinary least squares coincide, namely \bm{X}\hat{\beta}_\text{ols} = \bm{Z}\hat{\gamma}_\text{pcr}, where the definition of \bm{Z} and \hat{\gamma} = (\hat{\gamma}_1,\dots, \hat{\gamma}_p) is given in this slide. Moreover, show that when k = p \hat{\beta}_\text{ols} = \hat{\beta}_\text{pcr}, where \hat{\beta}_\text{pcr} has been defined in this slide.

Part I

Suppose the covariates \tilde{\bm{x}}_j were scaled (same variance) but not centered (different means). Moreover, suppose the response y was not centered. Consider the following estimator (\hat{\beta}_0, \hat{\beta}_\text{ridge}) = \arg\min_{\beta_0,\beta} \sum_{i=1}^n(y_{i} - \beta_0 - \bm{x}_{i}^T\beta)^2 + \lambda \sum_{j=1}^p\beta_j^2. Show that \hat{\beta}_\text{ridge} can be equivalently obtained using the centered data, that is \hat{\beta}_\text{ridge} = \arg\min_{\beta} \sum_{i=1}^n\{(y_{i} - \bar{y}) - (\bm{x}_{i} - \bar{\bm{x}})^T\beta\}^2 + \lambda \sum_{j=1}^p\beta_j^2. and that \hat{\beta}_0 = \bar{y} - \bar{\bm{x}}\hat{\beta}_\text{ridge}.

Part II

Suppose the covariates \tilde{\bm{x}}_j were not scaled (different variances s_j) and not centered (different means). Moreover, suppose the response y was not centered. Consider the following scaled-ridge estimator (\hat{\beta}_0, \hat{\beta}_\text{scaled-ridge}) = \arg\min_{\beta_0,\beta} \sum_{i=1}^n\left(y_{i} - \beta_0 - \bm{x}_i^T\beta\right)^2 + \lambda \sum_{j=1}^p s^2_j \beta_j^2. Consider now the following estimator \hat{\beta}_\text{ridge} = \arg\min_{\beta} \sum_{i=1}^n\left\{(y_{i} - \bar{y}) - \sum_{j=1}^p\left(\frac{x_{ij} - \bar{x}_j}{s_j}\right)\beta_j\right\}^2 + \lambda \sum_{j=1}^p\beta_j^2.

As mentioned in this slide, show that the coefficients of ridge regression, expressed in the original scale, are \hat{\beta}_0 = \bar{y} - \bar{\bm{x}}\hat{\beta}_\text{scaled-ridge}, \qquad \hat{\beta}_\text{scaled-ridge} = \text{diag}(1 / s_1,\dots, 1/s_p) \hat{\beta}_\text{ridge}.

Show that the ridge regression estimator \hat{\beta}_\text{ridge} can be obtained by ordinary least squares regression on an augmented data set.

We augment the centered matrix \bm{X} with p additional rows \sqrt{\lambda} I_p and augment \bm{y} with p zeros, namely we consider the augmented dataset \tilde{\bm{X}} = \begin{pmatrix} \bm{X}\\ \sqrt{\lambda}I_p\\ \end{pmatrix}, \qquad \tilde{\bm{y}} = \begin{pmatrix}\bm{y} \\ 0_p\end{pmatrix}. By introducing artificial data having response value zero, the fitting procedure is forced to shrink the coefficients towards zero. Show that \hat{\beta}_\text{ridge} = (\tilde{\bm{X}}^T\tilde{\bm{X}})^{-1} \tilde{\bm{X}}^T\tilde{\bm{y}}.

Suppose we run a ridge regression with parameter \lambda on a single variable \tilde{\bm{x}}_1 (centered and scaled) and get a coefficient \hat{\beta}_1. We now include an exact copy \tilde{\bm{x}}_2 = \tilde{\bm{x}}_1 and refit our ridge regression.

Show that both coefficients are identical and derive their value.

Consider the lasso problem with a single-predictor \hat{\beta}_\text{lasso} = \arg\min_{\beta}\frac{1}{2n}\sum_{i=1}^n(y_{i} - x_{i}\beta)^2 + \lambda |\beta|. Show that \hat{\beta}_\text{lasso} has an explicit expression, which is \hat{\beta}_\text{lasso} = \begin{cases} \text{cov}(x,y) - \lambda, \qquad &\text{if} \quad \text{cov}(x,y) > \lambda \\ 0 \qquad &\text{if} \quad |\text{cov}(x,y)| \le \lambda\\ \text{cov}(x,y) + \lambda, \qquad &\text{if} \quad \text{cov}(x,y) < -\lambda \\ \end{cases} Show in addition that \hat{\beta}_\text{ridge} = \frac{1}{\lambda + 1}\text{cov}(x,y) =\frac{1}{\lambda + 1}\hat{\beta}_\text{ols} = \frac{1}{\lambda + 1}\frac{1}{n}\sum_{i=1}^n x_{i}y_{i}, Note. In ridge regression, you need to include a n^{-1} scaling factor in the penalized loss: this makes the values of \lambda comparable between ridge and lasso. More precisely, you need to consider the ridge solution as a special case of this equation with \alpha = 0.

When the predictors are mutually orthogonal, lasso and ridge become simpler. Let \bm{Z} = (\tilde{\bm{z}}_1,\dots,\tilde{\bm{z}}_p) be the design matrix and suppose that \bm{Z} is orthogonal and standardized, which means that \bm{Z}^T\bm{Z} = I_p. Moreover, suppose the predictors and the response have been centered, that is \sum_{i=1}^ny_i = \sum_{i=1}^n z_{ij} = 0.

  • Find an explicit expression for \hat{\beta}_\text{ridge}.

  • Find an explicit expression for \hat{\beta}_\text{lasso}.

Discuss the results.

Consider this exercise only if you already know Support Vector Machines (SVM). In SVMs, we seek for the optimal values \hat{\beta}_0, \hat{\beta} that solve the following quadratic programming problem: \min_{\beta_0, \beta} \frac{1}{2}\sum_{j=1}^p\beta_j^2 + C \sum_{i=1}^n\xi_i, \quad \text{subject to}\quad \xi_i \ge 0, \quad y_i(\beta_0 + \bm{x}_i^T\beta) \ge 1 - \xi_i, \quad \forall i, where C > 0 is a cost parameter and y_i\in\{-1, 1\} the categorical response. The SVM classifier is then obtained as \hat{G}(\bm{x}) = \text{sign}(\hat{f}(\bm{x})), with \hat{f}(\bm{x}) = \hat{\beta}_0 + \bm{x}^T\hat{\beta}.

Show that the SVM estimates corresponds to the solution of the following ridge problem (\hat{\beta}_0,\hat{\beta}) = \arg\min_{\beta_0,\beta}\: \underbrace{\sum_{i=1}^n[1 - y_i(\beta_0 + \bm{x}_i^T\beta)]_+}_{\text{hinge loss}} + \underbrace{\lambda\sum_{j=1}^p\beta_j^2}_{\text{ridge penalty}}, in which \lambda = 1/C. Hence, SVMs are highly similar to a logistic regression, with responses y_i \in \{-1,1\} in which the log-likelihood \ell(\beta_0,\beta; \bm{y}) = \sum_{i=1}^n\log[1 + \exp\{-y_i (\beta_0 + \bm{x}_i^T\beta)\}], has been replaced by the hinge loss \ell_\text{hinge}(\beta_0,\beta; \bm{y}) = \sum_{i=1}^n[1 - y_i(\beta_0 + \bm{x}_i^T\beta)]_+. A detailed discussion is offered in Section 12.3.2 of textbook The Elements of Statistical Learning.

Coding exercises

Consider the Hitters dataset which is available in the ISLR R library. Having removed the missing values, consider a regression models to predict the Salary as a function of the available covariates.

Use best subset, principal components, ridge regression, and lasso to handle the presence of potentially irrelevant variables.

Implement the pathwise coordinate optimization algorithm that is described in this slide. Use it to predict the Salary of the baseball players on the Hitters dataset.