Exercises A

Data mining - CdL CLAMSES

Author
Affiliation

Tommaso Rigon

Università degli Studi di Milano-Bicocca

Homepage

Theoretical exercises

A.1 - Properties of the projection matrix \bm{H}

Suppose the n\times p design matrix \bm{X} has full rank, that is \text{rk}(\bm{X}) = p, with p < n. Prove that \text{rk}(\bm{H}) = \text{tr}(\bm{H}) = p. Moreover, show that \bm{H} = \bm{H}^T and that \bm{H}^2 = \bm{H}. Finally, show that if the intercept is included into the model, then the sum of the elements of each row of \bm{H} (and hence each column, because of symmetry) equals 1, that is \sum_{j=1}^n[\bm{H}]_{ij} = 1, \qquad i=1,\dots,n.

Hint. You may want to look for the properties of projection matrices on your favorite linear algebra textbook, otherwise this exercise becomes quite hard.

Suppose \bm{X} is of full rank, that is \text{rk}(\bm{X}) = p. Then the following QR decomposition holds true \bm{X} = \bm{Q}\bm{R}, with \bm{Q} \in \mathbb{R}^{n \times p} and such that \text{rk}(\bm{Q}) = p and \bm{Q}^T\bm{Q} = I_p.


We firstly show that \bm{H} is symmetric, i.e. \bm{H}^T = \bm{H}. In fact: \bm{H}^T = \left[\bm{X}(\bm{X}^T\bm{X})^{-1}\bm{X}^T\right]^T = (\bm{X}^T)^T\left[(\bm{X}^T\bm{X})^{-1}\right]^T\bm{X}^T = \bm{X}(\bm{X}^T\bm{X})^{-1}\bm{X}^T = \bm{H}, where has been used the fact that \left[(\bm{X}^T\bm{X})^{-1}\right]^T =(\bm{X}^T\bm{X})^{-1} because of symmetry.


We now offer two alternative proofs to show that \text{tr}(\bm{H}) = p. The first proof is based on the following property of the trace operator: \text{tr}(\bm{A}\bm{B}) = \text{tr}(\bm{B}\bm{A}). Hence, \text{tr}(\bm{H}) = \text{tr}(\bm{X}(\bm{X}^T\bm{X})^{-1}\bm{X}^T) = \text{tr}\left(\bm{X}^T\bm{X}(\bm{X}^T\bm{X})^{-1}\right) = \text{tr}(I_p) = p. Alternatively, using the QR decomposition: \text{tr}(\bm{H}) = \text{tr}(\bm{Q}\bm{Q}^T) = \text{tr}(\bm{Q}^T\bm{Q}) = \text{tr}(I_p) = p.


The matrix \bm{H} is idempotent, meaning that \bm{H}^2 = \bm{H}. In fact, \bm{H}^2 = \bm{H}\bm{H} = \bm{X}(\bm{X}^T\bm{X})^{-1}\underbrace{\bm{X}^T \bm{X}(\bm{X}^T\bm{X})^{-1}}_{=I_p}\bm{X}^T = \bm{X}(\bm{X}^T\bm{X})^{-1}\bm{X}^T = \bm{H} Alternatively, using the QR decomposition: \bm{H}^2 = \bm{H}\bm{H} = \bm{Q}\underbrace{\bm{Q}^T\bm{Q}}_{=I_p}\bm{Q}^T = \bm{Q}\bm{Q}^T = \bm{H}.


We now show that the rank of \bm{H} is p. In order to do so, we consider the spectral decomposition of \bm{H}. Suppose \lambda is an eigenvalue of the symmetric matrix \bm{H}, which means there is an (eigen-)vector \bm{x} \in \mathbb{R}^n such that \bm{x} \neq 0 \bm{H}\bm{x} = \lambda \bm{x}. We now exploit the fact that \bm{H} is idempotent: \lambda \bm{x} = \bm{H}\bm{x} = \bm{H} \underbrace{\bm{H} \bm{x}}_{=\lambda \bm{x}} = \lambda \underbrace{\bm{H} \bm{x}}_{=\lambda \bm{x}} = \lambda^2 \bm{x}. We have shown that \lambda \bm{x} = \lambda^2 \bm{x} for \bm{x} \neq 0. This equality is true if and only if \lambda(\lambda - 1)\bm{x} = \bm{0}_n, which implies that the eigevalues of \bm{H} are necessarily \lambda \in \{0, 1\}. Incidentally, this proves that \bm{H} is positive semi-definite (not requested by the exercise).

Let \lambda_1,\dots,\lambda_n be the eigenvalues of \bm{H}. We proved before that \text{tr}(\bm{H}) = p and note in addition that \text{tr}(\bm{H}) = \sum_{i=1}^n \lambda_i = p. Thus, there are necessarily p eigenvalues equal to 1 and n - p eigenvalues to 0. In particular, there are p positive eigenvalues, implying that \text{rk}(\bm{H}) = p.


Finally, we show that the sums of the rows (and the columns) of \bm{H} is 1, if the intercept is included. This follows from the fact that \bm{H} is projecting the values of \bm{y} into the space of columns of \bm{X}.

Hence, if the intercept is included in the model, we obtain that \bm{H} \bm{1}_n = \bm{1}_n.

A.2 - Positive definiveness of \bm{X}^T\bm{X}

Prove the statement of Proposition A.1. In other words, suppose the n\times p matrix \bm{X} has full rank, that is \text{rk}(\bm{X}) = p, with p < n. Then, show that \bm{X}^T\bm{X} is positive definite.

A.3 - OLS with orthogonal predictors

Prove the statement of Proposition A.2. In other words, show that when the predictors are orthogonal the least square estimate is \hat{\beta}_j = \frac{\tilde{\bm{z}}_j^T\bm{y}}{\tilde{\bm{z}}_j^T\tilde{\bm{z}}_j}, \qquad j=1,\dots,p.

Assume, in addition, that \bm{Y} = \bm{Z}\beta + \bm{\epsilon} and that the errors \epsilon_i \overset{\text{iid}}{\sim} \text{N}(0, \sigma^2). Then, obtain the covariance matrix of \hat{\beta} and conclude that the estimators \hat{\beta}_j and \hat{\beta}_{j'} are independent, .

A.4 - Final step of the Gram-Schidmt algorithm

Consider the last equation of this slide, that is \hat{\beta}_p = (\tilde{\bm{z}}_p^T\bm{y}) / (\tilde{\bm{z}}_p^T \tilde{\bm{z}}_p). Realize that this estimate can be regarded as the final (additional) step of the Gram-Schmidt algorithm, as mentioned in Algorithm 2.1 of the textbook Azzalini & Scarpa (2011).

A.5 - Recursive least squares

Verify the correctness of the recursive least square equations described in this slide.

The proof is already concisely written in the slides, you just need to convince yourself of the correctness of every step and add the missing details.

The second part of this exercise is quite hard and optional, because it involves a lot of algebraic steps. Verify the correctness of the deviance formula: ||\bm{y}_{(n + 1)} - \bm{X}_{(n+1)}\hat{\beta}_{(n+1)}||^2 = ||\bm{y}_{(n)} - \bm{X}_{(n)}\hat{\beta}_{(n)}||^2 + v_{(n)} e_{n+1}^2, which is mentioned here without proof.

A.6 - Separability in logistic regression

Consider a logistic regression model for binary data with a single predictor x_i \in \mathbb{R}, so that f(x_i) = \beta_0 + \beta_1 x_i and \mathbb{P}(Y_i = 1) = \pi(x_i) = 1/[1 + \exp\{-f(x_i)\}]. Suppose there exists a point x_0 \in \mathbb{R} that perfectly separates the two binary outcomes.

Investigate the behavior of the maximum likelihood estimate \hat{\beta}_0, \hat{\beta} in this scenario. Then, generalize this result to the multivariate case, when \bm{x}_i \in \mathbb{R}^p. That is, suppose there exists a vector \bm{x}_0 \in \mathbb{R}^p that perfectly separates the binary outcomes.

A detailed description of this separability issue is provided in the Biometrika paper by Albert and Anderson (1984). See also the paper by Rigon and Aliverti (2023) for a simple correction.

A.7 - Standardization of the input variables

In a linear model, suppose the original input values x_{ij} are standardized, namely the transformed variables z_{ij} are such that z_{ij} = \frac{x_{ij} - \bar{x}_j}{s_j}, where \bar{x}_j and s_j are the mean and the standard deviation of the jth variable. We denote with \hat{\beta} the OLS estimate based on the original data and with \hat{\gamma} the OLS estimate based on the standardized data.

Prove that the predicted values coincide, that is: \hat{y}_i = \bm{x}_i^T\hat{\beta} = \bm{z}_i^T\hat{\gamma}, \qquad i=1,\dots,n, where \bm{x}_i = (x_{i1},\dots,x_{ip})^T and \bm{z}_i = (z_{i1},\dots,z_{ip})^T. Hence, standardization of the inputs in ordinary least squares has no effect on the predictions. Similar linear operations on the inputs, such as normalization, would lead to the same conclusion.

To be added. The exercise has been discussed in-class.

Practical exercises

A.8 - Recursive least squares

Implement a function ols_function(X, y) that computes the least squares estimate using the recursive least squares algorithm, described in this slide.

A detailed description of this procedure is also described in Algorithm 2.2 of Azzalini & Scarpa, 2011

A.9 - Iteratively re-weighted least squares for logistic regression

Implement a function called logistic_mle(X, y) which computes the maximum likelihood estimate for a logistic regression model using the iteratively re-weighted least squares, as described here.

Verify that the output of logistic_mle and the built-in glm function coincide, using the hearth dataset.