Computational Statistics II
Homework 1
The homeworks are not graded, but the results may be sent to tommaso.rigon@unimib.it to receive feedbacks.
Introduction
In this homework we consider the CAGE cancer dataset that is described in the tutorial genes_tutorial.md. This is the same dataset that is presented in the article Durante (2019), Conjugate Bayes for Probit Regression via Unified Skew-Normal Distributions.
Note. This is a high-dimensional problem, so adjust your expectations about the performance of the algorithms accordingly.
Dataset description (from the gene_tutorial.md
document)
The focus is on learning how gene expression (monitored at p - 1 = 516 tags) relates to the probability of a cancerous tissue. Data are available for n = 74 measurements and can be downloaded at Cancer SAGE by clicking here. The download provides a directory SAGE_filtered_small_dataset
which contains several datasets. Here the focus is on dataset_74-516.csv
.
The dataframe dataset_gene
contains information on the response variable (first column of dataset_gene
), and on the covariates (subsequent columns of dataset_gene
). More specifically, the first column dataset_gene[, 1]
contains names of tissues followed by a letter which is either N (normal) or C (cancerous). Exploiting this information, let us create the response by hand.
The design matrix comprising the covariates can be easily obtained by extracting the remaining columns in dataset_gene
. Following Gelman et al. (2008), such covariates are also rescaled and an intercept term is added.
Homework
Model description
Let \textbf{y} = (y_1,\dots,y_n)^\intercal be the vector of the observed binary responses (variable y_data
) and let \textbf{X} be the corresponding design matrix (object X_data
) whose generic row is \textbf{x}_i = (x_{i1},\dots,x_{ip})^\intercal, for i=1,\dots,n, suitably standardized. Consider a generalized linear model such that
y_i \mid \pi_i \overset{\text{ind}}{\sim} \text{Bern}(\pi_i), \qquad \pi_i = g(\eta_i), \qquad \eta_i = \beta_1x_{i1} + \cdots + \beta_p x_{ip},
where g(\cdot) is the probit link (pnorm
function) and let the prior be \beta \sim N(0, 16 I_p).
Assignments
Implement a random walk Metropolis algorithm for sampling from the posterior distribution of \beta. Tune the covariance matrix of the Gaussian proposal distribution by trial and error.
Obtain a rough estimate of the posterior covariance matrix using the Laplace approximation; refer to Chopin & Ridgway (2017) for the details. Then, run a random walk Metropolis based on this approximation.
Implement a MALA algorithm for the sampling from the posterior distribution of \beta based approximation for the covariance matrix obtained in the previous point.