Computational Statistics II

Homework 1

Author
Affiliation

Tommaso Rigon

DEMS

The homeworks are not graded, but the results may be sent to tommaso.rigon@unimib.it to receive feedbacks.

Introduction

In this homework we consider the CAGE cancer dataset that is described in the tutorial genes_tutorial.md. This is the same dataset that is presented in the article Durante (2019), Conjugate Bayes for Probit Regression via Unified Skew-Normal Distributions.

Note. This is a high-dimensional problem, so adjust your expectations about the performance of the algorithms accordingly.

Dataset description (from the gene_tutorial.md document)

The focus is on learning how gene expression (monitored at p - 1 = 516 tags) relates to the probability of a cancerous tissue. Data are available for n = 74 measurements and can be downloaded at Cancer SAGE by clicking here. The download provides a directory SAGE_filtered_small_dataset which contains several datasets. Here the focus is on dataset_74-516.csv.

rm(list = ls())
dataset_gene <- read.csv("dataset_74-516.csv", header = TRUE, sep = "")

The dataframe dataset_gene contains information on the response variable (first column of dataset_gene), and on the covariates (subsequent columns of dataset_gene). More specifically, the first column dataset_gene[, 1] contains names of tissues followed by a letter which is either N (normal) or C (cancerous). Exploiting this information, let us create the response by hand.

y_data <- c(
  0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1,
  1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0,
  0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1
)

The design matrix comprising the covariates can be easily obtained by extracting the remaining columns in dataset_gene. Following Gelman et al. (2008), such covariates are also rescaled and an intercept term is added.

X_data <- cbind(1, scale(dataset_gene[, -1]))

Homework

Model description

Let \textbf{y} = (y_1,\dots,y_n)^\intercal be the vector of the observed binary responses (variable y_data) and let \textbf{X} be the corresponding design matrix (object X_data) whose generic row is \textbf{x}_i = (x_{i1},\dots,x_{ip})^\intercal, for i=1,\dots,n, suitably standardized. Consider a generalized linear model such that

y_i \mid \pi_i \overset{\text{ind}}{\sim} \text{Bern}(\pi_i), \qquad \pi_i = g(\eta_i), \qquad \eta_i = \beta_1x_{i1} + \cdots + \beta_p x_{ip}, where g(\cdot) is the probit link (pnorm function) and let the prior be \beta \sim N(0, 16 I_p).

Assignments

  1. Implement a random walk Metropolis algorithm for sampling from the posterior distribution of \beta. Tune the covariance matrix of the Gaussian proposal distribution by trial and error.

  2. Obtain a rough estimate of the posterior covariance matrix using the Laplace approximation; refer to Chopin & Ridgway (2017) for the details. Then, run a random walk Metropolis based on this approximation.

  3. Implement a MALA algorithm for the sampling from the posterior distribution of \beta based approximation for the covariance matrix obtained in the previous point.