Data mining

CdL CLAMSES

Author
Affiliation

Tommaso Rigon

Università degli Studi di Milano-Bicocca

This is the website of the Data Mining course (6 CFU) of the “Corso di Laurea Magistrale in Scienze Statistiche ed Economiche (CLAMSES)”, Università degli Studi di Milano-Bicocca.

Teaching material

Required

Optional

  • Efron, B. and Hastie, T. (2016), Computer Age Statistical Inference, Cambridge University Press.
  • Lewis, Kane, Arnold (2019) A Computational Approach to Statistical Learning. Chapman And Hall/Crc.

Slides and lecture notes

The slides are meant to be used in HTML. However, if you really want to convert the HTML slides into pdf files, you can follow the instruction of the quarto documentation.

Topic Notes Slides Code
Introduction Introduction Slides Introduction
A-B-C Unit A Slides Unit A Code A
Lab 1 (Computations for linear models) Code Lab 1
Exercises Exercises A
Optimism, conflicts and trade-offs Unit B Slides Unit B Code B
Exercises Exercises B
Lab 2 (Ames housing) Code Lab 2
Shrinkage and variable selection Unit C Slides Unit C Code C
Exercises Exercises C
Lab 3-4 (Ames housing) Code Lab 3-4
Nonparametric regression Unit D Slides Unit D Code D
Exercises Exercises D
Lab 5 (Auto) Code Lab 5
The curse of dimensionality Unit E Slides Unit E
Additive models Unit F Slides Unit F Code F
Exercises Exercises F
Lab 6 (Ames housing) Code Lab 6
FAQ - The statistical culture Unit G

Exam

Rules

  • The exam is made of two parts:

  • (20 / 30) Written examination: a pen-and-paper exam about the theoretical aspects of the course.

  • (10 / 30) Individual assignment: a data challenge.

    • You will be given a prediction task, and you will need to submit your predictions and produce a report of maximum 4 pages;
    • You will make use of the Kaggle platform (optional);
    • Further info will be provided in due time.
  • Both parts are mandatory, and you need to submit the assignment before attempting the written part. The report expires after one year from the end of the course.

  • The final grade is obtained as the sum of the above scores.

Mock exam

An example of written examination is provided at this link.

Prerequisites

Knowledge of the topics (i) linear algebra, (ii) linear models, and (iii) generalized linear models (GLMs), is highly recommended.

Knowledge of topics covered in the courses Probability and Statistics M and Advanced Statistics M is also highly recommended.

Office hours

To schedule a meeting, please write to tommaso.rigon@unimib.it. Office hours is every Tuesday at 17.30

Acknowledgments

The primary source of this website’s content are the textbooks Azzalini and Scarpa (2011) and Hastie, Tibshirani, and Friedman (2009). A few pictures have been taken from the textbooks; in these cases the original source is cited.

I have made use of several other books, scientific articles, and datasets to complement the main textbooks. They are all cited at the end of each unit. Students are encouraged to consult the books: these slides are just a concise summary.

I am also grateful to Aldo Solari, who gracefully shared the material of the courses Data Mining and Statistical Learning. These websites have been used to lay the basis of this course.

All the mistakes still present are mine.