Data mining
CdL CLAMSES
This is the website of the Data Mining course (6 CFU) of the “Corso di Laurea Magistrale in Scienze Statistiche ed Economiche (CLAMSES)”, Università degli Studi di Milano-Bicocca.
Teaching material
Required
- Azzalini, A. and Scarpa, B. (2011), Data Analysis and Data Mining, Oxford University Press.
- Hastie, T., Tibshirani, R. and Friedman, J. (2009), The Elements of Statistical Learning, Second Edition, Springer.
Optional
- Efron, B. and Hastie, T. (2016), Computer Age Statistical Inference, Cambridge University Press.
- Lewis, Kane, Arnold (2019) A Computational Approach to Statistical Learning. Chapman And Hall/Crc.
Slides and lecture notes
The slides are meant to be used in HTML. However, if you really want to convert the HTML slides into pdf files, you can follow the instruction of the quarto documentation.
Topic | Notes | Slides | Code |
---|---|---|---|
Introduction | Introduction | Slides Introduction | |
A-B-C | Unit A | Slides Unit A | Code A |
Lab 1 (Computations for linear models) | Code Lab 1 | ||
Exercises | Exercises A | ||
Optimism, conflicts and trade-offs | Unit B | Slides Unit B | Code B |
Exercises | Exercises B | ||
Lab 2 (Ames housing) | Code Lab 2 | ||
Shrinkage and variable selection | Unit C | Slides Unit C | Code C |
Exercises | Exercises C | ||
Lab 3-4 (Ames housing) | Code Lab 3-4 | ||
Nonparametric regression | Unit D | Slides Unit D | Code D |
Exercises | Exercises D | ||
Lab 5 (Auto) | Code Lab 5 | ||
The curse of dimensionality | Unit E | Slides Unit E | |
Additive models | Unit F | Slides Unit F | Code F |
Exercises | Exercises F | ||
Lab 6 (Ames housing) | Code Lab 6 | ||
FAQ - The statistical culture | Unit G |
Exam
Rules
The exam is made of two parts:
(20 / 30) Written examination: a pen-and-paper exam about the theoretical aspects of the course.
(10 / 30) Individual assignment: a data challenge.
- You will be given a prediction task, and you will need to submit your predictions and produce a report of maximum 4 pages;
- You will make use of the Kaggle platform (optional);
- Further info will be provided in due time.
Both parts are mandatory, and you need to submit the assignment before attempting the written part. The report expires after one year from the end of the course.
The final grade is obtained as the sum of the above scores.
Mock exam
An example of written examination is provided at this link.
Prerequisites
Knowledge of the topics (i) linear algebra, (ii) linear models, and (iii) generalized linear models (GLMs), is highly recommended.
Knowledge of topics covered in the courses Probability and Statistics M and Advanced Statistics M is also highly recommended.
Office hours
To schedule a meeting, please write to tommaso.rigon@unimib.it
. Office hours is every Tuesday at 17.30
Acknowledgments
The primary source of this website’s content are the textbooks Azzalini and Scarpa (2011) and Hastie, Tibshirani, and Friedman (2009). A few pictures have been taken from the textbooks; in these cases the original source is cited.
I have made use of several other books, scientific articles, and datasets to complement the main textbooks. They are all cited at the end of each unit. Students are encouraged to consult the books: these slides are just a concise summary.
I am also grateful to Aldo Solari, who gracefully shared the material of the courses Data Mining and Statistical Learning. These websites have been used to lay the basis of this course.
All the mistakes still present are mine.