Data Mining - CdL CLAMSES
Università degli Studi di Milano-Bicocca
Nowadays, predictive algorithms have become mainstream in the popular culture due to some spectacular successes:
And yet, there is a lot of confusion about the history and the boundaries of the field. For instance, what is “data mining”?
And what are then the differences, if any, with statistics, machine learning, statistical learning, and data science?
What applied problems cannot be solved with classical statistical tools? Why?
Let us consider some real case studies…
The marketing section of a telecommunications company is interested in analyzing the customer behavior.
Hence, the data science team would like to predict, for every single customer, the telephone traffic.
Traffic is measured as the total number of seconds of outgoing calls made in a given month by each customer.
Appropriate estimations of the overall traffic provide necessary elements for:
The dataset has n = 30.619 customers and p = 99 covariates, i.e., the customer activity in previous months.
These are observational data, which have been collected for other purposes, not for their analysis. Data “exists,” there is no sampling design.
Data are dirty and often stored in big data warehouse (DWH).
The dimension of the data is large in both directions: large n and large p. Hence:
The relationship between covariates and the response is complex, thus, it is hard to believe our models will be “true.” They are all wrong!
However, having a lot of data means we can split them, using the first half for estimation and the other half for testing.
Expression matrix of p = 6830 genes (rows) and n = 64 samples (columns), for the human tumor data.
100 randomly chosen rows are shown
The picture is a heatmap, ranging from bright green (under-expressed) to bright red (overexpressed).
Missing values are gray. The rows and columns are displayed in a randomly chosen order.
Goal: predict cancer class based on expression values.
The main statistical difficulty here is that p > n!
Logistic regression and discriminant analysis wouldn’t work; the estimates do not exist.
Is it even possible to fit a model in this context?
All the previous case studies cannot be solved using traditional tools; in fact:
The objective is predicting a response variable in the most accurate way. Classical statistics has broader goals including, but not limited to, prediction.
We need a paradigm shift to address the above issues.
For instance, if reality is non-linear, what about going nonparametric? We could let the data speak without making any assumption about the relationship between y and \bm{x}.
Moreover, if p-values and residual plots are no longer informative in this context, how do we validate our predictions?
After the Ph.D., Breiman resigned and went into full-time free-lance consulting, and it worked as a consultant for thirteen years.
Breiman joined the UC Berkeley Statistics Department in 1980.
Leo Breiman died in 2005 at the age of 77. He invented many of the mainstream predictive tools: CART, bagging, random forests, stacking.
It is tempting to fully embrace the pure predictive viewpoint, as Breiman did in his career, especially in light of the recent media attention and public interest.
“Statistical modeling: the two cultures” has been a highly influential paper written by an outstanding statistician.
In some cases, the paper may sound exaggerated and at times confrontational. These were different times.
It was also a discussion paper!
Two other giants of the discipline, Sir David Cox (died in 2022) and Bradley Efron were among the discussants and raised several critical points.
It is premature to delve into those criticisms. We will get back to them at the end of the course once you have enough knowledge to understand them.
If you are in this class today, it means…
You already studied a lot of real analysis, linear algebra and probability;
You know how to estimate the parameters of a statistical model, to construct and interpret confidence intervals, p-values, etc. You know the principles of inference;
You know how to explore data using the R statistical software and other tools (SAS, python, etc.). You know principal component analysis and perhaps even factor models;
You know how to fit linear models and how to interpret the associated empirical findings. You are familiar with R^2s, likelihood ratio tests, logistic regression, and so on;
You may have attended a course named “data mining” before, and studied essential tools like linear discriminant analysis, k-nearest neighbors…
Unit | Description |
---|---|
A-B-C | Linear models. Data modeling, the old-fashioned way. Advanced computations. |
Optimism, conflicts and trade-offs | Bias-variance trade-off. Training and test paradigm, cross-validation. Information criteria, optimism |
Shrinkage and variable selection | Best subset selection, principal component regression. Ridge regression. Lasso and LARS. Elastic net. |
Nonparametric estimation | Local linear regression. Regression and smoothing splines. |
Additive models | Generalized additive models (GAM). Multivariate adaptive regression splines (MARS). |
Predictive interpretability means transparent understanding the driving factors of the predictions. An example is linear models with few (highly relevant) variables.
For example, if I change the value of a set of covariates, what is the impact on predictions?
This is useful, especially within the context of ethical AI and machine learning.
However, the predictive relevance of a variable does not imply a causal effect on the response.
Finding causal relationship requires careful thinking, a suitable sampling design, or both.
Azzalini & Scarpa (2011)
Data mining represents the work of processing, graphically or numerically, large amounts or continuous streams of data, with the aim of extracting information useful to those who possess them.
Hand et al. (2001)
Data mining is fundamentally an applied discipline […] data mining requires an understanding of both statistical and computational issues. (p. xxviii)
[…]
The most fundamental difference between classical statistical applications and data mining is the size of the data. (p. 19)
At this point, it may sound natural to ask yourself: what is statistics?
Statistics existed before data mining, machine learning, data science, and all these fancy new names.
Statistical regression methods trace back to Gauss and Legendre in the early 1800s. Their goal was indeed prediction!
Davison (2003)
Statistics concerns what can be learned from data.
Hand (2011)
Statistics […] is the technology of extracting meaning from data.
Sure, old-fashioned statistics is often insufficient to address modern challenges.
But statistics has profoundly changed over the years, broadening its boundaries.
The road was paved by Tukey in the 60s, with further exhortations by Breiman.
Modern statistics encompasses also:
Feel free to call it “data science” if you like the bright, shiny new term.
While it might seem that data science and data mining have strong roots in statistics, it cannot be denied the existence of two distinct, albeit often overlapping, communities.
For the lack of a better term, we will call these communities the statisticians and the computer scientists, as identified by their background and studies.
Statisticians | Computer Scientists |
---|---|
Parameters | Weights |
Covariate | Feature |
Observation | Instance |
Response | Label |
R | Python |
Regression / Classification | Supervised learning |
Density estimation, clustering | Unsupervised learning |
Lasso / Ridge penalty | L^1 and L^2 penalty |
Several “automatic” tools have been developed over the years, tempting generations of analysts with automatic pipelines.
Those who choose to “press the button”:
More or less advanced knowledge of the methods is essential for:
Competence in computational aspects is helpful to evaluate better the output of the computer, e.g., in terms of its reliability.
If you are not making the choices, somebody else is!
“Quelli che s’innamoran di pratica sanza scienzia son come ’l nocchier ch’entra in navilio senza timone o bussola, che mai ha certezza dove si vada.”
Leonardo da Vinci
The exam is made of two parts:
(20 / 30) Written examination: a pen-and-paper exam about the theoretical aspects of the course.
(10 / 30) Individual assignment: a data challenge.
Both parts are mandatory, and you need to submit the assignment before attempting the written part. The report expires after one year from the end of the course.
The final grade is obtained as the sum of the above scores.
“Those who ignore statistics are condemned to reinvent it.”
Bradley Efron, Stanford University.