lecture 1(i): introduction & roadmapcucuring/lecture_1_intro_statml.pdf · 1.introduction &...
TRANSCRIPT
1
Lecture 1(i): Introduction & Roadmap
Foundations of Data Science:Algorithms and Mathematical Foundations
Mihai [email protected]
CDT in Mathematics of Random SystemUniversity of Oxford
September 25, 2019
2
Course overview
I combines both theoretical and practical approachesI Goal 1: understand the mathematical, statistical and
algorithmic foundations behind some of the state-of-the-artalgorithms for tasks in machine learning/data miningI organization and visualization of data cloudsI dimensionality reduction •I clustering •I network analysis •I classificationI regressionI ranking •
I Goal 2: exposure to practical examples drawn from a widerange of topics including social network analysis, finance,statistics, etc
3
BooksTextbook• An Introduction to Statistical Learning by James, Witten,Hastie, and Tibshirani, freely available at
http://faculty.marshall.usc.edu/gareth-james/ISL/
Slides will also be made available.
• More advanced + good reference book: The Elements ofStatistical Learning by Hastie, Tibshirani, and Friedman.http://www-stat.stanford.edu/ElemStatLearn.
• Popular reference in the ML/data mining: Pattern Recognitionand Machine Learning Book, by Christopher Bishop.
• Ten Lectures and Forty-Two Open Problems in theMathematics of Data Science, by Afonso S. Bandeira
https://people.math.ethz.ch/˜abandeira/TenLecturesFortyTwoProblems.pdf
4
PrerequisitesProbability:I event, random variable, indicator variableI probability mass function, probability density function
cumulative distribution functionI joint distribution, marginal distributionI conditional probability, Bayes’s ruleI independenceI expectation, varianceI uniform, exponential, binomial, Poisson, Gaussian
distributionsStatistics:I Sampling from a population; mean, variance, standard
deviation, median, covariance, correlation, and theirsample versions; histogram;
I linear regression, response and predictor variables,coefficients, residuals.
5
Prerequisites
Linear algebra:I vector and matrix arithmeticI eigenvalues and eigenvectors of a matrix
Programming:I arithmetic (scalar, vector, and matrix operations)I writing functionsI reading in data sets from csv filesI using and manipulating data structures (subset a data
frame)I installing, loading, and using packagesI plotting
Wilson et al., Good enough practices in scientific computing, PLoScomputational biology, 2017 Jun 22;13(6)
6
A list of tentative topics1. Introduction & syllabus2. Statistical learning3. Measures of correlation and dependency4. Linear regression 1 (Simple regression)5. Linear regression 2 (Multiple regression)6. Singular Value Decomposition (SVD), rank-k
approximation, Principal Component Analysis (PCA)7. PCA in high dimensions and random matrix theory
(Marcenko-Pastur); applications to finance8. Basics of spectral graph theory
Nonlinear dimensionality reduction methods:9. Diffusion Maps and Vector Diffusion Maps
10. Multidimensional scaling and ISOMAP, Locally LinearEmbedding (LLE)
11. Kernel PCA
7
Topics
Clustering:12. Clustering: K-means, K-medoids, Hierarchical clustering13. Spectral clustering and Cheeger’s inequality14. Constrained clustering, clustering of signed graphs and
correlation clustering
Modern regression:15. Ridge regression16. LASSO
Estimation from pairwise measurements:17. The Page-Rank algorithm18. Ranking from pairwise incomplete noisy information19. Group synchronization
8
Additional Topics (if we had more time)
1. Classification: logistic regression, support vectormachines, linear discriminant analysis.
2. Tree-based methods for classification and regression
3. Bagging, Boosting, Random Forests
4. The Multiplicative Weights Update Method: a MetaAlgorithm and Applications• Arora, Sanjeev, Elad Hazan, Satyen Kale; Theory of Computing 2012
5. Johnson-Lindenstrauss Lemma; approximate nearestneighbors
9
Additional Topics (if we had more time)
Other potential topics7. electrical networks and random walks
8. graph sparsification and Laplacian linear system solvers
9. graph embeddings from noisy distances
10. low-rank matrix completion
11. non-negative matrix factorization
12. matrix algorithms using sampling; sketch of a large matrix
10
What is data science/data mining?Given (usually very large) data sets, how do you discoverstructural properties and make predictions?
Two very broad categories of problems:I Unsupervised learning: discover structure. E.g., given
measurements X1, . . . ,Xn, learn some underlying groupstructure based on the pattern of similarity between pairsof points (”working blind”)
I Supervised learning: make predictions. E.g., givenmeasurements (X1,Y1), ...(Xn,Yn), learn a model topredict Yi from Xi
I Semi-supervised learning: only for m (with m << n)observations we have both predictor + responsemeasurements.
11
This course
I Combines both applied and theoretical perspectives,though for some of the topics the emphasis will be on thealgorithms/methodology
I Aim to understand what is it that we are trying to doI Often, it’s not enough to load your data in
R/Matlab/Python, use any available packages and expectto get an answer that makes sense (reasonable)
I Understand your data! Data is often very messy, noisyand incomplete
I Be able to identify what the end goal is, and based on that(and the given data) identify what tools are available
12
Tradeoffs: exact versus approximation
I Many problems are computationally hard to solve exactlyI In order to come up with tractable algorithms (that run in
polynomial time) aim for an approximate solutionI Approximation algorithms can often perform well
(sometimes they even find the exact solution!) and scalewell computationally when applied to very large problems
I Polynomial time algorithms are often not enough, someapplications demand close to linear-time complexity(sometimes even sublinear!)
13
Bias-variance tradeoffIn supervised learning, when moving beyond the training set:I If the model is too simple, the solution is biased and does
not fit the dataI If the model is too complex, the solution is very sensitive to
small changes in the dataProblem of simultaneously minimizing two sources of error
I bias: difference btwn truth and what you expect to learnI high bias leads to missing out on the relevant relations
between features and target outputs (underfitting).I decreases with more complex models
I variance: difference between what you learn from aparticular dataset and what you expect to learn. Arisesfrom sensitivity to small fluctuations in the training set.I high variance leads to overfitting: modeling the random
noise in the training data, rather than the intended outputs.I variance decreases with simpler models
14
Interpretability versus forecasting power
I trade-off between a model that is interpretable and onethat predicts well under general circumstances
I essay on the distinction between explanatory andpredictive modeling:
To Explain or to Predict?, by Galit Shmueli, StatisticalScience 2010, Vol. 25, No. 3, 289–310https://www.stat.berkeley.edu/˜aldous/157/
Papers/shmueli.pdf
15
Occam’s razor
I a problem-solving principle known as the ’law of parsimony’I when faced with different competing hypotheses that
predict equally well, choose the one with the fewestassumptions
I usually, more complex models may provide betterpredictions, but in the absence of differences in predictivepower, the fewer assumptions the better
16
Statistics vs. Machine Learning
I Brian D. Ripley: ”machine learning is statistics minus anychecking of models and assumptions”
”Statistical Modeling: The Two Cultures”, Leo Breiman,Statistical Science, 16 (3), 2001; argued thatI statisticians rely too heavily on data modeling and
assumptionsI machine learning techniques are making progress by
instead relying on the predictive accuracy of models
17
Statistics vs. Machine Learning
More recently, statisticians focused more on finite-sampleproperties, algorithms and massive data sets (big data).Some, still ongoing, differences between the two communities:I Statistics papers are more formal and often comes with
proofs, while Machine Learning papers are more open tonew methodologies even if the theory is lacking (for now)
I The Machine Learning community primarily publishes inconferences and proceedings, while statisticians usejournal papers (much slower process)
I Some statisticians still focus on areas which are welloutside the scope of ML (survey design, sampling,industrial statistics, survival analysis, etc)
18
Final thoughts: No universal data mining recipe bookI hard to say which methods will work best in what situationsI sometimes, it’s crystal clear what method one should followI most often, we have little intuition a-priori on what
approach or set of tools should we use. Need toI understand the data first and the task at handI understand the methods and their assumptionsI make an educated guess on how to proceed
I often, customized tools are required to handle problemparticularities (eg., response variable is highly imbalanced,as in default rate prediction)
I sometimes (if enough resources are available) one oftentries many different methods, and chooses the one whichgives best (out-of-sample) resultsI most competitions are won by Ensemble methods -
techniques that create multiple models and then combinethem to produce improved results
I stacking: considers heterogeneous weak learners, learnsthem in parallel and combines by training a meta-model
19
Alzheimer’s Markers Predict Start of Mental Decline
20
Classification of handwritten digits
Figure: Automatic detection of handwritten postal codes.
21
Facebook: friend suggestions
Figure: People you may know.
22
Forecasting the stock market
Figure: Price of Google stock.
Source: http://businessforecastblog.com/forecasting-googles-stock-price-goog-on-20-trading-day-horizons/
23
Netflix: the $ 1,000,000 Prize
Figure: Movies you might enjoy.
24
eHarmony: predict whom you might fall in love with
Figure: Love is in the air (statistically speaking)
25
Identifying patterns in migration networks
Figure: Eigenvector colourings for the similarity matrix Wij =M2
ijPi Pj
.
26
Financial time series clustering• Clustering of the empirical correlation matrix of 1500 timeseries (stocks contained in the S&P 1500 index)• Compute the bottom k = 10 eigenvectors of L, and run astandard machine learning clustering algorithm (k-means++) torecover k clusters.
Figure: Left: the adjacency matrix A with rows/columns sorted inaccordance to cluster membership. Right: Sector decomposition ofthe recovered clusters (based on a standard classification of the USeconomy into sectors). See link for details: GICS link.M. Cucuringu, P. Davies, A. Glielmo, H. Tyagi, SPONGE: A generalized eigenproblemfor clustering signed networks, AISTATS 2019 (python code available)
27
Yogi Berra:
It’s tough to make predictions, especially about the future
28
Lecture 1(ii): Statistical Learning
Chapter 2 in ”An Introduction to Statistical Learning” by James, Witten, Hastie,and Tibshirani, freely available at
http://faculty.marshall.usc.edu/gareth-james/ISL/
29
Advertising data set
I sales of a product in 200 different marketsI + budgets for the product in each of those markets for
three different media: TV, radio, and newspaperI Goal: predict sales given the three media budgetsI input variables (denoted by X1,X2, . . .)
I X1 TV budgetI X2 radio budgetI X3 newspaper budget
I inputs known as such as predictors, independent variables,features, variables
I the output variable (sales) is the response or dependentvariable (denoted by Y )
30
Advertising data set
31
I observe a quantitative response Y and p differentpredictors, X = (X1,X2, . . . ,Xp)
I assume there is some relationship between Y andX = (X1,X2, . . . ,Xp)
Y = f (X ) + ε
I f is fixed but unknown functionI ε is a random error term, independent of X , with mean zero
32
Why Estimate f?
Task I: PredictionI Y = f (X )
I f is our estimate for fI Y is the resulting predictionI it’s OK if f is a black box algorithmI we only care about forecasting Y
33
Reducible error vs. irreducible errorReducible error:I f is an imperfect estimate for f ; leads to reducible errorI we can potentially improve the accuracy of f by using more
complex models to estimate fIrreducible error:I even if we perfectly estimate f and
Y = f (X )
our prediction is not perfectI recall Y is also a function of ε, and ε ⊥ XI variability associated with ε also affects the accuracy of our
predictionsI most often is positive (ε may contain unmeasured variables
that impact Y ; unmeasurable variation)
34
Prediction
Fix a given estimate f and a set of predictors XOur prediction Y = f (X )Consider the average/expected value of the squared error:
E[Y − Y ]2 = E[f (X ) + ε− f (X )]2 = E[(f (X )− f (X )) + ε]2 (1)
= E[(f (X )− f (X ))2 + 2(f (X )− f (X ))ε+ ε2
](2)
= E[(f (X )− f (X ))2] + 0 + E[ε2] (3)
= E[f (X )− f (X )]2 + Var[ε] (4)
= [f (X )− f (X )]2︸ ︷︷ ︸Reducible
+ Var[ε]︸ ︷︷ ︸Irreducible
(5)
Our goal: develop techniques for estimating f and minimize thereducible error.
35
Task II: Inference
Want to understand how Y changes as a function of X1, . . . ,Xp
Cannot treat f as a black box; need to know its exact form.I Which predictors are associated with the response?
I most often, only a small fraction of the predictors arerelevant for Y
I What is the relationship between the response and eachpredictor?I postive/negative relationship; depending on third predictors
I Can the relationship between Y and each predictor beadequately summarized using a linear equation, or is thisnot enough?I most data out there is non-linear; need non-linear methods
36
Inference for the Advertising data set
I Which media contribute to sales?I Which media generate the biggest boost in sales? orI How much increase in sales is associated with a given
increase in TV advertising?
37
Prediction + inference
Real estate setting: relate values of homes toI crime rate, zoning, distance from a river, air quality,
schools, income level of community, size of housesInference:I How individual input variables affect the prices?I How much extra will a house be worth if it has a view of the
river?Prediction:I Predict the value of a home given its characteristics
38
Linear vs Nonlinear
I linear methods: allow for relatively simple and interpretableinference, but worse prediction
I non-linear methods: very accurate predictions for Y , butless interpretable model thus making inference morechallenging
39
Estimating f
I Parametric methods: Linear Regression, Model Selection,Generalized Linear Models, Mixture Models, Classification(linear, logistic, support vector machines), GraphicalModels, Structured Prediction, Hidden Markov Models
I Nonparametric methods: Nonparametric Regression andDensity Estimation, Nonparametric Classification,Boosting, Clustering and Dimension Reduction, PCA,Manifold Methods, Principal Curves, Spectral Methods,The Bootstrap and Subsampling, Nonparametric Bayes.
40
Parametric Methods1. make an assumption about the the function form or shape
of f , e.g., in a linear model, f can be linear
f (X ) = β0 + β1X1 + β2X2 + . . .+ βpXp
2. fit or train the model; only need to estimate the p + 1coefficients βi s.t.
Y ≈ β0 + β1X1 + β2X2 + . . .+ βpXp
I use (ordinary) least squares or generalized linear modelsI reduced the problem of finding f to finding p + 1 coefficients
Downside: the chosen linear model might not match the(unknown) ground truth form of f .If we aim for more flexible and complex models:I need to estimate a larger number of parametersI may lead to overfitting (fit the noise too closely)
41
Income data set
Figure: The blue surface represents the true underlying relationshipbetween income and years of education and seniority, which is knownsince the data are simulated. The red dots indicate the observedvalues of these quantities for 30 individuals.
42
Linear model
income ≈ β0 + β1 × education + β2 × seniority
43
Linear model for the Income data
income ≈ β0 + β1 × education + β2 × seniority
I True f has some curvature not captured by the linear fitI OK for capturing the positive relationship between years of
education/seniority and income
44
Non-parametric Methods
I no explicit assumptions about the functional form of f(assumption-free approach)
I search for an estimate of f that best fits the data, withsome smoothness assumptions
I +: could fit a more wide range of functions fI -: a much larger number of observations is needed
(compared to parametric approaches)
Example: splines with varying levels of smoothness (allowingfor a rougher/tighter fit)
45
Smooth splines
46
Rough splines
Figure: Perfect fit on the given data!
47
Rough splines (cont.)
‘
Figure: ... but far from the ground truth f (right plot)!
This is a good example of overfitting: the fit does not produceaccurate estimates of the response on new observations thatwere not part of the original training data set.
48
The Trade-Off Between Prediction Accuracy andModel Interpretability
I linear regression: fairly inflexibleI splines: considerably more flexible (can fit a much wider
range of possible shapes to estimate f )Inference:I then restrictive models are much more interpretable.I linear model: easy to understand the relationship between
Y and X1,X2, . . . ,Xp
Very flexible approaches (splines, boosting, etc)I can lead to such complicated estimates of fI hard to understand how any individual predictor is
associated with the responseLASSO:I less flexibleI linear model + sparsity of [β0, β1, . . . , βp]I more interpretable; only a small subset of predictors matter
49
Flexibility vs. InterpretabilityGeneralized additive models (GAMs)I extend the linear model to allow for certain non-linear
relationships
y = β0 + f1(xi,1) + f2(xi,2) + . . .+ fp(xi,p) + εi
I more flexible than linear regressionI less interpretable than linear regression (the relationship
predictor vs. response is modeled by a curve)I main limitation: the model is restricted to be additive (with
many variables, important interactions can be missed)
Fully non-linear methods: bagging, boosting, and supportvector machines with non-linear kernelsI highly flexibleI harder to interpret
50
Flexibility vs. Interpretability
51
For inferenceI better off using simple and relatively inflexible statistical
learning methods.For predictionI interpretability is not an issueI shall we use the most flexible model available?I surprisingly, this is not always the case!I sometimes we get more accurate (out-of-sample)
predictions using a less flexible methodI overfitting in highly flexible methods is a real
concern/danger
52
Supervised Versus Unsupervised LearningI supervised: there exists a response YI unsupervised: there is NO response Y (e.g., clustering)
I semi-supervised learning: some of the observations havean associated response Y , but some do not.
Personal note:I unsupervised learning: a means to an end!I in many/most applications, the ultimate goal is predictionI augment the unsupervised learning results (clustering,
non-linear dimensionality reduction) with a prediction step(even a simple linear regression)
53
Performance evaluation
I mean squared error (MSE)
MSE =1n
n∑i=1
(yi − f (xi))2
I f (xi) is the prediction of f observation iI training MSEI test dataI we care most/only about out-of-sample performanceI test MSE
Average(y0 − f (x0))2
on the test observations (x0, y0) we have not seen yet
54
Flexibility vs. MSE (non-linear)
Horizontal dashed line indicates Var(ε), the irreducible error.
55
Flexibility vs. MSE (linear)
56
Remarks
I test MSE >> train MSEI U-shape in the test MSEI as model flexibility increases, training MSE will decrease,
but the test MSE may notI overfitting: training MSE <<< test MSEI overfitting: when fitting the noise rather than properties of fI overfitting: when a less flexible model yields a smaller test
MSEI cross-validation: estimating test MSE using the training
data
57
R data sets (texbook)
I The UC Irvine Machine Learning Data Set Repository:http://archive.ics.uci.edu/ml/
I S&P 500/1500: daily open and close prices for 2003-2019I Networks data sets: directed, signed, temporal, multilayer