introduction to machine learning and deep learning · introduction to machine learning and deep...
TRANSCRIPT
Introduction to Machine Learning and Deep
Learning
CNRS Formation Entreprise
Stephane Gaıffas – Karine Tribouley
Today
Classification and regression trees (CART)
Bagging
Random forests
Boosting, adaboost, gradient boosting
Tree principle
Construct a recursive partition using a tree-structured set of“questions” (splits around a given value of a variable)
A simple average to predic class probabilities (classification)
Average in each leaf (regression)
Remarks
Quality of the prediction depends on the tree (the partition)
Issue: finding the optimal tree is hard!
Heuristic construction of a tree
Greedy approach: a top-down step in which branches arecreated (branching)
Heuristic construction of a tree
Start from a single region containing all the data (called root)
Recursively splits nodes (i.e. rectangles) along a certaindimension (feature) and a certain value (threshold)
Leads to a guillotine partition of the feature space
Which is equivalent to a binary tree
Remarks
Greedy: no regret strategy on the choice of the splits!
Quite the opposite of gradient descent!
Heuristic: choose a split so that the two new regions are ashomogeneous possible...
Homogeneous means:
For classification: class distributions in a node should be closeto a Dirac mass
For regression: labels should be very concentrated aroundtheir mean in a node (for regression)
How to quantify homogeneousness?
Variance (regression)
Gini index, Entropy (classification)
Among others
Splitting
We want to split a node N into a left child node NL and aright child node NR
The childs depends on a cut: a (feature, threshold) pairdenoted by (j , t)
Introduce
NL(j , t) = {x ∈ N : xj < t} and NR(j , t) = {x ∈ N : xj ≥ t}.
Finding the best split means finding the best (j , t) pair
Compare the impurity of N with the ones of NL(j , t) andNR(j , t) for all pairs (j , t)
Using the information gain of each pair (j , t)
For regression impurity is usually the variance of the node
V (N) =∑
i :xi∈N(yi − yN)2 where yN =
1
|N|∑
i :xi∈Nyi
with |N| = #{i : xi ∈ N} and information gain is given by
IG(j , t) = V (N)− |NL(j , t)||N|
V (NL(j , t))− |NR(j , t)||N|
V (NR(j , t))
For classification with labels yi ∈ {1, . . . ,K}. First, we computethe classes distribution pN = (pN,1, . . . , pN,K ) where
pN,k =#{i : xi ∈ N, yi = k}
|N|.
Then, we can consider an impurity measure I such as:
G (N) = G (pN) =K∑
k=1
pN,k(1− pN,k) (Gini index)
H(N) = H(pN) = −K∑
k=1
pN,k log2(pN,k) (Entropy index)
and for I = G or I = H we consider the information gain
IG(j , t) = I (N)− |NL(j , t)||N|
I (NL(j , t))− |NR(j , t)||N|
I (NR(j , t))
CART builds the partition iteratively
For each leaf N of the current tree
Find the best (feature, threshold) pair (j , t) that maximizesIG(j , t)
Create the two new childs of the leaf
Stop if some stopping criterion is met
Otherwise continue
Stopping criterions
Maximum depth of the tree
Maximum number of leaves
All leafs have less then a chosen number of samples
Impurity in all leafs is small enough
Testing error is increasing
Etc...
Remarks
CART with Gini is probably the most used technique
Other criterions are possible: χ2 homogeneity, other losses,than least-squares, etc.
More remarks
Usually overfits (maximally grown tree typically overfits)
Usually leads to poor prediction
Leads to nice interpretable results
Is the basis of more powerful techniques: random forests,boosting (ensemble methods)
Example on iris dataset
using R with the rpart and rpart.plot library
●
●
●●
●
●
● ●
●
●
●
●
●●
●
●
●
●
●●
●
●●
●●
●
●●●
●●
●
●●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●●●●
●
●●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●●●
●
●●
●
●●
●●
●
●
●
●
●
●●
●
●
●
●
●● ●
●
●
●
●
●●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●●
●
●
●
●
●
●
●●
●
●
●
●●
●●●
●
●●
●
●
●
●
●
●●●● ●
●● ●● ● ●●●
● ●●
●●●
●●
●
●
●●●●●●
●● ●● ●●● ●●●●●●●●●
●●
● ●●
●●
●
●
●●●
●
●
●●
●●
●
●
●●●
●
●
●
●
●●
●●●●
●
●●●
●
●
● ●●
●●●●
●
●
●
●●● ●
●
●
●
●
●●●
●
●
●
●●
●●
●
●●●●
●●
●
●
●
●
●
●●
●●
●●●
●
●
●
●
●
●●
●
●●
●●
●●
●●●
●●
●●●● ●
●●
●●●
●●●●
●
●●● ●●
●
●
●
●
●●
●
●●●●
●
●●●● ●
●● ●
●●●
●
●●
●● ●●
●● ●
●
●
●
●
●
●●
●
●
●
●●
●●
●
●
●
●
●
●
●●●●
●
●
●●●
●
●●
●●
●●●●
●
●
●
●●● ●
●
●
●
●
●
●
●●
●●●
●
●●
●●
●●
●
●●
●
●
● ●
●
●
●●●
●
●
●●
●
●●
●●
●●
●
●●
●
●
●
●
●●
●
●
0.74
0.53
0.46
●● ●● ●
●●●● ● ●●
●● ●
●●●
●●
●●
●
●●
● ●●●●● ● ●●●● ●●●
●●● ●●
●
●●
● ●●
●●
●
●
●●●
●
●
●●
●●
●
●
●●●
●
●
●
●
●●●●
●●
●
●●●
●
●
● ●●
●●●
●●
●
●
● ●●●
●
●
●
●
●●●
●
●
●
●●
●●
●
● ●●
●
●●
●
●
●
●
●
●●
● ●
●●
●●
●
●
●
●
●●
●
●●
●●
●●
●●
●●
●
●● ●● ●
●●●●
●●●
●●●
●●● ●●
●
●
●
●
●●
●
●●●●
●
●●●● ●
●● ●
●●●
●
●●
●● ●●
●●●
●
●
●
●
●
●●
●
●
●
●●
●●
●
●
●
●
●
●
●●●●
●
●
●●●
●
●●
●●
● ●●●
●
●
●
●●
●●
●
●
●
●
●
●
●●
●●●
●
●●
●●
●●
●
●●
●
●
●●
●
●
●● ●
●
●
●●
●
●●
●●
●●
●
●●
●
●
●
●
●●
●
●
0.27
0.75
0.86
0.18
0.56
0.4
●●●●●
●●●●●●●●●
●
●●●●●●
●
●
●
●●
●
●●●●
●
●●●●●●●●●●●
●
●●●●●●
●● ●
●
●
●
●
●
●●
●
●
●
●●
●●
●
●
●
●
●
●
●●● ●
●
●
●●●
●
●●●●
●●●●
●
●
●
●●●●
●
●
●
●
●
●
●●
●●●
●
●●
●●
●●
●
●●
●
●
● ●
●
●
●●●
●
●
●●
●
●●
●●
●●
●
●●
●
●
●
●
●●
●
●
0.28
0.55
0.28
0.23
0.66
0.54
0.33
0.79
0.32
Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.LengthSepal.W
idthPetal.Length
Petal.Width
5 6 7 8 2.0 2.5 3.0 3.5 4.0 4.5 2 4 6 0.0 0.5 1.0 1.5 2.0 2.5
5
6
7
8
2.0
2.5
3.0
3.5
4.0
4.5
2
4
6
0.0
0.5
1.0
1.5
2.0
2.5
Species●a●a●a
setosa
versicolor
virginica
Petal.Length < 2.5
Petal.Width < 1.8
setosa.33 .33 .33
100%
setosa1.00 .00 .00
33%
versicolor.00 .50 .50
67%
versicolor.00 .91 .09
36%
virginica.00 .02 .98
31%
yes nosetosaversicolorvirginica
Bagging
Decision trees suffer from a high variance
Bagging is a general-purpose procedure to (hopefully) reducethe variance of any statistical learning method
Particularly useful for decision trees
Bagging = Boostrap Aggregation = aggregate (average)classifiers fitted on bootstrapped samples
Algorithm 1 Bagging
1: for b = 1, . . .B do
2: Sample with replacement D(b)n from the original dataset Dn
3: Fit a regressor (resp. classifier) f (b) on D(b)n
4: Aggregate f (1), . . . , f (B) to get a single regressor (resp. clas-sifier) using an average (resp. majority vote)
5: end for
Random forests combine the following ingredients
1. Regression or classification trees (as “weak” learners). Thesetrees can be maximally grown in the random forest algorithm
2. Bagging of such trees: each tree is trained over a samplingwith replacement of the dataset
Instead of a single tree, it uses an average of the predictionsgiven by several trees: reduces noise, and variance of a singletree
Even more true is the trees are not correlated: baggings helpsin reducing the correlation of the trees
3. Columns subsampling: only a random subset of features istried when looking for splits
Also called “features bagging”
Reduces even further the correlation between trees
In particular when some features are particularly strong
ExtraTrees: extremely randomized trees
A variant of random forests
Features are selected using some impurity (Gini, Entropy)
Thresholds are selected uniformly at random in the featurerange
Faster, random thresholds are some form of regularization
(trees are combined and averaged: not so bad to select thethresholds at random...)
Using scikit-learn: decision funtions of
tree.DecisionTreeClassifier
ensemble.{ExtraTreesClassifier,RandomForestClassifier}
Input data
.78
Decision tree
.82
Extra trees
.78
Random Forest
.85 .80 .90
.95 .93 .95
Boosting for the binary classification problem
Features xi ∈ Rd
Labels yi ∈ Y with Y = {−1, 1} for binary classification andY = R for regression
Weak classifier
We have a finite set of “weak classifiers” H
Each weak classifier h : Rd → Y from H is a very simpleclassifier
Weak classifier examples
Decision tree that splits a single feature once: “”horizontal”or “vertical” hyperplanes. It’s called a stump
A stump is a tree of depth 1
For regression: linear regression model using one or twofeatures
For classification: logistic regression using one or two features
A strong learnerCombine linearly the weak learners to get a stronger one
gB(x) =B∑
b=1
ηbhb(x),
where ηb ∈ R.
This principle is called boosting
Each b = 1, . . . ,B is called a boosting step
It a ensemble method: it combine many weak learners toimprove performance
Gradient boosting
General-purpose method: many choice of losses, many choiceof weak learners
An example is AdaBoost: ADAptive BOOSTing, for binaryclassification
Look for h1, . . . , hB ∈ H and η1, . . . , ηB ∈ R such that
gB(x) =B∑
b=1
ηbhb(x)
minimizes an empiral risk
1
n
n∑i=1
`(yi , gB(xi )),
where ` is a loss (least-squares, logistic, etc.)
Problem.
Even given η1, . . . , ηB ∈ R we need to minimize over |H|B tofind the h1, . . . , hB
Size of H is typically O(d) (number of features)
Gradient boosting is a greedy algorithmIterative algorithm: at step b + 1 look for gb+1 and ηb+1 such that
(ηb+1, hb+1) = argminη∈R,h∈H
n∑i=1
`(yi , gb(xi ) + ηh(xi ))
and putgb+1 = gb + ηb+1hb+1.
This defines a sequence (η1, g1), . . . , (ηB , gB).
Still a problem
Exact minimization at each step is too hard
Gradient boosting idea
Replace exact minimization by a gradient step
Approximate the gradient step by an element of H
Gradient boosting. Replace
argminh∈H
n∑i=1
`(yi , gb(xi ) + ηh(xi ))
by
argminu∈Rn
n∑i=1
`(yi , gb(xi ) + ηu)
and don’t minimize, but do a step in the direction given by:
δb =(∇u
n∑i=1
`(yi , gb(xi ) + u))|u=0
where `′(y , z) = ∂`(y , z)/∂z . If `(y , z) = 12(y − z)2 this is given by
δb =[gb(x1)− y1 · · · gb(xn)− yn
]>
Gradient boosting
Obviously δb 6=[h(x1) · · · h(xn)
]>for all h ∈ H.
So, we find an approximation from H:
(h, ν) = argminh∈H,ν∈R
n∑i=1
(νh(xi )− δb(i)
)2.
For least-squares this means
(h, ν) = argminh∈H,ν∈R
n∑i=1
(νh(xi )− gb(xi ) + yi
)2,
meaning that we look for a weak learner h that corrects thecurrent residuals obtained from the current gb
Gradient boosting
1: Put g0 = m with m = argminm∈R∑n
i=1 `(yi ,m)2: for b = 1, . . . ,B do
3: δb ← −(∇u`(yi , gb(xi ) + u)
)|u=0
for i = 1, . . . , n
4: hb+1 ← h with
(h, ν) = argminh∈H,ν∈R
n∑i=1
(νh(xi )− δb(i)
)2.
5: ηb+1 ← argminη∈R∑n
i=1 `(yi , gb(xi ) + ηhb+1(xi ))6: end for7: return a boosting learner gB(x) =
∑Bb=1 ηbhb(x).
Adaboost
Gradient boosting with the exponential loss `(y , z) = e−yz
(binary classification only)
Construct recursively
gb+1(x) = gb(x) + ηb+1hb+1(x)
where
(hb+1, ηb+1) = argminh∈H,η∈R
1
n
n∑i=1
e−yi (gb(xi )+ηh(xi ))
Adaboost
Quite suprisingly, leads to an easy and explicit algorithm
Algorithm 2 Adaboost
1: Put p1(i) = 1/n for i = 1, . . . , n2: for b = 1, . . .B do3: hb ∈ argminh∈H
∑ni=1 pb(i)1yih(xi )<0
4: εb ←∑n
i=1 pb(i)1yihb(xi )<0
5: ηb ← 12 log
(1−εbεb
)6: Zb ← 2
√εb(1− εb)
7: pb+1(i)← pb(i)e−ηbyihb(xi )
Zb8: end for9: return A boosting classifier gB(x) = sgn
(∑Bb=1 ηbhb(x)
)
Choice of hyper-parametersFor dictionary H
If H contain trees, choose their depth (usually small... nomore than four or five)
If H contains generalized linear models (logistic regression),select the number of features (usually one or two)
Number of iterations B (usually a few hundreds, fewthousands, until test error does not improve)
Check XGBoost’s and documentation for more details on thehyper-parameters
State-of-the-art implementations of Gradient Boosting
With many extra clever tricks...
XGBoost
https://xgboost.readthedocs.io/en/latest/
https://github.com/dmlc/xgboost
Has been state-of-the art for years
LightGBM
https://lightgbm.readthedocs.io/en/latest/
https://github.com/Microsoft/LightGBM
Faster (and better ?) than XGBoost
Grand mother’s recipe. If you have “tabular” data (no image,signal, text, etc.)
Spend time on feature engineering
Spend time on feature engineering
...
Spend time on feature engineering
Try out RF or GBM methods before diving into a deeplearning method (tomorrow)
This afternoon: practical session
Prediction of load curves
Continuous labels / time series
Using several things, including XGBoost and LightGBM