![Page 1: Gradient Boosted Regression Trees in scikit-learn](https://reader034.vdocuments.us/reader034/viewer/2022042507/54c64e304a7959ad7b8b4580/html5/thumbnails/1.jpg)
Gradient Boosted Regression Trees
scikit
Peter Prettenhofer (@pprett)DataRobot
Gilles Louppe (@glouppe)Universite de Liege, Belgium
![Page 2: Gradient Boosted Regression Trees in scikit-learn](https://reader034.vdocuments.us/reader034/viewer/2022042507/54c64e304a7959ad7b8b4580/html5/thumbnails/2.jpg)
Motivation
![Page 3: Gradient Boosted Regression Trees in scikit-learn](https://reader034.vdocuments.us/reader034/viewer/2022042507/54c64e304a7959ad7b8b4580/html5/thumbnails/3.jpg)
Motivation
![Page 4: Gradient Boosted Regression Trees in scikit-learn](https://reader034.vdocuments.us/reader034/viewer/2022042507/54c64e304a7959ad7b8b4580/html5/thumbnails/4.jpg)
Outline
1 Basics
2 Gradient Boosting
3 Gradient Boosting in Scikit-learn
4 Case Study: California housing
![Page 5: Gradient Boosted Regression Trees in scikit-learn](https://reader034.vdocuments.us/reader034/viewer/2022042507/54c64e304a7959ad7b8b4580/html5/thumbnails/5.jpg)
About us
Peter
• @pprett
• Python & ML ∼ 6 years
• sklearn dev since 2010
Gilles
• @glouppe
• PhD student (Liege,Belgium)
• sklearn dev since 2011Chief tree hugger
![Page 6: Gradient Boosted Regression Trees in scikit-learn](https://reader034.vdocuments.us/reader034/viewer/2022042507/54c64e304a7959ad7b8b4580/html5/thumbnails/6.jpg)
Outline
1 Basics
2 Gradient Boosting
3 Gradient Boosting in Scikit-learn
4 Case Study: California housing
![Page 7: Gradient Boosted Regression Trees in scikit-learn](https://reader034.vdocuments.us/reader034/viewer/2022042507/54c64e304a7959ad7b8b4580/html5/thumbnails/7.jpg)
Machine Learning 101
• Data comes as...
• A set of examples {(xi , yi )|0 ≤ i < n samples}, with
• Feature vector x ∈ Rn features, and
• Response y ∈ R (regression) or y ∈ {−1, 1} (classification)
• Goal is to...
• Find a function y = f (x)
• Such that error L(y , y) on new (unseen) x is minimal
![Page 8: Gradient Boosted Regression Trees in scikit-learn](https://reader034.vdocuments.us/reader034/viewer/2022042507/54c64e304a7959ad7b8b4580/html5/thumbnails/8.jpg)
Classification and Regression Trees [Breiman et al, 1984]
MedInc <= 5.04
MedInc <= 3.07 MedInc <= 6.82
AveRooms <= 4.31 AveOccup <= 2.37
1.62 1.16 2.79 1.88
AveOccup <= 2.74 MedInc <= 7.82
3.39 2.56 3.73 4.57
sklearn.tree.DecisionTreeClassifier|Regressor
![Page 9: Gradient Boosted Regression Trees in scikit-learn](https://reader034.vdocuments.us/reader034/viewer/2022042507/54c64e304a7959ad7b8b4580/html5/thumbnails/9.jpg)
Function approximation with Regression Trees
0 2 4 6 8 10x
8
6
4
2
0
2
4
6
8
10y
ground truthRT d=1RT d=3RT d=20
Deprecated
• Nowadays seldom used alone
• Ensembles: Random Forest, Bagging, or Boosting(see sklearn.ensemble)
![Page 10: Gradient Boosted Regression Trees in scikit-learn](https://reader034.vdocuments.us/reader034/viewer/2022042507/54c64e304a7959ad7b8b4580/html5/thumbnails/10.jpg)
Function approximation with Regression Trees
0 2 4 6 8 10x
8
6
4
2
0
2
4
6
8
10y
ground truthRT d=1RT d=3RT d=20
Deprecated
• Nowadays seldom used alone
• Ensembles: Random Forest, Bagging, or Boosting(see sklearn.ensemble)
![Page 11: Gradient Boosted Regression Trees in scikit-learn](https://reader034.vdocuments.us/reader034/viewer/2022042507/54c64e304a7959ad7b8b4580/html5/thumbnails/11.jpg)
Outline
1 Basics
2 Gradient Boosting
3 Gradient Boosting in Scikit-learn
4 Case Study: California housing
![Page 12: Gradient Boosted Regression Trees in scikit-learn](https://reader034.vdocuments.us/reader034/viewer/2022042507/54c64e304a7959ad7b8b4580/html5/thumbnails/12.jpg)
Gradient Boosted Regression Trees
Advantages
• Heterogeneous data (features measured on different scale),
• Supports different loss functions (e.g. huber),
• Automatically detects (non-linear) feature interactions,
Disadvantages
• Requires careful tuning
• Slow to train (but fast to predict)
• Cannot extrapolate
![Page 13: Gradient Boosted Regression Trees in scikit-learn](https://reader034.vdocuments.us/reader034/viewer/2022042507/54c64e304a7959ad7b8b4580/html5/thumbnails/13.jpg)
Boosting
AdaBoost [Y. Freund & R. Schapire, 1995]
• Ensemble: each member is an expert on the errors of itspredecessor
• Iteratively re-weights training examples based on errors
2 1 0 1 2 3x0
2
1
0
1
2
x1
2 1 0 1 2 3x0
2 1 0 1 2 3x0
2 1 0 1 2 3x0
sklearn.ensemble.AdaBoostClassifier|Regressor
Huge success
• Viola-Jones Face Detector (2001)
• Freund & Schapire won the Godel prize 2003
![Page 14: Gradient Boosted Regression Trees in scikit-learn](https://reader034.vdocuments.us/reader034/viewer/2022042507/54c64e304a7959ad7b8b4580/html5/thumbnails/14.jpg)
Boosting
AdaBoost [Y. Freund & R. Schapire, 1995]
• Ensemble: each member is an expert on the errors of itspredecessor
• Iteratively re-weights training examples based on errors
2 1 0 1 2 3x0
2
1
0
1
2
x1
2 1 0 1 2 3x0
2 1 0 1 2 3x0
2 1 0 1 2 3x0
sklearn.ensemble.AdaBoostClassifier|Regressor
Huge success
• Viola-Jones Face Detector (2001)
• Freund & Schapire won the Godel prize 2003
![Page 15: Gradient Boosted Regression Trees in scikit-learn](https://reader034.vdocuments.us/reader034/viewer/2022042507/54c64e304a7959ad7b8b4580/html5/thumbnails/15.jpg)
Gradient Boosting [J. Friedman, 1999]
Statistical view on boosting
• ⇒ Generalization of boosting to arbitrary loss functions
Residual fitting
2 6 10x
2.0
1.5
1.0
0.5
0.0
0.5
1.0
1.5
2.0
2.5
y
Ground truth
2 6 10x
∼
tree 1
2 6 10x
+
tree 2
2 6 10x
+
tree 3
sklearn.ensemble.GradientBoostingClassifier|Regressor
![Page 16: Gradient Boosted Regression Trees in scikit-learn](https://reader034.vdocuments.us/reader034/viewer/2022042507/54c64e304a7959ad7b8b4580/html5/thumbnails/16.jpg)
Gradient Boosting [J. Friedman, 1999]
Statistical view on boosting
• ⇒ Generalization of boosting to arbitrary loss functions
Residual fitting
2 6 10x
2.0
1.5
1.0
0.5
0.0
0.5
1.0
1.5
2.0
2.5
y
Ground truth
2 6 10x
∼
tree 1
2 6 10x
+
tree 2
2 6 10x
+
tree 3
sklearn.ensemble.GradientBoostingClassifier|Regressor
![Page 17: Gradient Boosted Regression Trees in scikit-learn](https://reader034.vdocuments.us/reader034/viewer/2022042507/54c64e304a7959ad7b8b4580/html5/thumbnails/17.jpg)
Functional Gradient Descent
Least Squares Regression
• Squared loss: L(yi , f (xi )) = (yi − f (xi ))2
• The residual ∼ the (negative) gradient ∂L(yi , f (xi ))∂f (xi )
Steepest Descent
• Regression trees approximate the (negative) gradient
• Each tree is a successive gradient descent step
4 3 2 1 0 1 2 3 4y−f(x)
0
1
2
3
4
5
6
7
8
L(y,f
(x))
Squared errorAbsolute errorHuber error
4 3 2 1 0 1 2 3 4y ·f(x)
0
1
2
3
4
5
6
7
8
L(y,f
(x))
Zero-one lossLog lossExponential loss
![Page 18: Gradient Boosted Regression Trees in scikit-learn](https://reader034.vdocuments.us/reader034/viewer/2022042507/54c64e304a7959ad7b8b4580/html5/thumbnails/18.jpg)
Functional Gradient Descent
Least Squares Regression
• Squared loss: L(yi , f (xi )) = (yi − f (xi ))2
• The residual ∼ the (negative) gradient ∂L(yi , f (xi ))∂f (xi )
Steepest Descent
• Regression trees approximate the (negative) gradient
• Each tree is a successive gradient descent step
4 3 2 1 0 1 2 3 4y−f(x)
0
1
2
3
4
5
6
7
8
L(y,f
(x))
Squared errorAbsolute errorHuber error
4 3 2 1 0 1 2 3 4y ·f(x)
0
1
2
3
4
5
6
7
8
L(y,f
(x))
Zero-one lossLog lossExponential loss
![Page 19: Gradient Boosted Regression Trees in scikit-learn](https://reader034.vdocuments.us/reader034/viewer/2022042507/54c64e304a7959ad7b8b4580/html5/thumbnails/19.jpg)
Outline
1 Basics
2 Gradient Boosting
3 Gradient Boosting in Scikit-learn
4 Case Study: California housing
![Page 20: Gradient Boosted Regression Trees in scikit-learn](https://reader034.vdocuments.us/reader034/viewer/2022042507/54c64e304a7959ad7b8b4580/html5/thumbnails/20.jpg)
GBRT in scikit-learn
How to use it
>>> from sklearn.ensemble import GradientBoostingClassifier
>>> from sklearn.datasets import make_hastie_10_2
>>> X, y = make_hastie_10_2(n_samples=10000)
>>> est = GradientBoostingClassifier(n_estimators=200, max_depth=3)
>>> est.fit(X, y)
...
>>> # get predictions
>>> pred = est.predict(X)
>>> est.predict_proba(X)[0] # class probabilities
array([ 0.67, 0.33])
Implementation
• Written in pure Python/Numpy (easy to extend).
• Builds on top of sklearn.tree.DecisionTreeRegressor (Cython).
• Custom node splitter that uses pre-sorting (better for shallow trees).
![Page 21: Gradient Boosted Regression Trees in scikit-learn](https://reader034.vdocuments.us/reader034/viewer/2022042507/54c64e304a7959ad7b8b4580/html5/thumbnails/21.jpg)
Examplefrom sklearn.ensemble import GradientBoostingRegressor
est = GradientBoostingRegressor(n_estimators=2000, max_depth=1).fit(X, y)
for pred in est.staged_predict(X):
plt.plot(X[:, 0], pred, color=’r’, alpha=0.1)
0 2 4 6 8 10x
8
6
4
2
0
2
4
6
8
10
y
High bias - low variance
Low bias - high variance
ground truthRT d=1RT d=3GBRT d=1
![Page 22: Gradient Boosted Regression Trees in scikit-learn](https://reader034.vdocuments.us/reader034/viewer/2022042507/54c64e304a7959ad7b8b4580/html5/thumbnails/22.jpg)
Model complexity & Overfittingtest_score = np.empty(len(est.estimators_))
for i, pred in enumerate(est.staged_predict(X_test)):
test_score[i] = est.loss_(y_test, pred)
plt.plot(np.arange(n_estimators) + 1, test_score, label=’Test’)
plt.plot(np.arange(n_estimators) + 1, est.train_score_, label=’Train’)
0 200 400 600 800 1000n_estimators
0.0
0.5
1.0
1.5
2.0
Err
or
Lowest test error
train-test gap
TestTrain
Regularization
GBRT provides a number of knobs to controloverfitting
• Tree structure
• Shrinkage
• Stochastic Gradient Boosting
![Page 23: Gradient Boosted Regression Trees in scikit-learn](https://reader034.vdocuments.us/reader034/viewer/2022042507/54c64e304a7959ad7b8b4580/html5/thumbnails/23.jpg)
Model complexity & Overfittingtest_score = np.empty(len(est.estimators_))
for i, pred in enumerate(est.staged_predict(X_test)):
test_score[i] = est.loss_(y_test, pred)
plt.plot(np.arange(n_estimators) + 1, test_score, label=’Test’)
plt.plot(np.arange(n_estimators) + 1, est.train_score_, label=’Train’)
0 200 400 600 800 1000n_estimators
0.0
0.5
1.0
1.5
2.0
Err
or
Lowest test error
train-test gap
TestTrain
Regularization
GBRT provides a number of knobs to controloverfitting
• Tree structure
• Shrinkage
• Stochastic Gradient Boosting
![Page 24: Gradient Boosted Regression Trees in scikit-learn](https://reader034.vdocuments.us/reader034/viewer/2022042507/54c64e304a7959ad7b8b4580/html5/thumbnails/24.jpg)
Regularization: Tree structure
• The max depth of the trees controls the degree of features interactions
• Use min samples leaf to have a sufficient nr. of samples per leaf.
![Page 25: Gradient Boosted Regression Trees in scikit-learn](https://reader034.vdocuments.us/reader034/viewer/2022042507/54c64e304a7959ad7b8b4580/html5/thumbnails/25.jpg)
Regularization: Shrinkage
• Slow learning by shrinking tree predictions with 0 < learning rate <= 1
• Lower learning rate requires higher n estimators
0 200 400 600 800 1000n_estimators
0.0
0.5
1.0
1.5
2.0
Err
or
Requires more trees
Lower test error
Test Train Test learning_rate=0.1
Train learning_rate=0.1
![Page 26: Gradient Boosted Regression Trees in scikit-learn](https://reader034.vdocuments.us/reader034/viewer/2022042507/54c64e304a7959ad7b8b4580/html5/thumbnails/26.jpg)
Regularization: Stochastic Gradient Boosting• Samples: random subset of the training set (subsample)
• Features: random subset of features (max features)
• Improved accuracy – reduced runtime
0 200 400 600 800 1000n_estimators
0.0
0.5
1.0
1.5
2.0
Err
or
Even lower test error
Subsample alone does poorly
Train Test Train subsample=0.5, learning_rate=0.1
Test subsample=0.5, learning_rate=0.1
![Page 27: Gradient Boosted Regression Trees in scikit-learn](https://reader034.vdocuments.us/reader034/viewer/2022042507/54c64e304a7959ad7b8b4580/html5/thumbnails/27.jpg)
Hyperparameter tuning
1. Set n estimators as high as possible (eg. 3000)
2. Tune hyperparameters via grid search.
from sklearn.grid_search import GridSearchCV
param_grid = {’learning_rate’: [0.1, 0.05, 0.02, 0.01],
’max_depth’: [4, 6],
’min_samples_leaf’: [3, 5, 9, 17],
’max_features’: [1.0, 0.3, 0.1]}
est = GradientBoostingRegressor(n_estimators=3000)
gs_cv = GridSearchCV(est, param_grid).fit(X, y)
# best hyperparameter setting
gs_cv.best_params_
3. Finally, set n estimators even higher and tunelearning rate.
![Page 28: Gradient Boosted Regression Trees in scikit-learn](https://reader034.vdocuments.us/reader034/viewer/2022042507/54c64e304a7959ad7b8b4580/html5/thumbnails/28.jpg)
Outline
1 Basics
2 Gradient Boosting
3 Gradient Boosting in Scikit-learn
4 Case Study: California housing
![Page 29: Gradient Boosted Regression Trees in scikit-learn](https://reader034.vdocuments.us/reader034/viewer/2022042507/54c64e304a7959ad7b8b4580/html5/thumbnails/29.jpg)
Case Study
California Housing dataset
• Predict log(medianHouseValue)
• Block groups in 1990 census
• 20.640 groups with 8 features(median income, median age, lat,lon, ...)
• Evaluation: Mean absolute erroron 80/20 split
Challenges
• Heterogeneous features
• Non-linear interactions
![Page 30: Gradient Boosted Regression Trees in scikit-learn](https://reader034.vdocuments.us/reader034/viewer/2022042507/54c64e304a7959ad7b8b4580/html5/thumbnails/30.jpg)
Predictive accuracy & runtime
Train time [s] Test time [ms] MAE
Mean - - 0.4635Ridge 0.006 0.11 0.2756SVR 28.0 2000.00 0.1888RF 26.3 605.00 0.1620GBRT 192.0 439.00 0.1438
0 500 1000 1500 2000 2500 3000n_estimators
0.0
0.1
0.2
0.3
0.4
0.5
err
or
TestTrain
![Page 31: Gradient Boosted Regression Trees in scikit-learn](https://reader034.vdocuments.us/reader034/viewer/2022042507/54c64e304a7959ad7b8b4580/html5/thumbnails/31.jpg)
Model interpretation
Which features are important?
>>> est.feature_importances_
array([ 0.01, 0.38, ...])
0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18Relative importance
HouseAge
Population
AveBedrms
Latitude
AveOccup
Longitude
AveRooms
MedInc
![Page 32: Gradient Boosted Regression Trees in scikit-learn](https://reader034.vdocuments.us/reader034/viewer/2022042507/54c64e304a7959ad7b8b4580/html5/thumbnails/32.jpg)
Model interpretationWhat is the effect of a feature on the response?
from sklearn.ensemble import partial_dependence import as pd
features = [’MedInc’, ’AveOccup’, ’HouseAge’, ’AveRooms’,
(’AveOccup’, ’HouseAge’)]
fig, axs = pd.plot_partial_dependence(est, X_train, features,
feature_names=names)
1.5 3.0 4.5 6.0 7.5MedInc
0.4
0.2
0.0
0.2
0.4
0.6
Part
ial dependence
2.0 2.5 3.0 3.5 4.0 4.5AveOccup
0.4
0.2
0.0
0.2
0.4
0.6
Part
ial dependence
10 20 30 40 50 60HouseAge
0.4
0.2
0.0
0.2
0.4
0.6
Part
ial dependence
4 5 6 7 8AveRooms
0.4
0.2
0.0
0.2
0.4
0.6
Part
ial dependence
2.0 2.5 3.0 3.5 4.0AveOccup
10
20
30
40
50
House
Age
-0.1
2
-0.0
5
0.0
2
0.09
0.1
6
0.23
Partial dependence of house value on nonlocation featuresfor the California housing dataset
![Page 33: Gradient Boosted Regression Trees in scikit-learn](https://reader034.vdocuments.us/reader034/viewer/2022042507/54c64e304a7959ad7b8b4580/html5/thumbnails/33.jpg)
Model interpretation
Automatically detects spatial effects
longitude
lati
tude
-1.54
-1.22
-0.91
-0.60
-0.28
0.03
0.34
0.66
0.97
part
ial dep.
on m
edia
n h
ouse
valu
e
longitudela
titu
de
-0.15
-0.07
0.01
0.09
0.17
0.25
0.33
0.41
0.49
0.57
part
ial dep.
on m
edia
n h
ouse
valu
e
![Page 34: Gradient Boosted Regression Trees in scikit-learn](https://reader034.vdocuments.us/reader034/viewer/2022042507/54c64e304a7959ad7b8b4580/html5/thumbnails/34.jpg)
Summary
• Flexible non-parametric classification and regression technique
• Applicable to a variety of problems
• Solid, battle-worn implementation in scikit-learn
![Page 35: Gradient Boosted Regression Trees in scikit-learn](https://reader034.vdocuments.us/reader034/viewer/2022042507/54c64e304a7959ad7b8b4580/html5/thumbnails/35.jpg)
Thanks! Questions?
![Page 36: Gradient Boosted Regression Trees in scikit-learn](https://reader034.vdocuments.us/reader034/viewer/2022042507/54c64e304a7959ad7b8b4580/html5/thumbnails/36.jpg)
Benchmarks
0.00.20.40.60.81.01.2
Err
or
gbmsklearn-0.15
0.00.51.01.52.02.53.0
Tra
in t
ime
Arc
ene
Bost
on
Calif
orn
ia
Covty
pe
Exam
ple
10
.2
Expedia
Madelo
n
Sola
r
Spam
YahooLT
RC
bio
resp
dataset
0.0
0.2
0.4
0.6
0.8
1.0
Test
tim
e
![Page 37: Gradient Boosted Regression Trees in scikit-learn](https://reader034.vdocuments.us/reader034/viewer/2022042507/54c64e304a7959ad7b8b4580/html5/thumbnails/37.jpg)
Tipps & Tricks 1
Input layout
Use dtype=np.float32 to avoid memory copies and fortan layout for slight
runtime benefit.
X = np.asfortranarray(X, dtype=np.float32)
![Page 38: Gradient Boosted Regression Trees in scikit-learn](https://reader034.vdocuments.us/reader034/viewer/2022042507/54c64e304a7959ad7b8b4580/html5/thumbnails/38.jpg)
Tipps & Tricks 2
Feature interactionsGBRT automatically detects feature interactions but often explicit interactionshelp.
Trees required to approximate X1 − X2: 10 (left), 1000 (right).
x 0.00.2
0.40.6
0.81.0
y
0.00.2
0.40.6
0.81.0
x -
y
0.3
0.2
0.1
0.0
0.1
0.2
0.3
x 0.00.20.40.60.81.0y
0.00.2
0.40.6
0.81.0
x -
y
1.0
0.5
0.0
0.5
1.0
![Page 39: Gradient Boosted Regression Trees in scikit-learn](https://reader034.vdocuments.us/reader034/viewer/2022042507/54c64e304a7959ad7b8b4580/html5/thumbnails/39.jpg)
Tipps & Tricks 3
Categorical variables
Sklearn requires that categorical variables are encoded as numerics. Tree-based
methods work well with ordinal encoding:
df = pd.DataFrame(data={’icao’: [’CRJ2’, ’A380’, ’B737’, ’B737’]})
# ordinal encoding
df_enc = pd.DataFrame(data={’icao’: np.unique(df.icao,
return_inverse=True)[1]})
X = np.asfortranarray(df_enc.values, dtype=np.float32)