data 201 - techniques in data science - lecture 15 ... · adapted from \hands-on machine learning...
TRANSCRIPT
DATA 201 - Techniques in Data Science
Lecture 15: Ensemble Learning
Binh Nguyen
School of Mathematics and Statistics, Victoria University of Wellington
Adapted from “Hands-On Machine Learning with Scikit-Learn and TensorFlow” by Aurelien Geron
Table of contents
1. Voting Classifiers
2. Bagging and Pasting
3. Random Patches and Random Subspaces
4. Random Forests
5. Boosting
1
Voting Classifiers
Training diverse classifiers
• Suppose we have trained a few classifiers, each one achieving a
similar accuracy.
• There is a simple way to create a better classifier...
2
Hard voting classifier predictions
• Hard voting: aggregate the predictions of each classifier and predict
the class that gets the most votes. This voting classifier often
achieves a higher accuracy than the best classifier in the ensemble.
• If each classifier is a weak learner, the ensemble can still be a strong
learner, provided there are a sufficient number of weak learners and
they are sufficiently diverse. 3
Example
from sklearn.model_selection import train_test_splitfrom sklearn.datasets import make_moons
X, y = make_moons(n_samples=500, noise=0.30, random_state=42)X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
from sklearn.ensemble import RandomForestClassifierfrom sklearn.ensemble import VotingClassifierfrom sklearn.linear_model import LogisticRegressionfrom sklearn.svm import SVC
log_clf = LogisticRegression(random_state=42)rnd_clf = RandomForestClassifier(random_state=42)svm_clf = SVC(random_state=42)
voting_clf = VotingClassifier(estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],voting='hard')
4
VotingClassifier(estimators=[('lr',LogisticRegression(C=1.0, class_weight=None,
dual=False, fit_intercept=True,intercept_scaling=1,l1_ratio=None, max_iter=100,multi_class='auto',n_jobs=None, penalty='l2',random_state=42,solver='lbfgs', tol=0.0001,verbose=0, warm_start=False)),
('rf',RandomForestClassifier(bootstrap=True,
ccp_alpha=0.0,class_weight=None,crit...oob_score=False,random_state=42, verbose=0,warm_start=False)),
('svc',SVC(C=1.0, break_ties=False, cache_size=200,
class_weight=None, coef0=0.0,decision_function_shape='ovr', degree=3,gamma='scale', kernel='rbf', max_iter=-1,probability=False, random_state=42,shrinking=True, tol=0.001, verbose=False))],
flatten_transform=True, n_jobs=None, voting='hard',weights=None)
voting_clf.fit(X_train, y_train)
5
LogisticRegression 0.864RandomForestClassifier 0.896SVC 0.896VotingClassifier 0.912
LogisticRegression 0.864RandomForestClassifier 0.896SVC 0.896VotingClassifier 0.92
from sklearn.metrics import accuracy_score
for clf in (log_clf, rnd_clf, svm_clf, voting_clf):clf.fit(X_train, y_train)y_pred = clf.predict(X_test)print(clf.__class__.__name__, accuracy_score(y_test, y_pred))
log_clf = LogisticRegression(random_state=42)rnd_clf = RandomForestClassifier(random_state=42)svm_clf = SVC(probability=True, random_state=42)
voting_clf = VotingClassifier(estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],voting='soft')
voting_clf.fit(X_train, y_train);
for clf in (log_clf, rnd_clf, svm_clf, voting_clf):clf.fit(X_train, y_train)y_pred = clf.predict(X_test)print(clf.__class__.__name__, accuracy_score(y_test, y_pred))
6
Soft voting
• If all classifiers are able to estimate class probabilities (i.e., they have
a predict proba() method), then we can predict the class with
the highest class probability, averaged over all the individual
classifiers ⇒ soft voting.
• Soft voting often achieves higher performance than hard voting
because it gives more weight to highly confident votes.
7
Switching voting method
Now let's try using a hard voting classifier again. We do not actually need to retrain the
classifier, we can just set voting to "hard":
LogisticRegression 0.864RandomForestClassifier 0.896SVC 0.896VotingClassifier 0.912
voting_clf.voting = 'hard'
for clf in (log_clf, rnd_clf, svm_clf, voting_clf):clf.fit(X_train, y_train)y_pred = clf.predict(X_test)print(clf.__class__.__name__, accuracy_score(y_test, y_pred))
8
Comments
• Ensemble methods work best when the predictors are as independent
from one another as possible.
• One way to get diverse classifiers is to train them using very
different algorithms. This increases the chance that they will make
very different types of errors, improving the ensemble’s accuracy.
9
Bagging and Pasting
Introduction
• One way to get a diverse set of classifiers is to use very different
training algorithms.
• Another method is to use the same training algorithm for every
predictor, but to train them on different random subsets of the
training set.
- Bagging: when sampling is performed with replacement.
- Pasting: when sampling is performed without replacement.
10
Pasting/bagging training set sampling and training
11
Bootstrap (sampling with replacement)
12
After training
• The ensemble make a prediction for a new instance by aggregating
the predictions of all predictors.
• The aggregation function is typically the statistical mode for
classification, or the average for regression.
• Each individual predictor has a higher bias than if it were trained on
the original training set, but aggregation reduces both bias and
variance.
• Generally, the ensemble has a similar bias but a lower variance than
a single predictor trained on the original training set.
13
Bagging in sklearnCustom Search
Home Installation Documentation
Examples
Previous Next
[source]
sklearn.ensemble.BaggingClassifier
class sklearn.ensemble.BaggingClassifier(base_estimator=None, n_estimators=10, max_samples=1.0, max_features=1.0, bootstrap=True,bootstrap_features=False, oob_score=False, warm_start=False, n_jobs=None, random_state=None, verbose=0)
A Bagging classifier.
A Bagging classifier is an ensemble meta-estimator that fits base classifiers each on random subsets of the original dataset and then aggregate theirindividual predictions (either by voting or by averaging) to form a final prediction. Such a meta-estimator can typically be used as a way to reduce thevariance of a black-box estimator (e.g., a decision tree), by introducing randomization into its construction procedure and then making an ensembleout of it.
This algorithm encompasses several works from the literature. When random subsets of the dataset are drawn as random subsets of the samples,then this algorithm is known as Pasting [Rb1846455d0e5-1]. If samples are drawn with replacement, then the method is known as Bagging[Rb1846455d0e5-2]. When random subsets of the dataset are drawn as random subsets of the features, then the method is known as RandomSubspaces [Rb1846455d0e5-3]. Finally, when base estimators are built on subsets of both samples and features, then the method is known asRandom Patches [Rb1846455d0e5-4].
Read more in the User Guide.Parameters: base_estimator : object or None, optional (default=None)
The base estimator to fit on random subsets of the dataset. If None, then the base estimator is a decision tree.
n_estimators : int, optional (default=10)
The number of base estimators in the ensemble.
max_samples : int or float, optional (default=1.0)
The number of samples to draw from X to train each base estimator.If int, then draw max_samples samples.If float, then draw max_samples * X.shape[0] samples.
max_features : int or float, optional (default=1.0)
The number of features to draw from X to train each base estimator.If int, then draw max_features features.If float, then draw max_features * X.shape[1] features.
bootstrap : boolean, optional (default=True)
Whether samples are drawn with replacement. If False, sampling without replacement is performed.
bootstrap_features : boolean, optional (default=False)
Whether features are drawn with replacement.
»
14
Example - Bagging
In [1]:
In [2]:
In [3]:
In [4]:
0.904
0.856
from sklearn.model_selection import train_test_splitfrom sklearn.datasets import make_moons X, y = make_moons(n_samples=500, noise=0.30, random_state=42)X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
from sklearn.ensemble import BaggingClassifierfrom sklearn.tree import DecisionTreeClassifier bag_clf = BaggingClassifier(DecisionTreeClassifier(random_state=42), n_estimators=500, max_samples=100, bootstrap=True, n_jobs=-1, random_state=42) bag_clf.fit(X_train, y_train)y_pred = bag_clf.predict(X_test)
from sklearn.metrics import accuracy_scoreprint(accuracy_score(y_test, y_pred))
tree_clf = DecisionTreeClassifier(random_state=42)tree_clf.fit(X_train, y_train)y_pred_tree = tree_clf.predict(X_test)print(accuracy_score(y_test, y_pred_tree))
15
In [ ]:
In [ ]:
import numpy as npimport matplotlib.pyplot as pltfrom matplotlib.colors import ListedColormap def plot_decision_boundary(clf, X, y, axes=[-1.5, 2.5, -1, 1.5], alpha=0.5, contour=True): x1s = np.linspace(axes[0], axes[1], 100) x2s = np.linspace(axes[2], axes[3], 100) x1, x2 = np.meshgrid(x1s, x2s) X_new = np.c_[x1.ravel(), x2.ravel()] y_pred = clf.predict(X_new).reshape(x1.shape) custom_cmap = ListedColormap(['#fafab0','#9898ff','#a0faa0']) plt.contourf(x1, x2, y_pred, alpha=0.3, cmap=custom_cmap) if contour: custom_cmap2 = ListedColormap(['#7d7d58','#4c4c7f','#507d50']) plt.contour(x1, x2, y_pred, cmap=custom_cmap2, alpha=0.8) plt.plot(X[:, 0][y==0], X[:, 1][y==0], "yo", alpha=alpha) plt.plot(X[:, 0][y==1], X[:, 1][y==1], "bs", alpha=alpha) plt.axis(axes) plt.xlabel(r"$x_1$", fontsize=18) plt.ylabel(r"$x_2$", fontsize=18, rotation=0)
fig = plt.figure(figsize=(11,4))plt.subplot(121)plot_decision_boundary(tree_clf, X, y)plt.title("Decision Tree", fontsize=14)plt.subplot(122)plot_decision_boundary(bag_clf, X, y)plt.title("Decision Trees with Bagging", fontsize=14)plt.show()fig.savefig("DT_without_and_with_bagging.pdf", bbox_inches='tight')
16
A single Decision Tree versus a bagging ensemble of 500 trees
1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5x1
1.0
0.5
0.0
0.5
1.0
1.5
x2
Decision Tree
1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5x1
1.0
0.5
0.0
0.5
1.0
1.5
x2
Decision Trees with Bagging
17
Comments
• The BaggingClassifier automatically performs soft voting
instead of hard voting if the base classifier can estimate class
probabilities (i.e., if it has a predict proba() method).
• The ensemble has a comparable bias but a smaller variance.
• Bootstrapping introduces a bit more diversity in the subsets that
each predictor is trained on, so bagging ends up with a slightly
higher bias than pasting.
• However, this also means that predictors end up being less correlated
so the ensemble’s variance is reduced.
• Overall, bagging is generally preferred.
• Cross-validation should be done to evaluate both bagging and
pasting and select the one that works best.
18
Example - Pasting
In [1]:
In [2]:
In [3]:
In [4]:
0.912
0.856
from sklearn.model_selection import train_test_splitfrom sklearn.datasets import make_moons X, y = make_moons(n_samples=500, noise=0.30, random_state=42)X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
from sklearn.ensemble import BaggingClassifierfrom sklearn.tree import DecisionTreeClassifier bag_clf = BaggingClassifier(DecisionTreeClassifier(random_state=42), n_estimators=500, max_samples=100, bootstrap=False, n_jobs=-1, random_state=42) bag_clf.fit(X_train, y_train)y_pred = bag_clf.predict(X_test)
from sklearn.metrics import accuracy_scoreprint(accuracy_score(y_test, y_pred))
tree_clf = DecisionTreeClassifier(random_state=42)tree_clf.fit(X_train, y_train)y_pred_tree = tree_clf.predict(X_test)print(accuracy_score(y_test, y_pred_tree))
19
Random Patches and Random
Subspaces
Introduction
• The BaggingClassifier class also supports sampling the features.
This is controlled by two hyper-parameters: max features and
bootstrap features.
• Random Patches: sampling both training instances and features.
• Random Subspaces: keeping all training instances (i.e.,
bootstrap=False and max samples=1.0) but sampling features
(i.e., bootstrap features=True and/or max features smaller
than 1.0).
• Sampling features results in even more predictor diversity, trading a
bit more bias for a lower variance.
• This is useful when dealing with high-dimensional inputs (such as
images).
20
Random Forests
Introduction
• A Random Forest is an ensemble of Decision Trees, generally trained
via the bagging method (or sometimes pasting), typically with
max samples set to the size of the training set.
• Instead of building a BaggingClassifier and passing it a
DecisionTreeClassifier, we can instead use the
RandomForestClassifier and RandomForestRegressor classes.
21
Example
In [11]:
In [12]:
In [13]:
In [ ]:
Out[13]: 0.976
bag_clf = BaggingClassifier(DecisionTreeClassifier(splitter="random", max_leaf_nodes=16, random_state=42), n_estimators=500, max_samples=1.0, bootstrap=True, n_jobs=-1, random_state=42) bag_clf.fit(X_train, y_train)y_pred = bag_clf.predict(X_test)
from sklearn.ensemble import RandomForestClassifier rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=-1, random_state=42) rnd_clf.fit(X_train, y_train)y_pred_rf = rnd_clf.predict(X_test)
np.sum(y_pred == y_pred_rf) / len(y_pred) # almost identical predictions
22
Boosting
Introduction
• A boosting method trains predictors sequentially, each trying to
correct its predecessor.
• The most popular boosting methods are AdaBoost (Adaptive
Boosting) and Gradient Boosting.
23
1. AdaBoost
• One way for a new predictor to correct its predecessor is to pay a bit
more attention to the training instances that the predecessor
underfitted.
- A first base classifier is trained and used to make predictions on the
training set.
- The relative weight of misclassified training instances is then
increased.
- A second classifier is trained using the updated weights and again it
makes predictions on the training set, weights are updated, etc.
24
Building consecutive predictors
from sklearn.svm import SVC
m = len(X_train)
fig = plt.figure(figsize=(11, 4))for subplot, learning_rate in ((121, 1), (122, 0.5)):
sample_weights = np.ones(m)plt.subplot(subplot)for i in range(5):
svm_clf = SVC(kernel="rbf", C=0.05, gamma="scale", random_state=42)svm_clf.fit(X_train, y_train, sample_weight=sample_weights)y_pred = svm_clf.predict(X_train)sample_weights[y_pred != y_train] *= (1 + learning_rate)plot_decision_boundary(svm_clf, X, y, alpha=0.2)plt.title("learning_rate = {}".format(learning_rate), fontsize=16)
if subplot == 121:plt.text(-0.7, -0.50, "1", fontsize=14)plt.text(-0.6, -0.10, "2", fontsize=14)plt.text(-0.5, 0.30, "3", fontsize=14)plt.text(-0.4, 0.55, "4", fontsize=14)plt.text(-0.3, 1.00, "5", fontsize=14)
plt.show()
25
Decision boundaries of consecutive predictors
1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5x1
1.0
0.5
0.0
0.5
1.0
1.5
x2
1
2
34
5
learning_rate = 1
1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5x1
1.0
0.5
0.0
0.5
1.0
1.5
x2
learning_rate = 0.5
• The plot on the right represents the same sequence of predictors
except that the learning rate is halved.
• This means the misclassified instance weights are boosted half as
much at every iteration.
26
AdaBoostClassifier
from sklearn.ensemble import AdaBoostClassifier
ada_clf = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1), n_estimators=200,learning_rate=0.5, random_state=42)
ada_clf.fit(X_train, y_train)plot_decision_boundary(ada_clf, X, y)
1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5x1
1.0
0.5
0.0
0.5
1.0
1.5
x2
27
2. Gradient Boosting
• Combines multiple decision trees to create a more powerful model.
• These models can be used for regression and classification.
• Works by building trees in a serial manner, where each tree tries to
correct the mistakes of the previous one.
• Strong pre-pruning is used.
• Often use very shallow trees, of depth 1 to 5.
• Besides the pre-pruning and the number of trees in the ensemble,
another important parameter is the learning rate, which controls
how strongly each tree tries to correct the mistakes of the previous
trees.
28
Example
In [11]:
In [5]:
In [6]:
In [7]:
Accuracy on training set: 1.000 Accuracy on test set: 0.965
Accuracy on training set: 0.991 Accuracy on test set: 0.972
Accuracy on training set: 0.988 Accuracy on test set: 0.965
import numpy as npimport matplotlib.pyplot as pltfrom sklearn.model_selection import train_test_splitfrom sklearn.datasets import load_breast_cancercancer = load_breast_cancer()X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, random_state=0)
from sklearn.ensemble import GradientBoostingClassifier gbrt = GradientBoostingClassifier(random_state=0)gbrt.fit(X_train, y_train) print("Accuracy on training set: {:.3f}".format(gbrt.score(X_train, y_train)))print("Accuracy on test set: {:.3f}".format(gbrt.score(X_test, y_test)))
gbrt = GradientBoostingClassifier(random_state=0, max_depth=1)gbrt.fit(X_train, y_train) print("Accuracy on training set: {:.3f}".format(gbrt.score(X_train, y_train)))print("Accuracy on test set: {:.3f}".format(gbrt.score(X_test, y_test)))
gbrt = GradientBoostingClassifier(random_state=0, learning_rate=0.01)gbrt.fit(X_train, y_train) print("Accuracy on training set: {:.3f}".format(gbrt.score(X_train, y_train)))print("Accuracy on test set: {:.3f}".format(gbrt.score(X_test, y_test)))
29
Feature importances
In [ ]: gbrt = GradientBoostingClassifier(random_state=0, max_depth=1)gbrt.fit(X_train, y_train) fig = plt.figure(figsize=(8, 6))n_features = cancer.data.shape[1]plt.barh(np.arange(n_features), gbrt.feature_importances_, align='center')plt.yticks(np.arange(n_features), cancer.feature_names)plt.xlabel("Feature importance")plt.ylabel("Feature")plt.ylim(-1, n_features)fig.savefig("gbc_cancer.pdf", bbox_inches='tight')
0.00 0.05 0.10 0.15 0.20 0.25 0.30Feature importance
mean radiusmean texture
mean perimetermean area
mean smoothnessmean compactness
mean concavitymean concave points
mean symmetrymean fractal dimension
radius errortexture error
perimeter errorarea error
smoothness errorcompactness error
concavity errorconcave points error
symmetry errorfractal dimension error
worst radiusworst texture
worst perimeterworst area
worst smoothnessworst compactness
worst concavityworst concave points
worst symmetryworst fractal dimension
Feat
ure
30
More about Gradient Boosting
• Gradient boosted decision trees are among the most powerful and
widely used models for supervised learning.
• They require careful tuning of the parameters and may take a long
time to train.
• The algorithm works well without scaling and on a mixture of binary
and continuous features.
• It often does not work well on high-dimensional sparse data.
• n estimators and learning rate are interconnected (a lower
learning rate means that more trees are needed to build a model
of similar complexity).
• In contrast to random forests, where a higher n estimators value is
always better, increasing n estimators in gradient boosting leads
to a more complex model, which may lead to overfitting.
• Another important parameter is max depth (or alternatively
max leaf nodes), to reduce the complexity of each tree.
31
Questions?
31