trees and random forests - iac · random forests random forest is an ensemble of decision trees,...
TRANSCRIPT
![Page 1: trees and random forests - IAC · Random Forests Random Forest is an ensemble of decision trees, where randomness is injected into the training process of each individual tree with](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec55abbee8fe70feb538626/html5/thumbnails/1.jpg)
Decision Trees and Random Forests
Dalya Baron (Tel Aviv University)XXX Winter School, November 2018
![Page 2: trees and random forests - IAC · Random Forests Random Forest is an ensemble of decision trees, where randomness is injected into the training process of each individual tree with](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec55abbee8fe70feb538626/html5/thumbnails/2.jpg)
Decision Trees
Decision tree: a non-parametric model, constructed during training, which is described by a tree-like graph. It can be used for classification or regression.
![Page 3: trees and random forests - IAC · Random Forests Random Forest is an ensemble of decision trees, where randomness is injected into the training process of each individual tree with](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec55abbee8fe70feb538626/html5/thumbnails/3.jpg)
Decision Tree ConstructionInput training set: a list of objects with measured features and known labels.
Classes: “black” and “brown” galaxies.Measured features: r (arcsec), B (mag), V(mag).
![Page 4: trees and random forests - IAC · Random Forests Random Forest is an ensemble of decision trees, where randomness is injected into the training process of each individual tree with](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec55abbee8fe70feb538626/html5/thumbnails/4.jpg)
Decision Tree ConstructionInput training set: a list of objects with measured features and known labels.
Classes: “black” and “brown” galaxies.Measured features: r (arcsec), B (mag), V(mag).
r (arcsec)
N
B (mag)
N
V (mag)
N
![Page 5: trees and random forests - IAC · Random Forests Random Forest is an ensemble of decision trees, where randomness is injected into the training process of each individual tree with](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec55abbee8fe70feb538626/html5/thumbnails/5.jpg)
Decision Tree ConstructionInput training set: a list of objects with measured features and known labels.
Classes: “black” and “brown” galaxies.Measured features: r (arcsec), B (mag), V(mag).
r (arcsec)
N
B (mag)
N
V (mag)
Nr = 5 ``
r < 5`` r > 5``
![Page 6: trees and random forests - IAC · Random Forests Random Forest is an ensemble of decision trees, where randomness is injected into the training process of each individual tree with](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec55abbee8fe70feb538626/html5/thumbnails/6.jpg)
Decision Tree ConstructionInput training set: a list of objects with measured features and known labels.
Classes: “black” and “brown” galaxies.Measured features: r (arcsec), g (mag), i(mag).
r < 5``
g < 14 mag
g > 14 mag
i < 13 mag i > 13 mag
r < 5``
r > 4.8``r < 4.8``
Stop criterion (simplest version): each terminal contains objects from
a single class.
![Page 7: trees and random forests - IAC · Random Forests Random Forest is an ensemble of decision trees, where randomness is injected into the training process of each individual tree with](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec55abbee8fe70feb538626/html5/thumbnails/7.jpg)
Decision Tree PredictionInput set: a list of objects with measured features and unknown labels.
Objects are propagated through the tree according to their measured features.
r < 5``
g < 14 mag
g > 14 mag
i < 13 mag i > 13 mag
r < 5``
r > 4.8``r < 4.8``
class “brown”
class “brown”
class “black”
class “black”
class “black”
![Page 8: trees and random forests - IAC · Random Forests Random Forest is an ensemble of decision trees, where randomness is injected into the training process of each individual tree with](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec55abbee8fe70feb538626/html5/thumbnails/8.jpg)
Decision Tree PredictionInput set: a list of objects with measured features and unknown labels.
Objects are propagated through the tree according to their measured features.
r < 5``
g < 14 mag
g > 14 mag
i < 13 mag i > 13 mag
r < 5``
r > 4.8``r < 4.8``
Example: what is the predicted label for a galaxy with the measured features:
r=3``, g=15 mag, i=14 mag?
![Page 9: trees and random forests - IAC · Random Forests Random Forest is an ensemble of decision trees, where randomness is injected into the training process of each individual tree with](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec55abbee8fe70feb538626/html5/thumbnails/9.jpg)
Decision Tree PredictionInput set: a list of objects with measured features and unknown labels.
Objects are propagated through the tree according to their measured features.
r < 5``
g < 14 mag
g > 14 mag
i < 13 mag i > 13 mag
r < 5``
r > 4.8``r < 4.8``
Example: what is the predicted label for a galaxy with the measured features:
r=3``, g=15 mag, i=14 mag?
![Page 10: trees and random forests - IAC · Random Forests Random Forest is an ensemble of decision trees, where randomness is injected into the training process of each individual tree with](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec55abbee8fe70feb538626/html5/thumbnails/10.jpg)
Decision Tree PredictionInput set: a list of objects with measured features and unknown labels.
Objects are propagated through the tree according to their measured features.
r < 5``
g < 14 mag
g > 14 mag
i < 13 mag i > 13 mag
r < 5``
r > 4.8``r < 4.8``
Example: what is the predicted label for a galaxy with the measured features:
r=3``, g=15 mag, i=14 mag?
Prediction: "black" galaxy!
![Page 11: trees and random forests - IAC · Random Forests Random Forest is an ensemble of decision trees, where randomness is injected into the training process of each individual tree with](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec55abbee8fe70feb538626/html5/thumbnails/11.jpg)
Decision Trees: Pros & ConsAdvantages: (1) Non-linear model, which is constructed during training.(2) In its simplest version, very few free parameters.(3) Handles numerous features and numerous objects.(4) No need to scale the feature values to the same “units”.(5) Produces classification probability (in its more complex version).(6) Produces feature importance.
r < 5``
g < 14 mag
g > 14 mag
i < 13 mag i > 13 mag
r < 5``
r > 4.8``r < 4.8``
class “brown”
class “brown”class
“black”class
“black”class
“black”
![Page 12: trees and random forests - IAC · Random Forests Random Forest is an ensemble of decision trees, where randomness is injected into the training process of each individual tree with](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec55abbee8fe70feb538626/html5/thumbnails/12.jpg)
Decision Trees: Pros & ConsAdvantages: (1) Non-linear model, which is constructed during training.(2) In its simplest version, very few free parameters.(3) Handles numerous features and numerous objects.(4) No need to scale the feature values to the same “units”.(5) Produces classification probability (in its more complex version).(6) Produces feature importance.
r < 5``
g < 14 mag
g > 14 mag
i < 13 mag i > 13 mag
r < 5``
r > 4.8``r < 4.8``
class “brown”
class “brown”class
“black”class
“black”class
“black”
![Page 13: trees and random forests - IAC · Random Forests Random Forest is an ensemble of decision trees, where randomness is injected into the training process of each individual tree with](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec55abbee8fe70feb538626/html5/thumbnails/13.jpg)
Feature importance & feature selection
r < 5``
g < 14 mag
g > 14 mag
i < 13 mag i > 13 mag
r < 5``
r > 4.8``r < 4.8``
class “brown”
class “brown”class
“black”class
“black”class
“black”
Rule of thumb: the higher a feature is in a decision tree, the more important it is for the classification task. The locations of features within the tree can be used to produce feature importance.In our example, feature importance: r, i, and then g.
Useful trick: add non-informative features to your dataset (a feature with random values, or a constant feature). If your physical features are ranked less important, remove them!
![Page 14: trees and random forests - IAC · Random Forests Random Forest is an ensemble of decision trees, where randomness is injected into the training process of each individual tree with](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec55abbee8fe70feb538626/html5/thumbnails/14.jpg)
Decision Trees: Pros & Cons
Disadvantages: (1) Usually does not generalize well to unseen datasets:
(1) Mediocre performance on test set.(2) Tends to overfit.
Advantages: (1) Non-linear model, which is constructed during training.(2) In its simplest version, very few free parameters.(3) Handles numerous features and numerous objects.(4) No need to scale the feature values to the same “units”.(5) Produces classification probability (in its more complex version).(6) Produces feature importance.
![Page 15: trees and random forests - IAC · Random Forests Random Forest is an ensemble of decision trees, where randomness is injected into the training process of each individual tree with](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec55abbee8fe70feb538626/html5/thumbnails/15.jpg)
Random ForestsRandom Forest is an ensemble of decision trees, where randomness is injected
into the training process of each individual tree with a bagging approach.
Bagging: -The training set is split into randomly-selected subsets, and each decision tree is trained on a subset of the data.
-In each node in the decision tree, only a randomly-selected subset of the feature is considered.
r < 5``
g < 14 mag
g > 14 mag
i < 13 mag i > 13 mag
r < 5``
r > 4.8``r < 4.8``
class “brown”
class “brown”class
“black”
class “black”
class “black”
i < 12 mag
r < 5’’
r > 5’’
g < 13 mag g > 13 mag
i > 12 mag
class “brown”
class “brown”
class “black”
class “black”
decision tree #1decision tree #2
![Page 16: trees and random forests - IAC · Random Forests Random Forest is an ensemble of decision trees, where randomness is injected into the training process of each individual tree with](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec55abbee8fe70feb538626/html5/thumbnails/16.jpg)
Random Forest Prediction
Verikas+ 2016
![Page 17: trees and random forests - IAC · Random Forests Random Forest is an ensemble of decision trees, where randomness is injected into the training process of each individual tree with](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec55abbee8fe70feb538626/html5/thumbnails/17.jpg)
Random Forest Prediction
Verikas+ 2016
Hyper parameters: (1) Number of trees in the forest (2) Number of randomly-selected features to consider in each split.(3) Splitting criterion (also for Decision Trees).(4) Class weight.
![Page 18: trees and random forests - IAC · Random Forests Random Forest is an ensemble of decision trees, where randomness is injected into the training process of each individual tree with](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec55abbee8fe70feb538626/html5/thumbnails/18.jpg)
Random Forest: Pros & ConsAdvantages: (1) Same advantages as in a single Decision Tree.(2) Specifically, can handle thousands of features!(3) Generalizes well to unseen datasets.(4) Easily parallelizable. Input data Decision Tree Random Forest AdaBoost
Disadvantages: (1) Cannot handle measurement
uncertainties (true for most ML algorithms!).
http://scikit-learn.org/stable
![Page 19: trees and random forests - IAC · Random Forests Random Forest is an ensemble of decision trees, where randomness is injected into the training process of each individual tree with](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec55abbee8fe70feb538626/html5/thumbnails/19.jpg)
Random Forest: Examples
https://cs.stanford.edu/~karpathy/svmjs/demo/demoforest.html
![Page 20: trees and random forests - IAC · Random Forests Random Forest is an ensemble of decision trees, where randomness is injected into the training process of each individual tree with](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec55abbee8fe70feb538626/html5/thumbnails/20.jpg)
Probabilistic Random ForestA Random Forest that takes into account the uncertainties in both the features and
the input labels. The Probabilistic Random Forest treats all measurements as random variables (see Reis+18).
![Page 21: trees and random forests - IAC · Random Forests Random Forest is an ensemble of decision trees, where randomness is injected into the training process of each individual tree with](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec55abbee8fe70feb538626/html5/thumbnails/21.jpg)
PRF is able to handle a dataset with missing values!!!
![Page 22: trees and random forests - IAC · Random Forests Random Forest is an ensemble of decision trees, where randomness is injected into the training process of each individual tree with](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec55abbee8fe70feb538626/html5/thumbnails/22.jpg)
Probabilistic Random ForestA Random Forest that takes into account the uncertainties in both the features and
the input labels. The Probabilistic Random Forest treats all measurements as random variables (see Reis+18).
![Page 23: trees and random forests - IAC · Random Forests Random Forest is an ensemble of decision trees, where randomness is injected into the training process of each individual tree with](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec55abbee8fe70feb538626/html5/thumbnails/23.jpg)
Probabilistic Random ForestA Random Forest that takes into account the uncertainties in both the features and
the input labels. The Probabilistic Random Forest treats all measurements as random variables (see Reis+18).
![Page 24: trees and random forests - IAC · Random Forests Random Forest is an ensemble of decision trees, where randomness is injected into the training process of each individual tree with](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec55abbee8fe70feb538626/html5/thumbnails/24.jpg)
Unsupervised Random ForestRandom Forest can be used as an unsupervised algorithm, to produce pair-wise
similarity for the objects in our sample.Why do we need to measure distances between objects?
wavelength (nm)
norm
aliz
ed fl
ux
wavelength (nm) wavelength (nm)
1 2 3
wavelength (nm) wavelength (nm)
norm
aliz
ed fl
ux
![Page 25: trees and random forests - IAC · Random Forests Random Forest is an ensemble of decision trees, where randomness is injected into the training process of each individual tree with](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec55abbee8fe70feb538626/html5/thumbnails/25.jpg)
Unsupervised Random ForestRandom Forest can be used as an unsupervised algorithm, to produce pair-wise
similarity for the objects in our sample.Input dataset: a list of objects with measured features, but no labels!
Random Forest is trained to distinguish between real and synthetic datasets.
10 20 30 40 50 60 70 80 90100Feature 1
0
20
40
60
80
100
120
Feat
ure
2
Original DataGroup A
20 30 40 50 60 70 80 90 100Feature 1
0
20
40
60
80
100
120
Feat
ure
2
Synthetic DataGroup B
![Page 26: trees and random forests - IAC · Random Forests Random Forest is an ensemble of decision trees, where randomness is injected into the training process of each individual tree with](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec55abbee8fe70feb538626/html5/thumbnails/26.jpg)
Unsupervised Random ForestWe train the Random Forest to distinguish between groups A and B.
For group A (real data), we propagate the objects and obtain a similarity matrix.
similarity matrix
![Page 27: trees and random forests - IAC · Random Forests Random Forest is an ensemble of decision trees, where randomness is injected into the training process of each individual tree with](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec55abbee8fe70feb538626/html5/thumbnails/27.jpg)
Unsupervised Random ForestWe train the Random Forest to distinguish between groups A and B.
For group A (real data), we propagate the objects and obtain a similarity matrix.
similarity matrix
![Page 28: trees and random forests - IAC · Random Forests Random Forest is an ensemble of decision trees, where randomness is injected into the training process of each individual tree with](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec55abbee8fe70feb538626/html5/thumbnails/28.jpg)
Unsupervised Random ForestWe train the Random Forest to distinguish between groups A and B.
For group A (real data), we propagate the objects and obtain a similarity matrix.
similarity matrix
![Page 29: trees and random forests - IAC · Random Forests Random Forest is an ensemble of decision trees, where randomness is injected into the training process of each individual tree with](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec55abbee8fe70feb538626/html5/thumbnails/29.jpg)
Unsupervised Random ForestWe train the Random Forest to distinguish between groups A and B.
For group A (real data), we propagate the objects and obtain a similarity matrix.
similarity matrix
![Page 30: trees and random forests - IAC · Random Forests Random Forest is an ensemble of decision trees, where randomness is injected into the training process of each individual tree with](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec55abbee8fe70feb538626/html5/thumbnails/30.jpg)
Unsupervised Random ForestWe train the Random Forest to distinguish between groups A and B.
For group A (real data), we propagate the objects and obtain a similarity matrix.
similarity += 1
similarity matrix
![Page 31: trees and random forests - IAC · Random Forests Random Forest is an ensemble of decision trees, where randomness is injected into the training process of each individual tree with](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec55abbee8fe70feb538626/html5/thumbnails/31.jpg)
Unsupervised Random ForestWe train the Random Forest to distinguish between groups A and B.
For group A (real data), we propagate the objects and obtain a similarity matrix.
similarity matrix
![Page 32: trees and random forests - IAC · Random Forests Random Forest is an ensemble of decision trees, where randomness is injected into the training process of each individual tree with](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec55abbee8fe70feb538626/html5/thumbnails/32.jpg)
Unsupervised Random ForestWe train the Random Forest to distinguish between groups A and B.
For group A (real data), we propagate the objects and obtain a similarity matrix.
similarity matrix
![Page 33: trees and random forests - IAC · Random Forests Random Forest is an ensemble of decision trees, where randomness is injected into the training process of each individual tree with](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec55abbee8fe70feb538626/html5/thumbnails/33.jpg)
Unsupervised Random ForestWe train the Random Forest to distinguish between groups A and B.
For group A (real data), we propagate the objects and obtain a similarity matrix.
similarity += 0
similarity matrix
![Page 34: trees and random forests - IAC · Random Forests Random Forest is an ensemble of decision trees, where randomness is injected into the training process of each individual tree with](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec55abbee8fe70feb538626/html5/thumbnails/34.jpg)
Unsupervised Random ForestWe train the Random Forest to distinguish between groups A and B.
For group A (real data), we propagate the objects and obtain a similarity matrix.
similarity += 0
The process is repeated for all the trees in the forest. Therefore, the similarity ranges from 0 to N, the number of trees in the forest.
similarity matrix
![Page 35: trees and random forests - IAC · Random Forests Random Forest is an ensemble of decision trees, where randomness is injected into the training process of each individual tree with](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec55abbee8fe70feb538626/html5/thumbnails/35.jpg)
Questions?