random forests 13-06-2015
TRANSCRIPT
![Page 1: Random forests 13-06-2015](https://reader037.vdocuments.us/reader037/viewer/2022103105/58f23c361a28ab44138b45cf/html5/thumbnails/1.jpg)
Random ForestsCork Big Data & Analytics Group
![Page 2: Random forests 13-06-2015](https://reader037.vdocuments.us/reader037/viewer/2022103105/58f23c361a28ab44138b45cf/html5/thumbnails/2.jpg)
Decision Trees• One of the older ML algorithms (Breiman et al. 1983)• One of the most popular (Rexer Data Miner Survey 2013)• Really versatile, handle non-linear relationships,
missing data, outliers, categorical or numerical targets – you name it!• Can be easily interpreted – the rules can be presented
as a table, or a series of if-then statements for each “split”• Can also be visually represented• CART, ID3, C4.5, CHAID, C5.0
![Page 3: Random forests 13-06-2015](https://reader037.vdocuments.us/reader037/viewer/2022103105/58f23c361a28ab44138b45cf/html5/thumbnails/3.jpg)
Note: hope you don’t mindthe political example!
![Page 4: Random forests 13-06-2015](https://reader037.vdocuments.us/reader037/viewer/2022103105/58f23c361a28ab44138b45cf/html5/thumbnails/4.jpg)
Decision Trees (cont’d.)• Decision Trees have low bias – the created model
generally approximates reality well• On the other hand, they have high variance – a
model tends to perform differently on different samples of the data• We need consistent performance, so what now?• How about we “grow” a bunch of decision trees, and
average them up?• Breiman thought about this, and in 2001 developed…
![Page 5: Random forests 13-06-2015](https://reader037.vdocuments.us/reader037/viewer/2022103105/58f23c361a28ab44138b45cf/html5/thumbnails/5.jpg)
![Page 6: Random forests 13-06-2015](https://reader037.vdocuments.us/reader037/viewer/2022103105/58f23c361a28ab44138b45cf/html5/thumbnails/6.jpg)
Random Forests• Mimics an ensemble of “experts” making a decision• Grows a bunch of bagged decision trees, using
subsets of variables (to handle variance)• Fast (relatively), scalable, has all the benefits of
decision trees• Has several parameters to tweak for performance• Implemented in all major ML software and libraries• But – is a “black box”, so no rules, no visualizations,
little inference
![Page 7: Random forests 13-06-2015](https://reader037.vdocuments.us/reader037/viewer/2022103105/58f23c361a28ab44138b45cf/html5/thumbnails/7.jpg)
Random Forests (cont’d.)• Give you “free” cross-validation (through calculating
OOB error)• This means shorter training time• Calculates variable importance• Partial dependence plots• Now supports censored (survival) data• Handles class imbalance• Can create very large objects in memory
![Page 8: Random forests 13-06-2015](https://reader037.vdocuments.us/reader037/viewer/2022103105/58f23c361a28ab44138b45cf/html5/thumbnails/8.jpg)
Random Forests in R• randomForest• randomForestSRC• ggRandomForests• party• randomForestCI (swager on GitHub)• edarf (zmjones on GitHub)• Boruta
![Page 9: Random forests 13-06-2015](https://reader037.vdocuments.us/reader037/viewer/2022103105/58f23c361a28ab44138b45cf/html5/thumbnails/9.jpg)
Tuning Parameters• Number of Trees• Number of Variables• Prior Class Weights• Cutoff• Sample Size• Node Size
![Page 10: Random forests 13-06-2015](https://reader037.vdocuments.us/reader037/viewer/2022103105/58f23c361a28ab44138b45cf/html5/thumbnails/10.jpg)
Some Resources• James, Witten, Hastie, Tibshirani, An Introduction to
Statistical Learning• Kuhn, Johnson, Applied Predictive Modeling• Jones, Linder, Exploratory Data Analysis using Random
Forests (article)• Package Vignettes on CRAN• CrossValidated.com
![Page 11: Random forests 13-06-2015](https://reader037.vdocuments.us/reader037/viewer/2022103105/58f23c361a28ab44138b45cf/html5/thumbnails/11.jpg)
THANK [email protected]