random forests 13-06-2015

Random ForestsCork Big Data & Analytics Group

Decision Trees• One of the older ML algorithms (Breiman et al. 1983)• One of the most popular (Rexer Data Miner Survey 2013)• Really versatile, handle non-linear relationships,

missing data, outliers, categorical or numerical targets – you name it!• Can be easily interpreted – the rules can be presented

as a table, or a series of if-then statements for each “split”• Can also be visually represented• CART, ID3, C4.5, CHAID, C5.0

Note: hope you don’t mindthe political example!

Decision Trees (cont’d.)• Decision Trees have low bias – the created model

generally approximates reality well• On the other hand, they have high variance – a

model tends to perform differently on different samples of the data• We need consistent performance, so what now?• How about we “grow” a bunch of decision trees, and

average them up?• Breiman thought about this, and in 2001 developed…

Random Forests• Mimics an ensemble of “experts” making a decision• Grows a bunch of bagged decision trees, using

subsets of variables (to handle variance)• Fast (relatively), scalable, has all the benefits of

decision trees• Has several parameters to tweak for performance• Implemented in all major ML software and libraries• But – is a “black box”, so no rules, no visualizations,

little inference

Random Forests (cont’d.)• Give you “free” cross-validation (through calculating

OOB error)• This means shorter training time• Calculates variable importance• Partial dependence plots• Now supports censored (survival) data• Handles class imbalance• Can create very large objects in memory

Random Forests in R• randomForest• randomForestSRC• ggRandomForests• party• randomForestCI (swager on GitHub)• edarf (zmjones on GitHub)• Boruta

Tuning Parameters• Number of Trees• Number of Variables• Prior Class Weights• Cutoff• Sample Size• Node Size

Some Resources• James, Witten, Hastie, Tibshirani, An Introduction to

Statistical Learning• Kuhn, Johnson, Applied Predictive Modeling• Jones, Linder, Exploratory Data Analysis using Random

Forests (article)• Package Vignettes on CRAN• CrossValidated.com

THANK [email protected]