discovering regression structure with a bayesian ensemble · 2014-01-30 · example: the boston...
TRANSCRIPT
Discovering Regression Structure with a Bayesian Ensemble
Ed George, University of Pennsylvania (joint work with H. Chipman and R. McCulloch)
The Loeb Research Lecture in Mathematics Washington University in St Louis
November 3, 2011
1
Hugh Chipman and Rob McCulloch
2
We’ve been working on this for a while! CHIPMAN, GEORGE, & MCCULLOCH (1998). Bayesian CART model search. JASA. DENISON, MALLICK & SMITH (1998). A Bayesian CART algorithm. Biometrika. CHIPMAN, GEORGE &MCCULLOCH (2002). Bayesian treed models. Machine Learning. CHIPMAN, GEORGE & MCCULLOCH (2007). Bayesian ensemble learning. In Neural Information Processing. CHIPMAN, GEORGE & MCCULLOCH (2010). BART: Bayesian Additive Regression Trees. Annals of Applied Statistics.
3
Example: The Boston Housing Data Regression Problem
Each observation corresponds to a geographic district y = log median house price in the region 13 x variables, stuff about the district eg. crime rate, % poor, riverfront, size, air quality, etc. n = 507 observations
4
Each boxplot depicts 20 rmse's out-of-sample for a version of a method. eg. the method neural nets with a given number of nodes and decay value.
Smaller is better. BART wins!
5
Performance Comparisons across 42 Datasets
Neural networks (single layer) Random Forests Boosting (Friedman's gradient boosting machine) Linear regression with lasso BART-cv BART-default - *NO* tuning of parameters
Compared out-sample-performance of 6 methods
Data from Kim, Loh, Shih and Chaudhuri (2006) (thanks, Wei-Yin!) Up to 65 predictors and 2953 observations
For each dataset: Created 20 random splits into 5/6 train and 1/6 test Use 5-fold cross validation on train to pick tuning constants Gives 20*42 = 842 out-of-sample predictions for each method For comparisons, we divided each RMSE by the best RMSE for each split
6
Relative RMSE Performance Ratios to Best
1.2 means 20% worse than the best performer BART-cv best (but not by too much) BART-default does amazingly well 7
• Data: n observations of y and x = (x1,...,xp) • Suppose: Y = f(x) + ε, ε symmetric with mean 0 • Unknowns: f and the distribution of ε Goal to make inference about f(x) = E(y |x) and/or a future y.
A General Model Free Regression Setup
8
(Trying to assume as little as possible)
f(x) = g(x; θ1) + g(x; θ2) + ... + g(x; θm)
A General Bayesian Ensemble Idea
θ1, θ2,…, θm iid ~ π
For Y = f(x) + ε, approximate f(x) by something of the form
This can work well when the prior π is calibrated so that roughly, every one of the g(x; θj) components is small, but not too small!
Then use the posterior distribution of f given Y= y to draw inferences about f.
9
BART estimates f(x) = E(Y |x) with a sum-of-trees (very flexible) BART provides
• posterior intervals for f(x) • estimates of marginal effects of each xj • model-free importance rankings of x1,...,xp • model-free interaction detection
BART is a Bayesian Ensemble Approach
10
Note that it is wiser to build a model from selected variables, than to select variables based on an assumed model! BART can be regarded as a pre-model building tool.
A Single Regression Tree Model
x2 < d x2 ≥ d
x5 < c x5 ≥ c
µ3 = 7
µ1 = -2 µ2 = 5
Let g(x;θ), θ = (T, M) be a regression tree function that assigns a µ value to x
Let T denote the tree structure including the decision rules Let M = {µ1, µ2, … µb} denote the set of bottom node µ's.
A Single Tree Model: Y = g(x;θ) + ε 11
The Coordinate View of g(x;θ)
x2 < d x2 ≥ d
x5 < c x5 ≥ c
µ3 = 7
µ1 = -2 µ2 = 5
Easy to see that g(x;θ) is just a step function
µ1 = -2 µ2 = 5
⇔ µ3 = 7
c
d x2
x5
12
Let θ = ((T1,M1), (T2,M2), …, (Tm,Mm)) identify a set of m trees and their µ’s.
Y = g(x;T1,M1) + g(x;T2,M2) + ... + g(x;Tm,Mm) + σ z, z ~ N(0,1)
The BART Ensemble Model
E(Y | x, θ) is the sum of all the corresponding µ’s at each tree bottom node. Such a model combines additive and interaction effects.
µ1
µ2 µ3
µ4
13 Remark: We here assume z ~ N(0, 1) for simplicity, but will later see a successful extension to a general DP process error model.
Completing the Model with a Regularization Prior
g(x;T1,M1), g(x;T2,M2), ... , g(x;Tm,Mm) is a highly redundant “over-complete basis” with many many parameters
Y = g(x;T1,M1) + ... + g(x;Tm,Mm) + σ z, z ~ N(0,1)
π((T1,M1),....(Tm,Mm),σ)
We refer to π as a regularization prior because it is deliberately chosen to keep each g(x;Ti,Mi) effect small.
14
π( θ | y) ∝ p(y | θ) π( θ)
BART Implementation BART uses a fully Bayesian specification. Information about all the parameters, namely θ = ((T1,M1),....(Tm,Mm),σ), is captured by the posterior
The two main steps to implement BART: 1. Construct the prior π(θ)
Independent tree generating process on T1,..,Tm Use observed y to properly scale π(M | T) Extremely effective default choice available
2. Calculate the posterior π(θ | y) Bayesian backfitting MCMC Interweaving marginalization and regeneration of θ
R package BayesTree available on CRAN
15
Bayesian Nonparametrics: Lots of parameters (to make model flexible) A strong prior to shrink towards simple structure (regularization) BART shrinks towards additive models with some interaction Dynamic Random Basis: g(x;T1,M1), ..., g(x;Tm,Mm) are dimensionally adaptive Gradient Boosting: Overall fit becomes the cumulative effort of many “weak learners”
Connections to Other Modeling Ideas
Y = g(x;T1,M1) + ... + g(x;Tm,Mm) + σ z plus
π((T1,M1),....(Tm,Mm),σ)
16
BART is NOT Bayesian model averaging of a single tree model. Unlike boosting and random forests, the BART algorithm updates a fixed set of m trees, over and over. Choose m large for best estimation of E[Y|x] and prediction More trees yields more approximation flexibility Choose m small for variable selection Fewer trees forces the x’s to compete for entry
Some Distinguishing Features of BART
Y = g(x;T1,M1) + ... + g(x;Tm,Mm) + σ z plus
π((T1,M1),....(Tm,Mm),σ)
17
Friedman (1991) used n = 100 observations from this model with σ = 1 to illustrate the potential of MARS
Y = f(x) + σ z, z ~ N(0,1) where f(x) = 10 sin (πx1x2 ) + 20(x3 - .5)2 + 10x4 + 5x5 + 0x6 + … + 0x10
10 x's, but only the first 5 matter!
Example: Friedman’s Simulated Data
18
10002
i ii 1
1 ˆRMSE (f(x ) f(x ))1000 =
= −∑
Performance measured on 1000 out-of-sample x’s by
Comparison of BART with Other Methods
19
Cross Validation Domain for Comparisons
20
Applying BART to the Friedman Example
Red m = 1 model
Blue m = 100 model
We applied BART with m = 100 trees to n = 100 observations of the Friedman example.
(x)f
95% posterior intervals vs true f(x) σ draws
in-sample f(x) out-of-sample f(x) MCMC iteration
21
With only 100 observations on y and 1000 x's, BART yielded "reasonable" results !!!!
Added many useless x's to Friedman’s example
In-sample post int vs f(x)
f(x)20 x's
100 x's
1000 x's
Detecting Low Dimensional Structure in High Dimensional Data Out-of-sample post int vs f(x) σ draws
22
High Dimensional Out-of-Sample RMSE Performance Comparisons
p = 10 p = 100 p = 1000
n = 100 throughout!
23
Increasing the number of Trees in BART
Rapid decrease in RMSE followed by slow increase (Friedman data again)
24
The Effect of a Low Signal-to-Noise Ratio Look what happens when σ is increased with the Friedman model
25
BART is Robust to Prior Settings BART’s robust RMSE performance is shown (again on the Friedman data) below where the prior hyperparameter (ν,q,k,m) choices are varied
26
Measuring Variable Importance in BART
x2 < d x2 ≥ d
x5 < c x5 ≥ c
When Y = g(x;T1,M1) + ... + g(x;Tm,Mm) + σ z is fit to data, we can count how many times a predictor is used in the trees.
For example, in the tree here, x2 and x5 are each used once. The importance of each xk can thus be measured by its overall usage frequency. This approach is most effective when the number of trees m is small.
27
Creating a Competitive Bottleneck
28
BART Y
x1 : xi : xp
BART is an automatic attractor of x’s to explain Y Those x’s which most explain Y are attracted most Small number of trees m creates a bottleneck which excludes useless x’s
Variable Selection via BART
Variable usage frequencies as the number of trees m is reduced
29
Variable Selection via BART
Using m = 10 trees, it remains effective with n = 100 and p = 90
30
BART Offers Estimates of Individual Predictor Effects
Partial Dependence Plot of Crime Effect in Boston Housing Data
These are estimates of f3(x3) = (1/n) ∑i f(x3,xic) where xc = x \ x3 31
Partial Dependence Plots for the Friedman Example The Marginal Effects of x1 – x10
32
For this data Least Squares yields R2 = 26% BART yields R2 = 42%
Y = LDHL (log of hdl level) x’s = CD4, Age, Sex, Race, Study, PI1,PI2,NNRTI2, NRTI1, NRTI2, ABI_349, CRC_71, CRC_72, CRC_55, CRC_73, CRC_10, ABI_383, ABI_387, ABI_391, ABI_395, ABI_400, ABI_401, CRC_66, CRC_67, CRC_68, CRC_69 n = 458 patients
Example: HIV Data Analysis
33
BART suggests there is not a strong signal in x for this y.
The BART Fit for the HIV Data
34
Variable Selection via BART with m = 5, 10, 20, 50, 200 trees
35
For example, the average predictive effect of ABI_383
Partial Dependence Plots Suggest Genotype Effects
36
There appears to be no interaction effect
Predictive Inference about Interaction of NNRTI2 Treatment and ABI_383 Genotype
37
Each observation (n=245) corresponds to a football game.
y = Team A points - Team B points 29 x’s. Each is the difference between the two teams on some measure. eg x10 is average points against defense per game for Team A for team B.
The Football Data
38
For each draw, for each variable calculate the percentage of time that variable is used in a tree. Then average over trees.
Variable Selection for the Football Data
Subtle point: Can’t have too many trees. Variables come in without really doing anything. 39
Marginal Effects of the Variables
Just used variables 2,7,10, and 14.
Here are the four univariate partial- dependence plots.
40
Goal: To predict the “activity” of a compound against a biological target. Data: n = 2604 compounds of which 604 are active Y = 0 or 1 according to whether compound is active x’s - 266 variables that characterize the compound’s molecular structure
Example: Drug Discovery
BART is easily extended to estimate P[Y = 1|x] by Albert-Chib augmentation with a probit link Applied BART to rank compounds by their P[Y = 1|x] estimates
41
BART Posterior Intervals for 20 Compounds with Highest Predicted Activity
In-sample Out-of-Sample
42
Variable Importance in Drug Discovery with m = 5, 10, 20 trees
All 266 x’s Top 25 x’s
43
A Sketch of the BART MCMC algorithm
The “parameter“ is:
“Simple" Gibbs sampler:
Y = g(x;T1,M1) + g(x;T2,M2) + ... + g(x;Tm,Mm) + σ z
(1) Iteratively sample each (Tj,Mj) given Y, σ and other (Tj,Mj)s (Bayesian backfitting)
(2) Sample σ given Y and all (Tj,Mj)
Subtract all but g(x;Ti,Mi) from Y to update (Tj,Mj) Subtract all the g(x;Tj,Mj)s from Y to update σ
θ = ((T1,M1),....(Tm,Mm),σ)
44
45
The simulation of each pair (T, M) in (1), is done via
p(T, M | data) = p(T | data) p(M | T, data) That is, we first margin out M and draw T, then draw M given T.
Simulating each pair from p(T, M | data)
Sampling M from p(M | T, data) is routine Just simulate µ’s from the posterior under our conjugate prior Sampling from p(T | data) is much more involved We essentially apply our earlier Bayesian CART algorithm to the residuals obtained as Y minus the sum of the other g(x;Ti,Mi)s
Because p(T | data) is available in closed form (up to a norming constant), we use a Metropolis-Hastings algorithm. Our proposal moves around tree space by proposing local modifications such as
=> ?
=> ?
propose a more complex tree
propose a simpler tree
Such modifications are accepted according to their compatibility with p(T | data). 46
Simulating p(T | data) with the Bayesian CART Algorithm
Each iteration of the BART algorithm provides a complete update of the BART parameters (T1,M1),....(Tm,Mm) and σ.
This is a Markov chain converging to the posterior.
Surprisingly quick burn-in and convergence.
The Redundant, Dynamic Random Basis in Action
47
The Overall BART MCMC algorithm
As the chain runs:
Each tree contributes a small part to the fit, and the fit is swapped around from tree to tree.
We often observe that an individual tree grows quite large and then collapses back to a single node.
This illustrates how each tree is dimensionally adaptive.
At iteration i we draw from the posterior of f
To estimate f(x) we simply average the draws at x
Posterior uncertainty is captured by variation of the eg, 95% HPD region estimated by middle 95% of values
Using the MCMC Output to Draw Inference
)M,T,g( )M,T,g( )M,T,g( )(f mimi2i2i1i1ii ⋅++⋅+⋅=⋅ ˆ
(x)fi
Similarly, it is straightforward to estimate functionals of f such as partial dependence functions
fk(xk) = (1/n) ∑i f(xk,xic) where xc = x \ xk
)(ˆ ⋅if
48
First, introduce prior independence as follows
Thus we only need to choose π(T), π(µ | T) = π(µ) and π(σ).
A Sketch of the Prior
π((T1,M1),....,(Tm,Mm), σ) = [ Π π(Tj,Mj) ] π(σ) = [ Π π(µij | Tj) π(Tj) ] π(σ)
49
1 2 3 4 5 6 7
0500
1000
1500
2000
2500Marginal prior on
number of bottom nodes. Hyperparameters chosen to put prior weight on small trees!!
We specify a process that grows trees:
Step 1) Grow a tree structure with succesive biased coin flips Step 2) Randomly assign variables to decision nodes Step 3) Randomly assign splitting rules to decision nodes
π(T)
50
π(µ | T)
For each bottom node µ, let
.5k m .5k mµ µσ = ⇒σ =
This conveniently done by shifting and rescaling y so that (ymin, ymax) = (-0.5, 0.5) and then setting µ µ = 0 and
)σ,N(µ ~µ 2µµ
)σ,µN( ~ 2µµ mm x]|E[Ywhich induces
Since we are pretty sure that we chose µµ and σµ such that for a prechosen k (such as 2 or 3)
),()σµ,σ-µ( maxminµµµµ yy=+ mkmmkm
Note how the prior adapts to m: σµ gets smaller as m gets larger. Default choice is k = 2.
),,( maxmin yy∈x]|E[Y
51
π(σ)
22~ν
νλσ
χLet
Determine λ by setting a quantile such as .75, .95 or .99 at this rough estimate.
The three priors we would consider when
ˆ 2σ =
To set λ, we use a rough overestimate of σ based on the data (such as sd(y) or the LS estimate for the saturated linear regression).
and consider ν = 3, 5 or 10.
52
53
Weakening the Error Distribution Assumption
Simply replace by where DP is the Dirichlet Process.
€
ε i ~ N(0, σ i2)
σ i ~ GG ~ DP(α,G0)
€
ε i ~ N(0, σ2) iid
G0 is the same prior we used for the default σ prior. BART MCMC still well behaved. Not a model for heteroscedasticity (future work !!), just a more general error distribution.
54
So, we now have our fully flexible model specification:
Y = f(x) + ε, ε symmetric with mean 0 Issue: Is the richer version of BART too flexible to work? Back to the Friedman example
Original case: σi ≡ 1
Mixture case: σi = 1 with probability .75 σi = 10 with probability .25.
55
RMSE Comparison of Normal vs DP Models
Top pair for normal errors Bottom pair for mixture errors Left boxplot for normal model Right boxplot for DP model Similar performance for normal errors DP is much better for mixture errors
BART Extensions and Applications
ABREVEYA & MCCULLOCH (2006). Reversal of fortune: A statistical analysis of penalty calls in the national hockey league. ABU-NIMEH, NAPPA, WANG & NAIR (2008). Bayesian Additive Regression Trees-Based Spam Detection for Enhanced Email Privacy. 2008 Third International Conference on Availability, Reliability and Security. ABU-NIMEH, NAPPA, WANG & NAIR (2009). Hardening Email Security using Bayesian Additive Regression Trees, Machine Learning. ABU-NIMEH, NAPPA, WANG & NAIR (2009). Distributed Phishing Detection by Applying Variable Selection using Bayesian Additive Regression Trees, IEEE International Conference on Communications. CHIPMAN, GEORGE, LEMP, MCCULLOCH (2010). Bayesian Flexible Modeling of Trip Durations. Transportation Research Part B: Methodological. ELIASBERG, HUI & ZHANG (2010). Green-lighting Movie Scripts: Revenue Forecasting and Risk Management. GREEN & KERN (2010). Modeling heterogeneous treatment effects in large-scale experiments using Bayesian Additive Regression Trees 56
57
HILL (2010). Bayesian Nonparametric Modeling for Causal Inference, JCGS HILL & ZUBIZARRETA (2010). Causal Inference in Observational Studies with Missing Data: A Bayesian Nonparametric Approach KOURTELLOS, TAN, & ZHANG (2007). Is the Relationship between Aid and Economic Growth Nonlinear? Journal of Macroeconomics WANG & KARR (2010). Preserving data utility via BART, JSPI YU, LI, FANG & PENG (2010). An adaptive sampling scheme guided by BART - with an application to predict processor performance, Can J of Stat YU, MACEACHERN & PERUGGIA (2009). Bayesian Synthesis ZHANG & HAERDLE (2010). The Bayesian additive classification tree applied to credit risk modelling. Comput Statist Data Anal ZHANG, SHIH & MULLER (2007). A spatially-adjusted Bayesian additive regression tree model to merge two datasets. Bayesian Analysis ZHOU & LIU (2008). Extracting sequence features to predict protein-DNA binding: A comparative study. Nucleic Acids Research
BART Extensions and Applications (cont.)
• Further Error Structure Magic • Monotone BART • Model Building via BART
Plenty More to Do
58
Thank You!
59