discovering regression structure with a bayesian ensemble · 2014-01-30 · example: the boston...

Discovering Regression Structure with a Bayesian Ensemble

Ed George, University of Pennsylvania (joint work with H. Chipman and R. McCulloch)

The Loeb Research Lecture in Mathematics Washington University in St Louis

November 3, 2011

1

Hugh Chipman and Rob McCulloch

2

We’ve been working on this for a while! CHIPMAN, GEORGE, & MCCULLOCH (1998). Bayesian CART model search. JASA. DENISON, MALLICK & SMITH (1998). A Bayesian CART algorithm. Biometrika. CHIPMAN, GEORGE &MCCULLOCH (2002). Bayesian treed models. Machine Learning. CHIPMAN, GEORGE & MCCULLOCH (2007). Bayesian ensemble learning. In Neural Information Processing. CHIPMAN, GEORGE & MCCULLOCH (2010). BART: Bayesian Additive Regression Trees. Annals of Applied Statistics.

3

Example: The Boston Housing Data Regression Problem

Each observation corresponds to a geographic district y = log median house price in the region 13 x variables, stuff about the district eg. crime rate, % poor, riverfront, size, air quality, etc. n = 507 observations

4

Each boxplot depicts 20 rmse's out-of-sample for a version of a method. eg. the method neural nets with a given number of nodes and decay value.

Smaller is better. BART wins!

5

Performance Comparisons across 42 Datasets

Neural networks (single layer) Random Forests Boosting (Friedman's gradient boosting machine) Linear regression with lasso BART-cv BART-default - *NO* tuning of parameters

Compared out-sample-performance of 6 methods

Data from Kim, Loh, Shih and Chaudhuri (2006) (thanks, Wei-Yin!) Up to 65 predictors and 2953 observations

For each dataset: Created 20 random splits into 5/6 train and 1/6 test Use 5-fold cross validation on train to pick tuning constants Gives 20*42 = 842 out-of-sample predictions for each method For comparisons, we divided each RMSE by the best RMSE for each split

6

Relative RMSE Performance Ratios to Best

1.2 means 20% worse than the best performer BART-cv best (but not by too much) BART-default does amazingly well 7

•  Data: n observations of y and x = (x1,...,xp) •  Suppose: Y = f(x) + ε, ε symmetric with mean 0 •  Unknowns: f and the distribution of ε Goal to make inference about f(x) = E(y |x) and/or a future y.

A General Model Free Regression Setup

8

(Trying to assume as little as possible)

f(x) = g(x; θ1) + g(x; θ2) + ... + g(x; θm)

A General Bayesian Ensemble Idea

θ1, θ2,…, θm iid ~ π

For Y = f(x) + ε, approximate f(x) by something of the form

This can work well when the prior π is calibrated so that roughly, every one of the g(x; θj) components is small, but not too small!

Then use the posterior distribution of f given Y= y to draw inferences about f.

9

BART estimates f(x) = E(Y |x) with a sum-of-trees (very flexible) BART provides

•  posterior intervals for f(x) •  estimates of marginal effects of each xj •  model-free importance rankings of x1,...,xp •  model-free interaction detection

BART is a Bayesian Ensemble Approach

10

Note that it is wiser to build a model from selected variables, than to select variables based on an assumed model! BART can be regarded as a pre-model building tool.

A Single Regression Tree Model

x2 < d x2 ≥ d

x5 < c x5 ≥ c

µ3 = 7

µ1 = -2 µ2 = 5

Let g(x;θ), θ = (T, M) be a regression tree function that assigns a µ value to x

Let T denote the tree structure including the decision rules Let M = {µ1, µ2, … µb} denote the set of bottom node µ's.

A Single Tree Model: Y = g(x;θ) + ε 11

The Coordinate View of g(x;θ)

x2 < d x2 ≥ d

x5 < c x5 ≥ c

µ3 = 7

µ1 = -2 µ2 = 5

Easy to see that g(x;θ) is just a step function

µ1 = -2 µ2 = 5

⇔ µ3 = 7

c

d x2

x5

12

Let θ = ((T1,M1), (T2,M2), …, (Tm,Mm)) identify a set of m trees and their µ’s.

Y = g(x;T1,M1) + g(x;T2,M2) + ... + g(x;Tm,Mm) + σ z, z ~ N(0,1)

The BART Ensemble Model

E(Y | x, θ) is the sum of all the corresponding µ’s at each tree bottom node. Such a model combines additive and interaction effects.

µ1

µ2 µ3

µ4

13 Remark: We here assume z ~ N(0, 1) for simplicity, but will later see a successful extension to a general DP process error model.

Completing the Model with a Regularization Prior

g(x;T1,M1), g(x;T2,M2), ... , g(x;Tm,Mm) is a highly redundant “over-complete basis” with many many parameters

Y = g(x;T1,M1) + ... + g(x;Tm,Mm) + σ z, z ~ N(0,1)

π((T1,M1),....(Tm,Mm),σ)

We refer to π as a regularization prior because it is deliberately chosen to keep each g(x;Ti,Mi) effect small.

14

π( θ | y) ∝ p(y | θ) π( θ)

BART Implementation BART uses a fully Bayesian specification. Information about all the parameters, namely θ = ((T1,M1),....(Tm,Mm),σ), is captured by the posterior

The two main steps to implement BART: 1.  Construct the prior π(θ)

Independent tree generating process on T1,..,Tm Use observed y to properly scale π(M | T) Extremely effective default choice available

2. Calculate the posterior π(θ | y) Bayesian backfitting MCMC Interweaving marginalization and regeneration of θ

R package BayesTree available on CRAN

15

Bayesian Nonparametrics: Lots of parameters (to make model flexible) A strong prior to shrink towards simple structure (regularization) BART shrinks towards additive models with some interaction Dynamic Random Basis: g(x;T1,M1), ..., g(x;Tm,Mm) are dimensionally adaptive Gradient Boosting: Overall fit becomes the cumulative effort of many “weak learners”

Connections to Other Modeling Ideas

Y = g(x;T1,M1) + ... + g(x;Tm,Mm) + σ z plus

π((T1,M1),....(Tm,Mm),σ)

16

BART is NOT Bayesian model averaging of a single tree model. Unlike boosting and random forests, the BART algorithm updates a fixed set of m trees, over and over. Choose m large for best estimation of E[Y|x] and prediction More trees yields more approximation flexibility Choose m small for variable selection Fewer trees forces the x’s to compete for entry

Some Distinguishing Features of BART

Y = g(x;T1,M1) + ... + g(x;Tm,Mm) + σ z plus

π((T1,M1),....(Tm,Mm),σ)

17

Friedman (1991) used n = 100 observations from this model with σ = 1 to illustrate the potential of MARS

Y = f(x) + σ z, z ~ N(0,1) where f(x) = 10 sin (πx1x2 ) + 20(x3 - .5)2 + 10x4 + 5x5 + 0x6 + … + 0x10

10 x's, but only the first 5 matter!

Example: Friedman’s Simulated Data

18

10002

i ii 1

1 ˆRMSE (f(x ) f(x ))1000 =

= −∑

Performance measured on 1000 out-of-sample x’s by

Comparison of BART with Other Methods

19

Cross Validation Domain for Comparisons

20

Applying BART to the Friedman Example

Red m = 1 model

Blue m = 100 model

We applied BART with m = 100 trees to n = 100 observations of the Friedman example.

(x)f

95% posterior intervals vs true f(x) σ draws

in-sample f(x) out-of-sample f(x) MCMC iteration

21

With only 100 observations on y and 1000 x's, BART yielded "reasonable" results !!!!

Added many useless x's to Friedman’s example

In-sample post int vs f(x)

f(x)20 x's

100 x's

1000 x's

Detecting Low Dimensional Structure in High Dimensional Data Out-of-sample post int vs f(x) σ draws

22

High Dimensional Out-of-Sample RMSE Performance Comparisons

p = 10 p = 100 p = 1000

n = 100 throughout!

23

Increasing the number of Trees in BART

Rapid decrease in RMSE followed by slow increase (Friedman data again)

24

The Effect of a Low Signal-to-Noise Ratio Look what happens when σ is increased with the Friedman model

25

BART is Robust to Prior Settings BART’s robust RMSE performance is shown (again on the Friedman data) below where the prior hyperparameter (ν,q,k,m) choices are varied

26

Measuring Variable Importance in BART

x2 < d x2 ≥ d

x5 < c x5 ≥ c

When Y = g(x;T1,M1) + ... + g(x;Tm,Mm) + σ z is fit to data, we can count how many times a predictor is used in the trees.

For example, in the tree here, x2 and x5 are each used once. The importance of each xk can thus be measured by its overall usage frequency. This approach is most effective when the number of trees m is small.

27

Creating a Competitive Bottleneck

28

BART Y

x1 : xi : xp

BART is an automatic attractor of x’s to explain Y Those x’s which most explain Y are attracted most Small number of trees m creates a bottleneck which excludes useless x’s

Variable Selection via BART

Variable usage frequencies as the number of trees m is reduced

29

Variable Selection via BART

Using m = 10 trees, it remains effective with n = 100 and p = 90

30

BART Offers Estimates of Individual Predictor Effects

Partial Dependence Plot of Crime Effect in Boston Housing Data

These are estimates of f3(x3) = (1/n) ∑i f(x3,xic) where xc = x \ x3 31

Partial Dependence Plots for the Friedman Example The Marginal Effects of x1 – x10

32

For this data Least Squares yields R2 = 26% BART yields R2 = 42%

Y = LDHL (log of hdl level) x’s = CD4, Age, Sex, Race, Study, PI1,PI2,NNRTI2, NRTI1, NRTI2, ABI_349, CRC_71, CRC_72, CRC_55, CRC_73, CRC_10, ABI_383, ABI_387, ABI_391, ABI_395, ABI_400, ABI_401, CRC_66, CRC_67, CRC_68, CRC_69 n = 458 patients

Example: HIV Data Analysis

33

BART suggests there is not a strong signal in x for this y.

The BART Fit for the HIV Data

34

Variable Selection via BART with m = 5, 10, 20, 50, 200 trees

35

For example, the average predictive effect of ABI_383

Partial Dependence Plots Suggest Genotype Effects

36

There appears to be no interaction effect

Predictive Inference about Interaction of NNRTI2 Treatment and ABI_383 Genotype

37

Each observation (n=245) corresponds to a football game.

y = Team A points - Team B points 29 x’s. Each is the difference between the two teams on some measure. eg x10 is average points against defense per game for Team A for team B.

The Football Data

38

For each draw, for each variable calculate the percentage of time that variable is used in a tree. Then average over trees.

Variable Selection for the Football Data

Subtle point: Can’t have too many trees. Variables come in without really doing anything. 39

Marginal Effects of the Variables

Just used variables 2,7,10, and 14.

Here are the four univariate partial- dependence plots.

40

Goal: To predict the “activity” of a compound against a biological target. Data: n = 2604 compounds of which 604 are active Y = 0 or 1 according to whether compound is active x’s - 266 variables that characterize the compound’s molecular structure

Example: Drug Discovery

BART is easily extended to estimate P[Y = 1|x] by Albert-Chib augmentation with a probit link Applied BART to rank compounds by their P[Y = 1|x] estimates

41

BART Posterior Intervals for 20 Compounds with Highest Predicted Activity

In-sample Out-of-Sample

42

Variable Importance in Drug Discovery with m = 5, 10, 20 trees

All 266 x’s Top 25 x’s

43

A Sketch of the BART MCMC algorithm

The “parameter“ is:

“Simple" Gibbs sampler:

Y = g(x;T1,M1) + g(x;T2,M2) + ... + g(x;Tm,Mm) + σ z

(1)  Iteratively sample each (Tj,Mj) given Y, σ and other (Tj,Mj)s (Bayesian backfitting)

(2) Sample σ given Y and all (Tj,Mj)

Subtract all but g(x;Ti,Mi) from Y to update (Tj,Mj) Subtract all the g(x;Tj,Mj)s from Y to update σ

θ = ((T1,M1),....(Tm,Mm),σ)

44

45

The simulation of each pair (T, M) in (1), is done via

p(T, M | data) = p(T | data) p(M | T, data) That is, we first margin out M and draw T, then draw M given T.

Simulating each pair from p(T, M | data)

Sampling M from p(M | T, data) is routine Just simulate µ’s from the posterior under our conjugate prior Sampling from p(T | data) is much more involved We essentially apply our earlier Bayesian CART algorithm to the residuals obtained as Y minus the sum of the other g(x;Ti,Mi)s

Because p(T | data) is available in closed form (up to a norming constant), we use a Metropolis-Hastings algorithm. Our proposal moves around tree space by proposing local modifications such as

=> ?

=> ?

propose a more complex tree

propose a simpler tree

Such modifications are accepted according to their compatibility with p(T | data). 46

Simulating p(T | data) with the Bayesian CART Algorithm

Each iteration of the BART algorithm provides a complete update of the BART parameters (T1,M1),....(Tm,Mm) and σ.

This is a Markov chain converging to the posterior.

Surprisingly quick burn-in and convergence.

The Redundant, Dynamic Random Basis in Action

47

The Overall BART MCMC algorithm

As the chain runs:

Each tree contributes a small part to the fit, and the fit is swapped around from tree to tree.

We often observe that an individual tree grows quite large and then collapses back to a single node.

This illustrates how each tree is dimensionally adaptive.

At iteration i we draw from the posterior of f

To estimate f(x) we simply average the draws at x

Posterior uncertainty is captured by variation of the eg, 95% HPD region estimated by middle 95% of values

Using the MCMC Output to Draw Inference

)M,T,g( )M,T,g( )M,T,g( )(f mimi2i2i1i1ii ⋅++⋅+⋅=⋅ ˆ

(x)fi

Similarly, it is straightforward to estimate functionals of f such as partial dependence functions

fk(xk) = (1/n) ∑i f(xk,xic) where xc = x \ xk

)(ˆ ⋅if

48

First, introduce prior independence as follows

Thus we only need to choose π(T), π(µ | T) = π(µ) and π(σ).

A Sketch of the Prior

π((T1,M1),....,(Tm,Mm), σ) = [ Π π(Tj,Mj) ] π(σ) = [ Π π(µij | Tj) π(Tj) ] π(σ)

49

1 2 3 4 5 6 7

0500

1000

1500

2000

2500Marginal prior on

number of bottom nodes. Hyperparameters chosen to put prior weight on small trees!!

We specify a process that grows trees:

Step 1) Grow a tree structure with succesive biased coin flips Step 2) Randomly assign variables to decision nodes Step 3) Randomly assign splitting rules to decision nodes

π(T)

50

π(µ | T)

For each bottom node µ, let

.5k m .5k mµ µσ = ⇒σ =

This conveniently done by shifting and rescaling y so that (ymin, ymax) = (-0.5, 0.5) and then setting µ µ = 0 and

)σ,N(µ ~µ 2µµ

)σ,µN( ~ 2µµ mm x]|E[Ywhich induces

Since we are pretty sure that we chose µµ and σµ such that for a prechosen k (such as 2 or 3)

),()σµ,σ-µ( maxminµµµµ yy=+ mkmmkm

Note how the prior adapts to m: σµ gets smaller as m gets larger. Default choice is k = 2.

),,( maxmin yy∈x]|E[Y

51

π(σ)

22~ν

νλσ

χLet

Determine λ by setting a quantile such as .75, .95 or .99 at this rough estimate.

The three priors we would consider when

ˆ 2σ =

To set λ, we use a rough overestimate of σ based on the data (such as sd(y) or the LS estimate for the saturated linear regression).

and consider ν = 3, 5 or 10.

52

53

Weakening the Error Distribution Assumption

Simply replace by where DP is the Dirichlet Process.

€

ε i ~ N(0, σ i2)

σ i ~ GG ~ DP(α,G0)

€

ε i ~ N(0, σ2) iid

G0 is the same prior we used for the default σ prior. BART MCMC still well behaved. Not a model for heteroscedasticity (future work !!), just a more general error distribution.

54

So, we now have our fully flexible model specification:

Y = f(x) + ε, ε symmetric with mean 0 Issue: Is the richer version of BART too flexible to work? Back to the Friedman example

Original case: σi ≡ 1

Mixture case: σi = 1 with probability .75 σi = 10 with probability .25.

55

RMSE Comparison of Normal vs DP Models

Top pair for normal errors Bottom pair for mixture errors Left boxplot for normal model Right boxplot for DP model Similar performance for normal errors DP is much better for mixture errors

BART Extensions and Applications

ABREVEYA & MCCULLOCH (2006). Reversal of fortune: A statistical analysis of penalty calls in the national hockey league. ABU-NIMEH, NAPPA, WANG & NAIR (2008). Bayesian Additive Regression Trees-Based Spam Detection for Enhanced Email Privacy. 2008 Third International Conference on Availability, Reliability and Security. ABU-NIMEH, NAPPA, WANG & NAIR (2009). Hardening Email Security using Bayesian Additive Regression Trees, Machine Learning. ABU-NIMEH, NAPPA, WANG & NAIR (2009). Distributed Phishing Detection by Applying Variable Selection using Bayesian Additive Regression Trees, IEEE International Conference on Communications. CHIPMAN, GEORGE, LEMP, MCCULLOCH (2010). Bayesian Flexible Modeling of Trip Durations. Transportation Research Part B: Methodological. ELIASBERG, HUI & ZHANG (2010). Green-lighting Movie Scripts: Revenue Forecasting and Risk Management. GREEN & KERN (2010). Modeling heterogeneous treatment effects in large-scale experiments using Bayesian Additive Regression Trees 56

57

HILL (2010). Bayesian Nonparametric Modeling for Causal Inference, JCGS HILL & ZUBIZARRETA (2010). Causal Inference in Observational Studies with Missing Data: A Bayesian Nonparametric Approach KOURTELLOS, TAN, & ZHANG (2007). Is the Relationship between Aid and Economic Growth Nonlinear? Journal of Macroeconomics WANG & KARR (2010). Preserving data utility via BART, JSPI YU, LI, FANG & PENG (2010). An adaptive sampling scheme guided by BART - with an application to predict processor performance, Can J of Stat YU, MACEACHERN & PERUGGIA (2009). Bayesian Synthesis ZHANG & HAERDLE (2010). The Bayesian additive classification tree applied to credit risk modelling. Comput Statist Data Anal ZHANG, SHIH & MULLER (2007). A spatially-adjusted Bayesian additive regression tree model to merge two datasets. Bayesian Analysis ZHOU & LIU (2008). Extracting sequence features to predict protein-DNA binding: A comparative study. Nucleic Acids Research

BART Extensions and Applications (cont.)

•  Further Error Structure Magic •  Monotone BART •  Model Building via BART

Plenty More to Do

58

Thank You!

59

discovering regression structure with a bayesian ensemble · 2014-01-30 · example: the boston...

Documents