cliff notes on ecological niche modeling with randomforest (ensembles) falk huettmann ewhale lab...

Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles)

Falk HuettmannEWHALE lab

University of AlaskaFairbanks AK 99775

Email [email protected] Tel. 907 474 7882

mailto:[email protected]

Modeling Ecological Niches

Geographic Space Ecological Space

Latitude

Longitude Environmental factor a

Env

ironm

enta

l fac

tor

b

Sampling Space Model Space => Predictions

A Super Model

LMGLMGAMCARTMARS

NNGARP

TNRF

GDMMaxent…

=>Ensembles

‘Mean’SDOne formula capturing the data y=a +bx

Linear regression

A starting point…

Response Variable ~ Predictor1 Y X

X

Y

Common Ground

A Multiple Regression framework

Response Variable ~ Predictor1 + Predictor2 + Predictor3…

Common Ground

A Multiple Regression framework

Response Variable ~ Predictor1 + Predictor2 + Predictor3…

Traditionally, we used 1-5 predictors

But: 1 to 1000s of predictors are possible

‘One single algorithm’ that explains relationship between response and predictors

Derived relationship can be predicted to other locations with known predictors

GLM vs CART etc.

‘Mean’SD => potentially low r2

‘Mean’ ?SD ?

CART, TreeNet & RandomForest(there are many other algorithms !)

Linear(~unrealistic)

Non-Linear(driven by data)

Our Free Algorithms …

R-ProjectTreeNet

RandomForest

Fortran, C …

http://rweb.stat.umn.edu/R/library/randomForest/html/00Index.html

http://salford-systems.com/products.php

(free 30 day trial)

Tree/CART - Family

Classification & Regression Tree (CART)=>Binary recursive partitioning

Leo Breiman 1984, and others

Tree/CART - Family


YES NO

Temp>15

Precip <100

Temp<5

Classification & Regression Tree (CART)=>Binary recursive partitioning

Tree/CART - Family

Binary splits


Widely used concept

Tree/CART - Family

Binary splits


Widely used conceptFree of dataassumptions!No significances.

Tree/CART - Family

Binary splits

Binary split recursive partitioning (samepredictor can re-occur elsewhere as a ‘splitter’)

Maximizes Nodes for Homogenous Variance

Stopping Rules for Number of Branches basedon Optimization/Cross-validation

Terminal Nodes show Means (Regression Tree)or Categories (Classification Tree)


Widely used conceptFree of dataassumptions!No significances.

Tree/CART - Family

Binary splits Multiple splits

Binary split recursive partitioning (samepredictor can re-occur elsewhere as a ‘splitter’)

Maximizes Nodes for Homogenous Variance

Stopping Rules for Number of Branches basedon Optimization/Cross-validation

Terminal Nodes show Means (Regression Tree)or Categories (Classification Tree)


Classification Tree

A B C

A B

Widely used conceptRarely used, yet

Free of dataassumptions!No significances.

0.3 3 0.1

2 2.3

Regression Tree

CART Salford (rpart in R)Nice to interpret(e.g. for small trees, orwhen following throughspecific decision rulestil end)

0.70

0.80

0.90

0 100 200 300 400 500

Rel

ativ

e C

ost

Number of Nodes

DEM 100.00 ||||||||||||||||||||||||||||||||||||||||||TAIR_AUG 77.58 ||||||||||||||||||||||||||||||||PREC_AUG 69.46 |||||||||||||||||||||||||||||HYDRO 54.59 ||||||||||||||||||||||POP 47.39 |||||||||||||||||||LDUSE 40.88 |||||||||||||||||

Importance Value

CART Salford (rpart in R)

ROC curves for accuracy tests

e.g. correctly predicted absence app. 77%

e.g. correctly predicted presence app. 85%

=>Apply to a dataset for predictions

ROC

Nice to interpret(e.g. for small trees, orwhen following throughspecific decision rulestil end)

From withheld

Test Data

Optimum

TreeNet(~A sequence of CARTs) ‘boosting’

+ + + +

The more nodes…the more detail…the slower

Many trees make for a ‘net of trees’, or ‘a forest’ => Leo Breiman + Data Mining

TreeNet(~A sequence of CARTs) ‘boosting’

Variable Score LDUSE 100.00 ||||||||||||||||||||||||||||||||||||||||||TAIR_AUG 97.62 |||||||||||||||||||||||||||||||||||||||||HYDRO94.35 ||||||||||||||||||||||||||||||||||||||||DEM94.01 |||||||||||||||||||||||||||||||||||||||PREC_AUG 90.17 ||||||||||||||||||||||||||||||||||||||POP 82.54 ||||||||||||||||||||||||||||||||||HMFPT81.46 ||||||||||||||||||||||||||||||||||

0.0

0.1

0.2

0.3

0.4

0 10 20 30 40 50 60 70 80 90 100 110

Ris

k

Number of Trees

0

20

40

60

80

100

0 20 40 60 80 100

Pct

. C

lass

1

Pct. Population

+ + + +

Importance Value ROC curves for accuracy tests

e.g. correctly predicted absence app. 97%

e.g. correctly predicted presence app. 92%

=>Apply to a dataset for predictions

The more nodes…the more detail…the slower

ROCeach explains remaining variance

Difficult to interpretbut good graphs

Distance to Lake (m)

Bea

r O

ccu

rren

ce(P

arti

al D

epen

den

ce)

TreeNet: Graphic Output example

Response Curve

yes

no


Bea

r O

ccu

rren

ce(P

arti

al D

epen

den

ce)


Response Curve

(the function above is virtually impossible to fit in linear algorithms => misleading coefficients, e.g. from LMs, GLMs)

yes

no

?

or


Bea

r O

ccu

rren

ce(P

arti

al D

epen

den

ce)

or


Response Curve

(the function above is virtually impossible to fit in linear algorithms => misleading coefficients, e.g. from LMs, GLMs)

yes

no

?

?

Random set 1

Random set 2

Average Final Treefrom >2000 treesdone by VOTING

RandomForest (Prasad et al. 2006, Furlanelllo et al. 2003 Breimann 2001)

‘Boosting & Bagging’ algorithms (~Ensemble)

DEM Slope Aspect Climate Land-cover

1

2

3

4

5

Ran

dom

set

of

Row

s(C

ases

)

PredictorsRandom set 1

Random set 2

Average Final Treefrom e.g.>2000 treesdone by VOTING




1

2

3

4

5

Ran

dom

set

of

Row

s(C

ases

)

Random set of Columns(Predictors)

Random set 1

Random set 2






1

2

3

4

5

Ran

dom

set

of

Row

s(C

ases

)

Random set of Columns(Predictors)

Random set 1

Random set 2


Bagging: Optimization based on In-Bag, Out-of Bag samples

In RF no pruning => Difficult to overfit (robust)

Boosting & Bagging algorithms


Handles ‘noise’, interactionsand categorical data fine!


RandomForest and GIS: Spatial Modeling


Predictors

Response

Table

RandomForest(quantification)

Train &DevelopModel

ApplyModel

GISOverlays

GISVisualization

ofPredictions

Predictors

Response

Table

aaahhhhuuhhhh ?!-Makes sense because of...-No, wait a minute, that’s wrong…


Train &DevelopModel

ApplyModel

GISOverlays

GISVisualization

ofPredictions

RandomForest(quantification)

Allows for:

Works multivariate (100s of predictors)

Best Possible Predictions

Best Possible Clustering (without a response variable)

Tracking of Complex Interactions

Predictor Ranking

Handling Noisy Data

Fast & convenient applications

Allows for multiple (!) response variables !

RandomForest: Why so good and useable ?

Algorithms:RandomForest (R, Fortran, Salford)YAIMPUTE (R)PARTY (R)…

=> Change in World’s Science

What to read, for instance…

http://www.stat.berkeley.edu/~breiman/RandomForests/

Breiman, L. 2001. Statistical modeling: the two cultures. Statistical Science. 16(3): 199 –231.

Craig, E., and F. Huettmann. (2008). Using “blackbox” algorithms such as TreeNet and Random Forests for data-mining and for finding meaningful patterns, relationships and outliers in complex ecological data: an overview, an example using golden eagle satellite data and an outlook for a promising future. Chapter IV in Intelligent Data Analysis: Developing New Methodologies through Pattern Discovery and Recovery (Hsiao-fan Wang, Ed.). IGI Global, Hershey, PA,USA.

Magness, D.R., F. Huettmann, and J.M. Morton. (2008). Using Random Forests to provide predicted species distribution maps as a metric for ecological inventory & monitoring programs. Pages 209-229 in T.G. Smolinski, M.G. Milanova & A-E. Hassanien (eds.). Applications of Computational Intelligence in Biology: Current Trends and Open Problems. Studies in Computational Intelligence,Vol. 22, Springer-Verlag Berlin Heidelberg. 428 pp.

Prasad, A. L.A. Iverson, A. Liar. 2006. Newer Classification and Regression Tree Techniques: Bagging and Random Forests for Ecological Prediction. Ecosystems 181-199.

(and Hastie & Tibshirani, Furlanello et al. 2003, Elith et al. 2006 etc. etc.)

From now on, simply referred to as …

A Super Model

LMGLMCARTMARS

NNGARP

TNRF

GDMMaxent…

=>Ensembles

Some Super Models: Ensembles

LMGLMCARTMARS

NNGARP

TNRF

GDMMaxent…

Find the best modelfor a given section of yourdata => the best possible fit & prediction

Pres/Abs

Predictors

RF

LM

log

poly

Ivory Gull

LMpoly

RFlog

On Greyboxes, Philosophy and ScienceData

(Data Mining) Prediction & Accuracy

Algorithm with a Known Behavior




Such a statistical relationshipwill be found by either CART, TN, RF orLM, GLM



GLMs as a blackbox!? YES.Just think of software implementations, Max-Likelihood, Model FittingAIC and Research Design (sensu Keating & Cherry 1994)


On Greyboxes, Philosophy and Science

-> Over time ->GLM ANN Boosting, Bagging …

100%

0%

ImprovementIncreases

ModelPerfor-mance

Data


GLMs as a blackbox!? YES.Just think of software implementations, Max-Likelihood, Model FittingAIC and Research Design (sensu Keating & Cherry 1994)


Parsimony, Inference and Prediction ?!

Sole focus on predictions and its accuracies, whereas…

…R2, p-values and traditional inference (variable rankings, AIC) are of lower relevance

Why Parsimony ?

No real need for optimizing the fit and for parsimony when prediction is the goal

Global accuracy metrics, ROC, AUC, kappa, meta analysis …(instead of p-values and significance levels or AIC)

0.70

0.80

0.90

0 100 200 300 400 500

Rel

ativ

e C

ost

Number of Nodes

cliff notes on ecological niche modeling with randomforest (ensembles) falk huettmann ewhale lab...

Documents