cliff notes on ecological niche modeling with randomforest (ensembles) falk huettmann ewhale lab...

38
Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles) Falk Huettmann EWHALE lab University of Alaska Fairbanks AK 99775 Email [email protected] Tel. 907 474 7882

Upload: bridget-davis

Post on 20-Jan-2016

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles) Falk Huettmann EWHALE lab University of Alaska Fairbanks AK 99775 Email fhuettmann@alaska.edufhuettmann@alaska.edu

Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles)

Falk HuettmannEWHALE lab

University of AlaskaFairbanks AK 99775

Email [email protected] Tel. 907 474 7882

Page 2: Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles) Falk Huettmann EWHALE lab University of Alaska Fairbanks AK 99775 Email fhuettmann@alaska.edufhuettmann@alaska.edu

Modeling Ecological Niches

Geographic Space Ecological Space

Latitude

Longitude Environmental factor a

Env

ironm

enta

l fac

tor

b

Sampling Space Model Space => Predictions

Page 3: Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles) Falk Huettmann EWHALE lab University of Alaska Fairbanks AK 99775 Email fhuettmann@alaska.edufhuettmann@alaska.edu

A Super Model

LMGLMGAMCARTMARS

NNGARP

TNRF

GDMMaxent…

=>Ensembles

Page 4: Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles) Falk Huettmann EWHALE lab University of Alaska Fairbanks AK 99775 Email fhuettmann@alaska.edufhuettmann@alaska.edu

‘Mean’SDOne formula capturing the data y=a +bx

Linear regression

A starting point…

Response Variable ~ Predictor1 Y X

X

Y

Page 5: Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles) Falk Huettmann EWHALE lab University of Alaska Fairbanks AK 99775 Email fhuettmann@alaska.edufhuettmann@alaska.edu

Common Ground

A Multiple Regression framework

Response Variable ~ Predictor1 + Predictor2 + Predictor3…

Page 6: Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles) Falk Huettmann EWHALE lab University of Alaska Fairbanks AK 99775 Email fhuettmann@alaska.edufhuettmann@alaska.edu

Common Ground

A Multiple Regression framework

Response Variable ~ Predictor1 + Predictor2 + Predictor3…

Traditionally, we used 1-5 predictors

But: 1 to 1000s of predictors are possible

‘One single algorithm’ that explains relationship between response and predictors

Derived relationship can be predicted to other locations with known predictors

Page 7: Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles) Falk Huettmann EWHALE lab University of Alaska Fairbanks AK 99775 Email fhuettmann@alaska.edufhuettmann@alaska.edu

GLM vs CART etc.

‘Mean’SD => potentially low r2

‘Mean’ ?SD ?

CART, TreeNet & RandomForest(there are many other algorithms !)

Linear(~unrealistic)

Non-Linear(driven by data)

Page 8: Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles) Falk Huettmann EWHALE lab University of Alaska Fairbanks AK 99775 Email fhuettmann@alaska.edufhuettmann@alaska.edu

Our Free Algorithms …

R-ProjectTreeNet

RandomForest

Fortran, C …

http://rweb.stat.umn.edu/R/library/randomForest/html/00Index.html

http://salford-systems.com/products.php

(free 30 day trial)

Page 9: Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles) Falk Huettmann EWHALE lab University of Alaska Fairbanks AK 99775 Email fhuettmann@alaska.edufhuettmann@alaska.edu

Tree/CART - Family

Classification & Regression Tree (CART)=>Binary recursive partitioning

Leo Breiman 1984, and others

Page 10: Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles) Falk Huettmann EWHALE lab University of Alaska Fairbanks AK 99775 Email fhuettmann@alaska.edufhuettmann@alaska.edu

Tree/CART - Family

Leo Breiman 1984, and others

YES NO

Temp>15

Precip <100

Temp<5

Classification & Regression Tree (CART)=>Binary recursive partitioning

Page 11: Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles) Falk Huettmann EWHALE lab University of Alaska Fairbanks AK 99775 Email fhuettmann@alaska.edufhuettmann@alaska.edu

Tree/CART - Family

Binary splits

Leo Breiman 1984, and others

Widely used concept

Page 12: Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles) Falk Huettmann EWHALE lab University of Alaska Fairbanks AK 99775 Email fhuettmann@alaska.edufhuettmann@alaska.edu

Tree/CART - Family

Binary splits

Leo Breiman 1984, and others

Widely used conceptFree of dataassumptions!No significances.

Page 13: Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles) Falk Huettmann EWHALE lab University of Alaska Fairbanks AK 99775 Email fhuettmann@alaska.edufhuettmann@alaska.edu

Tree/CART - Family

Binary splits

Binary split recursive partitioning (samepredictor can re-occur elsewhere as a ‘splitter’)

Maximizes Nodes for Homogenous Variance

Stopping Rules for Number of Branches basedon Optimization/Cross-validation

Terminal Nodes show Means (Regression Tree)or Categories (Classification Tree)

Leo Breiman 1984, and others

Widely used conceptFree of dataassumptions!No significances.

Page 14: Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles) Falk Huettmann EWHALE lab University of Alaska Fairbanks AK 99775 Email fhuettmann@alaska.edufhuettmann@alaska.edu

Tree/CART - Family

Binary splits Multiple splits

Binary split recursive partitioning (samepredictor can re-occur elsewhere as a ‘splitter’)

Maximizes Nodes for Homogenous Variance

Stopping Rules for Number of Branches basedon Optimization/Cross-validation

Terminal Nodes show Means (Regression Tree)or Categories (Classification Tree)

Leo Breiman 1984, and others

Classification Tree

A B C

A B

Widely used conceptRarely used, yet

Free of dataassumptions!No significances.

0.3 3 0.1

2 2.3

Regression Tree

Page 15: Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles) Falk Huettmann EWHALE lab University of Alaska Fairbanks AK 99775 Email fhuettmann@alaska.edufhuettmann@alaska.edu

CART Salford (rpart in R)Nice to interpret(e.g. for small trees, orwhen following throughspecific decision rulestil end)

Page 16: Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles) Falk Huettmann EWHALE lab University of Alaska Fairbanks AK 99775 Email fhuettmann@alaska.edufhuettmann@alaska.edu

0.70

0.80

0.90

0 100 200 300 400 500

Rel

ativ

e C

ost

Number of Nodes

DEM 100.00 ||||||||||||||||||||||||||||||||||||||||||TAIR_AUG 77.58 ||||||||||||||||||||||||||||||||PREC_AUG 69.46 |||||||||||||||||||||||||||||HYDRO 54.59 ||||||||||||||||||||||POP 47.39 |||||||||||||||||||LDUSE 40.88 |||||||||||||||||

Importance Value

CART Salford (rpart in R)

ROC curves for accuracy tests

e.g. correctly predicted absence app. 77%

e.g. correctly predicted presence app. 85%

=>Apply to a dataset for predictions

ROC

Nice to interpret(e.g. for small trees, orwhen following throughspecific decision rulestil end)

From withheld

Test Data

Optimum

Page 17: Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles) Falk Huettmann EWHALE lab University of Alaska Fairbanks AK 99775 Email fhuettmann@alaska.edufhuettmann@alaska.edu

TreeNet(~A sequence of CARTs) ‘boosting’

+ + + +

The more nodes…the more detail…the slower

Many trees make for a ‘net of trees’, or ‘a forest’ => Leo Breiman + Data Mining

Page 18: Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles) Falk Huettmann EWHALE lab University of Alaska Fairbanks AK 99775 Email fhuettmann@alaska.edufhuettmann@alaska.edu

TreeNet(~A sequence of CARTs) ‘boosting’

Variable Score LDUSE 100.00 ||||||||||||||||||||||||||||||||||||||||||TAIR_AUG 97.62 |||||||||||||||||||||||||||||||||||||||||HYDRO94.35 ||||||||||||||||||||||||||||||||||||||||DEM94.01 |||||||||||||||||||||||||||||||||||||||PREC_AUG 90.17 ||||||||||||||||||||||||||||||||||||||POP 82.54 ||||||||||||||||||||||||||||||||||HMFPT81.46 ||||||||||||||||||||||||||||||||||

0.0

0.1

0.2

0.3

0.4

0 10 20 30 40 50 60 70 80 90 100 110

Ris

k

Number of Trees

0

20

40

60

80

100

0 20 40 60 80 100

Pct

. C

lass

1

Pct. Population

+ + + +

Importance Value ROC curves for accuracy tests

e.g. correctly predicted absence app. 97%

e.g. correctly predicted presence app. 92%

=>Apply to a dataset for predictions

The more nodes…the more detail…the slower

ROCeach explains remaining variance

Difficult to interpretbut good graphs

Page 19: Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles) Falk Huettmann EWHALE lab University of Alaska Fairbanks AK 99775 Email fhuettmann@alaska.edufhuettmann@alaska.edu

Distance to Lake (m)

Bea

r O

ccu

rren

ce(P

arti

al D

epen

den

ce)

TreeNet: Graphic Output example

Response Curve

yes

no

Page 20: Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles) Falk Huettmann EWHALE lab University of Alaska Fairbanks AK 99775 Email fhuettmann@alaska.edufhuettmann@alaska.edu

Distance to Lake (m)

Bea

r O

ccu

rren

ce(P

arti

al D

epen

den

ce)

TreeNet: Graphic Output example

Response Curve

(the function above is virtually impossible to fit in linear algorithms => misleading coefficients, e.g. from LMs, GLMs)

yes

no

?

or

Page 21: Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles) Falk Huettmann EWHALE lab University of Alaska Fairbanks AK 99775 Email fhuettmann@alaska.edufhuettmann@alaska.edu

Distance to Lake (m)

Bea

r O

ccu

rren

ce(P

arti

al D

epen

den

ce)

or

TreeNet: Graphic Output example

Response Curve

(the function above is virtually impossible to fit in linear algorithms => misleading coefficients, e.g. from LMs, GLMs)

yes

no

?

?

Page 22: Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles) Falk Huettmann EWHALE lab University of Alaska Fairbanks AK 99775 Email fhuettmann@alaska.edufhuettmann@alaska.edu

Random set 1

Random set 2

Average Final Treefrom >2000 treesdone by VOTING

RandomForest (Prasad et al. 2006, Furlanelllo et al. 2003 Breimann 2001)

‘Boosting & Bagging’ algorithms (~Ensemble)

Page 23: Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles) Falk Huettmann EWHALE lab University of Alaska Fairbanks AK 99775 Email fhuettmann@alaska.edufhuettmann@alaska.edu

DEM Slope Aspect Climate Land-cover

1

2

3

4

5

Ran

dom

set

of

Row

s(C

ases

)

PredictorsRandom set 1

Random set 2

Average Final Treefrom e.g.>2000 treesdone by VOTING

RandomForest (Prasad et al. 2006, Furlanelllo et al. 2003 Breimann 2001)

‘Boosting & Bagging’ algorithms (~Ensemble)

Page 24: Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles) Falk Huettmann EWHALE lab University of Alaska Fairbanks AK 99775 Email fhuettmann@alaska.edufhuettmann@alaska.edu

DEM Slope Aspect Climate Land-cover

1

2

3

4

5

Ran

dom

set

of

Row

s(C

ases

)

Random set of Columns(Predictors)

Random set 1

Random set 2

RandomForest (Prasad et al. 2006, Furlanelllo et al. 2003 Breimann 2001)

Difficult to interpretbut good graphs

Average Final Treefrom e.g.>2000 treesdone by VOTING

‘Boosting & Bagging’ algorithms (~Ensemble)

Page 25: Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles) Falk Huettmann EWHALE lab University of Alaska Fairbanks AK 99775 Email fhuettmann@alaska.edufhuettmann@alaska.edu

DEM Slope Aspect Climate Land-cover

1

2

3

4

5

Ran

dom

set

of

Row

s(C

ases

)

Random set of Columns(Predictors)

Random set 1

Random set 2

RandomForest (Prasad et al. 2006, Furlanelllo et al. 2003 Breimann 2001)

Bagging: Optimization based on In-Bag, Out-of Bag samples

In RF no pruning => Difficult to overfit (robust)

Boosting & Bagging algorithms

Difficult to interpretbut good graphs

Handles ‘noise’, interactionsand categorical data fine!

Average Final Treefrom e.g.>2000 treesdone by VOTING

Page 26: Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles) Falk Huettmann EWHALE lab University of Alaska Fairbanks AK 99775 Email fhuettmann@alaska.edufhuettmann@alaska.edu

RandomForest and GIS: Spatial Modeling

Page 27: Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles) Falk Huettmann EWHALE lab University of Alaska Fairbanks AK 99775 Email fhuettmann@alaska.edufhuettmann@alaska.edu

RandomForest and GIS: Spatial Modeling

Predictors

Response

Table

RandomForest(quantification)

Train &DevelopModel

ApplyModel

GISOverlays

GISVisualization

ofPredictions

Page 28: Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles) Falk Huettmann EWHALE lab University of Alaska Fairbanks AK 99775 Email fhuettmann@alaska.edufhuettmann@alaska.edu

Predictors

Response

Table

aaahhhhuuhhhh ?!-Makes sense because of...-No, wait a minute, that’s wrong…

RandomForest and GIS: Spatial Modeling

Train &DevelopModel

ApplyModel

GISOverlays

GISVisualization

ofPredictions

RandomForest(quantification)

Page 29: Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles) Falk Huettmann EWHALE lab University of Alaska Fairbanks AK 99775 Email fhuettmann@alaska.edufhuettmann@alaska.edu

Allows for:

Works multivariate (100s of predictors)

Best Possible Predictions

Best Possible Clustering (without a response variable)

Tracking of Complex Interactions

Predictor Ranking

Handling Noisy Data

Fast & convenient applications

Allows for multiple (!) response variables !

RandomForest: Why so good and useable ?

Algorithms:RandomForest (R, Fortran, Salford)YAIMPUTE (R)PARTY (R)…

=> Change in World’s Science

Page 30: Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles) Falk Huettmann EWHALE lab University of Alaska Fairbanks AK 99775 Email fhuettmann@alaska.edufhuettmann@alaska.edu

What to read, for instance…

http://www.stat.berkeley.edu/~breiman/RandomForests/

Breiman, L. 2001. Statistical modeling: the two cultures. Statistical Science. 16(3): 199 –231.

Craig, E., and F. Huettmann. (2008). Using “blackbox” algorithms such as TreeNet and Random Forests for data-mining and for finding meaningful patterns, relationships and outliers in complex ecological data: an overview, an example using golden eagle satellite data and an outlook for a promising future. Chapter IV in Intelligent Data Analysis: Developing New Methodologies through Pattern Discovery and Recovery (Hsiao-fan Wang, Ed.). IGI Global, Hershey, PA,USA.

Magness, D.R., F. Huettmann, and J.M. Morton.  (2008).  Using Random Forests to provide predicted species distribution maps as a metric for ecological inventory & monitoring programs.  Pages 209-229 in T.G. Smolinski, M.G. Milanova & A-E. Hassanien (eds.).  Applications of Computational Intelligence in Biology: Current Trends and Open Problems.  Studies in Computational Intelligence,Vol. 22, Springer-Verlag Berlin Heidelberg.  428 pp.

Prasad, A. L.A. Iverson, A. Liar. 2006. Newer Classification and Regression Tree Techniques: Bagging and Random Forests for Ecological Prediction. Ecosystems 181-199.

(and Hastie & Tibshirani, Furlanello et al. 2003, Elith et al. 2006 etc. etc.)

Page 31: Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles) Falk Huettmann EWHALE lab University of Alaska Fairbanks AK 99775 Email fhuettmann@alaska.edufhuettmann@alaska.edu

From now on, simply referred to as …

Page 32: Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles) Falk Huettmann EWHALE lab University of Alaska Fairbanks AK 99775 Email fhuettmann@alaska.edufhuettmann@alaska.edu

A Super Model

LMGLMCARTMARS

NNGARP

TNRF

GDMMaxent…

=>Ensembles

Page 33: Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles) Falk Huettmann EWHALE lab University of Alaska Fairbanks AK 99775 Email fhuettmann@alaska.edufhuettmann@alaska.edu

Some Super Models: Ensembles

LMGLMCARTMARS

NNGARP

TNRF

GDMMaxent…

Find the best modelfor a given section of yourdata => the best possible fit & prediction

Pres/Abs

Predictors

RF

LM

log

poly

Ivory Gull

LMpoly

RFlog

Page 34: Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles) Falk Huettmann EWHALE lab University of Alaska Fairbanks AK 99775 Email fhuettmann@alaska.edufhuettmann@alaska.edu

On Greyboxes, Philosophy and ScienceData

(Data Mining) Prediction & Accuracy

Algorithm with a Known Behavior

Page 35: Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles) Falk Huettmann EWHALE lab University of Alaska Fairbanks AK 99775 Email fhuettmann@alaska.edufhuettmann@alaska.edu

On Greyboxes, Philosophy and ScienceData

(Data Mining) Prediction & Accuracy

Algorithm with a Known Behavior

Such a statistical relationshipwill be found by either CART, TN, RF orLM, GLM

Page 36: Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles) Falk Huettmann EWHALE lab University of Alaska Fairbanks AK 99775 Email fhuettmann@alaska.edufhuettmann@alaska.edu

On Greyboxes, Philosophy and ScienceData

(Data Mining) Prediction & Accuracy

GLMs as a blackbox!? YES.Just think of software implementations, Max-Likelihood, Model FittingAIC and Research Design (sensu Keating & Cherry 1994)

Algorithm with a Known Behavior

Page 37: Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles) Falk Huettmann EWHALE lab University of Alaska Fairbanks AK 99775 Email fhuettmann@alaska.edufhuettmann@alaska.edu

On Greyboxes, Philosophy and Science

-> Over time ->GLM ANN Boosting, Bagging …

100%

0%

ImprovementIncreases

ModelPerfor-mance

Data

(Data Mining) Prediction & Accuracy

GLMs as a blackbox!? YES.Just think of software implementations, Max-Likelihood, Model FittingAIC and Research Design (sensu Keating & Cherry 1994)

Algorithm with a Known Behavior

Page 38: Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles) Falk Huettmann EWHALE lab University of Alaska Fairbanks AK 99775 Email fhuettmann@alaska.edufhuettmann@alaska.edu

Parsimony, Inference and Prediction ?!

Sole focus on predictions and its accuracies, whereas…

…R2, p-values and traditional inference (variable rankings, AIC) are of lower relevance

Why Parsimony ?

No real need for optimizing the fit and for parsimony when prediction is the goal

Global accuracy metrics, ROC, AUC, kappa, meta analysis …(instead of p-values and significance levels or AIC)

0.70

0.80

0.90

0 100 200 300 400 500

Rel

ativ

e C

ost

Number of Nodes