predictive habitat distribution models in ecology · ecological modelling 135 (2000) 147–186...

Ecological Modelling 135 (2000) 147–186

Predictive habitat distribution models in ecology

Antoine Guisan a,*, Niklaus E. Zimmermann b,1

a Swiss Center for Faunal Cartography (CSCF), Terreaux 14, CH-2000 Neuchâtel, Switzerlandb Swiss Federal Research Institute WSL, Zuercherstr. 111, 8903 Birmensdorf, Switzerland

Received 5 October 1999; received in revised form 25 May 2000; accepted 12 July 2000

Abstract

With the rise of new powerful statistical techniques and GIS tools, the development of predictive habitatdistribution models has rapidly increased in ecology. Such models are static and probabilistic in nature, since theystatistically relate the geographical distribution of species or communities to their present environment. A wide arrayof models has been developed to cover aspects as diverse as biogeography, conservation biology, climate changeresearch, and habitat or species management. In this paper, we present a review of predictive habitat distributionmodeling. The variety of statistical techniques used is growing. Ordinary multiple regression and its generalized form(GLM) are very popular and are often used for modeling species distributions. Other methods include neuralnetworks, ordination and classification methods, Bayesian models, locally weighted approaches (e.g. GAM), environ-mental envelopes or even combinations of these models. The selection of an appropriate method should not dependsolely on statistical considerations. Some models are better suited to reflect theoretical findings on the shape andnature of the species’ response (or realized niche). Conceptual considerations include e.g. the trade-off betweenoptimizing accuracy versus optimizing generality. In the field of static distribution modeling, the latter is mostlyrelated to selecting appropriate predictor variables and to designing an appropriate procedure for model selection.New methods, including threshold-independent measures (e.g. receiver operating characteristic (ROC)-plots) andresampling techniques (e.g. bootstrap, cross-validation) have been introduced in ecology for testing the accuracy ofpredictive models. The choice of an evaluation measure should be driven primarily by the goals of the study. Thismay possibly lead to the attribution of different weights to the various types of prediction errors (e.g. omission,commission or confusion). Testing the model in a wider range of situations (in space and time) will permit one todefine the range of applications for which the model predictions are suitable. In turn, the qualification of the modeldepends primarily on the goals of the study that define the qualification criteria and on the usability of the model,rather than on statistics alone. © 2000 Elsevier Science B.V. All rights reserved.

Keywords: Biogeography; Plant ecology; Vegetation models; Species models; Model formulation; Model calibration; Model

www.elsevier.com/locate/ecolmodel

Abbre6iations: CART, classification and regression trees; CC, climate change; CCA, canonical correspondence analysis; DEM,digital elevation model; ENFA, ecological niche factor analysis; GAM, generalized additive model; GLM, generalized linear model;LS, least squares; PCA, principal component analysis; RDA, redundancy analysis.

* Corresponding author. Tel.: +41-32-7249297; fax: +41-32-7177969.E-mail addresses: [email protected] (A. Guisan), [email protected] (N.E. Zimmermann).1 Co-corresponding author.

0304-3800/00/$ - see front matter © 2000 Elsevier Science B.V. All rights reserved.

PII: S 0304 -3800 (00 )00354 -9

A. Guisan, N.E. Zimmermann / Ecological Modelling 135 (2000) 147–186148

predictions; Model evaluation; Model credibility; Model applicability; GIS; Statistics

1. Introduction

The analysis of species–environment relation-ship has always been a central issue in ecology.The importance of climate to explain animal andplant distribution was recognized early on (Hum-boldt and Bonpland, 1807; de Candolle, 1855).Climate in combination with other environmentalfactors has been much used to explain the mainvegetation patterns around the world (e.g. Salis-bury, 1926; Cain, 1944; Good, 1953; Holdridge,1967; McArthur, 1972; Box, 1981; Stott, 1981;Walter, 1985; Woodward, 1987; Ellenberg, 1988).The quantification of such species–environmentrelationships represents the core of predictive geo-graphical modeling in ecology. These models aregenerally based on various hypotheses as to howenvironmental factors control the distribution ofspecies and communities.

Besides its prime importance as a research toolin autecology, predictive geographical modelingrecently gained importance as a tool to assess theimpact of accelerated land use and other environ-mental change on the distribution of organisms(e.g. climate – Lischke et al., 1998; Kienast et al.,1995, 1996, 1998; Guisan and Theurillat, 2000), totest biogeographic hypotheses (e.g. Mourell andEzcurra, 1996; Leathwick, 1998), to improvefloristic and faunistic atlases (e.g. Hausser, 1995)or to set up conservation priorities (Margules andAustin, 1994). A variety of statistical models iscurrently in use to simulate either the spatialdistribution of terrestrial plant species (e.g. Hill,1991; Buckland and Elston, 1993; Carpenter etal., 1993; Lenihan, 1993; Huntley et al., 1995;Shao and Halpin, 1995; Franklin, 1998; Guisan etal., 1998, 1999), aquatic plants (Lehmann et al.,1997; Lehmann, 1998), terrestrial animal species(e.g. Pereira and Itami, 1991; Aspinall, 1992; Au-gustin et al., 1996; Corsi et al., 1999; Mace et al.,1999; Manel et al., 1999; Mladenoff et al., 1995,1999), fishes (Lek et al., 1996; Mastrorillo et al.,1997), plant communities (e.g. Fischer, 1990;Brzeziecki et al., 1993; Zimmermann and Kienast,1999), vegetation types (e.g. Brown, 1994; Van de

Rijt et al., 1996), plant functional types (e.g. Box,1981, 1995, 1996), biomes and vegetation units ofsimilar complexity (Monserud and Leemans,1992; Prentice et al., 1992; Tchebakova et al.,1993, 1994; Neilson, 1995), plant biodiversity (e.g.Heikkinen, 1996; Wohlgemuth, 1998), or animalbiodiversity (Owen, 1989; Fraser, 1998) (see alsoScott et al., in press for numerous additionalexamples of plant and animal species distributionmodels). Such static, comparative, models are op-posed to more mechanistic models of ecosystemprocesses (Peters, 1991; Jones, 1992; Pickett et al.,1994; Lischke et al., 1998). Since only very fewspecies have been studied in detail in terms oftheir dynamic responses to environmental change,static distribution modeling often remains theonly approach for studying the possible conse-quences of a changing environment on speciesdistribution (Woodward and Cramer, 1996).

The development of predictive models is coher-ent with Peters’s (1991, p. 274) view of a ‘‘morerigorously scientific, more informative and moreuseful ecology’’. The use of and theoretical limita-tions of static models compared with dynamicapproaches have been described in several papers(e.g. Decoursey, 1992; Korzukhin et al., 1996;Lischke et al., 1998). Franklin (1995) provides areview of some currently used statistical tech-niques in vegetation and plant species modeling.Particular aspects of model development (e.g. ver-ification, calibration, evaluation (= validation),qualification) have been covered in more specificpapers, with very special attention given in recentyears to evaluation and its usefulness for testingecological models (e.g. Loehle, 1983; Oreskes etal., 1994; Rykiel, 1996). Since then, new statisticaltechniques for calibrating and testing predictivemodels have emerged (see e.g. Scott et al., inpress).

The aim of this paper is to review the varioussteps of predictive modeling, from the conceptualmodel formulation to prediction and application(Fig. 1). We discuss the importance of differenti-ating between model formulation, model calibra-tion, and model evaluation. Additionally, we

A. Guisan, N.E. Zimmermann / Ecological Modelling 135 (2000) 147–186 149

Fig. 1. Overview of the successive steps (1–5) of the model building process, when two data sets – one for fitting and one forevaluating the model – are available. Model evaluation is either made: (a) on the calibration data set using bootstrap,cross-validation or Jack-knife techniques; (b) on the independent data set, by comparing predicted to observed values usingpreferentially a threshold-independent measure (such as the ROC-plot approach for presence/absence models).

provide an overview of specific analytical, statisti-cal methods currently in use.

2. Conceptual model formulation

The process, which ends with the formulationof an ecological model, usually starts from anunderlying ecological concept (e.g. the pseudo-equilibrium assumption in static distributionmodeling). We consider it crucial to base theformulation of an ecological model on an under-lying conceptual framework. Hereafter, we discussa selection of important conceptual aspects.

2.1. General patterns in the geographicaldistribution of species

The core theory of predictive modeling of bioticentities originates from major trends published inthe field of biogeography. Here, our aim is not tosummarize all the patterns and processes of geo-graphic range limitation, which are best provided

by specific review papers (e.g. Brown et al., 1996),but to illustrate the link with the conceptualmodel formulation through examples.

A matter of primary interest is the relativeimportance of biotic versus abiotic factors at themargins of a species’ range. Brown et al. (1996)recalls that ‘‘in most ecological gradients, themajority of species appear to find one direction tobe physically stressful and the other to be biologi-cally stressful’’. This was stressed for an elevationgradient (Guisan et al., 1998) and already sug-gested for latitudinal gradient by Dobzhansky(1950) and McArthur (1972). In a more generalway, physical limits are caused by environmentaland physiological constraints (i.e. direct and re-source gradients in the sense of Austin et al.(1984) and Austin and Gaywood (1994) undersuboptimal conditions along these gradients (e.g.too cold, too dry). A discussion on (1) usingcausal rather than non-causal factors and (2) con-sidering inter-species competition for fitting astatic model follows in the next sections and in thefinal perspectives.


2.2. Generality, reality, and precision

Nature is too complex and heterogeneous to bepredicted accurately in every aspect of time andspace from a single, although complex, model.Levins (1966) formulated the principle that onlyany two out of three desirable model properties(generality, reality, precision) can be improvedsimultaneously (Fig. 2), while the third propertyhas to be sacrificed. This trade-off leads to adistinction of three different groups of models(Sharpe, 1990; Prentice and Helmisaari, 1991; Ko-rzukhin et al., 1996), and its associated constraintsare consequential when selecting modeling ap-proaches for specific project goals.

The first group of models (i) focuses on general-ity and precision. Such models are called analyti-cal (Pickett et al., 1994) or mathematical2 (Sharpe,1990), and are designed to predict accurate re-sponse within a limited or simplified reality. TheLotka–Volterra equation and their variants(Volterra, 1926; May, 1981), the general logisticgrowth equation, or the Blackman growth law(Assmann ex Sharpe, 1990) are examples of ana-lytical models. A second group of models (ii) isdesigned to be realistic and general. They are

called mechanistic (e.g. Prentice, 1986a), physio-logical (e.g. Leersnijder, 1992), causal (e.g. De-coursey, 1992) or process models (e.g. Korzukhinet al., 1996), and they base predictions on realcause–effect relationships. Thus they may also beconsidered as general, since these relationships areconsidered as biologically functional (Woodward,1987). A model of this group is not judged pri-marily on predicted precision, but rather on thetheoretical correctness of the predicted response(Pickett et al., 1994). A third group of models (iii)sacrifices generality for precision and reality. Suchmodels are called empirical (Decoursey, 1992; Ko-rzukhin et al., 1996), statistical (Sharpe andRykiel, 1991), or phenomenological (Pickett et al.,1994; Leary, 1985). The mathematical formula-tion of such a model is not expected to describerealistic ‘cause and effect’ between model parame-ters and predicted response, nor to inform aboutunderlying ecological functions and mechanisms,being the main purpose to condense empiricalfacts (Wissel, 1992)3.

Although Levins’ classification is helpful, it issomewhat misleading. In practice it can bedifficult to classify a specific model (Korzukhin etal., 1996). Predictive distribution models are gen-erally categorized as empirical models; however,Prentice et al. (1992) argue that their (predictive)global vegetation model, rigorously based on in-dependent physiological data and physiologicalfirst principles, is as mechanistic as we wouldachieve with limited data. Also Korzukhin et al.(1996) point out that process and empirical mod-els can both have either a high or low degree ofgenerality depending on the nature of the objectbeing modeled, and that mechanistic models bytheir nature do not necessarily have to be impre-cise. They conclude that precision, generality andreality are not always mutually exclusive. Simi-larly, Peters (1991, p.32) notices that, ‘‘there is nonecessary conflict between precision andgenerality’’.

Fig. 2. A classification of models based on their intrinsicproperties. After Levins (1966), and Sharpe (1990).

3 Several authors do not distinguish clearly between analyti-cal and empirical models (e.g. Loehle, 1983; Wissel, 1992),they use multiple criteria to arrange models along these gradi-ents (e.g. Pickett et al., 1994), or they introduce subclasses toLevins’ model classification (Kimmins and Sollins, 1989; Kim-mins et al., 1990).

2 We find this term misleading, since all of the three princi-pal modeling approaches may rely on more or less extensivemathematical formulations.


Loehle (1983) recognizes two distinct types ofmodels: calculating tools and theoretical models.The first can be put in the class of empiricalmodels, since ‘‘they are intended only to informus about the configuration of the world’’ (Peters,1991, p. 105), whereas the theoretical models aresynonymous with the mechanistic models, capableof predicting response from plausible causalrelations.

We argue that Levins’ classification and trade-offs are nevertheless useful in a conceptual con-text (see below). They help to focus on one or theother characteristics in model building, dependingon the overall goal of the modeling effort. Predic-tive vegetation models are generally empirical bynature. However, they can be based on physiolog-ically meaningful parameters (e.g. Prentice et al.,1992; Lenihan, 1993), and can thus be describedas more mechanistic than models based, say, ontopographic parameters only (e.g. Burke et al.,1989; Moore et al., 1991). This difference summa-rizes the main axis along which most of thepredictive vegetation models can be arranged: inmost cases it is the trade-off between precisionand generality.

2.3. Direct 6ersus indirect predictors

From a mechanistic point of view, it is desirableto predict the distribution of biotic entities on thebasis of ecological parameters that are believed tobe the causal, driving forces for their distributionand abundance. Such ecological factors are gener-ally sampled from digital maps, since they areusually difficult or expensive to measure. How-ever, they often tend to be less precise than puretopographic characteristics. Most bioclimaticmaps are developed by elevation-sensitive spatialinterpolations of climate station data (Hutchinsonand Bischof, 1983; Daly et al., 1994; Thornton etal., 1997). This introduces spatial uncertaintiesbecause of (i) interpolation errors, (ii) lack ofsufficient stations data, and (iii) the fact thatstandard climate stations do not reveal the biolog-ically relevant microclimates. Soil (and nutrient)and geology maps are even more difficult toderive. They are usually generated at very coarseresolution and are often drawn up using vegeta-

tion as delineation criteria. On the other hand,available digital elevation models (DEM) tend tobe relatively accurate, even in mountainous ter-rain. Thus, directly derived topographic variables(slope, aspect, topographic position, or slopecharacteristics) are generated without much lossof precision. It is thus not surprising that predic-tive vegetation models, developed for mountain-ous terrain at relatively high spatial resolution,are based partially or completely on topographi-cal factors (Fischer, 1990; Moore et al., 1991;Brzeziecki et al., 1993; Brown, 1994; Guisan et al.,1998, 1999). On the contrary, large-scale predic-tive models are generally based solely on biophys-ical parameters, since topography no longer hasany predictive power at such coarse resolution(Box, 1981; Prentice et al., 1992; Lenihan, 1993;Huntley et al., 1995; Neilson, 1995).

The distinction between topographic and biocli-matic variables is important in the discussion ofprecision versus generality. Austin (1980, 1985),Austin et al. (1984), and Austin and Smith (1989)defined three types of ecological gradients, namelyresource, direct, and indirect gradients. Resourcegradients address matter and energy consumed byplants or animals (nutrients, water, light forplants, food, water for animals). Direct gradientsare environmental parameters that have physio-logical importance, but are not consumed (tem-perature, pH). Indirect gradients are variablesthat have no direct physiological relevance for aspecies’ performance (slope, aspect, elevation, to-pographic position, habitat type, geology; Fig. 3gives an example for vascular plants). They aremost easily measured in the field and are oftenused because of their good correlation with ob-served species patterns. Indirect variables usuallyreplace a combination of different resources anddirect gradients in a simple way (Guisan et al.,1999).

However, one drawback of using such indirectparameters is that a model can only be appliedwithin a limited geographical extent without sig-nificant errors, because in a different region thesame topographic position can reveal a differentcombination of direct and resource gradients.Walter and Walter (1953) called this the ‘‘law ofrelative site constancy’’ (Gesetz der relati6en Stan-


Fig. 3. Example of a conceptual model of relationships between resources, direct and indirect environmental gradients (see e.g.Austin and Smith, 1989), and their influence on growth, performance, and geographical distribution of vascular plants andvegetation.

dortkonstanz). It describes the fact that speciestend to compensate regional differences in cli-matic conditions by selecting comparable mi-crosites by changing their topographic positions.In turn, the use of direct and resource gradients aspredictive parameters – in which case predictionsare based on what is supposed to be more physio-logically ‘mechanistic’ – ensures that the model ismore general and applicable over larger areas.Furthermore, direct and resource gradients helpto pave the way towards incorporating dynamicaspects of vegetation succession in spatially ex-plicit models, as proposed by Solomon and Lee-mans (1990) or by Halpin (1994).

2.4. Fundamental 6ersus realized niche

The fundamental niche is primarily a functionof physiological performance and ecosystem con-straints. As an example, Woodward (1987, 1992)analyzed the mechanistic relationships betweenclimate parameters and plant fundamental re-sponse. The realized niche additionally includesbiotic interactions and competitive exclusion (El-lenberg, 1953; Malanson et al., 1992; Malanson,1997). The concept of the ecological niche wasclarified by Hutchinson (1957) and recently revis-ited by several authors in the context of predictivemodeling (e.g. Austin et al., 1990; Westman, 1991;


Malanson et al., 1992; de Swart et al., 1994;Rutherford et al., 1995; Franklin, 1995; Leibold,1995). Differentiating between the fundamentaland the realized niche of a species is particularlyimportant because it distinguishes whether a sim-ulated distribution is predicted from theoreticalphysiological constraints or rather from field-derived observations.

Strict mechanistic models parameterize the fun-damental niche and additionally implement rulesof competitive behavior to finally result in thepredictions of the realized response. As an exam-ple, Prentice et al. (1992) based their model pri-marily on theoretical and physiologicalconstraints, and they add simple rules to copewith succession and dominance. Static predictivemodels are generally based on large empirical fielddata sets, thus, they are likely to predict therealized (ecological) niche. This seriously limitsapplications in changing environmental situations.However, Malanson et al. (1992) demonstratehow empirically fitted response surfaces can bealtered on the basis of theoretical and physiologi-cal principles to design a more fundamentalresponse.

2.5. Equilibrium 6ersus non-equilibrium

Static distribution models are developed fromsimple statistically or theoretically derived re-sponse surfaces. Thus, they automatically assumeequilibrium – or at least pseudo-equilibrium (Lis-chke et al., 1998) – between the environment andobserved species patterns. The non-equilibriumconcept is more realistic in ecology (Pickett et al.,1994), because it includes equilibrium as a possi-ble state (Clark, 1991). However, a model basedon the non-equilibrium concept must be (i) dy-namic and (ii) stochastic. Static distribution mod-els are conceptually unable to cope withnon-equilibrium situations, since they do not dis-tinguish between the transient and equilibriumresponse of species to a stochastically and dynam-ically changing environment. Hence, considering astate of equilibrium is a necessary assumption forthe purpose of large-scale distribution modeling.This limitation is less restrictive for species, orcommunities, which are relatively persistent or

react slowly to variability in environmental condi-tions (e.g. arctic and alpine).

Such a drawback is compensated by large-scaleprediction with less effort, and the advantage thatno detailed knowledge of the physiology and be-havior of the species involved is necessary. Situa-tions with strong disturbance, human influence, orsuccessional dynamics can thus only be modeledwith difficulty (Brzeziecki et al., 1993; Guisan etal., 1999; Lees and Ritman, 1991; Zimmermannand Kienast, 1999). However, it is sometimespossible to include such factors as predictiveparameters4. The alternative to static, equilibriummodeling is dynamic simulation modeling (Ko-rzukhin et al., 1996; Lischke et al., 1998). How-ever, since such models require intensiveknowledge of the species involved, most of thesemodels are developed for well-investigate speciesand habitats. Only few dynamic models have yetbeen developed in a spatially explicit way thatallow simulations on larger spatial scales (e.g.Urban et al., 1991; Moore and Noble, 1993;Roberts, 1996; He et al., 1999; He and Mladenoff,1999).

2.6. Species 6ersus community approach

Another major discussion in this field iswhether a model is said to be ‘gleasonian’ or‘clementsian’ (Prentice et al., 1992) simulatingspecies either individually or as community as-semblages (see also Franklin, 1995). This discus-sion is embedded in the community-continuumdebate (Clements, 1916, 1936; Gleason, 1926;Cooper, 1926; Whittaker, 1967; McIntosh, 1967;Austin, 1985; Austin and Smith, 1989; Collins etal., 1993).

A main argument for species modeling is thepaleoecological evidence that plant species assem-

4 Fischer (1990) identified land-use as the factor with thehighest predictive power when modeling community distribu-tion in a human-disturbed landscape. Lees and Ritman (1991)discussed the trade-off between spectral (remotely sensed data)and spatial (environmental GIS data) accuracy to respectivelypredict disturbed (e.g. rural or urban) versus undisturbed (e.g.rainforest) landscapes. Box (1981) as well as Prentice et al.(1992) included hierarchical rules to cope with succession, bysimulating mature stages only.


Fig. 4. Criteria for model selection. Examples of considerations for modeling vascular plant species in space and/or time, given aset of possible criteria to reach the project goals, CC: climatic change.

blages have never been stable, mostly due to pastvariations in climate (Webb, 1981; Prentice,1986b; Ritchie, 1986). Modern species assem-blages do not have long histories (Davis, 1983;Birks, 1993) and therefore, communities are notlikely to move as an entity under changing cli-matic conditions (Birks, 1986; Huntley and Webb,1988). Hence, modeling species instead of commu-nities comes closer to what is believed to berealistic5. An alternative to modeling communitiesis to simulate a selection of dominant species, andto classify their superimposed distributions after-wards, in order to generate simulated community

maps (Lenihan, 1993; Austin, 1998; Guisan andTheurillat, 2000). This approach solves the miss-ing logic of arbitrary a priori classifications. How-ever, when predicting future potentialdistributions based on static models and scenariosof environmental change (e.g. climatic change),the same limitations apply to species and commu-nity models. Both approaches are based on equi-librium assumptions (between observed responseand environmental conditions) and lack the possi-bility of simulating the individualistic behavior ofspecies (i.e. seed dispersal, migration, plasticity,adaptation, etc.).

2.7. Conceptual guidelines

So far we discussed that conceptual consider-ations based on the overall goal of the study areessential for the design of an appropriate model.Fig. 4 gives an example of possible considerationsfor modeling the distribution of vascular plants.In general, if high predictive precision is requiredto model the distribution of biological entities on

5 Franklin (1995) discusses the different points of view in thecontext of predictive vegetation mapping. She concludes that:‘‘…communities (and ecotones) are geographic entities, andtherefore can be predictively mapped. However, predictivemapping of species distributions presents far fewer definitionaluncertainties or abstractions. Often the distribution of speciesassemblages (plant communities) or functional types… is pre-dicted owing to methodological considerations (lack of suffi-cient data to model species distributions) rather than strongloyalty to the community concept.’’


a large spatial scale under present environmentalconditions (e.g. Hausser, 1995), then static model-ing is a valid and powerful approach. It may alsobe selected for modeling large-scale potential dis-tributions under environmental change scenarios,but limitations apply, as previously discussed. Formodeling at small spatial scales and in complextopography the use of indirect variables mayprovide better predictions, while for simulationsat large spatial scales the use of direct and re-source gradients should be the first choice.

3. Sampling design, field survey, spatial scales,and the geographical modeling context

The formulation of the conceptual model leadsideally (i) to the choice of an appropriate spatialscale (reviewed by Wiens, 1989; see alsoHengeveld, 1987; Fitzgerald and Lees, 1994a) forconducting the study, and (ii) to the selection of aset of conceptually (e.g. physiologically) meaning-ful explanatory variables for the predictive model(Fig. 3). Additionally, it may be helpful to designan efficient sampling strategy by identifying thosegradients that are believed to play a key role inthe model and should thus be considered primar-ily to stratify the sampling (Mohler, 1983; Austinand Heyligers, 1989, 1991; Wessels et al., 1998).The main environmental gradients in the studyarea (be it a small catchment or a large area) canbe identified in a preliminary exploratory analysis(e.g. Dufrêne and Legendre, 1991; Aspinall andLees, 1994) and used to define a sampling strategythat is especially designed to meet the require-ments of the model objectives (Mohler, 1983).

The gradsect approach – originally proposedby Helman (1983) and later improved by Gillisonand Brewer (1985) and by Austin and Heyligers(1989, 1991) – represents a compromise betweenrandomized sampling (distribution and replica-tion) of multiple gradients along transects (stratifi-cation) and minimizing survey costs(accessibility). It can easily be designed in a geo-graphical information system (Cocks and Baird,1991; Neldner et al. 1995; Franklin et al., inpress1995) and can thus be adapted to the spatialresolution of any study area. Austin and Heyligers

(1991, p.36) provide a detailed list of the generalsteps that can be used to design a reasonablesurvey based on the gradsect method.

Designing the sampling according to a random-stratified scheme is another classical approachthat can also be set up in a GIS. However,difficulties may arise from studies that involvemany environmental gradients and many species,since setting up a multi-gradient stratified sam-pling is a particularly demanding task (Goedicke-meier et al., 1997; Lischke et al., 1998). IndividualGIS layers have to be stratified and intersected inorder to delineate the polygons from which ran-dom samples have to be drawn. Each polygonrepresents a specific combination of environmen-tal conditions, and each of these combinations(stratum) is usually present in multiple polygons.Next, a given number of polygons per stratum isselected randomly where samples have to be lo-cated. Restrictions may be applied, such as ex-cluding polygons of size smaller than a lowerthreshold and larger than an upper threshold, toensure the sampling of polygons of similar surfacearea. Finally, the sample location within thepolygon can be selected randomly or systemati-cally (e.g. at the centroid of the polygon)6.

A quantitative comparison between gradsect,stratified (habitat-specific), systematic and ran-dom sampling is given in Wessels et al. (1998) andpartly also in Goedickemeier et al. (1997). Itshows that the gradsect approach gives compara-ble results to the stratified sampling for approach-ing species richness patterns in the study area, andboth are superior to systematic and random sam-pling in their study context. However, if samplingis aimed at building species distribution models,one will want to have all environmental combina-tions equally sampled, which can be done but isnot enforced by the gradsect approach. The use ofa stratifying approach, although involving more

6 To improve on this, multiple locations per polygon can beselected to ensure backup points for cases where the actualenvironmental characteristics in the field do not correspond tothe strata defined from the information available in the GIS(physiography, vegetation, or soil maps, etc.). Each samplingsite is then best localized in the field using DGPS (see for e.g.Goedickemeier et al. (1997) or Cherix et al. (1998)).


effort, will provide such a guarantee. Hence, ifspecies richness and biodiversity are the focus,stratifying will require the proportional represen-tation of all habitats in order to ensure samplingof as much as possible of the entire species pool ineach habitat. Conversely, if a quantitative analysisbetween species distribution and environmentaldescriptors is the focus, one will attempt to sam-ple an equal number of replicates per environmen-tal combination (strata). It is thus fundamental todiscuss the use of a particular method in a well-defined ecological context.

If a data set was not collected according to astratified strategy (as in many observational stud-ies), one may resample a fixed-size subset of ob-servations from the existing data set within eachenvironmental stratum. If there are not enoughobservations available for a stratum, we suggestto design a complementary field survey that willaim at sampling additional observations in under-represented strata, until the fixed-size number ofreplicates is reached. The new – resampled – dataset is then more valid for statistical, analyticalpurposes.

The minimum distance between two samplingsites should be defined prior to sampling, accord-ing to an exploratory spatial autocorrelationstudy (see Legendre, 1993 for a review). Autocor-relation occurs when sampling points are locatedso close to each other that the postulate of inde-pendence between observations is violated. Insuch a case, pseudoreplication occurs (see Heffneret al., 1996), i.e. the number of degrees of freedom(d.f.) is lower than (n − k − 1), the usual num-ber of d.f. used for inferences when all observa-tions are independently and identically distributed(i.i.d.), with n the number of sites and k thenumber of parameters in the model. Hence, inorder to avoid spatial autocorrelation, a distancelarger than the minimum distance at which auto-correlation occurs has to be chosen (e.g. 150 m foralpine communities; Fischer, 1994). Such a mini-mum distance criterion can easily be implementedin the design of a random-stratified sampling. Ifthe sampling distance is too low to avoid autocor-relation, then an alternative is to include autocor-relation in the model (Malanson, 1985; Roxburgand Chesson, 1998), through adding an autocor-

relative term (Augustin et al., 1996), adding thedistance between the observation points (Leath-wick, 1998) or smoothing model predictions by atrend surface analysis (TSA; Pereira and Itami,1991)7.

The environmental information gathered at therequired spatial resolution for the entire studyarea is best stored in a GIS. Four main sourcesmay be identified for the gathering of such envi-ronmental data:1. field surveys or observational studies;2. printed or digitized maps;3. remote sensing data (numerical aerial photo-

graphs and satellite images);4. maps obtained from GIS-based modeling

procedures.Field data (1) can be field measurements (e.g.

abiotic characteristics of a site) or a network ofmeteorological measurements aimed at further in-terpolating climatic maps (4). Spatial data ongeology, soil units, or hydrology most commonlyoriginate from existing printed or digitized maps(2). Land use, rocky surfaces, snow cover, poten-tial moisture or vegetation maps can be derivedfrom aerial photographs or satellite scenes (3).Examples of distribution studies including remotesensing data are Frank (1988), Hodgson et al.(1988), Bagby and Brian (1992), Fitzgerald andLees (1992), Herr and Queen (1993), Guisan et al.(1998), or Franklin et al. (2000). In certain cases,e.g. in mountainous areas where distortion can behigh due to the steep relief, aerial photographshave to be rectified with the help of a DEM (seeGuisan et al., 1998; Fig. 5) before they can beproperly georeferenced.

In many cases the main prerequisite for spatialmodeling is the DEM (Fig. 5). It constitutes thebasis for generating new maps of environmentalvariables and determines the spatial resolution ofall derived maps. However, it is important todistinguish between spatial resolution and map

7 The above discussion is not relevant to those studies whichaim to interpolate the sampled property (e.g. a crop biomass),without recourse to any additional environmental information.In such cases, the use of spatial interpolation techniques todesign an optimal sampling strategy, as e.g. kriging (Atkinson,1996), is more appropriate.


Fig. 5. The central role of the DEM in predictive habitat distribution modeling, as a basis for: (i) generating additionalenvironmental information through geostatistical modeling (e.g. climatic data); (ii) rectifying aerial photographs and satellite imagesin mountainous areas; (iii) planing field survey (e.g. stratification design).

accuracy. Only rigorous testing of the errors ofmapped entities or gradients will allow one toassess map accuracy. Generally, all maps derivedfrom interpolations, calculations or combinationsare less precise than the maps from which theyoriginate. Thus, a DEM and its basic derivatives– slope, aspect, topographic position and curva-ture – are usually the most accurate maps avail-able, though not necessarily those with the highestpredictive potential.

Modeling of environmental variables has be-come more and more powerful. Fractal geometrycan for instance be used to characterize vegetationcomplexity and help to interpret aerial photo-graphs (Van Hees, 1994). Various spatial interpo-lation techniques and approaches to deal with thespatial context (in the sense of Fitzgerald andLees, 1994a), such as regional autocorrelation andtopographic dependencies, have been employed togenerate a multitude of bioclimatic maps(Hutchinson and Bischof, 1983; Daly et al., 1994;Dubayah and Rich, 1995; Kumar et al., 1997;Thornton et al., 1997). These publicly availablemaps or modeling tools drastically improve the

perspective of building physiologically moremeaningful models of the environment. Still, thetrade-off between modeling accuracy and ‘physio-logical’ correctness remains to be evaluated interms of the overall goal of the model.

Related to the problem of accuracy is the taskof selecting an appropriate data set to parameter-ize the model. Most predictive variables varyalong topographical gradients. Depending on thespatial resolution and the techniques used to gen-erate these maps, topographic variables can beused to evaluate the correspondence between digi-tal and field-observed attributes for any location.Selecting a subset of plots that meet criteria ofhigh correspondence between digital and real to-pography can significantly improve the modelparameterization (Zimmermann, 1996; Zimmer-mann and Kienast, 1999). This way, the influenceof mapping errors on the model parameterizationcan be reduced, although predictive accuracy maynot be improved. It is obvious that measuringappropriate additional topographic parametersduring the survey is important to make full use ofsuch modeling steps.


4. Statistical model formulation

The statistical formulation has also been called6erification by some authors (e.g. Rykiel, 1996)and is often presented in the ‘Statistical analysis’section of most papers dealing with vegetation orsingle species modeling. By model formulation, wemean the choice of (i) a suited algorithm forpredicting a particular type of response variableand estimating the model coefficients, and (ii) anoptimal statistical approach with regard to themodeling context.

Most statistical models are specific to a particu-lar type of response variable and its associatedtheoretical probability distribution (or densityfunction). As a consequence, the latter has to betested prior to the selection of an adequate model.This can be done, e.g. for a continuous responsevariable, by comparing its empirical distributionto the theoretical probability distribution (good-ness-of-fit x test) or their cumulative form (Kol-mogorov–Smirnov test). A summary of someavailable techniques for modeling respectivelyquantitative (i.e. metric; ratio or interval scale;continuous or discrete), semi-quantitative (i.e. or-dinal) or qualitative (i.e. nominal; categorical)response variables are given in Table 1 and illus-trated with examples from the literature. Often,more than one technique, that may or may notbelong to the same statistical approach, can beapplied appropriately to the same response vari-able (Table 1).

The importance of a correct statistical modelformulation is best illustrated with an example.Consider predicting the potential abundance of aspecies (or ground cover for plant species; seeBrown et al., 1995 for a review of models for thespatial distribution of species abundance). Usingstandard least square (LS) regression in this casemight be inadequate for the following reasons.Abundance is often recorded as a discrete variableand its distribution might reasonably follow: (i) aPoisson distribution (Vincent and Haworth,1983), (ii) a negative binomial distribution (May,1975; Oksanen et al., 1990; White and Bennetts,1996), (iii) a canonical log-normal distribution(May, 1975), (iv) a broken-stick distribution(May, 1975) or, in cases where it was directly

sampled on a semi-quantitative scale, (v) an ordi-nal distribution (Guisan et al., 1998; Guisan andHarrell, 2000; Guisan, in press2000), rather thanthe normal distribution implying the use of LSregression. Often, log-transforming the responsevariable only solves such situations unsatisfacto-rily (Vincent and Haworth, 1983); it is more ap-propriate to use maximum-likelihood (ML)estimation to fit the model, as this technique candeal with any parametric distribution. Thus, ifabundance originates for instance from counts ofindividuals, one would preferably use generalizedlinear models (GLM; ML-based) with a Poissonor a negative-binomial distribution (Vincent andHaworth, 1983; Nicholls, 1989), rather than LSregression applied to a log-transformation of theresponse variable (log-linear models)8. Table 1provides an overview of examples of responsevariables and their corresponding statisticalmodels.

If necessary, the appropriateness of a statisticalmodel can be checked by a series of tests (e.g.Marshall et al., 1995 in the case of LS regression)or graphical methods (e.g. diagnostic plots). Aquantile–quantile plot (QQ-plot) of regressionresiduals can be used to check whether the as-sumption about the distribution of errors holds.Similarly, a plot of standardized residuals againstfitted values can also help to identify unexpectedpatterns in the deviance (i.e. variance for GLMs).In such a case, one solution is to reconsider thedefinition of the deviance in the model9. Withordinal regression models, partial residual plotscan be used to check the ordinality of the re-sponse variable y against each explanatory vari-able xi (Guisan and Harrell, 2000).

The formulation of a theoretically appropriatemodel may fail to provide satisfying results be-

8 The use of a binomial GLM with logistic link to predictpresence/absence of plant communities (e.g. Zimmermann andKienast, 1999), of plant species (Lischke et al., 1998; Franklin,1998), or the use of a proportional odds (PO) regression modelto predict ordinal abundance (Guisan et al., 1998; Guisan andHarrell, 2000; Guisan, in press2000) are other examples ofappropriate model formulations.

9 Using the e.g. ‘quasi’ family in the S-PLUS statistical soft-ware (see Guisan et al., 1999).


Tab

le1

Som

est

atis

tica

lapp

roac

hes

and

mod

els,

wit

hlit

erat

ure

exam

ples

,for

thre

edi

ffer

ent

type

sof

resp

onse

vari

able

s:qu

anti

tati

ve(r

atio

orin

terv

alsc

ales

),se

mi-

quan

tita

tive

(ord

inal

),an

dqu

alit

ativ

e(n

omin

al)

Stat

isti

cal

Pro

babi

lity

Pos

sibl

em

odel

ing

tech

niqu

eT

ype

ofre

spon

seT

ype

ofE

xam

ples

ofre

spon

seE

xam

ples

ofha

bita

tm

odel

ing

stud

ies

appr

oach

esva

riab

leva

riab

ledi

stri

buti

onpr

edic

tion

s

WA

,L

S,L

OW

ESS

,G

LM

,G

auss

ian

Qua

ntit

ativ

eP

erce

ntco

ver,

spP

rob.

MU

LR

EG

Hun

tley

etal

.,19

95;

(con

tinu

ous)

GA

M,

Reg

ress

ion

tree

Hei

kkin

en,

1996

rich

ness

,bi

omas

sH

ill,

1991

;G

ottf

ried

etal

.,O

RD

INC

AN

OC

OD

ist.

1998

;G

uisa

net

al.,

1999

Pro

b.V

ince

ntan

dH

awor

th,

1983

;G

LM

,G

AM

MU

LR

EG

Poi

sson

Indi

vidu

alco

unts

,G

uisa

n,19

97sp

ecie

sri

chne

ssN

egat

ive

GL

M,

GA

MP

rob.

–In

divi

dual

coun

tsM

UL

RE

Gbi

nom

ial

MU

LR

EG

Dis

cret

ized

PO

mod

el,

CR

mod

elP

rob.

Gui

san,

1997

;G

uisa

net

al.,

Abu

ndan

cesc

ale

Sem

i-qu

anti

tati

ve(o

rdin

al)

cont

inuo

us19

98;

Gui

san

and

Har

rell,

2000

;G

uisa

n,in

pres

s200

0–

Tru

eor

dina

lP

heno

logi

cal

stag

esM

UL

RE

GSt

ereo

type

mod

elP

rob.

Veg

etat

ion

unit

s,pl

ant

Mul

tino

mia

lM

UL

RE

GP

olyc

hoto

mou

slo

git

Qua

litat

ive

Pro

b.D

avis

and

Goe

tz,

1990

regr

essi

onco

mm

unit

ies

(cat

egor

ical

,no

min

al)

Cla

ssifi

cati

ontr

eeC

lass

Wal

ker

and

Moo

re,

1988

;C

LA

SSIF

Bur

keet

al.,

1989

;M

oore

etal

.,19

91;

Lee

san

dR

itm

an,

1991

ML

CD

ist.

Fra

nk,

1988

Tw

ery

etal

.,19

91;

Len

ihan

Rul

e-ba

sed

clas

sC

lass

and

Nei

lson

,19

93;

Li,

1995

Low

ell,

1991

DF

AD

iscr

im.

scor

eD

ISC

RD

egre

eof

Box

,19

81;

Bus

by,

1986

;E

NV

–EN

VB

oxca

r,C

onve

xH

ull,

Car

pent

eret

al.,

1993

;co

nfide

nce

poin

t-to

-poi

ntm

etri

csT

cheb

akov

aet

al.,

1993

Pro

b.N

icho

lls,

1989

;A

usti

net

al.,

p/a,

rela

tive

abun

danc

eG

LM

,G

AM

,R

egre

ssio

nB

inom

ial

MU

LR

EG

1990

,19

94;

Yee

and

tree

Mit

chel

l,19

91;

Len

ihan

,19

93;

Bro

wn,

1994

;V

ande

Rijt

etal

.,19

96;

Gui

san,

1997

;Sa

eter

sdal

and

Bir

ks,

1997

;F

rank

lin,

1998

;L

eath

wic

k,19

98;

Zim

mer

man

nan

dK

iena

st,

1999

;G

uisa

net

al.,

1999

;G

uisa

nan

dT

heur

illat

,20

00C

lass

ifica

tion

tree

Cla

ssF

rank

lin,

1998

;F

rank

linet

p/a

CL

ASS

IFal

.,20

00


Tab

le1

(Con

tinu

ed)

Stat

isti

cal

Pos

sibl

em

odel

ing

tech

niqu

eT

ype

ofE

xam

ples

ofha

bita

tT

ype

ofre

spon

seP

roba

bilit

yE

xam

ples

ofre

spon

sem

odel

ing

stud

ies

appr

oach

esva

riab

ledi

stri

buti

onpr

edic

tion

sva

riab

le

Box

car,

Con

vex

Hul

l,D

egre

eof

Bus

by,

1986

;B

usby

1991

;E

NV

–EN

VW

alke

ran

dC

ocks

,19

91;

confi

denc

epo

int-

to-p

oint

met

rics

Shao

and

Hal

pin,

1995

;H

untl

eyet

al.,

1995

Skid

mor

e,19

89;

Fis

cher

,B

AY

ES

Bay

esfo

rmul

aP

rob.

1990

;A

spin

all,

1992

;B

rzez

ieck

iet

al.,

1993


cause (i) the data themselves are not goodenough, (ii) the resolution of the spatial scale isnot appropriate, or (iii) the sampling design wasnot intended for this purpose. In some cases, themodel predictions can be greatly improved byapplying a particular category of statistical al-gorithms. As an example, the use of more empiri-cally based techniques, such as generalizedadditive models (GAM) instead of LS or GLM,proved to be more satisfying in some studies (e.g.Yee and Mitchell, 1991).

In the following sections, we discuss the mainstatistical approaches (to modeling) grouped intoseven categories (see Table 1): Multiple regressionand its generalized forms, Classification tech-niques, Environmental envelopes, Ordinationtechniques, Bayesian approaches, Neural net-works and a seventh category including otherpotential approaches or approaches involving sev-eral methods (mixed approach).

4.1. Generalized regressions

Regression relates a response variable to a sin-gle (simple regression) or a combination (multipleregression) of environmental predictors (explana-tory variables). The predictors can be the environ-mental variables themselves or, in order toprevent multicollinearity in the data, orthogonal-ized components derived from the environmentalvariables through multivariate analysis (e.g. par-tial least square (PLS), Nilsson et al., 1994;Heikkinen, 1996; Birks, 1996, principal compo-nent regression (PCR), Brown et al., 1993;Saetersdal and Birks, 1997). One possible multi-collinearity diagnostic is the variance inflationfactor analysis (VIF; Montgomery and Peck,1982). It should be preferred to analyze pairwisecorrelation between predictors, as near linear de-pendencies may involve more than two predictors.The classical LS regression approach is theoreti-cally valid only when the response variable isnormally distributed and the variance does notchange as a function of the mean (homoscedastic-ity). GLMs constitute a more flexible family ofregression models, which allow other distributionsfor the response variable and non-constant vari-ance functions to be modeled. In GLMs, the

combination of predictors, the linear predictor(LP), is related to the mean of the response vari-able through a link function. Using such linkfunctions allows (i) transformation to linearity,and (ii) the predictions to be maintained withinthe range of coherent values for the responsevariable. By doing so, GLMs can handle distribu-tions such as the Gaussian, Poisson, Binomial, orGamma – with respective link functions set e.g.to identity, logarithm, logit and inverse. Contraryto LS-regressions, that could predict biologicallyunfeasible values (e.g. probabilities higher than100% or negative values; see Guisan, in press),GLM models yield predictions within the limits ofobserved values (e.g. presence/absence [1/0] andprobability values in between these extremes)10.

If the response is not in linear relationship witha predictor, a transformed term of the latter canbe included in the model. A regression modelincluding higher order terms is called polynomialregression. Second order polynomial regressionssimulate unimodal symmetric responses (e.g. Fig.6a), whereas third order or higher terms allowsimulating skewed and bimodal responses, or evena combination of both. Other transformationfunctions (also called parametric smoothers; seeOksanen, 1997) to simulate more specific responseshapes include: (i) b-functions (Austin et al.,1994), (ii) hierarchical set of models (Huisman etal., 1993), or (iii) a set of ‘n-transformed’ func-tions (Usó-Domènech et al., 1997).

Alternative regression techniques to relate thedistribution of biological entities to environmentalgradients are based on non-parametric smoothingfunctions of predictors (Fig. 6b)11. They include

10 A Gaussian GLM with identity link is equivalent to a LSregression. Binomial GLMs with logit link are commonly usedin distribution modeling under the name of logit regression.The latter also constitute the core of ordinal regression models(Guisan and Harrell, 2000) and polychotomous regressionmodels (Davis and Goetz, 1990).

11 Thus rising new questions such as: ‘is a skewed responsecurve best integrated in a model by a parametric polynomial(e.g. Ferrer-Castán et al., 1995), a Beta-function (e.g. Austin etal., 1994), a non-linear function (e.g. from a hierarchical set offunctions; Huisman et al., 1993) or a smoothed function (e.g.Yee and Mitchell, 1991)?’


Fig. 6. Examples of response curves for different statistical approaches used to model distribution of plants and vegetation. (a)Generalized linear model with second order polynomial terms; (b) generalized additive model with smoothed spline functions; (c)classification tree; (d) environmental envelope of the BIOCLIM type; (e) canonical correspondence analysis; (f) Bayesian modelingaccording to Aspinall (1992); pp = posterior probability of presence of the modeled species, ppp = a priori probability of presence,ppa = a priori probability of absence, pcp = product of conditional probability of presence of the various predictor classes,pca = product of conditional probability of absence of the various predictor classes.

running means, locally weighted regression, orlocally weighted density functions (Venables andRipley, 1994). GAMs are commonly used to im-plement non-parametric smoothers in regressionmodels (Yee and Mitchell, 1991; Brown, 1994;Austin and Meyers, 1996; Bio et al., 1998;Franklin, 1998; Lehmann, 1998; Leathwick,1998)12. This technique applies smoothers inde-

pendently to each predictor and additively calcu-lates the component response. Alternatively,multidimensional smoothers can be applied(Huntley et al., 1989, 1995), the drawbacks ofwhich are discussed by Friedman and Stuetzle(1981) and Yee and Mitchell (1991).

Furthermore, regression models can be im-proved by incorporating additional informationon ecological processes such as dispersalor connectivity. An example is the recent imple-mentation of autocorrelative functions in a logis-tic GLM (Augustin et al. 1996; Preisler et al.,1997) as previously suggested by Malanson(1985).

12 GAMs are sometimes used in an exploratory way, toidentify the most probable shape of response curves that arethen best fitted with a GLM (e.g. Brown, 1994; Franklin,1998), but can also be used directly to fit the final model (e.g.Lehmann, 1998; Leathwick, 1998).


4.2. Classification techniques

The category of classification encompasses abroad range of techniques. It includes techniquessuch as classification (qualitative response; Fig.6c) and regression (quantitative response) trees(CART; Breiman et al., 1984; e.g. in Moore et al.,1991; Lees and Ritman, 1991; Franklin, 1998;Franklin et al., 2000)13, rule-based classification(e.g. Twery et al., 1991; Lenihan and Neilson,1993) and maximum-likelihood classification(MLC; e.g. Franklin and Wilson, 1991). Withinsome GIS software (ArcInfo, Imagine, GRASS,Ilwis, Idrisi), the latter is generally called ‘super-vised classification’ when it relies on a trainingdata set, and ‘unsupervised classification’ when notraining data set is required. We found no directexamples of the application of linear machines topredict plant or animal distribution, i.e., machinelearning multivariate or hybrid decision trees re-lated to CART (Brodley and Utgoff, 1995). How-ever, examples of their use to derive land coverfrom satellite scenes do exist (J. Franklin, per-sonal communication). All these techniques assigna class of the binary (Fig. 6c) or multinomialresponse variable to each combination of (nomi-nal or continuous) environmental predictors.

CART, decision tree, and other rule-based clas-sifications are sometimes placed within artificialintelligence (AI) techniques, although not all ofthese are based on classification techniques (e.g.neural networks; Fitzgerald and Lees, 1992).Some expert systems or decision-tree classifiersare built from interrelating simple rules deducedfrom previous understanding of the phenomena tobe modeled (e.g. from literature, laboratory exper-iments, expert knowledge). In a more empiricalway, algorithms such as CART allow the rules tobe induced directly from the observations (Leesand Ritman, 1991).

4.3. En6ironmental en6elopes

Until recently, many large-scale vegetation orspecies models were based on environmental en-

velope techniques (e.g. Holdridge, 1967; Box,1981; Busby, 1991; Carpenter et al., 1993; Shaoand Halpin, 1995). Busby (1986, 1991) developedthe BIOCLIM model – a fitted, species-specific,p-dimensional environmental envelope (Boxcar) –to model plant species distribution in Australia,using one-by-one degree latitude–longitude gridcells. This approach is based on calculating aminimal rectilinear envelope in a multi-dimen-sional climatic space (see Fig. 6d for a two-dimen-sional example). Holdridge (1967) applied thesame approach to ecosystems, and Box (1981) toplant functional types.

In an attempt to enclose record sites in theenvironmental space more tightly than by theBIOCLIM rectilinear envelope, Walker andCocks (1991) developed the HABITAT modelusing convex polytope envelopes (convex hull).Interestingly, both models give very similar resultsalthough they differ in their classification proce-dure. A rectilinear envelope can be defined from avery simple classification tree as well (only onedichotomy per predictor), whereas the more com-plex polytope envelope would need a more de-tailed tree including more terminal nodes14. TheDOMAIN model, developed by Carpenter et al.(1993), is more complex. Its approach is not thatof classification trees; instead, it is based on apoint-to-point similarity metric (measure of multi-variate distances) and has proved to be moresuitable for applications where available recordsare limited.

Prentice et al. (1992) developed a more mecha-nistic version of the environmental envelopemodel family by fitting only the lower limit ofdirect and resource gradients in order to predictthe predominant plant functional types on aglobal scale. Thus, these limits represent the mini-mum requirement for plant growth. Upper limitsfor functional types along environmental gradi-ents are assumed to be caused by competitionrather than by intolerance or environmental con-straints. Consequently, the functional types areranked by their ability to dominate, and a func-

14 Quantitative comparisons of environmental envelope andclassification tree approaches can be found in Walker andCocks (1991) or Skidmore et al. (1996).

13 A comparison between classification trees and GLM/GAM can be found in Franklin (1998).


tional type is replaced automatically by a morecompetitive type, once the latter is simulated.

4.4. Ordination techniques

Most models that use ordination techniques topredict the distribution of species or communitiesare based on canonical correspondence analysis(CCA; see Hill, 1991; Birks, 1996; Gottfried et al.,1998; Ohmann and Spiess, 1998; Guisan et al.,1999)15. In this direct gradient analysis the princi-pal ordination axes are constrained to be a linearcombination of environmental descriptors (terBraak, 1988). The method is based on reciprocalaveraging of species and site scores (Hill, 1973).Thus, the assumed distribution of species is Gaus-sian (Fig. 6e), with an upper and a lower limit ofoccurrence and an optimum along gradients. Thismethod is appropriate for data sets with manyzeros (i.e. absences)16. Although the method relieson several postulates (e.g. Gaussian responsecurves, species having equal ecological toleranceand equal maxima along canonical axes, with theposition of the modes distributed uniformly alonga much larger range than species tolerances) thatmay theoretically invalidate its use in many situa-tions, ter Braak (1985) argues that the method isstill robust when such postulates are violated.

Redundancy analysis (RDA) – the canonicalcounterpart of the principal component analysis(PCA) – is less frequently used to simulate envi-ronmentally dependent distribution of communi-ties and taxa. The method is similar toLS-regression with linear terms only. However,the deviation of the residuals is calculated perpen-dicular to the fitted axes, since there is no inde-pendent gradient in ordination. Thus, theunderlying response model is a linear ormonotonic distribution of species along environ-mental gradients, which limits the applicability ofthis method to short (truncated) environmentalgradients only (Jongman et al., 1995; ter Braak,

1988). Hence, the approach is of limited use onthe scale of landscape modeling, where large gra-dients are usually analyzed.

4.5. Bayesian approach

Models based on Bayesian statistics combine apriori probabilities of observing species (e.g. Skid-more, 1989; Aspinall, 1992; Aspinall and Veitch,1993) or communities (e.g. Fischer, 1990;Brzeziecki et al., 1993) with their probabilities ofoccurrence conditional to the value (or class ofvalues) of each environmental predictor (Fig. 6f).Conditional probabilities p(y �xi) can be for in-stance the relative frequencies of species occur-rence within discrete classes of a nominalpredictor. A priori probabilities can be based onprevious results or literature. This results in an aposteriori predicted probability of the modeledentity at a given site with known environmentalattributes. In vegetation mapping, a posterioriprobabilities are calculated for each vegetationunit and the unit with the highest probability ispredicted at every candidate site (Fischer, 1990;Brzeziecki et al., 1993).

4.6. Neural networks

The recourse to artificial neural networks(ANN) as used by Fitzgerald and Lees (1992,1994a,b), Civco (1993), Tan and Smeins (1994),Chen et al. (1995), Lees (1996), Lek et al. (1996)or Manel et al. (1999) is a promising area ofpredictive habitat distribution modeling. How-ever, most published applications of ANN in ecol-ogy are concerned with the field of remote sensing(see Fitzgerald and Lees, 1992) and (non-spatial)predictive assessment of environmental changes(see Spitz and Lek, 1999). Very few examples existof ANN being used to predict the spatial distribu-tion of species or communities using biophysicaldescriptors. Fitzgerald and Lees (1992) had re-course to back-propagation neural network topredict the distribution of nine land-use/floristicclasses from a mix of remote-sensing (LANDSATTM bands 2, 4, 7), topographic (elevation, aspect,slope, catchment) and geological data. They ob-tained results as good as those using a classifica-

15 A comparison between CCA and GLM modeling is givenin Guisan et al. (1999).

16 Yet, the CCA-approach has mainly been used to predictpresence–absence of species at spatial locations.


tion tree applied to the same data (e.g. Lees andRitman, 1991; Lees, 1996). Lek et al. (1996)showed that neural network models are morepowerful than multiple regression models whenmodeling nonlinear relationships. The full classifi-cation procedure in the case of neural network isa complex non-parametric process that is some-times seen as a ‘black art’, even by computerscientists (Caudill, 1991). Detailed discussion ofthe method is the e.g. provided by Hepner et al.(1990) or Benediktsson et al. (1991).

4.7. Other approaches

Other existing techniques are not listed in Table1. Simple models can be developed directly withina GIS, using overlays of environmental variables,measures of variation, measures of similarity andfinal rules for combining single probabilities. Anexample is given in Martinez-Taberner et al.(1992). For each cell of the DEM and for eachenvironmental variable, a modified Jaccard indexis used to measure the overlap between the degreeof seasonal variation and the environmental toler-ance of each species. These values are then consid-ered to be probabilities of occurrence of species asa function of a single environmental predictor.Finally, to calculate an overall probability of oc-currence, all predictors are combined using a geo-metric mean17.

Examples of distribution modeling exist thathave recourse to discriminant function analysis(DFA; e.g. Frank, 1988; Lowell, 1991; Corsi etal., 1999). When modeling vegetation units, indi-vidual discriminant functions are produced foreach unit, from which a discriminant score iscalculated. Further assignment of any candidatesite to one of the unit is based on the highestcalculated score. This approach relies on similarassumptions as those used for least square regres-

sion (independence of predictors, constant vari-ance) except that here, the data for each unit needto be drawn from a multivariate normal popula-tion (Huberty, 1994). Interestingly, using a GLMwith binomial distribution for a binary responsevariable (i.e. the logistic regression) can be consid-ered a more flexible way to run a discriminantanalysis.

The ecological niche-factor analysis (ENFA)was first suggested by Perrin (1984), developedlater by Hausser (1995) and recently implementedin the BIOMAPPER package (Hirzel et al., 2000). Itdiffers from CCA or RDA, in the sense that itconsiders only one species at a time. Like theenvironmental envelope approach, it presents theadvantage of requiring only presence data, a fre-quent situation in the case of animal observationswhere absences are difficult to assess in the field.Using a marginality index and up to three toler-ance indices, ENFA situates the species–environ-mental envelope within the multidimensionalenvironmental envelope that is defined by consid-ering all mapping units within the study area.

With the MONOMAX approach (Bayes andMackey, 1991; Mackey, 1993), a suite of al-gorithms fits a monotonic maximum-likelihoodfunction through iterative processes (whichMackey calls ‘dynamic programming’). Onesevere drawback of this method is that the proba-bility of a response variable can be determinedfrom a maximum of two predictor variables at atime. In turn, a clear advantage of such non-para-metric method (also called distribution-freemethod) is that no assumptions about the distri-bution of the data or the residuals nor about theirvariance is needed, which makes such a tool par-ticularly suitable for exploratory analyses(Mackey, 1993).

5. Model calibration

This step results in the adjustment of the math-ematical model that was selected for the specificdata set at hand. Rykiel (1996) defines calibrationas ‘‘the estimation and adjustment of modelparameters and constants to improve the agree-ment between model output and a data set’’.

17 Additional examples of boolean approaches (overlayrules) can be found in Turner and Baumer (1984), Franklin etal. (1986), Scepan et al. (1987), Cibula and Nyquist (1987),Hodgson et al. (1988), Mead et al. (1988), Shaw and Atkinson(1988), Agee et al. (1989), Stoms et al. (1990), Breininger et al.(1991), Bagby and Brian (1992), Chang et al. (1992), Schuster(1994), or Herr and Queen (1993).


Although we agree with this definition, we wouldlike to broaden it to encompass the more globalphase of model construction, which includes theselection of explanatory variable.

To enhance the accuracy and predictive power ofa model, the number of explanatory variables usedmust be reduced to a reasonable number (Harrellet al., 1996). Thus, one of the most difficult tasksis to decide which explanatory variables, or combi-nation of variables, should enter the model. Esti-mating their coefficients, once they are selected, isusually a straightforward task.

The selection of predictors can be made arbitrar-ily (which we do not recommend), automatically(e.g. by stepwise procedures in LS regression,GLMs, and CCA), by following physiologicalprinciples, or by following shrinkage rules (Harrellet al., 1996, 1998). The latter approach seems to bepromising in the case of main effect GLM models(i.e., models without interaction terms). Overall,Harrell et al. (1996) suggest than no more thanm/10 predictors should be included in the finalmodel, where m is the total number of observationsor the number of observations in the least repre-sented category in the case of a binary response.

By predictors, we also mean all their possibletransformations such as polynomial terms, b-func-tions (Austin et al., 1994), smoothed empiricalfunctions (GAMs; Yee and Mitchell, 1991), or theuse of significant ordination axes (Franklin et al.,1995) as in orthogonalized regressions (e.g. Heikki-nen, 1996). In the latter case, however, the biolog-ical interpretation of such artificial components aswell as the link to a conceptual model might bedifficult. Clearly, the selection of a set of direct andresource gradients (Austin, 1980, 1985) for calibrat-ing a model is particularly promising if ecologicalsignificance and interpretability are to be optimized(Prentice et al., 1992).

Variable transformation is closely bound to theprimary identification of species’ response curves toenvironmental gradients (see first section). Oncetheir shape is approximated (see e.g. Huisman et al.,1993; Austin and Gaywood, 1994; Austin et al.,1994; Bio et al., 1998), the statistical model at-tempts to reproduce and formalize this shape(Austin, 1987). As an e.g. Brown (1994) or Franklin(1998) used non-parametric GAMs to explore the

response of species to the environmental predictorsand then used GLMs to reproduce the identifiedshapes with adequate parametric terms in themodel.

Particular problems arise with skewed responses,as their simulation in LS or GLMs is not easy.Third order and higher polynomials can simulatea skewness in the responses, but they tend to revealspurious and biologically unfeasible responseshapes that are more difficult to interpret (Austinet al., 1990; Hastie and Tibshirani, 1987; Huismanet al., 1993; Austin et al., 1994; Bio et al., 1998).This is specifically true outside the range of valuesused for calibrating the model (e.g. outside thestudy area, for evaluating climate change impactson plant distribution; see Guisan, 1997). Oksanen(1997) concludes that b-functions (Austin et al.,1994) are not appropriate in this case, and thatusing Huisman’s approach of hierarchical models(Huisman et al., 1993) is a preferable alternative,although no clear explanation is given on how toimplement such non linear models within, forexample, a GLM. Besides, choosing appropriateinitial values for the estimation of parameters innon-linear models might also constitute an addi-tional limitation to their common use in furthermodeling studies (Huisman, personal communica-tion).

Parameter estimation is an objective mathemat-ical operation described in most statistical text-books and available in any statistical package (SAS,S-PLUS, SPSS, SYSTAT, etc.). The fit of most modelsis characterized by a measure of the variancereduction (or deviance reduction in the case ofmaximum-likelihood estimation (MLE) tech-niques). In GLMs, the model is optimized throughdeviance reduction, which can easily be convertedinto an estimated D2 – the equivalent to R2 in LSmodels – by the following formula:

D2= (Null deviance−Residual deviance)

/Null deviance, (1)

where the null deviance is the deviance of the modelwith the intercept only, and the residual devianceis the deviance that remains unexplained by themodel after all final variables have been included.A perfect model has no residual devi-


ance and its D2 takes the value 1. In the case ofLS regression, Weisberg (1980) and Nagelkerke(1991) argue that such a measure is not represen-tative of the real fit, unless the number of obser-vations n and the number of predictors p in themodel are taken into account (i.e. weighting bythe residual degrees of freedom). For this pur-pose, Weisberg (1980) suggests a new measurethat is commonly called the adjusted R2. It can bedefined similarly for GLMs as

adjusted D2=1− [(n−1)/(n−p)]× [1−D2] ,(2)

where D replaces R in Weisberg’s original for-mula. The value of the adjusted D2 increases withan increasing number of observations (n) or adecreasing number of parameters (p) in themodel.

The adjusted D2 – or R2 – is an ideal measureto compare models that include different combi-nations of variables and interaction terms. Gener-ally, the model for which the deviance reduction ismaximal is considered as the best, and furtherused for prediction purposes. During the variableselection procedure, the deviance reduction associ-ated with each variable is tested for significance ata given confidence level (usually 0.05). The testdepends on the method used for estimating thecoefficients and the related variance or deviancedistribution. For GLMs, McCullagh and Nelder(1989, p. 36) stated: ‘‘Once we depart from theNormal-theory linear model we generally lack anexact theory for the distribution of the deviance[…] Usually we rely on the x2 approximation fordifferences between deviances’’. Additionally, aStudent t-test, using the standard error associatedwith the estimated model coefficients, is necessaryto check whether the coefficients differ signifi-cantly from zero.

In tree-based techniques the model will attemptto predict the data exactly, so that no fit needs tobe characterized and the evaluation of the modelmay take place immediately after the model cali-bration. With classification and regression trees(CART) this generally results in over-fitted trees(Chambers and Hastie, 1993) with almost asmany terminal nodes as there are observations.Hence, the model is not parsimonious and no

reduction in complexity has been achieved. Prun-ing – which is cutting the tree at a certain com-plexity level to limit the number of terminal nodes– combined with cross-validation (CV), can beused to cut the tree down to a more ‘optimal’number of terminal nodes (Breiman et al., 1984;Chambers and Hastie, 1993; see Franklin et al.,2000 for an application). However, if more thanone observation is left out at a time, the result ofsuch optimized pruning can be subject to changefrom one run to the other, since it relies on arandom partitioning of the training data set. Inthis case, we recommend replicating the procedureand choosing the most frequent or average num-ber of terminal nodes.

Walker and Cocks (1991) describe one way tocalibrate en6ironmental en6elopes. Instead of usingthe same set of environmental parameters for allspecies (as in BIOCLIM), they propose the selec-tion of a subset using the CART algorithm (Brei-man et al., 1984). This subset of predictors is thenused to define the multidimensional envelope thatbest encloses the occurrence of the species. TheirHABITAT model uses a refined set of habitatdecision rules which divide the global envelopeinto sub-envelopes of varying sizes in an optimalway (Walker and Cocks, 1991). The proportion ofspecies’ occurrence over the total number of ob-servations in each sub-envelope is now used lin-early as a measure of degree of membership (nota probabilistic concept; see Zadeh, 1965) of eachnew site to each sub-envelope of the species.Another approach for calibrating an environmen-tal envelope is proposed by Carpenter et al.(1993). Their DOMAIN model is based on apoint-to-point similarity metric (Gower, 1971) be-tween a candidate site and the most similar recordsite in environmental space. Again, the predic-tions are not probabilistic, but an expression ofthe degree of classification confidence.

In constrained ordination methods (also called‘direct gradient analysis’, or ‘direct ordination’)like CCA, the model calibration is very similar tolinear regression, except that here the goodness-of-fit criterion is to ‘‘minimize the ratio of themean within-species sum-of-squares of the vari-ance to the overall sum of squares’’ (Hill, 1991).As in regressions, explanatory variables can be


selected stepwise. Posterior to the ordination,each axis can be tested for significance throughMonte Carlo permutations. A subset of environ-mental predictors can also be defined as covari-ables, which allows the removal of their effect (i.e.‘partial out’) from the ordination of the remainingset of explanatory variables (Borcard et al., 1992).This is especially useful in cases where the effectsof particular variables are to be singled out fromthe background variation imposed by other vari-ables (ter Braak, 1988). This is achieved in partialcanonical ordination, a method that was firstapplied by Borcard et al. (1992) and by Borcardand Legendre (1994) to detect hidden spatial gra-dients that are still unexplained by present ecolog-ical gradients. Finally, variables can be declaredas ‘passive’, which means that their vector isplotted with other environmental variables in theresulting ordination bi- or tri-plot, but they arenot actually used with other predictors to calcu-late the linear combination that best explains thesystem of ordination axes.

An overall measure of the CCA fit is given bothby the trace (or total inertia) of the underlyingcorrespondence analysis (CA) and by the propor-tion of variance in the species’ data that is ex-plained by each canonical axis. The trace is thetotal variance in the species data (i.e. the sum ofall eigenvalues). It is measured by the x2 of thesample-by-species table (Greenacre, 1984) dividedby N, the table’s grand total (see ter Braak andSmilauer, 1998). The fit of a particular species byk CCA axes is given cumulatively and expressedas a fraction of the variance of a species. Thespecies variance is calculated as the x2 of thesample-by-species table divided by species’column total (for more details, see Greenacre,1984, or ter Braak and Smilauer, 1998). The re-ported fits are the regression sums of squares ofthe weighted regression of the data for the species,expressed as a fraction of the total sum of squaresfor the species (i.e. in a similar way as D2 inGLMs), on the 1 – k ordination axes. The overallpercentage of explained variance is obtained byadding all axes. These measures of the fit arediscussed in more detail in ter Braak and Smilauer(1998). In addition, the species–environment cor-relation can be measured for each axis as the

correlation of the respective multidimensional co-ordinates of the species occurrences in both thespecies and the environmental space. The latterresults from multiple regression predictions of thespecies coordinates on the environmental vari-ables. A high species–environment correlationdoes not necessarily mean that a considerableamount of the species data is explained by theenvironmental variables (ter Braak, 1988), andthus it is not a good measure of the fit (see alsoGuisan et al., 1999).

Calibrating a Bayesian model to predict distri-bution of species or vegetation units is equivalentto calculating the multivariate state conditionalprobability of each considered entity, given thevalues of the environmental predictors (Aspinall,1992; Brzeziecki et al., 1993). The significance ofeach habitat attribute for discriminating or notdiscriminating the occurrence of the modeled en-tity can be assessed through x2 frequency analysis(Aspinall, 1992). The resulting x scores can beused to decide which predictor should be includedin the model. If prior information is available, e.g.about the overall frequency of the modeled enti-ties in the study area, it can be set as the priorprobabilities. If no prior information is available,the prior probabilities can be defined as equal andassigned an arbitrary value. Fischer (1993) usedprior probabilities using data from a systematicsampling, whereas Brzeziecki et al. (1993) did notdistinguish prior probabilities because the trainingdata set lacked any statistical sampling procedure.In Aspinall (1992), prior probabilities are esti-mates of the probabilities of presence and ab-sence. Both can be set to 0.5 if the assumption ofequal probability for presence and absence is cho-sen. As an alternative, they can be set accordingto the proportion of all sites in which the entity ispresent (Aspinall, 1992). Qualitative predictorscan be treated as in parallelepiped (PPD) classifi-cation (see Binz and Wildi, 1988), by assigningprobability 1 if a vegetation type occurs at leastonce (or another defined threshold) within thequalitative class, and 0 if it never occurs withinthe respective class (Brzeziecki et al., 1993). Thesevalues are then multiplied by the probabilitiesoriginating from the Bayes formula. Thus, a zerovalue for any of the qualitative predictors will setthe overall probability to zero.


Discriminant functions are often calibrated us-ing Wilk’s l goodness-of-fit statistic, which pro-vides a similar measure of overall model fit as theR2 in multiple regression. It is distributed as anF-ratio when the number of units modeled is lessthan or equal to 4, or when there are only twopredictors (Lowell, 1991; Huberty, 1994).

During model calibration, the individual influ-ences of each observation on the model-fitting canusually be evaluated graphically (e.g. in LS,GLMs). In regression analyses, such screeningallows the identification of outliers and leveragepoints, the removal of which (supported by bio-logical reasons) may contribute to improve the fitof the model. In LS regressions, they are derivedfrom the residual calculations (in the case ofoutliers) and from the hat matrix (in the case ofleverage points). Another measure of influence isbased on the Jack-knife methods (Efron and Tib-shirani, 1993). It is performed by fitting the modelwith n observations but one, leaving out succes-sively one observation at a time. This procedureleads to the calculation of the empirical influence6alues o for each observation. These values can beplotted as a function of the observation numberto detect possible outlying observations. A similarapproach can be used for cross-validating themodel when no held-back data are available (seeSection 7).

6. Model predictions

Once the plant species’ or community’s multipleresponse (i.e. its ecological profile) is derived byany of the modeling techniques previously de-scribed, its potential distribution within the mod-eled area can be predicted. Modeling potentialdistribution of plant species or communities isequivalent to modeling their potential habitat(Schuster, 1994; sensu Whittaker et al., 1973),which led some authors to call such maps ‘poten-tial habitat distribution maps’ (PHDMs). Poten-tial distribution maps can be defined in severalways, as cartographic representations of:1. the probability of occurrence (e.g. from logistic

GLMs; Fig. 7a);

2. the most probable abundance (e.g. from ordinalGLM; Fig. 7b);

3. the predicted occurrence based on non-proba-bilistic metrics (e.g. from CCA; Fig. 7c);

4. the most probable entity (e.g. from hierarchicalconsiderations, Fig. 7d).

Although GIS are widely used tools in all typesof spatially explicit studies, they still lack impor-tant statistical procedures for predictive purposes.This is a serious flaw because not all statisticallyderived models are similarly easy to implement ina GIS environment.

Logistic regression and supervised classificationtechniques are available in most GIS packages butthey remain largely insufficient when applyingmost of the previously cited methodological steps(e.g. no stepwise selection procedure for logisticregression is available in ArcInfo). Moreover,they do not provide graphical checking of themodel fitting (e.g. regression diagnostics), and thefinal evaluation of the model predictions cannotbe made immediately. In turn, models cannot beeasily calibrated outside of the GIS, since moststatistical packages cannot read GIS-maps di-rectly, and the interchange files are generally hugein size.

GLM models are easy to implement in a GIS,as far as the inverse of their link function can becalculated. Each model is generated by simplymultiplying each regression coefficient with itsrelated predictor variable. The results of the cal-culations are obtained on the scale of the LP sothat the inverse link transformation is necessaryto obtain probability values on the scale of theoriginal response variable (Guisan et al., 1998,1999). With binomial GLM, for instance, theinverse logistic transformation is

p(y)=exp(LP)/(1+exp(LP)), (3)

where LP is the linear predictor fitted by logisticregression. Such transformation is necessary toobtain probability values between 0 and 1. Ordi-nal GLMs are implemented on the same basis ina GIS (see Guisan and Harrell, 2000).

Implementing classification models in a GISdepends on the specific approach chosen. Super-vised classification techniques (using an MLC al-gorithm) are available in most GIS. Classification


Fig. 7. Predicted maps representing (a) the probability of occurrence of a species (Spring Mountains, Nevada; see Guisan et al.,1999), (b) the distribution of most probable abundance values of a species at each pixel (Belalp area, Swiss Alps; from Guisan, 1997),(c) the potential distribution of a species based on a non-probabilistic metric (CART) (Shoshone National Forest, Wyoming;Zimmermann, N.E. and Roberts, D.W., unpublished data) and (d) the most probable entity (vegetation map; Pays d’En-Haut, SwissAlps; from Zimmermann, 1996; the legend numbers represent vegetation types).


and regression trees can be implemented by usingthe CART software (Breiman et al., 1984; CART,1984) directly linked to a GIS, or by writing aGIS specific macro to reconstruct the tree byusing conditional statements (e.g. Lagacherie,1992). However, the latter solution becomes labo-rious to implement when trees are complex. Rule-based classifications are the easiest to implementin a GIS, as they can be performed th

predictive habitat distribution models in ecology · ecological modelling 135 (2000) 147–186...

Documents