evaluating the predictive performance of habitat models ... · ecological modelling 133 (0000)...

Ecological Modelling 133 (0000) 225–245

Evaluating the predictive performance of habitat modelsdeveloped using logistic regression

Jennie Pearce*, Simon FerrierNSW National Parks and Wildlife Ser6ice, PO Box 402, Armidale, NSW, 2350, Australia

Accepted 3 May 2000

Abstract

The use of statistical models to predict the likely occurrence or distribution of species is becoming an increasinglyimportant tool in conservation planning and wildlife management. Evaluating the predictive performance of modelsusing independent data is a vital step in model development. Such evaluation assists in determining the suitability ofa model for specific applications, facilitates comparative assessment of competing models and modelling techniques,and identifies aspects of a model most in need of improvement. The predictive performance of habitat modelsdeveloped using logistic regression needs to be evaluated in terms of two components: reliability or calibration (theagreement between predicted probabilities of occurrence and observed proportions of sites occupied), and discrimina-tion capacity (the ability of a model to correctly distinguish between occupied and unoccupied sites). Lack ofreliability can be attributed to two systematic sources, calibration bias and spread. Techniques are described forevaluating both of these sources of error. The discrimination capacity of logistic regression models is often measuredby cross-classifying observations and predictions in a two-by-two table, and calculating indices of classificationperformance. However, this approach relies on the essentially arbitrary choice of a threshold probability to determinewhether or not a site is predicted to be occupied. An alternative approach is described which measures discriminationcapacity in terms of the area under a relative operating characteristic (ROC) curve relating relative proportions ofcorrectly and incorrectly classified predictions over a wide and continuous range of threshold levels. Wider applicationof the techniques promoted in this paper could greatly improve understanding of the usefulness, and potentiallimitations, of habitat models developed for use in conservation planning and wildlife management. © 2000 ElsevierScience B.V. All rights reserved.

Keywords: Logistic regression; Model evaluation; Prediction; Relative operating characteristic curve

www.elsevier.com/locate/ecolmodel

1. Introduction

The use of statistical models to predict thelikely occurrence or distribution of species is be-coming an increasingly important tool in conser-vation planning and wildlife management. Such

* Corresponding author. Present address: Landscape Analy-sis and Applications Section, Canadian Forest Service-SaultSte. Marie, Great Lakes Forestry Centre, 1219 Queen StreetEast, PO Box 490, Sault Ste. Marie, Ont., Canada P6A 5M7.Tel.: +1-705-7595740 ext. 2306; fax: +1-705-7595700.

E-mail address: [email protected] (J. Pearce).

0304-3800/00/$ - see front matter © 2000 Elsevier Science B.V. All rights reserved.

PII: S 0304 -3800 (00 )00322 -7

J. Pearce, S. Ferrier / Ecological Modelling 133 (2000) 225–245226

modelling often employs logistic regression tomodel the presence or absence of a species at a setof survey sites in relation to environmental orhabitat variables, thereby enabling the probabilityof occurrence of the species to be predicted atunsurveyed sites. These models are usually fittedusing generalised linear modelling (McCulloghand Nelder, 1989) or generalised additive mod-elling (Hastie and Tibshirani, 1990). For example,logistic regression has been used widely to predictthe occurrence and habitat use of endangeredvertebrate species (Ferrier, 1991; Lindenmayer etal., 1991; Mills et al., 1993; Pearce et al., 1994;Mladenoff et al., 1999), game species (Straw et al.,1986; Diffenbach and Owen, 1989) and vascularplants (Austin et al., 1990), to examine the re-sponse of species to environmental perturbation(Reckhow et al., 1987), and to model the regionaldistributions of a large number of fauna and floraspecies to provide information for regional forestconservation planning (Osborne and Tigar, 1992;NSW NPWS, 1994a,b).

Evaluating the predictive performance of mod-els is a vital step in model development. Suchevaluation assists in determining the suitability ofa model for specific applications. It also providesa basis for comparing different modelling tech-niques and competing models, and for identifyingaspects of a model most in need of improvement.Although the use of statistical modelling tech-niques such as logistic regression is increasing,relatively little attention has been devoted to thedevelopment and application of appropriate eval-uation techniques for assessing the predictive per-formance of habitat models.

To obtain an unbiased estimate of a model’spredictive performance, evaluation is best under-taken with independent data collected from sitesother than those used to develop the model. Theseindependent sites should be sampled representa-tively from the region across which the model islikely to be extrapolated. If independent data arenot available, then statistical resampling tech-niques such as cross-validation (Stone, 1974) andjackknifing (Efron, 1982), may be used to reducebias in the measurement of predictive perfor-mance. In cross-validation the model developmentsites are divided into K groups of roughly equal

size (in the special case of jackknifing each groupconsists of just one site). For each group of sites amodel is fitted to the data from the other K−1groups. This model is used to predict a probabil-ity of occurrence for each of the sites in the groupexcluded from the fitting of the model. This pro-cedure is repeated for all groups until predictedvalues have been calculated for all sites. Thesepredicted values are then used to assess the accu-racy of the predictive model. Cross-validation is aless rigorous approach to model evaluation thanusing a truly independent dataset, particularly insituations where the model development sites arenot distributed representatively across the regionunder consideration, in terms of both environ-mental and geographical coverage.

Good predictions are both reliable and discrim-inatory. Reliable predictions may be used at ‘facevalue’, as each predicted probability is an accurateestimate of the likelihood of detecting the speciesat a given site. A model with good discriminationability is one that can correctly discriminate be-tween occupied and unoccupied sites in an evalua-tion dataset, irrespective of the reliability of thepredicted probabilities. The measurement of dis-crimination performance requires a different ap-proach to that used to measure reliability.

This paper adopts a framework for model eval-uation developed by Murphy and Winkler (1987,1992) that explicitly defines the links betweenmodel reliability and discrimination. Techniquesfor measuring each aspect of predictive perfor-mance are described and their calculation demon-strated using models developed for two species innorth-east New South Wales. The paper also dis-cusses the relevance and importance of each ofthese measures in the application of predictivemodels to conservation planning and manage-ment. While the evaluation techniques presentedare described primarily in relation to logistic re-gression, these techniques can potentially be ap-plied to any type of model that generatespredicted probabilities of species occurrence (e.g.decision trees, artificial neural networks).

This paper considers only a single componentof model performance, that of the accuracy ofmodel predictions. Other, equally important as-pects of model performance, such as the rational-

J. Pearce, S. Ferrier / Ecological Modelling 133 (2000) 225–245 227

Fig. 1. Framework for assessing the predictive performance of models, describing the links between each measurable component ofperformance described in the text.

ity and interpretability of the explanatory vari-ables included in the model and the validity of thepredicted shape of the response curves, will not bediscussed here.

2. A framework for evaluating predictiveperformance

Murphy and Winkler (1987, 1992) have devel-oped a framework for assessing predictive perfor-mance of models that explicitly links modelreliability and discrimination ability. This frame-work, summarised in Fig. 1, also allows predic-tion error to be partitioned between a number ofsources. Their approach is based on the jointdistribution of predictions, denoted p, and obser-vations, denoted x, which can be represented asp(p,x). The best approach to understanding thisjoint distribution is to consider it as a frequencytable of observed and predicted values as shownin Fig. 2. For models derived from species pres-ence/absence data the value predicted by themodel for each evaluation site is the probability ofoccurrence of the species concerned p. The obser-

vation obtained at each evaluation site is thepresence or absence of the species x. To definespecific characteristics of the performance of apredictive model, Murphy and Winkler factorisethe joint distribution of p and x into a conditionaldistribution and a marginal distribution, usingtwo different factorisations. Each factorisationmay be considered as viewing the frequency tablefrom a different direction. The first factorisationis based on the predictions (the columns of Fig.2), and involves the conditional distribution of theobservations given the predictions p(x/p), and themarginal distribution of the predictions p(p):

Fig. 2. The framework described by Murphy and Winkler(1987, 1992) can be considered as a frequency table of observa-tions x, and predicted values from the model p for a given setof evaluation sites.


Fig. 3. Relationship between the predicted probability ofoccurrence and the observed proportion of evaluation sitesoccupied for the Yellow Box E. melliodora predictive modeldeveloped for north-east New South Wales. The graph isdeveloped by plotting the proportion of evaluation sites foundto be occupied within each of ten equi-interval predictedprobability classes. These proportions are then plotted againstthe median value of each class. The number of evaluation sitesand the confidence interval of the observed proportion ofobservations are shown for each class. The regression linedescribes the overall relationship between the observed andpredicted values.

the predictions given the observations describesthe distribution of predicted values for eachunique observed value (the individual rows ofFig. 2). For presence-absence data, the observa-tions have only two unique values, and so theconditional distribution can be thought of asconsisting of two frequency distributions, thefirst describing the distribution of predicted val-ues for evaluation sites at which the species wasobserved, and the second describing the distribu-tion of predicted values for sites at which thespecies was not observed.

2.1. Factorisation based on predictions

Murphy and Winkler (1987) describe the fac-torisation based on predictions as the calibra-tion-refinement factorisation, where theconditional distribution p(x/p) reflects modelcalibration, and the marginal distribution p(p)reflects model refinement.

Refinement relates to the range of predictionsproduced by the model for a given set of sites.A model is well refined if predictions cover thefull probability range, with predicted values nearboth one and zero. The variance of the predic-tions (sp) provides a measure of model refine-ment, with large values indicating a greater levelof refinement than small values. It is necessaryto have at least a moderate level of refinementin order to be able to examine model perfor-mance further.

Calibration relates to the level of agreementbetween predictions generated by a model andactual observations. This can be examinedgraphically by breaking the predicted probabilityrange up into classes, and plotting the propor-tion of evaluation sites that are observed to beoccupied within each of these classes against themedian predicted value of each class, as shownin Fig. 3. If the model is well calibrated thenthe points should lie along a 45° line. The lackof agreement may be partitioned into three com-ponents: bias, spread, and unexplained error.These components and their implications forspecies modelling are described in detail later inthis paper.

p(p,x)=p(x/p)·p(p) (1)

The marginal distribution of the predictionsdescribes the frequency distribution of predictedprobability values within the evaluation sample(shown by the bottom row of the frequencytable in Fig. 2). The conditional distribution ofthe observations given the predictions describesthe distribution of observed values (presence orabsence) obtained for each unique predictedprobability (the individual columns of Fig. 2).

The second factorisation is based on the ob-servations (the rows of Fig. 2), and involves theconditional distribution of the predictions giventhe observations p(p/x) and the marginal distri-bution of the observations p(x):

p(p,x)=p(p/x)·p(x) (2)

The marginal distribution of the observationsdescribes the distribution of observed values(presence or absence) within the evaluation sam-ple (shown by the third column of the frequencytable in Fig. 2). The conditional distribution of


2.2. Factorisation based on obser6ations

The factorisation based on the observations isdescribed by Murphy and Winkler (1987) as thediscrimination-base rate factorisation, where theconditional distribution p(p/x) measures the dis-crimination ability of a model, and the marginaldistribution p(x) is the base rate.

The base rate indicates how often differentvalues of x (presence or absence) occur and, inspecies modelling, therefore describes the preva-lence of a species at a sampled set of sites (i.e.species rarity or prior probability of occurrence).Murphy and Winkler call this distribution thebase rate because it provides information on theprobability of a species being observed as presentat a randomly selected site, in the absence of apredictive model. The base rate is measured as theproportion of evaluation sites at which a species isobserved to be present. This value needs to bemoderately large in order for the predictive per-formance of a model to be examined.

The conditional distribution p(p/x) indicateshow often different values of p are predicted forsites at which a particular value of x is observed.For presence/absence observations of a species,p(p/x=1) describes the proportion of occasionsthat each value of p is predicted for sites at whichthe species has been recorded as present, andp(p/x=0) describes the proportion of occasionsthat each p value is predicted for sites at whichthe species has been recorded as absent. Theseconditional probabilities indicate how well amodel discriminates between sites at which thespecies has been observed and those at which ithas not been observed. The discrimination abilityof a model can be examined graphically by plot-ting a frequency distribution of the predicted val-ues for occupied sites, and comparing this with afrequency distribution of the predicted values as-sociated with unoccupied sites (see Fig. 8a). Amodel with good discrimination ability will showlittle overlap between these two distributions, withthe predicted values of occupied sites beinggreater on average than those of unoccupied sites.The measurement of discrimination capacity isdescribed later in this paper.

2.3. Relationship between calibration/refinementand discrimination

The two factorisations described by Murphyand Winkler (1987) are equivalent,

p(x/p)·p(p)=p(p/x)·p(x) (3)

so that a model which has good calibration andrefinement must also have good discrimination (asthe base rate is fixed and constant). However, theconverse is not necessarily true, as calibration andrefinement may both be poor, yet still combine ina manner which gives good discrimination.

As suggested by the above equation, compo-nents of the calibration-refinement factorisationinfluence the discrimination capacity of a model.Refinement affects how well a model will poten-tially discriminate between observed presence andabsence, as the likelihood of good discriminationincreases as the predictions span a larger range ofthe 0 to 1 probability distribution. For example, amodel that always predicts a value of 0.5, whethera site is occupied or not, lacks refinement and willhave no discriminatory power, even though it mayexhibit excellent calibration. As the range of pre-dicted probabilities increases, the model has anincreased likelihood of being able to discriminatebetween positive and negative outcomes. That is,good refinement does not necessarily imply gooddiscrimination, but improves the potential forgood discrimination. A model has good discrimi-natory power if the predicted probability rangeassociated with occupied sites is higher than, andhas little overlap with, the probability range asso-ciated with unoccupied sites.

However, although these two factorisations arerelated, they tell us about different aspects of thepredictive performance of a model. The calibra-tion/refinement factorisation tells us about thereliability of predictions; that is, how closely pre-dicted probabilities match observed proportionsof occurrence. The discrimination/base rate fac-torisation tells us about how well predictions candiscriminate between observed presence and ab-sence, regardless of the absolute value of thosepredictions. The relative importance of each per-formance measure — calibration, refinement anddiscrimination — depends on the use of the


model and the experience of the user. If thepredictions are to be used as an absolute measureof probability of occurrence, then knowledge ofprediction calibration (reliability) and refinement(sharpness) is essential, otherwise the predictionsmay be misleading. If, on the other hand thepredictions from a model are to be used only as arelative measure of the likelihood of occurrence(for example, to rank areas for reservation), thenonly discrimination ability need be examined.However, if this relative index is to be broken intotwo or more classes (for example, suitable versusunsuitable habitat, or areas with high, moderateand low conservation potential), then model cali-bration and refinement need to be examined (atleast visually) to select appropriate classthresholds.

3. Measuring discrimination performance

The discrimination performance of wildlifehabitat models derived by logistic regression isoften assessed by examining the agreement be-tween predictions and actual observations, using a2×2 classification table as shown in Fig. 4 (Lin-denmayer et al., 1990; Pearce et al., 1994). Aspecies is predicted to be present or absent at asite based on whether the predicted probabilityfor the site is higher or lower than a specifiedthreshold probability value.

The table can be used to calculate four indicesdescribing predictive performance of models. Two

of these indices — sensitivity (or the true positivefraction) and specificity (or the true negative frac-tion) — measure the proportion of sites at whichthe observations and the predictions agree. Theother two indices — the false positive fractionand the false negative fraction — measure theproportion of sites at which the observations andthe predictions disagree. These four indices aredefined as follows (Fig. 4):

Sensitivity

=Number of positive sites correctly predicted

Total number of positive sites in sample

=A

(A+C)(4)

Specificity

=Number of negative sites correctly predicted

Total number of negative sites in sample

=D

(B+D)(5)

False positive fraction

=Number of false positive predictions

Total number of negative sites in sample

=B

(B+D)(6)

False negative fraction

=Number of false negative predictions

Total number of positive sites in sample

=C

(A+C)(7)

Using these indices, the accuracy (i.e. the totalfraction of the sample that is correctly predictedby the model) can be calculated as:

Accuracy=A+D

A+B+C+D(8)

However, this measure of accuracy can be mis-leading, as its interpretation depends on a knowl-edge of the prior probability of occurrence (orbase rate) of the species in question, or p(x). Forexample, if the species occurs at only 5% of sitessurveyed, a high predictive accuracy can be ob-

Fig. 4. The classification table describing the agreement be-tween the observed presence and absence of a species and thepredicted presence or absence of a species. Each of the valuesA, B, C and D, represent numbers of observations, so thattheir sum equals the sample size of the evaluation sample(A+B+C+D).


Fig. 5. An example of the model that underlies ROC analysis.The curves represent the frequency distribution of probabilitiespredicted by a model for occupied and unoccupied sites withina data set for which the real distribution of the species isknown. A threshold probability, represented by the verticalline, separates sites predicted to be occupied from sites pre-dicted to be unoccupied.

right of that for the unoccupied sites, p(p/x=0),if a model is to be of any use in terms of discrim-ination ability. Unless this discrimination capacityis perfect, the two distributions will overlap, andsome predicted values may be associated witheither occupied or unoccupied sites. To generate a2×2 classification table, a decision threshold (thevertical line) must be identified to separate sitespredicted to be occupied from those predicted tobe unoccupied. For a given decision threshold theproportion of the distribution of occupied sitesfalling to the right of this threshold defines thesensitivity (true positive fraction) of a model,while the proportion of unoccupied sites falling tothe left of the thresholds defines the specificity (ortrue negative fraction), as indicated in Fig. 5(Metz, 1978). The proportion of the unoccupieddistribution to the right of the threshold is thefalse positive fraction, while the proportion of theoccupied distribution to the left of the threshold isthe false negative fraction. The sensitivity, specifi-city, false positive rate and false negative rate willvary as the decision threshold is changed.

These traditional measures of discriminationcapacity depend on the arbitrary choice of adecision threshold, which introduces a furthercomplication into the interpretation of the classifi-cation statistics. The choice of a decisionthreshold in habitat modelling is usually basedpartly on knowledge of the prior probability ofoccurrence of the species of interest, i.e. its rarity,and partly on value judgements regarding theconsequences of various kinds of correct and in-correct decisions (Metz, 1986; Fielding and Bell,1997). For example, if a species is endangered andthe model is intended to identify potential re-in-troduction sites, then it is important that thehabitat chosen is indeed suitable for the species,to minimise the risk of failure. This would requirethe choice of a relatively high threshold probabil-ity that would result in the identification of onlythose sites with a high predicted probability ofoccurrence. The number of true positive and falsepositive values would be correspondingly low, asfew sites would be predicted to be occupied. Con-versely, if the model is used to identify areaswithin which proposed development may impactthe species (e.g. as part of an environmental im-

tained if the species is predicted to never occur.The predictions will then be correct 95% of thetime. A true accuracy measure should not besensitive to the relative frequency of the specieswithin the test sample. This measure is also a poorindicator of relative predictive accuracy, as twomodels may have the same accuracy, but performdifferently with respect to the types of correct andincorrect predictions they provide. The incorrectpredictions from one model may be all false nega-tives, while for another model they might be allfalse positives. The nature of incorrect predictionsmust be examined further to properly interpretthe performance of a model, as different types oferror have different implications for how a modelcan be applied.

As described by Murphy and Winkler (1987),the discrimination capacity of a model can beevaluated graphically by plotting the two condi-tional distributions p(p/x=1) and p(p/x=0) onthe same axis and examining the degree of over-lap. Two such distributions are depicted in Fig. 5.The horizontal axis represents the predicted prob-ability of a site being occupied, as derived fromthe model. The two curves represent the frequencydistribution of predicted probabilities for two setsof evaluation sites: sites at which the species wasobserved as present (occupied sites) and sites atwhich the species was recorded as absent (unoccu-pied sites). The distribution of predicted valuesfor the occupied sites, p(p/x=1), should lie to the


pact assessment), then it is important to be pre-cautionary by identifying all potentially suitablehabitats. The decision threshold would thereforeneed to be more lenient, and the true positive rateand the false positive rate would be expected to behigh, as the model predicts more sites to beoccupied. The choice of an actual threshold valuein both of these cases is essentially arbitrary, anddifferent practitioners may choose different val-ues. However, this choice strongly affects the rela-tive frequencies of correct and incorrectpredictions, and thus the measurement of discrim-ination performance. A true measure of discrimi-nation capacity should be valid for all situationsin which a model will be employed, with anydecision threshold that may be appropriate for agiven application.

Metz (1986), and more recently from an ecolog-ical perspective, Fielding and Bell (1997) havereviewed several of the most commonly used dis-crimination indices, including those traditionallyemployed in wildlife habitat studies. They foundmost indices to be unsuitable as an unbiased

measure of accuracy, being dependent on speciesrarity and/or the choice of a threshold probabil-ity. However, both Metz (1986), and Fielding andBell (1997) found that one index did meet therequirements of an unbiased discrimination index.This index is derived from the area under a rela-tive operating characteristic curve.

3.1. The relati6e operating characteristic cur6e

From Fig. 5 it can be seen that the true positivefraction (sensitivity) plus the false negative frac-tion must add to 1, as do the true negativefraction (specificity) and false positive fraction.That is, for a given observed state (positive andnegative), the number of correct predictions plusthe number of incorrect predictions must equalthe number of observations with that state. Thus,the various indices defined above are related bythe following equations:

sensitivity+ false negative fraction=1 (9)

specificity+ false positive fraction=1 (10)

In the notation of Murphy and Winkler (1987)these equations can also be written as:

p(P/x=1)+p(A/x=1)=1 (11)

p(A/x=0)+p(P/x=0)=1 (12)

where P is a predicted value greater than or equalto the threshold probability, and A is a predictedvalue less than the threshold probability.

Because of these constraints, it is only necessaryto specify one fraction from each equation aboveto describe all four performance measures. Typi-cally, the sensitivity and the false positive fractionare specified, as these fractions describe the per-formance of the positive predictions, and vary inthe same direction as the decision threshold isvaried. Varying the decision threshold incremen-tally across the predicted probability range of amodel will generate a series of pairs of sensitivityand false positive values. Each pair of sensitivityand false positive values can be plotted as the yand x coordinates respectively on a graph, such asthat shown in Fig. 6. This series of points definesa smooth curve, which is called the relative oper-ating characteristic (ROC) curve (Metz, 1978).

Fig. 6. The ROC graph in which the sensitivity (true positiveproportion) is plotted against the false positive proportion fora range of threshold probabilities. A smooth curve is drawnthrough the points to derive the ROC curve. The 45° linerepresents the sensitivity and false positive values expected tobe achieved by chance alone for each decision threshold.


Fig. 7. The ROC curve plotted on binormal axes.

Standard error of sensitivity fraction (TPF)

=' TPF× (1−TPF)

(No. occupied sites−1)× (No. occupied sites)(13)

Standard error of false positive fraction (FPF)

=' FPF× (1−FPF)

(No. unoccupied sites−1)× (No. unoccupied sites)(14)

A smooth curve is then drawn through thepoints and error bars. More objective and reliablemethods require that some assumptions be maderegarding the functional form of the ROC curve.Many functional forms have been proposed, butthe binormal form is used most widely. Accordingto the binormal model, each ROC curve is as-sumed to have the same functional form as thatgenerated by two Gaussian distributions, or vari-ables that can, by some monotonic transforma-tion, be transformed into Gaussian form. Thebinormal assumption allows the ROC curve to betransformed to a straight line when plotted onnormal deviate axes, as shown in Fig. 7 (Metz,1986). Other distributional forms, including logis-tic, triangular and exponential, also yield approxi-mately straight lines on a binormal graph (Swets,1986b). The Gaussian approach is therefore appli-cable even if the form of the underlying distribu-tions is not Gaussian.

The binormal ROC curve can be described bytwo parameters, a and b ; a is the y-intercept, andb is the gradient of the straight line representingthe ROC curve when plotted against normal devi-ate axes. Assuming a Gaussian distributionalform, a can be interpreted as the distance betweenthe means of the two distributions, (mx=1−mx=

0)/sx=0, and b can be interpreted as the ratio ofthe standard deviations of the two distributionssx=1/sx=0.

For a binormal graph the task of curve-fittingbecomes one of choosing numerical values for aand b that best represent the measured data. Thisis usually done using maximum likelihood estima-tion algorithms (Metz, 1986).

If the two underlying decision distributionshave equal variance then the ROC curve is sym-metrical about the minor axis, as shown in Fig. 6.

The ROC curve describes the compromises thatare made between the sensitivity and false positivefractions as the decision threshold is varied.Furthermore, the sensitivity and false positivevalues are independent of the prevalence of aspecies because they are expressed as a proportionof all sites with a given observed state (Swets,1988). ROC analysis is therefore independent ofboth species prevalence and decision thresholdeffects.

3.2. Deri6ing the ROC cur6e

Several techniques are available for fitting theROC curve to the sensitivity and false positivedata. Any number of sensitivity and false positivepairs may be used to define and fit the ROCcurve, although most applications use at least fivepoints. However, depending on the curve fittingtechnique used, larger numbers of sensitivity andfalse positive pairs will result in a better fit. Wehave generally employed thresholds spaced at 0.01intervals across the predicted probability range.

The simplest technique for fitting the ROCcurve is to plot the points manually with errorbars, calculated using the following formulae(Metz, 1978):


However, in practice the curve tends to be skewedeither toward the origin or toward the top righthand corner of the graph. On binormal axes, thesymmetrical ROC curve has a gradient of 1. If thecurve is asymmetrical, the gradient will tend to beless than 1 if the curve is skewed toward theorigin, and greater than 1 if skewed toward thetop right hand corner of the graph. These differ-ences occur due to the distributions of the twodecision variables not having equal variance. Ifthe curve is skewed toward the origin, then thevariance of the distribution for the unoccupieddecision variable is larger than that for the occu-pied sites.

In practice, we have found the Gaussian ap-proach to be a rapid and easily applied techniquefor fitting the ROC curve to the sensitivity andfalse positive pairs. However, we also recommenddisplaying the actual sensitivity and false positivepoints on the ROC curve, so that departures fromthe Gaussian assumption, if present, may beobserved.

The ROC curve provides a graphical approachto assessing discrimination capacity over a rangeof threshold probabilities. A model that has nodiscrimination ability will generate an ROC curvethat follows the 45° line. That is, for all decisionthreshold values, the sensitivity equals the falsepositive fraction. Perfect discrimination is indi-cated when the ROC curve follows the left handand top axes of the unit square. That is, for allthreshold values, the true positive fraction equalsone and the false positive fraction equals zero.

The ROC curve can also be used to identify anappropriate threshold value for a given applica-tion, by balancing the cost that would arise froman incorrect decision against the benefit to begained by a correct decision (Metz, 1978; Hilden,1991). Murtaugh (1996) further describes, withexamples, the application of ROC methodology inassessing the accuracy of ecological indicators.

3.3. Deri6ing a discrimination index from theROC cur6e

Once an ROC curve is developed, a single indexis required to describe the discrimination capacityof the model. Hilden (1991) argues that a sum-

mary measure of discrimination accuracy mustincorporate cost-benefit information. However, inecological applications the costs and benefits asso-ciated with correct and incorrect predictions areoften quite intangible, making their quantificationextremely difficult and, at best, arbitrary. It istherefore most practicable to develop a summarymeasure that assumes that the costs and benefitsare equal.

Swets (1986a) has reviewed several discrimina-tion measures that are consistent with the use of avariable decision threshold (as employed in ROCcurves) and do not require cost-benefit informa-tion. Unfortunately, most of these measures as-sume that the two underlying decisiondistributions have equal variance, an assumptionthat is rarely met in ecological applications. Swetsconcluded that the best discrimination index in arange of situations appears to be the area underthe ROC curve expressed as a proportion of thetotal area of the unit square defined by the falsepositive and true positive axes. This index rangesfrom 0.5 for models with no discrimination abil-ity, to 1 for models with perfect discrimination.

This index can also be interpreted in terms ofthe true positive and false positive values used tocreate the curve. Areas between 0.5 and 0.7 indi-cate poor discrimination capacity because the sen-sitivity rate is not much more than the falsepositive rate. Values between 0.7 and 0.9 indicatea reasonable discrimination ability appropriatefor many uses, and rates higher that 0.9 indicatevery good discrimination because the sensitivityrate is high relative to the false positive rate(Swets, 1988). Hanley and McNeil (1982) haveshown that the ROC index can also be interpretedas the probability that a model will correctlydistinguish between two observations, one posi-tive and the other negative. In other words, if apositive observation and a negative observationare selected at random the index is an estimate ofthe probability that the model will predict ahigher likelihood of occurrence for the positiveobservation than for the negative observation.

The simplest technique to measure the areaunder the ROC curve is the direct approach,which uses the trapezoidal rule to calculate thearea directly from the points on the graph. This


approach, however, can underestimate the truearea under the curve if the sample is not wellrefined.

If the curve is derived using the parametricGaussian approach, then the area under thecurve, A, can be calculated using the two parame-ters a and b describing the straight line on binor-mal axes (Swets and Pickett, 1982). Alternatively,Brownie et al. (1986) have demonstrated how themean and variance of the predicted values foroccupied and unoccupied sites can be used todirectly calculate the area under the ROC curve ifthe predicted values are normally distributed. Thisapproach can result in more precise estimates ofthe area under the ROC curve, because the pre-dictions and observations do not need to be firstconverted to true positive and false positivecounts based on a series of probability thresholds.Instead the raw predicted probabilities are useddirectly to calculate the area under the ROCcurve. Both of these statistical techniques dependon the degree to which the data meet the binormalassumption, but this assumption may not be metwith ecological data.

Bambar (1975) recognised that the area underthe ROC curve is intimately connected with thestatistic W calculated in the Wilcoxon or Mann–Whitney statistical test of the difference betweentwo samples (Sokal and Rohlf, 1981). The Mann–Whitney statistic therefore provides a means bywhich the area under an ROC curve may becalculated without the need to assume normality.The Mann–Whitney statistic is based on a com-parison between the ranks of predicted valuesassociated with positive observations and the rankof predicted values associated with negative obser-vations. Hanley and McNeil (1982) further de-scribe this approach, and also provide a formulaby which the standard error of the area may becalculated. Bootstrapping may also be used tocalculate the standard error of W (Efron andTibshirani, 1993). The bootstrap is performed byrandomly selecting a site, with replacement, ntimes from the total set of n evaluation sites (i.e. agiven site can be represented more than once inthe sample). Each sample selected in this manneris used to calculate a W value. This is repeated alarge number of times, and the generated sample

of W values is then used to estimate the standarderror of the original Mann–Whitney value. Thebootstrapping approach is preferable to that pre-sented by Hanley and McNeil (1982).

Of the four techniques for calculating an ROCindex of discrimination ability, the Mann–Whit-ney technique (with bootstrapping to provide astandard error) is the most reliable approach forecological applications as it makes no distribu-tional assumptions. This is the approach that willbe used below. However, if it is known that thetwo decision distributions are binormal, then theparametric approach of Brownie et al. (1986) isrecommended as probably providing the mostaccurate results.

The following example illustrates the evaluationof discrimination performance:

3.4. Example 1

The distribution of Yellow Box Eucalyptus mel-liodora in north-east New South Wales has beenmodelled as a function of environmental and geo-graphical attributes using field survey data from2223 sites distributed throughout the region(NSW NPWS 1994b). Yellow Box was recordedas present at 80 of these sites.

The following logistic regression model wasfitted to the Yellow Box survey data using for-ward stepwise generalised additive modelling(Hastie and Tibshirani, 1990):

log it(p)=s(soil depth, df=4)+s(moisture index, df=3)+s(rainfall, df=2)+s(temperature, df=2)

(15)

The predictive accuracy of this model was thentested using independent evaluation data collectedat a further 407 sites within the region (NSWNPWS 1995a). Yellow Box was recorded at 22 ofthese evaluation sites.

Before calculating an ROC curve, the discrimi-nation ability of the model was assessed visuallyby comparing the distribution of predicted proba-bilities for occupied sites with the distribution ofthe predicted probabilities for unoccupied sites, asshown in Fig. 8a. The graph indicates that pre-dicted values for sites at which the species was


Fig. 8. Discrimination capacity of the distribution model developed for Yellow Box E. melliodora in north-east New South Wales.A: Distribution of predicted probability values associated with either occupied sites or unoccupied sites. B: The ROC curve.


recorded are, on average, higher than those forunoccupied sites, suggesting the model has gooddiscrimination ability. The refinement of valuespredicted by the model is also good, with predic-tions ranging from 0 to approximately 0.7.

To examine the discrimination performance ofthe model over a range of threshold levels, theproportion of evaluation sites correctly predictedto be occupied (sensitivity or true positive rate),and the proportion of sites incorrectly predictedto be occupied (false positive rate) were calculatedfor 70 threshold values spread evenly across therange of available predicted values (from 0.0 to0.7). These sensitivity and false positive valueswere then plotted against each other, and asmooth curve drawn through the points (using theGaussian assumption), to produce the ROC curveshown in Fig. 8b.

To obtain a summary measure of discrimina-tion capacity, the area under the ROC curve wascalculated using the non-parametric approachbased on the Mann–Whitney statistic. The non-parametric approach was employed because Fig.8b indicates that the Gaussian assumption doesnot provide a good fit for this species, as the fittedcurve underestimates the area under the observedROC curve. The standard error of the Mann–Whitney statistic was calculated using a bootstrapsample of 200. The discrimination capacity of themodel was calculated to be 0.91690.033 indicat-ing that the model can correctly discriminate be-tween occupied and unoccupied sites 91.6% of thetime. In other words, if a pair of evaluation sites(one occupied and the other unoccupied) is cho-sen at random, then there is a 0.916 probabilitythat the model will predict a higher likelihood ofoccurrence for the occupied site than for theunoccupied site.

4. Measuring model calibration

While discrimination deals with the ability of amodel to distinguish between occupied and unoc-cupied sites, calibration describes the agreementbetween observations and predicted values (orgoodness-of-fit), and therefore describes the reli-ability with which a model predicts the probabil-

ity of a site being occupied. The deviance hastraditionally been used to evaluate logistic regres-sion models by calculating the amount by which amodel reduces null deviance (or the percentage ofnull deviance explained by the model). The devi-ance, however, in uninformative about the good-ness-of-fit of a model developed from binary dataas it depends only on the fitted probabilities (Col-lett, 1991 p. 64). However, we can examine theagreement between observations and prediction toexamine how and why the predictions depart fromthe observations. As described earlier (Fig. 1),calibration, can be separated into two measurablecomponents, bias and spread, and a third compo-nent, unexplained error.

These components of calibration can be bestunderstood by re-examining Fig. 3. If a regressionline is fitted to the logits of the predicted probabil-ities and the observed proportions, then bias andspread can be thought of as the intercept andslope, respectively, of this line. The regression linewill be a straight line if both the predicted andobserved axes are transformed to a logit scale, butwill be a curved logistic line when plotted againstuntransformed probability axes as shown in Fig.3. Perfect calibration is represented by a 45° line(the dotted line in Fig. 3). Bias describes a consis-tent overestimate or underestimate of the proba-bility of occurrence of a species resulting in aconsistent upward or downward shift in the re-gression line across the entire predicted probabil-ity range. In other words, the intercept of theregression line is either too high or too low. Itoccurs because the prevalence of the species in theevaluation data is higher or lower than that in theoriginal model development data due, for exam-ple, to the use of different survey techniques, orseasonal variation in the abundance or detectabil-ity of the species. Bias may also arise when amodel, developed in one region or environment, isapplied to another region or environment wherethe species is more or less prevalent.

Spread describes the systematic departure of theregression line, fitted to the predicted and ob-served values, from a gradient of 45°. On logitaxes, a slope of 1, in the absence of bias, impliesthe predictions follow the 45° line. A slope greaterthan 1 indicates that predicted values greater than


0.5 are underestimating the occurrence of thespecies and that predicted values less than 0.5 areoverestimating the occurrence of the species. Agradient between 0 and 1 implies that predictedvalues less than 0.5 are underestimating the occur-rence of the species and that predicted valuesgreater than 0.5 are overestimating the occurrenceof the species. A gradient significantly differentfrom 1 indicates misspecification of a model.

The third component of calibration, unexplainederror, is due to the variability of individual recordsaround the regression line fitted to predicted andobserved values, and describes variation not ac-counted for by the bias and spread of a model.Some of this variation arises because particularcovariate patterns or habitat types were not wellrepresented in the development the model, and maybe identified by analysing deviance, bias or spreadresiduals (Miller et al., 1991). The rest of theunexplained error is due to random variation, thesource of which is not identifiable through residualanalysis. In wildlife habitat studies this componentof the deviance is expected to be quite large due toerror in measurement of species presence andenvironmental predictors, and to factors not in-cluded in the model that may influence habitatselection or species distribution (such as competi-tion, predation, or biogeographical barriers todispersal).

A reliable model (well calibrated) should be ableto correctly predict the actual proportion of sitesoccupied by the species of interest. That is, E(x/p)=p. Cox (1958) formalised an approach todetecting bias and spread by using logistic regres-sion to model the relationship between the logit ofthe predicted probabilities (pi) for the evaluationsites and the observed occurrences (p(xi=1)) atthese sites. This modelled relationship takes theform:

ln�p(xi=1)

p(xi=0)�

=a+b ln� pi

1−pi

�(16)

The coefficients a and b in this relationshiprepresent bias and spread respectively. In a per-fectly calibrated model a will have a value of zeroand b will have a value of one.

If a=0, then the observed proportion (p(xi=1))at pi=0.5 will be 0.5, regardless of the value of b.

If b\1 (Fig. 9(A)) then predicted values less than0.5 are overestimating the observed proportion ofoccurrence and predicted values greater than 0.5are underestimating observed occurrence. If 0BbB1, then the reverse is occurring, with predictedvalues less than 0.5 underestimating and thoseabove 0.5 overestimating (Fig. 9(B)). If bB0, thenthe overall trend in predicted probabilities is wrong,with values less than 0.5 being associated with ahigher observed proportion of occurrence thanthose above 0.5.

At a predicted (p) value of 0.5, ln[p(xi=1)/p(xi=0)]=a, and a=0 implies p(xi=1)=0.5.Thus a reflects the overall bias of the model if b=1.However, if b " 1, then a describes the model biasat p=0.5. This occurs because a is a function ofb :

a=mp−bmx (17)

The predicted probabilities are generally too lowif a\0, and too high if aB0 (Fig. 9(C, D)). Fig.9(E, F) illustrate the relationship between modelpredictions and observations if both spread errorand bias are present.

Cox (1958) devised score tests to evaluate the biasand spread of binary regression models. Miller etal. (1991) provide equivalent tests based on likeli-hood ratio statistics that are more accurate andeasier to apply. These tests will be used here.

The first of these likelihood ratio tests examinesthe hypothesis that a=0, b=1 and is calculated asthe difference between two deviance values. Thefirst deviance value, D(0,1), is simply the devianceof observations at the evaluation sites in relation toraw predictions from the original model. This isequivalent to forcing a=0 and b=1 in Eq. (16),i.e. by assuming that the model has perfect calibra-tion. The second deviance value, D(a,b), is thedeviance of observations in relation to the adjustedpredictions, obtained by applying Eq. (16) usingfitted values for a and b. In other words, thesevalues are used to correct predictions for bias andspread. The test evaluates the significance of thiscorrection by comparing the difference between thetwo deviance values:

D(0,1)−D(a,b) (18)

to a x2 distribution with two degrees of freedom.


Fig. 9. Calibration diagrams showing effect of bias and spread on the agreement between model predictions and observations. (A)No bias with spread error (gradient greater than 1); (B), no bias with spread error (gradient less than 1); (C) positive bias withcorrect spread; (D) negative bias with correct spread; (E) positive bias with spread error (gradient greater than 1); (F) negative biaswith spread error (gradient greater than 1).


Fig. 9. (Continued)

The second likelihood ratio test examines thehypothesis that a=0/b=1, and is a test for biasgiven appropriate spread. The following teststatistic is compared to a x2 distribution with onedegree of freedom:

D(0,1)−D(a,1) (19)

where D(a,1) is the deviance of observations inrelation to predictions adjusted by applying Eq.(16) using a fitted value for a but forcing b toequal 1.

The third likelihood ratio test examines thehypothesis that b=1/a and is a test for incorrectspread given no bias. The following test statistic isagain compared to a x2 distribution with onedegree of freedom:

D(a,1)−D(a,b) (20)

The following example illustrates the evaluationof model calibration.


4.1. Example 2

As described in Example 1 the distribution ofYellow Box E. melliodora has been modelledwithin north-east New South Wales as a functionof several regional scale environmental variablesusing field survey data from 2223 sites. The occur-rence of the species has also been surveyed at anadditional 407 sites within the region. These addi-tional data are now used to evaluate the reliabilityof the model’s predictions in terms of calibration,bias and spread.

The overall agreement between model predic-tions and the observed occurrence of Yellow Boxwithin the evaluation data was examined graphi-cally. This was undertaken by breaking the pre-dicted probability range into ten classes, andcalculating the proportion of occupied sites withineach class. The proportion of occupied sites, andthe median predicted probability for each classwere then plotted against each other, and asmooth curve drawn through the points, as shownin Fig. 3. If there were perfect agreement betweenpredicted probabilities from the model and theobserved occurrence of the species, the points inthis graph would lie along the 45° line. Significantsystematic departure of the points from this 45°line indicates the existence of calibration biasand/or spread error. Examination of the curve inFig. 3 suggests that the model may lack calibra-tion due to both bias and incorrect spread withinthe predictions.

To determine whether this lack of calibration isstatistically significant, the observed data weremodelled as a function of the logit of the pre-dicted values to produce the following logisticregression model:

ln�p(xi=1)

p(xi=0)�

= −0.3987+0.6659 ln� pi

1−pi

�(21)

This model was used to fit the smooth curve tothe points in Fig. 3. If the observations agree withthe predictions, then the intercept and gradient ofthis regression line should not differ significantlyfrom 0 and 1, respectively.

To determine whether the a and b coefficients inthe above model differ significantly from 0 and 1,

thereby suggesting significant bias and/or spreaderror, the likelihood ratio tests reported by Milleret al. (1991) were applied. D(a,b), the deviance ofobserved values in relation to the logistic regres-sion model presented in Eq. (21) was calculated tobe 112.451 and the Null deviance from this modelD(0,1) as 115.697 with 405 degrees of freedom.Therefore, using Eq. (18)

D(0,1)−D(a,b)=115.697−112.451=3.246(22)

This value was not significant when comparedto a x2 distribution with two degrees of freedom.Therefore, the a and b coefficients do not differsignificantly from 0 and 1, respectively. Conse-quently, the curve in Fig. 3 does not departsignificantly from the 45° line and the modelaccurately predicts the observed probability ofoccurrence of Yellow Box at the evaluation sites.If this test had rejected the hypothesis that a=0,b=1 then the alternative hypothesis that a=0/b=1 (bias given appropriate spread) or b=1/a(incorrect spread in the absence of bias) couldhave been evaluated further using the likelihoodratio tests presented in Eqs. (19) and (20).

5. Comparing the performance of two or moremodels

It is often necessary to compare the perfor-mance of two or more models for a species,developed using different explanatory variables ordifferent modelling techniques, in order to choosethe most appropriate model for a given applica-tion. The reliability of predictions generated bytwo or more models can be compared by examin-ing differences in residual deviance, D(0,1), ofevaluation data in relation to predictions, and bycomparing levels of prediction error in terms ofcalibration bias and spread. The model providingthe most reliable predictions, is that with thesmallest residual deviance and the least error inprediction spread and bias.

The difference between the area under twoROC curves generated by two or more modelsprovides a measure of comparative discriminationcapacity of these models when applied to indepen-


dent evaluation data. The best model is that gen-erating the largest ROC area. However, it is im-portant to note that, if two ROC curves intersect,then neither model performs consistently betterthan the other across the entire range of decisionthresholds (Brownie et al., 1986). This informa-tion needs to be considered when interpretingdifferences in the summary index of average dis-crimination performance.

The significance of the difference between theareas under two ROC curves (A1 and A2) gener-ated using independent data can be calculated asa critical ratio test:

Z=A1−A2

SEA 1

2 +SEA 2

2(23)

The value of Z is then tested against a table ofareas under the normal probability density func-tion. If this value is significant at a specifiedprobability threshold this is evidence that the trueROC areas are different.

However, if the two ROC curves are generatedusing the same evaluation data, the two areas arelikely to be positively correlated and thus estima-tion of the standard error of the difference in areaunder the two curves, the denominator in Eq.(23), needs to account for this correlation. Hanleyand McNeil (1983) have developed the followingequation to incorporate r, the correlation betweenpredictions for the two models:

SE(A1−A2)=SEA 1

2 +SEA 2

2 −2rSEA 1SEA 2

(24)

The right side of this equation replaces thedenominator in the critical ratio test in Eq. (23).

To calculate r, it is necessary to determine twointermediate correlation coefficients, the correla-tion between predictions from the two models forthe positive events and the correlation betweenpredictions for the negative events. These arecalculated independently using either the Pearsonproduct moment correlation coefficient or theSpearman Rank correlation coefficient. The aver-age of these two correlation values, and the aver-age ROC area of the two models, are thencompared to the table provided by Hanley andMcNeil (1983) to determine r. (The average corre-

lation value can be used to approximate r if thetable is not available.)

When comparing two ROC curves it is impor-tant to consider not only the area under eachcurve, as discussed, but also the shape of thecurves and whether they intersect. The area underthe curve provides a summary measure of modeldiscrimination accuracy. Consequently, the ROCcurve with the larger area is, on average, moreaccurate. However, the shape of the ROC curve isalso important, and describes the trade-off be-tween true positive and false positive rates as thethreshold probability is altered. Therefore, twoROC curves that intersect provide different levelsof accuracy depending on which threshold proba-bility level is selected. This trade-off may affectthe choice of a model for a given application.

The following example demonstrates the analy-sis of the difference in discrimination capacitybetween two models derived from the same evalu-ation data.

5.1. Example 3

The distribution of the small reptile Calyptotisscutirostrum in north-east New South Wales hasbeen modelled using presence/absence data from836 sites as a function of a number of climatic,topographical and contextual variables (NSWNPWS, 1994a). The discrimination capacity ofthis distributional model was determined by cal-culating the area under an ROC curve developedusing data from 125 independent evaluation siteswithin the region (NSW NPWS, 1995b; Clode andBurgman, 1997). One hundred threshold probabil-ities were used to calculate the curve. The modelwas found to have good discrimination ability,with an ROC area of 0.81190.038.

In an attempt to further improve the discrimi-nation capacity of the model, microhabitat vari-ables measured at each of the 125 evaluation siteswere added to the model. To evaluate this refinedmodel, a jackknife procedure was employed tocalculate independent predicted probability valuesfor each of the 125 validation sites. These valueswere then used to develop a second ROC curve.The discrimination capacity of this refined modelwas calculated to be 0.88890.028.


To determine whether the addition of micro-habitat variables provides a significant improve-ment in discrimination capacity over the originalmodel, the significance of this improvement wascalculated using a critical ratio test, modified tocontrol for correlation between the two discrimi-nation indices. This correlation exists because thesame 125 evaluation sites were used to calculatethe area under the ROC curve for both predictivemodels.

Following the procedure of Hanley and McNeil(1983), the degree of correlation between the twoROC area measurements was calculated to be0.982 using the many-ties version of the SpearmanRank correlation test (Conover, 1980). This valuewas calculated from the mean of the correlationbetween the predictions for occupied sites for eachmodel (r=0.979), and the predictions for unoccu-pied sites for each model (r=0.984). The averagecorrelation value of 0.982 was incorporated intothe calculation of a critical ratio test statistic,using Eqs. (23) and (24):

Z=A1−A2

SEA 1

2 +SEA 2

2 −2rSEA 1SEA 2

=0.888−0.811

0.0382+0.0282−2×0.982×0.038×0.028

=6.55 (25)

The resulting Z-value was significant at the 1%level when compared to the normal distribution.Therefore, adding the microhabitat informationto the predictive model significantly improved theability of the model to discriminate between occu-pied and unoccupied sites.

6. Conclusion

Wider application of the techniques describedin this paper could improve understanding of theusefulness, and potential limitations, of habitatmodels developed for use in conservation plan-ning and wildlife management. Evaluation of pre-dictive performance can also assist in determiningthe suitability of a model for specific applications.

There are three main ways in which predictionsof species occurrence derived from logistic regres-

sion models may be used. First, predictions can beused as an absolute estimate of the probability ofa species occurring at a site. That is, the predictedprobabilities are used at face value. Second, pre-dictions can be used merely as a relative index oflikelihood of occurrence, where higher values indi-cate sites more likely to be occupied by the spe-cies. Third, predicted probabilities can beconverted to predicted presence/absence by apply-ing a threshold to the predicted probability range.Each of these uses requires knowledge of differentcomponents of predictive performance.

If the predictions are to be used at face value,for example to estimate the total population sizefor a species by predicting the probability of thespecies occurring at a large set of sites within aregion, then knowledge of model calibration, biasand spread is essential to interpret the predictedprobability values. It is important to know some-thing about the expected magnitude and nature ofdifferences between predicted probabilities andobserved proportions of occurrence.

However, in many applications of logistic re-gression models, exact estimation of the probabil-ity of species occurrence is not required. All thatis required is an index of relative suitability ofsites within a region so that areas may be rankedaccording to their importance as habitat or theirlikelihood of containing the species of interest.Examples of these applications include maps ofrelative habitat suitability to aid species manage-ment, or the ranking of priority areas within aregion for selection of conservation reserves. Thistype of application requires knowledge of thedegree to which higher predicted probabilities areassociated with the presence of a species. Thediscrimination index provides a summary measureof this capability. A graph relating observed oc-currence to predicted probability, such as thatshown in Fig. 3, can also provide useful informa-tion on the rank order relationship between pre-dictions and observations.

An understanding of discrimination capacity isparticularly important if a model is to be used todelineate areas predicted to be occupied, or tocontain suitable habitat, from areas predicted tobe unoccupied, or to contain unsuitable habitat.An understanding of model calibration is also


important in this case to inform selection of anappropriate threshold probability to distinguishsites predicted to be occupied from sites predictedto be unoccupied.

Most studies utilising the threshold probabilityapproach (Lindenmayer et al., 1990; Pearce et al.,1994) have assumed that the selection of athreshold, such as 0.5, implies that sites with aprobability greater than 0.5 will be occupiedgreater than 50% of the time. However, if a modelis not well calibrated and has significant bias orspread error, then a predicted value of 0.5 will notrelate to an observed value of 0.5, but a higher orlower observed rate. Therefore, choosing athreshold probability without any information onbias and spread will greatly reduce the confidencethat can be placed in any map that is produced. Ifpredictions from a model tend to overestimate theoccurrence of a species, then the predicted distri-bution will include not only areas of suitablehabitat (true positive sites) but also substantialareas of unsuitable habitat (false positive sites).However, if a model tends to underestimate theoccurrence of a species, then areas of potentiallyoccupied habitat will remain unidentified (falsenegative sites). For a rare species this may havedevastating ramifications because some popula-tions may fail to be identified and protected. TheROC curve can also provide useful informationfor selecting an appropriate probability thresholdby describing the trade-off between correctly pre-dicting the occurrence of a species (true positive)and incorrectly predicting the presence of thespecies (false positive).

The evaluation techniques described in this pa-per can also be used to identify aspects of a modelmost in need of improvement. Each of the statis-tics provides information on possible reasons fora lack of agreement between predictions and ob-servations. Calibration bias suggests that theprevalence of the species in the model develop-ment data and the evaluation data are different,and can arise because of differences in methodol-ogy between the surveys that collected these datasets (e.g. different detection techniques, differentseasons), or differences between the regions orenvironments covered by these surveys. It is im-portant to develop models using data that are

representative of the situations in which a modelis to be applied. Lack of model refinement and thepresence of spread error suggest that importantexplanatory variables are missing from the model,that non-discriminatory variables have been in-cluded in the model, or that weighting of variableswithin the model places too much emphasis onvariables that are only weakly related to the oc-currence of the species in the validation data.Lack of discrimination ability usually arises be-cause the explanatory variables in the model arenot strongly associated with the presence of thespecies. Fielding and Bell (1997) provide a de-tailed discussion of potential sources of classifica-tion error in ecological applications, and theireffect on the classification ability of a model.

The model evaluation techniques described inthis paper can play an important role in thedevelopment and application of models for con-servation planning and wildlife management.Such evaluation needs to be included routinely aspart of the model development process. Modelscannot be applied confidently without knowledgeof their predictive accuracy and the nature andsource of prediction errors.

Acknowledgements

The work described in this paper was per-formed as part of two consultancies funded by theAustralian Nature Conservation Agency and En-vironment Australia. We thank Andrew Taplinand Dave Barrett from these agencies for theirsupport and encouragement. We also thankDavid Meagher for commenting on an early draftof the manuscript.

References

Austin, M.P., Nicholls, A.O., Margules, C.R., 1990. Measure-ment of the realised qualitative niche: environmental nichesof five Eucalyptus species. Ecol. Monogr. 60, 161–177.

Bambar, D., 1975. The area above the ordinal dominancegraph and the area below the receiver operating graph. J.Math. Psych. 12, 387–415.

Brownie, C., Habicht, J., Cogill, B., 1986. Comparing indica-tors of health or nutritional status. Am. J. Epidemiol. 124,1031–1044.


Clode, D., Burgman, M., 1997. Joint Old Growth ForestsProject: Summary Report. NSW National Parks andWildlife Service and State Forests of NSW, Sydney.

Collett, D., 1991. Modelling Binary Data. Chapman and Hall,London.

Conover, W.J., 1980. Practical Nonparametric Statistics, 2nded. Wiley, New York.

Cox, D.R., 1958. Two further applications of a model for binaryregression. Biometrika 45, 562–565.

Diffenbach, D.R., Owen, R.B., 1989. A model of habitat use bybreeding American Black Ducks. J. Wildl. Manage. 53,383–389.

Efron, B., 1982. The jackknife, the bootstrap, and otherresampling plans. Volume 38 of CBMS-NSF Regionalconference series in applied mathematics. SIAM.

Efron, B., Tibshirani, R.J., 1993. An Introduction to theBootstrap. Chapman and Hall, New York.

Ferrier, S., 1991. Computer-based extension of forest faunasurvey data: current issues, problems and directions. In:Lunney, D. (Ed.), Conservation of Australia’s Forest Fauna.Royal Zoological Society of New South Wales, Sydney, pp.221–227.

Fielding, A.H., Bell, J.F., 1997. A review of methods for theassessment of prediction errors in conservation presence/ab-sence models. Environ. Conserv. 24, 38–49.

Hanley, J.A., McNeil, B.J., 1982. The meaning and use of thearea under a receiver operating characteristic (ROC) curve.Radiology 143, 29–36.

Hanley, J.A., McNeil, B.J., 1983. A method of comparing theareas under receiver operating characteristic curves derivedfrom the same cases. Radiology 148, 839–843.

Hastie, T.J., Tibshirani, R., 1990. Generalized Additive Models.Chapman and Hall, London.

Hilden, J., 1991. The area under the ROC curve and itscompetitors. Med. Dec. Making 11, 95–101.

Lindenmayer, D.B., Cunningham, R.B., Tanton, M.T., Smith,A.P., Nix, H.A., 1990. Habitat requirements of the mountainbrushtail possum and the greater glider in the montaneash-type forests of the central highlands of Victoria. Aust.Wildl. Res. 17, 467–478.

Lindenmayer, D.B., Cunningham, R.B., Tanton, M.T., Nix,H.A., Smith, A.P., 1991. The conservation of arborealmarsupials in the montane ash forests of the CentralHighlands of Victoria, south-east Australia. III. The habitatrequirements of Leadbeater’s Possum, Gymnobelideus lead-beateri and models of the diversity and abundance ofarboreal marsupials. Biol. Cons. 56, 295–315.

McCullogh, P., Nelder, J.A., 1989. Generalized Linear Models,2nd ed. Chapman and Hall, London.

Metz, C.E., 1978. Basic principles of ROC analysis. Sem.Nuclear Med. 8, 283–298.

Metz, C.E., 1986. ROC methodology in radiologic imaging.Invest. Rad. 21, 720–733.

Miller, M.E., Hui, S.L., Tierney, W.M., 1991. Validation

techniques for logistic regression models. Stat. Med. 10,1213–1226.

Mills, L.S., Fredrickson, R.J., Moorhead, B.B., 1993. Charac-teristics of old-growth forests associated with northernspotted owls in Olympic National Park. J. Wildl. Manage.57, 315–321.

Mladenoff, D.J., Sickley, T.A., Wydeven, A.P., 1999. Predictinggray wolf landscape recolonisation: logistic regression mod-els vs. new field data. Ecol. Appl. 9, 37–44.

Murphy, A.H., Winkler, R.L., 1987. A general framework forforecast verification. Monthly Weather Rev. 115, 1330–1338.

Murphy, A.H., Winkler, R.L., 1992. Diagnostic verification ofprobability forecasts. Int. J. Forecasting 7, 435–455.

Murtaugh, P.A., 1996. The statistical evaluation of ecologicalindicators. Ecol. Appl. 6, 132–139.

NSW NPWS, 1994a. Fauna of north-east NSW forests, NorthEast Forests Biodiversity Study Report No. 3. NSW Na-tional Parks and Wildlife Service, Sydney, New South Wales,Australia.

NSW NPWS, 1994b. Flora of north-east NSW forests, NorthEast Forests Biodiversity Study Report No. 4. NSW Na-tional Parks and Wildlife Service, Sydney, New South Wales,Australia.

NSW NPWS, 1995a. Vegetation Survey and Mapping of UpperNorth-eastern New South Wales. NSW National Parks andWildlife Service, Sydney, New South Wales, Australia.

NSW NPWS, 1995b. Vertebrates of Upper North East NewSouth Wales. NSW National Parks and Wildlife Service,Sydney, New South Wales, Australia.

Osborne, P.E., Tigar, B.J., 1992. Interpreting bird atlas datausing logistic models: an example from Lesotho, SouthernAfrica. J. Appl. Ecol. 29, 55–62.

Pearce, J.L., Burgman, M.A., Franklin, D.C., 1994. Habitatselection by helmeted honeyeaters. Wildl. Res. 21, 53–63.

Reckhow, K.H., Black, R.W., Stockton, T.B. Jr, Vogt, J.D.,Wood, J.G., 1987. Empirical models of fish response to lakeacidification. Can. J. Fish. Aquat. Sci. 44, 1432–1442.

Sokal, R.R., Rohlf, F.J., 1981. Biometry, 2nd ed. W.H. Free-man, New York.

Stone, M., 1974. Cross-validation choice and assessment ofstatistical predictions. J. R. Stat. Soc. B 36, 111–147.

Straw, J.A. Jr, Wakely, J.S., Hudgins, J.E., 1986. A model formanagement of diurnal habitat for American woodcock inPennsylvania. J. Wildl. Manage. 50, 378–383.

Swets, J.A., 1986a. Indices of discrimination or diagnosticaccuracy: their ROCs and implied models. Psych. Bull. 99,100–117.

Swets, J.A., 1986b. Form of empirical ROCs in discriminationand diagnostic tasks: implications for theory and measure-ment of performance. Psych. Bull. 99, 181–198.

Swets, J.A., 1988. Measuring the accuracy of diagnostic systems.Science 240, 1285–1293.

Swets, J.A., Pickett, R.M., 1982. Evaluation of DiagnosticSystems. Academic Press, New York.

.

evaluating the predictive performance of habitat models ... · ecological modelling 133 (0000)...

Documents