the predictive validation of ecological and environmental models

Ecological Modelling, 68 (1993) 33-50 Elsevier Science Publishers B.V., Amsterdam

33

The predictive validation of ecological and environmental models

M. P o w e r

Department of Biology, University of Waterloo, Waterloo, Ont., Canada

(Received 11 March 1992; accepted 1 December 1992)

ABSTRAC1 ~

Power, M., 1993. The predictive validation of ecological and environmental models, Ecol. Modelling, 68: 33-50.

The wide-spread use of numerical models for the study of ecological and environmental phenomena requires some means of assessing model correctness. While validation is such a process, it has tended to focus on the in-sample goodness-of-fit properties of models. Good performance at the model estimation and calibration stage, however, does not guarantee correct predictions. Despite the emphasis on prediction in many ecological and environmental models, predictive validation techniques are typically ignored in the modelling literature. Drawing on the econometric and statistical forecasting literature, this paper reviews the available methods and suggests that a structured approach to predictive model validation be adopted. The approach assumes the candidate models have been validated in-sample and proposes that the models first be evaluated for statistical adequacy before being compared and ranked. The evaluative phase eliminates models with poor statistical properties. The comparative phase selects, from the subset of statistically adequate models, the model with the best predictive properties. The approach is argued to increase user confidence and help ensure that the results of modelling exercises are effectively and appropriately used in ecological and environmental decision-making processes.

INTRODUCTION

T h e r e s e e m s li t t le n e e d to just i fy the app l i ca t i on of quan t i t a t i ve m o d e l s to the s tudy of c o m p l i c a t e d eco log ica l sys tems. T h e analy t ica l in t rac tab i l i ty of such sys tems, the d e g r e e of n o n l i n e a r i t y o f t en i n h e r e n t in the sys tems a n d the i r s tochas t i c n a t u r e u n d e r s c o r e the uti l i ty o f quan t i t a t i ve m o d e l l i n g a p p r o a c h e s . T h e use , howeve r , o f c o m p l i c a t e d n u m e r i c a l and c o m p u t a -

Correspondence to: M. Power, Department of Biology, University of Waterloo, Waterloo, Ont., Canada N2L 3G1.

0304-3800/93/$06.00 © 1993 - Elsevier Science Publishers B.V. All rights reserved

34 M. POWER

tional techniques is not all that is required of the modelling process. If accurate, or meaningful, results are to be obtained and used, it is necessary to know how much confidence can be placed in model results. Users must be assured that the model actually corresponds to the system being studied. The process of providing the required assurance of the correspondence has been referred to as model validation (Naylor et al., 1966).

Van Horn (1971) defined validation as any process which is designed to assess the correspondence between the model and the system. Naylor and Finger (1967) refine the notion of validation to encompass the statistical techniques available for testing the goodness-of-fit of empirical data. The goodness-of-fit approach is based largely on the methods developed in applied economic modelling for the validation of econometric models. The most commonly used of the suggested techniques is the test of the hypothesis that the regression line of observed versus predicted values passes through the origin with a slope of unity (Cohen and Cyert, 1961; Carter, 1986). While useful, the comparison of model output with actual data has been criticized for requiring the completion of the model formulation, estimation and actual data collection processes before any model validation can be undertaken (Miller et al., 1976).

In the ecological modelling literature other, more sophisticated, goodness-of-fit tests have been proposed. Feldman et al. (1984) propose a test of the hypothesis that model outputs are unbiased. Reynolds and Deaton (1982) propose a series of tests that determine whether the distribution of model predictions is the same as that for the observed values for each set of explanatory variables used in the model. Dent and Blackie (1979) apply the hypothesis-testing procedure to model outputs as a means of determining whether the distribution of model results can be statistically regarded as being drawn from the same population as the observed system values. Such tests follow the route first suggested by Naylor and Finger (1967) and might easily be expanded to include the techniques of analysis of variance, the chi-squared test, factor analysis, the Kolmogorov-Smirnov test, the techniques of spectral analysis and a variety of non-parametric tests.

Good performance of the model at the calibration stage, however, does not guarantee correctly predicted behaviour. Furthermore, as Wallach and Goffinet (1989) have argued, in many cases the goal of modelling is prediction. As such it seems reasonable to quantify the predictive accuracy of a model and use that as a criterion for assessing model quality. This, strictly speaking, is not consistent with the notions of validation defined by Naylor and Finger (1967). Gaas (1983), however, has expanded the definition and distinguishes between three types of validation. The distinctions are useful insofar as they clearly differentiate between the notions of validation as initially suggested by Naylor and Finger (1967) and the

PREDICTIVE VALIDATION OF MODELS 35

notions of prediction as suggested by Wallach and Goffinet (1989). Validity is defined generically as the measurement of how well model-generated and real system data compare. A model is considered to be replicatively valid if it matches the data already acquired from the real system and used in the formulation and estimation phases of model design and construction. Replicative validity corresponds to the econometric notions of goodness-of- fit and the statistical notions of distributional similarity. A model is predictively valid if it can match data before the data are acquired from the real system and structurally valid if it reproduces real system behaviour in a way that can be construed as being reflective of the operating characteris- tics of the real system (Ziegler, 1976).

Shannon (1975), Banks and Carson (1984), Law and Kelton (1991) and others describe standard statistical techniques useful for the replicative validation of models. These procedures include tests of means and variances, analysis of variance, goodness-of-fit testing, regression and correlation analysis and confidence interval construction. Predictive validation, however, is typically ignored in the standard modelling references and one must turn to the statistical forecasting literature to find discussions of the available methods. Even here, however, the emphasis tends to be placed more on model building and estimation than on validation. Accordingly, the purpose of the present paper is to introduce the available techniques to the modelling literature and comment on the ways in which they can be used to validate and compare alternative models.

DATA SPLITTING

There is no assurance that the model providing the best fit to the available sample data will most successfully predict the future behaviour of a system. The principle has been recognized for some time in the statistical literature (Montgomery and Peck, 1982) and justifies the development of measures designed specifically to evaluate model predictions. The collection of new data and the testing of the predictive powers of the model against such data is clearly the most robust procedure for evaluating predictive capabilities. In this context, Montgomery and Peck recommended 15-20 new observations as being sufficient for evaluative purposes. If collecting new data is not possible, as is the case with many ecological and environmental modelling exercises, then data splitting techniques may be employed. Data splitting involves the division of the available data into two data sets: an estimation and a prediction data set (Snee, 1977). The estimation data set is used to estimate model parameters and to assess the replicative validity of the resulting model. The prediction data set is used exclusively for predictive validation. Data splitting, also called cross-valida-

36 M. P O W E R

tion (Stone, 1974), requires a means of dividing the data into the two distinct sets. Time is the most often used approach but is applicable only to time-series data. Approaches for cross-sectional data relying on random- ized selection of the data elements for each set can be found in Snee (1977).

Each of the techniques discussed below presumes the existence of a predictive data set, the elements of which have not been used to estimate, or replicatively validate, model parameters. Conceptually the technique of predictive validation is equivalent to determining how well a model performs over a sample data set when confronted with information not used in the estimation process. A model with high predictive validity will perform well and is presumed to have captured the nature of system dynamics. A model with low predictive validity will perform poorly and has either failed to capture the nature of system dynamics, or has been invalidated by structural shifts in the system.

A good model will produce small, uncorrelated predictive errors. The predictive error at time t, et, is defined as the model prediction at time t, Mr, minus the system value at time t, S t. That is:

e t = M , - S , . (1)

The best model will produce the smallest errors. This suggests that in assessing predictive validity one should first evaluate the predictive properties of a model and then compare acceptable models before selecting a single model as being the best of a series of alternatives.

EVALUATING MODEL PREDICTIONS

The mean error of a series of predictions is termed the predictive bias. The existence of bias is evidence that the constructed model makes systematic errors in the prediction of system values. Accordingly, models with good predictive properties will have a bias measure close to zero where bias, given n values in the data estimation set and m values in the predictive validation set, is computed as:

m

E en+t Bias = ~ = t= 1 (2)

m The standardized variable:

W = - - , (3) cr

will have an approximate standard normal distribution (Hogg and Tanis, 1988, p. 323) and an overall test for predictive bias can be obtained by referring W to a standard normal table where the probability of observing a

P R E D I C T I V E V A L I D A T I O N O F M O D E L S 37

value of W = x/~-~/o- can easily be determined. For a two-tailed test, the probability of W < I 1.961 is 0.95 and values of W > I 1.961 are indicative of statistically significant predictive bias in the model.

In practice, o- is not known and must be estimated as the sample standard deviation, s, from the data values, $1,. . . , Sn, used to estimate the model. Abraham and Ledolter (1983) have suggested that a better approximation of the predictive bias is obtained by referring W = v/--m~/s to a t-table with n-g degrees of f reedom where g is the number of parameters fitted to the data set. If n is large, the refinement will make little difference and either test may be used.

A second test for model adequacy is obtained by referring the predictive sum of squared errors to the chi-squared distribution. This is done by constructing the statistic Oe where:

m

E 2 en+t Qe ~ t = l °" 2 (4)

and comparing it to the percentage points of the t ,2 distribution with m degrees of freedom. Because o -2 is estimated from the data used for model estimation, a better approximation is to construct the statistic:

m

2 E en+i

Qe = ' = 1 ( 5 ) ms 2

and refer the result to an F-table with m and n-g degrees of freedom (Abraham and Ledolter, 1983). A value of the calculated statistic with chosen level of significance a exceeding the tabular value with the appro- priate degrees of f reedom provides evidence of predictive inadequacies. Torantola (1987) has argued that sum-of-squares-based stastistics are too sensitive to outlier predictive errors and are consequently less preferable than mean-error-based measures. The argument, however, is no less true of mean-error-based methods than it is of sum-of-squares-based statistics and the existence of extreme errors (outliers) might well be taken as evidence of predictive inadequacies in their own right.

The predictive errors produced by a model should also be uncorrelated. Accordingly, one expects their sample autocorrelations r k (k = 1, 2 . . . . ) to be approximately zero. The sample autocorrelations are given by:

m

E ( M t - M ) ( M t - k - M ) Fk t = k + l = m ( 6 )

g ( M , - / ~ ) 2

t = l

38 M. POWER

b

where M is the mean of the predicted data values and k is the chosen number of lags for which r k is calculated. The significance of the calculated sample autocorrelations is determined by comparing each of the r k values with the standard error term f~ - and rejecting the null hypothesis that the autocorrelations are zero at significance level a = 0.05 if x/-m--I rk I > 1.96 (Abraham and Ledolter, 1983). If the null hypothesis is rejected, the predictions at t and t + k are autocorrelated and evidence of predictive inadequacy exists. It should be remembered, however, that as the test procedure is applied to each of the r k values separately, the probability of the results being simultaneously true is not given by the level of significance, a, bu t by (1 _a )m. For a sample of any significant size, the probability of observing at least one r k outside the bounds defined by + 1.96fm- is quite large. For example, with m = 10 and a = 0.05 the probability of observing a single r/, that fails the test is 0.40.

Predictive errors are also required to be normally distributed (Draper and Smith, 1981; Montgomery and Peck, 1982). This can be checked by testing the errors ea+ l , . . . , en+ m for normality using a Shapiro-Wilks or Anderson-Dar l ing test. While most statistics references recommend the use of a Kolmogorov-Smirnov test, D'Agostino (1986, p. 406) has described the test as an "historical curiosity" and recommended that it "should never be used". The Shapiro-Wilks test statistic is recommended as being the most powerful of the available tests, but because of the problems associated with ties and the requirement for extensive tables in its calculation, the Anderson-Darling test statistic is of more practical use.

As the Anderson-Dar l ing test is not commonly described, a brief de- scription of the statistic and its use follows. Details can be found in Stephens (1986). The version of the test described below assumes normality is being tested for when neither the population mean or variance are known. Instead they are estimated from the sample data, in this instance the errors in the predictive data set, in the usual manner. The test is then completed as follows: 1. Calculate the standard normal variable W n + i = ( e n + i - - e ) / s ( i =

1 , . . . , m ) . 2. Calculate the value Z i = ~ ( W i) ( i = n +. 1 , . . . , m ) where qb(Z) denotes

the cumulative probability of the standard normal distribution to the value of Z. This can be obtained easily from standard statistical tables or computer routines such as Hill (1973).

3. Arrange the Z i values in ascending order and define the order statistics Z 1 to Z m where Z1 is the smallest of the Z i values in the range Z,,+~ to Z m •

m

4. Define Z, = ~ Zi/m. i = l


5. Calculate the statistics

i=l

6. Modify the statistic as follows:

A 2 = A 2 ( 1 . 0 + 0.75/n + 2.25/n2).

The modification obviates the need for large tables. One may reject the hypothesis that prediction errors are normally distributed if A 2 >__ C 1_,~,

where C 1_,~ denotes the critical value for the test statistic with significance level a. The values for C with a equal to 0.10, 0.05 and 0.01 are respectively: 0.631, 0.752 and 1.035 (Stephens, 1986, table 4.7). Finally, plots of the values M t and S t can also prove useful in evaluating

the predictive power of a model. If predictions are identical to actual system observations, the plotted points will form a 45 ° line through the origin. Departures from the line indicate model inadequacy. The procedure has been formalized by fitting the linear regression:

Sn-t =/30 + BIMn+t + e,+t (7)

for (t = 1 , . . . , m) and testing the significance of the/30 and /31 parameters with standard t-tests. Rejection of the hypotheses that/30 = 0 or/31 = 1 are signs that the model does not predict well.

COMPARING MODEL PREDICTIONS

A model with predictive validity must, at a minimum, meet the require- ments discussed above. Meeting them, however, does not guarantee that any specific model is the best of all possible models or the best of a series of competing models. Confirming that a model is the best of all possible models is not possible with the tools currently available. At best one can use the evaluative tools discussed above to compare and rank competing models in terms of their predictive bias and accuracy. The five statistics defined below are commonly used.

m

E en+t Mean error: t= 1 , (8)

m

Mean percent error: 100

m

en+t

t = l Sn+t (9)

4 0 M. POWER

Mean square error:

Mean absolute error:

Mean absolute percent error:

m

E 2 en+t t = l

m

m

len+,l t = l

(lO)

m (11)

100 ~ [en+ t] (12) m / = 1 I S n + t l "

The first two measure predictive bias and should be close to zero. The other three measure predictive accuracy and should be as small as possible. Models with smaller bias and accuracy measures are preferred to those with larger measures.

A particularly useful statistic for comparing alternative models is Theil's inequality coefficient (Theil, 1966). Originally suggested as a systematic means of evaluating the forecasts obtained from econometric models, the statistic may equally well be applied to ecological and environmental models designed specifically to predict change in an observed system variable and is calculated as follows:

i n

S 2 E (Mn+,- n+,) U 2 = t = l m (13)

n ÷ t

¢ = 1

The value of U is bounded as follows: 0 < U < 0o. If Mn+ t =an+ t then U = 0 and the model produces perfect predictions. If M n + t = 0, then U = 1 and the model produces predictions of system behaviour that are no better, and more costly, than a naive zero change prediction. Finally, if U > 1, then the predictive power of the model is worse than the naive no-change prediction and better predictions of system behaviour could be made by assuming that S n + 1 = Sn.

The numerator in the inequality coefficient is the root mean square error for the predicted data set and is critical to the definition of U. The denominator simply allows U to be expressed independently of the units of measure used in the model and facilitates comparisons between models. Although closely related to the mean square error measure of Eq. (10), and thus a means of assessing predictive accuracy, the numerator may be decomposed into three terms each of which indicates a different source of predictive error (Theil, 1966). The three components, referred to as partial inequality coefficients, are defined as follows:


1. The bias proportion (UB): ascribes differences in the predicted and actual system observations to differences in their means.

(M-S) UB= 1 m (14)

- - E ( M n + t - S n + t ) 2 n t = l

where M and S are the means of the predicted and the actual data series over the range of the observations n + 1 to m.

2. The variance proportion (Uv): ascribes differences in the predicted and actual system values to differences in their variances.

S M -- SS) 2 U v = 1 m (15)

-- E ( M . + , - Sn+,) 2 n t = l

where s M and s s are the s tandard deviations of the predicted and the actual data series over the range of observations n + 1 to m:

m m 1 E ( M n + l - / ~ ) 2, s 2 1 ~ . (Sn+ ' ~.~)2. (16)

$ 2 = n t = l n t = l

3. The covariance proportion (Uc): ascribes differences in the predicted and actual system values to imperfect covariation in the two series.

2(1 - rMs)SMS s U c = 1 m (17)

- - E ( M n + t - S n + l ) 2 n t= 1

where rMS is the correlation coefficient of the predicted and the actual data series computed over the range of observations n + 1 to m:

1 m - 2

r M S = n t = l (18) SMS S

Insofar as the measures are proportions, U B + U v + U c = 1. The bias and variance proport ions of the inequality coefficient can typically be reduced by increasing the size of the model estimation data set (t = 1 . . . . , n). This

4 2 M. POWER

can, however, limit the data available for predictive validation if data are either difficult or expensive to collect. There is little that can be done to reduce, or eliminate, the covariance proportion of the inequality coefficient. As predictions are not typically perfectly correlated with actual outcomes (Pindyck and Rubinfeld, 1981), the covariance proportion of the inequality coefficient will generally not fall to zero.

The Janus quotient, first proposed by Gadd and Wold (1964), is a useful measure of model predictive accuracy as well as an indicator of possible changes in model structure. The latter attribute is particularly important for ecological and environmental models, many of which are specifically designed to predict the behaviour of remediated or altered systems. The quotient is calculated as follows:

m

E (Mn+t-Sn+t)2/. m j 2 = t = l n (19)

E (M,- S,)2/n t = l

The numerator is the predictive mean squared error of the model and is useful in its own right for predictive validation. The denominator is the replicative mean squared error of the model. The Janus quotient thus forms the ratio of the predictive and replicative validity measures to provide an index of the change in model accuracy between the estimation and prediction data sets. J will vary between 0 and ~. If the system structure and the predictive ability of the model remain more or less constant outside the sample estimation period, J will be approximately 1. The higher J becomes, the poorer the predictive performance of the model relative to replicative performance. High J values are suggestive of changes in model structure or an over-emphasis on the tuning of the constructed model to the estimation data set. Spreit (1985) has argued that the tasks of comparison and selection should be guided by a trade-off between fit, model parsimony and accuracy. Selecting a model on the basis of fit alone runs the risk that the resulting model will be unnecessarily complicated and contaminated with the information obtained from the noise contained in the estimation data set. A contaminated model will not be truly reflective of the system being modelled and cannot be expected to produce the unbiased predictions required of good predictive models. While the Janus coefficient will detect incidences of over-fitting or structural shifts, a low J coefficient is not indicative of good predictive performance per se. That is because in forming the ratio of predictive to replicative performance the Janus coefficient obstructs direct inferences being made about the predictive performance of the model.


DISCUSSION

Validation, be it replicative, structural or predictive, is an attempt to increase the degree of confidence that the events inferred by a model will in fact occur under the assumed conditions. Accordingly, validation increases model utility by increasing user confidence. It is, however, a dangerous myth to believe that the validation exercise can unquestionably establish the truth of a model. At best models can be shown to be credible. The measures defined above produce a series of quantitative judgements on the confidence a user should place in a model. Furthermore, they establish a means of comparing and ranking alternative models. They do not, however, precisely define a procedure for establishing predictive validity. In that regard, much can be learned from the computer science discipline and the notion of structured programming. The success of the structured programming heuristic in ensuring the integrity and validity of computer code suggests similar approaches be adopted to predictive validation. A structured approach to predictive validation would follow the steps of evaluation and comparison. Assuming that the considered models had been replicatively and structurally validated, the evaluation phase of predictive validation would involve the following: 1. an assessment of model predictive bias and its statistical significance, 2. an assessment of model accuracy and its statistical significance, 3. an assessment of the serial independence of model predictions (autocor-

relation), 4. an assessment of the normality of model predictions. Models having no significant predictive bias, statistically adequate accuracy, and serially independent and normally distributed predictions are considered as statistically acceptable prediction models.

The possibility of there being several such models, however, requires the completion of the comparative phase of predictive validation. Models lacking the properties of a good statistical prediction procedure are re- moved from the set of those being compared and ranked as candidates for the best predictive model. The comparative phase, then, consists of the application of statistical and analytical procedures designed to select from among the group of statistically acceptable models the model that best predicts system behaviour. Bias and accuracy statistics enable model users to choose the model that most consistently and precisely predicts system behaviour. Models with smaller bias and accuracy measures are preferred to those with larger measures. While useful, the bias and accuracy measures do not maximize the information on predictive performance available to the model user. The inequality coefficient approach to comparing alternative models, however, has the advantage of producing both a relative

4 4 M. POWER

index of predictive performance and indicating the source of predictive errors. Accordingly, model users gain insights into the methods that might prove fruitful in terms of improving predictive performance. The Janus coefficient, though not as analytically useful, does aid in the detection of structural shifts in the system being modelled and, for ecological and environmental modelling exercises, may be a particularly useful measure of the relative in and out of sample predictive performance.

Application of the structured evaluative and comparative approach to predictive validation leads directly to the selection of the best predictive model from among those considered. Furthermore, the inclusion of a comparative phase places emphasis on the need to compare and rank any new model against its predecessors. The difficulty associated with accurately predicting the behaviour of many systems suggests that absolute accuracy is initially less important than relative accuracy. A model, though in an absolute sense inaccurate, that predicts system behaviour relatively more accurately than previous models is nonetheless an improvement. Finally, as many models have a large predictive component to their design, they will only play a role in decision making if users can be convinced of their validity. A systematic approach to validation has the potential to accomplish that task.

THE STOCK-RECRUITMENT EXAMPLE

Stock-recruitment (SR) models are models of density-dependent relationships in the theory of population regulation. The principle of density dependence is that as population increases, or decreases, it becomes correspondingly more, or less, difficult to survive. Changes in population are associated with changes in mortality that serve to keep the population fluctuating about an equilibrium point. SR models, of which there are several (Elliott, 1985), describe the nature of the relationship between the number of recruits (R) to the population as a function of the parental stock (S). The number of recruits can be estimated at different life stages in the life cycle (e.g. egg to larval, larval to juvenile) or over the entire life cycle. Losses between stages t, the stock stage, and t + 1, the recruit stage, are composed of density-independent and density-dependent components. The relationship between parental stock and recruits is typically defined as follows:

R=aSf(S) (20)

where a determines the degree of density-independent loss and f(S) defines the density-dependent component (Elliott 1985). Various forms of the SR relationship have been proposed, including: Ricker (1954), Bever-


ton and Holt (1957), Cushing (1973) and Hassell (1975). If S can be observed, then it makes sense to predict R as a function of S as long as the two are significantly related. This is, in fact, the basis of estimated SR relationships and the rationale for their use in the management of many species. The relationship offers fisheries managers a convenient solution to the recruit prediction problem and the potential for determining, given stock size, the balance between harvest and replacement critical to main- taining long-term sustainable yields.

Elliott (1984, table 3) provides data on both the estimated density of eggs and older (aged over 1 year) fish per square metre for each year-class in a population of brown trout (Salmo trutta) in an English Lake District Stream for the years 1967 to 1983. The data are ideally suited to SR modelling and provide a convenient illustration for the techniques involved in the predictive validation of ecological and environmental models.

The data were split into estimation and prediction data sets. The observations for the years 1967 to 1978 were used to estimate the considered models and the observations for the years 1979 to 1983 for the predictive validation of the proposed models. Elliott (1984) considers two SR models: the dome-shaped Ricker (1954) and the asymptotic-shaped Beverton-Holt (1957) models defined below: The Ricker model:

R = aSe (-t3s). (21)

The Beverton-Holt model:

a S R = (1.0 +/3S)" (22)

Non-linear estimation of the models was completed using the SYSTAT non-linear estimation routine (Wilkinson, 1989). The resulting parameter estimates for the estimation data set are given in Table 1. Once estimated, the models were used to predict the recruitment values. The actual recruitment data (aged over 1-year fish per m2), the stock values (eggs per

TABLE 1

Stock-recruitment model estimation results

Stock-recruitment model Parameter (a) Parameter (/3)

Ricker 0.027 0.012 Beverton-Holt 0.103 0.128

The parameters a and /3 for each of the considered stock-recruitment models were estimated with the non-linear estimation routine described in Wilkinson (1989).

46

TABLE 2

Stock-recruitment data and model predictions

M. POWER

Year Stock data Recruitment Ricker Beverton-Holt data predictions predictions

1979 17.23 0.78 0.38 0.55 1980 12.03 0.53 0.28 0.49 1981 127.43 0.50 0.75 0.76 1982 132.63 0.80 0.73 0.76 1983 41.30 0.72 0.68 0.68

The data in columns 2 and 3 were drawn from Elliott's (1984) table 3. The stock data represents eggs per m 2 and the recruitment data fish aged over 1 year per m 2 in May/June of the specified year. The stock data were combined with Eqs. (21) and (22) to produce the model predictions given in columns 4 and 5.

m 2) u sed to m a k e the p r ed i c t i ons a n d the m o d e l r e c r u i t m e n t p r ed i c t i ons fo r the yea r s 1978 to 1983 a re g iven in T a b l e 2. T h e d a t a in c o l u m n s 2 a n d 3 give the s tock and r e c r u i t m e n t da t a f r o m El l io t t (1984). C o l u m n s 4 and 5

use the s tock da t a f r o m c o l u m n 2 to p r ed i c t the r e c r u i t m e n t da t a us ing Eqs. (21) and (22) respect ive ly .

T h e p r e d i c t i o n s and a c t u a l da t a w e r e t h e n u sed to c o m p l e t e a s tat is t ical e v a l u a t i o n o f the p r o p o s e d mode l s . T h e eva lua t i on app l i ed Eqs . (2) t h r o u g h (6) to the da t a to p r o d u c e the resu l t s g iven in T a b l e 3. R e f e r e n c e to T a b l e

3 ind ica tes no s ta t is t ical b ias in the p r ed i c t i ons o f e i t he r mode l . In b o t h cases the c o m p u t e d W-stat is t ic was less t h a n the c o r r e s p o n d i n g t - t ab le

TABLE 3

Statistical evaluation of the predictive models

Statistic Ricker model Beverton-Holt Critical value Significant model (~ = 0.05)

Bias - 0.103 - 0.019 NA NA W-statistic - 1.290 - 0.234 + 1.812 NO Q-statistic 1.812 0.769 3.330 NO Anderson-Darling

statistic 0.242 0.332 0.752 NO Autocorrelation

V ~ [ r k [ > 1.96 NO NO 1.96 NO

Table 3 gives the computed statistic values for the Ricker and Beverton-Holt recruitment predictions for the listed statistical tests, the critical value for each test, where applicable, at the 0.05 level of significance and comments on the statistical significance of each of the test results. The fact that no statistical significance was found for any of the completed tests confirms the adequacy of the null hypotheses concerning the statistical adequacy of the proposed models as prediction tools.

PREDICnVE VAUDATION OF MODELS

TABLE 4

Statistical comparison of the predictive models

47

Predictive statistic Ricker model Beverton-Holt model

Mean error - 0.103 - 0.019 Mean percent error - 12.766 0.729 Mean square error 0.058 0.025 Mean absolute error 0.202 0.122 Mean absolute percent error 32.418 24.264

Theil's U 2 statistic 1.640 1.932

Janus coefficient 1.805 0.883

Table 4 gives the computed values for each of the statistics proposed for use in the comparison of statistically adequate predictive models. The first group of statistics assesses the bias and accuracy of the models. Theil's U statistic compares the predictive ability of the models to a naive no change prediction and the Janus coefficient compares the in- and out-of-sample predictive abilities of the models as a means of detecting incidences of over-fitting or structural shifts. Further details on the use and interpretation each statistic are given in the text.

value for the 0.05 level of significance with n - g , here 10, degrees of freedom. Likewise, the Q-statistic gives no indication of predictive inadequacies in either model. At the 0.05 level of significance the computed statistic is less than the critical value given in the F-table for 5 and 10 degrees of freedom. The computed Anderson-Dar l ing statistic for both models was also less than the critical value at the 0.05 level of significance. The results provide no evidence for the rejection of the null hypothesis that the prediction errors are normally distributed. Finally, examination of the prediction error sample autocorrelations indicate that none of the x/-m- I rk I values exceed 1.96 for either model. The results confirm that the predictive errors of both models are uncorrelated. Based on the results of Table 3 both the Ricker and the Bever ton-Hol t SR models are statistically acceptable as predictive tools.

Once the statistical adequacy has been confirmed the next step is to compare the predictive power of each model over the range of the predictive data set. The comparison was completed using the statistics defined in Eqs. (8) to (13) and the results are presented in Table 4. The first group of statistics assesses the predictive bias and accuracy of the two models and suggests that the Bever ton-Hol t model is superior to the Ricker model. The U statistic assesses the ability of the models to produce predictions of change in recrui tment from period to period that are better than a naive no-change prediction. The fact that neither model produces predictions superior to the naive no-change approach suggests that neither

4 8 M. POWER

model is useful as a dynamic predictive tool. The result is not atypical of the prediction problem. Models which predict levels well often do not predict period to period changes nearly as well. In this instance the chosen models are both static and not specifically designed to capture the dynamics of recruitment change. Accordingly, the poor period-to-period predictive performance could have been expected a priori.

Finally, the Janus coefficients for the two models indicate that the Beverton-Holt model is superior. The computed coefficient for the Bever- ton-Hol t model suggests that the relationship between the stock-recruitment relationship and the predictive ability of the model remains more or less constant within and outside the sample estimation period. The higher J value for the Ricker model, on the otherhand, suggests that the model performs better at in-sample than out-of-sample prediction. Elliott (1984) in assessing the two models selected the Ricker model as the best descrip- tion of the available stock-recruitment data because of its superior in-sample prediction abilities. While the choice is justifiable on the grounds of an improved r 2 measure, the choice cannot be supported from a predictive point of view. Accordingly, the example provides an instructive instance of the inherent conflict that often arises between the desire for fit and the need for predictive accuracy (Spreit, 1985) and illustrates the need to employ a different set of evaluative tools when assessing the predictive powers of ecological and environmental models.

CONCLUSIONS

Validation is an attempt to increase the degree of confidence that users have in any proposed model and should be completed for any model having a predictive purpose to its design and construction. There are a wide variety of approaches to assessing the predictive validity of models and among the suggested approaches none is clearly superior. The variety of measures and lack of common agreement upon which, if any, is best does not, however, release the modeller from the obligation of validation. Montgomery and Peck (1982, p. 426) argue that since the model developer does not control the uses to which the model is put that "whenever possible all the validation techniques be used." Though referring to a smaller set of methods than discussed above, the advice is sound and should be heeded by all modelling exercises having probable predictive applications.

Furthermore, when applying the available validation techniques, it is suggested that a structured approach consisting of model evaluation and comparison be adopted. The approach is consistent with the methodologies used elsewhere in the statistical modelling literature. See, for example, the


diagnostic, estimation and checking procedural methodology used in time- series analysis. The evaluation phase ensures that the models considered in the comparative phase are statistically adequate. The comparative phase then selects the model with the best predictive performance. The procedural, criteria based aspects of the structured approach are designed to improve user confidence and model utility by demonstrat ing the statistical integrity and comparative worth of the selected model. Both are important as models with high user confidence are more likely to be used in decision making. Insofar as models of ecological and environmental systems offer much potential for understanding the consequences of human action, that time and money might not otherwise allow, the importance of predictive validation cannot be understated.

REFERENCES

Abraham, B. and Ledolter, J., 1983. Statistical Methods for Forecasting. John Wiley and Sons, New York, NY, 405 pp.

Banks, J. and Carson, J.S., II, 1984. Discrete-event System Simulation. Prentice-Hall, Englewood Cliffs, NJ, 514 pp.

Beverton, R.J.H. and Holt, S.J., 1957. On the dynamics of exploited fish populations. Fish. Invest. l_ond. Ser. 2, 19: 1-533.

Carter, N., 1986. Simulation modelling. In: G.D. McLean, R.G. Garret and W.G. Ruesink (Editors), Plant Virus Epidemics. Academic Press, Sydney, N.S.W., pp. 193-215.

Cohen, K.J. and Cyert, R.M., 1961. Computer models in dynamic economics, Q. J. Econ., 75: 112-127.

Cushing, D.H., 1973. Dependence of recruitment on parent stock. J. Fish. Res. Board Can., 30: 1965-1976.

D'Agostino, R.B., 1986. Tests for the normal distribution. In: R.B. D'Agostino and M.A. Stephens (Editors), Goodness-of-fit Techniques. Marcel Dekker, New York, NY, pp. 367-419.

Dent, J.B. and Blackie, M.J., 1979. Systems Simulation in Agriculture. Applied Science, London, 180 pp.

Draper, N. and Smith, H., 1981. Applied Regression Analysis, second edition. John Wiley and Sons, New York, NY, 709 pp.

Elliott, J.M., 1984. Numerical changes and population regulation in young migratory trout Salmo trutta in a Lake District stream, 1966-83. J. Anim. Ecol., 53: 327-350.

Elliott, J.M., 1985. The choice of a stock-recruitment model for migratory trout Sahno trutta

in an English Lake District stream. Arch. Hydrobiol., 104: 145-168. Feldman, R.M., Curry, G.L. and Wherly, T.E., 1984. Statistical procedure for validating a

simple population model. Environ. Entomol., 13: 1446-1451. Gaas, S.I., 1983. Decision-aiding models: validation, assessment, and related issues in policy

analysis, Oper. Res., 31: 603-631. Gadd, A. and Wold, H., 1964. The janus coefficient: a measure for the accuracy of

prediction. In: H.O.A. Wold (Editor), Econometric Model Building: Essays on the Causal Chain Approach. North-Holland, Amsterdam, pp. 229-235.

Hassell, M.P., 1975. Density dependence in single-species populations. J. Anim. Ecol., 44: 283-295.

50 M. POWER

Hill, I.D., 1973. Algorithm AS 66: the normal integral. Appl. Stat., 22: 424-427. Hogg, R.V. and Tanis, E.A., 1988. Probability and Statistical Inference, third edition.

Macmillan, New York, NY, 658 pp. Law, A.M. and Kelton, W.D., 1991. Simulation Modeling and Analysis, second edition.

McGraw-Hill, New York, NY, 759 pp. Miller, D.R., Butler, G. and Bramall, L., 1976. Validation of ecological system models. J.

Environ. Manage., 4: 383-401. Montgomery, D.C. and Peck, E.A., 1982. Introduction to Linear Regression Analysis. John

Wiley and Sons, New York, NY, 504 pp. Naylor, T.H. and Finger, J.M., 1967. Verification of computer simulation models. Manage.

Sci., 14: 92-101. Naylor, T.H., Balintfy, J.L., Burdick, D.S. and Chu, K., 1966. Computer Simulation

Techniques. John Wiley and Sons, New York, NY, 352 pp. Pindyck, R.S. and Rubinfeld, D.L., 1981. Econometric Models and Economic Forecasts,

second edition. McGraw-Hill, New York, NY, 630 pp. Reynolds, M.R., Jr. and Deaton, M.L., 1982. Comparison of some tests for validation of

stochastic simulation models. Commun. Stat. Simul. Comput., 11: 769-799. Ricker, W.E., 1954. Stock and recruitment. J. Fish. Res. Board Can., 11: 559-623. Shannon, R.E., 1975. Systems Simulation: the Art and Science. Prentice-Hall, Englewood

Cliffs, NJ, 387 pp. Snee, R.D., 1977. Validation of regression models: methods and examples. Technometrics,

19: 415-428. Spreit, J.A., 1985. Structural characterization - - an overview. In: M.A. Barker and P.

Young (Editors). IFAC Proceedings Identification and System Parameter Estimation. Pergamon Press, Oxford, pp. 749-756.

Stephens, M.A., 1986. Tests based on EDF statistics. In: R.B. D'Agostino and M.A. Stephens (Editors), Goodness-of-fit Techniques. Marcel Dekker, New York, NY, pp. 97-193.

Stone, M., 1974. Cross-validating choice and assessment of statistical predictions (with discussion). J. R. Stat. Soc. B, 36: 111-147.

Torantola, A., 1987. Inverse Problem Theory: Methods for Data Fitting and Model Parameter Estimation. Elsevier, Amsterdam, 613 pp.

Theil, H., 1966. Applied Econometric Forecasting. North-Holland, Amsterdam, 474 pp. Van Horn, R.L., 1971, Validation of simulation results. Manage. Sci., 17: 247-258. Wallach, D. and Goffinet, B., 1989. Mean squared error of prediction as a criterion for

evaluating and comparing system models. Ecol. Modelling, 44: 299-306. Wilkinson, L., 1989. SYSTAT: The System for Statistics. Evanston, IL, SYSTAT Inc. Ziegler, B.P., 1976. Theory of Modelling and Simulation. John Wiley and Sons, New York,

NY, 435 pp.

the predictive validation of ecological and environmental models

Documents