advances in water resources - peoplemye/pdf/paper24.pdf · 2011. 4. 16. · dom log hydraulic...
Post on 17-Mar-2021
2 Views
Preview:
TRANSCRIPT
Advances in Water Resources xxx (2011) xxx–xxx
Contents lists available at ScienceDirect
Advances in Water Resources
journal homepage: www.elsevier .com/ locate/advwatres
Bayesian analysis of data-worth considering model and parameter uncertainties
Shlomo P. Neuman a,⇑, Liang Xue a, Ming Ye b, Dan Lu b
a Department of Hydrology and Water Resources, University of Arizona, Tucson, AZ 85721, United Statesb Department of Scientific Computing, Florida State University, Tallahassee, FL 32306, United States
a r t i c l e i n f o
Article history:Available online xxxx
Keywords:Data worthValue of informationModel uncertaintyParameter uncertaintyBayesian model averaging
0309-1708/$ - see front matter � 2011 Elsevier Ltd. Adoi:10.1016/j.advwatres.2011.02.007
⇑ Corresponding author. Tel.: +1 520 885 3482; faxE-mail address: neuman@hwr.arizona.edu (S.P. Ne
Please cite this article in press as: Neuman SP e(2011), doi:10.1016/j.advwatres.2011.02.007
a b s t r a c t
The rational management of water resource systems requires an understanding of their response to exist-ing and planned schemes of exploitation, pollution prevention and/or remediation. Such understandingrequires the collection of data to help characterize the system and monitor its response to existing andfuture stresses. It also requires incorporating such data in models of system makeup, water flow andcontaminant transport. As the collection of subsurface characterization and monitoring data is costly,it is imperative that the design of corresponding data collection schemes be cost-effective, i.e., that theexpected benefit of new information exceed its cost. A major benefit of new data is its potential to helpimprove one’s understanding of the system, in large part through a reduction in model predictive uncer-tainty and corresponding risk of failure. Traditionally, value-of-information or data-worth analyses haverelied on a single conceptual-mathematical model of site hydrology with prescribed parameters. Yetthere is a growing recognition that ignoring model and parameter uncertainties render model predictionsprone to statistical bias and underestimation of uncertainty. This has led to a recent emphasis onconducting hydrologic analyses and rendering corresponding predictions by means of multiple models.We describe a corresponding approach to data-worth analyses within a Bayesian model averaging(BMA) framework. We focus on a maximum likelihood version (MLBMA) of BMA which (a) is compatiblewith both deterministic and stochastic models, (b) admits but does not require prior information aboutthe parameters, (c) is consistent with modern statistical methods of hydrologic model calibration, (d)allows approximating lead predictive moments of any model by linearization, and (e) updates modelposterior probabilities as well as parameter estimates on the basis of potential new data both beforeand after such data become actually available. We describe both the BMA and MLBMA versions theoret-ically and implement MLBMA computationally on a synthetic example with and without linearization.
� 2011 Elsevier Ltd. All rights reserved.
1. Introduction approach [55]. Many today prefer a fourth approach based on va-
The world’s water supply is threatened by overexploitation andcontamination. To manage this supply in an optimal and sustain-able manner, it is necessary to understand the response of waterresource systems to existing and planned schemes of exploitationand pollution prevention and/or remediation. Such understandingrequires the collection of suitable data to help characterize the sys-tem and monitor its response to existing and future stresses. It alsorequires incorporating such data in suitable models of water flowand contaminant transport.
As noted by Back [3], three strategies have traditionally beenused to determine the magnitude of a data collection effort: mini-mizing cost for a specific level of accuracy or precision, minimizinguncertainty for a given budget, or responding to regulatory de-mands on data quantity and quality. Various combinations of thesestrategies have also been described such as a fitness-for-purpose
ll rights reserved.
: +1 520 844 1795.uman).
t al. Bayesian analysis of data-
lue-of-information or data-worth analysis. Here the decision tocollect additional data, or the design of a data collection program,is based on cost-effectiveness. A program is considered cost-effec-tive if the expected benefit from the new information exceeds thecost. A major benefit of new data is its potential to help improveone’s understanding of the system, in large part through a reduc-tion in model predictive uncertainty. This benefit, however, isworth the cost only if it has the potential to impact decisionsconcerning management of the water resource system.
Value-of-information or data-worth analyses incorporating sta-tistical decision theory have been applied to various water-relatedproblems in the 1970s [12,13,23,36] and to groundwater problemsin the 1980s [4,22,33,47]. More recent applications to groundwaterresource and contamination issues have been reported in[1,11,21,27–30,37,45,50,51]. James and Freeze [28] proposed aBayesian decision-making framework to evaluate the worth of datain the context of contaminated groundwater that has been widelycited in the subsequent literature. A comprehensive reviewfocusing on health risk assessment can be found in [61]. Additionalrecent publications of relevance include [18,41].
worth considering model and parameter uncertainties. Adv Water Resour
Fig. 1. Decision tree (after [3]).
2 S.P. Neuman et al. / Advances in Water Resources xxx (2011) xxx–xxx
A major limitation of many existing approaches is that they relyon a single conceptual-mathematical model of geologic or wa-tershed makeup and of hydrologic processes therein. Yet hydro-logic environments are open and complex, rendering them proneto multiple interpretations and mathematical descriptions, includ-ing parameterizations. This is true regardless of the quantity andquality of available data. Predictions and analyses of uncertaintybased on a single hydrologic concept are prone to statistical bias(caused by reliance on an inadequate model) and underestimationof uncertainty (caused by under-sampling of the relevant modelspace). Analyses of environmental data-worth which explore howdifferent sets of conditioning data impact the predictiveuncertainty of multiple models in a Bayesian context include[19,49,54]; whereas Freer et al. [19] employ Generalized LikelihoodUncertainty Estimation (GLUE; see [5,6]), Rojas et al. [49] combineGLUE with Bayesian model averaging (BMA; [7,17,25,32,34]).Diggle and Lophaven [16] describe a Bayesian approach to geosta-tistical design. Nowak et al. [43] introduce a Bayesian approach todata worth analysis when flow and transport take place in a ran-dom log hydraulic conductivity field. Whereas flow and transportare described by a single (linearized) model each having knownparameters, other than those describing spatial variations in loghydraulic conductivity, the latter is characterized by a single driftmodel and a continuous family of variogram models having uncer-tain parameters.
In a similar spirit, we propose in this paper a multimodel ap-proach to optimum value-of-information or data-worth analysesthat is based on model averaging within a Bayesian framework.Our approach is general in that it considers multiple models ofany kind, all having uncertain parameters; whereas parameterizingmodels in the manner of [43] is elegant and computationally effi-cient, it is unfortunately limited to a narrow range of variogrammodels and does not, generally, apply to other models such asthose of flow and transport. We prefer BMA over GLUE because it(a) rests on rigorous statistical theory, (b) is compatible with deter-ministic as well as stochastic models and, (c) in its maximumlikelihood (ML) version (MLBMA), is consistent with current MLmethods of hydrologic model calibration [35,38–40,57–60].Whereas BMA (like the closely related approach in [43]) reliesheavily on prior parameter statistics, MLBMA can do without suchstatistics or otherwise update them on the basis of potential newdata both before and after they are collected. We describe the pro-posed BMA and MLBMA approaches theoretically, outline ways toimplement the MLBMA version computationally, and illustrate thelatter on a synthetic example. Our proposed methodology shouldbe of help in designing the collection of hydrologic characterizationand monitoring data in a cost-effective manner by maximizingtheir benefit under given cost constraints. The benefit would ac-crue from optimum gain in information, or reduction in predictiveuncertainty, upon considering jointly not only traditional sourcesof uncertainty such as those affecting model parameters and thereliability of data but also, most importantly, lack of certaintyabout the conceptual-mathematical models that underlie the anal-ysis and the scenarios under which the system would operate inthe future. The methodology should apply to a broad range of mod-els representing natural processes in ubiquitously open and com-plex earth and environmental systems.
Fig. 2. Steps of Bayesian approach (after [3]).
2. Background
2.1. Bayesian decision analysis framework
One way to cast the data-worth issue would be within a Bayes-ian risk-cost-benefit decision framework such as that of Freezeet al. [20,21]. Suppose without loss of generality that the data are
Please cite this article in press as: Neuman SP et al. Bayesian analysis of data-(2011), doi:10.1016/j.advwatres.2011.02.007
intended to help one decide whether or not a contaminated siteshould be remediated. This decision problem is illustrated inFig. 1 [3] by a decision tree in which U is the decision objective;Ui is an objective function associated with each decision alterna-tive (i = 1,2) defined as
Ui ¼ Bi � Ci � cPiCfi; ð1Þ
where Bi is the benefit and Ci the investment cost, risk being ex-pressed as the product cPiCfi of a risk aversion factor c, the probabil-ity of failure Pi and the cost of failure Cfi; C+ designates acontaminated and C� an uncontaminated state of the site; costsand benefits occurring at the triangular terminal nodes (only thecost of failure is indicated in the figure). Collecting additional infor-mation generally causes the risk term to decrease due to a decreasein the probability of failure. The corresponding increase in Ui is theexpected value (worth) of the new data. The final outcome of theanalysis depends on the choice of decision rule one adopts [42];for example, maximizing Ui would result in the largest benefitand lowest cost.
According to Back [3] the Bayesian approach to data-worthanalysis entails five steps as illustrated in Fig. 2. The steps include(1) defining one or more data collection (sampling) programs, (2)postulating a prior probability for the state of the site (e.g., contam-inated or uncontaminated), (3) using Bayes’ theorem to update theprior probability to a posterior probability conditional on the newdata, corresponding to each data collection program, (4) estimatingcorresponding costs and benefits, and (5) computing the worth ofdata or value of information using a given decision model andusing the results to optimize the data collection scheme. This workalso considers using linearized estimation of uncertainty to updatethe prediction variance in step 3. We add that data-worth analysesactually include a pre-posterior and a posterior mode. In the pos-terior mode, a given set of data is evaluated in hindsight, afterspending the money to collect it. In the pre-posterior mode, possi-ble sampling schemes are analyzed for the worth of data not yetcollected. This mode requires averaging over all such data,weighted by their pre-posterior probabilities, as we do below.
worth considering model and parameter uncertainties. Adv Water Resour
S.P. Neuman et al. / Advances in Water Resources xxx (2011) xxx–xxx 3
As noted, collecting additional information generally reducesrisk due to a decrease in the probability of failure. A reduction inthe probability of failure comes about through a reduction inuncertainty about the expected system state, present or future.The impact of hydrologic data on this expectation and the associ-ated uncertainty are often evaluated by means of a hydrologicmodel. Commonly, the model is considered to be certain whileits parameters (and in some cases its forcing terms such as sourcesand boundary conditions) are treated as being uncertain due toinsufficient and error-prone data [18]. As already noted, we knowof only one work that considers the impact of data on model pre-dictive uncertainty within a Bayesian framework by consideringthe model itself to be uncertain [19] and another work that param-eterizes this uncertainty [43]. Below we provide background aboutBayesian model averaging and its maximum likelihood versionwhich we propose to employ for this same purpose.
2.2. Bayesian model averaging (BMA)
Consider a random vector, D, the multivariate statistics ofwhich are to be predicted with a set M of K mutually independentmodels, Mk, each characterized by a vector of parameters hk, condi-tional on a discrete set of data, D (the case of correlated models hasrecently been considered in [52]). In analogy to the case of a scalarD [25] we write the joint posterior (conditional) distribution of Das
pðDjDÞ ¼XK
k¼1
pðDjD;MkÞpðMkjDÞ; ð2Þ
i.e., as the average over all models of the joint posterior distribu-tions p(DjD,Mk) associated with individual models, weighted bythe model posterior probabilities p(MkjD). These weights are givenby Bayes’ rule in the form
pðMkjDÞ ¼pðDjMkÞpðMkÞPKl¼1pðDjMlÞpðMlÞ
; ð3Þ
where
pðDjMkÞ ¼Z
pðDjMk; hkÞpðhkjMkÞdhk ð4Þ
is the integrated likelihood of model Mk, p(DjMk,hk) being the jointlikelihood of this model and its parameters, p(hkjMk) the prior den-sity of hk under model Mk, and p(Mk) the prior probability of Mk. Thelikelihood p(DjMk,hk) contains a model of D measurement errors,and the prior density p(hkjMk) may contain a model of parametermeasurement errors [8]. All probabilities are implicitly conditionalon the choice of models entering into the set M.
The posterior mean and covariance of D are given, throughextension of Draper’s [17] analysis to our multivariate case, by
EðDjDÞ ¼XK
k¼1
EðDjD;MkÞpðMkjDÞ; ð5Þ
CovðDjDÞ ¼XK
k¼1
CovðDjD;MkÞpðMkjDÞ þXK
k¼1
½EðDjD;MkÞ
� EðDjDÞ�½EðDjD;MkÞ � EðDjDÞ�T pðMkjDÞ; ð6Þ
where the superscript T denotes transpose. Eq. (6) is a discreteexpression of the law of total covariance,
CovðDjDÞ ¼ EMk jDCovðDjD;MkÞ þ CovMk jDEðDjD;MkÞ; ð7Þ
where EMk jDCovðDjD;MkÞ is the within-model component ofCov(DjD) and CovMk jDEðDjD;MkÞ is its between-model component.We will also be interested in the trace
Please cite this article in press as: Neuman SP et al. Bayesian analysis of data-(2011), doi:10.1016/j.advwatres.2011.02.007
Tr½CovðDjDÞ� ¼ Tr½EMk jDCovðDjD;MkÞ� þ Tr½CovMk jDEðDjD;MkÞ� ð8Þ
which provides a scalar measure of the posterior variance of D. Thelatter is of interest because, for K > 1, one generally has Tr½CovMk jDEðDjD;MkÞ� > 0 so that Tr½CovðDjDÞ� > Tr½EMk jDCovðDjD;MkÞ�. Hencethe consideration of multiple models generally results in greaterpredictive uncertainty, as measured by Tr[Cov(DjD)], than theuncertainty associated with a single model, as measured byTr[Cov(DjD,Mk)]. Note that, for any consistent normkk, kCovðDjDÞk 6 kEMk jDCovðDjD;MkÞk þ kCovMk jDEðDjD;MkÞk wherethe equality holds if and only if the two right hand side argumentsare linearly dependent, which is generally not the case. Itfollows that kCovMk jDEðDjD;MkÞk > 0 does not generally implykCovðDjDÞk > kEMk jDCovðDjD;MkÞk.
2.3. Maximum likelihood Bayesian model averaging (MLBMA)
BMA defines the integrated likelihood p(DjMk) of model Mk en-tirely in terms of the prior parameter density p(hkjMk) of modelparameters, having thus no provision for the conditioning of modelparameters on measurements D (i.e., for the estimation of optimummodel parameters on the basis of D using inverse methods). Instead,it requires computing the integral in (4) through exhaustive sam-pling of the prior parameter space hk for each model followed bynumerical integration. One way to resolve both issues is to replacehk by an estimate, hD
k , which maximizes the likelihood p(DjMk,hk).Obtaining such maximum likelihood (ML) estimates entails calibrat-ing each model against (conditioning on) the data D usingwell-established statistical inverse methods [8–10,24,44,48].Approximating p(DjD,Mk) in (2) by p(DjD,Mk)ML, where the sub-script indicates approximation based on ML estimation of hk, wasshown to be useful (for the case of a scalar D) in [17,46,56]. Neuman[40] proposed evaluating the weights p(MkjD) in (2), (5), and (6)based on a result due to Kashyap [31]. The corresponding expressionis [57]
pðMkjDÞ ’ pðMkjDÞML ¼exp � 1
2 dKICDk
� �pðMkÞPK
l¼1 exp � 12 dKICD
l
� �pðMlÞ
; ð9Þ
where
dKICDk ¼ KICD
k � KICDmin; ð10Þ
KICDk ¼ � 2 ln p DjMk; h
Dk
� �ML� 2 ln p hD
k jMk
� �ML
þ Nk lnND
2p
!þ ln jFkðDjMkÞMLj; ð11Þ
KICDk being the so-called Kashyap model selection (or information)
criterion for model Mk, KICDmin its minimum value over all candidate
models, and �2lnp(DjMk,hk)ML � 2lnp(hkjMk)ML a negative log like-lihood incorporating prior measurements of the parameters (ifavailable; see [8]), evaluated at hD
k . Here Nk is the dimension of hk
(number of adjustable parameters associated with model Mk), ND
is the dimension of D (number of discrete data points, which mayinclude measured parameter values), and Fk is the normalized (byND) observed (as opposed to ensemble mean) Fisher informationmatrix having components
Fk;nm ¼ �1
ND
@2 ln pðDjMk; hkÞ@hkn@hkm
" #hk¼hD
k
: ð12Þ
In the limit of large ND=Nk, KICDk reduces asymptotically to the
so-called Bayesian selection (or information) criterion (e.g. [59])
BICDk ¼ �2 ln p DjMk; h
Dk
� �MLþ Nk ln ND: ð13Þ
worth considering model and parameter uncertainties. Adv Water Resour
4 S.P. Neuman et al. / Advances in Water Resources xxx (2011) xxx–xxx
3. Effect of data augmentation on uncertainty
3.1. BMA framework
Suppose that the original data set D is augmented by anotherhypothetical data set, C, which has not yet been collected and istherefore uncertain. Assume that the multivariate statistics of D,predicted with the model set M, can be conditioned on the aug-mented data set {D,C}. Assume further that the multivariate statis-tics of C, conditional on D, can be predicted either via BMA or viaMLBMA with a set P of I mutually independent statistical models,Pi, having parameters pi. The models Pi may be independent of Mk,may form extensions of Mk or may coincide with the latter as in thecomputational example given later in this paper. Then analogy to(2) and (5)–(8) implies
pðDjDÞ ¼ ECjDpðDjD;CÞ; ð14Þ
pðDjD;CÞ ¼XK
k¼1
pðDjD;C;MkÞpðMkjD;CÞ; ð15Þ
EðDjDÞ ¼ ECjDEðDjD;CÞ; ð16Þ
EðDjD;CÞ ¼XK
k¼1
EðDjD;C;MkÞpðMkjD;CÞ; ð17Þ
CovðDjDÞ ¼ ECjDCovðDjD;CÞ þ CovCjDEðDjD;CÞ; ð18Þ
CovðDjD;CÞ ¼XK
k¼1
CovðDjD;C;MkÞpðMkjD;CÞ þXK
k¼1
½EðDjD;C;MkÞ
� EðDjD;CÞ�½EðDjD;C;MkÞ � EðDjD;CÞ�T pðMkjD;CÞ;ð19Þ
Tr½CovðDjDÞ� ¼ Tr½ECjDCovðDjD;CÞ� þ Tr½CovCjDEðDjD;CÞ�; ð20Þ
where C is implicitly conditional on the choice of models P and onD. The term Tr[CovCjDE(DjD,C)] = Tr[Cov(DjD)] � Tr[ECjDCov(DjD,C)]represents the difference between the total trace conditional on Dand the expected trace conditional jointly on D and C. As this differ-ence is positive, conditioning on D and C jointly results in a lowertrace than conditioning on D alone. The difference could be viewedas an extended version of the A-criterion in optimal design.
The joint posterior distribution (weight) of each model is given,analogously to (3), by Bayes’ rule
pðMkjD;CÞ ¼pðD;CjMkÞpðMkÞPKl¼1pðD;CjMlÞpðMlÞ
; ð21Þ
where, in analogy to (4), the joint integrated likelihood of D and C is
pðD;CjMkÞ ¼Z
pðD;CjMk; hkÞpðhkjMkÞdhk: ð22Þ
The joint likelihood of D and C under the integral lends itself toBayesian updating according to
pðD;CjMk; hkÞ ¼ pðCjD;Mk; hkÞpðDjMk; hkÞ: ð23Þ
3.2. MLBMA framework
Let hD;Ck be an ML estimate of hk obtained by maximizing the
joint likelihood p(D,CjMk,hk) under the integral in (22) with respectto these parameters. Then MLBMA would entail approximatingEqs. (15), (17), (19)–(21) via
pðDjD;CÞ ’ pðDjD;CÞML ¼XK
k¼1
pðDjD;C;MkÞMLpðMkjD;CÞML; ð24Þ
EðDjD;CÞ ’ EðDjD;CÞML ¼XK
k¼1
EðDjD;C;MkÞMLpðMkjD;CÞML; ð25Þ
Please cite this article in press as: Neuman SP et al. Bayesian analysis of data-(2011), doi:10.1016/j.advwatres.2011.02.007
CovðDjD;CÞ ’ CovðDjD;CÞML
¼XK
k¼1
CovðDjD;C;MkÞMLpðMkjD;CÞML
þXK
k¼1
½EðDjD;C;MkÞML � EðDjD;CÞML�½EðDjD;C;MkÞML
� EðDjD;CÞML�T pðMkjD;CÞML; ð26Þ
Tr½CovðDjD;CÞ� ’ Tr½CovðDjD;CÞML�¼ Tr½EMk jDCovðDjD;C;MkÞML�þ Tr½CovMk jDEðDjD;C;MkÞML�; ð27Þ
pðMkjD;CÞ ’ pðMkjD;CÞML ¼pðD;CjMkÞMLpðMkÞPKl¼1pðD;CjMlÞMLpðMlÞ
: ð28Þ
In analogy to (9)–(12), (28) can be rewritten as
pðMkjD;CÞML ¼exp � 1
2 dKICD;Ck
� �pðMkÞPK
l¼1 exp � 12 dKICD;C
l
� �pðMlÞ
; ð29Þ
dKICD;Ck ¼ KICD;C
k � KICD;Cmin; ð30Þ
KICD;Ck ¼ � 2 ln pðD;CjMk; h
D;Ck ÞML � 2 ln p h
D;Ck jMk
� �ML
þ Nk lnND;C
2p
!þ ln jFkðD;CjMkÞMLj; ð31Þ
where KICD;Cmin is the minimum value of KICD;C
k over all K models, Mk;�2lnp(D,CjMk,hk)ML � 2lnp(hkjMk)ML is a negative log likelihoodincorporating prior measurements of the parameters (if available),evaluated at h
D;Ck ; Nk is the dimension of h
D;Ck (number of estimated
parameters); ND,C is the joint dimension of D and C (number of dis-crete data points); and Fk is the normalized observed Fisher infor-mation matrix having components
Fk;nm ¼ �1
ND;C
@2 ln pðD;CjMk; hkÞ@hkn @hkm
" #hk¼h
D;Ck
: ð32Þ
Minimizing �2lnp(D,CjMk,hk) � 2lnp(hkjMk) with respect to hk
can be optionally simplified by linearizing the residuals enteringinto this negative log likelihood about ML estimates hD
k , based so-lely on D, obtained through minimization of �2lnp(DjMk,hk) �2lnp(hkjMk) with respect to hk. This option is not explored in thepresent paper.
Computational implementation of the proposed MLBMA frame-work is detailed in Appendix A. Appendix B shows how some ofthese computations can be rendered more efficient throughlinearization.
4. Synthetic geostatistical example
Our proposed approach to the assessment of data-worth can beimplemented by using either BMA or MLBMA. Here we do so usingMLBMA by considering multiple variogram models of a zero-meanspatially correlated random field, Z(x), having point support in twodimensions, x = (x1,x2)T. In particular, we use a modified version ofthe sequential Gaussian simulation code SGSIM [14] to generate anunconditional realization of Z on a grid of 50 � 50 nodes using atruncated power variogram model with Gaussian modes (TpvG) gi-ven in [15],
worth considering model and parameter uncertainties. Adv Water Resour
S.P. Neuman et al. / Advances in Water Resources xxx (2011) xxx–xxx 5
cðs; kuÞ ¼ r2ðkuÞ 1� exp �p4
sku
� �2" #
þ p4
sku
� �2" #H
8<:
� C 1� H;p4
sku
� �2" #)
; 0 < H < 1; ð33Þ
where s = kx1 � x2k is separation distance (lag) between Z values atany two points x1 and x2, ku is an upper cutoff scale proportional todomain size, A is a coefficient, H is a Hurst scaling exponent,r2ðkuÞ ¼ Ak2H
u =2H is variance (sill) and C(�, �) is the incompletegamma function. The corresponding integral scale is I(ku) = 2Hku/(1 + 2H). We set the parameters of the TpvG model equal toh = (A,H,ku)T = (0.1,0.25,5)T which correspond to r2 = 0.45 andI = 1.67. We then generate a ‘‘true’’ sample of 2500 Z values at50 � 50 nodes of a square grid, spaced a unit distance apart, asshown in Fig. 3. After verifying that a sample variogram based onall the generated values reproduces the original TpvG very closelywe select 100 Z values at randomly located nodes to comprise a vec-tor D of ‘‘available’’ data, 20 values to form a vector C0 (C0 represent-ing true but unknown data values, C their estimates) at otherrandomly located ‘‘potential new’’ sampling nodes, and those atthe remaining 2380 nodes to make up a vector D of ‘‘unknown’’ val-ues that we wish to predict via MLBMA. The latter is based either onD or on {D,C} where C are values simulated randomly at the ‘‘poten-tial new’’ sampling nodes conditional on D.
To predict D we consider a set M of K = 3 alternative variogrammodels, Mk, having parameters hk (purposely excluding the gener-ating, or ‘‘true,’’ TpvG model): exponential (Exp), Gaussian (Gau)and spherical (Sph). Each model, Mk, is assigned an equal priorprobability, p(Mk) = 1/3, and is calibrated against D to yield ML esti-mates hD
k of hk by minimizing the joint negative log likelihood�2lnp(DjMk,hk) � 2lnp(hkjMk). The process, denoted by MLD as ex-plained in Appendix A, also yields corresponding parameter esti-mation covariance matrices CD
k , Kashyap criteria KICDk according
to (11) and Bayesian criteria BICDk according to (13). For comparison
we also compute information theoretic criteria
AICDk ¼ �2 ln pðDjMkÞML þ 2Nk; ð34Þ
AICcDk ¼ �2 ln pðDjMkÞML þ 2Nk þ
2NkðNk þ 1ÞND � Nk � 1
ð35Þ
Fig. 3. ‘‘True’’ random field Z (colored) generated using TpvG with parametersh = (A,H,ku)T = (0.1,0.25,5)T at 2500 grid nodes (shown). D represent locations of‘‘available’’ data and C those of ‘‘potential new’’ data.
Please cite this article in press as: Neuman SP et al. Bayesian analysis of data-(2011), doi:10.1016/j.advwatres.2011.02.007
introduced, respectively, by Akaike [2] and Hurvich and Tsai [26].KICD
k and BICDk are used to compute posterior model probabilities
(or, in the case of AICDk and AICcD
k , model averaging weights),pðMkjDÞMLD , for each model via (9). The results are listed in Table1 and the corresponding fits illustrated in Fig. 4. It is evident thatthe sample D of available data is not large enough to reproducecorrectly the TpvG model used to generate it; the correspondingsample and fitted variograms underestimate the true sill and over-estimate the true integral scale. All criteria favor the spherical mod-el, assigning a very low posterior probability (or weight) to theGaussian model. Our model averaged results are based on KICD
k .Augmenting the sample to include {D,C0} is still not enough to
reproduce correctly the TpvG model; the sample and fitted vario-grams underestimate the true sill to a greater extent than wasthe case with D alone but overestimate the true integral scale toa lesser extent (compare Figs. 4 and 5, and Tables 1 and 2). Thepreference of all criteria for the spherical model is now more pro-nounced (less ambiguous) than it was in the case of D. These re-sults indicate a need to account for the effect of potential newdata on parameter estimation and model weighting, as we do nextby following the procedure in Appendix A.
Predictions of D are uncertain due to random spatial fluctua-tions in Z as well as uncertainty about the variogram model, Mk,and its parameters, hk. As D is generally nonlinear in hk, one canestimate its lead moments either through linearization, as de-scribed in Appendix B, or via Monte Carlo simulation. We start withthe latter option by (a) drawing Rh = 2000 random realizations, h
rhk ,
of hk from a multivariate normal distribution hk � N hDk ;C
Dk
� �where rh = 1,2, . . . ,Rh for each Mk; (b) obtaining kriging estimatesEðDpjD;Mk; h
rhk ÞMLD and kriging variances VarðDpjD;Mk; h
rhk ÞMLD and
covariances CovðDpDqjD;Mk; hrhk ÞMLD for all components Dp and Dq
of D; and (c) averaging these over all Rh realizations to obtain
EðDpjD;MkÞMLD ¼ Ehk½EðDpjD;Mk; hkÞMLD �
’ 1Rh
XRh
r¼1
EðDpjD;Mk; hrhk ÞMLD ; ð36Þ
CovðDpDqjD;MkÞMLD ¼ EhkCovðDpDqjD;Mk; hkÞMLD
� �þ Covhk
EðDpDqjD;Mk; hkÞMLD
� �; ð37Þ
Table 1Parameter estimates; negative log likelihoods NLL; model selection criteria AIC, AICc,BIC and KIC; prior and posterior model probabilities; and rankings of variogrammodels based on D.
Model Exp Gau Sph
Sill estimate 0.404 0.401 0.399Stda of sill 0.226 0.233 0.222Integral scale estimate 3.686 1.979 2.681Stda of integral scale 0.574 0.187 0.283NLL �145.14 �136.13 �147.23NLL rank 2 3 1p(Mk) 1/3 1/3 1/3AIC �139.14 �130.13 �141.23AIC rank 2 3 1p(MkjD)AIC 25.98% 0.29% 73.74%AICc �138.91 �129.90 �141.00AICc rank 2 3 1p(MkjD)AICc 25.98% 0.29% 73.74%BIC �131.04 �122.03 �133.12BIC rank 2 3 1p(MkjD)BIC 25.98% 0.29% 73.74%KIC �146.89 �135.94 �147.29KIC rank 2 3 1p(MkjD)KIC 44.92% 0.19% 54.90%
a Std represents standard deviation.
worth considering model and parameter uncertainties. Adv Water Resour
V
L
Fig. 4. TpvG, sample and fitted variograms based on D; see statistics in Table 1.
V
L
Fig. 5. TpvG, sample and fitted variograms based on {D,C0}; see statistics in Table 2.
Table 2Parameter estimates; negative log likelihoods NLL; model selection criteria AIC, AICc,BIC and KIC; prior and posterior model probabilities; and rankings of variogrammodels based on {D,C0}.
Model Exp Gau Sph
Sill estimate 0.371 0.372 0.373Std⁄ of sill 0.214 0.220 0.210Integral scale estimate 2.409 1.907 2.672Std⁄ of integral scale 0.403 0.151 0.311NLL �183.86 �175.59 �188.36NLL rank 2 3 1p(Mk) 1/3 1/3 1/3AIC �177.86 �169.59 �182.36AIC rank 2 3 1p(MkjD,C0)AIC 9.52% 0.15% 90.33%AICc �177.67 �169.40 �182.17AICc rank 2 3 1p(MkjD,C0)AICc 9.52% 0.15% 90.33%BIC �169.25 �160.99 �173.76BIC rank 2 3 1p(MkjD,C0)BIC 9.52% 0.15% 90.33%KIC �184.80 �174.87 �188.51KIC rank 2 3 1p(MkjD,C0)KIC 13.51% 0.09% 86.39%
⁄ Std represents standard deviation.
6 S.P. Neuman et al. / Advances in Water Resources xxx (2011) xxx–xxx
Please cite this article in press as: Neuman SP et al. Bayesian analysis of data-(2011), doi:10.1016/j.advwatres.2011.02.007
where
EhkCovðDjD;Mk; hkÞMLD
� �’ 1
Rh
XRh
rh¼1
CovðDjD;Mk; hrhk ÞMLD ð38Þ
and [53]
Cov DpDqjD;Mk; hrhk
� MLD ¼ Covðxp � xqjMk; h
rhk Þ
þX
m2Np
Xn2Nq
kpmkqnCovðxm � xnjMk; hrhk Þ
�X
m2Np
kpmCovðxm � xpjMk; hrhk Þ
�Xn2Nq
kqnCovðxn � xqjMk; hrhk Þ; ð39Þ
Covðxp � xqjMk; hrhk Þ being the covariance between Z values at spatial
locations xp and xq according to variogram model Mk with parame-ters h
rhk , kpm the kriging weight of data at xm used to estimate Z at
point xp, and Np the number of data points employed for this pur-pose. When Dp = Dq, (39) reduces to
VarðDpjD;Mk; hrhk ÞMLD ¼ Covð0jMk; h
rhk Þ
�X
m2Np
kpmCovðxm � xpjMk; hrhk Þ � lp ð40Þ
where Cov(0) is the sill of variogram model Mk and lp a Lagrangemultiplier entering into the kriging equations associated with theestimation of Dp. The last term in (37) is computed as
Covhk½EðDpDqjD;Mk; hkÞMLD �
’ 1Rh � 1
XRh
rh¼1
EðDpjD;Mk; hrhk ÞhD
k� EðDpjD;MkÞhD
k
h i
� EðDqjD;Mk; hrhk ÞhD
k� EðDqjD;MkÞhD
k
h ið41Þ
We verify that Rh = 2000 is large enough for Tr½CovðDjD;MkÞMLD � tostabilize for each model.
Once EðDpjD;MkÞMLD and CovðDpDqjD;MkÞMLD have been deter-mined for each Mk by Monte Carlo simulation, we compute theirmodel averaged equivalents according to
EðDpjDÞMLD ¼XK
k¼1
EðDpjD;MkÞMLD pðMkjDÞMLD ; ð42Þ
CovðDpDqjDÞMLD ¼XK
k¼1
CovðDpDqjD;MkÞMLD pðMkjDÞMLD
þXK
k¼1
½EðDpjD;MkÞMLD �EðDpjDÞMLD �½EðDqjD;MkÞMLD
�EðDqjDÞMLD �pðMkjDÞMLD : ð43Þ
A similar procedure is used to obtain EðDpjD;C0ÞMLD;C0 andCovðDpDqjD;C0ÞMLD;C0 . Figs. 6 and 7 show how VarðDpjDÞMLD andVarðDpjD;C0ÞMLD;C0 , respectively, vary across the grid, the differenceVarðDpjDÞMLD � VarðDpjD;C0ÞMLD;C0 between them being depicted inFig. 8. It is seen that enlarging the data base from 100 (the dimen-sion of D) to 120 (the dimension of {D,C0}) reduces the predictionvariance across the grid, most noticeably in its upper rightquadrant.
Since in real applications C0 would not be available, the nextstep is to compute the statistics of potential data C conditionalon D. Since in our case D and C describe the same attribute at dif-ferent locations, the statistics of C are obtained simply upon replac-ing D with C in the above procedure. We use these statistics togenerate Rc = 200 random realizations, CrC ; rc ¼ 1;2; . . . ;Rc , of Cby considering it to be multivariate normal, C � N EðCjDÞMLD ;
�CovðCjDÞMLD �; we see no way to avoid Monte Carlo simulation
worth considering model and parameter uncertainties. Adv Water Resour
DD
D
D
D
D
D
D
D
D
D
D
D
D
DD
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
DD
D
DD
D D
DD
DD
D
D
D
D
D
D
D
D
D
D D
D
DD
DDD
D
D
D
D
DD
D
D
DDD
D
D
D
D
D
D
DD
D
D
D
D
D
D
D
D
x
y
0 10 20 30 40 500
10
20
30
40
500.460.440.420.400.380.360.340.320.300.280.260.240.220.200.180.160.140.120.100.080.060.040.02
Fig. 6. Variation of VarðDpjDÞMLD across the grid.
DD
D
D
D
D
D
D
D
D
D
D
D
D
DD
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
DD
D
DD
D D
DD
DD
D
D
D
D
D
D
D
D
D
D D
D
DD
DDD
D
D
D
D
D
D
DDD
D
D
D
D
D
D
DD
D
D
D
D
D
D
D
D
C
C
C
C C
C
CC
C
C
C
C
C
C
CC
C
DD
CC
C
x
y
0 10 20 30 40 500
10
20
30
40
50 0.460.440.420.400.380.360.340.320.300.280.260.240.220.200.180.160.140.120.100.080.060.040.02
Fig. 7. Variation of VarðDpjD;C0 ÞMLD;C0 across the grid.
C
C
C
C C
C
CC
C
C
C
C
C
C
CC
C
CC
C
x
y
0 10 20 30 40 500
10
20
30
40
500.460.440.420.400.380.360.340.320.300.280.260.240.220.200.180.160.140.120.100.080.060.040.02
Fig. 8. Variation of VarðDpjDÞMLD � VarðDpjD;C0ÞMLD;C0 across the grid.
Fig. 9. C0 (red), EðCjDÞMLD (blue), 200 realizations CrC (gray) and 95% confidenceinterval of all 200 realizations (dashed black). (For interpretation of the referencesin color in this figure legend, the reader is referred to the web version of this article.)
DD
D
D
D
D
D
D
D
D
D
D
D
D
DD
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
DD
D
DD
D D
DD
DD
D
D
D
D
D
D
D
D
D
D D
D
DD
DDD
D
D
D
D
DD
D
D
DDD
D
D
D
D
D
D
DD
D
D
D
D
D
D
D
D
x
y
0 10 20 30 40 500
10
20
30
40
500.460.440.420.400.380.360.340.320.300.280.260.240.220.200.180.160.140.120.100.080.060.040.02
Fig. 10. Variation of VarðDpjDÞMLD;C across the grid; compare with Fig. 6 forVarðDpjDÞMLD .
DD
D
D
D
D
D
D
D
D
D
D
D
D
DD
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
DD
D
DD
D D
DD
DD
D
D
D
D
D
D
D
D
D
D D
D
DD
DDD
D
D
D
D
DD
D
D
DDD
D
D
D
D
D
D
DD
D
D
D
D
D
D
D
D
C
C
C
C C
C
CC
C
C
C
C
C
C
CC
C
CC
C
x
y
0 10 20 30 40 500
10
20
30
40
500.460.440.420.400.380.360.340.320.300.280.260.240.220.200.180.160.140.120.100.080.060.040.02
Fig. 11. Variation of ECjDVarðDpjD;CÞMLD;C across the grid; compare with Fig. 7 forVarðDpjD;C0ÞMLD;C0 .
S.P. Neuman et al. / Advances in Water Resources xxx (2011) xxx–xxx 7
Please cite this article in press as: Neuman SP et al. Bayesian analysis of data-worth considering model and parameter uncertainties. Adv Water Resour(2011), doi:10.1016/j.advwatres.2011.02.007
V
8 S.P. Neuman et al. / Advances in Water Resources xxx (2011) xxx–xxx
without neglecting the effect of C on parameter and model uncer-tainties. Fig. 9 verifies that EðCjDÞMLD represents a smoothed ver-sion of C0, which in turn is well within the range of simulated CrC
realizations.For every realization CrC we predict D the same way as before
but now conditional on an expanded data base fD;CrC g. This yieldsEðDpjD;CrC ÞMLD;C and CovðDpDqjD;CrC ÞMLD;C where the subscript MLD,C
denotes ML estimation based on fD;CrC g. The latter are then aver-aged over all realizations of CrC to obtain
EðDpjDÞMLD;C ¼ ECjDEðDpjD;CÞMLD;C ’ 1RC
XRC
rC¼1
EðDpjD;CrC ÞMLD;C ; ð44Þ
Fig. 13. Trace values obtained upon estimating lead moments of D through MonteCarlo simulation (MC) and second-order approximation (MA) according to Appen-dix B. Indexes 1–5 represent Tr½CovðDjDÞMLD �; Tr½CovðDjD;C0 Þ
MLD;C0 �; Tr½CovðDjDÞMLD;C �;Tr½CovCjDEðDjD;CÞMLD;C � and Tr½ECjDCovðDjD;CÞMLD;C �, respectively.
CovðDpDqjDÞMLD;C ¼ECjDCovðDpDqjD;CÞMLD;C þCovCjDEðDpDqjD;CÞMLD;C
ð45Þ
and Tr½CovðDjDÞMLD;C �. The final step entails computing Tr½CovCjDEðDjD;CÞMLD;C � which, we recall, represents the variance reductionmeasure Tr½CovðDjDÞMLD;C � � Tr½ECjDCovðDjD;CÞMLD;C �; we verify thatsample estimates of all these three terms stabilize after 200realizations. Variations of VarðDjDÞMLD;C ; ECjDVarðDjD;CÞMLD;C andVarCjDEðDjD;CÞMLD;C across the grid are shown, respectively, inFigs. 10–12. A comparison of Figs. 6 and 10 reveals that even thoughVarðDjDÞMLD;C tends to exceed VarðDjDÞMLD ðTr½CovðDjDÞMLD;C � ¼ 712:95while Tr½CovðDjDÞMLD � ¼ 682:44 due, most likely, to sampling errorsstemming from size and accuracy limitations on D) the twoquantities have near identical spatial patterns. Likewise, thoughECjDVarðDjD;CÞMLD;C in Fig. 11 tends to exceed VarðDjD;C0ÞMLD;C0 inFig. 7 these too have near identical spatial patterns. One therefore ex-pects the estimated variance reduction VarCjDEðDjD;CÞMLD;C to exhibita pattern similar to that of the true variance reduction VarðDjDÞMLD�VarðDjD;C0Þ
MLD;C0 , a fact verified through a comparison of Figs. 12and 8.Correspondingly Tr½CovCjDEðDjD;CÞMLD;C � ¼ 54:61 approximatesclosely the true trace reduction Tr½CovðDjDÞMLD � � Tr½CovðDjD;C0ÞMLD;C0 Þ ¼ 58:95.
The above results are based on Monte Carlo evaluation of allmoments. Estimating the lead moments of D through linearizationas described in Appendix B brings about a 10-fold reduction incentral processor time without any serious effect on accuracy:the value of Tr½CovCjDEðDjD;CÞMLD;C � drops from 54.61 to 54.54 andthat of Tr½CovðDjDÞMLD � � Tr½CovðDjD;C0ÞMLD;C0 � from 58.95 to 55.3,all spatial patterns remaining virtually unchanged. A visual com-parison of Tr½CovðDjDÞMLD �; Tr½CovðDjD;C0ÞMLD;C0 �; Tr½CovðDjDÞMLD;C �;
C
C
C
C C
C
CC
C
C
C
C
C
C
CC
C
CC
C
x
y
0 10 20 30 40 500
10
20
30
40
500.460.440.420.400.380.360.340.320.300.280.260.240.220.200.180.160.140.120.100.080.060.040.02
Fig. 12. Variation of VarCjDEðDpjD;CÞMLD;C across the grid; compare with Fig. 8 forVarðDpjDÞMLD � VarðDpjD;C0ÞMLD;C0 .
Please cite this article in press as: Neuman SP et al. Bayesian analysis of data-(2011), doi:10.1016/j.advwatres.2011.02.007
Tr½CovCjDEðDjD;CÞMLD;C � and Tr½ECjDCovðDjD;CÞMLD;C � values obtainedby the two methods is provided in Fig. 13.
5. Conclusions
Our paper leads to the following major conclusions:
1. A multimodel approach to optimum value-of-information ordata-worth analyses has been proposed based on a Bayesianmodel averaging (BMA) framework. We have focused on a max-imum likelihood (MLBMA) variant of BMA that (a) is compatiblewith both deterministic and stochastic models, (b) admits butdoes not require prior information about the parameters, (c)is consistent with modern statistical methods of hydrologicmodel calibration, (d) allows approximating lead predictivemoments of any model by linearization, and (e) updates modelposterior probabilities as well as parameter estimates on thebasis of potential new data both before and after such databecome actually available.
2. The proposed approach should be of help in designing the col-lection of hydrologic characterization and monitoring data ina cost-effective manner by maximizing their benefit undergiven cost constraints. Benefits would accrue from optimumgain in information, or reduction in predictive uncertainty,upon considering jointly not only traditional sources of uncer-tainty such as those affecting model parameters and the reli-ability of data but also lack of certainty about the underlyingmodels.
3. Implementation of the proposed approach on a synthetic geo-statistical problem in two space dimensions demonstrates aneed to account for the impact of potential new data on modeland parameter uncertainties. Though neither existing nor apotentially augmented set of data are sufficient to identify cor-rectly the underlying geostatistical model (variogram) and itsparameters, they nevertheless yield self-consistent results andallow identifying quite accurately the impacts of potentialnew data on the spatial distribution and magnitude of corre-sponding reductions in predictive variance.
4. Approximating lead predictive moments associated with eachmodel by linearization, as described in Appendix B, has yieldedresults comparable to those obtained via Monte Carlo simula-tion with a much lesser expenditure of computational effort.The extent to which such linearization would work in stronglynonlinear situations remains an open question.
worth considering model and parameter uncertainties. Adv Water Resour
S.P. Neuman et al. / Advances in Water Resources xxx (2011) xxx–xxx 9
Acknowledgements
This research was supported in part through a contract betweenthe University of Arizona and Vanderbilt University under the Con-sortium for Risk Evaluation with Stakeholder Participation (CRESP)III, funded by the US Department of Energy. The third and fourthauthors were supported in part by NSF-EAR Grant 0911074 andDOE-ERSP Grant DE-SC0002687.
Appendix A. Computational implementation of MLBMAframework
To assess the impact of data augmentation within the aboveMLBMA framework computationally we propose the followingapproach:
1. Postulate a set M of K mutually independent geostatistical,statistical or stochastic models, Mk, with parameters hk forthe desired output vector, D.
2. Obtain ML estimates hDk of hk by calibrating each Mk against
available data D through minimization of the log likelihood�2lnp(DjMk,hk) � 2lnp(hkjMk), then compute the corre-sponding estimation covariance CD
k and KICDk .
3. Compute pðMkjDÞMLD ¼ exp �12dKICD
kð ÞpðMkÞPK
l¼1exp �1
2dKICDlð ÞpðMlÞ
where the sub-
script MLD designates the ML estimation process in step 2.4. For each model Mk estimate EðDjD;MkÞMLD and
CovðDjD;MkÞMLD either through second-order approxima-tions, as described in Appendix B, or via Monte Carlo simu-lation (both options are explored in our synthetic example):a. Draw random samples (realizations) of hk from a multi-
variate Gaussian distribution with mean hDk and covari-
ance CDk .
b. Estimate EðDjD;Mk; hkÞMLD and CovðDjD;Mk; hkÞMLD for eachrealization of hk.
c. Average over all realizations of hk to obtain sample esti-mates of EðDjD;MkÞMLD ¼ Ehk
EðDjD;Mk; hkÞMLD and
Please(2011
CovðDjD;MkÞMLD ¼ EhkCovðDjD;Mk; hkÞMLD
þ CovhkEðDjD;Mk; hkÞMLD :
5. Compute EðDjDÞMLD ¼PK
k¼1EðDjD;MkÞMLD pðMkjDÞMLD and
CovðDjDÞMLD¼XK
k¼1
CovðDjD;MkÞMLD pðMkjDÞMLD
þXK
k¼1
EðDjD;MkÞMLD�EðDjDÞMLD
� �EðDjD;MkÞMLD
�
�EðDjDÞMLD
�T pðMkjDÞMLD
and/or Tr½CovðDjDÞMLD �.6. Postulate a set P of I alternative geostatistical, statistical or
stochastic models, Pi, with parameters pi for a potential dataset C; the models Pi may be independent of Mk, may formextensions of Mk or may coincide with the latter as in thecomputational examples given in this paper.
7. Predict multivariate statistics of C, conditional on D, eithervia BMA or via MLBMA by means of the model set P; inthe case of MLBMA the procedure would parallel thatdescribed for D in steps (2)–(6).
8. Estimate EðDjD;CÞMLD;C and CovðDjD;CÞMLD;C , where the sub-script MLD,C designates the ML estimation process in step 2but now with respect to an augmented data set {D,C}, eitherthrough second-order approximations, as described inAppendix B, followed by step 10 or via Monte Carlo simula-
cite this article in press as: Neuman SP et al. Bayesian analysis of data-), doi:10.1016/j.advwatres.2011.02.007
tion by using the statistics of C from step 7 to generate ran-dom realizations of C (both options are explored in oursynthetic example); for each realization and for each modelMk:a. Optionally linearize the residuals entering into the nega-
tive log likelihood �2lnp(D,CjMk,hk) � 2lnp(hkjMk) abouthD
k (this option is not explored in this paper).b. Obtain ML estimates hD;C
k of hk by minimizing this negativelog likelihood with respect to hk, then compute the corre-sponding estimation covariance CD;C
k and KICD;Ck .
c. Compute pðMkjD;CÞMLD;C ¼ exp �12dKICD;C
kð ÞpðMkÞPK
l¼1exp �1
2dKICD;Clð ÞpðMlÞ
.
d. For each model Mk estimate EðDjD;C;MkÞMLD;C andCovðDjD;C;MkÞMLD;C via Monte Carlo simulation:i. Draw random samples (realizations) of hk from a mul-
tivariate Gaussian distribution with mean hD;Ck and
covariance CD;Ck .
ii. Estimate EðDjD;C;Mk; hkÞMLD;C and CovðDjD;C;Mk;
hkÞMLD;C for each realization of hk.iii. Average over all realizations of hk to obtain sample
estimates of
worth
considEðDjD;C;MkÞMLD;C ¼ EhkEðDjD;C;Mk; hkÞMLD;C and
CovðDjD;C;MkÞMLD;C ¼ EhkCovðDjD;C;Mk; hkÞMLD;C
þ CovhkEðDjD;C;Mk; hkÞMLD;C :
e. Compute EðD j D; CÞMLD;C ¼PK
k¼1EðD j D; C; MkÞMLD;C�pðMkjD;CÞMLD;C and
CovðDjD;CÞMLD;C ¼XK
k¼1
CovðDjD;C;MkÞMLD;C pðMkjD;CÞMLD;C
þXK
k¼1
EðDjD;C;MkÞMLD;C �EðDjD;CÞMLD;C
� ��½EðDjD;C;MkÞMLD;C �EðDjD;CÞMLD;C �T
�pðMkjD;CÞMLD;C
and/or Tr½CovðDjD;CÞMLD;C �.9. Average over all realizations of C to obtain sample estimates
of
EðDjDÞMLD;C ¼ ECjDEðDjD;CÞMLD;C ;
CovðDjDÞMLD;C ¼ ECjDCovðDjD;CÞMLD;C þ CovCjDEðDjD;CÞMLD;C
and/or Tr½CovðDjDÞMLD;C �; note that due to inevitable samplingerrors and approximations associated with the ML estimationprocess EðDjDÞMLD;C , CovðDjDÞMLD;C and Tr½CovðDjDÞMLD;C � ob-tained at this step would generally differ, though ideally notby much, from EðDjDÞMLD , CovðDjDÞMLD and Tr½CovðDjDÞMLD � ob-tained at step 5.
10. Repeat steps 6–9 for different sets C1,C2,C3, . . . of potentialdata and select that set which maximizes the differenceTr½CovCjDEðDjD;CÞMLD;C � ¼ Tr CovðDjDÞMLD;C
� �� Tr½ECjDCovðDjD;
CÞMLD;C � between the trace conditional on D and the expectedtrace conditional on D and C (this step is not explored in thepresent paper).
Appendix B. Second-order moment approximations
Step 4 of the MLBMA procedure in Appendix A entails an op-tional approximation of EðDjD;MkÞMLD and CovðDjD;MkÞMLD via leadorder expansions of EðDjD;Mk; hkÞMLD and CovðDjD;Mk; hkÞMLD in hk
ering model and parameter uncertainties. Adv Water Resour
10 S.P. Neuman et al. / Advances in Water Resources xxx (2011) xxx–xxx
about hDk ; step 8 entails a similar approximation of EðDjD;C;MkÞMLD;C
and CovðDjD;C;MkÞMLD;C via lead order expansion ofEðDjD;C;Mk; hkÞMLD;C and CovðDjD;C;Mk; hkÞMLD;C in hk about h
D;Ck . In
the context of step 4 one would start with second-orderexpansions
EðDpjD;Mk;hkÞMLD ’ EðDpjD;Mk; hDk Þþ
XNk
n¼1
@EðDpjD;Mk;hkÞMLD
@hkn
hD
k
dhkn
þ12
XNk
n¼1
XNk
m¼1
@2EðDpjD;Mk;hkÞMLD
@hkn@hkm
hD
k
dhkndhkm;
ðB1Þ
CovðDpDqjD;Mk;hkÞMLD ’ Cov DpDqjD;Mk; hDk
� �
þXNk
n¼1
@CovðDpDqjD;Mk;hkÞMLD
@hkn
hD
k
dhkn
þ12
XNk
n¼1
XNk
m¼1
@2Cov DpDqjD;Mk;hk
� MLD
@hkn@hkm
hD
k
dhkndhkm;
ðB2Þ
where Dp is the pth component of D and dhk ¼ hk � hDk . Since
EhkðdhkÞ ¼ 0 one has, to second order,
EhkEðDpjD;Mk; hkÞMLD ’ E DpjD;Mk; h
Dk
� �
þ 12
XNk
n¼1
XNk
m¼1
@2EðDpjD;Mk; hkÞMLD
@hkn@hkm
hD
k
Cknm;
ðB3Þ
EhkCovðDpDqjD;Mk;hkÞMLD ’ CovðDpDqjD;Mk; h
Dk Þ
þ12
XNk
n¼1
XNk
m¼1
@2CovðDpDqjD;Mk;hkÞMLD
@hkn@hkm
hD
k
Cknm;
ðB4Þ
CovhkEðDpDqjD;Mk; hkÞMLD ¼ Ehk
EðDpjD;Mk; hkÞMLD
��� Ehk
EðDpjD;Mk; hkÞMLD
�� EðDqjD;Mk; hkÞMLD
�� Ehk
EðDqjD;Mk; hkÞMLD
��’XNk
n¼1
XNk
m¼1
@EðDpjD;Mk; hkÞMLD
@hkn
hD
k
� @EðDqjD;Mk; hkÞMLD
@hkm
hD
k
Cknm; ðB5Þ
CovðDpDqjD;MkÞMLD ¼EhkCovðDpDqjD;Mk;hkÞMLD
þCovhkEðDpDqjD;Mk;hkÞMLD
’CovðDpDqjD;Mk;hDk Þ
þ12
XNk
n¼1
XNk
m¼1
@2CovðDpDqjD;Mk;hkÞMLD
@hkn@hkm
hD
k
Cknm
þXNk
n¼1
XNk
m¼1
@EðDpjD;Mk;hkÞMLD
@hkn
hD
k
@EðDqjD;Mk;hkÞMLD
@hkm
hD
k
Cknm;
ðB6Þ
where Cknm ¼ EhkðdhkndhkmÞ is the (n,m)th component of CD
k . An anal-ogous approach would apply to step 8 of Appendix A.
Please cite this article in press as: Neuman SP et al. Bayesian analysis of data-(2011), doi:10.1016/j.advwatres.2011.02.007
References
[1] Abbaspour KC, Schulin R, Schlappi E, Fluhler H. A Bayesian approach forincorporating uncertainty and data worth in environmental projects. EnvironModel Assess 1996;1:151–258.
[2] Akaike H. A new look at statistical model identification. IEEE Trans AutomatControl 1974;AC-19:716–22.
[3] Back P-E. A model for estimating the value of sampling programs and theoptimal number of samples for contaminated soil. Environ Geol2007;52:573–85. doi:10.1007/s00254-0488-6.
[4] Ben-Zvi M, Berkowitz B, Kesler S. Pre-posterior analysis as a tool for dataevaluation: application to aquifer contamination. Water Resour Manage1988;2:11–20.
[5] Beven KJ, Binley AM. The future of distributed models: model calibration anduncertainty prediction. Hydrol Process 1992;6:279–98.
[6] Beven KJ, Freer J. Equifinality, data assimilation, and uncertainty estimation inmechanistic modelling of complex environmental systems using the GLUEmethodology. J Hydrol 2001;249:11–29.
[7] Box G. Sampling and Bayes’ inference in scientific modelling and robustness. JRoy Statist Soc, Ser A 1980;143(4):383–430.
[8] Carrera J, Neuman SP. Estimation of aquifer parameters under transient andsteady state conditions: 1. Maximum likelihood method incorporating priorinformation. Water Resour Res 1986;22(2):199–210.
[9] Carrera J, Neuman SP. Estimation of aquifer parameters under transient andsteady state conditions: 3. Application to synthetic and field data. WaterResour Res 1986;22(2):228–42.
[10] Carrera J, Medina A, Axness C, Zimmerman T. Formulations and computationalissues of the inversion of random fields. In: Dagan G, Neuman SP, editors.Subsurface flow and transport: a stochastic approach. Cambridge, UnitedKingdom: Cambridge University Press; 1997. p. 62–79.
[11] Dausman AM, Doherty J, Langevin CD, Sukop MC. Quantifying data worthtoward reducing predictive uncertainty. Ground Water 2010;48(5):729–40.
[12] Davis DR, Kisiel CC, Duckstein L. Bayesian decision theory applied to design inhydrology. Water Resour Res 1972;8(1):33–41.
[13] Davis DR, Dvoranchik WM. Evaluation of the worth of additional data. WaterResour Res 1971;7(4):700–7.
[14] Deutsch CV, Journel AG. GSLIB: geostatistical software library and user’s guide.2nd ed. New York: Oxford University Press; 1998.
[15] Di Federico V, Neuman SP. scaling of random fields by means of truncatedpower variograms and associated spectra. Water Resour Res1997;33(5):1075–85.
[16] Diggle P, Lophaven S. Bayesian geostatistical design. Scand J Statist2006;33:53–64. doi:10.1111/j.1467-9469.2005.00469.x.
[17] Draper D. Assessment and propagation of model uncertainty. J Roy Statist SocB 1995;57(1):45–97.
[18] Feyen L, Gorelick SM. Framework to evaluate the worth of hydraulicconductivity data for optimal groundwater resources management inecologically sensitive areas. Water Resour Res 2005;41:W03019.doi:10.1029/2003WR002901.
[19] Freer J, Beven K, Ambroise B. Bayesian estimation of uncertainty in runoffprediction and the value of data: an application of the GLUE approach. WaterResour Res 1996;32(7):2161–73.
[20] Freeze RA, Massman J, Smith L, Sperling T, James B. Hydrogeological decisionanalysis: 1. A framework. Ground Water 1990;28(5):738–66.
[21] Freeze RA, Brude J, Massman J, Sperling T, Smith L. Hydrogeological decisionanalysis: 4. The concept of data worth and its use in the development of siteinvestigation strategies. Ground Water 1992;30(4):574–88.
[22] Grosser PW, Goodman AS. Determination of groundwater samplingfrequencies through Bayesian decision theory. Civil Eng Syst1985;2(December):186–294.
[23] Gates JS, Kisiel CC. Worth of additional data to a digital computer model of agroundwater basin. Water Resour Res 1974;10(5):1031–8.
[24] Hernandez AF, Neuman SP, Guadagnini A, Carrera J. Inverse stochastic momentanalysis of steady state flow in randomly heterogeneous media. Water ResourRes 2006;42:W05425. doi:10.1029/2005WR004449.
[25] Hoeting JA, Madigan D, Raftery AE, Volinsky CT. Bayesian model averaging: atutorial. Statist Sci 1999;14(4):382–417.
[26] Hurvich CM, Tsai C-L. Regression and time series model selection in smallsample. Biometrika 1989;76(2):99–104.
[27] IT-Corporation. Value of information analysis for corrective action unit no. 98,Frenchman Flat, US Department of Energy, USA; 1997.
[28] James BR, Freeze RA. The worth of data in predicting aquitard continuity inhydrogeological design. Water Resour Res 1993;29(7):2049–65.
[29] James BR, Gorelick SM. When enough is enough: the worth of monitoringdata in aquifer remediation design. Water Resour Res 1994;30(12):3499–513.
[30] James BR, Gwo J-P, Toran L. Risk-cost decision framework for aquiferremediation design. J Water Resour Plan Manage 1996;122(6):414–20.
[31] Kashyap RL. Optimal choice of AR and MA parts in autoregressive movingaverage models. IEEE Trans Pattern Anal Mach Intell 1982;4(2):99–104.
[32] Kass R, Raftery A. Bayes factors. J Am Statist Assoc 1995;90(430):773–95.[33] Kaunas JR, Haimes YY. Risk management of groundwater contamination in a
multiobjective framework. Water Resour Res 1985;21(11):1721–30.[34] Leamer E. Specification searches: ad hoc inference with nonexperimental data.
1st ed. New York: John Wiley & Sons; 1978.
worth considering model and parameter uncertainties. Adv Water Resour
S.P. Neuman et al. / Advances in Water Resources xxx (2011) xxx–xxx 11
[35] Li X, Tsai FT-C. Bayesian model averaging for groundwater head prediction anduncertainty analysis using multimodel and multimethod. Water Resour Res2009;45:W09403. doi:10.1029/2008WR007488.
[36] Maddock T. Management model as a tool for studying the worth of data. WaterResour Res 1973;9(2):270–80.
[37] McNulty G, Deshler B, Dove H. Value of information analysis. Nevada Test Site,USA; 1997.
[38] Meyer PD, Ye M, Rockhold ML, Neuman SP, Cantrell KJ. Combined estimationof hydrogeologic conceptual model, parameter, and scenario uncertainty withapplication to uranium transport at the Hanford Site 300 Area, NUREG/CR-6940, (PNNL-16396), US Nuclear Regulatory Commission, Washington, DC;2007.
[39] Morales-Casique E, Neuman SP, Vesselinov VV. Maximum likelihood Bayesianaveraging of airflow models in unsaturated tuff. Stochast Environ Res RiskAssess 2010;24:863–80. doi:10.1007/s00477-010-0383-2.
[40] Neuman SP. Maximum likelihood Bayesian averaging of alternativeconceptual-mathematical models. Stochast Environ Res Risk Assess2003;17(5):291–305. doi:10.1007/s00477-003-0151-7.
[41] Norberg T, Rosen L. Calculating the optimal number of contaminant samplesby means of data worth analysis. Environmetrics 2006;17:705–19.
[42] Norrman J. Decision analysis under risk and uncertainty at contaminated sites.A literature review. Department of Geology, Chalmers University ofTechnology, Goteborg; 2001.
[43] Nowak W, de Barros FPJ, Rubin Y. Bayesian geostatistical design: Task-drivensite investigation when the geostatistical model is uncertain. Water ResourRes 2010;46:W035535. doi:10.1029/2009WR008312.
[44] Poeter EP, Hill MC. MMA, a computer code for multi-model analysis. USGeological Survey Techniques and Methods TM6-E3; 2007.
[45] Rada ES, Schultz GA. Worth of hydrological data in water resources projects.Application: Bolivians Amazons zone. In: Braga Jr PF, Fernandez-Jauregui CA,editors. Water management of the Amazon Basin. UNESCO; 1998.
[46] Raftery AE, Madigan D, Volinsky CT. Accounting for model uncertainty insurvival analysis improves predictive performance. In: Bernardo J, Beerger J,Dawid A, Smith A, editors. Bayesian statistics. United Kingdom: OxfordUniversity Press; 1996. p. 323–49.
[47] Reichard EG, Evans JS. Assessing the value of hydrogeological information forrisk-based remedial action decisions. Water Resour Res 1989;25(7):1451–60.
[48] Riva M, Guadagnini A, Neuman SP, Bianchi Janetti E, Malama B. Inverseanalysis of stochastic moment equations for transient flow in randomly
Please cite this article in press as: Neuman SP et al. Bayesian analysis of data-(2011), doi:10.1016/j.advwatres.2011.02.007
heterogeneous media. Adv Water Resour 2009;32:1495–507. doi:10.1016/j.advwatres.2009.07.003.
[49] Rojas R, Feyen L, Batelaan O, Dassargues A. On the value of conditioning data toreduce conceptual model uncertainty in groundwater modeling. Water ResourRes, in press. doi:10.1029/2009WR008822.
[50] Rosen L, LeGrand He. An outline of a guidance framework for assessinghydrogeological risks at early stages. Ground Water 1997;35(2):195–204.
[51] Russell KT, Rabideau AJ. Decision analysis for pump-and-treat design.Groundwater Monitor Remediat 2000(Summer):159–268.
[52] Sain SR, Furrer R. Combining climate model output via model correlations.Stochast Environ Res Risk Assess 2010;24(6):821–9. doi:10.1007/s00477-010-0380-5.
[53] Samper FJ, Neuman SP. Estimation of spatial covariance structures by adjointstate maximum likelihood cross-validation: 1. Theory. Water Resour Res1989;25(3):351–62.
[54] Sohn M, Small M, Pantazidou M. Reducing uncertainty in site characterizationusing Bayes Monte Carlo methods. J Environ Eng – ASCE2000;126(10):893–902. doi:10.1061/(ASCE)0733-9372(2000)126:10(893).
[55] Thompson M, Fearn T. What exactly is fitness for purpose in analyticalmeasurements? Analyst 1996;121(March):275–8.
[56] Volinsky CT, Madigan D, Raftery AE, Kronmal RA. Bayesian model averaging inproportional hazard models: assessing the risk of a stroke. J Roy Statist Soc, SerC 1997;46:433–48.
[57] Ye M, Neuman SP, Meyer PD. Maximum Likelihood Bayesian averaging ofspatial variability models in unsaturated fractured tuff. Water Resour Res2004;40:W05113. doi:10.1029/2003WR002557.
[58] Ye M, Neuman SP, Meyer PD, Pohlmann KF. Sensitivity analysis andassessment of prior model probabilities in MLBMA with application tounsaturated fractured tuff. Water Resour Res 2005;41:W12429. doi:10.1029/2005WR004260.
[59] Ye M, Meyer PD, Neuman SP. On model selection criteria in multimodelanalysis. Water Resour Res 2008;44:W03428. doi:10.1029/2008WR006803.
[60] Ye M, Lu D, Neuman SP, Meyer PD. Comment on ‘‘Inverse groundwatermodeling for hydraulic conductivity estimation using Bayesian modelaveraging and variance window’’ by Frank T.-C. Tsai and Xiaobao Li. WaterResour Res 2010;46:W02801. doi:10.1029/2009WR008501.
[61] Yokota F, Thompson K. Value of information literature analysis: a reviewof applications in health risk assessment. Med Decis Making 2004;24:287–98.
worth considering model and parameter uncertainties. Adv Water Resour
top related