data mining of environmental models for sensitivity analysis
DESCRIPTION
Data Mining of Environmental Models for Sensitivity Analysis. re. Knowledge Discovery. Tom Stockton Paul Black, Andy Schuh, Kate Catlett, John Tauxe Neptune and Company, Inc. www.neptuneandco.com. Issue. - PowerPoint PPT PresentationTRANSCRIPT
Data Mining of Environmental Data Mining of Environmental Models for Sensitivity AnalysisModels for Sensitivity Analysis
Tom Stockton
Paul Black, Andy Schuh, Kate Catlett, John Tauxe
Neptune and Company, Inc.
www.neptuneandco.com
Knowledge Discoveryre
IssueIssue
How to conduct a sensitivity analysis of a complex high dimensional probabilistic environmental model?
Decision ModelingDecision Modeling
1. Decision Model, build and solve– Decision Actions and Outcomes– Utility (costs, liabilities, desires) – Probabilistic model
• Scenario• Model• Parameter
2. Sensitivity analysis (knowledge re-discovery)3. Value of information analysis (OUT-path)4. Data collection5. Update model (Bayesian or ad hoc)
Decision ModelingDecision Modeling
U(d | I) = supd SMY U(d | y , S, M,M) utility function
p(S) scenario uncertainty
p(M | S) model uncertainty
p(M | S) parameter uncertainty
p(I | M M, S) data likelihood
p(y | M , M,S) risk predictive dist
dy dS dM dM
where:U = utility, loss, cost M = model structured = decision M = model parametersI = information/data S = scenario
y = risk
Sensitivity AnalysisSensitivity Analysis
Given a model:
Y = f (X) [Y = GoldSim(X)]
Sensitivity analysis is aimed at describing the influence of each input variable Xi on the model response Y
Sensitivity MeasuresSensitivity Measures
• One-At-A-Time (OAT)
• Differential Analysis
• Global– Statistical
• scatter plots, correlation, regression, rank transformations
– Data mining• Sobol, FAST, MARS, MART
iX
f
)(X
Desirable PropertiesDesirable Propertiesof a SA Measureof a SA Measure
• Efficiency– account for all effects while being
computationally affordable
• Simplicity– implementable and interpretable
• Model Independent– The method can handle non-linearity, non-
monotonicity (across time and space)
K. Chan, S. Tarantola and A. Saltelli, 2000, Variance-Based Methods, in Sensitivity Analysis, A. Saltelli, K. Chan, E.M.Scott.John Wiley and Sons.
Sensitivity MeasuresSensitivity Measures
• OAT and Differential Analysis, for complex probabilistic models, often are– not efficient, and
– not model independent
Global Sensitivity MeasuresGlobal Sensitivity Measures• Sensitivity Measure
• Build a statistical model of the model response and the model inputs using the Monte Carlo simulation results
• Decompose variance of the output and attribute to input variables
)(Var
)]|(E[Var
Y
xYS iX
ii
Standardized Rank RegressionStandardized Rank Regression
SRR– Rank Y and Xi and scale the ranks to mean of 0 and
variance of 1 for convenience
2
1
2
1
so
)(Var)(Var
ii
p
i ii
p
i ii
S
XY
xy
Based on the ranks of Y and Xi
Assuming the Xi are independent
Fourier Amplitude Sensitivity TestFourier Amplitude Sensitivity Test
FAST– Explores the multidimensional input
space of the input factors by a search curve using Fourier transform function.
– Handles main and interaction effects
K. Chan, S. Tarantola and A. Saltelli, 2000, Variance-Based Methods, in Sensitivity Analysis, A. Saltelli, K. Chan, E.M.Scott.John Wiley and Sons.
IssuesIssues• Differential Analysis
– not feasible: derivatives of complex models
• SRR and OAT– not model independent: trouble with
nonmonotonic nonlinear models.
– not efficient: trouble with interaction effects in high dimensional models
• FAST– not efficient: Separate model runs
Possible SolutionsPossible Solutions
• Data mine the probabilistic model output– Multivariate Adaptive Regression Splines
(MARS)– Multiple Additive Regression Trees
(MART)
Data MiningData Mining• MARS
– Non-parametric recursive partitioning approach that fits separate splines to distinct intervals of the predictor variables.
• MART– Explores the multidimensional input space of the
input factors using gradient boosting of additive regression models.
• Advantages– Search for interactions between variables, allowing any degree of
interaction to be considered. – Tracks very complex data structures in high-dimensional data.
Sensitivity Indices viaSensitivity Indices viaANOVA decompositionANOVA decomposition
x
xx
sx
skjikji
sjiji
siios
Kkjikji
Kjiji
Kiio
SPR
SPRS
fySPR
xxxfxxfxfaf
xxxfxxfxfaf
s
s
s
mmm
2
},,{},{
3,,
2,
1
)(ˆ
),,(),()()(ˆ
),,(),()()(ˆ
X
X
X
Sensitivity indices are calculated using basis functions not including xs
Analytical ExampleAnalytical Example
Sobol’ g-function
p
iii xgy
1
)(
i
iiii a
axxg
1
|24|)(
))(arcsin(sin1
2
1iii swx
Saltelli A., Tarantola S., and Chan K.P.-S. (1999), “A Quantitative Model-Independent Method for Global Sensitivity Analysis of Model Output,” Technometrics, 41, 39-55.
Example: Sobol’ Example: Sobol’ gg-function-function
Input a Sensitivities
Analytic MART MARS FAST SRR
x1 0 23 0.73 0.565 0.733 0.773 0.0005
x2 1 55 0.23 0.281 0.224 0.193 0.0015
x3 4.5 77 0.032 0.094 0.036 0.025 0.045
x4 9 97 0.009 0.05 0.009 0.008 0.197
x5 99 107 0.0001 0.005 0.0006 0.0002 0.207
x6 99 113 0.0001 0.004 0.0000 0.0005 0.437
x7 99 121 0.0001 0.0 0.0000 0.0001 0.007
x8 99 125 0.0001 0.0 0.0000 0.0002 0.105
Saltelli A., Tarantola S., and Chan K.P.-S. (1999), “A Quantitative Model-Independent Method for Global Sensitivity Analysis of Model Output,” Technometrics, 41, 39-55.
Public BenefitAnalysis Costs
ALARA Costs
Monitoring Costs
Disposal Fees
Cumulative (CA)
Management Options - Institutional Controls - Site Maintenance - Waste Acceptance - Closure - Monitoring/Surveillance
Potential Liabilities
Closure Costs
Research, Monitoring,Information & Data
Collection
Choose Management Options & Update Management Plan
YES
NO
Ecosystem
MOP & IHI Occupational
Regulations & Guidance
Can the risk be managed to regulatory thresholds at an acceptable cost with
an acceptable level of uncertainty?
Assessm
ntam inati
Disposal Costs
Budgets
• Maintenance Review• Periodic Review• Waste Acceptance
Decision• Closure Decision
C ost-Benefit Analysis
Fate & Transport
Existing Inventory
Future Inventory
12
3
4
5
Cost
Management
Risk
Contamination
Uncertainty
analysis
Sensitivity
analysis
Value of
Information
6
7
Iteration
loop
Legend
1Sequence
number
8
Simulation ResultsSimulation Results
• Model Inputs ( X )– Inventory– Fate and transport
• Upward advection
• Biotic transport
• Model response ( Y )– “EPA-SUM”
Model ResponseModel Response
EPA Sum
Pro
ba
bili
ty
1.0
e-0
30
1.0
e-0
25
1.0
e-0
20
1.0
e-0
15
2.6
e-0
12
3.7
e-0
09
3.2
e-0
07
3.8
e-0
05
9.6
e-0
03
1.0
e+
00
01
.0e
+0
01
0.0001
0.0010
0.0100
0.1000
0.5000
1.0000
Relative Influence PlotRelative Influence Plot
Relative Influence
Ant2 MaxDepth
Ant2 NestWidth
Dry Bulk Density
Kd Np
Kd U
Solubility U
Termite1 b
Upward Flux Rate
0 0.2 0.4 0.6 0.8 1
Key
MART
SRR
Partial Dependence PlotsPartial Dependence Plotspa
rtia
l dep
ende
nce
0 1e-04 2e-04 3e-04 4e-04
-4-2
02
4
Upward Flux Rate
5e-04 0.001 0.0015 0.002
05
1015
20
Kd Np
5 10 15 20 25 30
-0.6
-0.4
-0.2
00.
20.
40.
6 Termite1 b
50 100 150 200
-0.4
-0.2
00.
20.
4
Ant2 NestWidth
1200 1400 1600 1800 2000
-0.4
-0.2
00.
20.
4
Dry Bulk Density
0.001 0.002 0.003 0.004
00.
51
Kd U
0 0.002 0.004 0.006
-0.2
-0.1
00.
10.
2
Solubility U
300 320 340 360 380 400
-0.2
-0.1
00.
10.
2
Ant2 MaxDepth
MART
SRR
Density
Co-partial Dependence PlotCo-partial Dependence Plot
Variation ExplainedVariation Explained
MART/MART/TimeTime SRRSRR MARSMARSGCDGCD10,00010,000 0.910.91 0.990.99LANLLANL
5050 0.870.87 0.940.94100100 0.860.86 0.960.96500500 0.750.75 0.910.91
1,0001,000 0.710.71 0.950.9510,00010,000 0.710.71 0.930.93
Sensitivity ConvergenceSensitivity Convergence
Measure of Relative Sensitivity
Sim
ula
tion
Siz
e
0 0.2 0.4 0.6 0.8 1
100 Sims (MART)
100 Sims (SRR)
500 Sims (MART)
500 Sims (SRR)
1000 Sims (MART)
1000 Sims (SRR)
2500 Sims (MART)
2500 Sims (SRR)
5000 Sims (MART)
5000 Sims (SRR)
Upward.Flux.Rate
0 0.2 0.4 0.6 0.8 1
Kd.def.Kd.Np
0 0.02 0.04 0.06 0.08 0.1
Ant2.Data.NestWidth
0 0.05 0.1 0.15
100 Sims (MART)
100 Sims (SRR)
500 Sims (MART)
500 Sims (SRR)
1000 Sims (MART)
1000 Sims (SRR)
2500 Sims (MART)
2500 Sims (SRR)
5000 Sims (MART)
5000 Sims (SRR)
Dry.Bulk.Density
0 0.05 0.1 0.15 0.2 0.25
Kd.def.Kd.U
0 0.05 0.1 0.15 0.2 0.25
Termite1.Data.b
Upward Flux OATUpward Flux OAT
Upward flux rate
EP
A S
um
0.00005 0.00015 0.00025 0.00035
1e-18
1e-16
1e-14
1e-12
1e-10
SummarySummary
• MART and MARS appear to provide an– Efficient– Simple (?)– Model Independent
approach to data mining probabilistic model results for sensitivity analysis
Finally…Finally…
• The decision context:– Is the uncertainty in the model response
too high?– Is there value in reducing input
uncertainty?– SA and cost used to estimate the value of
collecting additional information.
FASTFAST
}sincos{ jsBjsAy jj ( 1 )
w h e r e A j a n d B j a r e t h e F o u r i e r c o e f f i c i e n t s a n d c a n b e e s t i m a t e d v i a a f a s t F o u r i e rt r a n s f o r m a l g o r i t h m
T h e s p e c t r u m o f t h e F o u r i e r t r a n s f o r m i s
22jjj BA ( 2 )
S u m m i n g a l l j p r o v i d e s a n e s t i m a t e o f t h e t o t a l v a r i a n c e i n y
Zj
jDˆ ( 3 )
S u m m i n g a l l j e x c l u d i n g t h e f r e q u e n c y e m b e d d e d i n x i a n d i t s a s s o c i a t e d h i g h e r
h a r m o n i c s , Z 0 , p r o v i d e s a n e s t i m a t e o f t h e v a r i a n c e d u e t o t h e u n c e r t a i n t y i n x i
0
ˆZj
jiD ( 4 )
T h e s e n s i t i v i t y o f y t o x i i s t h e n g i v e n b y
DDS iiˆ/ˆˆ ( 5 )
MARSMARS• Non-parametric recursive partitioning approach that fits
separate splines to distinct intervals of the predictor variables.
• Both the selected variables and the knots are found via a brute force, exhaustive search procedure optimized simultaneously by evaluating a "loss of fit" criterion.
• Searches for interactions between variables, allowing any degree of interaction to be considered.
• Tracks very complex data structures in high-dimensional data.
J.H. Friedman, (1991), “Multivariate Adaptive Regression Splines,” The Annals of Statistics, 19, 1-14
Software:Trevor Hastie and Robert Tibshirani, MDA Library for R (‘GNU S’).
Ross Ihaka and Robert Gentleman, (1996) R: A Language for Data Analysis and Graphics, Journal of Computational and Graphical Statistics, 5, 3, 299-314. www.r-project.org.
MARTMART
• Multiple Additive Regression Trees– Explores the multidimensional input
space of the input factors using gradient boosting of additive regression models.
– Handles main and interaction effects.– Fast
K. Chan, S. Tarantola and A. Saltelli, 2000, Variance-Based Methods, in Sensitivity Analysis, A. Saltelli, K. Chan, E.M.Scott.John Wiley and Sons.