richard wilkinson r.d.wilkinson@she eld.acrichard wilkinson r.d.wilkinson@she eld.ac.uk school of...

55
Inference for complex models Richard Wilkinson r.d.wilkinson@sheffield.ac.uk School of Maths and Statistics University of Sheffield 23 April 2015

Upload: others

Post on 26-Jan-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

  • Inference for complex models

    Richard [email protected]

    School of Maths and StatisticsUniversity of Sheffield

    23 April 2015

  • Computer experiments

    Rohrlich (1991): Computer simulation is

    ‘a key milestone somewhat comparable to the milestone thatstarted the empirical approach (Galileo) and the deterministicmathematical approach to dynamics (Newton and Laplace)’

    Challenges for statistics:How do we make inferences about the world from a simulation of it?

    how do we relate simulators to reality?

    how do we estimate tunable parameters?

    how do we deal with computational constraints?

    how do we make uncertainty statements about the world thatcombine models, data and their corresponding errors?

    There is an inherent a lack of quantitative information on the uncertaintysurrounding a simulation - unlike in physical experiments.

  • Computer experiments

    Rohrlich (1991): Computer simulation is

    ‘a key milestone somewhat comparable to the milestone thatstarted the empirical approach (Galileo) and the deterministicmathematical approach to dynamics (Newton and Laplace)’

    Challenges for statistics:How do we make inferences about the world from a simulation of it?

    how do we relate simulators to reality?

    how do we estimate tunable parameters?

    how do we deal with computational constraints?

    how do we make uncertainty statements about the world thatcombine models, data and their corresponding errors?

    There is an inherent a lack of quantitative information on the uncertaintysurrounding a simulation - unlike in physical experiments.

  • Bayesian statisticsRepresent all uncertainties as probability distributions:

    π(θ|D) = π(D|θ)π(θ)π(D)

    π(θ|D) is the posterior distributionI Always hard to compute:

    SMC2, PGAS, Tempered NUTS-HMC

    π(D|θ) is the likelihood function.I For complex models can be slow to compute:

    GP emulators

    I Can also be impossible to compute in some cases:

    ABC

    I

    π(D|θ) =∫π(D|X )π(X |θ)dX

    Relating simulator to reality can make specifying π(D|θ) particularlydifficult:

    Simlator discrepancy modelling

    π(D) is the model evidence or normalising constant.I Requires us to integrate, and is thus harder to compute than π(θ|D):

    SMC2, nested sampling

  • Bayesian statisticsRepresent all uncertainties as probability distributions:

    π(θ|D) = π(D|θ)π(θ)π(D)

    π(θ|D) is the posterior distributionI Always hard to compute: SMC2, PGAS, Tempered NUTS-HMC

    π(D|θ) is the likelihood function.I For complex models can be slow to compute:

    GP emulators

    I Can also be impossible to compute in some cases:

    ABC

    I

    π(D|θ) =∫π(D|X )π(X |θ)dX

    Relating simulator to reality can make specifying π(D|θ) particularlydifficult:

    Simlator discrepancy modelling

    π(D) is the model evidence or normalising constant.I Requires us to integrate, and is thus harder to compute than π(θ|D):

    SMC2, nested sampling

  • Bayesian statisticsRepresent all uncertainties as probability distributions:

    π(θ|D) = π(D|θ)π(θ)π(D)

    π(θ|D) is the posterior distributionI Always hard to compute: SMC2, PGAS, Tempered NUTS-HMC

    π(D|θ) is the likelihood function.I For complex models can be slow to compute: GP emulatorsI Can also be impossible to compute in some cases: ABCI

    π(D|θ) =∫π(D|X )π(X |θ)dX

    Relating simulator to reality can make specifying π(D|θ) particularlydifficult: Simlator discrepancy modelling

    π(D) is the model evidence or normalising constant.I Requires us to integrate, and is thus harder to compute than π(θ|D):

    SMC2, nested sampling

  • Bayesian statisticsRepresent all uncertainties as probability distributions:

    π(θ|D) = π(D|θ)π(θ)π(D)

    π(θ|D) is the posterior distributionI Always hard to compute: SMC2, PGAS, Tempered NUTS-HMC

    π(D|θ) is the likelihood function.I For complex models can be slow to compute: GP emulatorsI Can also be impossible to compute in some cases: ABCI

    π(D|θ) =∫π(D|X )π(X |θ)dX

    Relating simulator to reality can make specifying π(D|θ) particularlydifficult: Simlator discrepancy modelling

    π(D) is the model evidence or normalising constant.I Requires us to integrate, and is thus harder to compute than π(θ|D):

    SMC2, nested sampling

  • Uncertainty Quantification (UQ) for computer experiments

    CalibrationI Estimate unknown parameters θI Usually via the posterior distribution π(θ|D)I Or history matching

    Uncertainty analysisI f (x) a complex simulator. If we are uncertain about x , e.g.,

    X ∼ π(x), what is π(f (X ))?Sensitivity analysis

    I X = (X1, . . . ,Xd)>. Can we decompose Var(f (X )) into contributions

    from each Var(Xi )?I If we can improve our knowledge of any Xi , which should we choose to

    minimise Var(f (X ))?Simulator discrepancy

    I f (x) is imperfect. How can we quantify or correct simulatordiscrepancy.

    Data assimilationI Find π(x1:t |y1:t)

  • Uncertainty Quantification (UQ) for computer experiments

    CalibrationI Estimate unknown parameters θI Usually via the posterior distribution π(θ|D)I Or history matching

    Uncertainty analysisI f (x) a complex simulator. If we are uncertain about x , e.g.,

    X ∼ π(x), what is π(f (X ))?

    Sensitivity analysisI X = (X1, . . . ,Xd)

    >. Can we decompose Var(f (X )) into contributionsfrom each Var(Xi )?

    I If we can improve our knowledge of any Xi , which should we choose tominimise Var(f (X ))?

    Simulator discrepancyI f (x) is imperfect. How can we quantify or correct simulator

    discrepancy.

    Data assimilationI Find π(x1:t |y1:t)

  • Uncertainty Quantification (UQ) for computer experiments

    CalibrationI Estimate unknown parameters θI Usually via the posterior distribution π(θ|D)I Or history matching

    Uncertainty analysisI f (x) a complex simulator. If we are uncertain about x , e.g.,

    X ∼ π(x), what is π(f (X ))?Sensitivity analysis

    I X = (X1, . . . ,Xd)>. Can we decompose Var(f (X )) into contributions

    from each Var(Xi )?I If we can improve our knowledge of any Xi , which should we choose to

    minimise Var(f (X ))?

    Simulator discrepancyI f (x) is imperfect. How can we quantify or correct simulator

    discrepancy.

    Data assimilationI Find π(x1:t |y1:t)

  • Uncertainty Quantification (UQ) for computer experiments

    CalibrationI Estimate unknown parameters θI Usually via the posterior distribution π(θ|D)I Or history matching

    Uncertainty analysisI f (x) a complex simulator. If we are uncertain about x , e.g.,

    X ∼ π(x), what is π(f (X ))?Sensitivity analysis

    I X = (X1, . . . ,Xd)>. Can we decompose Var(f (X )) into contributions

    from each Var(Xi )?I If we can improve our knowledge of any Xi , which should we choose to

    minimise Var(f (X ))?Simulator discrepancy

    I f (x) is imperfect. How can we quantify or correct simulatordiscrepancy.

    Data assimilationI Find π(x1:t |y1:t)

  • Uncertainty Quantification (UQ) for computer experiments

    CalibrationI Estimate unknown parameters θI Usually via the posterior distribution π(θ|D)I Or history matching

    Uncertainty analysisI f (x) a complex simulator. If we are uncertain about x , e.g.,

    X ∼ π(x), what is π(f (X ))?Sensitivity analysis

    I X = (X1, . . . ,Xd)>. Can we decompose Var(f (X )) into contributions

    from each Var(Xi )?I If we can improve our knowledge of any Xi , which should we choose to

    minimise Var(f (X ))?Simulator discrepancy

    I f (x) is imperfect. How can we quantify or correct simulatordiscrepancy.

    Data assimilationI Find π(x1:t |y1:t)

  • Meta-modellingSurrogate modelling

    Emulation

  • Code uncertainty

    For complex simulators, run times might be long, ruling out brute-forceapproaches such as Monte Carlo methods.

    Consequently, we will only know the simulator output at a finite numberof points.

    We call this code uncertainty.

    All inference must be done using a finite ensemble of model runs

    Dsim = {(θi , f (θi ))}i=1,...,N

    If θ is not in the ensemble, then we are uncertainty about the valueof f (θ).

  • Code uncertainty

    For complex simulators, run times might be long, ruling out brute-forceapproaches such as Monte Carlo methods.

    Consequently, we will only know the simulator output at a finite numberof points.

    We call this code uncertainty.

    All inference must be done using a finite ensemble of model runs

    Dsim = {(θi , f (θi ))}i=1,...,N

    If θ is not in the ensemble, then we are uncertainty about the valueof f (θ).

  • Meta-modelling

    Idea: If the simulator is expensive, build a cheap model of it and use thisin any analysis.

    ‘a model of the model’

    We call this meta-model an emulator of our simulator.

    Gaussian process emulators are most popular choice for emulator.

    Built using an ensemble of model runs Dsim = {(θi , f (θi ))}i=1,...,NThey give an assessment of their prediction accuracy π(f (θ)|Dsim)

  • Meta-modelling

    Idea: If the simulator is expensive, build a cheap model of it and use thisin any analysis.

    ‘a model of the model’

    We call this meta-model an emulator of our simulator.

    Gaussian process emulators are most popular choice for emulator.

    Built using an ensemble of model runs Dsim = {(θi , f (θi ))}i=1,...,NThey give an assessment of their prediction accuracy π(f (θ)|Dsim)

  • Meta-modellingGaussian Process Emulators

    Gaussian processes provide a flexible nonparametric distributions for ourprior beliefs about the functional form of the simulator:

    f (·) ∼ GP(m(·), σ2c(·, ·))

    where m(·) is the prior mean function, and c(·, ·) is the prior covariancefunction (semi-definite).Gaussian processes are invariant under Bayesian updating.

    Definition If f (·) ∼ GP(m(·), c(·, ·)) then for any collection of inputsx1, . . . , xn the vector

    (f (x1), . . . , f (xn))T ∼ MVN(m(x), σ2Σ)

    where Σij = c(xi , xj).

  • Meta-modellingGaussian Process Emulators

    Gaussian processes provide a flexible nonparametric distributions for ourprior beliefs about the functional form of the simulator:

    f (·) ∼ GP(m(·), σ2c(·, ·))

    where m(·) is the prior mean function, and c(·, ·) is the prior covariancefunction (semi-definite).Gaussian processes are invariant under Bayesian updating.

    Definition If f (·) ∼ GP(m(·), c(·, ·)) then for any collection of inputsx1, . . . , xn the vector

    (f (x1), . . . , f (xn))T ∼ MVN(m(x), σ2Σ)

    where Σij = c(xi , xj).

  • Gaussian Process IllustrationZero mean

  • Gaussian Process Illustration

  • Gaussian Process Illustration

  • Challenges

    Design: if we can afford n simulator runs, which parameters shouldwe run it at?

    High dimensional inputsI If θ is multidimensional, then even short run times can rule out brute

    force approaches

    High dimensional outputsI Spatio-temporal.

    Incorporating physical knowledge

    Difficult behaviour, e.g., switches, step-functions, non-stationarity...

  • Uncertainty quantification for Carbon Capture and StorageEPSRC: transport

    10-4Volume [m^3/mol]

    6

    8

    10

    Pres

    sure

    [MPa

    ]

    310K307K304.2K (Tc)296K290KCritical pointCoexisting vapourCo-existing liquid

    Technical challenges:

    How do we find non-parametric Gaussian process models that i) obeythe fugacity constraints ii) have the correct asymptotic behaviourHow do we fit parametric equations of state (Peng-Robinson andvariants) - tempered NUTS-HMC.

  • Storage

    Knowledge of the physical problem is encoded in a simulator f

    Inputs:Permeability field, K(2d field) y

    f (K )yOutputs:

    Stream func. (2d field),concentration (2d field),surface flux (1d scalar),...

    0 5 10 15 20 25 30 35 40 45 500

    5

    10

    15

    20

    25

    30

    35

    40

    45

    50

    0.5

    0.55

    0.6

    0.65

    0.7

    0.75

    0.8

    0.85

    0.9

    0.95

    1

    010

    2030

    4050

    0

    10

    20

    30

    40

    500.5

    0.6

    0.7

    0.8

    0.9

    1

    1.1

    1.2

    1.3

    ↓ f (K )True truncated streamfield

    5 10 15 20 25 30 35 40 45 50

    5

    10

    15

    20

    25

    30

    35

    40

    45

    50

    −0.02

    −0.015

    −0.01

    −0.005

    0

    0.005

    0.01True truncated concfield

    5 10 15 20 25 30 35 40 45 50

    5

    10

    15

    20

    25

    30

    35

    40

    45

    50

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    Surface Flux= 6.43, . . .

  • CCS examplesLeft=true, right = emulated, 118 training runs, held out test set.

    True streamfield

    5 10 15 20 25 30 35 40 45 50

    5

    10

    15

    20

    25

    30

    35

    40

    45

    50

    −0.04

    −0.035

    −0.03

    −0.025

    −0.02

    −0.015

    −0.01

    −0.005

    0Emulated streamfield

    5 10 15 20 25 30 35 40 45 50

    5

    10

    15

    20

    25

    30

    35

    40

    45

    50

    −0.04

    −0.035

    −0.03

    −0.025

    −0.02

    −0.015

    −0.01

    −0.005

    0

    True concfield

    5 10 15 20 25 30 35 40 45 50

    5

    10

    15

    20

    25

    30

    35

    40

    45

    50

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1Emulated concfield

    5 10 15 20 25 30 35 40 45 50

    5

    10

    15

    20

    25

    30

    35

    40

    45

    50

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

  • ABC: inference for complexstochastic models

  • Estimating Divergence Times

  • Forward simulation

    Model evolution and fossil finds

    Let τ be the temporal gap between the divergence time and theoldest fossil.

    The posterior for τ is then used as a prior for a genetic analysis.

    The likelihood function π(D|θ) is intractable, but it is cheap to simulate.

  • Approximate Bayesian Computation (ABC)Wilkinson 2008/2013, Wilkinson and Tavaré 2009

    If the likelihood function is intractable, then ABC is one of the fewapproaches we can use to do inference.

    Uniform Rejection Algorithm

    Draw θ from π(θ)

    Simulate X ∼ f (θ)Accept θ if ρ(D,X ) ≤ �

    � reflects the tension between computability and accuracy.

    As �→∞, we get observations from the prior, π(θ).If � = 0, we generate observations from π(θ | D).

    ABC does not require explicit knowledge of the likelihood function

  • Approximate Bayesian Computation (ABC)Wilkinson 2008/2013, Wilkinson and Tavaré 2009

    If the likelihood function is intractable, then ABC is one of the fewapproaches we can use to do inference.

    Uniform Rejection Algorithm

    Draw θ from π(θ)

    Simulate X ∼ f (θ)Accept θ if ρ(D,X ) ≤ �

    � reflects the tension between computability and accuracy.

    As �→∞, we get observations from the prior, π(θ).If � = 0, we generate observations from π(θ | D).

    ABC does not require explicit knowledge of the likelihood function

  • � = 10

    ●●

    ● ●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    −3 −2 −1 0 1 2 3

    −10

    010

    20

    theta vs D

    theta

    D

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●●

    ●●● ●

    ●●

    ● ● ●●

    ●●

    ● ●

    ●●

    ●●●●

    ●●

    ●●

    ●●

    ●●

    ●● ●

    ●●

    ●●

    ●●

    ●●

    ●●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ● ●

    ●●

    ● ●

    ●●

    ●●

    ● ●● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ● ●

    ● ●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●● ●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    − ε

    + ε

    D

    −3 −2 −1 0 1 2 3

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    1.2

    1.4

    Density

    theta

    Den

    sity

    ABCTrue

    θ ∼ U[−10, 10], X ∼ N(2(θ + 2)θ(θ − 2), 0.1 + θ2)

    ρ(D,X ) = |D − X |, D = 2

  • � = 7.5

    ●● ●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●● ●

    ● ●●

    ● ●

    ● ●●

    ●●●

    ●●

    ● ●

    ● ●

    ●●

    ●●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    −3 −2 −1 0 1 2 3

    −10

    010

    20

    theta vs D

    theta

    D

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●●

    ●●● ●

    ●●

    ● ● ●●

    ●●

    ● ●

    ●●

    ●●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●● ●

    ●●

    ●●

    ●●

    ●●

    ●●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ● ●

    ●●

    ● ●

    ●●

    ●●

    ● ●● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●● ●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    − ε

    + ε

    D

    −3 −2 −1 0 1 2 3

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    1.2

    1.4

    Density

    theta

    Den

    sity

    ABCTrue

  • � = 5

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●● ●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●● ●

    ● ●●

    ●● ●

    ● ●

    ● ●

    ●●

    ● ●●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ● ●

    ●●

    ●●● ●

    ●●

    ●●

    ●●●

    ● ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    −3 −2 −1 0 1 2 3

    −10

    010

    20

    theta vs D

    theta

    D

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●●

    ●●● ●

    ●●

    ● ●

    ● ● ●●

    ●●

    ●●

    ● ●

    ●●

    ●●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●● ●

    ●●

    ●●

    ●● ●

    ●●

    ●●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ● ●● ●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●● ●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●● ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●● ●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●● ●

    ●●

    − ε

    + ε

    D

    −3 −2 −1 0 1 2 3

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    1.2

    1.4

    Density

    theta

    Den

    sity

    ABCTrue

  • � = 2.5

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●● ●

    ●●

    ● ● ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●●

    ●●

    ●●

    ● ●

    ●●

    ● ●

    ●●

    ● ●● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ● ●

    ● ●

    ●●

    ● ●

    ●●

    ●● ●

    ●●

    ●●●

    ●●

    ● ●●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    −3 −2 −1 0 1 2 3

    −10

    010

    20

    theta vs D

    theta

    D

    ●●

    ●●

    ●●

    ●●●●

    ● ●●

    ●●

    ●●

    ●●

    ●●●●

    ●●●

    ● ●●

    ●●

    ●●

    ● ●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●●

    ●●

    ●●

    ●●

    ●●

    ●●● ●

    ●●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●●

    ●●●

    ●●●

    ● ● ●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●● ●●

    ●●

    ●●

    ●●●

    ●●

    ●●●

    ●●

    ●● ●

    ●●

    ●●

    ●● ●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●● ●

    ●●

    − ε

    + ε

    D

    −3 −2 −1 0 1 2 3

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    1.2

    1.4

    Density

    theta

    Den

    sity

    ABCTrue

  • � = 1

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●● ●

    ●●

    ● ● ●●

    ●●

    ● ●

    ●●

    ●●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●●

    ●●

    ● ●

    ●●

    ● ●

    ●●

    ● ●● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ● ●

    ● ●

    ● ●

    ●●

    ●● ●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●●

    ●●

    ●● ●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    −3 −2 −1 0 1 2 3

    −10

    010

    20

    theta vs D

    theta

    D ●●● ●●●● ●●●●●●● ●

    ●●

    ●●● ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●●●●

    ●●●●●

    ●●●

    ●●●

    ● ●●●●

    ● ● ●●●●

    ●●

    ●●

    ●●●●●

    ●●

    ●●

    ●●

    ●● ●● ●

    ● ●● ●●●

    ● ●● ● ●

    ●●● ●●

    − ε

    + ε

    −3 −2 −1 0 1 2 3

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    1.2

    1.4

    Density

    theta

    Den

    sity

    ABCTrue

  • Rejection ABCIf the data are too high dimensional we never observe simulations that are‘close’ to the field data - curse of dimensionalityReduce the dimension using summary statistics, S(D).

    Approximate Rejection Algorithm With Summaries

    Draw θ from π(θ)

    Simulate X ∼ f (θ)Accept θ if ρ(S(D), S(X )) < �

    If S is sufficient this is equivalent to the previous algorithm.

    Simple → Popular with non-statisticians

    ∃ many extensions and improvementsHow to choose S(D)

    How to efficiently sample θ

  • Rejection ABCIf the data are too high dimensional we never observe simulations that are‘close’ to the field data - curse of dimensionalityReduce the dimension using summary statistics, S(D).

    Approximate Rejection Algorithm With Summaries

    Draw θ from π(θ)

    Simulate X ∼ f (θ)Accept θ if ρ(S(D), S(X )) < �

    If S is sufficient this is equivalent to the previous algorithm.

    Simple → Popular with non-statisticians

    ∃ many extensions and improvementsHow to choose S(D)

    How to efficiently sample θ

  • An integrated molecular and palaeontological analysis

    The fossil record does not constrain the primate divergence time asclosely as previously believed.

    Genetic and palaeontology estimates unifiedHuman-chimp divergence time pushed further back.

    Wilkinson et al. 2011, Bracken-Grissom et al. 2014.

  • Accelerating ABC: GP-ABCMonte Carlo methods (such as ABC) are costly and can require moresimulation than is possible. However,

    most methods sample naively - they don’t learn from previoussimulations

    they don’t exploit known properties of the likelihood function, suchas continuity

    they sample randomly, rather than using careful design.

    Emulators are the usual approach to dealing with complex models. But,emulating stochastic simulators is problematic.

    Instead of modelling the simulator output, we can instead modelL(θ) = π(D|θ)

    D remains fixed: we only need learn L as a function of θ

    1d response surface

    But, it can be hard to model.

  • Accelerating ABC: GP-ABCMonte Carlo methods (such as ABC) are costly and can require moresimulation than is possible. However,

    most methods sample naively - they don’t learn from previoussimulations

    they don’t exploit known properties of the likelihood function, suchas continuity

    they sample randomly, rather than using careful design.

    Emulators are the usual approach to dealing with complex models. But,emulating stochastic simulators is problematic.

    Instead of modelling the simulator output, we can instead modelL(θ) = π(D|θ)

    D remains fixed: we only need learn L as a function of θ

    1d response surface

    But, it can be hard to model.

  • Iteration 24Left=estimate, right = truth

    http://youtu.be/FF3KhKh6NHg

  • Climate scienceWhat drives the glacial-interglacial cycle?

    Eccentricity: orbital departure from a circle, controls duration of the seasonsObliquity: axial tilt, controls amplitude of seasonal cyclePrecession: variation in Earth’s axis of rotation, affects difference betweenseasons

  • Model selection

    What drives the glacial-interglacial cycle?

    Which aspect of the astronomical forcing is of primary importance?

    Which models best represent the cycle?

    Most simple models of the [...] glacial cycles have at leastfour degrees of freedom [parameters], and some have as manyas twelve. Unsurprisingly [...this is] insufficient to distinguishbetween the skill of the various models (Roe and Allen 1999)

    Bayesian model selection revolves around the use of the Bayes factor,which are notoriously difficult to compute.

    Model selection for stochastic differential equations

    1000 observations, 3000 unknown state variables, 1000 unknowntimes, 17 unknown parameters, choice of 5 different simulators.

    Simulation studies show we can accurately choose between competingmodels, and identify the correct forcing.

  • Model selection

    What drives the glacial-interglacial cycle?

    Which aspect of the astronomical forcing is of primary importance?

    Which models best represent the cycle?

    Most simple models of the [...] glacial cycles have at leastfour degrees of freedom [parameters], and some have as manyas twelve. Unsurprisingly [...this is] insufficient to distinguishbetween the skill of the various models (Roe and Allen 1999)

    Bayesian model selection revolves around the use of the Bayes factor,which are notoriously difficult to compute.

    Model selection for stochastic differential equations

    1000 observations, 3000 unknown state variables, 1000 unknowntimes, 17 unknown parameters, choice of 5 different simulators.

    Simulation studies show we can accurately choose between competingmodels, and identify the correct forcing.

  • Age model

    Can we also quantify chronologicaluncertainty?

    dXt = g(Xt , θ)dt + F (t, γ)dt + ΣdW

    Yt = d + sX1,t + �t

    Plus an age model

    dH = −µsdT + σdW

    Targetπ(θ,T1:N ,X1:N ,Mk |y1:N)

    where T1:N are the unknown times of the observations Y1:N , X1:N are theclimate state variables through time, Mk is the simulation model used,and θ is the corresponding parameter.

    I.e., can we simultaneously date the stack, do climate reconstruction, fitthe model, and choose between models?

  • Age model

    Can we also quantify chronologicaluncertainty?

    dXt = g(Xt , θ)dt + F (t, γ)dt + ΣdW

    Yt = d + sX1,t + �t

    Plus an age model

    dH = −µsdT + σdWTarget

    π(θ,T1:N ,X1:N ,Mk |y1:N)

    where T1:N are the unknown times of the observations Y1:N , X1:N are theclimate state variables through time, Mk is the simulation model used,and θ is the corresponding parameter.I.e., can we simultaneously date the stack, do climate reconstruction, fitthe model, and choose between models?

  • Simulation study results - age vs depth (trend removed)Dots = truth, black line = estimate, grey = 95% CI

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●● ●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ● ●

    ●●

    ● ●

    ●●

    ●●

    ● ●

    ● ●

    ● ●

    ● ●

    ●●

    ● ●

    ● ●

    ● ●

    ●●

    ●●

    30 25 20 15 10 5 0

    −30

    −20

    −10

    010

    20

    Depth (m)

    Tim

    e (d

    rift r

    emov

    ed, k

    yr)

  • Simulation study results - climate reconstructionDots = truth, black line = estimate, grey = 95% CI

    ●●

    ●● ●

    ● ●

    ●●

    ●●

    ● ●

    ●●

    ● ●

    ●●

    ●●

    ● ●

    ● ●

    ●●

    ●●

    ● ●

    ●●

    ● ●●

    ●●

    ● ● ●

    ● ●

    ●●

    ● ●

    ●●

    ● ●

    ● ●

    ●●

    ● ●

    ●●

    ● ●

    ●●

    ● ●

    ●●

    ● ●

    ● ●

    ●●

    ● ● ●

    ●●

    ●● ●

    30 25 20 15 10 5 0

    −1.

    5−

    1.0

    −0.

    50.

    00.

    51.

    01.

    5

    Depth (m)

    X1

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●● ●

    ●●

    ●●

    ●●

    ● ●

    ● ●

    ●●

    ● ●

    ● ●

    ● ●

    ●●

    ● ●

    ● ●

    ●●

    ●●

    ● ●

    ● ●

    ●●

    ● ●●

    ●●

    ● ●

    ● ●

    ●●

    ●●

    ●●

    ● ●●

    ●●

    ● ●

    ●●

    ●● ● ●

    ●●

    ● ●

    ● ●

    ● ●●

    ● ●

    ●●

    ● ●

    ● ●

    ● ●

    ●●

    ●●

    30 25 20 15 10 5 0

    −3

    −2

    −1

    01

    23

    Depth (m)

    X2

  • Simulation study results - parameter estimation

    −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

    01

    23

    β0

    −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

    0.0

    0.5

    1.0

    1.5

    2.0

    β1

    0.0 0.5 1.0 1.5 2.0

    0.0

    0.5

    1.0

    1.5

    β2

    0.0 0.5 1.0 1.5 2.0

    01

    23

    45

    6

    δ

    0.0 0.2 0.4 0.6 0.8 1.0

    02

    46

    8

    γp

    0.0 0.2 0.4 0.6 0.8 1.0

    02

    46

    8

    γc

    0.0 0.2 0.4 0.6 0.8 1.0

    02

    46

    810

    γe

    0.0 0.2 0.4 0.6 0.8 1.0

    02

    46

    810

    σ1

    0.0 0.5 1.0 1.5 2.0 2.5 3.0

    0.0

    1.0

    2.0

    σ2

    0.0 0.1 0.2 0.3 0.4 0.5

    010

    2030

    40

    σobs

    3.0 3.5 4.0 4.5 5.0

    02

    46

    810

    D

    0.5 1.0 1.5 2.0

    01

    23

    45

    67

    S

    0e+00 2e−05 4e−05 6e−05 8e−05 1e−04

    050

    000

    1500

    00

    µs

    0.000 0.002 0.004 0.006 0.008 0.010

    050

    010

    0015

    00

    σs

    0 10 20 30 40 50

    0.00

    0.04

    0.08

    α

    0.0 0.2 0.4 0.6 0.8 1.0

    02

    46

    8

    φ0

    0.000 0.001 0.002 0.003 0.004

    020

    060

    010

    00

    c

    Simultaneous inference of the choice between 5 models, 17 parameters,800 ages, 2400 climate variables, using just 800 observations.

  • Results for ODP846 - age vs depth (trend removed)Black = posterior mean, grey = 95%CI, red = Huybers 2007, blue = Lisieki and Raymo2004

    30 25 20 15 10 5 0

    −50

    −40

    −30

    −20

    −10

    010

    20

    Depth (m)

    Tim

    e (d

    rift r

    emov

    ed, k

    yr)

    Advantages: full UQ, model selection, simultaneous parameter estimationand climate reconstructionIgnoring uncertainty leads to incorrect conclusions

  • Model discrepancyConsider the state space model:

    xt+1 = fθ(xt) + et , yt = g(xt) + �t

    et ∼ p(·), �t ∼ q(·)

    How do we correct errors in f or g?

    Use a GP discrepancy model - eg, xt+1 = fθ(xt) + δ(xt) + et

    Chapter 6: Gaussian Process Models of Simulator Discrepancy

    −30 −20 −10 0 10 20 30

    −15

    −10

    −5

    05

    10

    15

    f(xt,ut)=0.5xt+8ut+GP(0,K)

    x(t)

    x(t+1)−f(xt,ut)

    −30 −20 −10 0 10 20 30

    −15

    −10

    −5

    05

    10

    15

    f(xt,ut)= 25xt/(1+xt^2)+8ut+GP(0,K)

    x(t)

    x(t+1)−f(xt,ut)

    −30 −20 −10 0 10 20 30

    −15

    −10

    −5

    05

    10

    15

    f(xt,ut)= 8 ut+GP(0,K)

    x(t)

    x(t+1)−f(xt,ut)

    −30 −20 −10 0 10 20 30

    −15

    −10

    −5

    05

    10

    15

    f(xt,ut)= 0.5+8 ut+GP(0,K)

    x(t)

    x(t+1)−f(xt,ut)

    −30 −20 −10 0 10 20 30

    −15

    −10

    −5

    05

    10

    15

    f(xt,ut)=25 xt/(1+xt^2)+8 ut+GP(0,K)

    x(t)

    x(t+1)−f(xt,ut)

    −30 −20 −10 0 10 20 30

    −15

    −10

    −5

    05

    10

    15

    f(xt,ut)=8 ut+GP(0,K)

    x(t)

    x(t+1)−f(xt,ut)

    −30 −20 −10 0 10 20 30

    −15

    −10

    −5

    05

    10

    15

    f(xt,ut)= 0.5+8 ut+GP(0,K)

    x(t)

    x(t+1)−f(xt,ut)

    −30 −20 −10 0 10 20 30

    −15

    −10

    −5

    05

    10

    15

    f(xt,ut)= 25 xt/(1+xt^2)+8 ut+GP(0,K)

    x(t)

    x(t+1)−f(xt,ut)

    −30 −20 −10 0 10 20 30

    −15

    −10

    −5

    05

    10

    15

    f(xt,ut)= 8 ut+GP(0,K)

    x(t)

    x(t+1)−f(xt,ut)

    Figure 6.6: The learnt discrepancy (solid black line) and the true discrepancy function

    (red line) from using different incorrect simulators with Gaussian process discrepancy.

    Note that data are generated from Equations 6.4.1 and 6.4.2 with true parameters

    (q2, r2) as (0.1, 1) (3 plots in the top row), (1, 0.1) (3 plots in the middle row), and

    (1, 100) (3 plots in the bottom row), respectively.

    ancy from using different incorrect simulators with Gaussian process discrepancy to

    model different set of data. We have found that when data come from the system with

    193

    Technical challenge: inference using PGAS works but is expensive. Avariational approach looks more promising.

  • Model discrepancyConsider the state space model:

    xt+1 = fθ(xt) + et , yt = g(xt) + �t

    et ∼ p(·), �t ∼ q(·)

    How do we correct errors in f or g?Use a GP discrepancy model - eg, xt+1 = fθ(xt) + δ(xt) + et

    Chapter 6: Gaussian Process Models of Simulator Discrepancy

    −30 −20 −10 0 10 20 30

    −15

    −10

    −5

    05

    10

    15

    f(xt,ut)=0.5xt+8ut+GP(0,K)

    x(t)

    x(t+1)−f(xt,ut)

    −30 −20 −10 0 10 20 30

    −15

    −10

    −5

    05

    10

    15

    f(xt,ut)= 25xt/(1+xt^2)+8ut+GP(0,K)

    x(t)

    x(t+1)−f(xt,ut)

    −30 −20 −10 0 10 20 30

    −15

    −10

    −5

    05

    10

    15

    f(xt,ut)= 8 ut+GP(0,K)

    x(t)

    x(t+1)−f(xt,ut)

    −30 −20 −10 0 10 20 30

    −15

    −10

    −5

    05

    10

    15

    f(xt,ut)= 0.5+8 ut+GP(0,K)

    x(t)

    x(t+1)−f(xt,ut)

    −30 −20 −10 0 10 20 30

    −15

    −10

    −5

    05

    10

    15

    f(xt,ut)=25 xt/(1+xt^2)+8 ut+GP(0,K)

    x(t)

    x(t+1)−f(xt,ut)

    −30 −20 −10 0 10 20 30

    −15

    −10

    −5

    05

    10

    15

    f(xt,ut)=8 ut+GP(0,K)

    x(t)

    x(t+1)−f(xt,ut)

    −30 −20 −10 0 10 20 30

    −15

    −10

    −5

    05

    10

    15

    f(xt,ut)= 0.5+8 ut+GP(0,K)

    x(t)

    x(t+1)−f(xt,ut)

    −30 −20 −10 0 10 20 30

    −15

    −10

    −5

    05

    10

    15

    f(xt,ut)= 25 xt/(1+xt^2)+8 ut+GP(0,K)

    x(t)

    x(t+1)−f(xt,ut)

    −30 −20 −10 0 10 20 30

    −15

    −10

    −5

    05

    10

    15

    f(xt,ut)= 8 ut+GP(0,K)

    x(t)

    x(t+1)−f(xt,ut)

    Figure 6.6: The learnt discrepancy (solid black line) and the true discrepancy function

    (red line) from using different incorrect simulators with Gaussian process discrepancy.

    Note that data are generated from Equations 6.4.1 and 6.4.2 with true parameters

    (q2, r2) as (0.1, 1) (3 plots in the top row), (1, 0.1) (3 plots in the middle row), and

    (1, 100) (3 plots in the bottom row), respectively.

    ancy from using different incorrect simulators with Gaussian process discrepancy to

    model different set of data. We have found that when data come from the system with

    193

    Technical challenge: inference using PGAS works but is expensive. Avariational approach looks more promising.

  • Conclusions

    UQ can be vital: ignoring uncertainty can lead to incorrectconclusions, often in subtle ways.

    Computational tractability is one of the key bottlenecks: bigsimulation and big data.

    Methods from machine learning have the potential to help us makelarge advances in statistical methodology.

    Thank you for listening!

  • Conclusions

    UQ can be vital: ignoring uncertainty can lead to incorrectconclusions, often in subtle ways.

    Computational tractability is one of the key bottlenecks: bigsimulation and big data.

    Methods from machine learning have the potential to help us makelarge advances in statistical methodology.

    Thank you for listening!