bayesian inverse problems - technische universität münchen · bayesian inverse problems bayesian...

Jonas Latz (Fakultat fur Mathematik, Technische Universitat Munchen) 1

Bayesian Inverse Problems

Jonas Latz

Technische Universitat MunchenLehrstuhl fur Numerische Mathematik (M2)Fakultat fur Mathematikjonas.latz@tum.de

Guest lecture in Algorithms for Uncertainty Quantificationwith Dr. Tobias Neckel and Ionut Farcas

Outline

Motivation: Forward and Inverse ProblemConditional Probabilities and Bayes’ TheoremBayesian Inverse ProblemExamplesConclusions

Motivation: Forward and Inverse Problem

Outline

Forward Problem: A Picture

Uncertain

Parameter

θ ∼ µ

ModelG

(e.g. ODE, PDE,...)

Quantity of interest

Q(θ) = Q G(θ) ∼ µQ?

Q−2 −1 0 1 20

0.2−2 0 2 4

Forward Problem: A few examplesGroundwater pollution.G: Transport equation (PDE)θ: Permeability of the groundwater reservoirQ: Travel time of a particle in the groundwater reservoir

Figure: Final disposal site for nuclear waste (Image: Spiegel Online)

Forward Problem: A few examplesDiabetes patient.G: Glucose-Insulin ODE for a Diabetes-type 2 patientθ: Model parameters such as exchange rate plasma insulin to

interstitial insulinQ: Time to inject insulin

Figure: Glucometer (Image: Bayer AG)

Forward Problem: A few examplesGeotechnical EngineeringG: Deformation modelθ: SoilQ: Deformation/Stability/probability of failure

Figure: Construction on Soil (Image: www.ottawaconstructionnews.com)

Distribution of the parameter θ

How do we get the distribution of θ?Can we use data to characterise the distribution of θ?

Uncertain

Parameter

θ ∼ µ

−2 −1 0 1 20

‘Data’?

Let θtrue be the actual parameter. We define data y by

y := O︸︷︷︸Observation operator

Model︷︸︸︷

G ( θtrue︸︷︷︸actual parameter

measurement noise︷︸︸︷η

The measurement noise is a random variable η ∼ N(0, Γ).

‘Data’? A Picture.

Parameter

θtrue

ModelG

(e.g. ODE, PDE,...)

Observational Data

y := O G(θtrue) + η

Q(θtrue) = Q G(θtrue)?

Identify θtrue?!

Can we use the data to identify θtrue?

⇐⇒

Can we solve the equation y = O G(θtrue) + η?

Identify θtrue?!

⇐⇒

Identify θtrue?!

⇐⇒

Identify θtrue?!

Can we solve equation y = O G(θtrue) + η?No. The problem is ill-posed.1

The problem is not uniquely solvable due to the measurementnoise.

For the given realisation of y there is probably no θ∗ : O G(θ∗) = y

The operator O G is very complexdim X dim Y , where X 3 θ and Y 3 y .

1Hadamard (1902) - Sur les problemes aux derives partielles et leur significationphysique, Princeton University Bulletin 13, pp. 49-52

Identify θtrue?!

Can we solve equation y = O G(θtrue) + η?No. The problem is ill-posed.1

The problem is not uniquely solvable due to the measurementnoise.

For the given realisation of y there is probably no θ∗ : O G(θ∗) = y

The operator O G is very complexdim X dim Y , where X 3 θ and Y 3 y .

1Hadamard (1902) - Sur les problemes aux derives partielles et leur significationphysique, Princeton University Bulletin 13, pp. 49-52

Summary

We want to use noisy observational data y to find θtrue, but wecannot.The uncertain parameter θ is still uncertain, even if we observedata y .

2 Questions:How can we quantify the uncertainty in θ considering the data y?How does this change the probability distribution of our Quantityof interest Q?

Summary

We want to use noisy observational data y to find θtrue, but wecannot.The uncertain parameter θ is still uncertain, even if we observedata y .

2 Questions:How can we quantify the uncertainty in θ considering the data y?How does this change the probability distribution of our Quantityof interest Q?

Conditional Probabilities and Bayes’ Theorem

Outline

An Experiment

We roll a dice.The sample space of this experiment is

Ω := 1, ...,6.

The space of events is the power set of Ω:

A := 2Ω := A : A ⊆ Ω.

The probability measure is the Uniform measure on Ω:

P := UnifΩ :=∑ω∈Ω

16δω.

An Experiment

We roll a dice.Consider the event A := 6.

The probability of A is P(A) = 1/6.Now, an oracle tells us before rolling the dice, whether theoutcome would be even or odd.

B := 2,4,6,Bc := 1,3,5.

How does the probability of A change, if we know whether B or Bc

occurs?→ Conditional Probabilities

An Experiment

We roll a dice.Consider the event A := 6.

The probability of A is P(A) = 1/6.Now, an oracle tells us before rolling the dice, whether theoutcome would be even or odd.

B := 2,4,6,Bc := 1,3,5.

How does the probability of A change, if we know whether B or Bc

occurs?→ Conditional Probabilities

Conditional probabilities

Consider a probability space (Ω,A,P) and two events D1,D2 ∈ A,such that P(D2) > 0.The conditional probability distribution of D1 given the event D2 isdefined by:

P(D1|D2) :=P(D1 and D2)

P(D2):=

P(D1 ∩ D2)

Conditional probabilities: Moving back to theexperiment.

We roll a dice.

P(A|B) =P(6)

P(2,4,6)=

1/61/2

P(A|Bc) =P(∅)

P(1,3,5)=

Probability and Knowledge

Probability distributions can be used to model knowledge.When using a fair dice, we have no knowledge whatsoeverconcerning the outcome:

P(1) = P(2) = P(3) = P(4) = P(5) = P(6) = 1/6

Probability and Knowledge

Probability distributions can be used to model knowledge.When did the French revolution start? Rough knowledge fromschool: End of the 18th Century, definitely not before 1750/ after1800.

1750 1760 1770 1780 1790 1800

Probability and KnowledgeProbability distributions can be used to model knowledge.

1750 1760 1770 1780 1790 1800

Here, the probability distribution is given by a probability densityfunction (pdf), i.e.

P(A) =

∫Apdf(x)dx

Conditional probability and Learning

We represent content we learn by an event B ⊆ 2Ω.

Learning B is a map P(·) 7→ P(·|B).

We learn that B = 2,4,6 occurs. Hence, we map

[P(1) =P(2) = P(3) = P(4) = P(5) = P(6) = 1/6]

↓ ↓[P(1|B) = P(3|B) = P(5|B) = 0;P(2|B) = P(4|B) = P(6|B) = 1/3

]But, how do we do this in general?

We learn that B = 2,4,6 occurs. Hence, we map

[P(1) =P(2) = P(3) = P(4) = P(5) = P(6) = 1/6]

↓ ↓[P(1|B) = P(3|B) = P(5|B) = 0;P(2|B) = P(4|B) = P(6|B) = 1/3

]But, how do we do this in general?

Elementary Bayes’ Theorem

Consider a probability space (Ω,A,P) and two events D1,D2 ∈ A,such that P(D2) > 0. Then,

P(D1|D2) =P(D2|D1)P(D1)

Proof: We have

P(D1|D2) =P(D1 ∩ D2)

P(D2)(1) and P(D2|D1) =

P(D2 ∩ D1)

P(D1)(2) .

(2) is equivalent to P(D2 ∩ D1) = P(D2|D1)P(D1), which can besubstituted into (1) to get the final result.

Elementary Bayes’ Theorem

Consider a probability space (Ω,A,P) and two events D1,D2 ∈ A,such that P(D2) > 0. Then,

P(D1|D2) =P(D2|D1)P(D1)

Proof: We have

P(D1|D2) =P(D1 ∩ D2)

P(D2)(1) and P(D2|D1) =

P(D2 ∩ D1)

P(D1)(2) .

(2) is equivalent to P(D2 ∩ D1) = P(D2|D1)P(D1), which can besubstituted into (1) to get the final result.

Who is Bayes?

Figure: Bayes ( Image: Terence O’Donnell, History of Life Insurance in Its Formative Years(Chicago: American Conservation Co:, 1936))

Thomas Bayes, 1701-1761English, Presbyterian Minister, Mathematician, PhilosopherProposed a (very) special case of Bayes’ TheoremNot much known about him (the image above might be not him)

Who do we know Bayes’ Theorem from?

Figure: Laplace (Image: Wikipedia)

Pierre-Simon Laplace, 1749–1827French, Mathematician and AstronomerPublished Bayes’ Theorem in ‘Theorie analytique des probabilites’in 1812

Conditional probability and LearningWhen did the French revolution start?(1) Rough knowledge from school: End of the 18th Century, definitely

not before 1750/ after 1800.

1750 1760 1770 1780 1790 1800

(2) Today in the radio : It was in the 1780s, so in the interval[1780,1790).

1750 1760 1770 1780 1790 1800

Conditional probability and LearningWhen did the French revolution start?(1) Rough knowledge from school: End of the 18th Century, definitely

not before 1750/ after 1800.

1750 1760 1770 1780 1790 1800

(2) Today in the radio : It was in the 1780s, so in the interval[1780,1790).

1750 1760 1770 1780 1790 1800

When did the French revolution start?(2) Today in the radio : It was in the 1780s, so in the interval

[1780,1790).

1750 1760 1770 1780 1790 1800

(Image: Wikipedia)

(3) Reading in a textbook: It was in the middle of year 1789.Problem. The point in time x , we are looking for, is now set to aparticular value x = 1789.5 + η, where η ∼ N(0,0.0625). Hence,the event we learn is B = x + η = 1789.5. But, P(B) = 0. HenceP(·|B) is not defined and Bayes’ Theorem does not hold.

[1780,1790).

1750 1760 1770 1780 1790 1800

(Image: Wikipedia)

[1780,1790).

1750 1760 1770 1780 1790 1800

(Image: Wikipedia)

Non-elementary Conditional Probability

It is possible to define conditional probabilities for (non-empty)events B, with P(B) = 0. (rather complicated)Easier: Consider the learning in terms of continuous randomvariables. (rather simple)

Conditional Densities

We learn a random variable x1 and observe another randomvariable x2

The joint distribution of x1 and x2 is given by a 2-dimensionalprobability density function pdf(x1, x2).Given pdf(x1, x2) the marginal distributions of x1, x2 are given by

mpdf1(x1) =

∫pdf(x1, x2)dx2; mpdf2(x2) =

∫pdf(x1, x2)dx1

We learn the event B = x2 = b, for some b ∈ R. Here, theconditional distribution is given by

cpdf1|2(x1|x2 = b) = pdf(x1,b)/mpdf(b)

Bayes’ Theorem for Conditional Densities

Similarly to the Elementary Bayes’ Theorem, we can give a BayesTheorem for Densities

cpdf1|2(·|x2 = b)︸︷︷︸posterior

= cpdf2|1(b|x1 = ·)︸︷︷︸(data) likelihood

mpdf1(·)︸︷︷︸prior

/mpdf2(b)︸︷︷︸evidence

prior: Knowledge we have a priori concerning x1likelihood: The probability distribution of the data given x1posterior: Knowledge we have concerning x2 knowing that

x2 = bevidence: Assesses the model assumptions

Laplace’s formulation

Figure: Bayes’ Theorem in ‘Theorie analytique des probabilites’ byPierre-Simon Laplace (1812, pp. 182)

prior p, likelihood H, posterior P, integral/sum S.

When did the French revolution start?(3) Reading in a textbook: It was in the middle of year 1789.

1750 1760 1770 1780 1790 1800

(4) Looking it up on wikipedia.org: The actual date is 14. Juli 1789

When did the French revolution start?(3) Reading in a textbook: It was in the middle of year 1789.

1750 1760 1770 1780 1790 1800

(4) Looking it up on wikipedia.org: The actual date is 14. Juli 1789

Figure: Prise de la Bastille by Jean-Pierre Louis Laurent Houel, 1789 (Image:Bibliotheque nationale de France)

Bayesian Inverse Problem

Outline

Given data y and a prior distribution µ0 - the parameter θ is arandom variable: θ ∼ µ0.Determine the posterior distribution µy , that is

µy = P(θ ∈ ·|O G(θ) + η = y)

The problem ‘find µy ’ is well-posed

Parameter

θtrue

Uncertain

parameter

θ ∼ µy

ModelG

(e.g. ODE, PDE,...)

Q(θ) = Q G(θ) ∼ µyQ?

Observational Data

y := O G(θtrue) + η

Posterior distribution

µy := P(θ ∈ ·|G(θ) +η = y)

−2 −1 0 1 20

−2 0 2 40

0.20.40.6

Prior Information

θ ∼ µ0

Bayes’ Theorem (revisited)

cpdf(θ|O G(θ) + η = y)︸︷︷︸posterior

= cpdf(y |θ)︸︷︷︸(data) likelihood

mpdf1(θ)︸︷︷︸prior

/mpdf2(y)︸︷︷︸evidence

prior: Given by the probability measure µ0

likelihood: O G(θ)− y = η ∼ N(0, Γ)⇔ y ∼ N(O G(θ), Γ)

posterior: Given by the probability measure µy

evidence: Chosen as a normalising constant

How do we invert Bayesian?

Sampling based: Sample from the posterior measure µy

Importance SamplingMarkov Chain Monte CarloSequential Monte Carlo/Particle Filters

Deterministic: Use a deterministic quadrature rule, to approximate µy

Sparse GridsQMC

How do we invert Bayesian?

Sampling based: Sample from the posterior measure µy

Importance SamplingMarkov Chain Monte CarloSequential Monte Carlo/Particle Filters

Deterministic: Use a deterministic quadrature rule, to approximate µy

Sparse GridsQMC

Sampling based methods for Bayesian InverseProblems

Idea: Generate samples from µy .Use these samples in a Monte Carlo manner to approximate thedistribution of Q(θ), where θ ∼ µy .Problem: We typically can’t generate iid. samples of µy

weighted samples of the wrong distribution (Importance Sampling,SMC)dependent samples of the right distribution (MCMC)

Sampling based methods for Bayesian InverseProblems

Idea: Generate samples from µy .Use these samples in a Monte Carlo manner to approximate thedistribution of Q(θ), where θ ∼ µy .Problem: We typically can’t generate iid. samples of µy

weighted samples of the wrong distribution (Importance Sampling,SMC)dependent samples of the right distribution (MCMC)

Importance Sampling

Importance sampling applies directly Bayes’ Theorem and uses thefollowing identity:

Eµy [Q] = Eµ0 [Q · cpdf(y |·)︸︷︷︸likelihood

]/Eµ0 [cpdf(y |·)︸︷︷︸likelihood

︸︷︷︸evidence

Hence, we can integrate w.r.t. to µy , using only integrals w.r.t. µ0.In practice: Sample iid. from (θj : j = 1, ..., J) ∼ µ0 and approximate:

Eµy [Q] ≈ J−1J∑

Q(θj)cpdf(y |θj)/J−1J∑

cpdf(y |θj)

Deterministic Strategies

Several deterministic methods have been proposedGeneral issue: Estimating the model evidence is difficult(this also contraindicates importance sampling)

Examples

Outline

Examples

Example 1: 1D Groundwater flow with uncertainsource

Consider the following partial differential equation on D = [0,1]

−∇(k∇)p = f (θ) (on D)

p = 0 (on ∂D),

where the diffusion coefficient k is known. The source term f (θ)contains one Gaussian-type source at position θ ∈ [0.1,0.9].(We solve the PDE using 48 linear Finite Elements.)

Examples

Example 1: (deterministic) log-Permeability

0 0.2 0.4 0.6 0.8 1

eabili

Examples

Example 1: Quantity of InterestConsidering the uncertainty in f (θ), determine the distribution of theQuantity of interest

Q : [0.1,0.9]→ R, θ 7→ p(5/12).

0 0.2 0.4 0.6 0.8 1

Length, x

Quantity of Interest

Examples

Example 1: Data

The observations are based on the observation operator O, whichmaps

p 7→ [p(2/12),p(4/12),p(6/12),p(8/12)],

given θtrue = 0.2.

0 0.2 0.4 0.6 0.8 1

Length, x

0 0.2 0.4 0.6 0.8 10

at midpoints

at sensor

Examples

Example 1: Bayesian Setting

We assume uncorrelated Gaussian noise, with different variances:

(a) Γ = 0.82

(b) Γ = 0.42

(c) Γ = 0.22

(d) Γ = 0.12

Prior distribution θ ∼ µ0 = Unif[0.1,0.9]

Compareprior and different posteriors (with different noise levels)the uncertainty propagation of prior and the posteriors

(Estimations with standard Monte Carlo/Importance Sampling usingJ = 10000.)

Examples

We assume uncorrelated Gaussian noise, with different variances:

(a) Γ = 0.82

(b) Γ = 0.42

(c) Γ = 0.22

(d) Γ = 0.12

Prior distribution θ ∼ µ0 = Unif[0.1,0.9]

Compareprior and different posteriors (with different noise levels)the uncertainty propagation of prior and the posteriors

(Estimations with standard Monte Carlo/Importance Sampling usingJ = 10000.)

Examples

Example 1: No data (i.e. prior)

Examples

Example 1: Very high noise level Γ = 0.82

Examples

Example 1: High noise level Γ = 0.42

Examples

Example 1: Small noise level Γ = 0.22

Examples

Example 1: Very small noise level Γ = 0.12

Examples

Example 1: Summary

Smaller noise level⇔ less uncertainty in the parameter⇔ lessuncertainty2 in the quantity of interestThe unknown parameter can be estimated pretty well in thissettingImportance Sampling can be used in such simple settings.

2less uncertainty meaning ‘smaller variance’.

Examples

Example 2: 1D Groundwater flow with uncertainnumber and position of sources

Consider again

−∇(k∇)p = g(θ) (on D)

p = 0 (on ∂D),

θ := (N, ξ1, ..., ξN), where N is the number of Gaussian typesources and ξ1, ..., ξN are the positions of the sources (sortedascendingly)the log-permeability is known, but with a higher spatial variability

Examples

Example 2: (deterministic) log-Permeability

0 0.5 1

eabili

Examples

Example 2: Quantity of InterestConsidering the uncertainty in g(θ), determine the distribution of theQuantity of interest

Q : [0.1,0.9]→ R, θ 7→ p(1/2).

0 0.2 0.4 0.6 0.8 1

Length, x

TrueQuantity of Interest

Examples

Example 2: DataThe observations are based on the observation operator O, whichmaps

p 7→ [p(1/12),p(3/12),p(5/12),p(7/12),p(9/12),p(11/12)],

given θtrue := (4,0.2,0.4,0.6,0.8).

0 0.5 1

Length, x

0 0.5 10

at midpoints

at sensor

Examples

We assume uncorrelated Gaussian noise with variance Γ = 0.42

Prior distribution θ ∼ µ0. µ0 is given by the following samplingprocedure:

1 Sample N ∼ Unif1, ...,82 Sample ξ ∼ Unif[0.1,0.9]N

3 Set ξ := sort(ξ)4 Set θ := (N, ξ1, ..., ξN)

Compare prior and posterior and their uncertainty propagation(Estimations with standard Monte Carlo/Importance Sampling usingJ = 10000.)

Examples

We assume uncorrelated Gaussian noise with variance Γ = 0.42

Prior distribution θ ∼ µ0. µ0 is given by the following samplingprocedure:

1 Sample N ∼ Unif1, ...,82 Sample ξ ∼ Unif[0.1,0.9]N

3 Set ξ := sort(ξ)4 Set θ := (N, ξ1, ..., ξN)

Compare prior and posterior and their uncertainty propagation(Estimations with standard Monte Carlo/Importance Sampling usingJ = 10000.)

Examples

Example 2: Prior

0 10 20 30 40 50

Figure: Prior distribution of N (left) and 100 samples of the prior distribution ofthe Source terms

Examples

Example 2: Posterior

0 10 20 30 40 50

Figure: Posterior distribution of N (left) and 100 samples of the posteriordistribution of the Source terms

Examples

Example 2: Quantity of Interest

Figure: Quantities of interest, where the source term is distributed accordingto the prior (left) and posterior (right)

Examples

Example 2: Summary

Bayesian estimation is possible in ‘complicated settings’ (such asthis multidimensional setting)Importance Sampling is not very efficient

Conclusions

Outline

Conclusions

Messages to take home

+ Bayesian Statistics can be used to incorporate data into anuncertain model

+ Bayesian Inverse Problems are well-posed and thus a consistentapproach to parameter estimation

+ Applying the Bayesian Framework is possible in many differentsettings, also in ones that are genuinely difficult (e.g.multidimensional parameter spaces)

- Solving Bayesian Inverse Problems is computationally veryexpensive

requires many forward solvesalgorithmically complex

Conclusions

How to learn Bayesian

Various lectures at TUM:Bayesian strategies for inverse problems, Prof. Koutsourelakis(Mechanical Engineering)Various Machine Learning lectures in CS

Speak with Prof. Dr. Elisabeth Ullmann or Jonas Latz (both M2)GitHub/latz-io

A short review on algorithms for Bayesian Inverse ProblemsSample Code (MATLAB)These slides

Conclusions

How to learn Bayesian

Various Books/PapersMoritz Allmaras et al.- Estimating Parameters in Physical Modelsthrough Bayesian Inversion: A Complete Example (2013; SIAMRev. 55(1))Jun Liu - Monte Carlo Strategies in Scientific Computing (2004;Springer)Sharon Bertsch McGrayne - The Theory that would not die (2011,Yale University Press)Christian Robert - The Bayesian Choice (2007, Springer)Andrew Stuart - Inverse Problems: A Bayesian Perspective (2010;in Acta Numerica 19)

Conclusions

Figure: One last remark concerning conditional probability (Image: xkcd)

www.latz.io

Jonas Latz

Input/Output: www.latz.io

bayesian inverse problems - technische universität münchen · bayesian inverse problems bayesian...

Documents

inverse source problems

inverse eigenvalue problems

bayesian nonparametric feature construction for inverse...

the bayesian approach to inverse problems€¦ ·...

a primer of frequentist and bayesian inference in inverse...

inverse problems, inverse methods ... - harvard university

deep imitation learning for molecular inverse problems ·...

bayesian analysis of inverse lomax distribution using

bayesian inversion - joint inversions in geophysics ·...

bayesian inverse problem and optimization with iterative...

sequential monte carlo methods for bayesian elliptic...

data-driven parameter reduction methods for bayesian inverse...

motivation: why inverse problems? lectures on linear...

bayesian and geostatistical approaches to inverse problems...

extreme-scale uq for bayesian inverse problems governed by...

bayesian computation approach to inverse problems in heat...

limitations of polynomial chaos expansions in the bayesian...

discretization of bayesian inverse problems, part ii ·...

human action understanding as bayesian inverse planning

logarithmic laplacian prior based bayesian inverse synthetic