bayesian and least squares fitting: problem: given data (d) and model (m) with adjustable parameters...

Bayesian and Least Squares fitting:

Problem:

Given data (d) and model (m) with adjustable parameters (x),

what are the best values and uncertainties for the parameters?

Given Bayes’ Theorem: prob(m|d,I) prob(d|m,I) prob(m|I)

+ Gaussian data uncertainties

+ flat priors

Then

log( prob(m|d) ) constant –data

and maximizing the log( prob(m|d) ) is equivalent to minimizing

chi-squared (i.e., least squares)

and xfitted = x|0 + x

r P x

Weighted least-squares:

Least squares equations: r = P x , where Pij = ∂mi/∂xj

Each datum equation: ri = j ( Pij xj )

Divide both sides of each equation by datum uncertainty, i ,

i.e., r i ri / i

and P ij Pij / i for each j

gives variance-weighted solution.

Including priors in least-squares:

Least squares equations: r = d – m = P x ,

where Pij = ∂mi/∂xj

Each data equation: ri = j ( Pij xj )

The weighted data (residuals) need not be homogenous:

r = ( d – m ) / can be composed of N “normal” data and some “prior-like data”.

Possible prior-like datum: xk = vk ± k (for kth parameter)

Then rN+1 = ( vk - xk ) / k

and PN+1,j = 1/k for j =k

0 for j ≠ k

Non-linear models

Least squares equations: r = P x,

where P ij = ∂mi/∂xj

has solution: x = (PTP)-1 PT r

• If the partial derivatives of the model are independent of the parameters, then the first-order Taylor expansion is exact and applying the parameter corrections, x, gives the final answer.

Example linear problem: m = x1 + x2 t + x3 t2

• If not, you have a non-linear problem and the 2nd and higher order terms in the Taylor expansion can be important until x0; so iteration is required.

Example non-linear problem: m = sin(xt)

How to calculate partial derivatives (Pij ):

• Analytic formulae If the model can be expressed analytically)

• Numerical evaluations:

“wiggle” parameters one at a time:

xw = x except for jth parameter

xjw = xj + x

Partial derivative of the ith datum for parameter j:

Pij = ( mi(xw) - mi(x) ) / ( xjw – xj )

NB: choose x small enough to avoid 2nd order errors,

but large enough to avoid numerical inaccuracies.

Always use 64-bit computations!

Can do very complicated modeling:

Example problem: model pulsating photosphere for Mira variables

(see Reid & Goldston 2002, ApJ, 568, 931)

Data: Observed flux, S(t,), at radio, IR and optical wavelengths

Model: Assume power-law temperature, T(r,t), and density, (r,t);

calculate opacity sources (ionization equilibrium, H2 formation, …);

numerically integrate radiative transfer along ray-paths through atmosphere for many impact parameters and wavelengths;

parameters include T0 and 0 at radius r0

Even though model is complicated and not analytic, one can easily calculate partials numerically and solve for best parameter values.

Modeling Mira Variables:

Visual: mv ~ 8 mag ~ factor of 1000;variable formation of TiO clouds at ~2R*

with top T ~ 1400 K

IR: seeing pulsating stellar surface

Radio: Hfree-free opacity at ~2R*

Iteration and parameter adjustment:




• It is often better to make parameter adjustments slowly, so for the k+1 iteration, set

x|k+1 = x|k + *xIk

where 0 < < 1

• NB: this is equivalent to scaling partial derivatives by .

So, if one iterates enough, one only needs to get the sign of the partial derivatives correct !

Evaluating Fits:




Always carefully examine final residuals (r)

Plot them

Look for >3 values

Look for non-random behavior

Always look at parameter correlations

correlation coefficient: jk = Djk / sqrt( Djj Dkk )

where D = (PTP)-1

sqrt( cos2 t )

sqrt( 2 ) = 0.7

sqrt( 3 ) = 0.6

sqrt( 4 ) = 0.5

Bayesian vs Least Squares Fitting:

Least Squares fitting:

Seeks the best parameter values and their uncertainties

Bayesian fitting:

Seeks the posteriori probability distribution for parameters

Bayesian Fitting:

Bayesian: what is posterior probability distribution for parameters?

Answer, evaluate prob(m|d,I) prob(d|m,I) prob(m|I)

If data and parameter priors have Gaussian distributions,

log( prob(m|d) ) constant –data – ()param_priors

“Simply” evaluate for all (reasonable) parameter values. But this can be computationally challenging: eg, a modest problem with only 10 parameters, evaluated on a coarse grid of 100 values each, requires 1020

model calculations!

Markov chain Monte Carlo (McMC) methods:

Instead of complete exploration of parameter space, avoid regions with low probability and wander about quasi-randomly over high probability regions:

“Monte Carlo” random trials (like roulette wheel in Monte Carlo casinos)

“Markov chain” (k+1)th trial parameter values are “close to” kth values

McMC using Metropolis-Hastings (M-H) algorithm:

1. Given kth model (ie, values for all parameters in the model),

generate (k+1)th model by small random changes:

xj|k+1 = xj|k + gj

is an “acceptance fraction” parameter, g is a Gaussian random number (mean=0, standard deviation=1), j is the width of the posteriori probability distribution of parameter xj

• Evaluate the probability ratio: R = prob(m|d)|k+1 / prob(m|d)|k

1. Draw a random number, U, uniformly distributed from 01

• If R > U, “accept” and store the (k+1)th parameter values,

else “replace” the (k+1)th values with a copy the kth values and store them (NB: this yields many duplicate models)

Stored parameter values from the M-H algorithm give the posteriori probability distribution of the parameters !

Metropolis-Hastings (M-H) details:

M-H McMC parameter adjustments: xj|k+1 = xj|k + gj

determines the “acceptance fraction” (start near 1/√N),

g is a Gaussian random number (mean=0, standard deviation=1),

j is the “sigma” of the posteriori probability distribution of xj

• The M-H “acceptance fraction” should be about 23% for problems with many parameters and about 50% for few parameters; iteratively adjust to achieve this; decreasing increases acceptance rate

• Since one doesn’t know the parameter posteriori uncertainties, j, at the start, one needs to do trial solutions and iteratively adjust.

• When exploring the PDF of the parameters with the M-H algorithm, one should start with near-optimum parameter values, so discard early “burn-in” trials.

M-H McMC flow:Enter data (d) and initial guesses for parameters (x, prior, posteriori)

Start “burn-in” & “acceptance-fraction” adjustment loops (eg, ~10 loops)

start a McMC loop (with eg, ~105 trials)

make new model: xj|k+1 = xj|k + gjposteriori

calculate (k+1)th log( prob(m|d) )data + ()param_priors

calculate Metropolis ratio: R = exp( log(probk+1) – log(probk) )

if R > Uk+1 accept and store model

< Uk+1 replace with kth model and store

end McMC loop

estimate & update posteriori and adjust for desired acceptance fraction

End “burn-in” loops

Start “real” McMC exploration with latest parameter values and

using final posteriori and to determine parameter step sizesUse a large number of trials (eg, ~106)

Estimation of posteriori :

Make histogram of trial parameter values (must cover full range)

Start bin loop

check cumulative # crossing “-1”

check cumulative # crossing “+1”

End bin loop

Estimates of (Gaussian) posteriori : ≈p_val(+1) – p_val(-1) | ≈p_val(+2) – p_val(-2) |

Relatively robust to non-Gaussian pdfs

Adjusting Acceptance Rate Parameter ():

Metropolis trial acceptance rule for (k+1)th trial model:

if R > Uk+1 accept and store model

< Uk+1 replace with kth model and store

For nth set of M-H McMC trials, count cumulative number of accepted (Na) and replaced (Nr) models. Acceptance rate: A = Na / (Na + Nr)

For (n+1)th set of trials, set n+1An / Adesired) n

“Non-least-squares” Bayesian fitting:

Sivia gives 2 examples where the data uncertainties are not known Gaussians; hence least squares is non-optimal:

1) prob() = for ≥; 0 otherwise)

where is the error on a datum, which typically is close to (the minimum error), but can occasionally be much larger.

2) prob() = –– where is the fraction of “bad” data whose uncertainties are

“Error tolerant” Bayesian fitting:

Sivia’s “conservative formulation”: data uncertainties are given by

prob() = for ≥; 0 otherwise)

where is the error on a datum, which typically is close to (the minimum error), but can occasionally be much larger.

Marginalizing over gives

prob(d|m,) = ∫prob(d |m,) prob() d

=∫√exp[(d-m)2/2d

= √–exp(-R2/2) ) / R2 )where R = (d-m)/

Thus, one maximizes ∑i log–exp(-R2/2) ) / R2 ) ,

instead of minimizing ∑i logexp(-R2/2) ) ∑i -R2/2

Data PDFs:

• Gaussian pdf has sharper peak, giving more accurate parameter estimates (provided all data are good).

• Error tolerant pdf doesn’t have a large penalty for a wild point, so it will not “care much” about some wild data.

Error tolerant fitting example:

Goal: to determine motion of 100’s of maser spots

Data: maps (positions) at 12 epochs

Method: find all spots at nearly the same position over all epochs; then fit for linear motion

Problem: some “extra” spots appear near those selected to fit (eg, R>10). Too much data to plot, examine and excise by hand.

Error tolerant fitting example:

Error tolerant fitting output with no “human intervention”

The “good-and-bad” data Bayesian fitting:

Box & Tiao’s (1968) data uncertainties come in two “flavors”, given by

prob() = ––

where is the fraction of “bad” data ( ≤ whose uncertainties are

Marginalizing over for Gaussian errorsgives

prob(d|m,) = ∫prob(d |m,) prob() d

= √exp(-R2/2) + (1-) exp(-R2/2) )where R = (d-m)/

Thus, one maximizes constant + ∑i logexp(-R2/2) + (1-) exp(-R2/2) ) ,

which for no bad data () recovers least squares.

But one must estimate 2 extra parameters: and

Estimation of parameter PDFs :

Bayesian fitting result: histogram of M-H trial parameter values (PDF)

This “integrates” over all values of all other parameters and is the parameter estimate “marginalized” over all other parameters.

Parameter correlations: e.g, plot all trial values of xi versus xj

bayesian and least squares fitting: problem: given data (d) and model (m) with adjustable parameters...

Documents

x r p x slide

x x j w x j nb

x small

j p ij x j divide

sin x t

p ij p ij

parameter j

j th parameter x j w