statistical modeling of physical systems: an introduction to grey box

Statistical Modelling of Physical Systems An introduction to Grey Box modelling

Henrik Madsen

ESI 101 Course, NREL, Colorado

[email protected]

Henrik Madsen 1

mailto:[email protected]

Outline 1

2

3

4

5

6

7

8

9

10

Introduction Selection of modelling framework Nonlinear versus linear modelling Why Stochastic Differential Equations? Grey Box Modelling

The Stochastic State Space Model Model Identification, Estimation, Selection and Validation

Identification of model order Identification of functional relation

Estimation – The maximum likelihood principle The information matrix Distribution of the ML estimator Tests for individual parameters Likelihood ratio tests

Model Validation Software – CTSM-R Summary

Henrik Madsen 2

Introduction

• Various methods of advanced modelling are needed for an increasing number of complex physical, chemical and biological systems.

• For a model to describe the future evolution of the system, it must

1. capture the inherently non-linear behavior of the system. 2. provide means to accommodate for noise due to approximation

and measurement errors.

• Calls for methods that are capable of bridging the gap between physical and statistical modelling.

Henrik Madsen 3

Which type of model to use?

• Simple / Complex • Lumped / Distributed • Linear / Non-linear • Time-invariant / Time-varying • Discrete / Continuous time • Deterministic / Stochastic • Black box / Grey box / White box • Parametric / Non-parametric

Base the decision on • Purpose • Prior knowledge • Available data • Tools

Henrik Madsen 4

Nonlinear versus linear modelling

• The aim of the modelling effort may be generally expressed as follows: Find a nonlinear function h such that {Et } defined by

h(Xt , Xt−1, . . . ) = Et (1)

is a sequence of independent random variables. • Suppose also that the model is causally invertible, i.e. the

equation above may be ’solved’ such that we may write

Xt = h'(Et , Et−1, . . . ). (2)

Henrik Madsen 5

Nonlinear vs. linear model buiding (cont.)

• Suppose that h' is sufficiently well-behaved to be expanded in a Taylor series

∞ ∞ ∞

k=0 k=0 l=0 ∞ ∞ ∞

cc

cc

c

c

Xt = µ + gk Et−k + gkl Et−k Et−l

gklmEt−k Et−l Et−m + . . . (3)+ k=0 l=0 m=0

• The functions

∂h' ∂2h' µ = h' (0), gk = ( ), gkl = ( ), etc. (4)

∂Et−k ∂ Et−k ∂ Et−l

are called the Volterra series for the process {X }. The sequences gk , gkl , . . . are called the kernels of the Volterra series.

Henrik Madsen 6

Non-linear vs. linear model building (cont.)

• For linear systems

gkl = gklm = gklmn = . . . = 0 (5)

• Hence the system is completely characterized by either

{gk } : Impulse response function

or

H(→) : Frequency response function

Henrik Madsen 7


• In general there is no such thing as a transfer function for non-linear systems.

• However, an infinite sequence of generalized transfer functions may be defined as

∞ c−iω1kH1(ω1) gk e=

k=0 ∞ c ∞ c

−i(ω1k+ω2 l)H2(ω1, ω2) gkl e= k=0 l=0

∞ c ∞ c ∞ c−i(ω1k+ω2 l+ω3m)H3(ω1, ω2, ω3) gklme=

k=0 l=0 m=0

. . .

Henrik Madsen 8


Let Ut and Xt denote the input and the output of a given system. • For linear systems it is well known that

iω0tL1 If the input is a single harmonic Ut = A0e then the output is a single harmonic of the same frequency, but with the amplitude scaled by |H(ω0)| and the phase shifted by arg H(ω0).

L2 Due to the linearity, the principle of superposition is valid, and the total output is the sum of the outputs corresponding to the individual frequency components of the input. (Hence the system is completely described by knowing the response to all frequencies – that is what the transfer function supplies).

• For non-linear systems, however, neither of the properties (L1) or (L2) hold. NL1 For an input with frequency ω0, the output will, in general, contain

also components at the frequencies 2ω0, 3ω0, . . . (frequency multiplication).

NL2 For two inputs with frequencies ω0 and ω1, the output will contain components at frequencies ω0, ω1, (ω0 + ω1) and all harmonics of the frequencies (intermodulation distortion).

Henrik Madsen 9

Why Stochastic Differential Equations? Problem Scenario

Ordinary differential equation

dA = −KA dt

Y = A + E

Henrik Madsen 10

ODE

0 5 10 15 20 25−1.8

−1.6

−1.4

−1.2

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

Time

Con

cent

ratio

n −

logs

cale

DataODE

• Autocorrelated residuals!!

Henrik Madsen 11

Problem Scenario

Stochastic differential equation

dA = −KA dt + dw

Y = A + e

Henrik Madsen 12

ODE vs SDE

0 5 10 15 20 25−1.8

−1.6

−1.4

−1.2

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

Time

Con

cent

ratio

n −

logs

cale

DataODE

• Uncorrelated residuals • System noise • Measurement noise

Henrik Madsen 13

ODE vs SDE

0 5 10 15 20 25−1.8

−1.6

−1.4

−1.2

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

Time

Con

cent

ratio

n −

logs

cale

DataODESDE

• Uncorrelated residuals • System noise • Measurement noise

Henrik Madsen 13

Grey box modelling of oxygen concentration - A sketch of the physical system

Henrik Madsen 14

Grey box modelling of oxygen concentration - A white box model

Model found in the literature:

dC K = √ (Cm(T ) − C) + P(I) − R(T )

dt h h

IP(I) = PmE0 (= βI)

Pm + E0I R(T ) = R15θ

T −15 [mg/l]

Cm(T ) = 14.54 − 0.39T + 0.01T 2 [mg/l]

• Simple - however, a non-linear model. • Uncertainty of prediction does not depend on horizon.

Henrik Madsen 15

Model validation

The autocorrelation and partial autocorrelation function for the residuals from the first order model

Henrik Madsen 16

Grey box model of oxygen concentration

The following nonlinear state space (Hidden Markov) model has been found: ***The system equation:

√K

C Kb

[dC

] √ −Kc

[C]

β [

I ]

h h h= dt + dtdL K3 −Kl L 0 γ Pr

K √ Cm(T ) − R(T )

[dW1(t)

]h h+ dt +

0 dW2(t)

***The observation equation:

Cr = �1 0

� [C]

+ eL

Henrik Madsen 17

Model types

• White box models - the model structure is known and deterministic. - uncertainty is discarded and the model tends to be overspecified.

• Black box models - data based models - input/output models. - the model and its parameters have little physical significance.

• Grey box models - between white and black box models

Henrik Madsen 18

The grey box modelling concept

• Combines prior physical knowledge with information in data. • The model is not completely described by physical equations, but

equations and the parameters are physically interpretable.

Henrik Madsen 19

Why use grey box modelling?

• Prior physical knowledge can be used. • Non-linear and non-stationary models are easily formulated. • Missing data are easily accommodated. • It is possible to estimate environmental variables that are not

measured. • Available physical knowledge and statistical modelling tools is

combined to estimate the parameters of a rather complex dynamic system.

• The parameters contain information from the data that can be directly interpreted by the scientists.

• Fewer parameters → more power in the statistical tests. • The physical expert and the statistician can collaborate in the

model formulation.

Henrik Madsen 20

Stochastic Differential Equations (SDE’s)

• Ordinary Differential Equations (ODE’s) provide deterministic description of a system:

dX t = f (X t , U t , t)dt t ≥ 0.

where f is a known function of the time t and the state X and input U.

• To describe the deviation between the ODE and the true variation of the state an additive noise term is introduced.

• Physical arguments for including the noise part: 1. Modelling approximations. 2. Unrecognized inputs. 3. Measurements of the input are noise corrupted.

Henrik Madsen 21

The continuous-discrete time non-linear stochastic state space model The system equation (set of Itô stochastic differential eqs.)

dX t = f (X t , U t , θθθ) dt + G(X t , U t , θθθ) dW t , X t0 = X 0

Notation X t ∈ Rn State vector U t ∈ Rr Known input vector f Drift term G diffusion term W t Wiener process of dimension, d , with incre

mental covariance Qt

θθθ ∈ ΘΘΘ ⊆ Rp Unknown parameter vector

Henrik Madsen 22

The observation equation

The observations are in discrete time, functions of state, input, and parameters, and are subject to noise:

Y ti = h(X ti , U ti , θθθ) + eti

NoY ti h eti

tation ∈ Rm

∈ Rm

Observation vector Observation function Gaussian white noise with covariance ΣΣΣti

Observations are available at the time points ti : t1 < . . . < ti < . . . < tN

X 0, W t , eti assumed independent for all (t , ti ), t = ti

Henrik Madsen 23

Advantages of HMM with the dynamics described by SDE’s

• Provides a decomposition of the total error into process error and measurement error.

• Facilitates use of statistical tools for model validation. • Provides a systematic framework for pinpointing model

deficiencies – will be demonstrated later on. • Covariances of system error and measurement error are

estimated. • SDE based estimation gives more accurate and reliable

parameter estimates than ODE based estimation. • SDEs give more correct (more accurate and realistic) predictions

and simulations.

Henrik Madsen 24

Methods for Identification, Estimation and Model Validation

• Model Identification: See the next slides. • Parameter Estimation:

− Maximum Likelihood Methods • Model testing/selections:

− Test for significiant parameters (typically t-tests) − Test for model reductions (typically likelihood ratio tests) − Alternatively: Information criteria

• Model Validation: − Test whether the estimated model describes the data. − Autocorrelation functions – or Lag Dependent Functions. − Other classical methods ...

Henrik Madsen 25

Identification of model order (here: number of states)

Use the autocorrelation and partial autocorrelation functions

Henrik Madsen 26

Identification input variables and eg. time delays

Use the Pre-whitening procedure (pp. 223-226 in Time Series Analysis book) or Ridge regression (pp. 227-228 in TSA).

Henrik Madsen 27

Identification of functional relations

Use non-parametric methods (kernels, smoothing splines, etc.) to estimate the conditional mean and the conditional variance. • The conditional mean enters the drift term. • The conditional variance enters the diffusion term.

Henrik Madsen 28

Identification of Model Structure

• The diffusion term gives information for pinpointing model deficiencies.

• Assume that we based on ’large’ values of relevant diffusion term(s) suspect r ∈ θθθ to be a function of the states, input or time.

• Then consider the extended state space model:

dX t = f (X t , U t , θθθ) dt + G(X t , U t , θθθ) dW t , X t0 = X 0 ∗drt = dWt

Y ti = h(X ti , U ti , θθθ) + eti

(6)

which corresponds to a random walk description of rt .

Henrik Madsen 29

Identification of Model Structure

• Do we observe a significiant reduction of the relevant diffusion term(s)?

• In that case calculate the smoothed state estimate ?rt |N (use for instance the software tool CTSM-R - see slides by Niamh O’Connell).

• Plot ?rt |N versus the states, inputs and time. • Identify a possible functional relationship. • Build that functional relationship into the stochastic state space

model. • Estimate the model parameters and evaluate the improvement –

using e.g. likelihood ratio tests.

Henrik Madsen 30

The information matrix

Definition (Observed information)

The matrix ∂2

j(θθθ; y) = − l(θθθ; y)T∂θθθ∂θθθ

with the elements ∂2

j(θθθ; y)ij = − l(θθθ; y)∂θi ∂θj

is called the observed information corresponding to the observation y , evaluated in ?θθθ.

The observed information is thus equal to the Hessian (with opposite sign) of the log-likelihood function evaluated at θθθ. The Hessian matrix is simply (with opposite sign) the curvature of the log-likelihood function.

Henrik Madsen 31

Invariance property

Theorem (Invariance property)

Assume that ?θθθ is a maximum likelihood estimator for θθθ, and let ψψψ = ψψψ(θθθ) denote a one-to-one mapping of Ω ⊂ Rk onto Ψ ⊂ Rk . Then the estimator ψψψ(?θθθ) is a maximum likelihood estimator for the parameter ψψψ(θθθ).

The principle is easily generalized to the case where the mapping is not one-to-one.

Henrik Madsen 32

Distribution of the ML estimator

Theorem (Distribution of the ML estimator) We assume that 9θθθ is consistent. Then, under some regularity conditions,

9θθθ − θθθ → N(0, i(θθθ)−1)

where i(θθθ) is the expected information or the information matrix.

The results can be used for inference under very general conditions. As the price for the generality, the results are only asymptotically valid.

• Asymptotically the variance of the estimator is seen to be equal to the Cramer-Rao lower bound for any unbiased estimator.

• The practical significance of this result is that the MLE makes efficient use of the available data for large data sets.

Henrik Madsen 33


In practice, we would use

θ ∼ N(θ, j−1 θθθ))? (?

where j(?θθθ) is the observed (Fisher) information.

This means that asymptotically

i) E [?θθθ] = θθθ

ii) D[?θθθ] = j−1(?θθθ)

Henrik Madsen 34


• The standard error of θ?i is given by σ9θ̂i = Varii [θ9]

where Varii [θ?] is the i’th diagonal term of j−1(?θθθ) • Hence we have that an estimate of the dispersion

(variance-covariance matrix) of the estimator is

D[9θθθ] = j−1(9θθθ) • An estimate of the uncertainty of the individual parameter

estimates is obtained by decomposing the dispersion matrix as follows:

D[9θθθ] = 9σσσθ̂θθR9σσσθ̂θθ into ?σσσˆ, which is a diagonal matrix of the standard deviations ofθθθ the individual parameter estimates, and R, which is the corresponding correlation matrix. The value Rij is thus the estimated correlation between θ?i and θ?j .

Henrik Madsen 35

The Wald Statistic

A test of an individual parameter

H0 : θi = θi,0

is given by the Wald statistic:

θi − θi,0Zi = ?

θi σ?ˆ

which under H0 is approximately N(0, 1)-distributed.

Henrik Madsen 36

Example: Quadratic approximation of the log-likelihood 18 The likelihood principleL

og-l

ikel

ihood

θ

0-1

-2

-3

0.0 0.2 0.4 0.6 0.8 1.0

TrueApprox.

(a) n = 10, y = 3L

og-l

ikel

ihood

θ

0-1

-2

-3

0.0 0.2 0.4 0.6 0.8 1.0

TrueApprox.

(b) n = 100, y = 30

Figure 2.3: Quadratic approximation of the log-likelihood function.

Example 2.7 (Quadratic approximation of the log-likelihood)Consider again the situation from Example 2.1 where the log-likelihoodfunction is

l(θ) = y log θ + (n− y) log(1− θ) + const

The score function isl′(θ) =

y

θ− n− y

1− θ,

and the observed information

j(θ) =y

θ2+

n− y

(1− θ)2.

For n = 10, y = 3 and θ̂ = 0.3 we obtain

j(θ̂) = 47.6

The log-likelihood function and the corresponding quadratic approximationare shown in Figure 2.3a. The approximation is poor as can be seen in thefigure. By increasing the sample size to n = 100, but still with θ̂ = 0.3, theapproximation is much better as seen in Figure 2.3b.

At the point θ̂ = yn we have

j(θ̂) =n

θ̂(1− θ̂)

and we find the variance of the estimate

Var[θ̂]= j−1(θ̂) =

θ̂(1− θ̂)

n.

Figure: Quadratic approximation of the log-likelihood function.

Henrik Madsen 37

Likelihood ratio tests

• Want to know the distribution of D when assuming H0 (model B). • It is sometimes possible to calculate the exact distribution. This

is for instance the case for the General Linear Model for Gaussian data.

• In most cases, however, we must use following important result regarding the asymptotic behavior.

Theorem (Wilk’s Likelihood Ratio test)

The random variable D = 2(£A(e θθθB, Y )) converges in lawθθθA, Y ) − £B(eto a χ2 random variable with f = (dim(ΩA) − dim(ΩB)) degrees of freedom, i.e.,

D → χ2(f )

under H0.

Henrik Madsen 38

Flexhouse layout

Henrik Madsen 39

RC-diagram ofte used for illustrating linear models

Rie

Rie

Rea

Rea

Ti

Ti

Ta

Ta

Te

Te

Φh

Φh

Ci

Ci

Ce

Ce

Henrik Madsen 40

Model A

RiaTi

TaΦh CiAwΦs

Aw is the effective window area.

Henrik Madsen 41

Model Validation for Model A

Autocorrelation function and Periodogram for the residuals.

0 5 10 15 20 25 30 35

−1.

0−

0.5

0.0

0.5

1.0

lag

AC

F

0.0 0.1 0.2 0.3 0.4 0.5

0.0

0.2

0.4

0.6

0.8

1.0

frequency

Model is seen not to be adequate.

Henrik Madsen 42

Model E

After some steps: (Notice that eg. the electric heating system is included)

Henrik Madsen 43

Model Validation for Model E.

0 5 10 15 20 25 30 35

−1.

0−

0.5

0.0

0.5

1.0

lag

AC

F

0.0 0.1 0.2 0.3 0.4 0.5

0.0

0.2

0.4

0.6

0.8

1.0

frequency

It is concluded that the model is adequate.

Henrik Madsen 44

Continuous Time Stochastic Modelling (CTSM-R)

• The parameter estimation is performed by using the software CTSM-R.

• The software has been developed at DTU Compute • Download from www.ctsm.info (see also slides by Niamh

O’Connell) • The program returns the uncertainty of the parameter estimates

as well.

Henrik Madsen 45

www.ctsm.info

and provides eg.• Likelihood testing for nested models• Calculations of smoothen state E [X t |YT ]

• Calculations of k-step predictions E [X t |Yt−k ].• Calculations of noise free simulations E [X t |Yt0 ]

The estimation procedure (CTSM-R)

CTSM-R is based on • The Extended Kalman Filter • Approximate likelihood estimation

Henrik Madsen 46

The estimation procedure (CTSM-R)

CTSM-R is based on • The Extended Kalman Filter • Approximate likelihood estimation

and provides eg. • Likelihood testing for nested models • Calculations of smoothen state E [X t |YT ]

• Calculations of k-step predictions E [X t |Yt−k ]. • Calculations of noise free simulations E [X t |Yt0 ]

Henrik Madsen 46

The continuous-discrete time stochastic state space formulation

General formulation

dx t =f (x t , ut , θθθ, t)dt + σσσ(x t , ut , θθθ, t)dw t

yk =h(x tk , utk , θθθ, ek , tk ),

where • x t is the continuous time state and yk ∈ Rl is the discrete time

observations. • ut ∈ Rr is the inputs • θθθ ∈ Rp is a parameter vector • ek ∈ Rl is a random observation error.

Henrik Madsen 47

The estimation procedure (CTSM) - Limitations

Most general set up in CTSM

dx t =f (x t , ut , θθθ, t)dt + σσσ(ut , θθθ, t)dw t

yk =h(x tk , utk , θθθ, tk ) + ek ,

where • σσσ ∈ Rn×n is a quadratic matrix, independent of the state • ek ∼ N(0, Sk (θθθ, uk )) is a Gaussian random variable.

Henrik Madsen 48

Transformation of the State Space 1

Consider the system equation

dx t =f (x t , ut , θθθ, t)dt + σσσ(x t , ut , θθθ, t)R(ut , θθθ, t)dw t ,

where R(ut , θθθ, t) ∈ Rn×n is any matrix function and σσσ ∈ Rn×n is a diagonal matrix

σii (x t , ut , θθθ, t) = σi (xi,t , ut , θθθ, t)

Henrik Madsen 49

then by Itô’s lemma zi is also an Itô process given by

dzi,t =∂

∂tψi(·, t)dt +

fi(·)σi(·)

dt −12σi(·)

nc

j=1

[R(·)]2i,j dt

+

nc

j=1

[R(·)]i,j dwj ,

where the diffusion term is now independent of the state zi .


Choose the transformation 1

dξ zi,t = ψi (xi,t , ut , θθθ, t) = ,

σi (ξ, ut , θθθ, t) ξ=xi

Henrik Madsen 50


Choose the transformation 1

dξ zi,t = ψi (xi,t , ut , θθθ, t) = ,

σi (ξ, ut , θθθ, t) ξ=xi

then by Itô’s lemma zi is also an Itô process given by

n ∂ fi (·) 1 2dzi,t = ψi (·, t)dt + dt − σi (·)

c [R(·)]i,j dt

∂t σi (·) 2 j=1

n

+ c

[R(·)]i,j dwj , j=1

where the diffusion term is now independent of the state zi .

Henrik Madsen 50

Summary

By applying the grey box modelling approach • physical/prior knowledge and information in data are combined,

ie. we have bridged the gap between physical and statistical modelling.

• many statistical, mathematical and physical methods for model validation and structure modification become available.

• parameter estimates have physical significance - seldom the case for black box models.

• we obtain more accurate predictions and more realistic prediction intervals.

• we obtain realistic simulations (ODE based models do not provide a reasonable framework for simulations)

Henrik Madsen 51

Some References

• H. Madsen: Time Series Analysis, Chapman and Hall, 392 pp, 2008.

• J.M. Morales, A.J. Conejo, H. Madsen, P. Pinson, M. Zugno: Integrating Renewables in Electricity Markes, Springer, 430 pp., 2013.

• H.Aa. Nielsen, H. Madsen: A generalization of some classical time series tools, Computational Statistics and Data Analysis, Vol. 37, pp. 13-31, 2001.

• H. Madsen, J. Holst: Estimation of Continuous-Time Models for the Heat Dynamics of a Building, Energy and Building, Vol. 22, pp. 67-79, 1995.

• J.N. Nielsen, H. Madsen, P.C. Young: Parameter Estimation in Stochastic Differential Equations; An Overview, Annual Reviews in Control, Vol. 24, pp. 83-94, 2000.

• N.R. Kristensen, H. Madsen, S.B. Jørgensen: A Method for systematic improvement of stochastic grey-box models, Computers and Chemical Engineering, Vol 28, 1431-1449, 2004.

• N.R. Kristensen, H. Madsen, S.B. Jørgensen: Parameter estimation in stochastic grey-box models, Automatica, Vol. 40, 225-237, 2004.

• J.B. Jørgensen, M.R. Kristensen, P.G. Thomsen, H. Madsen: Efficient numerical implementation of the continuous-discrete extended Kalman Filter, Computers and Chemical Engineering, 2007.

• K.R. Philipsen, L.E. Christiansen, H. Hasman, H. Madsen: Modelling conjugation with stochastic differential equations, Journal of Theoretical Biology, Vol. 263, pp. 134-142, 2010.

• P. Bacher, H. Madsen: Identifying suitable models for the heat dynamics of buildings, Vol. 43, pp. 1511-1522, 2011.

Henrik Madsen 52

statistical modeling of physical systems: an introduction to grey box

Documents