2 gaussian distributions and the heat equation · 2009-12-17 · 32 2 gaussian distributions and...

2

Gaussian Distributions and the Heat Equation

In this chapter the Gaussian distribution is defined and its properties are explored. Thechapter starts with the definition of a Gaussian distribution on the real line. In theprocess of exploring the properties of the Gaussian on the line, the Fourier transformand heat equation are introduced, and their relationship to the Gaussian is developed.The Gaussian distribution in multiple dimensions is defined, as are clipped and foldedversions of this distribution. Some concepts from probability and statistics such as mean,variance, marginalization, and conditioning of probability densities are introduced in aconcrete way using the Gaussian as the primary example. The properties of the Gaussiandistribution are fundamental to understanding the concept of white noise, which is thedriving process for all of the stochastic processes studied in this book.

The main things to take away from this chapter are:

• To become familiar with the Gaussian distribution and its properties, and to becomfortable in performing integrals involving multi-dimensional Gaussians;

• To become acquainted with the concepts of mean, covariance, and information-theoretic entropy;

• To understand how to marginalize and convolve probability densities, to computeconditional densities, and to fold and clip Gaussians;

• To observe that there is a relationship between Gaussian distributions and the heatequation;

• To know where to begin if presented with a diffusion equation, the symmetries ofwhich are desired.

2.1 The Gaussian Distribution on the Real Line

2.1.1 Defining Parameters

The Gaussian distribution on the real line is any function of the form ρ(x − x0) where

ρ(x) = ce−ax2(2.1)

and c ∈ R>0 is related to a ∈ R>0 by the constraint that

I.=∫ ∞

−∞ρ(x)dx = 1. (2.2)

G.S. Chirikjian, Stochastic Models, Information Theory, and Lie Groups, Volume 1: Classical Results

© Birkhäuser Boston, a part of Springer Science + Business Media, LLC 2009

31and Geometric Methods, Applied and Numerical Harmonic Analysis, DOI 10.1007/978-0-8176-4803-9_2,

32 2 Gaussian Distributions and the Heat Equation

This constraint, together with the fact that ρ(x) ≥ 0 makes it a probability densityfunction (or pdf for short). That is, any non-negative function satisfying (2.2) (not onlythose of the form in (2.1)) is a pdf.

The Gaussian distribution is the “bell curve” so often referred to when discussingstatistical quantities. It is an infinitely differentiable function. Taking the first derivativegives

dρ

dx= −2acxe−ax2

.

From this it is clear that ρ(x) has a critical point at x = 0, and this is its only criticalpoint. The second derivative of ρ(x) evaluated at x = 0 is

d2ρ

dx2 |x=0 = −2ac,

which is always negative, indicating that x = 0 is a maximum, and the maximal valuethat ρ(x) can attain is c. Furthermore, due to the negative sign in the exponential, thefunction ρ(x) decays to zero very rapidly as |x| increases. The Gaussian distribution iscalled unimodal because it has only one local maximum, or mode.

To determine the functional relationship between c and a that ensures that I = 1,the following trick can be used. First evaluate

I2 = c2(∫ ∞

−∞e−ax2

dx

)2

= c2∫ ∞

−∞

∫ ∞

−∞e−a(x2+y2)dxdy.

Then, changing to the polar coordinates x = r cos θ and y = r sin θ it becomes clearthat

I2 = c2∫ 2π

0

∫ ∞

0e−ar2

rdrdθ.

The integral over θ reduces to 2π and the integral over r can be performed in closedform. The resulting relationship between c and a is then I2 = c2π/a = 1, or

c =√

a

π. (2.3)

The Gaussian distribution is an even function, and for any finite positive value of ait is also a “nice” function. An even function is one for which fe(x) = fe(−x) and anodd function is one for which fo(x) = −fo(−x). Any function can be decomposed intoa sum of even and odd functions as f(x) = fe(x) + fo(x) where

fe(x) =12[f(x) + f(−x)] and fo(x) =

12[f(x) − f(−x)].

Furthermore, the product of two even functions is even, the product of two odd functionsis even, and the product of one even and one odd function is odd. The integral of anywell-behaved odd function over any finite interval that is symmetric around the originis always zero. This can be seen as follows:∫ b

−b

fo(x)dx =∫ 0

−b

fo(x)dx +∫ b

0fo(x)dx,

but from the definition of an odd function,

2.1 The Gaussian Distribution on the Real Line 33∫ 0

−b

fo(x)dx = −∫ 0

−b

fo(−x)dx = −∫ b

0fo(y)dy,

and so ∫ b

−b

fo(x)dx = 0.

For an even function ∫ b

−b

fe(x)dx = 2∫ b

0fe(x).

For an even function, the product x · fe(x) must be an odd function, and since oddfunctions integrate to zero over any interval [−b, b], it follows that∫ ∞

−∞xfe(x)dx = lim

b→∞

∫ b

−b

xfe(x)dx = 0.

This limit would exist even if the upper and lower integrands go to ±∞ at differentrates because fe(x), like the other functions in this book, is restricted to be a “nice”function in the sense defined in (1.19), and hence it must decay to zero faster than 1/xas x → ±∞. More generally, the quantity μ defined by the integral

μ.=∫ ∞

−∞xf(x)dx

for any probability density function, f(x), is called the mean.From the shift-invariance property of integration of an arbitrary integrable function

on the real line,1 ∫ ∞

−∞f(x − x0)dx =

∫ ∞

−∞f(x)dx,

it follows that for the special case of a Gaussian distribution shifted by μ, ρ(x − μ),∫ ∞

−∞xρ(x − μ)dx =

∫ ∞

−∞(y + μ)ρ(y)dy = 0 + μ · I = μ.

The median of the Gaussian distribution is the point m for which∫ m

−∞ρ(x)dx =

∫ ∞

m

ρ(x)dx.

Due to the fact that the Gaussian distribution is an even function, m = 0.In statistics it is useful to have indicators that describe how concentrated or how

spread out a distribution is. One such indicator is the variance, defined as

σ2 .=∫ ∞

−∞x2f(x − μ)dx. (2.4)

1Another often-glossed-over property of integration of functions on the real line that willbe useful later is invariance under inversion of the argument:∫ ∞

−∞f(−x)dx =

∫ ∞

−∞f(x)dx.


The square root of the variance is called the standard deviation. Note that this is differentfrom

s.=∫ ∞

−∞|x|f(x − μ)dx, (2.5)

which is called the spread. Of course, the concepts of mean, mode, variance, and spreadare not limited to the study of Gaussian distributions. They can be calculated for anypdf.

For the Gaussian distribution in (2.1) with normalization (2.3), the mean, median,variance, and spread can be calculated in the following closed form:

μ = m = 0 , σ2 =12a

, and s =1√πa

. (2.6)

In general, non-Gaussian pdfs can have multiple modes, the mean and median need notbe at the same point, and the relationship between spread and variance need not be sosimple.

Since for a Gaussian these quantities are directly related to a, the Gaussian distribu-tion can be redefined with σ2 or s incorporated into the definition. The most commonchoice is to use σ2, in which case the Gaussian distribution with mean at μ and standarddeviation σ is denoted2

ρ(x;μ, σ2) =1√2πσ

e−(x−μ)2/2σ2. (2.7)

In some instances, such as in the following subsections, it will be more convenient towrite this as ρ(μ,σ2)(x). Note: another common name for the Gaussian distribution isthe normal distribution. Figure 2.1 shows a plot of the Gaussian distribution withμ = 0 and σ = 1 plotted over the range [−3, 3]. Most (approximately 97 percent) of theprobability density falls on this finite interval. Changing the value of μ or σ would onlyshift or uniformly stretch this plot.

The integral

F (x;μ, σ2) =∫ x

−∞ρ(ξ;μ, σ2)dξ

is called the cumulative distribution function. This function is known to have a “closed-form” solution in terms of error integrals. In the limit as σ → 0, F (x;μ, σ2) exhibits asharp transition from a value of 0 for x < μ to a value of 1 for x > μ. When μ = 0 thisis idealized with the Heaviside step function

H(x) .={

1 for x > 00 for x ≤ 0. (2.8)

2.1.2 The Maximum Entropy Property

The entropy of a pdf f(x) is defined by the integral [23]

S(f) = −∫ ∞

−∞f(x) log f(x)dx (2.9)

2The symbols f(x) and ρ(x) often will be used to denote generic pdfs, but when appendedas ρ(x; μ, σ2), this will always denote a Gaussian.

2.1 The Gaussian Distribution on the Real Line 35

−3 −2 −1 0 1 2 30

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

x

ρ(x;0,1)

Fig. 2.1. The Gaussian Distribution ρ(x; 0, 1) Plotted over [−3, 3]

where here log = loge = ln. This entropy is written as S(f) rather than S(f(x)) becauseit is not a function of x, but rather it is a “functional” of f , since all dependence on xhas been integrated out.

S is computed in closed form for the Gaussian distribution as

S(ρ(μ,σ2)) = log(√

2πe σ). (2.10)

Interestingly, for any given value of variance, the Gaussian distribution is the pdfwith maximal entropy. This can be shown by performing the following optimization:

maxf

S(f) subject to f(x) ≥ 0

and ∫ ∞

−∞f(x)dx = 1 ,

∫ ∞

−∞xf(x)dx = μ ,

∫ ∞

−∞(x − μ)2f(x)dx = σ2. (2.11)

To find the distribution that satisfies these conditions, Lagrange multipliers3 are intro-duced to enforce constraints, and the following necessary conditions are calculated:

∂C

∂f= 0 where C = −f log f + λ1f + λ2xf + λ3(x − μ)2f.

Performing the above calculation and solving for f and the λi that satisfy (2.11) givesf(x) = ρ(μ,σ2)(x). Note that the constraint f(x) ≥ 0 was not actively enforced in theabove derivation, but the result satisfies this condition anyway.

What the above shows is that ρ(μ,σ2)(x) extremizes the entropy subject to the givenconstraints. In other words, ρ(μ,σ2) is a critical point of the functional S(f) subject

3See Section A.11.1 for a definition.


to the constraints (2.11). However, this could be a minimum, maximum, or point ofinflection. To show that it actually maximizes the entropy (at least in a local sense), itis possible to define a perturbed version of this pdf as

f(x) = ρ(μ,σ2)(x) · [1 + ε(x)] (2.12)

where ε(x) is arbitrary except for the fact that4 |ε(x)| << 1 and it is defined such thatf(x) satisfies (2.11). In other words,∫ ∞

−∞ρ(μ,σ2)(x)ε(x)dx =

∫ ∞

−∞xρ(μ,σ2)(x)ε(x)dx =

∫ ∞

−∞(x − μ)2ρ(μ,σ2)(x)ε(x)dx = 0.

Substituting (2.12) into (2.9) and using the Taylor series approximation log(1+ ε) ≈ε − ε2/2,

S(f) = −∫ ∞

−∞ρ(μ,σ2)(x) · [1 + ε(x)] log(ρ(μ,σ2)(x) · [1 + ε(x)])dx

= −∫ ∞

−∞ρ(μ,σ2)(x) · [1 + ε(x)] · [log(ρ(μ,σ2)(x)) + log(1 + ε(x))]dx

= S(ρ(μ,σ2)) − F (ε2) + O(ε3)

where the functional F is always positive and the cross terms that are linear in ε allvanish due to the integral constraints on ε. This means that at least locally a Gaussianmaximizes entropy. Determining the exact form of the functional F is left as an exercise.

2.1.3 The Convolution of Gaussians

The convolution of two pdfs on the real line is defined as

(f1 ∗ f2)(x) .=∫ ∞

−∞f1(ξ)f2(x − ξ)dξ. (2.13)

Sometimes this is written as f1(x)∗f2(x). Note that convolution on the real line is com-mutative: (f1 ∗ f2)(x) = (f2 ∗ f1)(x). This is a direct consequence of the commutativityof addition: x + y = y + x.

In order for the convolution integral to exist, f1(x) and f2(x) must both decay tozero sufficiently fast as x → ±∞. In addition, the scope here is restricted to “nice”functions in the sense of (1.19) with D = R. Therefore these functions are infinitelydifferentiable and have integrals of their square and absolute values that are finite. Itcan be shown that the convolution integral will always exist for such “nice” functions,and furthermore

fi ∈ N (R) =⇒ f1 ∗ f2 ∈ N (R).

In (2.13) ξ is a dummy variable of integration, the name of which is unimportant. Ageometric interpretation of (2.13) is as follows. First, the function f2(x) is shifted alongthe real line in the positive direction by an amount ξ, resulting in f2(x − ξ). Then, thefunction f1 evaluated at the amount of shift, f1(ξ), is used to weight f2(x− ξ). Finally,all copies of the product f1(ξ)f2(x− ξ) are “added up” by integrating over all values ofthe shift. This has the effect of “smearing” f2 over f1.

4To be concrete, ε = 0.01 << 1. Then ε3 = 10−6 is certainly negligible in comparison toquantities that are on the order of 1.


In the case when f1(x) = δ(x), i.e., the Dirac delta function, which is the probabilitydensity function with all of its mass concentrated at x = 0, (δ ∗ f)(x) = f(x). Thisis because the only shift that the delta function allows is ξ = 0. All other shifts areweighted by a value of zero, and therefore do not contribute. While δ(x) is not a “nice”function, it is possible to approximate it with a Gaussian distribution with very smallvariance, ε, which is a “nice” function. The approximation of the Dirac delta function asδ(x) ≈ ρ(x; 0, ε) is deemed to be “good enough” if the integral of |ρ(x; 0, ε)∗f(x)−f(x)|and the integral of the square of this are both “small enough” when f(x) is a nicefunction.

The Gaussian distribution has the property that the convolution of two Gaussiansis a Gaussian:

ρ(x;μ1, σ21) ∗ ρ(x;μ2, σ

22) = ρ(x;μ1 + μ2, σ

21 + σ2

2). (2.14)

The Dirac δ-function can be viewed as the limit

δ(x) = limσ→0

ρ(x; 0, σ2). (2.15)

It then follows from (2.14) that

ρ(x;μ1, σ21) ∗ δ(x) = ρ(x;μ1, σ

21).

2.1.4 The Fourier Transform of the Gaussian Distribution

The Fourier transform of a “nice” function f ∈ N (R) is defined as

[F(f)](ω) .=∫ ∞

−∞f(x)e−iωxdx. (2.16)

The shorthand f(ω) .= [F(f)](ω) will be used frequently.The conditions for existence and properties of the Fourier transform of functions

on the real line are described in detail in [6, 11, 15]. Tools for the computation of fastsampled versions of the Fourier transform of periodic functions can be found in manybooks such as [7, 10, 24]. From the definition of the Fourier transform, it can be shownthat

(f1 ∗ f2)(ω) = f1(ω)f2(ω) (2.17)

(i.e., the Fourier transform of the convolution is the product of Fourier transforms) and

f(x) = [F−1(f)](x) .=12π

∫ ∞

−∞f(ω)eiωxdω. (2.18)

This is called the inverse Fourier transform or Fourier reconstruction formula.The proof of the property (2.17) is left as an exercise, whereas (2.18) is proven below.

For more details about classical Fourier analysis and its extensions, see [8] and referencestherein.

The fact that a function is recovered from its Fourier transform is found by firstobserving that it is true for the special case of g(x) = e−ax2

for a > 0. One way tocalculate

g(ω) =∫ ∞

−∞e−ax2

e−iωxdx

is to differentiate both sides with respect to ω, which yields


dg

dω= −i

∫ ∞

−∞xe−ax2

e−iωxdx =i

2a

∫ ∞

−∞

dg

dxe−iωxdx.

Integrating by parts, and observing that e−iωxg(x) vanishes at the limits of integrationyields

dg

dω= − ω

2ag.

The solution of this first-order ordinary differential equation is of the form

g(ω) = g(0)e− ω24a

where

g(0) =∫ ∞

−∞e−ax2

dx =√

π

a.

Having found the form of g(ω), it is easy to see that g(x) is reconstructed from g(ω)using the inversion formula (2.18) (the calculation is essentially the same as for theforward Fourier transform). Likewise, the Gaussian function

ρ(0,σ2)(x) =1√2πσ

e− x2

2σ2

has Fourier transformρ(0,σ2)(ω) = e− σ2

2 ω2

and the reconstruction formula holds. As σ becomes small, ρ(0,σ2)(x) becomes like δ(x).From the property that (δ ∗f)(x) = f(x), the convolution theorem, and the above prop-erties of Gaussian approximations to the Dirac δ-function, (2.18) immediately follows.

2.1.5 Diffusion Equations

A one-dimensional linear diffusion equation with constant coefficients has the form

∂u

∂t= a

∂u

∂x+ b

∂2u

∂x2 (2.19)

where a ∈ R is called the drift coefficient and b ∈ R>0 is called the diffusion coefficient.When modeling diffusion phenomena in an infinite medium, the above diffusion equationfor u(x, t) has initial conditions of the form u(x, 0) = f(x). The boundary conditions

u(±∞, 0) =∂u

∂x(±∞, 0) = 0

are implicit in this problem, because otherwise the solutions will not be pdfs, or in theclass N (R).

Note that (2.19) is a special case of the Fokker–Planck equation5 which will beexamined in great detail in Chapter 4. When the drift coefficient is zero, the diffusionequation is called the heat equation.

Taking the Fourier transform of u(x, t) for each value of t (i.e., treating time as aconstant for the moment and x as the independent variable) produces u(ω, t). Thenapplying the Fourier transform to both sides of (2.19) and the initial conditions results

5Also known as Kolmogorov’s forward equation.


in a linear first-order ordinary differential equation with t as the independent variable,together with initial conditions, for each fixed frequency ω:

du

dt= (iaω − bω2)u with u(ω, 0) = f(ω).

The solution to this initial value problem is of the form

u(ω, t) = f(ω)e(iaω−bω2)t.

Application of the inverse Fourier transform yields a solution. The above expressionfor u(ω, t) is a Gaussian with phase factor, and on inversion this becomes a shiftedGaussian:

[F−1(eiatωe−bω2t)](x) =1√4πbt

exp(

− (x + at)2

4bt

).

Using the convolution theorem in reverse then gives

u(x, t) =1√4πbt

∫ ∞

−∞f(ξ) exp

(− (x + at − ξ)2

4bt

)dξ. (2.20)

2.1.6 Stirling’s Formula

In probability theory for discrete variables, the binomial distribution is defined as

f(k;n, p) .=(

nk

)pk(1 − p)n−k where

(nk

).=

n!k!(n − k)!

0 ≤ p ≤ 1 (2.21)

and k = 0, 1, 2, ..., n, and the values(

nk

)are called binomial coefficients. From the

binomial theorem,

(a + b)n =n∑

k=0

(nk

)akbn−k,

it follows thatn∑

k=0

f(k;n, p) = (1 − p + p)n = 1,

and from the definition in (2.21)n∑

k=0

k · f(k;n, p) = np ·n−1∑k′=0

f(k′;n − 1, p) = np where k′ = k − 1.

The factorial n! can be approximated using the Stirling series:

n! =√

2πn(ne

)n(

1 +1

12n+

1288n2 + · · ·

).

If the first term is kept, the result is Stirling’s formula:

n! ≈√

2πn(ne

)n

. (2.22)

Stirling’s formula is used extensively in probability theory to establish limiting be-haviors. In the current context, it can be used to show that the Gaussian distributionis the limiting distribution of the binomial distribution in the sense that [22]

limn→∞

f(k;n, p)ρ(k;np, np(1 − p))

= 1 for finite |k − np|/√

np(1 − p). (2.23)


2.2 The Multivariate Gaussian Distribution

The multivariate Gaussian distribution on Rn is defined as6

ρ(x;μ, Σ) .=1

(2π)n/2|detΣ| 12

exp{

−12(x − μ)TΣ−1(x − μ)

}. (2.24)

This is the maximum entropy distribution subject to the constraints7∫Rn

ρ(x;μ, Σ) dx = 1;∫

Rn

x ρ(x;μ, Σ) dx = μ;∫

Rn

(x − μ)(x − μ)T ρ(x;μ, Σ) dx = Σ.

(2.25)The integral is calculated with respect to the differential volume element for Rn, de-noted above as dx = dx1dx2 · · · dxn. The above properties can be proved by changingcoordinates as y = Σ− 1

2 (x − μ), which reduces the problem to many one-dimensionalintegrals. The meaning of a fractional power of a matrix is reviewed in the appendix.Given a multi-dimensional coordinate transformation y = y(x) (which is written incomponents as yi = yi(x1, ..., xn) for i = 1, ..., n), the following well-known integrationrule (which is a restatement of (1.38) in different notation) holds:∫

y(D)F (y)dy =

∫D

F (y(x))|det J |dx (2.26)

where dx = dx1dx2 · · · dxn, dy = dy1dy2 · · · dyn, and

J =[∂y∂x1

, ...,∂y∂xn

]is the Jacobian matrix of the transformation and |det J | gives a measure of local volumechange. D is the domain of integration in terms of the coordinates x, and y(D) is thenew domain to which each point in D is mapped under the transformation y(x). Inthe current context, the range of integrals over x and y are both copies of Rn, i.e.,D = y(D) = Rn.

2.2.1 Conditional and Marginal Densities

A vector x ∈ Rn can be partitioned as

x =(

x1x2

)= [xT

1 ,xT2 ]T ∈ Rn1+n2

where x1 ∈ Rn1 and x2 ∈ Rn2 . The notation [xT1 ,xT

2 ]T , which takes advantage of thefact that the “transpose of a transpose is the original,” has the benefit that it can be

6It is unfortunate that the notation for the one-dimensional case, ρ(x; μ, σ2), is inconsistentwith the multivariate case since σ2 becomes Σ (rather than Σ2), but this is the notation thatis standard in the field.

7In Chapter 1 the notation d(x) was used to denote the volume element dx1dx2 · · · dxn. Inthe expressions in this chapter, the parentheses will be dropped to reduce the amount of clutter,and dx will be used as shorthand for d(x). This will not cause trouble because x(t + dt) − x(t)does not appear in any of these calculations.

2.2 The Multivariate Gaussian Distribution 41

written on one line and included in a sentence, whereas it is difficult to do so for acolumn vector.

If f(x) = f([xT1 ,xT

2 ]T ) (which also will be referred to as f(x1,x2)) is any pdf onRn1+n2 , then the marginal density f1(x1) is defined by integrating over all values of x2:

f1(x1) =∫

Rn2

f(x1,x2) dx2.

f2(x2) is obtained from f(x1,x2) in a similar way by integrating over all values of x1.The mean and variance of f1(x1) are obtained from the mean and variance of f(x)

by observing that

μ1 =∫

Rn1

x1f1(x1) dx1

=∫

Rn1

x1

(∫Rn2

f(x1,x2) dx2

)dx1

=∫

Rn1

∫Rn2

x1f(x1,x2) dx2 dx1

and

Σ11 =∫

Rn1

(x1 − μ1)(x1 − μ1)T f1(x1) dx1

=∫

Rn1

(x1 − μ1)(x1 − μ1)T

(∫Rn2

f(x1,x2) dx2

)dx1

=∫

Rn1

∫Rn2

(x1 − μ1)(x1 − μ1)T f(x1,x2) dx2 dx1.

In other words, the mean vector and covariance matrix for the marginal density areobtained directly from those of the full density. For example, μ = [μT

1 ,μT2 ]T .

Given a (multivariate) Gaussian distribution ρ(x;μ, Σ), the associated covariancematrix can be written in terms of blocks as

Σ =(

Σ11 Σ12Σ21 Σ22

)where Σ11 = ΣT

11, Σ22 = ΣT22, and Σ21 = ΣT

12. The block Σij has dimensions ni × nj .In other words, Σij ∈ Rni×nj where i and j can either be 1 or 2.

The marginal density that results from integrating the Gaussian distribution ρ(x,μ, Σ)over all values of x2 is∫

Rn2

ρ([xT1 ,xT

2 ]T ;μ, Σ)dx2 = ρ(x1;μ1, Σ11). (2.27)

This should not come as a surprise, since a Gaussian is defined completely by the valuesof its mean and covariance.

Another operation that is important in probability and statistics is that of condi-tioning. Given f(x1,x2), the conditional density of x1 given x2 is

f(x1|x2).= f(x1,x2)/f2(x2). (2.28)


Evaluating this expression using a Gaussian gives

ρ([xT1 ,xT

2 ]T ;μ, Σ)/ρ(x2;μ2, Σ2) = ρ(x1;μ1 + Σ12Σ−122 (x2 − μ2), Σ11 − Σ12Σ

−122 Σ21).

(2.29)

The above formulas follow from decomposing Σ into a product of block lower trian-gular, block diagonal, and block upper triangular matrices as in Appendix A.4.3. Eachof these can then be inverted in closed form resulting in explicit expressions for Σ−1 interms of the blocks of Σ.

In summary, the set of Gaussian distributions has the remarkable property that itis closed under marginalization and conditioning, and as was demonstrated previouslyin the 1D case, it is also closed under convolution.

2.2.2 Multi-Dimensional Integrals Involving Gaussians

Several integral identities involving Gaussian distributions are used throughout thisbook. These are stated here and proved in the following subsections.

First, it is well known that∫ ∞

−∞e− 1

2 x2dx =

√2π =⇒

∫Rn

exp(

−12xT x

)dx = (2π)

n2 . (2.30)

Here x ∈ Rn and dx = dx1dx2 · · · dxn. Note also that∫ ∞

−∞x2e− 1

2 x2dx =

√2π. (2.31)

These identities are used below to prove∫Rn

exp(−12xTMx − mT x)dx = (2π)n/2|detM |− 1

2 exp(

12mTM−1m

)(2.32)

and ∫Rn

xTGx exp(

−12xTAx

)dx = (2π)n/2 tr(GA−1)

|detA| 12

. (2.33)

These integrals have applications in the analysis of elastic network models of proteins[9].

Proof of Equation (2.32)

Consider the integral

I =∫

Rn

exp(−12xTMx − mT x)dx.

Using the change of variables z = M12 x − M− 1

2 m implies that dz = |detM | 12 dx and

x = M− 12 (z + M− 1

2 m). Therefore

I =1

|detM | 12

∫Rn

exp(−12zT z +

12mTM−1m)dz

=exp( 1

2mTM−1m

)|detM | 1

2

∫Rn

exp(−12zT z)dz.

And so, (2.32) follows from (2.30).

2.3 The Volume of Spheres and Balls in Rn 43

Proof of Equation (2.33)

It is also convenient to have closed-form solutions for integrals of the form

J =∫

Rn

xTGx exp(

−12xTAx

)dx.

Let z = A12 x. Then

J =1

|detA| 12

∫Rn

zTA− 12 GA− 1

2 z exp(

−12zT z)

dz.

Now let G′ = A− 12 GA− 1

2 . Then it is clear that off-diagonal terms of G′ do not contributeto this integral since odd moments of Gaussians are zero. Therefore,

J =1

|detA| 12

∫Rn

n∑i=1

g′

iiz2i exp

(−1

2zT z)

dz

=1

|detA| 12

n∑i=1

g′

ii

∫ ∞

−∞z2i e

− 12 z2

i dzi

∫Rn−1

exp(

−12yT

i yi

)dyi

where yi ∈ Rn−1 is the part of z ∈ Rn with the zi component removed. The value ofthe integrals are independent of i, and

n∑i=1

g′

ii = tr(G′) = tr(A− 12 GA− 1

2 ) = tr(GA−1),

and so, (2.33) follows.

2.3 The Volume of Spheres and Balls in Rn

The volume of the (n − 1)-dimensional hyper-sphere with unit radius, Sn−1 ⊂ Rn, andof the open ball Bn ⊂ Rn enclosed by Sn−1 appear in a number of geometric andstatistical applications. The argument used here for computing these volumes followsthat given in [12]. Before proceeding, a note is in order regarding the use of the word“volume.” In the case of n = 3, the “volume” of the sphere S2 is its surface area, andin the case of n = 2, the “volume” of the circle S1 is its perimeter. In contrast, the“volume” of the ball B2 is the area on the interior of a circle, and the “volume” of B3

is the classical volume in R3 bounded by the sphere S2. In general, the volume of ann-dimensional manifold will be an n-dimensional measurement.

Consider the isotropic Gaussian distribution on Rn with zero mean written as

ρ(x;μ = 0, Σ = σ2I) =1

(2π)n/2σnexp(−1

2‖x‖2/σ2).

If x = ru where r and u = u(φ1, φ2, ..., φn−1) represent “hyper-spherical” coordinates,then the Jacobian determinant relates the change from Cartesian coordinates as

dx =∣∣∣∣det

[∂x∂r

,∂x∂φ1

, ... ,∂x

∂φn−1

]∣∣∣∣ drdφ1 · · · dφn−1 = dV (φ)rn−1dr


where dV (φ) is the volume element for the sphere Sn−1. The volume of Sn−1 is then

V ol(Sn−1) =∫

Sn−1dV (φ).

This can be computed directly by extending the usual spherical coordinates to higherdimensions in the natural way as

u(2) =(

cosφ1sinφ1

); u(3) =

⎛⎝ cosφ1 sinφ2sinφ1 sinφ2

cosφ2

⎞⎠ ; u(3) =

⎛⎜⎜⎝cosφ1 sinφ2 sinφ3sinφ1 sinφ2 sinφ3

cosφ2 sinφ3cosφ3

⎞⎟⎟⎠ ; etc.,

computing Jacobian determinants for each case, and then integrating over the appro-priate range of angles, 0 ≤ φ1 < 2π and 0 ≤ φi < π for 1 < i ≤ n − 1. Or, the volumeof the unit sphere can be calculated indirectly, as it is done below.

From the fact that ρ is a pdf,

1 =∫

Rn

ρ(x;0, σ2I)dx

=∫ ∞

0

∫Sn−1

ρ(ru;0, σ2I)dV (u)rn−1dr

=1

(2π)n/2σn

(∫ ∞

0exp(−r2/(2σ2))rn−1dr

)V ol(Sn−1).

Therefore, it must be that

1(2π)n/2σn

∫ ∞

0exp(−r2/(2σ2))rn−1dr = 1/V ol(Sn−1)

for any value of σ. Letting s = r/(√

2σ), the integral on the left becomes∫ ∞

0exp(−r2/(2σ2))rn−1dr = 2n/2σn

∫ ∞

0exp(−s2)sn−1ds =

122n/2σnΓ (n/2).

This can be taken as the definition of the Gamma function, or it can be viewed as theresult of the change of coordinates t = s2 from the more standard definition

Γ (α) =∫ ∞

0e−ttα−1dt (2.34)

with α = n/2.In any case, since the Gaussian pdf integrates to unity, the factors of 2n/2σn cancel,

and it must be that 12Γ (n/2)V ol(Sn−1) = (π)n/2, or

V ol(Sn−1) =2(π)n/2

Γ(

n2

) . (2.35)

This is the volume of a unit hyper-sphere Sn−1 ⊂ Rn. The volume of a hyper-sphereof radius r would be rn−1 times this quantity. The volume of the unit ball Bn ⊂ Rn isthen obtained by integrating over all of these spherical shells as

V ol(Bn) =∫ 1

0V ol(Sn−1)rn−1dr =

2(π)n/2

Γ(

n2

) ∫ 1

0rn−1dr.

2.4 Clipped Gaussian Distributions 45

In other words,

V ol(Bn) =2(π)n/2

n · Γ(

n2

) =(π)n/2

Γ(

n2 + 1

) . (2.36)

The first few values of Γ (n/2) are given in the following table:

Table 2.1. The First Few Half-Integer Values of the Γ -Function

n Γ (n/2)1

√π

2 13

√π/2

4 15 3

√π/4

6 2

Note that for integer arguments, Γ (m) = (m − 1)!.The shorthand notation

V ol(Sn−1) = On and V ol(Bn) =On

n(2.37)

will be useful.

2.4 Clipped Gaussian Distributions

The Gaussian distribution has many interesting and useful properties. For example, it isthe maximum entropy distribution of given mean and covariance, it satisfies a diffusionequation, as a family of parametric distributions it is closed under the operations ofconvolution and conditioning. In addition, its higher moments can be computed asclosed-form integrals. It would be useful to take advantage of these properties whenfitting a density to measured data on other domains such as spheres. However, a problemthat immediately arises is that for compact domains, something must be done with theinfinite tails of the Gaussian distribution. Two options are to wrap the tails around(resulting in a “folded” Gaussian), or to clip the tails. The folded Gaussian for the circleis discussed in Section 2.5. While this is a viable option in some cases, a more generalprocedure that can be used for other finite domains is clipping. In the subsections thatfollow, the properties of the univariate clipped Gaussian are obtained, and extended tothe multi-dimensional case.

2.4.1 One-Dimensional Clipped Gaussian Distributions

Suppose that we want to clip the Gaussian distribution with mean at x = 0 defined by

ρ(x; 0, σ0) =1√

2πσ0e−x2/2σ2

0 .

This is defined on the real line. By restricting it to the unit circle, which we identifywith the interval [−π, π], the mass is reduced from unity to


r(σ0).=∫ π

−π

ρ(x; 0, σ0)dx < 1. (2.38)

An exact expression for r(σ0) can be found in terms of the error function

erf(x) .=2√π

∫ x

0e−t2dt. (2.39)

However, if kσ0 < π for k ≥ 3, then r(σ0) ≈ 1 is a good approximation.The variance of a clipped Gaussian is then

σ2 =1√

2πσ0r(σ0)

∫ π

−π

x2e−x2/2σ20dx =

σ20√

2πr(σ0)

∫ π/σ0

−π/σ0

y2e−y2/2dy.

This can be written as

σ2 =σ2

0√2πr(σ0)

[√2π − 2π

σ0e−π2/(2σ2

0)]

by using integration by parts. As σ0 → 0, then σ → σ0.

2.4.2 Multi-Dimensional Clipped Gaussian Distributions

The integral of a multi-dimensional Gaussian distribution over the interior of an ellipsoiddefined by

xTΣ−10 x = a2

can be computed in closed form (using error integrals). We can therefore clip a multi-dimensional Gaussian distribution along the boundary of such an ellipsoid and renor-malize the resulting distribution so as to be a pdf. In other words, a clipped Gaussianis defined relative to a Gaussian as

ρc(x, Σ0, a) .=

⎧⎨⎩ρ(x, Σ0)/r(Σ0, a) for xTΣ−10 x < a2

0 otherwise(2.40)

wherer(Σ0, a) .=

∫xT Σ−1

0 x<a2ρ(x, Σ0) dx.

The covariance of a clipped Gaussian is then

Σ =∫xT Σ−1

0 x<a2xxT ρc(x, Σ0, a) dx. (2.41)

By making the change of variables y = Σ− 1

20 x, it follows that

Σ = Σ120

[∫yT y<a2

yyT ρc(y, I, a) dy]Σ

120 . (2.42)

The above integral can be computed in closed form. This is done below for the three-dimensional case. The two-dimensional case is left as an exercise.

It will be convenient to define

2.5 Folded, or Wrapped, Gaussians 47

f0(a) .=∫ a

0e−r2/2dr =

√π

2erf(a/

√2)

f1(a) .=∫ a

0r2e−r2/2dr = −ae−a2/2 + f0(a)

andf2(a) .=

∫ a

0r4e−r2/2dr = 3f1(a) − a3e−a2/2.

Then

m(Σ0, a) .=∫xT Σ−1

0 x<a2exp{−xTΣ−1

0 x}dx = 4πf1(a) · |Σ0|12

and

r(Σ0, a) = m(Σ0, a)/(2π)32 |Σ0|

12 =

√2πf1(a).

Using spherical coordinates,

y =

⎛⎝ r sin θ cosφr sin θ sinφ

r cos θ

⎞⎠ ,

∫yT y<a2

yyT ρc(y, I, a) dy =∫ a

r=0

∫ 2π

φ0

∫ π

θ=0yyT ρc(y, I, a) r2drdφdθ =

√2π

f2(a)3

I

wheref2(a) =

∫ a

0r4e−r2/2dr.

This can be computed in closed form using integration by parts. Therefore (2.42) reducesto

Σ =f2(a)

3 · f1(a)Σ0. (2.43)

As a → ∞, Σ → Σ0.

2.5 Folded, or Wrapped, Gaussians

In some applications, data on the circle is given, and a corresponding concept of Gaussiandistribution is needed. One approach that was discussed in the previous section thatcould be applied to this end is to “clip the tails” of a Gaussian outside of the range ofvalues θ ∈ [−π, π] and renormalize the result in order to make it a valid pdf. In contrast,the tails can be “wrapped around” the circle as

ρW (θ;μ, σ) .=∞∑

k=−∞ρ(θ − 2πk;μ, σ), (2.44)

where if μ is outside of the range [−π, π], it can be “put back in the range” by subtracting2πN from it for some N ∈ Z until it is in range.

If σ is very small and μ = 0, only the k = 0 term in the above sum needs to beretained, and there is no distinction between the original Gaussian restricted to therange θ ∈ [−π, π], the Gaussian clipped to this range, and the folded Gaussian. But as


σ increases, so too do the values of |k| that need to be retained. As σ becomes verylarge, it becomes impractical to compute (2.44).

However, there is an alternative representation of the folded Gaussian that uses thefact that it is a periodic function. Recall that any 2π-periodic function, i.e., a “functionon the unit circle,” can be expanded in a Fourier series:

f(θ) =12π

∞∑n=−∞

f(n)einθ where f(n) =∫ 2π

0f(θ)e−inθdθ, (2.45)

where einθ = cosnθ+i sinnθ and i =√

−1. Here f(n) are called the Fourier coefficients,or circular Fourier transform. These coefficients can be computed in closed form for(2.44). This leads to the Fourier series representation of the folded Gaussian distribution:

ρW (θ;μ, σ) =12π

+1π

∞∑n=1

e− σ22 n2

cos (n(θ − μ)) . (2.46)

As σ becomes large, very close approximations can be achieved with the first coupleof terms in the summation in (2.46). In contrast, as σ becomes very small, using veryfew of the terms in the series (2.44) will produce a very good approximation when μ = 0.

The general theme that a Gaussian on a space other than the real line can beapproximated well as a Gaussian restricted to a smaller domain when σ is small, or as ageneralized Fourier series expansion when σ is large, will recur many times throughoutthis book.

Note that the above “folding” process is not restricted to Gaussian distributions;any well-behaved function, f(x), defined on the line can be wrapped around the circle.The resulting folded function, which is 2π-periodic, is related to the Fourier transformof the original non-periodic function on the real line through the Poisson summationformula [1]:

∞∑n=−∞

f(θ + 2πn) =12π

∞∑k=−∞

[F(f)](k)eikθ. (2.47)

In other words, the Fourier coefficients of the folded function are related to the Fouriertransform of the original function as

f(k) = [F(f)](k).

2.6 The Heat Equation

In this section, the relationship between the Gaussian distribution and the heat equation(also called the diffusion equation) is developed.

Sometimes the exact solution of an equation is not as critical as knowing how itsmean and covariance behave as a function of time. This is illustrated both in the one-dimensional and multi-dimensional settings in the following subsections.

2.6.1 The One-Dimensional Case

Consider the diffusion equation on the real line with time-varying diffusion and driftcoefficients, k(t) and a(t):

2.6 The Heat Equation 49

∂f

∂t=

12k(t)

∂2f

∂x2 − a(t)∂f

∂x. (2.48)

The initial condition is f(x, 0) = δ(x). The solution f(x, t) can be obtained in closedform, following essentially the same procedure as in Section 2.1.5, and then the meanand variance can be computed from this solution as

μ(t) =∫ ∞

−∞xf(x, t)dx and σ2(t) =

∫ ∞

−∞[x − μ(t)]2f(x, t)dx. (2.49)

Alternatively, the mean and variance of f(x, t) can be computed directly from (2.48)without actually knowing the solution f(x, t). In fact, many properties of f(x, t) canbe determined from (2.48) and the corresponding initial conditions without knowingf(x, t). For example, integrating both sides of (2.48) with respect to x yields

d

dt

∫ ∞

−∞f(x, t)dx = 0.

This follows because∫ ∞

−∞

∂f

∂xdx = f(x, t)|∞x=−∞ and

∫ ∞

−∞

∂2f

∂x2 dx =∂f

∂x

∣∣∣∣∞x=−∞

and under the boundary conditions that f(x, t) and ∂f/∂x decay rapidly to zero asx → ±∞, these terms become zero. Since the initial conditions are a delta function inx, it follows that ∫ ∞

−∞f(x, t)dx = 1.

In other words, (2.48) preserves the initial mass of the distribution over all values oftime after t = 0.

To compute μ(t), multiply both sides of (2.48) by x and integrate. On the one hand,∫ ∞

−∞x∂f

∂tdx =

d

dt

∫ ∞

−∞xf(x, t)dx =

dμ

dt.

On the other hand,∫ ∞

−∞x∂f

∂tdx =

12k(t)

∫ ∞

−∞x∂2f

∂x2 dx − a(t)∫ ∞

−∞x∂f

∂xdx.

Evaluating both integrals on the right side by integrating by parts and using the con-ditions that both f(x, t) and ∂f/∂x decay rapidly to zero as x → ±∞, it becomes clearthat

dμ

dt= a(t) or μ(t) =

∫ t

0a(s)ds. (2.50)

A similar argument shows that

d

dt(σ2) = k(t) or σ2(t) =

∫ t

0k(s)ds. (2.51)


2.6.2 The Multi-Dimensional Case

Consider the following time-varying diffusion equation without drift:

∂f

∂t=

12

n∑i,j=1

Dij(t)∂2f

∂xi∂xj, (2.52)

where Dij(t) = Dji(t) are the time-varying diffusion constants. If f(x, 0) = δ(x), thenintegrating (2.52) both sides over Rn and using integration by parts in x shows that theunit volume under the curve is preserved.

Multiplying both sides by xkxl and integrating over x ∈ Rn gives

d

dt(σkl) =

12

n∑i,j=1

Dij(t)∫

Rn

xkxl∂2f

∂xi∂xjdx. (2.53)

Let the integral over Rn−1 resulting from the exclusion of the integral over xi be denotedas ∫

x−xi

f(x)dx/dxi =∫ ∞

x1=−∞· · ·∫ ∞

xi−1=−∞

∫ ∞

xi+1=−∞· · ·∫ ∞

xn=−∞f(x1, ..., xn)dx1 · · · dxi−1dxi+1 · · · dxn

so that∫Rn

f(x)dx =∫ ∞

−∞

(∫x−xi

f(x)dx/dxi

)dxi =

∫x−xi

(∫ ∞

−∞f(x)dxi

)dx/dxi.

An integral over n−2 degrees of freedom denoted by the integral with subscript x−xi−xj

follows in a similar way.From integration by parts∫

Rn

xkxl∂2f

∂xi∂xjdx =

∫x−xi

[xkxl

∂f

∂xj

∣∣∣∣∞xi=−∞

−∫ ∞

−∞

∂

∂xi(xkxl)

∂f

∂xjdxi

]dx/dxi.

The assumption that f(x, t) decays rapidly as ‖x‖ → ∞ for all values of t makes thefirst term in the brackets disappear. Using the fact that ∂xi/∂xj = δij , and integratingby parts again (over xj) reduces the above integral to∫

Rn

xkxl∂2f

∂xi∂xjdx = δkjδil + δikδlj .

Substituting this into (2.53) results in

d

dt(σkl) = Dkl(t) or σkl(t) =

∫ t

0Dkl(s)ds. (2.54)

Therefore, even without knowing the form of the time-varying pdf that solves (2.52)it is possible to obtain an exact expression for the covariance of the solution.

2.7 Gaussians and Multi-Dimensional Diffusions 51

2.6.3 The Heat Equation on the Unit Circle

The heat equation on the circle is exactly the same as the heat equation on the real line(with θ replacing x as the spatial variable). However, the topological constraint thatθ = ±π represents the same point means that the long-time solution will be completelydifferent than in the unconstrained case on the real line. Whereas the Fourier transformcan be used to solve the heat equation on the line, the Fourier series expansion is usedon the circle.

The result is that the solution on the line can be folded around the circle. In otherwords, the solution to the heat equation on the circle for constant diffusion coefficientk,

∂f

∂t=

12k∂2f

∂θ2 subject to f(θ, 0) = δ(θ),

is

f(θ, t) =∞∑

k=−∞ρ(θ − 2πk; 0, (kt)

12 ) =

12π

+1π

∞∑n=1

e−ktn2/2 cosnθ. (2.55)

This is the folded Gaussian in (2.46) with σ2 = kt and μ = 0.

2.7 Gaussians and Multi-Dimensional Diffusions

In the previous section, the evolution of the mean and covariance of a diffusion equationwas obtained without knowing the time-varying pdf. Here, the pdf is sought.

2.7.1 The Constant Diffusion Case

Consider the diffusion equation

∂f

∂t=

12

n∑i,j=1

Dij∂f2

∂xi∂xj(2.56)

subject to the initial conditions f(x, t) = δ(x), where D = [Dij ] = DT is a constantmatrix of diffusion constants.

Since diffusion equations preserve mass (see Section 2.6.2), it follows that∫Rn

f(x, t)dx = 1 (2.57)

for all values of time, t ∈ R>0.Try a solution of the form

f(x, t) = c(t) exp(−12xTA(t)x) (2.58)

where A(t) = φ(t)A0 and A0 = [αij ] = AT0 . Then, from (2.57) and the formula (2.83)

derived in the exercises, it follows that

c(t) =(φ(t)2π

)n/2

|detA0|12 .


With this constraint in mind, substituting f(x, t) into (2.56) produces the followingconditions on φ(t) and A0:

nφ′ = −φ2n∑

i,j=1

Dijαij

φ′xTA0x = −φ2n∑

i,j=1

Dij

(n∑

k=1

αikxk

)(n∑

l=1

αjlxl

)

where φ′ = dφ/dt.Both of the conditions (2.59) are satisfied if A0 = α0D

−1 and φ(t) = (α0t)−1 forsome arbitrary constant α0 ∈ R>0. But since A(t) = φ(t)A0 = t−1D−1, this constantdoes not matter.

Putting all of this together,

f(x, t) =1

(2πt)n/2|detD| 12

exp(− 12t

xTD−1x). (2.59)

Stated in another way, the solution to (2.56) is a time-varying Gaussian distributionwith Σ(t) = tD when D is symmetric.

2.7.2 The Time-Varying Case

Consider again (2.56), but now let D = D(t). Try a solution of the form

f(x, t) = (2π)−n/2|detΣ(t)|− 12 exp(−1

2xTΣ−1(t)x) (2.60)

where Σ(t) is a time-varying covariance matrix, the form of which is as yet undeter-mined. This guess is simply f(x, t) = ρ(x;0, Σ(t)).

The derivatives with respect to xi are evaluated as before, using the chain rule. Thetime derivative is evaluated as follows:

∂f

∂t= (2π)−n/2 d(| detΣ|− 1

2 )dt

exp(−12xTΣ−1x)

+(2π)−n/2|detΣ|− 12d

dt

[exp(−1

2xTΣ−1x)

]

= −12(2π)−n/2|detΣ|− 3

2d(detΣ)

dtexp(−1

2xTΣ−1x)

−12(2π)−n/2|detΣ|− 1

2

(xT d

dt[Σ−1]x

)exp(

−12xTΣ−1x

).

On the other hand,

12

n∑i,j=1

Dij∂2f

∂xi∂xj=

12{−tr(DΣ−1) + xT (Σ−1DΣ−1)x

}f(x, t).

Therefore, if

|detΣ|−1 d(detΣ)dt

= tr(DΣ−1) andd

dt[Σ−1] = −Σ−1DΣ−1, (2.61)

2.8 Symmetry Analysis of Evolution Equations 53

then (2.56) with variable diffusion coefficients will be satisfied. Since8

d

dt(ΣΣ−1) = O =⇒ d

dt[Σ−1] = −Σ−1ΣΣ−1,

the second equality in (2.61) will be satisfied if D = Σ. In this case the first equality in(2.61) becomes

d

dtlog(detΣ) = tr(ΣΣ−1). (2.62)

Under what conditions will this be true?

Case 1:From Systems Theory (as reviewed in the appendix), if Σ = exp(tS0) where S0 = ST

0is constant, then

detΣ = etr(tS0) = et(trS0).

Therefore, in this special case

d

dtlog(detΣ) = tr(S0).

Likewise, if Σ = exp(tS0), then tr(ΣΣ−1) = tr(S0). Therefore, it can be concluded thata sufficient condition for the Gaussian in (2.60) to be a solution to (2.56) is if a constantsymmetric matrix S0 can be found such that D(t) = S0 exp(tS0).

Case 2:The condition in (2.62) will be satisfied if Σ = σ(t)Σ0 where σ(t) is a differentiable

scalar function of time and Σ0 = ΣT0 . Substitution into (2.62) yields the condition

d

dtlog(σn detΣ0) = σσ−1tr(I).

Since log(a · b) = log a + log b, and ddt log a(t) = a/a, the above condition becomes

1σn

nσn−1σ = nσσ−1,

which is always true. Therefore any σ(t) will work.A broader condition that encompasses both Case 1 and Case 2 is D(t) = S(t) expS(t)

where S = ST and [S, S] .= SS − SS = O.Under this condition,

Σ(t) =∫ t

0D(s)ds. (2.63)

2.8 Symmetry Analysis of Evolution Equations

The concept of symmetry can have several meanings when applied to evolution equa-tions.9 For example, the diffusion matrix in the multi-dimensional heat equation mighthave symmetries in it other than the primary symmetry D = DT . That kind of sym-metry is reflected in the solution of the equation. Another kind of symmetry is thatthe equation itself can be solved when the independent variables undergo a non-linearchange of coordinates. Both of these concepts of symmetry are addressed in this section.

8Here O = ddt

(I) is the zero matrix.9These are equations with a single partial derivative in time, and multiple partial derivatives

in space. They include, but are not limited to, diffusion equations.


2.8.1 Symmetries in Parameters

Consider a drift-free diffusion in Rn with constant diffusion matrix D = DT , and letthe solution be denoted as f(x, t;D). Since the dependence on D and t always appearsas their product, the solution has a continuous scale symmetry of the form

f(x, t;D) = f(x, t/α;αD)

for any α ∈ R>0.In addition, since the solution is the Gaussian distribution in (2.59), it can be verified

thatf(x, t;D) = βn/2f(

√βx, t;βD).

If D = σ2I, then any change of spatial coordinates of the form y = Qx whereQTQ = I will preserve the solution:

f(Qx, t;σ2I) = f(x, t;σ2I).

In contrast, if n = 3 and D = diag[σ21 , σ

21 , σ

23 ] is the diagonal matrix with the indicated

entries on the diagonal, then

R3(θ)TDR3(θ) = D where R3(θ) =

⎛⎝ cos θ − sin θ 0sin θ cos θ 0

0 0 1

⎞⎠ ,

and sof(R3(θ)x, t;D) = f(x, t;D).

These symmetries all involve simple transformations of the coordinates. Less obvi-ous symmetries result by examining operators which, when applied to the equation ofinterest, leave it invariant in a sense that will be made precise.

2.8.2 Infinitesimal Symmetry Operators of the Heat Equation

Let Qf = 0 denote any partial differential equation, where Q is a differential operatorin temporal and spatial variables (t,x) ∈ R≥0 × Rn. For example, for the heat equationon the real line where there is only one spatial variable (t,x) becomes (t, x) and

Q =∂

∂t− ∂2

∂x2

where the diffusion constant, k, is chosen to be k = 2 here for convenience.A body of literature exists that addresses the question of how to obtain new solutions

of Qf = 0 from old ones. In particular, if it is possible to find a first-order operator ofthe form

L = T (x, t)∂

∂t+

n∑i=1

Xi(x, t)∂

∂xi+ Z(x, t) (2.64)

where T (x, t), Xi(x, t), and Z(x, t) are analytic functions such that

[L,Q]f(x, t) = R(x, t)Qf where [L,Q] = LQ − QL, (2.65)

then f ′ .= Lf will solve Qf ′ = 0.


At first this might seem surprising, but since the condition in (2.64) reads LQf −QLf = RQf , and since Qf = 0, it must be that 0 = QLf = Q(Lf) = Qf ′.

Following [2, 3, 4, 17, 19, 21], the infinitesimal operators that transform solutionsof the heat equation into new solutions are presented below. In this case there is onespatial dimension and so

L = T (x, t)∂

∂t+ X(x, t)

∂

∂x+ Z(x, t).

Some mundane calculus yields

QLf =(

∂

∂t− ∂2

∂x2

)(T (x, t)

∂

∂t+ X(x, t)

∂

∂x+ Z(x, t)

)=(∂T

∂t

)(∂f

∂t

)+ T

(∂2f

∂t2

)+(∂X

∂t

)(∂f

∂x

)+ X

(∂2f

∂t∂x

)+(∂Z

∂t

)f

+Z

(∂f

∂t

)−(∂2T

∂x2

)(∂f

∂t

)− 2(∂T

∂x

)(∂2f

∂t∂x

)− T

(∂3f

∂t∂x2

)−(∂2X

∂x2

)(∂f

∂x

)− 2(∂X

∂x

)(∂2f

∂x2

)− X

(∂3f

∂x3

)−(∂2Z

∂x2

)f − 2

(∂Z

∂x

)(∂f

∂x

)− Z

(∂2f

∂x2

)and

LQf =(T (x, t)

∂

∂t+ X(x, t)

∂

∂x+ Z(x, t)

)(∂

∂t− ∂2

∂x2

)f

= T∂2f

∂t2− T

∂3f

∂x2∂t+ X

∂2f

∂x∂t− X

∂3f

∂x3 + Z∂f

∂t− Z

∂2f

∂x2 .

Note that every term in LQf can also be found in QLf . Subtracting, and reorganizingthe terms that result, yields

[Q,L]f =(∂T

∂t− ∂2T

∂x2

)∂f

∂t+(∂X

∂t− ∂2X

∂x2 − 2∂Z

∂x

)∂f

∂x

+(

−2∂T

∂x

)∂2f

∂x∂t+(

−2∂X

∂x

)∂2f

∂x2 +(∂Z

∂t− ∂2Z

∂x2

)f.

Since [Q,L] = −[L,Q], (2.64) is the same as computing [Q,L]f = −RQf where

RQf = R∂f

∂t− R

∂2f

∂x2 .

Then equating the coefficients in front of each term involving f , the following fiveequations result:

∂T

∂t− ∂2T

∂x2 = −R (2.66)

2∂X

∂x= −R (2.67)

∂X

∂t− ∂2X

∂x2 − 2∂Z

∂x= 0 (2.68)


∂T

∂x= 0 (2.69)

∂Z

∂t− ∂2Z

∂x2 = 0. (2.70)

These equations completely determine the structure of the operator L that transformssolutions into solutions.

Starting with (2.69), the restriction T (x, t) = T (t) must be observed. Then, usingthis result in (2.66) means −R(x, t) = T ′(t). This in turn can be substituted into (2.67)to yield

X(x, t) =12T ′(t)x + c1(t)

where c1(t) is a yet-to-be-determined function resulting from integration over x. Sub-stituting this into (2.68) forces the form of Z(x, t) to be

Z(x, t) =18T ′′(t)x2 +

12c′1(t)x + c2(t).

Substituting this into (2.70) forces

T ′′′(t) = 0; c′′1(t) = 0; c′

2(t) =14T ′′(t).

It follows that

T (t) = a0t2 + b0t + c0; c1(t) = α0t + β0; c2(t) =

12a0t + γ0

where a0, b0, c0, α0, β0, γ0 are all free constants.This means that any L with the following form will map solutions of the heat equa-

tion into solutions:

T (x, t) = a0t2 + b0t + c0

X(x, t) = (a0t + b0/2)x + α0t + β0

Z(x, t) =14a0x

2 +12α0x +

12a0t + γ0.

In fact, the space of all allowable L operators is a vector space with elements of theform

L = a0L1 + b0L2 + c0L3 + α0L4 + β0L5 + γ0L6

where the following serves as a basis:

L1 = t2∂

∂t+ xt

∂

∂x+

14x2 +

12t (2.71)

L2 = t∂

∂t+

12x

∂

∂x(2.72)

L3 =∂

∂t(2.73)

L4 = t∂

∂x+

12x (2.74)

L5 =∂

∂x(2.75)

L6 = 1. (2.76)


In addition to being a vector space, operators of the form of L given above are alsoclosed under the Lie bracket, [·, ·]. In other words, [Li, Lj ] for any i, j ∈ {1, ..., 6} willresult in a linear combination of these same basis elements. This makes the space of allL operators that map solutions of the heat equation into solutions a Lie algebra [16].This concept will be defined more rigorously in the appendix and in Volume 2.

2.8.3 Non-Linear Transformations of Coordinates

Consider the heat equation∂f

∂t=

∂f2

∂x2

and assume that an f(x, t) has been obtained that satisfies this equation. For the mo-ment, the initial conditions will be left unspecified.

The following matrices can be defined [17]:

B = B(u, v, w) =

⎛⎝1 v 2w + uv/20 1 u0 0 1

⎞⎠ where u, v, w ∈ R (2.77)

and

A = A(α, β, γ, δ) =(

α βγ δ

)where α, β, γ, δ ∈ R and αδ − βγ = 1. (2.78)

It is clear that since detA = 1 by definition, then the product of two such matricesalso satisfies this condition: det(A1A2) = detA1 detA2 = 1. Likewise, the form of theB matrices are preserved under matrix multiplication, and

B(u, v, w)B(u′, v′, w′) = B(u + u′, v + v′, w + w′ + (vu′ − uv′)/4).

These are examples of matrix Lie groups which, roughly speaking, are groups of con-tinuous transformations, the elements of which are matrices. The group operation ismatrix multiplication.

It can be shown (see Exercise 2.18) that transformations of the following form convertsolutions into solutions [17]:

(T1(B)f)(x, t) = exp12

[b13 + b23x +

12b223t

]f(x + b12 + b23t, t) (2.79)

and

(T2(A)f)(x, t) = exp(

−x2β/4δ + tβ

)(δ + tβ)− 1

2 f

(x

δ + tβ,γ + tα

δ + tβ

). (2.80)

In other words, if f(x, t) is a solution to the heat equation, then so too aref1(x, t) = (T1(B)f)(x, t) and f2(x, t) = (T2(A)f)(x, t). This means that applying thesetransformations twice with different permissible matrices Ai and Bi will also take solu-tions into solutions:

f1(x, t) = (T1(B2)T1(B1)f)(x, t) = (T1(B2)(T1(B1)f))(x, t)


andf2(x, t) = (T2(A2)T2(A1)f)(x, t) = (T2(A2)(T2(A1)f))(x, t).

This gets really interesting when these definitions are combined with the closure propertyunder multiplication of matrices of the same kind since

T1(B2)T1(B1) = T1(B2B1) and T2(A2)T2(A1) = T2(A2A1). (2.81)

What this means is that there are two independent sets of three-parameter transforma-tions that can map solutions into solutions. And furthermore, these can be combinedsince (T2(A)(T1(B)f))(x, t) and (T1(B)(T2(A)f))(x, t) must also be solutions.

In Volume 2 this example will be revisited as an example of a six-dimensional Liegroup, where the A matrices and B matrices each independently form three-dimensionalsubgroups.

2.9 Chapter Summary

Many aspects of the Gaussian distribution were reviewed. These include the parametriza-tion of multi-dimensional Gaussians by their mean and covariance, the form of marginalsand conditionals of Gaussians, the properties of Gaussians under convolution, the maxi-mum entropy property, and the relationship between Gaussians and diffusion/heat equa-tions.10 Finally, a brief review of the theory of symmetry analysis of partial differentialequations, as applied to diffusion equations, was presented. This forms the first of manylinks between the topic of diffusion equations and Lie groups that will be forged through-out these books.

The connection between Lie group methods and partial differential equations has along history dating back to the 1950s [25, 26, 27, 28]. In addition to those referencescited earlier in this chapter, significant progress on this topic was made through the1970s and 1980s including [5, 13, 14, 18, 20]. These approaches have been used for verycomplicated partial differential equations, such as in [21].

The next chapter will serve as a more formal introduction to probability and in-formation theory. With the concrete example of the Gaussian distribution in mind, itshould be easier to tackle these problems. Furthermore, the maximum entropy propertyof Gaussians, as well as their role in the central limit theorem will justify what mightappear to be a preoccupation with Gaussians in the current chapter.

2.10 Exercises

2.1. Verify (2.6) by performing the integrals in the definitions of σ2 and s.

2.2. Verify (2.10).

2.3. Verify (2.14) by: (a) directly computing the convolution integral in (2.13); (b) usingthe convolution property of the Fourier transform (2.17).

10Note that although these equations were written in Cartesian coordinates in this chapter,it is possible to convert to polar, spherical, or other coordinates. For covariance matrices withsymmetry, this can be more convenient. See [8] for a detailed discussion of different curvilinearcoordinate systems.

2.10 Exercises 59

2.4. Using the same reasoning as in Section 2.1.2, compute: (a) the maximum entropydistribution on the real line subject to the constraint that it has a specified value ofthe spread (rather than variance); (b) the maximum entropy distribution on the finiteinterval [a, b] subject to no constraints.

2.5. What is the exact expression for the functional F (ε2) in Section 2.1.2?

2.6. Prove that for any suitable f1(x) and f2(x), the convolution theorem (2.17) holds.Hint: a change of variables and a change in the order in which integrals are performedwill be required.

2.7. Verify (2.27). Hint: Use the block decomposition in (A.70) to obtain an explicitexpression for Σ−1.

2.8. Verify (2.29). Hint: Use the property of the exponential function ea+b = eaeb.

2.9. Calculate the exact form of r(σ0) in (2.38) in terms of the error function in (2.39).

2.10. Work out the covariance matrix for the 2D clipped Gaussian in analogy with the3D case presented in (2.43).

2.11. Following the steps in Section 2.1.5, derive the closed-form solution f(x, t) thatsatisfies (2.48) subject to the initial conditions f(x, 0) = δ(x).

2.12. Using (2.49), show that the mean and variance of f(x, t) computed from (2.48) in(2.50) and (2.51) are the same as computed directly from the closed-form solution off(x, t) obtained in the previous exercise.

2.13. Verify (2.46) analytically by computing the Fourier coefficients ρW (n;μ, σ) of(2.44).

2.14. Using integration by parts, prove (2.50) and (2.51).

2.15. Show that the matrices in (2.77) and (2.78) are invertible.

2.16. Find the nine basis operators {Li} that take solutions of

∂f

∂t=

∂2f

∂x2 +∂2f

∂y2 (2.82)

into other solutions.

2.17. Find the thirteen basis operators {Li} that take solutions of

∂f

∂t=

∂2f

∂x2 +∂2f

∂y2 +∂2f

∂z2

into other solutions.

2.18. Verify that transformations of the form in (2.79) and (2.80) will transform onesolution into another. Hint: Use the chain rule.

2.19. Verify that the two equations in (2.81) hold. That is, first compute two concate-nated transformations, and then compute the single transformation resulting from thematrix products, and compare.

2.20. Show that for A ∈ Rn×n with A = AT > 0,

60 2 Gaussian Distributions and the Heat Equation∫Rn

exp(−12xTAx)dx = (2π)n/2|detA|− 1

2 . (2.83)

Hint: Decompose A = QΛQT where Q is orthogonal and Λ is the diagonal matrixconsisting of eigenvalues of A, which are all positive.

2.21. Verify (2.59) by substituting (2.58) into (2.56) and using the chain rule.

2.22. Can a 2 × 2 matrix S(t) be constructed such that Σ = expS which does not fallinto Case 1 or Case 2? If so, provide an example. If not, explain why not.

2.23. Determine conditions under which the time-dependent diffusion with drift

∂f

∂t=

12

n∑k,l=1

Dkl(t)∂f2

∂xk∂xl−

n∑k=1

dk(t)∂f

∂xk(2.84)

will have a solution of the form

f(x, t) = c(t) exp[−1

2[x − a(t)]TC(t)[x − a(t)]

]. (2.85)

2.24. Show that the following transformations take solutions of (2.82) into solutions[17]:

T1(w, z, ω)f(x, t) = exp[12x · w +

14t‖w‖2 + ω

]f(x + tw + z, t) (2.86)

where w, z ∈ R2 and ω ∈ R;

T2(A)f(x, t) = exp[−1

4(δ + tβ)−1β‖x‖2

](δ + tβ)−1f

((δ + tβ)−1x,

γ + tα

δ + tβ

)(2.87)

where A ∈ R2×2 with detA = 1;

T3(θ)f(x, t) = f(RT (θ)x, t) where R(θ) =(

cos θ − sin θsin θ cos θ

). (2.88)

References

1. Benedetto, J.J., Zimmermann, G., “Sampling multipliers and the Poisson summation for-mula,” J. Fourier Anal. Appl., 3, pp. 505–523, 1997.

2. Bluman, G., Construction of Solutions to Partial Differential Equations by the Use ofTransformation Groups, PhD Dissertation, Caltech, 1967.

3. Bluman, G., Cole, J., “The general similarity solution of the heat equation,” J. Math.Mech., 18, pp. 1025–1042, 1969.

4. Bluman, G., Cole, J., Similarity Methods for Differential Equations, Springer-Verlag, NewYork, 1974.

5. Boyer, C., “The maximal kinematical invariance group for an arbitrary potential,” Helv.Phys. Acta, 47, pp. 589–605, 1974.

6. Bracewell, R.N., The Fourier Transform and Its Applications, 2nd ed., McGraw-Hill, NewYork, 1986.

7. Burrus, C.S., Parks, T.W., DFT/FFT and Convolution Algorithms, John Wiley and Sons,New York, 1985.

References 61

8. Chirikjian, G.S., Kyatkin, A.B., Engineering Applications of Noncommutative HarmonicAnalysis, CRC Press, Boca Raton, FL, 2001.

9. Chirikjian, G.S., “A methodology for determining mechanical properties of macromoleculesfrom ensemble motion data,” Trends Anal. Chem., 22, pp. 549–553, 2003.

10. Cooley, J.W., Tukey, J., “An algorithm for the machine calculation of complex Fourierseries,” Math. Comput., 19, pp. 297–301, 1965.

11. Fourier, J.B.J., Theorie Analytique de la Chaleur, F. Didot, Paris, 1822.12. Gray, A., Tubes, 2nd ed., Birkhauser, Boston, 2004.13. Kalnins, E., Miller, W., Jr., “Symmetry and separation of variables for the heat equation,”

Proc. Conf. on Symmetry, Similarity and Group-Theoretic Methods in Mechanics, pp. 246–261, University of Calgary, Calgary, Canada, 1974.

14. Kalnins, E., Miller, W., Jr., “Lie theory and separation of variables 4: The groups SO(2, 1)and SO(3),” J. Math. Phys., 15, pp. 1263–1274, 1974.

15. Korner, T.W., Fourier Analysis, Cambridge University Press, London, 1988 (reprinted1993).

16. Lie, S., “Uber die Integration durch bestimmte Integral von einer Klasse linearer partiellerDifferentialgleichungen,” Arch. Math., 4(3), Kristiana, p. 328, 1881.

17. Miller, W., Jr., Symmetry and Separation of Variables, Encyclopedia of Mathematics andIts Applications, Vol. 4, G.-C. Rota, ed., Addison-Wesley, Reading MA, 1974.

18. Olver, P.J., Applications of Lie Groups to Differential Equations, Springer-Verlag, NewYork, 1986.

19. Ovsiannikov, L.V., Group Analysis of Differential Equations, Academic Press, New York,1982.

20. Patera, J., Winternitz, P., “A new basis for the representations of the rotation group: Lameand Heun polynomials,” J. Math. Phys., 14, pp. 1130–1139, 1973.

21. Poluyanov, L.V., Aguilar, A., Gonzalez, M., Group Properties of the Acoustic DifferentialEquation, Taylor & Francis, London, 1995.

22. Renyi, A., Probability Theory, North-Holland, Amsterdam, 1970.23. Shannon, C.E., “A mathematical theory of communication,” Bell Syst. Tech. J., 27,

pp. 379–423 and 623–656, July and October, 1948.24. Van Loan, C., Computational Frameworks for the Fast Fourier Transform, SIAM, Philadel-

phia, 1992.25. Weisner, L., “Group-theoretic origin of certain generating functions,” Pacific J. Math., 5,

pp. 1033–1039, 1955.26. Weisner, L., “Generating functions for Hermite functions,” Can. J. Math., 11, pp. 141–147,

1959.27. Weisner, L., “Generating functions for Bessel functions,” Can. J. Math., 11, pp. 148–155,

1959.28. Winternitz, P., Fris, I., “Invariant expansions of relativistic amplitudes and subgroups of

the proper Lorentz group,” Soviet Physics JNP, 1, pp. 636–643, 1965.

http://www.springer.com/978-0-8176-4802-2

2 gaussian distributions and the heat equation · 2009-12-17 · 32 2 gaussian distributions and...

Documents