solving challenging non-linear regression problems by...

62
Solving Challenging Non-linear Regression Problems by Manipulating a Gaussian Distribution Machine Learning Summer School, Cambridge 2009 Carl Edward Rasmussen Department of Engineering, University of Cambridge August 30th - September 10th, 2009 Rasmussen (Engineering, Cambridge) Gaussian Process Regression August 30th - September 10th, 2009 1 / 62

Upload: others

Post on 15-Jul-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Solving Challenging Non-linear Regression Problems by ...mlg.eng.cam.ac.uk/mlss09/mlss_slides/Rasmussen_1.pdf · Carl Edward Rasmussen Department of Engineering, University of Cambridge

Solving Challenging Non-linear Regression Problemsby Manipulating a Gaussian Distribution

Machine Learning Summer School, Cambridge 2009

Carl Edward Rasmussen

Department of Engineering, University of Cambridge

August 30th - September 10th, 2009

Rasmussen (Engineering, Cambridge) Gaussian Process Regression August 30th - September 10th, 2009 1 / 62

Page 2: Solving Challenging Non-linear Regression Problems by ...mlg.eng.cam.ac.uk/mlss09/mlss_slides/Rasmussen_1.pdf · Carl Edward Rasmussen Department of Engineering, University of Cambridge

The Prediction Problem

1960 1980 2000 2020

320

340

360

380

400

420

year

CO

2 con

cent

ratio

n, p

pm ?

Rasmussen (Engineering, Cambridge) Gaussian Process Regression August 30th - September 10th, 2009 2 / 62

Page 3: Solving Challenging Non-linear Regression Problems by ...mlg.eng.cam.ac.uk/mlss09/mlss_slides/Rasmussen_1.pdf · Carl Edward Rasmussen Department of Engineering, University of Cambridge

The Prediction Problem

1960 1980 2000 2020

320

340

360

380

400

420

year

CO

2 con

cent

ratio

n, p

pm

Rasmussen (Engineering, Cambridge) Gaussian Process Regression August 30th - September 10th, 2009 3 / 62

Page 4: Solving Challenging Non-linear Regression Problems by ...mlg.eng.cam.ac.uk/mlss09/mlss_slides/Rasmussen_1.pdf · Carl Edward Rasmussen Department of Engineering, University of Cambridge

The Prediction Problem

1960 1980 2000 2020

320

340

360

380

400

420

year

CO

2 con

cent

ratio

n, p

pm

Rasmussen (Engineering, Cambridge) Gaussian Process Regression August 30th - September 10th, 2009 4 / 62

Page 5: Solving Challenging Non-linear Regression Problems by ...mlg.eng.cam.ac.uk/mlss09/mlss_slides/Rasmussen_1.pdf · Carl Edward Rasmussen Department of Engineering, University of Cambridge

The Prediction Problem

1960 1980 2000 2020

320

340

360

380

400

420

year

CO

2 con

cent

ratio

n, p

pm

Rasmussen (Engineering, Cambridge) Gaussian Process Regression August 30th - September 10th, 2009 5 / 62

Page 6: Solving Challenging Non-linear Regression Problems by ...mlg.eng.cam.ac.uk/mlss09/mlss_slides/Rasmussen_1.pdf · Carl Edward Rasmussen Department of Engineering, University of Cambridge

Maximum likelihood, parametric model

Supervised parametric learning:

• data: x, y• model: y = fw(x) + ε

Gaussian likelihood:

p(y|x, w) ∝∏

c

exp(− 12 (yc − fw(xc))

2/σ2noise).

Maximize the likelihood:

wML = argmaxw

p(y|x, w).

Make predictions, by plugging in the ML estimate:

p(y∗|x∗, wML)

Rasmussen (Engineering, Cambridge) Gaussian Process Regression August 30th - September 10th, 2009 6 / 62

Page 7: Solving Challenging Non-linear Regression Problems by ...mlg.eng.cam.ac.uk/mlss09/mlss_slides/Rasmussen_1.pdf · Carl Edward Rasmussen Department of Engineering, University of Cambridge

Bayesian Inference, parametric model

Supervised parametric learning:

• data: x, y• model: y = fw(x) + ε

Gaussian likelihood:

p(y|x, w) ∝∏

c

exp(− 12 (yc − fw(xc))

2/σ2noise).

Parameter prior:p(w)

Posterior parameter distribution by Bayes rule p(a|b) = p(b|a)p(a)/p(b):

p(w|x, y) =p(w)p(y|x, w)

p(y|x)

Rasmussen (Engineering, Cambridge) Gaussian Process Regression August 30th - September 10th, 2009 7 / 62

Page 8: Solving Challenging Non-linear Regression Problems by ...mlg.eng.cam.ac.uk/mlss09/mlss_slides/Rasmussen_1.pdf · Carl Edward Rasmussen Department of Engineering, University of Cambridge

Bayesian Inference, parametric model, cont.

Making predictions:

p(y∗|x∗, x, y) =

∫p(y∗|w, x∗)p(w|x, y)dw

Marginal likelihood:

p(y|x) =

∫p(w)p(y|x, w)dw.

Rasmussen (Engineering, Cambridge) Gaussian Process Regression August 30th - September 10th, 2009 8 / 62

Page 9: Solving Challenging Non-linear Regression Problems by ...mlg.eng.cam.ac.uk/mlss09/mlss_slides/Rasmussen_1.pdf · Carl Edward Rasmussen Department of Engineering, University of Cambridge

The Gaussian Distribution

The Gaussian distribution is given by

p(x|µ,Σ) = N(µ,Σ) = (2π)−D/2|Σ|−1/2 exp(

− 12 (x − µ)>Σ−1(x − µ)

)where µ is the mean vector and Σ the covariance matrix.

Rasmussen (Engineering, Cambridge) Gaussian Process Regression August 30th - September 10th, 2009 9 / 62

Page 10: Solving Challenging Non-linear Regression Problems by ...mlg.eng.cam.ac.uk/mlss09/mlss_slides/Rasmussen_1.pdf · Carl Edward Rasmussen Department of Engineering, University of Cambridge

Conditionals and Marginals of a Gaussian

joint Gaussianconditional

joint Gaussianmarginal

Both the conditionals and the marginals of a joint Gaussian are again Gaussian.

Rasmussen (Engineering, Cambridge) Gaussian Process Regression August 30th - September 10th, 2009 10 / 62

Page 11: Solving Challenging Non-linear Regression Problems by ...mlg.eng.cam.ac.uk/mlss09/mlss_slides/Rasmussen_1.pdf · Carl Edward Rasmussen Department of Engineering, University of Cambridge

Conditionals and Marginals of a Gaussian

In algebra, if x and y are jointly Gaussian

p(x, y) = N([ a

b

],

[ A BB> C

]),

the marginal distribution of x is

p(x, y) = N([ a

b

],

[ A BB> C

])=⇒ p(x) = N(a, A),

and the conditional distribution of x given y is

p(x, y) = N([ a

b

],

[ A BB> C

])=⇒ p(x|y) = N(a+BC−1(y−b), A−BC−1B>),

where x and y can be scalars or vectors.

Rasmussen (Engineering, Cambridge) Gaussian Process Regression August 30th - September 10th, 2009 11 / 62

Page 12: Solving Challenging Non-linear Regression Problems by ...mlg.eng.cam.ac.uk/mlss09/mlss_slides/Rasmussen_1.pdf · Carl Edward Rasmussen Department of Engineering, University of Cambridge

What is a Gaussian Process?

A Gaussian process is a generalization of a multivariate Gaussian distribution toinfinitely many variables.

Informally: infinitely long vector ' function

Definition: a Gaussian process is a collection of random variables, anyfinite number of which have (consistent) Gaussian distributions. �

A Gaussian distribution is fully specified by a mean vector, µ, and covariancematrix Σ:

f = (f1, . . . , fn)> ∼ N(µ,Σ), indexes i = 1, . . . , n

A Gaussian process is fully specified by a mean function m(x) and covariancefunction k(x, x ′):

f (x) ∼ GP(m(x), k(x, x ′)

), indexes: x

Rasmussen (Engineering, Cambridge) Gaussian Process Regression August 30th - September 10th, 2009 12 / 62

Page 13: Solving Challenging Non-linear Regression Problems by ...mlg.eng.cam.ac.uk/mlss09/mlss_slides/Rasmussen_1.pdf · Carl Edward Rasmussen Department of Engineering, University of Cambridge

The marginalization property

Thinking of a GP as a Gaussian distribution with an infinitely long mean vectorand an infinite by infinite covariance matrix may seem impractical. . .

. . . luckily we are saved by the marginalization property:

Recall:

p(x) =

∫p(x, y)dy.

For Gaussians:

p(x, y) = N([ a

b

],

[ A BB> C

])=⇒ p(x) = N(a, A)

Rasmussen (Engineering, Cambridge) Gaussian Process Regression August 30th - September 10th, 2009 13 / 62

Page 14: Solving Challenging Non-linear Regression Problems by ...mlg.eng.cam.ac.uk/mlss09/mlss_slides/Rasmussen_1.pdf · Carl Edward Rasmussen Department of Engineering, University of Cambridge

Random functions from a Gaussian Process

Example one dimensional Gaussian process:

p(f (x)) ∼ GP(m(x) = 0, k(x, x ′) = exp(− 1

2 (x − x ′)2)).

To get an indication of what this distribution over functions looks like, focus on afinite subset of function values f = (f (x1), f (x2), . . . , f (xn))

>, for which

f ∼ N(0,Σ),

where Σij = k(xi, xj).

Then plot the coordinates of f as a function of the corresponding x values.

Rasmussen (Engineering, Cambridge) Gaussian Process Regression August 30th - September 10th, 2009 14 / 62

Page 15: Solving Challenging Non-linear Regression Problems by ...mlg.eng.cam.ac.uk/mlss09/mlss_slides/Rasmussen_1.pdf · Carl Edward Rasmussen Department of Engineering, University of Cambridge

Some values of the random function

−5 0 5−1.5

−1

−0.5

0

0.5

1

1.5

input, x

outp

ut, f

(x)

Rasmussen (Engineering, Cambridge) Gaussian Process Regression August 30th - September 10th, 2009 15 / 62

Page 16: Solving Challenging Non-linear Regression Problems by ...mlg.eng.cam.ac.uk/mlss09/mlss_slides/Rasmussen_1.pdf · Carl Edward Rasmussen Department of Engineering, University of Cambridge

Joint Generation

To generate a random sample from a D dimensional joint Gaussian withcovariance matrix K and mean vector m: (in octave or matlab)

z = randn(D,1);

y = chol(K)'*z + m;

where chol is the Cholesky factor R such that R>R = K.

Thus, the covariance of y is:

E[(y − y)(y − y)>] = E[R>zz>R] = R>E[zz>]R = R>IR = K.

Rasmussen (Engineering, Cambridge) Gaussian Process Regression August 30th - September 10th, 2009 16 / 62

Page 17: Solving Challenging Non-linear Regression Problems by ...mlg.eng.cam.ac.uk/mlss09/mlss_slides/Rasmussen_1.pdf · Carl Edward Rasmussen Department of Engineering, University of Cambridge

Sequential Generation

Factorize the joint distribution

p(f1, . . . , fn|x1, . . . xn) =

n∏i=1

p(fi|fi−1, . . . , f1, xi, . . . , x1),

and generate function values sequentially.

What do the individual terms look like? For Gaussians:

p(x, y) = N([ a

b

],

[ A BB> C

])=⇒ p(x|y) = N(a+BC−1(y−b), A−BC−1B>)

Do try this at home!

Rasmussen (Engineering, Cambridge) Gaussian Process Regression August 30th - September 10th, 2009 17 / 62

Page 18: Solving Challenging Non-linear Regression Problems by ...mlg.eng.cam.ac.uk/mlss09/mlss_slides/Rasmussen_1.pdf · Carl Edward Rasmussen Department of Engineering, University of Cambridge

−6−4

−20

24

6

−6

−4

−2

0

2

4

6

−2

−1

0

1

2

3

4

5

6

7

8

Function drawn at random from a Gaussian Process with Gaussian covariance

Rasmussen (Engineering, Cambridge) Gaussian Process Regression August 30th - September 10th, 2009 18 / 62

Page 19: Solving Challenging Non-linear Regression Problems by ...mlg.eng.cam.ac.uk/mlss09/mlss_slides/Rasmussen_1.pdf · Carl Edward Rasmussen Department of Engineering, University of Cambridge

Maximum likelihood, parametric model

Supervised parametric learning:

• data: x, y• model: y = fw(x) + ε

Gaussian likelihood:

p(y|x, w, Mi) ∝∏

c

exp(− 12 (yc − fw(xc))

2/σ2noise).

Maximize the likelihood:

wML = argmaxw

p(y|x, w, Mi).

Make predictions, by plugging in the ML estimate:

p(y∗|x∗, wML, Mi)

Rasmussen (Engineering, Cambridge) Gaussian Process Regression August 30th - September 10th, 2009 19 / 62

Page 20: Solving Challenging Non-linear Regression Problems by ...mlg.eng.cam.ac.uk/mlss09/mlss_slides/Rasmussen_1.pdf · Carl Edward Rasmussen Department of Engineering, University of Cambridge

Bayesian Inference, parametric model

Supervised parametric learning:

• data: x, y• model: y = fw(x) + ε

Gaussian likelihood:

p(y|x, w, Mi) ∝∏

c

exp(− 12 (yc − fw(xc))

2/σ2noise).

Parameter prior:p(w|Mi)

Posterior parameter distribution by Bayes rule p(a|b) = p(b|a)p(a)/p(b):

p(w|x, y, Mi) =p(w|Mi)p(y|x, w, Mi)

p(y|x, Mi)

Rasmussen (Engineering, Cambridge) Gaussian Process Regression August 30th - September 10th, 2009 20 / 62

Page 21: Solving Challenging Non-linear Regression Problems by ...mlg.eng.cam.ac.uk/mlss09/mlss_slides/Rasmussen_1.pdf · Carl Edward Rasmussen Department of Engineering, University of Cambridge

Bayesian Inference, parametric model, cont.

Making predictions:

p(y∗|x∗, x, y, Mi) =

∫p(y∗|w, x∗, Mi)p(w|x, y, Mi)dw

Marginal likelihood:

p(y|x, Mi) =

∫p(w|Mi)p(y|x, w, Mi)dw.

Model probability:

p(Mi|x, y) =p(Mi)p(y|x, Mi)

p(y|x)

Problem: integrals are intractable for most interesting models!

Rasmussen (Engineering, Cambridge) Gaussian Process Regression August 30th - September 10th, 2009 21 / 62

Page 22: Solving Challenging Non-linear Regression Problems by ...mlg.eng.cam.ac.uk/mlss09/mlss_slides/Rasmussen_1.pdf · Carl Edward Rasmussen Department of Engineering, University of Cambridge

Non-parametric Gaussian process models

In our non-parametric model, the “parameters” are the function itself!

Gaussian likelihood:y|x, f (x), Mi ∼ N(f, σ2

noiseI)

(Zero mean) Gaussian process prior:

f (x)|Mi ∼ GP(m(x) ≡ 0, k(x, x ′)

)Leads to a Gaussian process posterior

f (x)|x, y, Mi ∼ GP(mpost(x) = k(x, x)[K(x, x) + σ2

noiseI]−1y,

kpost(x, x ′) = k(x, x ′) − k(x, x)[K(x, x) + σ2noiseI]

−1k(x, x ′)).

And a Gaussian predictive distribution:

y∗|x∗, x, y, Mi ∼ N(k(x∗, x)>[K + σ2

noiseI]−1y,

k(x∗, x∗) + σ2noise − k(x∗, x)>[K + σ2

noiseI]−1k(x∗, x)

)Rasmussen (Engineering, Cambridge) Gaussian Process Regression August 30th - September 10th, 2009 22 / 62

Page 23: Solving Challenging Non-linear Regression Problems by ...mlg.eng.cam.ac.uk/mlss09/mlss_slides/Rasmussen_1.pdf · Carl Edward Rasmussen Department of Engineering, University of Cambridge

Prior and Posterior

−5 0 5

−2

−1

0

1

2

input, x

outp

ut, f

(x)

−5 0 5

−2

−1

0

1

2

input, x

outp

ut, f

(x)

Predictive distribution:

p(y∗|x∗, x, y) ∼ N(k(x∗, x)>[K + σ2

noiseI]−1y,

k(x∗, x∗) + σ2noise − k(x∗, x)>[K + σ2

noiseI]−1k(x∗, x)

)Rasmussen (Engineering, Cambridge) Gaussian Process Regression August 30th - September 10th, 2009 23 / 62

Page 24: Solving Challenging Non-linear Regression Problems by ...mlg.eng.cam.ac.uk/mlss09/mlss_slides/Rasmussen_1.pdf · Carl Edward Rasmussen Department of Engineering, University of Cambridge

Graphical model for Gaussian Process

fn

f3

f2

f1 f∗1

f∗2

f∗3

xnyn

x3

y3

x2

y2

x1

y1

y∗

1

x∗

1

y∗

2

x∗

2

y∗

3

x∗

3

Square nodes are observed (clamped),round nodes stochastic (free).

All pairs of latent variables are con-nected.

Predictions y∗ depend only on the corre-sponding single latent f ∗.

Notice, that adding a triplet x∗m, f ∗m, y∗mdoes not influence the distribution. Thisis guaranteed by the marginalizationproperty of the GP.

This explains why we can make inference using a finite amount of computation!

Rasmussen (Engineering, Cambridge) Gaussian Process Regression August 30th - September 10th, 2009 24 / 62

Page 25: Solving Challenging Non-linear Regression Problems by ...mlg.eng.cam.ac.uk/mlss09/mlss_slides/Rasmussen_1.pdf · Carl Edward Rasmussen Department of Engineering, University of Cambridge

Some interpretation

Recall our main result:

f∗|X∗, X, y ∼ N(K(X∗, X)[K(X, X) + σ2

nI]−1y,

K(X∗, X∗) − K(X∗, X)[K(X, X) + σ2nI]−1K(X, X∗)

).

The mean is linear in two ways:

µ(x∗) = k(x∗, X)[K(X, X) + σ2n]

−1y =

n∑c=1

βcy(c) =

n∑c=1

αck(x∗, x(c)).

The last form is most commonly encountered in the kernel literature.

The variance is the difference between two terms:

V(x∗) = k(x∗, x∗) − k(x∗, X)[K(X, X) + σ2nI]−1k(X, x∗),

the first term is the prior variance, from which we subtract a (positive) term,telling how much the data X has explained. Note, that the variance isindependent of the observed outputs y.

Rasmussen (Engineering, Cambridge) Gaussian Process Regression August 30th - September 10th, 2009 25 / 62

Page 26: Solving Challenging Non-linear Regression Problems by ...mlg.eng.cam.ac.uk/mlss09/mlss_slides/Rasmussen_1.pdf · Carl Edward Rasmussen Department of Engineering, University of Cambridge

The marginal likelihood

Log marginal likelihood:

log p(y|x, Mi) = −12

y>K−1y −12

log |K| −n2

log(2π)

is the combination of a data fit term and complexity penalty. Occam’s Razor isautomatic.

Learning in Gaussian process models involves finding

• the form of the covariance function, and• any unknown (hyper-) parameters θ.

This can be done by optimizing the marginal likelihood:

∂ log p(y|x, θ, Mi)

∂θj=

12

y>K−1 ∂K∂θj

K−1y −12

trace(K−1 ∂K∂θj

)

Rasmussen (Engineering, Cambridge) Gaussian Process Regression August 30th - September 10th, 2009 26 / 62

Page 27: Solving Challenging Non-linear Regression Problems by ...mlg.eng.cam.ac.uk/mlss09/mlss_slides/Rasmussen_1.pdf · Carl Edward Rasmussen Department of Engineering, University of Cambridge

Example: Fitting the length scale parameter

Parameterized covariance function: k(x, x ′) = v2 exp(

−(x − x ′)2

2`2

)+ σ2

nδxx′ .

−10 −8 −6 −4 −2 0 2 4 6 8 10−0.5

0

0.5

1

1.5observationstoo shortgood length scaletoo long

The mean posterior predictive function is plotted for 3 different length scales (thegreen curve corresponds to optimizing the marginal likelihood). Notice, that analmost exact fit to the data can be achieved by reducing the length scale – but themarginal likelihood does not favour this!

Rasmussen (Engineering, Cambridge) Gaussian Process Regression August 30th - September 10th, 2009 27 / 62

Page 28: Solving Challenging Non-linear Regression Problems by ...mlg.eng.cam.ac.uk/mlss09/mlss_slides/Rasmussen_1.pdf · Carl Edward Rasmussen Department of Engineering, University of Cambridge

Why, in principle, does Bayesian Inference work?Occam’s Razor

too simple

too complex

"just right"

All possible data sets

P(Y

|Mi)

Y

Rasmussen (Engineering, Cambridge) Gaussian Process Regression August 30th - September 10th, 2009 28 / 62

Page 29: Solving Challenging Non-linear Regression Problems by ...mlg.eng.cam.ac.uk/mlss09/mlss_slides/Rasmussen_1.pdf · Carl Edward Rasmussen Department of Engineering, University of Cambridge

An illustrative analogous example

Imagine the simple task of fitting the variance, σ2, of a zero-mean Gaussian to aset of n scalar observations.

The log likelihood is log p(y|µ,σ2) = − 12 y>Iy/σ2− 1

2 log |Iσ2| − n2 log(2π)

Rasmussen (Engineering, Cambridge) Gaussian Process Regression August 30th - September 10th, 2009 29 / 62

Page 30: Solving Challenging Non-linear Regression Problems by ...mlg.eng.cam.ac.uk/mlss09/mlss_slides/Rasmussen_1.pdf · Carl Edward Rasmussen Department of Engineering, University of Cambridge

From random functions to covariance functions

Consider the class of linear functions:

f (x) = ax + b, where a ∼ N(0,α), and b ∼ N(0,β).

We can compute the mean function:

µ(x) = E[f (x)] =

∫∫f (x)p(a)p(b)dadb =

∫axp(a)da +

∫bp(b)db = 0,

and covariance function:

k(x, x ′) = E[(f (x) − 0)(f (x ′) − 0)] =

∫∫(ax + b)(ax ′ + b)p(a)p(b)dadb

=

∫a2xx ′p(a)da +

∫b2p(b)db + (x + x ′)

∫abp(a)p(b)dadb = αxx ′ + β.

Rasmussen (Engineering, Cambridge) Gaussian Process Regression August 30th - September 10th, 2009 30 / 62

Page 31: Solving Challenging Non-linear Regression Problems by ...mlg.eng.cam.ac.uk/mlss09/mlss_slides/Rasmussen_1.pdf · Carl Edward Rasmussen Department of Engineering, University of Cambridge

From random functions to covariance functions II

Consider the class of functions (sums of squared exponentials):

f (x) = limn→∞ 1

n

∑i

γi exp(−(x − i/n)2), where γi ∼ N(0, 1), ∀i

=

∫∞−∞γ(u) exp(−(x − u)2)du, where γ(u) ∼ N(0, 1), ∀u.

The mean function is:

µ(x) = E[f (x)] =

∫∞−∞ exp(−(x − u)2)

∫∞−∞γp(γ)dγdu = 0,

and the covariance function:

E[f (x)f (x ′)] =

∫exp

(− (x − u)2 − (x ′ − u)2)du

=

∫exp

(− 2(u −

x + x ′

2)2 +

(x + x ′)2

2− x2 − x ′2

))du ∝ exp

(−

(x − x ′)2

2

).

Thus, the squared exponential covariance function is equivalent to regressionusing infinitely many Gaussian shaped basis functions placed everywhere, not justat your training points!

Rasmussen (Engineering, Cambridge) Gaussian Process Regression August 30th - September 10th, 2009 31 / 62

Page 32: Solving Challenging Non-linear Regression Problems by ...mlg.eng.cam.ac.uk/mlss09/mlss_slides/Rasmussen_1.pdf · Carl Edward Rasmussen Department of Engineering, University of Cambridge

Using finitely many basis functions may be dangerous!

−10 −8 −6 −4 −2 0 2 4 6 8 10−0.5

0

0.5

1

?

Rasmussen (Engineering, Cambridge) Gaussian Process Regression August 30th - September 10th, 2009 32 / 62

Page 33: Solving Challenging Non-linear Regression Problems by ...mlg.eng.cam.ac.uk/mlss09/mlss_slides/Rasmussen_1.pdf · Carl Edward Rasmussen Department of Engineering, University of Cambridge

Spline models

One dimensional minimization problem: find the function f (x) which minimizes:

c∑i=1

(f (x(i)) − y(i))2 + λ

∫1

0(f ′′(x))2dx,

where 0 < x(i) < x(i+1) < 1, ∀i = 1, . . . , n − 1, has as solution the NaturalSmoothing Cubic Spline: first order polinomials when x ∈ [0; x(1)] and whenx ∈ [x(n); 1] and a cubic polynomical in each x ∈ [x(i); x(i+1)], ∀i = 1, . . . , n − 1,joined to have continuous second derivatives at the knots.

The identical function is also the mean of a Gaussian process: Consider the classa functions given by:

f (x) = α + βx + limn→∞ 1√

n

n−1∑i=0

γi(x −in

)+, where (x)+ =

{x if x > 00 otherwise

with Gaussian priors:

α ∼ N(0, ξ), β ∼ N(0, ξ), γi ∼ N(0, Γ), ∀i = 0, . . . , n − 1.

Rasmussen (Engineering, Cambridge) Gaussian Process Regression August 30th - September 10th, 2009 33 / 62

Page 34: Solving Challenging Non-linear Regression Problems by ...mlg.eng.cam.ac.uk/mlss09/mlss_slides/Rasmussen_1.pdf · Carl Edward Rasmussen Department of Engineering, University of Cambridge

The covariance function becomes:

k(x, x ′) = ξ + xx ′ξ + Γ limn→∞ 1

n

n−1∑i=0

(x −in

)+ (x ′ −in

)+

= ξ + xx ′ξ + Γ

∫1

0(x − u)+ (x ′ − u)+du

= ξ + xx ′ξ + Γ(1

2|x − x ′| min(x, x ′)2 +

13

min(x, x ′)3).

In the limit ξ → ∞ and λ = σ2n/Γ the posterior mean becomes the natrual cubic

spline.

We can thus find the hyperparameters σ2 and Γ (and thereby λ) by maximisingthe marginal likelihood in the usual way.

Defining h(x) = (1, x)> the posterior predictions with mean and variance:

�µ(X∗) = H(X∗)>β + K(X, X∗)[K(X, X) + σ2

nI]−1(y − H(X)>β)

�Σ(x∗) = Σ(X∗) + R(X, X∗)>A(X)−1R(X, X∗)

β = A(X)−1H(X)[K + σ2nI]−1y, A(X) = H(X)[K(X, X) + σ2

nI]−1H(X)>

R(X, X∗) = H(X∗) − H(X)[K + σ2nI]−1K(X, X∗)

Rasmussen (Engineering, Cambridge) Gaussian Process Regression August 30th - September 10th, 2009 34 / 62

Page 35: Solving Challenging Non-linear Regression Problems by ...mlg.eng.cam.ac.uk/mlss09/mlss_slides/Rasmussen_1.pdf · Carl Edward Rasmussen Department of Engineering, University of Cambridge

Cubic Splines, Example

Although this is not the fastest way to compute splines, it offers a principled wayof finding hyperparameters, and uncertainties on predictions.

Note also, that although the posterior mean is smooth (piecewise cubic), posteriorsample functions are not.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−1

−0.5

0

0.5

1

1.5observationsmeanrandomrandom95% conf

Rasmussen (Engineering, Cambridge) Gaussian Process Regression August 30th - September 10th, 2009 35 / 62

Page 36: Solving Challenging Non-linear Regression Problems by ...mlg.eng.cam.ac.uk/mlss09/mlss_slides/Rasmussen_1.pdf · Carl Edward Rasmussen Department of Engineering, University of Cambridge

Model Selection in Practise; Hyperparameters

There are two types of task: form and parameters of the covariance function.

Typically, our prior is too weak to quantify aspects of the covariance function.We use a hierarchical model using hyperparameters. Eg, in ARD:

k(x, x ′) = v20 exp

(−

D∑d=1

(xd − x ′d)2

2v2d

), hyperparameters θ = (v0, v1, . . . , vd,σ

2n).

−20

2

−20

2

0

1

2

x1

v1=v2=1

x2 −20

2

−20

2−2

0

2

x1

v1=v2=0.32

x2 −20

2

−20

2

−2

0

2

x1

v1=0.32 and v2=1

x2Rasmussen (Engineering, Cambridge) Gaussian Process Regression August 30th - September 10th, 2009 36 / 62

Page 37: Solving Challenging Non-linear Regression Problems by ...mlg.eng.cam.ac.uk/mlss09/mlss_slides/Rasmussen_1.pdf · Carl Edward Rasmussen Department of Engineering, University of Cambridge

Rational quadratic covariance function

The rational quadratic (RQ) covariance function:

kRQ(r) =(

1 +r2

2α`2

)−α

with α, ` > 0 can be seen as a scale mixture (an infinite sum) of squaredexponential (SE) covariance functions with different characteristic length-scales.

Using τ = `−2 and p(τ|α,β) ∝ τα−1 exp(−ατ/β):

kRQ(r) =

∫p(τ|α,β)kSE(r|τ)dτ

∝∫

τα−1 exp(−

ατ

β

)exp

(−

τr2

2

)dτ ∝

(1 +

r2

2α`2

)−α

,

Rasmussen (Engineering, Cambridge) Gaussian Process Regression August 30th - September 10th, 2009 37 / 62

Page 38: Solving Challenging Non-linear Regression Problems by ...mlg.eng.cam.ac.uk/mlss09/mlss_slides/Rasmussen_1.pdf · Carl Edward Rasmussen Department of Engineering, University of Cambridge

Rational quadratic covariance function II

0 1 2 30

0.2

0.4

0.6

0.8

1

input distance

cova

rianc

e

α=1/2α=2α→∞

−5 0 5−3

−2

−1

0

1

2

3

input, xou

tput

, f(x

)

The limit α → ∞ of the RQ covariance function is the SE.

Rasmussen (Engineering, Cambridge) Gaussian Process Regression August 30th - September 10th, 2009 38 / 62

Page 39: Solving Challenging Non-linear Regression Problems by ...mlg.eng.cam.ac.uk/mlss09/mlss_slides/Rasmussen_1.pdf · Carl Edward Rasmussen Department of Engineering, University of Cambridge

Matérn covariance functions

Stationary covariance functions can be based on the Matérn form:

k(x, x ′) =1

Γ(ν)2ν−1

[√2ν

`|x − x ′|

(√2ν

`|x − x ′|

),

where Kν is the modified Bessel function of second kind of order ν, and ` is thecharacteristic length scale.

Sample functions from Matérn forms are bν − 1c times differentiable. Thus, thehyperparameter ν can control the degree of smoothness

Special cases:

• kν=1/2(r) = exp(− r` ): Laplacian covariance function, Browninan motion

(Ornstein-Uhlenbeck)

• kν=3/2(r) =(1 +

√3r`

)exp

(−

√3r`

)(once differentiable)

• kν=5/2(r) =(1 +

√5r` + 5r2

3`2

)exp

(−

√5r`

)(twice differentiable)

• kν→∞ = exp(− r2

2`2 ): smooth (infinitely differentiable)

Rasmussen (Engineering, Cambridge) Gaussian Process Regression August 30th - September 10th, 2009 39 / 62

Page 40: Solving Challenging Non-linear Regression Problems by ...mlg.eng.cam.ac.uk/mlss09/mlss_slides/Rasmussen_1.pdf · Carl Edward Rasmussen Department of Engineering, University of Cambridge

Matérn covariance functions II

Univariate Matérn covariance function with unit characteristic length scale andunit variance:

0 1 2 30

0.5

1

covariance function

input distance

cova

rianc

e

−5 0 5

−2

−1

0

1

2

sample functions

input, x

outp

ut, f

(x)

ν=1/2ν=1ν=2ν→∞

Rasmussen (Engineering, Cambridge) Gaussian Process Regression August 30th - September 10th, 2009 40 / 62

Page 41: Solving Challenging Non-linear Regression Problems by ...mlg.eng.cam.ac.uk/mlss09/mlss_slides/Rasmussen_1.pdf · Carl Edward Rasmussen Department of Engineering, University of Cambridge

Periodic, smooth functions

To create a distribution over periodic functions of x, we can first map the inputsto u = (sin(x), cos(x))>, and then measure distances in the u space. Combinedwith the SE covariance function, which characteristic length scale `, we get:

kperiodic(x, x ′) = exp(−2 sin2(π(x − x ′))/`2)

−2 −1 0 1 2−3

−2

−1

0

1

2

3

−2 −1 0 1 2−3

−2

−1

0

1

2

3

Three functions drawn at random; left ` > 1, and right ` < 1.Rasmussen (Engineering, Cambridge) Gaussian Process Regression August 30th - September 10th, 2009 41 / 62

Page 42: Solving Challenging Non-linear Regression Problems by ...mlg.eng.cam.ac.uk/mlss09/mlss_slides/Rasmussen_1.pdf · Carl Edward Rasmussen Department of Engineering, University of Cambridge

The Prediction Problem

1960 1980 2000 2020

320

340

360

380

400

420

year

CO

2 con

cent

ratio

n, p

pm ?

Rasmussen (Engineering, Cambridge) Gaussian Process Regression August 30th - September 10th, 2009 42 / 62

Page 43: Solving Challenging Non-linear Regression Problems by ...mlg.eng.cam.ac.uk/mlss09/mlss_slides/Rasmussen_1.pdf · Carl Edward Rasmussen Department of Engineering, University of Cambridge

Covariance Function

The covariance function consists of several terms, parameterized by a total of 11hyperparameters:

• long-term smooth trend (squared exponential)k1(x, x ′) = θ2

1 exp(−(x − x ′)2/θ22),

• seasonal trend (quasi-periodic smooth)

k2(x, x ′) = θ23 exp

(− 2 sin2(π(x − x ′))/θ2

5

)× exp

(− 1

2 (x − x ′)2/θ24

),

• short- and medium-term anomaly (rational quadratic)

k3(x, x ′) = θ26

(1 +

(x−x′)2

2θ8θ27

)−θ8

• noise (independent Gaussian, and dependent)

k4(x, x ′) = θ29 exp

(−

(x−x′)2

2θ210

)+ θ2

11δxx′ .

k(x, x ′) = k1(x, x ′) + k2(x, x ′) + k3(x, x ′) + k4(x, x ′)

Let’s try this with the gpml software (http://www.gaussianprocess.org/gpml).Rasmussen (Engineering, Cambridge) Gaussian Process Regression August 30th - September 10th, 2009 43 / 62

Page 44: Solving Challenging Non-linear Regression Problems by ...mlg.eng.cam.ac.uk/mlss09/mlss_slides/Rasmussen_1.pdf · Carl Edward Rasmussen Department of Engineering, University of Cambridge

Long- and medium-term mean predictions

1960 1970 1980 1990 2000 2010 2020

320

340

360

380

400C

O2 c

once

ntra

tion,

ppm

year

−1

−0.5

0

0.5

1

CO

2 con

cent

ratio

n, p

pm

Rasmussen (Engineering, Cambridge) Gaussian Process Regression August 30th - September 10th, 2009 44 / 62

Page 45: Solving Challenging Non-linear Regression Problems by ...mlg.eng.cam.ac.uk/mlss09/mlss_slides/Rasmussen_1.pdf · Carl Edward Rasmussen Department of Engineering, University of Cambridge

Mean Seasonal Component

J F M A M J J A S O N D1960

1970

1980

1990

2000

2010

2020

Month

Yea

r

−3.6

−3.3

−2.8

−2.8

−2 −2

−1

−10

01 12 2

2.83

3.1

Seasonal component: magnitude θ3 = 2.4 ppm, decay-time θ4 = 90 years.

Dependent noise, magnitude θ9 = 0.18 ppm, decay θ10 = 1.6 months.Independent noise, magnitude θ11 = 0.19 ppm.

Optimize or integrate out? See MacKay [3].

Rasmussen (Engineering, Cambridge) Gaussian Process Regression August 30th - September 10th, 2009 45 / 62

Page 46: Solving Challenging Non-linear Regression Problems by ...mlg.eng.cam.ac.uk/mlss09/mlss_slides/Rasmussen_1.pdf · Carl Edward Rasmussen Department of Engineering, University of Cambridge

Binary Gaussian Process Classification

−4

−2

0

2

4

input, x

late

nt fu

nctio

n, f(

x)

0

1

input, x

clas

s pr

obab

ility

, π(x

)

The class probability is related to the latent function, f , through:

p(y = 1|f (x)) = π(x) = Φ(f (x)

),

where Φ is a sigmoid function, such as the logistic or cumulative Gaussian.Observations are independent given f , so the likelihood is

p(y|f) =

n∏i=1

p(yi|fi) =

n∏i=1

Φ(yifi).

Rasmussen (Engineering, Cambridge) Gaussian Process Regression August 30th - September 10th, 2009 46 / 62

Page 47: Solving Challenging Non-linear Regression Problems by ...mlg.eng.cam.ac.uk/mlss09/mlss_slides/Rasmussen_1.pdf · Carl Edward Rasmussen Department of Engineering, University of Cambridge

Prior and Posterior for Classification

We use a Gaussian process prior for the latent function:

f|X, θ ∼ N(0, K)

The posterior becomes:

p(f|D, θ) =p(y|f) p(f|X, θ)

p(D|θ)=

N(f|0, K)

p(D|θ)

m∏i=1

Φ(yifi),

which is non-Gaussian.

The latent value at the test point, f (x∗) is

p(f∗|D, θ, x∗) =

∫p(f∗|f, X, θ, x∗)p(f|D, θ)df,

and the predictive class probability becomes

p(y∗|D, θ, x∗) =

∫p(y∗|f∗)p(f∗|D, θ, x∗)df∗,

both of which are intractable to compute.Rasmussen (Engineering, Cambridge) Gaussian Process Regression August 30th - September 10th, 2009 47 / 62

Page 48: Solving Challenging Non-linear Regression Problems by ...mlg.eng.cam.ac.uk/mlss09/mlss_slides/Rasmussen_1.pdf · Carl Edward Rasmussen Department of Engineering, University of Cambridge

Gaussian Approximation to the Posterior

We approximate the non-Gaussian posterior by a Gaussian:

p(f|D, θ) ' q(f|D, θ) = N(m, A)

then q(f∗|D, θ, x∗) = N(f∗|µ∗,σ2∗), where

µ∗ = k>∗ K−1m

σ2∗ = k(x∗, x∗)−k>∗ (K−1 − K−1AK−1)k∗.

Using this approximation with the cumulative Gaussian likelihood

q(y∗ = 1|D, θ, x∗) =

∫Φ(f∗) N(f∗|µ∗,σ2

∗)df∗ = Φ( µ∗√

1 + σ2∗

)

Rasmussen (Engineering, Cambridge) Gaussian Process Regression August 30th - September 10th, 2009 48 / 62

Page 49: Solving Challenging Non-linear Regression Problems by ...mlg.eng.cam.ac.uk/mlss09/mlss_slides/Rasmussen_1.pdf · Carl Edward Rasmussen Department of Engineering, University of Cambridge

Laplace’s method and Expectation Propagation

How do we find a good Gaussian approximation N(m, A) to the posterior?

Laplace’s method: Find the Maximum A Posteriori (MAP) lantent values fMAP,and use a local expansion (Gaussian) around this point as suggested by Williamsand Barber [8].

Variational bounds: bound the likelihood by some tractable expressionA local variational bound for each likelihood term was given by Gibbs andMacKay [1]. A lower bound based on Jensen’s inequality by Opper and by Seeger[5].

Expectation Propagation: use an approximation of the likelihood, such that themoments of the marginals of the approximate posterior match the (approximate)moment of the posterior, Minka [4].

Rasmussen (Engineering, Cambridge) Gaussian Process Regression August 30th - September 10th, 2009 49 / 62

Page 50: Solving Challenging Non-linear Regression Problems by ...mlg.eng.cam.ac.uk/mlss09/mlss_slides/Rasmussen_1.pdf · Carl Edward Rasmussen Department of Engineering, University of Cambridge

Expectation Propagation

fx1 fx2 fxn f!x1

f!x2

f!x3

y1 y2 yn y!1 y!2 y!3

...

�fi =proj[p(yi|fi)q\i]

q\i, q\i = p(f)

∏j6=i

�fi.

Rasmussen (Engineering, Cambridge) Gaussian Process Regression August 30th - September 10th, 2009 50 / 62

Page 51: Solving Challenging Non-linear Regression Problems by ...mlg.eng.cam.ac.uk/mlss09/mlss_slides/Rasmussen_1.pdf · Carl Edward Rasmussen Department of Engineering, University of Cambridge

USPS Digits, 3s vs 5s

−200

−200

−150

−130

−115

−105

−100

log lengthscale, log(l)

log

mag

nitu

de, l

og(σ

f)

2 3 4 5

0

1

2

3

4

5

−200−200

−160

−160

−130

−130

−115

−105

−105

−100

−95

−92

log lengthscale, log(l)

log

mag

nitu

de, l

og(σ

f)

2 3 4 5

0

1

2

3

4

5

2 3 4 5

0

1

2

3

4

5

log lengthscale, log(l)

log

mag

nitu

de, l

og(σ

f)

−92−95−100

−105

−105

−115

−130

−160

−160

−200

−200

0.25

0.25

0.5

0.5

0.7

0.7

0.8

0.8

0.84

log lengthscale, log(l)

log

mag

nitu

de, l

og(σ

f)

2 3 4 5

0

1

2

3

4

5

0.25

0.5

0.7

0.7

0.8

0.8

0.84

0.84

0.860.86

0.88

0.89

log lengthscale, log(l)

log

mag

nitu

de, l

og(σ

f)

2 3 4 5

0

1

2

3

4

5

0.250.5

0.7

0.70.8

0.84

0.84

0.86

0.86

0.88

0.89

log lengthscale, log(l)

log

mag

nitu

de, l

og(σ

f)2 3 4 5

0

1

2

3

4

5

Rasmussen (Engineering, Cambridge) Gaussian Process Regression August 30th - September 10th, 2009 51 / 62

Page 52: Solving Challenging Non-linear Regression Problems by ...mlg.eng.cam.ac.uk/mlss09/mlss_slides/Rasmussen_1.pdf · Carl Edward Rasmussen Department of Engineering, University of Cambridge

USPS Digits, 3s vs 5s

−15 −10 −5 0 50

0.05

0.1

0.15

0.2

f

MCMC samplesLaplace p(f|D)EP p(f|D)

−40 −30 −20 −10 0 100

0.01

0.02

0.03

0.04

0.05

0.06

0.07

f

MCMC samplesLaplace p(f|D)EP p(f|D)

Rasmussen (Engineering, Cambridge) Gaussian Process Regression August 30th - September 10th, 2009 52 / 62

Page 53: Solving Challenging Non-linear Regression Problems by ...mlg.eng.cam.ac.uk/mlss09/mlss_slides/Rasmussen_1.pdf · Carl Edward Rasmussen Department of Engineering, University of Cambridge

The Structure of the posterior

���

����

�0 2 4 6

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

xi

p(x i)

Rasmussen (Engineering, Cambridge) Gaussian Process Regression August 30th - September 10th, 2009 53 / 62

Page 54: Solving Challenging Non-linear Regression Problems by ...mlg.eng.cam.ac.uk/mlss09/mlss_slides/Rasmussen_1.pdf · Carl Edward Rasmussen Department of Engineering, University of Cambridge

Sparse Approximations

Recall the graphical model for a Gaussian process. Inference is expensive becausethe latent variables are fully connected.

fx1 fx2 fxn f!x1

f!x2

f!x3

y1 y2 yn y!1 y!2 y!3

...

Exact inference: O(n3).

Sparse approximations: solve a smaller,sparse, approximation of the originalproblem.

Algorithm: Subset of data.

Are there better ways to sparsify?

Rasmussen (Engineering, Cambridge) Gaussian Process Regression August 30th - September 10th, 2009 54 / 62

Page 55: Solving Challenging Non-linear Regression Problems by ...mlg.eng.cam.ac.uk/mlss09/mlss_slides/Rasmussen_1.pdf · Carl Edward Rasmussen Department of Engineering, University of Cambridge

Inducing Variables

Because of the marginalization property, we can introduce more latent variableswithout changing the distribution of the original variables.

fx1 fx2 fxn f!x1

f!x2

f!x3

y1 y2 yn y!1 y!2 y!3

...

fu1 fu2 fu3

The u = (u1, u2, . . .)> are called inducing

variables.

The inducing variables have associatedinducing inputs, s, but no associatedoutput values.

The marginalization property ensuresthat

p(f, f∗) =

∫p(f, f∗, u)du

Rasmussen (Engineering, Cambridge) Gaussian Process Regression August 30th - September 10th, 2009 55 / 62

Page 56: Solving Challenging Non-linear Regression Problems by ...mlg.eng.cam.ac.uk/mlss09/mlss_slides/Rasmussen_1.pdf · Carl Edward Rasmussen Department of Engineering, University of Cambridge

The Central Approximations

In a unifying treatment, Candela and Rasmussen [2] assume that training and testsets are conditionally independent given u.

fx1 fx2 fxn f!x1

f!x2

f!x3

y1 y2 yn y!1 y!2 y!3

...

fu1 fu2 fu3

Assume: p(f, f∗) ' q(f, f∗), where

q(f, f∗) =

∫q(f∗|u)q(f|u)p(u)du.

The inducing variables induce the depen-dencies between training and test cases.

Different sparse algorithms in the litera-ture correspond to different

• choices of the inducing inputs• further approximations

Rasmussen (Engineering, Cambridge) Gaussian Process Regression August 30th - September 10th, 2009 56 / 62

Page 57: Solving Challenging Non-linear Regression Problems by ...mlg.eng.cam.ac.uk/mlss09/mlss_slides/Rasmussen_1.pdf · Carl Edward Rasmussen Department of Engineering, University of Cambridge

Training and test conditionals

The exact training and test conditionals are:

p(f|u) = N(Kf,uK−1f,f u, Kf,f − Qf,f)

p(f∗|u) = N(Kf∗,uK−1f,f u, Kf∗,f∗ − Qf∗,f∗),

where Qa,b = Ka,uK−1u,uKu,b.

These equations are easily recognized as the usual predictive equations for GPs.

The effective prior is:

q(f, f∗) = N(

0,

[ Kf,f Q∗,fQf,∗ K∗,∗

])

Rasmussen (Engineering, Cambridge) Gaussian Process Regression August 30th - September 10th, 2009 57 / 62

Page 58: Solving Challenging Non-linear Regression Problems by ...mlg.eng.cam.ac.uk/mlss09/mlss_slides/Rasmussen_1.pdf · Carl Edward Rasmussen Department of Engineering, University of Cambridge

Example: Sparse parametric Gaussian processes

Snelson and Ghahramani [6] introduced the idea of sparse GP inference based ona pseudo data set, integrating out the targets, and optimizing the inputs.

Equivalently, in the unifying scheme:

q(f|u) = N(Kf,uK−1f,f u, diag[Kf,f − Qf,f])

q(f∗|u) = p(f∗|u).

The effective prior becomes

qFITC(f, f∗) = N(

0,

[ Qf,f − diag[Qf,f − Kf,f] Q∗,fQf,∗ K∗,∗

]),

which can be computed efficiently.

The Bayesian Committee Machine [7] uses block diag instead of diag, and theinducing variables to be the test cases.

Rasmussen (Engineering, Cambridge) Gaussian Process Regression August 30th - September 10th, 2009 58 / 62

Page 59: Solving Challenging Non-linear Regression Problems by ...mlg.eng.cam.ac.uk/mlss09/mlss_slides/Rasmussen_1.pdf · Carl Edward Rasmussen Department of Engineering, University of Cambridge

The FITC Factor Graph

Sparse Parametric Gaussian Process ≡ Fully Independent Training Conditional

fx1 fx2 fxn f!x1

f!x2

f!x3

y1 y2 yn y!1 y!2 y!3

...

fu1 fu2 fu3

Rasmussen (Engineering, Cambridge) Gaussian Process Regression August 30th - September 10th, 2009 59 / 62

Page 60: Solving Challenging Non-linear Regression Problems by ...mlg.eng.cam.ac.uk/mlss09/mlss_slides/Rasmussen_1.pdf · Carl Edward Rasmussen Department of Engineering, University of Cambridge

Conclusions

Complex non-linear inferenceproblems can be solved bymanipulating plain old Gaussiandistributions

Rasmussen (Engineering, Cambridge) Gaussian Process Regression August 30th - September 10th, 2009 60 / 62

Page 61: Solving Challenging Non-linear Regression Problems by ...mlg.eng.cam.ac.uk/mlss09/mlss_slides/Rasmussen_1.pdf · Carl Edward Rasmussen Department of Engineering, University of Cambridge

More on Gaussian Processes

Rasmussen and WilliamsGaussian Processes for Machine Learning,MIT Press, 2006.http://www.GaussianProcess.org/gpml

Gaussian process web (code, papers, etc): http://www.GaussianProcess.org

Rasmussen (Engineering, Cambridge) Gaussian Process Regression August 30th - September 10th, 2009 61 / 62

Page 62: Solving Challenging Non-linear Regression Problems by ...mlg.eng.cam.ac.uk/mlss09/mlss_slides/Rasmussen_1.pdf · Carl Edward Rasmussen Department of Engineering, University of Cambridge

A few references

[1] Gibbs, M. N. and MacKay, D. J. C. (2000). Variational Gaussian Process Classifiers. IEEETransactions on Neural Networks, 11(6):1458–1464.

[2] Joaquin Quiñonero-Candela and Carl Edward Rasmussen (2005). A unifying view of sparseapproximate gaussian process regression. Journal of Machine Learning Research, 6:1939–1959.

[3] MacKay, D. J. C. (1999). Comparison of Approximate Methods for Handling Hyperparameters.Neural Computation, 11(5):1035–1068.

[4] Minka, T. P. (2001). A Family of Algorithms for Approximate Bayesian Inference. PhD thesis,Massachusetts Institute of Technology.

[5] Seeger, M. (2003). Bayesian Gaussian Process Models: PAC-Bayesian Generalisation ErrorBounds and Sparse Approximations. PhD thesis, School of Informatics, University of Edinburgh.http://www.cs.berkeley.edu/∼mseeger.

[6] Snelson, E. and Ghahramani, Z. (2006). Sparse gaussian processes using pseudo-inputs. InAdvances in Neural Information Processing Systems 18. MIT Press.

[7] Tresp, V. (2000). A Bayesian Committee Machine. Neural Computation, 12(11):2719–2741.

[8] Williams, C. K. I. and Barber, D. (1998). Bayesian Classification with Gaussian Processes. IEEETransactions on Pattern Analysis and Machine Intelligence, 20(12):1342–1351.

Rasmussen (Engineering, Cambridge) Gaussian Process Regression August 30th - September 10th, 2009 62 / 62