gaussian processes for natural language …danielpr/files/gptut.pdfgaussian processes state of the...

399
Gaussian Processes for Natural Language Processing http://goo.gl/18heUk Trevor Cohn 1 Daniel Preot ¸iuc-Pietro 2 Neil Lawrence 2 Computing and Information Systems 1 Department of Computer Science 2 ACL 2014 Tutorial, 22 June 2014 (Special thanks also to Daniel Beck)

Upload: others

Post on 17-Jun-2020

25 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Gaussian Processes forNatural Language Processing

http://goo.gl/18heUk

Trevor Cohn1 Daniel Preotiuc-Pietro2 Neil Lawrence2

Computing and Information Systems1 Department of Computer Science2

ACL 2014 Tutorial, 22 June 2014(Special thanks also to Daniel Beck)

Page 2: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Gaussian Processes

Brings together several key ideas in one framework

I BayesianI kernelisedI non-parametricI non-linearI modelling uncertainty

Elegant and powerful framework, with growing popularity inmachine learning and application domains.

Page 3: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Gaussian Processes

State of the art for regression

I exact posterior inferenceI supports very complex non-linear functionsI elegant model selection

Now mature enough for use in NLP

I support for classification, ranking, etcI fancy kernels, e.g., textI sparse approximations for large scale inference

Several great toolkits:

https://github.com/SheffieldML/GPy

http://www.gaussianprocess.org/gpml

Page 4: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Tutorial Scope

Covers

1. GP fundamentals (1 hour)I focus on regressionI weight space vs. function space viewI squared exponential kernel

2. NLP applications (1 hour 15)I sparse GPsI multi-output GPsI kernelsI model selection

3. Further topics (30 mins)I classification and other likelihoodsI unsupervised inferenceI scaling to big data

See also materials from the GP Summer/Winter Schools

http://ml.dcs.shef.ac.uk/gpss/gpws14/

Page 5: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Outline

Introduction

GP fundamentals

The Gaussian Density

Covariance from Basis Functions

Basis Function Representations

NLP Applications

Advanced Topics

Page 6: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Book

Rasmussen and Williams (2006)

Page 7: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Outline

The Gaussian Density

Covariance from Basis Functions

Basis Function Representations

Page 8: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Outline

The Gaussian Density

Covariance from Basis Functions

Basis Function Representations

Page 9: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

The Gaussian Density

I Perhaps the most common probability density.

p(y|µ, σ2) =1

2πσ2exp

(−

(y − µ)2

2σ2

)4= N

(y|µ, σ2

)I The Gaussian density.

Page 10: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Gaussian Density

0

1

2

3

0 1 2

p(h|µ,σ

2 )

h, height/m

The Gaussian PDF with µ = 1.7 and variance σ2 = 0.0225. Meanshown as red line. It could represent the heights of apopulation of students.

Page 11: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Gaussian Density

N

(y|µ, σ2

)=

1√

2πσ2exp

(−

(y − µ)2

2σ2

)σ2 is the variance of the density and µ isthe mean.

Page 12: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Two Important Gaussian Properties

Sum of Gaussians

I Sum of Gaussian variables is also Gaussian.

yi ∼ N(µi, σ

2i

)

And the sum is distributed as

n∑i=1

yi ∼ N

n∑i=1

µi,n∑

i=1

σ2i

(Aside: As sum increases, sum of non-Gaussian, finitevariance variables is also Gaussian [central limit theorem].)

Page 13: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Two Important Gaussian Properties

Sum of Gaussians

I Sum of Gaussian variables is also Gaussian.

yi ∼ N(µi, σ

2i

)And the sum is distributed as

n∑i=1

yi ∼ N

n∑i=1

µi,n∑

i=1

σ2i

(Aside: As sum increases, sum of non-Gaussian, finitevariance variables is also Gaussian [central limit theorem].)

Page 14: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Two Important Gaussian Properties

Sum of Gaussians

I Sum of Gaussian variables is also Gaussian.

yi ∼ N(µi, σ

2i

)And the sum is distributed as

n∑i=1

yi ∼ N

n∑i=1

µi,n∑

i=1

σ2i

(Aside: As sum increases, sum of non-Gaussian, finitevariance variables is also Gaussian [central limit theorem].)

Page 15: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Two Important Gaussian Properties

Sum of Gaussians

I Sum of Gaussian variables is also Gaussian.

yi ∼ N(µi, σ

2i

)And the sum is distributed as

n∑i=1

yi ∼ N

n∑i=1

µi,n∑

i=1

σ2i

(Aside: As sum increases, sum of non-Gaussian, finitevariance variables is also Gaussian [central limit theorem].)

Page 16: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Two Important Gaussian Properties

Scaling a Gaussian

I Scaling a Gaussian leads to a Gaussian.

y ∼ N(µ, σ2

)And the scaled density is distributed as

wy ∼ N(wµ,w2σ2

)

Page 17: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Two Important Gaussian Properties

Scaling a Gaussian

I Scaling a Gaussian leads to a Gaussian.

y ∼ N(µ, σ2

)

And the scaled density is distributed as

wy ∼ N(wµ,w2σ2

)

Page 18: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Two Important Gaussian Properties

Scaling a Gaussian

I Scaling a Gaussian leads to a Gaussian.

y ∼ N(µ, σ2

)And the scaled density is distributed as

wy ∼ N(wµ,w2σ2

)

Page 19: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Linear Function

1

2

50 60 70 80 90 100

y

x

data pointsbest fit line

A linear regression between x and y.

Page 20: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Regression Examples

I Predict a real value, yi given some inputs xi.I Predict quality of meat given spectral measurements

(Tecator data).I Radiocarbon dating, the C14 calibration curve: predict age

given quantity of C14 isotope.I Predict quality of different Go or Backgammon moves

given expert rated training data.

Page 21: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

y = mx + c

Page 22: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

0

1

2

3

4

5

0 1 2 3 4 5

y

x

y = mx + c

Page 23: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

0

1

2

3

4

5

0 1 2 3 4 5

y

x

y = mx + cc

m

Page 24: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

0

1

2

3

4

5

0 1 2 3 4 5

y

x

y = mx + cc

m

Page 25: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

0

1

2

3

4

5

0 1 2 3 4 5

y

x

y = mx + cc

m

Page 26: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

0

1

2

3

4

5

0 1 2 3 4 5

y

x

y = mx + c

Page 27: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

0

1

2

3

4

5

0 1 2 3 4 5

y

x

y = mx + c

Page 28: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

0

1

2

3

4

5

0 1 2 3 4 5

y

x

y = mx + c

Page 29: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

y = mx + c

point 1: x = 1, y = 3

3 = m + c

point 2: x = 3, y = 1

1 = 3m + c

point 3: x = 2, y = 2.5

2.5 = 2m + c

Page 30: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear
Page 31: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear
Page 32: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear
Page 33: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

6 A PHILOSOPHICAL ESSAY ON PROBABILITIES.

height: "The day will come when, by study pursued

through several ages, the things now concealed will

appear with evidence; and posterity will be astonished

that truths so clear had escaped us.' '

Clairaut then

undertook to submit to analysis the perturbations which

the comet had experienced by the action of the two

great planets, Jupiter and Saturn; after immense cal-

culations he fixed its next passage at the perihelion

toward the beginning of April, 1759, which was actually

verified by observation. The regularity which astronomyshows us in the movements of the comets doubtless

exists also in all phenomena. -

The curve described by a simple molecule of air or

vapor is regulated in a manner just as certain as the

planetary orbits;the only difference between them is

that which comes from our ignorance.

Probability is relative, in part to this ignorance, in

part to our knowledge. We know that of three or a

greater number of events a single one ought to occur;

but nothing induces us to believe that one of them will

occur rather than the others. In this state of indecision

it is impossible for us to announce their occurrence with

certainty. It is, however, probable that one of these

events, chosen at will, will not occur because we see

several cases equally possible which exclude its occur-

rence, while only a single one favors it.

The theory of chance consists in reducing all the

events of the same kind to a certain number of cases

equally possible, that is to say, to such as we may be

equally undecided about in regard to their existence,and in determining the number of cases favorable to

the event whose probability is sought. The ratio of

Page 34: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

y = mx + c + ε

point 1: x = 1, y = 3

3 = m + c + ε1

point 2: x = 3, y = 1

1 = 3m + c + ε2

point 3: x = 2, y = 2.5

2.5 = 2m + c + ε3

Page 35: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Underdetermined System

What about two unknowns andone observation?

y1 = mx1 + c

012345

0 1 2 3y

x

Page 36: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Underdetermined System

Can compute m given c.

m =y1 − c

x

012345

0 1 2 3y

x

Page 37: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Underdetermined System

Can compute m given c.

c = 1.75 =⇒ m = 1.25

012345

0 1 2 3y

x

Page 38: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Underdetermined System

Can compute m given c.

c = −0.777 =⇒ m = 3.78

012345

0 1 2 3y

x

Page 39: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Underdetermined System

Can compute m given c.

c = −4.01 =⇒ m = 7.01

012345

0 1 2 3y

x

Page 40: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Underdetermined System

Can compute m given c.

c = −0.718 =⇒ m = 3.72

012345

0 1 2 3y

x

Page 41: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Underdetermined System

Can compute m given c.

c = 2.45 =⇒ m = 0.545

012345

0 1 2 3y

x

Page 42: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Underdetermined System

Can compute m given c.

c = −0.657 =⇒ m = 3.66

012345

0 1 2 3y

x

Page 43: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Underdetermined System

Can compute m given c.

c = −3.13 =⇒ m = 6.13

012345

0 1 2 3y

x

Page 44: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Underdetermined System

Can compute m given c.

c = −1.47 =⇒ m = 4.47

012345

0 1 2 3y

x

Page 45: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Underdetermined System

Can compute m given c.Assume

c ∼ N (0, 4) ,

we find a distribution of solu-tions.

012345

0 1 2 3y

x

Page 46: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Probability for Under- and Overdetermined

I To deal with overdetermined introduced probabilitydistribution for ‘variable’, εi.

I For underdetermined system introduced probabilitydistribution for ‘parameter’, c.

I This is known as a Bayesian treatment.

Page 47: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Multivariate Prior Distributions

I For general Bayesian inference need multivariate priors.I E.g. for multivariate linear regression:

yi =∑

i

w jxi, j + εi

(where we’ve dropped c for convenience), we need a priorover w.

I This motivates a multivariate Gaussian density.I We will use the multivariate Gaussian to put a prior directly

on the function (a Gaussian process).

Page 48: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Multivariate Prior Distributions

I For general Bayesian inference need multivariate priors.I E.g. for multivariate linear regression:

yi = w>xi,: + εi

(where we’ve dropped c for convenience), we need a priorover w.

I This motivates a multivariate Gaussian density.I We will use the multivariate Gaussian to put a prior directly

on the function (a Gaussian process).

Page 49: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Prior Distribution

I Bayesian inference requires a prior on the parameters.I The prior represents your belief before you see the data of

the likely value of the parameters.I For linear regression, consider a Gaussian prior on the

intercept:c ∼ N (0, α1)

Page 50: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Posterior Distribution

I Posterior distribution is found by combining the prior withthe likelihood.

I Posterior distribution is your belief after you see the data ofthe likely value of the parameters.

I The posterior is found through Bayes’ Rule

p(c|y) =p(y|c)p(c)

p(y)

Page 51: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Bayes Update

0

1

2

-3 -2 -1 0 1 2 3 4c

p(c) = N (c|0, α1)

Figure : A Gaussian prior combines with a Gaussian likelihood for aGaussian posterior.

Page 52: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Bayes Update

0

1

2

-3 -2 -1 0 1 2 3 4c

p(c) = N (c|0, α1)

p(y|m, c, x, σ2) = N(y|mx + c, σ2

)

Figure : A Gaussian prior combines with a Gaussian likelihood for aGaussian posterior.

Page 53: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Bayes Update

0

1

2

-3 -2 -1 0 1 2 3 4c

p(c) = N (c|0, α1)

p(y|m, c, x, σ2) = N(y|mx + c, σ2

)p(c|y,m, x, σ2) =

N

(c| y−mx

1+σ2/α1, (σ−2 + α−1

1 )−1)

Figure : A Gaussian prior combines with a Gaussian likelihood for aGaussian posterior.

Page 54: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Stages to Derivation of the Posterior

I Multiply likelihood by priorI they are “exponentiated quadratics”, the answer is always

also an exponentiated quadratic becauseexp(a2) exp(b2) = exp(a2 + b2).

I Complete the square to get the resulting density in theform of a Gaussian.

I Recognise the mean and (co)variance of the Gaussian. Thisis the estimate of the posterior.

Page 55: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Multivariate Regression Likelihood

I Noise corrupted data point

yi = w>xi,: + εi

I Multivariate regression likelihood:

p(y|X,w) =1

(2πσ2)n/2 exp

− 12σ2

n∑i=1

(yi −w>xi,:

)2

I Now use a multivariate Gaussian prior:

p(w) =1

(2πα)p2

exp(−

12α

w>w)

Page 56: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Multivariate Regression Likelihood

I Noise corrupted data point

yi = w>xi,: + εi

I Multivariate regression likelihood:

p(y|X,w) =1

(2πσ2)n/2 exp

− 12σ2

n∑i=1

(yi −w>xi,:

)2

I Now use a multivariate Gaussian prior:

p(w) =1

(2πα)p2

exp(−

12α

w>w)

Page 57: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Multivariate Regression Likelihood

I Noise corrupted data point

yi = w>xi,: + εi

I Multivariate regression likelihood:

p(y|X,w) =1

(2πσ2)n/2 exp

− 12σ2

n∑i=1

(yi −w>xi,:

)2

I Now use a multivariate Gaussian prior:

p(w) =1

(2πα)p2

exp(−

12α

w>w)

Page 58: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Two Dimensional Gaussian

I Consider height, h/m and weight, w/kg.I Could sample height from a distribution:

p(h) ∼ N (1.7, 0.0225)

I And similarly weight:

p(w) ∼ N (75, 36)

Page 59: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Height and Weight Modelsp(

h)

h/m

p(w

)

w/kg

Gaussian distributions for height and weight.

Page 60: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Sampling Two Dimensional Variables

Joint Distribution

w/k

g

h/m

Marginal Distributions

p(h)

p(w

)

Samples of height and weight

Page 61: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Sampling Two Dimensional Variables

Joint Distribution

w/k

g

h/m

Marginal Distributions

p(h)

p(w

)

Samples of height and weight

Page 62: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Sampling Two Dimensional Variables

Joint Distribution

w/k

g

h/m

Marginal Distributions

p(h)

p(w

)

Samples of height and weight

Page 63: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Sampling Two Dimensional Variables

Joint Distribution

w/k

g

h/m

Marginal Distributions

p(h)

p(w

)

Samples of height and weight

Page 64: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Sampling Two Dimensional Variables

Joint Distribution

w/k

g

h/m

Marginal Distributions

p(h)

p(w

)

Samples of height and weight

Page 65: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Sampling Two Dimensional Variables

Joint Distribution

w/k

g

h/m

Marginal Distributions

p(h)

p(w

)

Samples of height and weight

Page 66: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Sampling Two Dimensional Variables

Joint Distribution

w/k

g

h/m

Marginal Distributions

p(h)

p(w

)

Samples of height and weight

Page 67: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Sampling Two Dimensional Variables

Joint Distribution

w/k

g

h/m

Marginal Distributions

p(h)

p(w

)

Samples of height and weight

Page 68: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Sampling Two Dimensional Variables

Joint Distribution

w/k

g

h/m

Marginal Distributions

p(h)

p(w

)

Samples of height and weight

Page 69: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Sampling Two Dimensional Variables

Joint Distribution

w/k

g

h/m

Marginal Distributions

p(h)

p(w

)

Samples of height and weight

Page 70: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Sampling Two Dimensional Variables

Joint Distribution

w/k

g

h/m

Marginal Distributions

p(h)

p(w

)

Samples of height and weight

Page 71: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Sampling Two Dimensional Variables

Joint Distribution

w/k

g

h/m

Marginal Distributions

p(h)

p(w

)

Samples of height and weight

Page 72: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Sampling Two Dimensional Variables

Joint Distribution

w/k

g

h/m

Marginal Distributions

p(h)

p(w

)

Samples of height and weight

Page 73: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Sampling Two Dimensional Variables

Joint Distribution

w/k

g

h/m

Marginal Distributions

p(h)

p(w

)

Samples of height and weight

Page 74: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Sampling Two Dimensional Variables

Joint Distribution

w/k

g

h/m

Marginal Distributions

p(h)

p(w

)

Samples of height and weight

Page 75: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Sampling Two Dimensional Variables

Joint Distribution

w/k

g

h/m

Marginal Distributions

p(h)

p(w

)

Samples of height and weight

Page 76: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Sampling Two Dimensional Variables

Joint Distribution

w/k

g

h/m

Marginal Distributions

p(h)

p(w

)

Samples of height and weight

Page 77: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Sampling Two Dimensional Variables

Joint Distribution

w/k

g

h/m

Marginal Distributions

p(h)

p(w

)

Samples of height and weight

Page 78: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Sampling Two Dimensional Variables

Joint Distribution

w/k

g

h/m

Marginal Distributions

p(h)

p(w

)

Samples of height and weight

Page 79: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Sampling Two Dimensional Variables

Joint Distribution

w/k

g

h/m

Marginal Distributions

p(h)

p(w

)

Samples of height and weight

Page 80: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Sampling Two Dimensional Variables

Joint Distribution

w/k

g

h/m

Marginal Distributions

p(h)

p(w

)

Samples of height and weight

Page 81: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Sampling Two Dimensional Variables

Joint Distribution

w/k

g

h/m

Marginal Distributions

p(h)

p(w

)

Samples of height and weight

Page 82: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Sampling Two Dimensional Variables

Joint Distribution

w/k

g

h/m

Marginal Distributions

p(h)

p(w

)

Samples of height and weight

Page 83: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Independence Assumption

I This assumes height and weight are independent.

p(h,w) = p(h)p(w)

I In reality they are dependent (body mass index) = wh2 .

Page 84: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Sampling Two Dimensional Variables

Joint Distribution

w/k

g

h/m

Marginal Distributions

p(h)

p(w

)

Page 85: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Sampling Two Dimensional Variables

Joint Distribution

w/k

g

h/m

Marginal Distributions

p(h)

p(w

)

Page 86: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Sampling Two Dimensional Variables

Joint Distribution

w/k

g

h/m

Marginal Distributions

p(h)

p(w

)

Page 87: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Sampling Two Dimensional Variables

Joint Distribution

w/k

g

h/m

Marginal Distributions

p(h)

p(w

)

Page 88: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Sampling Two Dimensional Variables

Joint Distribution

w/k

g

h/m

Marginal Distributions

p(h)

p(w

)

Page 89: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Sampling Two Dimensional Variables

Joint Distribution

w/k

g

h/m

Marginal Distributions

p(h)

p(w

)

Page 90: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Sampling Two Dimensional Variables

Joint Distribution

w/k

g

h/m

Marginal Distributions

p(h)

p(w

)

Page 91: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Sampling Two Dimensional Variables

Joint Distribution

w/k

g

h/m

Marginal Distributions

p(h)

p(w

)

Page 92: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Sampling Two Dimensional Variables

Joint Distribution

w/k

g

h/m

Marginal Distributions

p(h)

p(w

)

Page 93: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Sampling Two Dimensional Variables

Joint Distribution

w/k

g

h/m

Marginal Distributions

p(h)

p(w

)

Page 94: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Sampling Two Dimensional Variables

Joint Distribution

w/k

g

h/m

Marginal Distributions

p(h)

p(w

)

Page 95: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Sampling Two Dimensional Variables

Joint Distribution

w/k

g

h/m

Marginal Distributions

p(h)

p(w

)

Page 96: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Sampling Two Dimensional Variables

Joint Distribution

w/k

g

h/m

Marginal Distributions

p(h)

p(w

)

Page 97: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Sampling Two Dimensional Variables

Joint Distribution

w/k

g

h/m

Marginal Distributions

p(h)

p(w

)

Page 98: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Sampling Two Dimensional Variables

Joint Distribution

w/k

g

h/m

Marginal Distributions

p(h)

p(w

)

Page 99: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Sampling Two Dimensional Variables

Joint Distribution

w/k

g

h/m

Marginal Distributions

p(h)

p(w

)

Page 100: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Sampling Two Dimensional Variables

Joint Distribution

w/k

g

h/m

Marginal Distributions

p(h)

p(w

)

Page 101: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Sampling Two Dimensional Variables

Joint Distribution

w/k

g

h/m

Marginal Distributions

p(h)

p(w

)

Page 102: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Sampling Two Dimensional Variables

Joint Distribution

w/k

g

h/m

Marginal Distributions

p(h)

p(w

)

Page 103: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Sampling Two Dimensional Variables

Joint Distribution

w/k

g

h/m

Marginal Distributions

p(h)

p(w

)

Page 104: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Sampling Two Dimensional Variables

Joint Distribution

w/k

g

h/m

Marginal Distributions

p(h)

p(w

)

Page 105: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Sampling Two Dimensional Variables

Joint Distribution

w/k

g

h/m

Marginal Distributions

p(h)

p(w

)

Page 106: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Sampling Two Dimensional Variables

Joint Distribution

w/k

g

h/m

Marginal Distributions

p(h)

p(w

)

Page 107: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Independent Gaussians

p(w, h) = p(w)p(h)

Page 108: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Independent Gaussians

p(w, h) =1√

2πσ21

√2πσ2

2

exp

−12

(w − µ1)2

σ21

+(h − µ2)2

σ22

Page 109: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Independent Gaussians

p(w, h) =1

2π√σ2

1σ22

exp

−12

([wh

]−

[µ1µ2

])> [σ2

1 00 σ2

2

]−1 ([wh

]−

[µ1µ2

])

Page 110: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Independent Gaussians

p(y) =1

2π |D|12

exp(−

12

(y − µ)>D−1(y − µ))

Page 111: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Correlated Gaussian

Form correlated from original by rotating the data space usingmatrix R.

p(y) =1

2π |D|12

exp(−

12

(y − µ)>D−1(y − µ))

Page 112: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Correlated Gaussian

Form correlated from original by rotating the data space usingmatrix R.

p(y) =1

2π |D|12

exp(−

12

(R>y − R>µ)>D−1(R>y − R>µ))

Page 113: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Correlated Gaussian

Form correlated from original by rotating the data space usingmatrix R.

p(y) =1

2π |D|12

exp(−

12

(y − µ)>RD−1R>(y − µ))

this gives a covariance matrix:

C−1 = RD−1R>

Page 114: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Correlated Gaussian

Form correlated from original by rotating the data space usingmatrix R.

p(y) =1

2π |C|12

exp(−

12

(y − µ)>C−1(y − µ))

this gives a covariance matrix:

C = RDR>

Page 115: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Recall Univariate Gaussian Properties

1. Sum of Gaussian variables is also Gaussian.

yi ∼ N(µi, σ

2i

)

n∑i=1

yi ∼ N

n∑i=1

µi,n∑

i=1

σ2i

2. Scaling a Gaussian leads to a Gaussian.

y ∼ N(µ, σ2

)wy ∼ N

(wµ,w2σ2

)

Page 116: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Recall Univariate Gaussian Properties

1. Sum of Gaussian variables is also Gaussian.

yi ∼ N(µi, σ

2i

)n∑

i=1

yi ∼ N

n∑i=1

µi,n∑

i=1

σ2i

2. Scaling a Gaussian leads to a Gaussian.

y ∼ N(µ, σ2

)wy ∼ N

(wµ,w2σ2

)

Page 117: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Recall Univariate Gaussian Properties

1. Sum of Gaussian variables is also Gaussian.

yi ∼ N(µi, σ

2i

)n∑

i=1

yi ∼ N

n∑i=1

µi,n∑

i=1

σ2i

2. Scaling a Gaussian leads to a Gaussian.

y ∼ N(µ, σ2

)wy ∼ N

(wµ,w2σ2

)

Page 118: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Recall Univariate Gaussian Properties

1. Sum of Gaussian variables is also Gaussian.

yi ∼ N(µi, σ

2i

)n∑

i=1

yi ∼ N

n∑i=1

µi,n∑

i=1

σ2i

2. Scaling a Gaussian leads to a Gaussian.

y ∼ N(µ, σ2

)

wy ∼ N(wµ,w2σ2

)

Page 119: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Recall Univariate Gaussian Properties

1. Sum of Gaussian variables is also Gaussian.

yi ∼ N(µi, σ

2i

)n∑

i=1

yi ∼ N

n∑i=1

µi,n∑

i=1

σ2i

2. Scaling a Gaussian leads to a Gaussian.

y ∼ N(µ, σ2

)wy ∼ N

(wµ,w2σ2

)

Page 120: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Multivariate Consequence

I Ifx ∼ N

(µ,Σ

)

I Andy = Wx

I Theny ∼ N

(Wµ,WΣW>

)

Page 121: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Multivariate Consequence

I Ifx ∼ N

(µ,Σ

)I And

y = Wx

I Theny ∼ N

(Wµ,WΣW>

)

Page 122: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Multivariate Consequence

I Ifx ∼ N

(µ,Σ

)I And

y = Wx

I Theny ∼ N

(Wµ,WΣW>

)

Page 123: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Sampling a Function

Multi-variate Gaussians

I We will consider a Gaussian with a particular structure ofcovariance matrix.

I Generate a single sample from this 25 dimensionalGaussian distribution, f =

[f1, f2 . . . f25

].

I We will plot these points against their index.

Page 124: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Gaussian Distribution Sample

-2

-1

0

1

2

0 5 10 15 20 25

f i

i(a) A 25 dimensional correlated ran-dom variable (values ploted againstindex)

ji

0

0.2

0.4

0.6

0.8

1

(b) colormap showing correlationsbetween dimensions.

Figure : A sample from a 25 dimensional Gaussian distribution.

Page 125: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Gaussian Distribution Sample

-2

-1

0

1

2

0 5 10 15 20 25

f i

i(a) A 25 dimensional correlated ran-dom variable (values ploted againstindex)

ji

0

0.2

0.4

0.6

0.8

1

(b) colormap showing correlationsbetween dimensions.

Figure : A sample from a 25 dimensional Gaussian distribution.

Page 126: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Gaussian Distribution Sample

-2

-1

0

1

2

0 5 10 15 20 25

f i

i(a) A 25 dimensional correlated ran-dom variable (values ploted againstindex)

0

0.2

0.4

0.6

0.8

1

(b) colormap showing correlationsbetween dimensions.

Figure : A sample from a 25 dimensional Gaussian distribution.

Page 127: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Gaussian Distribution Sample

-2

-1

0

1

2

0 5 10 15 20 25

f i

i(a) A 25 dimensional correlated ran-dom variable (values ploted againstindex)

0

0.2

0.4

0.6

0.8

1

(b) colormap showing correlationsbetween dimensions.

Figure : A sample from a 25 dimensional Gaussian distribution.

Page 128: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Gaussian Distribution Sample

-2

-1

0

1

2

0 5 10 15 20 25

f i

i(a) A 25 dimensional correlated ran-dom variable (values ploted againstindex)

0

0.2

0.4

0.6

0.8

1

(b) colormap showing correlationsbetween dimensions.

Figure : A sample from a 25 dimensional Gaussian distribution.

Page 129: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Gaussian Distribution Sample

-2

-1

0

1

2

0 5 10 15 20 25

f i

i(a) A 25 dimensional correlated ran-dom variable (values ploted againstindex)

0

0.2

0.4

0.6

0.8

1

(b) colormap showing correlationsbetween dimensions.

Figure : A sample from a 25 dimensional Gaussian distribution.

Page 130: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Gaussian Distribution Sample

-2

-1

0

1

2

0 5 10 15 20 25

f i

i(a) A 25 dimensional correlated ran-dom variable (values ploted againstindex)

0

0.2

0.4

0.6

0.8

1

(b) colormap showing correlationsbetween dimensions.

Figure : A sample from a 25 dimensional Gaussian distribution.

Page 131: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Gaussian Distribution Sample

-2

-1

0

1

2

0 5 10 15 20 25

f i

i(a) A 25 dimensional correlated ran-dom variable (values ploted againstindex)

1 0.96587

0.96587 1

(b) correlation between f1 and f2.

Figure : A sample from a 25 dimensional Gaussian distribution.

Page 132: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Prediction of f2 from f1

-1

0

1

-1 0 1

f 1

f2

1 0.96587

0.96587 1

I The single contour of the Gaussian density represents thejoint distribution, p( f1, f2).

I We observe that f1 = −0.313.I Conditional density: p( f2| f1 = −0.313).

Page 133: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Prediction of f2 from f1

-1

0

1

-1 0 1

f 1

f2

1 0.96587

0.96587 1

I The single contour of the Gaussian density represents thejoint distribution, p( f1, f2).

I We observe that f1 = −0.313.

I Conditional density: p( f2| f1 = −0.313).

Page 134: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Prediction of f2 from f1

-1

0

1

-1 0 1

f 1

f2

1 0.96587

0.96587 1

I The single contour of the Gaussian density represents thejoint distribution, p( f1, f2).

I We observe that f1 = −0.313.I Conditional density: p( f2| f1 = −0.313).

Page 135: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Prediction of f2 from f1

-1

0

1

-1 0 1

f 1

f2

1 0.96587

0.96587 1

I The single contour of the Gaussian density represents thejoint distribution, p( f1, f2).

I We observe that f1 = −0.313.I Conditional density: p( f2| f1 = −0.313).

Page 136: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Prediction with Correlated Gaussians

I Prediction of f2 from f1 requires conditional density.I Conditional density is also Gaussian.

p( f2| f1) = N

f2|k1,2

k1,1f1, k2,2 −

k21,2

k1,1

where covariance of joint density is given by

K =

[k1,1 k1,2k2,1 k2,2

]

Page 137: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Prediction of f5 from f1

-1

0

1

-1 0 1

f 1

f5

1 0.57375

0.57375 1

I The single contour of the Gaussian density represents thejoint distribution, p( f1, f5).

I We observe that f1 = −0.313.I Conditional density: p( f5| f1 = −0.313).

Page 138: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Prediction of f5 from f1

-1

0

1

-1 0 1

f 1

f5

1 0.57375

0.57375 1

I The single contour of the Gaussian density represents thejoint distribution, p( f1, f5).

I We observe that f1 = −0.313.

I Conditional density: p( f5| f1 = −0.313).

Page 139: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Prediction of f5 from f1

-1

0

1

-1 0 1

f 1

f5

1 0.57375

0.57375 1

I The single contour of the Gaussian density represents thejoint distribution, p( f1, f5).

I We observe that f1 = −0.313.I Conditional density: p( f5| f1 = −0.313).

Page 140: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Prediction of f5 from f1

-1

0

1

-1 0 1

f 1

f5

1 0.57375

0.57375 1

I The single contour of the Gaussian density represents thejoint distribution, p( f1, f5).

I We observe that f1 = −0.313.I Conditional density: p( f5| f1 = −0.313).

Page 141: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Prediction with Correlated Gaussians

I Prediction of f∗ from f requires multivariate conditionaldensity.

I Multivariate conditional density is also Gaussian.

p(f∗|f) = N(f∗|K∗,fK−1

f,f f,K∗,∗ −K∗,fK−1f,f Kf,∗

)

I Here covariance of joint density is given by

K =

[Kf,f K∗,fKf,∗ K∗,∗

]

Page 142: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Prediction with Correlated Gaussians

I Prediction of f∗ from f requires multivariate conditionaldensity.

I Multivariate conditional density is also Gaussian.

p(f∗|f) = N(f∗|µ,Σ

)µ = K∗,fK−1

f,f f

Σ = K∗,∗ −K∗,fK−1f,f Kf,∗

I Here covariance of joint density is given by

K =

[Kf,f K∗,fKf,∗ K∗,∗

]

Page 143: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Covariance FunctionsWhere did this covariance matrix come from?

Exponentiated Quadratic Kernel Function (RBF, SquaredExponential, Gaussian)

k (x, x′) = α exp

−‖x − x′‖222`2

I Covariance matrix is

built using the inputs tothe function x.

I For the example above itwas based on Euclideandistance.

I The covariance functionis also know as a kernel.

Page 144: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Covariance FunctionsWhere did this covariance matrix come from?

Exponentiated Quadratic Kernel Function (RBF, SquaredExponential, Gaussian)

k (x, x′) = α exp

−‖x − x′‖222`2

I Covariance matrix is

built using the inputs tothe function x.

I For the example above itwas based on Euclideandistance.

I The covariance functionis also know as a kernel.

Page 145: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Covariance FunctionsWhere did this covariance matrix come from?

k(xi, x j

)= α exp

(−||xi−x j||

2

2`2

)

x1 = −3.0, x2 = 1.20, and x3 = 1.40 with ` = 2.00 and α = 1.00.

x1 = −3.0, x1 = −3.0

k1,1 = 1.00 × exp(−

(−3.0−−3.0)2

2×2.002

)

Page 146: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Covariance FunctionsWhere did this covariance matrix come from?

1.00

k(xi, x j

)= α exp

(−||xi−x j||

2

2`2

)

x1 = −3.0, x2 = 1.20, and x3 = 1.40 with ` = 2.00 and α = 1.00.

x1 = −3.0, x1 = −3.0

k1,1 = 1.00 × exp(−

(−3.0−−3.0)2

2×2.002

)

Page 147: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Covariance FunctionsWhere did this covariance matrix come from?

1.00

k(xi, x j

)= α exp

(−||xi−x j||

2

2`2

)

x1 = −3.0, x2 = 1.20, and x3 = 1.40 with ` = 2.00 and α = 1.00.

x2 = 1.20, x1 = −3.0

k2,1 = 1.00 × exp(−

(1.20−−3.0)2

2×2.002

)

Page 148: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Covariance FunctionsWhere did this covariance matrix come from?

1.00

0.110

k(xi, x j

)= α exp

(−||xi−x j||

2

2`2

)

x1 = −3.0, x2 = 1.20, and x3 = 1.40 with ` = 2.00 and α = 1.00.

x2 = 1.20, x1 = −3.0

k2,1 = 1.00 × exp(−

(1.20−−3.0)2

2×2.002

)

Page 149: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Covariance FunctionsWhere did this covariance matrix come from?

1.00 0.110

0.110

k(xi, x j

)= α exp

(−||xi−x j||

2

2`2

)

x1 = −3.0, x2 = 1.20, and x3 = 1.40 with ` = 2.00 and α = 1.00.

x2 = 1.20, x1 = −3.0

k2,1 = 1.00 × exp(−

(1.20−−3.0)2

2×2.002

)

Page 150: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Covariance FunctionsWhere did this covariance matrix come from?

1.00 0.110

0.110

k(xi, x j

)= α exp

(−||xi−x j||

2

2`2

)

x1 = −3.0, x2 = 1.20, and x3 = 1.40 with ` = 2.00 and α = 1.00.

x2 = 1.20, x2 = 1.20

k2,2 = 1.00 × exp(−

(1.20−1.20)2

2×2.002

)

Page 151: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Covariance FunctionsWhere did this covariance matrix come from?

1.00 0.110

0.110 1.00

k(xi, x j

)= α exp

(−||xi−x j||

2

2`2

)

x1 = −3.0, x2 = 1.20, and x3 = 1.40 with ` = 2.00 and α = 1.00.

x2 = 1.20, x2 = 1.20

k2,2 = 1.00 × exp(−

(1.20−1.20)2

2×2.002

)

Page 152: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Covariance FunctionsWhere did this covariance matrix come from?

1.00 0.110

0.110 1.00

k(xi, x j

)= α exp

(−||xi−x j||

2

2`2

)

x1 = −3.0, x2 = 1.20, and x3 = 1.40 with ` = 2.00 and α = 1.00.

x3 = 1.40, x1 = −3.0

k3,1 = 1.00 × exp(−

(1.40−−3.0)2

2×2.002

)

Page 153: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Covariance FunctionsWhere did this covariance matrix come from?

1.00 0.110

0.110 1.00

0.0889

k(xi, x j

)= α exp

(−||xi−x j||

2

2`2

)

x1 = −3.0, x2 = 1.20, and x3 = 1.40 with ` = 2.00 and α = 1.00.

x3 = 1.40, x1 = −3.0

k3,1 = 1.00 × exp(−

(1.40−−3.0)2

2×2.002

)

Page 154: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Covariance FunctionsWhere did this covariance matrix come from?

1.00 0.110 0.0889

0.110 1.00

0.0889

k(xi, x j

)= α exp

(−||xi−x j||

2

2`2

)

x1 = −3.0, x2 = 1.20, and x3 = 1.40 with ` = 2.00 and α = 1.00.

x3 = 1.40, x1 = −3.0

k3,1 = 1.00 × exp(−

(1.40−−3.0)2

2×2.002

)

Page 155: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Covariance FunctionsWhere did this covariance matrix come from?

1.00 0.110 0.0889

0.110 1.00

0.0889

k(xi, x j

)= α exp

(−||xi−x j||

2

2`2

)

x1 = −3.0, x2 = 1.20, and x3 = 1.40 with ` = 2.00 and α = 1.00.

x3 = 1.40, x2 = 1.20

k3,2 = 1.00 × exp(−

(1.40−1.20)2

2×2.002

)

Page 156: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Covariance FunctionsWhere did this covariance matrix come from?

1.00 0.110 0.0889

0.110 1.00

0.0889 0.995

k(xi, x j

)= α exp

(−||xi−x j||

2

2`2

)

x1 = −3.0, x2 = 1.20, and x3 = 1.40 with ` = 2.00 and α = 1.00.

x3 = 1.40, x2 = 1.20

k3,2 = 1.00 × exp(−

(1.40−1.20)2

2×2.002

)

Page 157: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Covariance FunctionsWhere did this covariance matrix come from?

1.00 0.110 0.0889

0.110 1.00 0.995

0.0889 0.995

k(xi, x j

)= α exp

(−||xi−x j||

2

2`2

)

x1 = −3.0, x2 = 1.20, and x3 = 1.40 with ` = 2.00 and α = 1.00.

x3 = 1.40, x2 = 1.20

k3,2 = 1.00 × exp(−

(1.40−1.20)2

2×2.002

)

Page 158: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Covariance FunctionsWhere did this covariance matrix come from?

1.00 0.110 0.0889

0.110 1.00 0.995

0.0889 0.995

k(xi, x j

)= α exp

(−||xi−x j||

2

2`2

)

x1 = −3.0, x2 = 1.20, and x3 = 1.40 with ` = 2.00 and α = 1.00.

x3 = 1.40, x3 = 1.40

k3,3 = 1.00 × exp(−

(1.40−1.40)2

2×2.002

)

Page 159: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Covariance FunctionsWhere did this covariance matrix come from?

1.00 0.110 0.0889

0.110 1.00 0.995

0.0889 0.995 1.00

k(xi, x j

)= α exp

(−||xi−x j||

2

2`2

)

x1 = −3.0, x2 = 1.20, and x3 = 1.40 with ` = 2.00 and α = 1.00.

x3 = 1.40, x3 = 1.40

k3,3 = 1.00 × exp(−

(1.40−1.40)2

2×2.002

)

Page 160: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Covariance FunctionsWhere did this covariance matrix come from?

k(xi, x j

)= α exp

(−||xi−x j||

2

2`2

)

x1 = −3.0, x2 = 1.20, and x3 = 1.40 with ` = 2.00 and α = 1.00.

x3 = 1.40, x3 = 1.40

k3,3 = 1.00 × exp(−

(1.40−1.40)2

2×2.002

)

Page 161: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Covariance FunctionsWhere did this covariance matrix come from?

k(xi, x j

)= α exp

(−||xi−x j||

2

2`2

)

x1 = −3, x2 = 1.2, x3 = 1.4, and x4 = 2.0 with ` = 2.0 and α = 1.0.

x1 = −3, x1 = −3

k1,1 = 1.0 × exp(−

(−3−−3)2

2×2.02

)

Page 162: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Covariance FunctionsWhere did this covariance matrix come from?

1.0

k(xi, x j

)= α exp

(−||xi−x j||

2

2`2

)

x1 = −3, x2 = 1.2, x3 = 1.4, and x4 = 2.0 with ` = 2.0 and α = 1.0.

x1 = −3, x1 = −3

k1,1 = 1.0 × exp(−

(−3−−3)2

2×2.02

)

Page 163: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Covariance FunctionsWhere did this covariance matrix come from?

1.0

k(xi, x j

)= α exp

(−||xi−x j||

2

2`2

)

x1 = −3, x2 = 1.2, x3 = 1.4, and x4 = 2.0 with ` = 2.0 and α = 1.0.

x2 = 1.2, x1 = −3

k2,1 = 1.0 × exp(−

(1.2−−3)2

2×2.02

)

Page 164: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Covariance FunctionsWhere did this covariance matrix come from?

1.0

0.11

k(xi, x j

)= α exp

(−||xi−x j||

2

2`2

)

x1 = −3, x2 = 1.2, x3 = 1.4, and x4 = 2.0 with ` = 2.0 and α = 1.0.

x2 = 1.2, x1 = −3

k2,1 = 1.0 × exp(−

(1.2−−3)2

2×2.02

)

Page 165: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Covariance FunctionsWhere did this covariance matrix come from?

1.0 0.11

0.11

k(xi, x j

)= α exp

(−||xi−x j||

2

2`2

)

x1 = −3, x2 = 1.2, x3 = 1.4, and x4 = 2.0 with ` = 2.0 and α = 1.0.

x2 = 1.2, x1 = −3

k2,1 = 1.0 × exp(−

(1.2−−3)2

2×2.02

)

Page 166: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Covariance FunctionsWhere did this covariance matrix come from?

1.0 0.11

0.11

k(xi, x j

)= α exp

(−||xi−x j||

2

2`2

)

x1 = −3, x2 = 1.2, x3 = 1.4, and x4 = 2.0 with ` = 2.0 and α = 1.0.

x2 = 1.2, x2 = 1.2

k2,2 = 1.0 × exp(−

(1.2−1.2)2

2×2.02

)

Page 167: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Covariance FunctionsWhere did this covariance matrix come from?

1.0 0.11

0.11 1.0

k(xi, x j

)= α exp

(−||xi−x j||

2

2`2

)

x1 = −3, x2 = 1.2, x3 = 1.4, and x4 = 2.0 with ` = 2.0 and α = 1.0.

x2 = 1.2, x2 = 1.2

k2,2 = 1.0 × exp(−

(1.2−1.2)2

2×2.02

)

Page 168: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Covariance FunctionsWhere did this covariance matrix come from?

1.0 0.11

0.11 1.0

k(xi, x j

)= α exp

(−||xi−x j||

2

2`2

)

x1 = −3, x2 = 1.2, x3 = 1.4, and x4 = 2.0 with ` = 2.0 and α = 1.0.

x3 = 1.4, x1 = −3

k3,1 = 1.0 × exp(−

(1.4−−3)2

2×2.02

)

Page 169: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Covariance FunctionsWhere did this covariance matrix come from?

1.0 0.11

0.11 1.0

0.089

k(xi, x j

)= α exp

(−||xi−x j||

2

2`2

)

x1 = −3, x2 = 1.2, x3 = 1.4, and x4 = 2.0 with ` = 2.0 and α = 1.0.

x3 = 1.4, x1 = −3

k3,1 = 1.0 × exp(−

(1.4−−3)2

2×2.02

)

Page 170: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Covariance FunctionsWhere did this covariance matrix come from?

1.0 0.11 0.089

0.11 1.0

0.089

k(xi, x j

)= α exp

(−||xi−x j||

2

2`2

)

x1 = −3, x2 = 1.2, x3 = 1.4, and x4 = 2.0 with ` = 2.0 and α = 1.0.

x3 = 1.4, x1 = −3

k3,1 = 1.0 × exp(−

(1.4−−3)2

2×2.02

)

Page 171: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Covariance FunctionsWhere did this covariance matrix come from?

1.0 0.11 0.089

0.11 1.0

0.089

k(xi, x j

)= α exp

(−||xi−x j||

2

2`2

)

x1 = −3, x2 = 1.2, x3 = 1.4, and x4 = 2.0 with ` = 2.0 and α = 1.0.

x3 = 1.4, x2 = 1.2

k3,2 = 1.0 × exp(−

(1.4−1.2)2

2×2.02

)

Page 172: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Covariance FunctionsWhere did this covariance matrix come from?

1.0 0.11 0.089

0.11 1.0

0.089 1.0

k(xi, x j

)= α exp

(−||xi−x j||

2

2`2

)

x1 = −3, x2 = 1.2, x3 = 1.4, and x4 = 2.0 with ` = 2.0 and α = 1.0.

x3 = 1.4, x2 = 1.2

k3,2 = 1.0 × exp(−

(1.4−1.2)2

2×2.02

)

Page 173: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Covariance FunctionsWhere did this covariance matrix come from?

1.0 0.11 0.089

0.11 1.0 1.0

0.089 1.0

k(xi, x j

)= α exp

(−||xi−x j||

2

2`2

)

x1 = −3, x2 = 1.2, x3 = 1.4, and x4 = 2.0 with ` = 2.0 and α = 1.0.

x3 = 1.4, x2 = 1.2

k3,2 = 1.0 × exp(−

(1.4−1.2)2

2×2.02

)

Page 174: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Covariance FunctionsWhere did this covariance matrix come from?

1.0 0.11 0.089

0.11 1.0 1.0

0.089 1.0

k(xi, x j

)= α exp

(−||xi−x j||

2

2`2

)

x1 = −3, x2 = 1.2, x3 = 1.4, and x4 = 2.0 with ` = 2.0 and α = 1.0.

x3 = 1.4, x3 = 1.4

k3,3 = 1.0 × exp(−

(1.4−1.4)2

2×2.02

)

Page 175: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Covariance FunctionsWhere did this covariance matrix come from?

1.0 0.11 0.089

0.11 1.0 1.0

0.089 1.0 1.0

k(xi, x j

)= α exp

(−||xi−x j||

2

2`2

)

x1 = −3, x2 = 1.2, x3 = 1.4, and x4 = 2.0 with ` = 2.0 and α = 1.0.

x3 = 1.4, x3 = 1.4

k3,3 = 1.0 × exp(−

(1.4−1.4)2

2×2.02

)

Page 176: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Covariance FunctionsWhere did this covariance matrix come from?

1.0 0.11 0.089

0.11 1.0 1.0

0.089 1.0 1.0

k(xi, x j

)= α exp

(−||xi−x j||

2

2`2

)

x1 = −3, x2 = 1.2, x3 = 1.4, and x4 = 2.0 with ` = 2.0 and α = 1.0.

x4 = 2.0, x1 = −3

k4,1 = 1.0 × exp(−

(2.0−−3)2

2×2.02

)

Page 177: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Covariance FunctionsWhere did this covariance matrix come from?

1.0 0.11 0.089

0.11 1.0 1.0

0.089 1.0 1.0

0.044

k(xi, x j

)= α exp

(−||xi−x j||

2

2`2

)

x1 = −3, x2 = 1.2, x3 = 1.4, and x4 = 2.0 with ` = 2.0 and α = 1.0.

x4 = 2.0, x1 = −3

k4,1 = 1.0 × exp(−

(2.0−−3)2

2×2.02

)

Page 178: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Covariance FunctionsWhere did this covariance matrix come from?

1.0 0.11 0.089 0.044

0.11 1.0 1.0

0.089 1.0 1.0

0.044

k(xi, x j

)= α exp

(−||xi−x j||

2

2`2

)

x1 = −3, x2 = 1.2, x3 = 1.4, and x4 = 2.0 with ` = 2.0 and α = 1.0.

x4 = 2.0, x1 = −3

k4,1 = 1.0 × exp(−

(2.0−−3)2

2×2.02

)

Page 179: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Covariance FunctionsWhere did this covariance matrix come from?

1.0 0.11 0.089 0.044

0.11 1.0 1.0

0.089 1.0 1.0

0.044

k(xi, x j

)= α exp

(−||xi−x j||

2

2`2

)

x1 = −3, x2 = 1.2, x3 = 1.4, and x4 = 2.0 with ` = 2.0 and α = 1.0.

x4 = 2.0, x2 = 1.2

k4,2 = 1.0 × exp(−

(2.0−1.2)2

2×2.02

)

Page 180: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Covariance FunctionsWhere did this covariance matrix come from?

1.0 0.11 0.089 0.044

0.11 1.0 1.0

0.089 1.0 1.0

0.044 0.92

k(xi, x j

)= α exp

(−||xi−x j||

2

2`2

)

x1 = −3, x2 = 1.2, x3 = 1.4, and x4 = 2.0 with ` = 2.0 and α = 1.0.

x4 = 2.0, x2 = 1.2

k4,2 = 1.0 × exp(−

(2.0−1.2)2

2×2.02

)

Page 181: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Covariance FunctionsWhere did this covariance matrix come from?

1.0 0.11 0.089 0.044

0.11 1.0 1.0 0.92

0.089 1.0 1.0

0.044 0.92

k(xi, x j

)= α exp

(−||xi−x j||

2

2`2

)

x1 = −3, x2 = 1.2, x3 = 1.4, and x4 = 2.0 with ` = 2.0 and α = 1.0.

x4 = 2.0, x2 = 1.2

k4,2 = 1.0 × exp(−

(2.0−1.2)2

2×2.02

)

Page 182: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Covariance FunctionsWhere did this covariance matrix come from?

1.0 0.11 0.089 0.044

0.11 1.0 1.0 0.92

0.089 1.0 1.0

0.044 0.92

k(xi, x j

)= α exp

(−||xi−x j||

2

2`2

)

x1 = −3, x2 = 1.2, x3 = 1.4, and x4 = 2.0 with ` = 2.0 and α = 1.0.

x4 = 2.0, x3 = 1.4

k4,3 = 1.0 × exp(−

(2.0−1.4)2

2×2.02

)

Page 183: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Covariance FunctionsWhere did this covariance matrix come from?

1.0 0.11 0.089 0.044

0.11 1.0 1.0 0.92

0.089 1.0 1.0

0.044 0.92 0.96

k(xi, x j

)= α exp

(−||xi−x j||

2

2`2

)

x1 = −3, x2 = 1.2, x3 = 1.4, and x4 = 2.0 with ` = 2.0 and α = 1.0.

x4 = 2.0, x3 = 1.4

k4,3 = 1.0 × exp(−

(2.0−1.4)2

2×2.02

)

Page 184: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Covariance FunctionsWhere did this covariance matrix come from?

1.0 0.11 0.089 0.044

0.11 1.0 1.0 0.92

0.089 1.0 1.0 0.96

0.044 0.92 0.96

k(xi, x j

)= α exp

(−||xi−x j||

2

2`2

)

x1 = −3, x2 = 1.2, x3 = 1.4, and x4 = 2.0 with ` = 2.0 and α = 1.0.

x4 = 2.0, x3 = 1.4

k4,3 = 1.0 × exp(−

(2.0−1.4)2

2×2.02

)

Page 185: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Covariance FunctionsWhere did this covariance matrix come from?

1.0 0.11 0.089 0.044

0.11 1.0 1.0 0.92

0.089 1.0 1.0 0.96

0.044 0.92 0.96

k(xi, x j

)= α exp

(−||xi−x j||

2

2`2

)

x1 = −3, x2 = 1.2, x3 = 1.4, and x4 = 2.0 with ` = 2.0 and α = 1.0.

x4 = 2.0, x4 = 2.0

k4,4 = 1.0 × exp(−

(2.0−2.0)2

2×2.02

)

Page 186: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Covariance FunctionsWhere did this covariance matrix come from?

1.0 0.11 0.089 0.044

0.11 1.0 1.0 0.92

0.089 1.0 1.0 0.96

0.044 0.92 0.96 1.0

k(xi, x j

)= α exp

(−||xi−x j||

2

2`2

)

x1 = −3, x2 = 1.2, x3 = 1.4, and x4 = 2.0 with ` = 2.0 and α = 1.0.

x4 = 2.0, x4 = 2.0

k4,4 = 1.0 × exp(−

(2.0−2.0)2

2×2.02

)

Page 187: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Covariance FunctionsWhere did this covariance matrix come from?

k(xi, x j

)= α exp

(−||xi−x j||

2

2`2

)

x1 = −3, x2 = 1.2, x3 = 1.4, and x4 = 2.0 with ` = 2.0 and α = 1.0.

x4 = 2.0, x4 = 2.0

k4,4 = 1.0 × exp(−

(2.0−2.0)2

2×2.02

)

Page 188: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Covariance FunctionsWhere did this covariance matrix come from?

k(xi, x j

)= α exp

(−||xi−x j||

2

2`2

)

x1 = −3.0, x2 = 1.20, and x3 = 1.40 with ` = 5.00 and α = 4.00.

x1 = −3.0, x1 = −3.0

k1,1 = 4.00 × exp(−

(−3.0−−3.0)2

2×5.002

)

Page 189: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Covariance FunctionsWhere did this covariance matrix come from?

4.00

k(xi, x j

)= α exp

(−||xi−x j||

2

2`2

)

x1 = −3.0, x2 = 1.20, and x3 = 1.40 with ` = 5.00 and α = 4.00.

x1 = −3.0, x1 = −3.0

k1,1 = 4.00 × exp(−

(−3.0−−3.0)2

2×5.002

)

Page 190: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Covariance FunctionsWhere did this covariance matrix come from?

4.00

k(xi, x j

)= α exp

(−||xi−x j||

2

2`2

)

x1 = −3.0, x2 = 1.20, and x3 = 1.40 with ` = 5.00 and α = 4.00.

x2 = 1.20, x1 = −3.0

k2,1 = 4.00 × exp(−

(1.20−−3.0)2

2×5.002

)

Page 191: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Covariance FunctionsWhere did this covariance matrix come from?

4.00

2.81

k(xi, x j

)= α exp

(−||xi−x j||

2

2`2

)

x1 = −3.0, x2 = 1.20, and x3 = 1.40 with ` = 5.00 and α = 4.00.

x2 = 1.20, x1 = −3.0

k2,1 = 4.00 × exp(−

(1.20−−3.0)2

2×5.002

)

Page 192: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Covariance FunctionsWhere did this covariance matrix come from?

4.00 2.81

2.81

k(xi, x j

)= α exp

(−||xi−x j||

2

2`2

)

x1 = −3.0, x2 = 1.20, and x3 = 1.40 with ` = 5.00 and α = 4.00.

x2 = 1.20, x1 = −3.0

k2,1 = 4.00 × exp(−

(1.20−−3.0)2

2×5.002

)

Page 193: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Covariance FunctionsWhere did this covariance matrix come from?

4.00 2.81

2.81

k(xi, x j

)= α exp

(−||xi−x j||

2

2`2

)

x1 = −3.0, x2 = 1.20, and x3 = 1.40 with ` = 5.00 and α = 4.00.

x2 = 1.20, x2 = 1.20

k2,2 = 4.00 × exp(−

(1.20−1.20)2

2×5.002

)

Page 194: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Covariance FunctionsWhere did this covariance matrix come from?

4.00 2.81

2.81 4.00

k(xi, x j

)= α exp

(−||xi−x j||

2

2`2

)

x1 = −3.0, x2 = 1.20, and x3 = 1.40 with ` = 5.00 and α = 4.00.

x2 = 1.20, x2 = 1.20

k2,2 = 4.00 × exp(−

(1.20−1.20)2

2×5.002

)

Page 195: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Covariance FunctionsWhere did this covariance matrix come from?

4.00 2.81

2.81 4.00

k(xi, x j

)= α exp

(−||xi−x j||

2

2`2

)

x1 = −3.0, x2 = 1.20, and x3 = 1.40 with ` = 5.00 and α = 4.00.

x3 = 1.40, x1 = −3.0

k3,1 = 4.00 × exp(−

(1.40−−3.0)2

2×5.002

)

Page 196: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Covariance FunctionsWhere did this covariance matrix come from?

4.00 2.81

2.81 4.00

2.72

k(xi, x j

)= α exp

(−||xi−x j||

2

2`2

)

x1 = −3.0, x2 = 1.20, and x3 = 1.40 with ` = 5.00 and α = 4.00.

x3 = 1.40, x1 = −3.0

k3,1 = 4.00 × exp(−

(1.40−−3.0)2

2×5.002

)

Page 197: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Covariance FunctionsWhere did this covariance matrix come from?

4.00 2.81 2.72

2.81 4.00

2.72

k(xi, x j

)= α exp

(−||xi−x j||

2

2`2

)

x1 = −3.0, x2 = 1.20, and x3 = 1.40 with ` = 5.00 and α = 4.00.

x3 = 1.40, x1 = −3.0

k3,1 = 4.00 × exp(−

(1.40−−3.0)2

2×5.002

)

Page 198: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Covariance FunctionsWhere did this covariance matrix come from?

4.00 2.81 2.72

2.81 4.00

2.72

k(xi, x j

)= α exp

(−||xi−x j||

2

2`2

)

x1 = −3.0, x2 = 1.20, and x3 = 1.40 with ` = 5.00 and α = 4.00.

x3 = 1.40, x2 = 1.20

k3,2 = 4.00 × exp(−

(1.40−1.20)2

2×5.002

)

Page 199: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Covariance FunctionsWhere did this covariance matrix come from?

4.00 2.81 2.72

2.81 4.00

2.72 4.00

k(xi, x j

)= α exp

(−||xi−x j||

2

2`2

)

x1 = −3.0, x2 = 1.20, and x3 = 1.40 with ` = 5.00 and α = 4.00.

x3 = 1.40, x2 = 1.20

k3,2 = 4.00 × exp(−

(1.40−1.20)2

2×5.002

)

Page 200: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Covariance FunctionsWhere did this covariance matrix come from?

4.00 2.81 2.72

2.81 4.00 4.00

2.72 4.00

k(xi, x j

)= α exp

(−||xi−x j||

2

2`2

)

x1 = −3.0, x2 = 1.20, and x3 = 1.40 with ` = 5.00 and α = 4.00.

x3 = 1.40, x2 = 1.20

k3,2 = 4.00 × exp(−

(1.40−1.20)2

2×5.002

)

Page 201: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Covariance FunctionsWhere did this covariance matrix come from?

4.00 2.81 2.72

2.81 4.00 4.00

2.72 4.00

k(xi, x j

)= α exp

(−||xi−x j||

2

2`2

)

x1 = −3.0, x2 = 1.20, and x3 = 1.40 with ` = 5.00 and α = 4.00.

x3 = 1.40, x3 = 1.40

k3,3 = 4.00 × exp(−

(1.40−1.40)2

2×5.002

)

Page 202: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Covariance FunctionsWhere did this covariance matrix come from?

4.00 2.81 2.72

2.81 4.00 4.00

2.72 4.00 4.00

k(xi, x j

)= α exp

(−||xi−x j||

2

2`2

)

x1 = −3.0, x2 = 1.20, and x3 = 1.40 with ` = 5.00 and α = 4.00.

x3 = 1.40, x3 = 1.40

k3,3 = 4.00 × exp(−

(1.40−1.40)2

2×5.002

)

Page 203: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Covariance FunctionsWhere did this covariance matrix come from?

k(xi, x j

)= α exp

(−||xi−x j||

2

2`2

)

x1 = −3.0, x2 = 1.20, and x3 = 1.40 with ` = 5.00 and α = 4.00.

x3 = 1.40, x3 = 1.40

k3,3 = 4.00 × exp(−

(1.40−1.40)2

2×5.002

)

Page 204: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Outline

The Gaussian Density

Covariance from Basis Functions

Basis Function Representations

Page 205: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Basis Function Form

Radial basis functions commonly have the form

φk (xi) = exp

−∣∣∣xi − µk

∣∣∣22`2

.

I Basis functionmaps data into a“feature space” inwhich a linear sumis a non linearfunction.

0

0.5

1

-8 -6 -4 -2 0 2 4 6 8

φ(x

)

xFigure : A set of radial basis functions with width` = 2 and location parameters µ = [−4 0 4]>.

Page 206: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Basis Function Representations

I Represent a function by a linear sum over a basis,

f (xi,:; w) =

m∑k=1

wkφk(xi,:), (1)

I Here: m basis functions and φk(·) is kth basis function and

w = [w1, . . . ,wm]> .

I For standard linear model: φk(xi,:) = xi,k.

Page 207: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Random Functions

Functions derivedusing:

f (x) =

m∑k=1

wkφk(x),

where W is sampledfrom a Gaussiandensity,

wk ∼ N (0, α) .

-2-1012

-8 -6 -4 -2 0 2 4 6 8f(

x)x

Figure : Functions sampled using the basis set fromfigure 3. Each line is a separate sample, generatedby a weighted sum of the basis set. The weights, ware sampled from a Gaussian density with varianceα = 1.

Page 208: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Direct Construction of Covariance Matrix

Use matrix notation to write function,

f (xi; w) =

m∑k=1

wkφk (xi)

Page 209: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Direct Construction of Covariance Matrix

Use matrix notation to write function,

f (xi; w) =

m∑k=1

wkφk (xi)

computed at training data gives a vector

f = Φw.

Page 210: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Direct Construction of Covariance Matrix

Use matrix notation to write function,

f (xi; w) =

m∑k=1

wkφk (xi)

computed at training data gives a vector

f = Φw.

w ∼ N (0, αI)

Page 211: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Direct Construction of Covariance Matrix

Use matrix notation to write function,

f (xi; w) =

m∑k=1

wkφk (xi)

computed at training data gives a vector

f = Φw.

w ∼ N (0, αI)

w and f are only related by an inner product.

Page 212: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Direct Construction of Covariance Matrix

Use matrix notation to write function,

f (xi; w) =

m∑k=1

wkφk (xi)

computed at training data gives a vector

f = Φw.

w ∼ N (0, αI)

w and f are only related by an inner product.

Φ ∈ <n×p is a design matrix

Page 213: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Direct Construction of Covariance Matrix

Use matrix notation to write function,

f (xi; w) =

m∑k=1

wkφk (xi)

computed at training data gives a vector

f = Φw.

w ∼ N (0, αI)

w and f are only related by an inner product.

Φ ∈ <n×p is a design matrix

Φ is fixed and non-stochastic for a given training set.

Page 214: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Direct Construction of Covariance Matrix

Use matrix notation to write function,

f (xi; w) =

m∑k=1

wkφk (xi)

computed at training data gives a vector

f = Φw.

w ∼ N (0, αI)

w and f are only related by an inner product.

Φ ∈ <n×p is a design matrix

Φ is fixed and non-stochastic for a given training set.

f is Gaussian distributed.

Page 215: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Expectations

I We have〈f〉 = Φ 〈w〉 .

I Prior mean of w was zero giving

〈f〉 = 0.

I Prior covariance of f is

K =⟨ff>

⟩− 〈f〉 〈f〉>

We use 〈·〉 to denote expectations under prior distributions.

Page 216: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Expectations

I We have〈f〉 = Φ 〈w〉 .

I Prior mean of w was zero giving

〈f〉 = 0.

I Prior covariance of f is

K =⟨ff>

⟩− 〈f〉 〈f〉>

We use 〈·〉 to denote expectations under prior distributions.

Page 217: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Expectations

I We have〈f〉 = Φ 〈w〉 .

I Prior mean of w was zero giving

〈f〉 = 0.

I Prior covariance of f is

K =⟨ff>

⟩− 〈f〉 〈f〉>

We use 〈·〉 to denote expectations under prior distributions.

Page 218: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Expectations

I We have〈f〉 = Φ 〈w〉 .

I Prior mean of w was zero giving

〈f〉 = 0.

I Prior covariance of f is

K =⟨ff>

⟩− 〈f〉 〈f〉>

⟨ff>

⟩= Φ

⟨ww>

⟩Φ>,

givingK = αΦΦ>.

We use 〈·〉 to denote expectations under prior distributions.

Page 219: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Covariance between Two Points

I The prior covariance between two points xi and x j is

k(xi, x j

)= αφ: (xi)

> φ:

(x j

),

or in sum notation

k(xi, x j

)= α

m∑k=1

φk (xi)φk

(x j

)I For the radial basis used this gives

k(xi, x j

)= α

m∑k=1

exp

−∣∣∣xi − µk

∣∣∣2 +∣∣∣x j − µk

∣∣∣22`2

.

Page 220: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Covariance between Two Points

I The prior covariance between two points xi and x j is

k(xi, x j

)= αφ: (xi)

> φ:

(x j

),

or in sum notation

k(xi, x j

)= α

m∑k=1

φk (xi)φk

(x j

)

I For the radial basis used this gives

k(xi, x j

)= α

m∑k=1

exp

−∣∣∣xi − µk

∣∣∣2 +∣∣∣x j − µk

∣∣∣22`2

.

Page 221: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Covariance between Two Points

I The prior covariance between two points xi and x j is

k(xi, x j

)= αφ: (xi)

> φ:

(x j

),

or in sum notation

k(xi, x j

)= α

m∑k=1

φk (xi)φk

(x j

)I For the radial basis used this gives

k(xi, x j

)= α

m∑k=1

exp

−∣∣∣xi − µk

∣∣∣2 +∣∣∣x j − µk

∣∣∣22`2

.

Page 222: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Covariance between Two Points

I The prior covariance between two points xi and x j is

k(xi, x j

)= αφ: (xi)

> φ:

(x j

),

or in sum notation

k(xi, x j

)= α

m∑k=1

φk (xi)φk

(x j

)I For the radial basis used this gives

k(xi, x j

)= α

m∑k=1

exp

−∣∣∣xi − µk

∣∣∣2 +∣∣∣x j − µk

∣∣∣22`2

.

Page 223: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Constructing Covariance Functions

I Sum of two covariances is also a covariance function.

k(x, x′) = k1(x, x′) + k2(x, x′)

Page 224: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Constructing Covariance Functions

I Product of two covariances is also a covariance function.

k(x, x′) = k1(x, x′)k2(x, x′)

Page 225: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Multiply by Deterministic Function

I If f (x) is a Gaussian process.I g(x) is a deterministic function.I h(x) = f (x)g(x)I Then

kh(x, x′) = g(x)k f (x, x′)g(x′)

where kh is covariance for h(·) and k f is covariance for f (·).

Page 226: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Covariance Functions

MLP Covariance Function

k (x, x′) = αasin(

wx>x′ + b√

wx>x + b + 1√

wx′>x′ + b + 1

)

I Based on infinite neuralnetwork model.

w = 40

b = 4

Page 227: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Covariance Functions

MLP Covariance Function

k (x, x′) = αasin(

wx>x′ + b√

wx>x + b + 1√

wx′>x′ + b + 1

)

I Based on infinite neuralnetwork model.

w = 40

b = 4

Page 228: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Covariance Functions

Linear Covariance Function

k (x, x′) = αx>x′

I Bayesian linearregression.

α = 1

Page 229: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Covariance Functions

Linear Covariance Function

k (x, x′) = αx>x′

I Bayesian linearregression.

α = 1

Page 230: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Covariance FunctionsWhere did this covariance matrix come from?

Ornstein-Uhlenbeck (stationary Gauss-Markov) covariancefunction

k (x, x′) = α exp(−|x − x′|

2`2

)I In one dimension arises

from a stochasticdifferential equation.Brownian motion in aparabolic tube.

I In higher dimension aFourier filter of the form

1π(1+x2) .

Page 231: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Covariance FunctionsWhere did this covariance matrix come from?

Ornstein-Uhlenbeck (stationary Gauss-Markov) covariancefunction

k (x, x′) = α exp(−|x − x′|

2`2

)I In one dimension arises

from a stochasticdifferential equation.Brownian motion in aparabolic tube.

I In higher dimension aFourier filter of the form

1π(1+x2) .

Page 232: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Covariance FunctionsWhere did this covariance matrix come from?

Markov Process

k (t, t′) = αmin(t, t′)

I Covariance matrix isbuilt using the inputs tothe function t.

Page 233: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Covariance FunctionsWhere did this covariance matrix come from?

Markov Process

k (t, t′) = αmin(t, t′)

I Covariance matrix isbuilt using the inputs tothe function t.

-3-2-10123

0 0.5 1 1.5 2

Page 234: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Covariance FunctionsWhere did this covariance matrix come from?

Matern 5/2 Covariance Function

k (x, x′) = α(1 +√

5r +53

r2)

exp(−

5r)

where r =‖x − x′‖2

`

I Matern 5/2 is a twicedifferentiablecovariance.

I Matern familyconstructed withStudent-t filters inFourier space.

Page 235: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Covariance FunctionsWhere did this covariance matrix come from?

Matern 5/2 Covariance Function

k (x, x′) = α(1 +√

5r +53

r2)

exp(−

5r)

where r =‖x − x′‖2

`

I Matern 5/2 is a twicedifferentiablecovariance.

I Matern familyconstructed withStudent-t filters inFourier space.

Page 236: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Covariance Functions

RBF Basis Functions

k (x, x′) = αφ(x)>φ(x′)

φi(x) = exp

−∥∥∥x − µi

∥∥∥22

`2

µ =

−101

Page 237: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Covariance Functions

RBF Basis Functions

k (x, x′) = αφ(x)>φ(x′)

φi(x) = exp

−∥∥∥x − µi

∥∥∥22

`2

µ =

−101

-3-2-10123

-3 -2 -1 0 1 2 3

Page 238: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Covariance FunctionsWhere did this covariance matrix come from?

Exponentiated Quadratic Kernel Function (RBF, SquaredExponential, Gaussian)

k (x, x′) = α exp

−‖x − x′‖222`2

I Covariance matrix is

built using the inputs tothe function x.

I For the example above itwas based on Euclideandistance.

I The covariance functionis also know as a kernel.

Page 239: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Covariance FunctionsWhere did this covariance matrix come from?

Exponentiated Quadratic Kernel Function (RBF, SquaredExponential, Gaussian)

k (x, x′) = α exp

−‖x − x′‖222`2

I Covariance matrix is

built using the inputs tothe function x.

I For the example above itwas based on Euclideandistance.

I The covariance functionis also know as a kernel.

Page 240: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Gaussian Process Interpolation

-3

-2

-1

0

1

2

3

-2 -1 0 1 2

f(x)

x

Figure : Real example: BACCO (see e.g. (Oakley and O’Hagan, 2002)).Interpolation through outputs from slow computer simulations (e.g.atmospheric carbon levels).

Page 241: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Gaussian Process Interpolation

-3

-2

-1

0

1

2

3

-2 -1 0 1 2

f(x)

x

Figure : Real example: BACCO (see e.g. (Oakley and O’Hagan, 2002)).Interpolation through outputs from slow computer simulations (e.g.atmospheric carbon levels).

Page 242: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Gaussian Process Interpolation

-3

-2

-1

0

1

2

3

-2 -1 0 1 2

f(x)

x

Figure : Real example: BACCO (see e.g. (Oakley and O’Hagan, 2002)).Interpolation through outputs from slow computer simulations (e.g.atmospheric carbon levels).

Page 243: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Gaussian Process Interpolation

-3

-2

-1

0

1

2

3

-2 -1 0 1 2

f(x)

x

Figure : Real example: BACCO (see e.g. (Oakley and O’Hagan, 2002)).Interpolation through outputs from slow computer simulations (e.g.atmospheric carbon levels).

Page 244: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Gaussian Process Interpolation

-3

-2

-1

0

1

2

3

-2 -1 0 1 2

f(x)

x

Figure : Real example: BACCO (see e.g. (Oakley and O’Hagan, 2002)).Interpolation through outputs from slow computer simulations (e.g.atmospheric carbon levels).

Page 245: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Gaussian Process Interpolation

-3

-2

-1

0

1

2

3

-2 -1 0 1 2

f(x)

x

Figure : Real example: BACCO (see e.g. (Oakley and O’Hagan, 2002)).Interpolation through outputs from slow computer simulations (e.g.atmospheric carbon levels).

Page 246: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Gaussian Process Interpolation

-3

-2

-1

0

1

2

3

-2 -1 0 1 2

f(x)

x

Figure : Real example: BACCO (see e.g. (Oakley and O’Hagan, 2002)).Interpolation through outputs from slow computer simulations (e.g.atmospheric carbon levels).

Page 247: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Gaussian Process Interpolation

-3

-2

-1

0

1

2

3

-2 -1 0 1 2

f(x)

x

Figure : Real example: BACCO (see e.g. (Oakley and O’Hagan, 2002)).Interpolation through outputs from slow computer simulations (e.g.atmospheric carbon levels).

Page 248: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Gaussian Noise

I Gaussian noise model,

p(yi| fi

)= N

(yi| fi, σ2

)where σ2 is the variance of the noise.

I Equivalent to a covariance function of the form

k(xi, x j) = δi, jσ2

where δi, j is the Kronecker delta function.I Additive nature of Gaussians means we can simply add

this term to existing covariance matrices.

Page 249: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Gaussian Process Regression

-3

-2

-1

0

1

2

3

-2 -1 0 1 2

y(x)

x

Figure : Examples include WiFi localization, C14 callibration curve.

Page 250: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Gaussian Process Regression

-3

-2

-1

0

1

2

3

-2 -1 0 1 2

y(x)

x

Figure : Examples include WiFi localization, C14 callibration curve.

Page 251: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Gaussian Process Regression

-3

-2

-1

0

1

2

3

-2 -1 0 1 2

y(x)

x

Figure : Examples include WiFi localization, C14 callibration curve.

Page 252: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Gaussian Process Regression

-3

-2

-1

0

1

2

3

-2 -1 0 1 2

y(x)

x

Figure : Examples include WiFi localization, C14 callibration curve.

Page 253: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Gaussian Process Regression

-3

-2

-1

0

1

2

3

-2 -1 0 1 2

y(x)

x

Figure : Examples include WiFi localization, C14 callibration curve.

Page 254: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Gaussian Process Regression

-3

-2

-1

0

1

2

3

-2 -1 0 1 2

y(x)

x

Figure : Examples include WiFi localization, C14 callibration curve.

Page 255: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Gaussian Process Regression

-3

-2

-1

0

1

2

3

-2 -1 0 1 2

y(x)

x

Figure : Examples include WiFi localization, C14 callibration curve.

Page 256: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Gaussian Process Regression

-3

-2

-1

0

1

2

3

-2 -1 0 1 2

y(x)

x

Figure : Examples include WiFi localization, C14 callibration curve.

Page 257: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Gaussian Process Regression

-3

-2

-1

0

1

2

3

-2 -1 0 1 2

y(x)

x

Figure : Examples include WiFi localization, C14 callibration curve.

Page 258: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Learning Covariance ParametersCan we determine covariance parameters from the data?

N(y|0,K

)=

1

(2π)n2 |K|

12exp

(−

y>K−1y2

)

The parameters are inside the covariancefunction (matrix).

ki, j = k(xi, x j;θ)

Page 259: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Learning Covariance ParametersCan we determine covariance parameters from the data?

N(y|0,K

)=

1

(2π)n2 |K|

12exp

(−

y>K−1y2

)

The parameters are inside the covariancefunction (matrix).

ki, j = k(xi, x j;θ)

Page 260: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Learning Covariance ParametersCan we determine covariance parameters from the data?

logN(y|0,K

)=−

12

log |K|−y>K−1y

2−

n2

log 2π

The parameters are inside the covariancefunction (matrix).

ki, j = k(xi, x j;θ)

Page 261: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Learning Covariance ParametersCan we determine covariance parameters from the data?

E(θ) =12

log |K| +y>K−1y

2

The parameters are inside the covariancefunction (matrix).

ki, j = k(xi, x j;θ)

Page 262: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Eigendecomposition of Covariance

A useful decomposition for understanding the objectivefunction.

K = RΛ2R>

λ1λ2

Diagonal of Λ represents distancealong axes.R gives a rotation of these axes.

where Λ is a diagonal matrix and R>R = I.

Useful representation since |K| =∣∣∣Λ2

∣∣∣ = |Λ|2.

Page 263: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Capacity control: log |K|

λ1 0

0 λ2

λ1

Λ =

Page 264: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Capacity control: log |K|

λ1 0

0 λ2

λ1

λ2Λ =

Page 265: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Capacity control: log |K|

λ1 0

0 λ2

λ1

λ2Λ =

Page 266: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Capacity control: log |K|

|Λ| = λ1λ2

λ1 0

0 λ2

λ1

λ2Λ =

Page 267: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Capacity control: log |K|

|Λ| = λ1λ2

λ1 0

0 λ2

λ1

λ2Λ =

Page 268: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Capacity control: log |K|

|Λ| = λ1λ2

λ1 0

0 λ2

λ1

λ2 |Λ|Λ =

Page 269: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Capacity control: log |K|

|Λ| = λ1λ2

λ1 0 0

0 λ2 0

0 0 λ3λ1

λ2 |Λ|Λ =

Page 270: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Capacity control: log |K|

|Λ| = λ1λ2λ3

λ1 0 0

0 λ2 0

0 0 λ3λ1

λ2

λ3

|Λ|Λ =

Page 271: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Capacity control: log |K|

|Λ| = λ1λ2

λ1 0

0 λ2

λ1

λ2 |Λ|Λ =

Page 272: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Capacity control: log |K|

|RΛ| = λ1λ2

w1,1 w1,2

w2,1 w2,2

λ1λ2

|Λ|RΛ =

Page 273: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Data Fit: y>K−1y2

-6

-4

-2

0

2

4

6

-6 -4 -2 0 2 4 6

y 2

y1

λ1

λ2

Page 274: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Data Fit: y>K−1y2

-6

-4

-2

0

2

4

6

-6 -4 -2 0 2 4 6

y 2

y1

λ1

λ2

Page 275: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Data Fit: y>K−1y2

-6

-4

-2

0

2

4

6

-6 -4 -2 0 2 4 6

y 2

y1

λ1λ2

Page 276: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Learning Covariance ParametersCan we determine length scales and noise levels from the data?

-2

-1

0

1

2

-2 -1 0 1 2

y(x)

x

-10-505

101520

10−1 100 101

length scale, `

E(θ) =12

log |K| +y>K−1y

2

Page 277: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Learning Covariance ParametersCan we determine length scales and noise levels from the data?

-2

-1

0

1

2

-2 -1 0 1 2

y(x)

x

-10-505

101520

10−1 100 101

length scale, `

E(θ) =12

log |K| +y>K−1y

2

Page 278: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Learning Covariance ParametersCan we determine length scales and noise levels from the data?

-2

-1

0

1

2

-2 -1 0 1 2

y(x)

x

-10-505

101520

10−1 100 101

length scale, `

E(θ) =12

log |K| +y>K−1y

2

Page 279: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Learning Covariance ParametersCan we determine length scales and noise levels from the data?

-2

-1

0

1

2

-2 -1 0 1 2

y(x)

x

-10-505

101520

10−1 100 101

length scale, `

E(θ) =12

log |K| +y>K−1y

2

Page 280: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Learning Covariance ParametersCan we determine length scales and noise levels from the data?

-2

-1

0

1

2

-2 -1 0 1 2

y(x)

x

-10-505

101520

10−1 100 101

length scale, `

E(θ) =12

log |K| +y>K−1y

2

Page 281: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Learning Covariance ParametersCan we determine length scales and noise levels from the data?

-2

-1

0

1

2

-2 -1 0 1 2

y(x)

x

-10-505

101520

10−1 100 101

length scale, `

E(θ) =12

log |K| +y>K−1y

2

Page 282: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Learning Covariance ParametersCan we determine length scales and noise levels from the data?

-2

-1

0

1

2

-2 -1 0 1 2

y(x)

x

-10-505

101520

10−1 100 101

length scale, `

E(θ) =12

log |K| +y>K−1y

2

Page 283: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Learning Covariance ParametersCan we determine length scales and noise levels from the data?

-2

-1

0

1

2

-2 -1 0 1 2

y(x)

x

-10-505

101520

10−1 100 101

length scale, `

E(θ) =12

log |K| +y>K−1y

2

Page 284: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Learning Covariance ParametersCan we determine length scales and noise levels from the data?

-2

-1

0

1

2

-2 -1 0 1 2

y(x)

x

-10-505

101520

10−1 100 101

length scale, `

E(θ) =12

log |K| +y>K−1y

2

Page 285: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Gene Expression Example

I Given given expression levels in the form of a time seriesfrom Della Gatta et al. (2008).

I Want to detect if a gene is expressed or not, fit a GP to eachgene (Kalaitzis and Lawrence, 2011).

Page 286: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

RESEARCH ARTICLE Open Access

A Simple Approach to Ranking DifferentiallyExpressed Gene Expression Time Courses throughGaussian Process RegressionAlfredo A Kalaitzis* and Neil D Lawrence*

Abstract

Background: The analysis of gene expression from time series underpins many biological studies. Two basic formsof analysis recur for data of this type: removing inactive (quiet) genes from the study and determining whichgenes are differentially expressed. Often these analysis stages are applied disregarding the fact that the data isdrawn from a time series. In this paper we propose a simple model for accounting for the underlying temporalnature of the data based on a Gaussian process.

Results: We review Gaussian process (GP) regression for estimating the continuous trajectories underlying in geneexpression time-series. We present a simple approach which can be used to filter quiet genes, or for the case oftime series in the form of expression ratios, quantify differential expression. We assess via ROC curves the rankingsproduced by our regression framework and compare them to a recently proposed hierarchical Bayesian model forthe analysis of gene expression time-series (BATS). We compare on both simulated and experimental data showingthat the proposed approach considerably outperforms the current state of the art.

Conclusions: Gaussian processes offer an attractive trade-off between efficiency and usability for the analysis ofmicroarray time series. The Gaussian process framework offers a natural way of handling biological replicates andmissing values and provides confidence intervals along the estimated curves of gene expression. Therefore, webelieve Gaussian processes should be a standard tool in the analysis of gene expression time series.

BackgroundGene expression profiles give a snapshot of mRNA con-centration levels as encoded by the genes of an organ-ism under given experimental conditions. Early studiesof this data often focused on a single point in timewhich biologists assumed to be critical along the generegulation process after the perturbation. However, thestatic nature of such experiments severely restricts theinferences that can be made about the underlying dyna-mical system.With the decreasing cost of gene expression microar-

rays time series experiments have become commonplacegiving a far broader picture of the gene regulation pro-cess. Such time series are often irregularly sampled andmay involve differing numbers of replicates at each timepoint [1]. The experimental conditions under which

gene expression measurements are taken cannot be per-fectly controlled leading the signals of interest to be cor-rupted by noise, either of biological origin or arisingthrough the measurement process.Primary analysis of gene expression profiles is often

dominated by methods targeted at static experiments, i.e. gene expression measured on a single time-point, thattreat time as an additional experimental factor [1-6].However, were possible, it would seem sensible to con-sider methods that can account for the special nature oftime course data. Such methods can take advantage ofthe particular statistical constraints that are imposed ondata that is naturally ordered [7-12].The analysis of gene expression microarray time-series

has been a stepping stone to important problems in sys-tems biology such as the genome-wide identification ofdirect targets of transcription factors [13,14] and the fullreconstruction of gene regulatory networks [15,16]. Amore comprehensive review on the motivations and

* Correspondence: [email protected]; [email protected] Sheffield Institute for Translational Neuroscience, 385A Glossop Road,Sheffield, S10 2HQ, UK

Kalaitzis and Lawrence BMC Bioinformatics 2011, 12:180http://www.biomedcentral.com/1471-2105/12/180

© 2011 Kalaitzis and Lawrence; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the CreativeCommons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, andreproduction in any medium, provided the original work is properly cited.

Page 287: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

-2.5-2

-1.5-1

-0.50

0.51

1 1.5 2 2.5 3 3.5

log 10

SNR

log10 length scale

Contour plot of Gaussian process likelihood.

Page 288: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

-2.5-2

-1.5-1

-0.50

0.51

1 1.5 2 2.5 3 3.5

log 10

SNR

log10 length scale

-1

-0.5

0

0.5

1

0 50100150200250300

y(x)

x

Optima: length scale of 1.2221 and log10 SNR of 1.9654 loglikelihood is -0.22317.

Page 289: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

-2.5-2

-1.5-1

-0.50

0.51

1 1.5 2 2.5 3 3.5

log 10

SNR

log10 length scale

-1

-0.5

0

0.5

1

0 50100150200250300

y(x)

x

Optima: length scale of 1.5162 and log10 SNR of 0.21306 loglikelihood is -0.23604.

Page 290: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

-2.5-2

-1.5-1

-0.50

0.51

1 1.5 2 2.5 3 3.5

log 10

SNR

log10 length scale

-0.8-0.6-0.4-0.2

00.20.40.60.8

0 50100150200250300

y(x)

x

Optima: length scale of 2.9886 and log10 SNR of -4.506 loglikelihood is -2.1056.

Page 291: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Limitations of Gaussian Processes

I Inference is O(n3) due to matrix inverse (in practice useCholesky).

I Gaussian processes don’t deal well with discontinuities(financial crises, phosphorylation, collisions, edges inimages).

I Widely used exponentiated quadratic covariance (RBF) canbe too smooth in practice (but there are manyalternatives!!).

Page 292: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Gaussian Process Summer SchoolGAUSSI

AN

PROCESS SUMM

ER

SCHOOL

IUDIC

IUM

P

OSTERIUMDISCIPU

LUS

EST

PRIO

RIS

I Series of summer schools on GPs:http://ml.dcs.shef.ac.uk/gpss/

I Next edition 15th–17th September, followed by aworkshop on 18th September.

I Limited to 50 students, combination of lectures andpractical sessions.

I Facebook page: https://www.facebook.com/gaussianprocesssummerschool

I Videos from earlier editions here:https://www.youtube.com/user/ProfNeilLawrence

Page 293: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

References I

G. Della Gatta, M. Bansal, A. Ambesi-Impiombato, D. Antonini, C. Missero, and D. di Bernardo. Direct targets of thetrp63 transcription factor revealed by a combination of gene expression profiling and reverse engineering.Genome Research, 18(6):939–948, Jun 2008. [URL]. [DOI].

A. A. Kalaitzis and N. D. Lawrence. A simple approach to ranking differentially expressed gene expression timecourses through Gaussian process regression. BMC Bioinformatics, 12(180), 2011. [DOI].

P. S. Laplace. Essai philosophique sur les probabilites. Courcier, Paris, 2nd edition, 1814. Sixth edition of 1840 translatedand repreinted (1951) as A Philosophical Essay on Probabilities, New York: Dover; fifth edition of 1825 reprinted1986 with notes by Bernard Bru, Paris: Christian Bourgois Editeur, translated by Andrew Dale (1995) asPhilosophical Essay on Probabilities, New York:Springer-Verlag.

J. Oakley and A. O’Hagan. Bayesian inference for the uncertainty distribution of computer model outputs.Biometrika, 89(4):769–784, 2002.

C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, Cambridge, MA, 2006.[Google Books] .

Page 294: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Outline

Introduction

GP fundamentals

NLP Applications

Sparse GPs: Characterising user impact

Multi-task learning with GPs: Machine Translationevaluation

Model selection and Kernels: Identifying temporal patternsin word frequencies

Advanced Topics

Page 295: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Outline

Introduction

GP fundamentals

NLP Applications

Sparse GPs: Characterising user impact

Multi-task learning with GPs: Machine Translationevaluation

Model selection and Kernels: Identifying temporal patternsin word frequencies

Advanced Topics

Page 296: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Case study: User impact on Twitter

Predicting and characterising user impact on Twitter

I define a user-level impact scoreI use user’s text and profile information as features to

predict the scoreI analyse the features which better predict the scoreI provide users with ‘guidelines’ for improving their score

Instance of a text prediction problem

I emphasis on feature analysis and interpretability (specificto social science applications)

I non-linear variation

See our paper Lampos et al. (2014), EACL.

Page 297: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Sparse GPs

Exact inference in a GP

I Memory: O(n2)I Time: O(n3)

where n is the number of training points.

Sparse GP approximation

I Memory: O(n ·m)I Time: O(n ·m2)

where m is selected at runtime, m� n.

Are usually needed when n > 1000.

Page 298: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Sparse GPs

Many options for sparse approximations

I Based on Inducing VariablesI Subset of Data (SoD)I Subset of Regressors (SoR)I Deterministic Training Conditional (DTC)I Partially Independent Training Conditional

Approximations (PITC)I Fully Independent Training Conditional Approximations

(FITC)

I Fast Matrix Vector Multiplication (MVM)I Variational Methods

See Quinonero Candela and Rasmussen (2005) for an overview.

Page 299: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Sparse GPs

Sparse approximations

I f (xi) are treated as latent variablesI a subset are treated exactly (|uuu| = m)I the other are given a computationally cheaper treatment

I following Quinonero Candela and Rasmussen (2005), weview Sparse GP approximation as ‘exact inference with anapproximate prior’

I modify the joint prior p( f (x), f (x∗)) to reduce the O(n3)complexity

I different methods based on the effective prior usedI the GP model is concerned only with the conditional of the

outputs given the inputs

Page 300: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Inducing points

f*f1

f3

f4

y1

f2

Page 301: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Inducing points

f*f1

f3

f4

y1

u2

u1

f2

Page 302: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Inducing points

The inducing points u = (u1, ...,um) ‘induce’ the dependenciesbetween train and test points. All computations are based oncross-covariances between training, test and inducing pointsonly.

I assume that f and f∗ are conditionally independent giventhe inducing points u:

p( f (x∗), f (x)) ' q( f (x∗), f (x)) =

∫q( f (x∗)|u)p( f (x)|u)p(u)du

Page 303: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Inducing points

I choosing the inducing points: usually equispaced, notnecessarily in the training set

I a random subset of the training points can be usedI note the predictive variance will be overestimated outside

the support of the inducing pointsI photo from GPML documentation

Page 304: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Fully Independent Training Conditional (FITC)

Exact inference in a GP:

p( f (x∗)|y) = N(Kx∗,x(Kx,x + σ2I)−1y,Kx∗,x∗ − Kx∗,x(Kx,x + σ2I)−1Kx,x∗)

where p( f (x), f (x∗)) = N(0, Kx,x Kx∗ ,x

Kx,x∗ Kx∗ ,x∗

)

FITC predictive distribution:

qFITC( f (x∗)|y) = N(Qx∗,x(Qx,x + diag(Kx,x −Qx,x + σ2I))−1y,

Kx∗,x∗ −Qx∗,x(Qx,x + diag(Kx,x −Qx,x + σ2I))−1Qx,x∗)

based on a low-rank plus diagonal approximation:

qFITC( f (x), f (x∗)) = N(0, Qx,x−diag(Qx,x−Kx,x) Qx,x∗

Qx∗ ,x Kx∗ ,x∗

)and

Qi, j , Ki,uK−1u,uKu, j

Page 305: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Predicting and characterising user impact

500 million Tweets a day in Twitter

I important and some not so important informationI breaking news from mediaI friendsI celebrity self promotionI marketingI spam

Can we automatically predict the impact of a user?

Can we automatically identify factors which influence userimpact?

Page 306: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Defining user impact

Define impact as a function of network connections

I no. of followersI no. of followeesI no. of time the account is listed by others

Impact = ln(

listings·followers2

followees

)

DatasetI 38.000 UK usersI all tweets from one yearI 48 million deduplicated

messages

Page 307: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

User controlled features

Only features under the user’s control (e.g. not no. of retweets)

I User features (18)extracted from the account profileaggregated text features

I Text features (100)user’s topic distributiontopics computed using spectral clustering on the wordco-occurrence (NPMI) matrix

Page 308: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Models

Regression task

I Gaussian Process regression modelI n = 38000 · 9/10, use Sparse GPs with FITCI Squared Exponential kernel (k-dimensional):

k(xxxp,xxxq) = σ2f exp

(−1

2(xxxp − xxxq)TD(xxxp − xxxq)

)+ σ2

nδpq

where D ∈ Rk×k is a symmetric matrix.I if DARD = diag(lll)−2:

k(xxxp,xxxq) = σ2f exp

12

k∑

d=1

(xxxpd − xxxpd)2

l2d

+ σ2

nδpq

Page 309: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Automatic Relevance Determination (ARD)

I

kARD(xxxp,xxxq) = σ2f exp

12

k∑

d=1

(xxxpd − xxxpd)2

l2d

+ σ2

nδpq

I SE kernel with automatic relevance determination (ARD),with the vector lll denoting the characteristic length-scalesof each feature

I ld measures the distance for being uncorrelated along xd

I 1/l2d proportional to how relevant a feature is: largelength-scales means the covariance becomes independentof that feature value

I sorting by length-scales indicates which features impactthe prediction the most

I tuning these parameters in done via Bayesian modelselection

Page 310: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Prediction results

ExperimentsI 10-fold cross validationI using predictive meanI baseline model is ridge

regression (LIN)I Profile featuresI Text features

0.6 0.65 0.7 0.75 0.8

LIN

GP

Pearson correlation

Conclusions

I GPs substantially better than ridge regressionI non-linear GPs with only profile features performs better

than linear methods with all featuresI GPs outperform SVRI adding topic features improves all models

Page 311: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Selected features

Feature ImportanceUsing default profile image 0.73Total number of tweets (entire history) 1.32Number of unique @-mentions in tweets 2.31Number of tweets (in dataset) 3.47Links ratio in tweets 3.57T1 (Weather): mph, humidity,

barometer, gust, winds 3.73T2 (Healthcare, Housing): nursing, nurse,

rn, registered, bedroom, clinical, #news,estate, #hospital 5.44

T3 (Politics): senate, republican, gop, police,arrested, voters, robbery, democrats,presidential, elections 6.07

Proportion of days with non-zero tweets 6.96Proportion of tweets with @-replies 7.10

Page 312: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Feature analysis

No. of unique @-mentions0

100

0

100

0

100

0

100

0 10 20 300

100

0 10 20 30

L H

L H

L H

L H

L H

Tweetszinzentirezhistoryz(α11)

Uniquez@-mentionsz(α7)

Linksz(α9)

@-repliesz(α8)

Dayszwithznonzeroztweetsz(α18)

0

100

0

100

0

100

0

100

0 10 20 300

100

0 10 20 30

L H

L H

L H

L H

L H

Tweetszinzentirezhistoryz(α11)

Uniquez@-mentionsz(α7)

Linksz(α9)

@-repliesz(α8)

Dayszwithznonzeroztweetsz(α18)

No. of tweets

0

100

0

100

0

100

0

100

0 10 20 300

100

0 10 20 30

L H

L H

L H

L H

L H

Tweetszinzentirezhistoryz(α11)

Uniquez@-mentionsz(α7)

Linksz(α9)

@-repliesz(α8)

Dayszwithznonzeroztweetsz(α18)

0

100

0

100

0

100

0

100

0 10 20 300

100

0 10 20 30

L H

L H

L H

L H

L H

Tweetszinzentirezhistoryz(α11)

Uniquez@-mentionsz(α7)

Linksz(α9)

@-repliesz(α8)

Dayszwithznonzeroztweetsz(α18)

Impact histogram for users with high (H) values of this featureas opposed to low (L). Red line is the mean impact score.

Page 313: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Feature analysis

0

100

0 10 20 300

100

0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30

τ1 τ2 τ3 τ4 τ5

τ6 τ7 τ8 τ9 τ10damon, potter, #tvd, harry

elena, kate, portman,pattinson, hermione,

jennifer

0

100

0 10 20 300

100

0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30

τ1 τ2 τ3 τ4 τ5

τ6 τ7 τ8 τ9 τ10senate, republican, gop,police, arrested, voters,

robbery, democrats,presidential, elections

Impact histogram for users with high (H) values of this feature.Red line is the mean impact score.

Page 314: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Conclusions

User impact is highly predictable

I user behaviour very informativeI ‘tips’ for improving your impact

GP framework suitable

I non-linear modellingI ARD feature selectionI sparse GPs allow large scale experimentsI empirical improvements over linear models & SVR

Page 315: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Outline

Introduction

GP fundamentals

NLP Applications

Sparse GPs: Characterising user impact

Multi-task learning with GPs: Machine Translationevaluation

Model selection and Kernels: Identifying temporal patternsin word frequencies

Advanced Topics

Page 316: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Case study: MT Evaluation

Evaluation of Machine Translation

I human assessment of translation qualityI many ‘good’ translations, no gold standardI judgements highly subjective, biased, noisy

Instance of general NLP annotation problem

I multiply annotated data, mixing experts and novicesI slippery task definition, low agreement

See our paper Cohn and Specia (2013), ACL.

Page 317: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Multi-task learning

Multi-task learning

I form of transfer learningI several related tasks sharing the same input data

representationI learn the types, extent of correlations

Compared to domain adaptation

I tasks need not be identical (even regression vsclassification)

I no explicit ‘target’ domainI several sources of variation besides domainI no assumptions of data asymmetry

Page 318: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Multi-task learning for MT Evaluation

Modelling individual annotators

I each bring own biasesI but correlated decisions with others’ annotationsI could even find clusters of common solutions

Here use multi-output GP regression

I joint inference over several translatorsI learn degree of inter-task transferI learn per-translator noiseI incorporate task meta-data

Page 319: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Previous work on modelling MT Quality

Typically simplified into a single-task modelling problem

I learn one model from one “good” annotatorI average several annotatorsI ignore variation and simply pool data

Here framed as Transfer Learning

I each individual is a separate “task”I joint modelling of individuals and the group

Page 320: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Review: GP Regression

θ

f

yx

σ

N

f ∼ GP(0, θ)

yi ∼ N( f (xi), σ2)

Page 321: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Multi-task GP Regression

B

M

θ

f

yx

σ

N

f ∼ GP(0,(B, θ

))

yim ∼ N( fm(xi), σ2m)

See Alvarez et al. (2011).

Page 322: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Geostatistical origins: ‘co-kriging’

Kriging is the application of GPs in geostatistics, to predict e.g.,locations of minerals from several soil samples. Co-krigingjointly models several different outputs expected to havesignificant correlation, e.g., lead and nickel deposits.

See Alvarez, Rosasco and Lawrence, 2012.

Page 323: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Multi-task Covariance Kernels

Represent data as (x, t, y) tuples, where t is a task identifier.Define a separable covariance kernel,

K(x, x′)t,t′ = Bt,t′kθ(x, x′) + noise

I effectively each input augmented with t, indexing the taskof interest

I the coregionalisation matrix, B ∈ RM×M weights inter-taskcovariance

I the data kernel kθ takes data points x as inpute.g., exponentiated quadratic

Page 324: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Coregionalisation Kernels

Generally B can be any symmetric positive semi-definitematrix. Some interesting choices

I I encodes independent learningI 1 encodes pooled learningI interpolating the aboveI full rank B = WW>, or low rank variants

Known as the Intrinsic model of coregionalisation (IMC).

See Alvarez et al. (2011); Bonilla et al. (2008)

Page 325: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Stacking and Kronecker products

Response variables are a matrix

Y = RN×M

Represent data in ‘stacked’ form

X =

x1x2...

xN...

x1x2...

xN

y =

y11y21...

yN1...

y1My2M...

yNM

Kernel a Kronecker product K(X,X) = B ⊗ kdata(Xo,Xo)

Page 326: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Kronecker productKronecker Product

aK bKcK dK

Ka b

c d⌦ =

Page 327: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Kronecker productKronecker Product

⌦ =

Page 328: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Choices for B: Independent learning

⊗ =

B = I

Page 329: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Choices for B: Pooled learning

⊗ =

B = 1

Page 330: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Choices for B: Interpolating independent and pooledlearning

⊗ =

B = 1 + αI

Page 331: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Choices for B: Modulating independent and pooledlearning II

⊗ =

B = 1 + diag(α)

Page 332: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Compared to Daume III (2007)

Feature augmentation approach to multi-task learning. Useshorizontal data stacking:

X =

[X(1) X(1) 0X(2) 0 X(2)

]y =

[y(1)

y(2)

]

where (X(i),y(i)) are the training data for task i. This expands thefeature space by a factor of M.

Equivalent to a multitask kernel

k(x, x′)t,t′ = (1 + δ(t, t′)) x>x′

K(X,X) = (1 + I) ⊗ klinear(X,X)

⇒ A specific choice of B with a linear data kernel

Page 333: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Compared to Evgeniou et al. (2006)

In the regularisation setting, Evgeniou et al. (2006) show thatthe kernel

K(x, x′)t,t′ = (1 − λ + λMδ(t, t′)) x>x′

is equivalent to a linear model with regularisation term

J(Θ) =1M

t

||θt||2 +1 − λλ||θt − 1

M

t′θt′ ||2

This regularises each task’s parameters θt towards the meanparameters over all tasks, 1

M∑

t′ θt′ .

A form of interpolation method from before.

Page 334: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Linear model of coregionalisation

Consider a mixture of several components,

K(x, x′)t,t′ =

Q∑

q=1

B(q)t,t′kθq(x, x

′)

Includes per-component

I data kernel, parameterised by θq

I coregionalisation matrix, B(q)

More flexible than ICM, which corresponds to Q = 1. Cancapture multi-output correlations, e.g., as different lengthscales.

See Alvarez et al. (2011)

Page 335: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

ICM samplesIntrinsic Coregionalization Model

K(X,X) = ww> ⌦ k(X,X).

w ="15

#

B ="1 55 25

#

Page 336: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

ICM samplesIntrinsic Coregionalization Model

K(X,X) = ww> ⌦ k(X,X).

w ="15

#

B ="1 55 25

#

Page 337: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

ICM samplesIntrinsic Coregionalization Model

K(X,X) = ww> ⌦ k(X,X).

w ="15

#

B ="1 55 25

#

Page 338: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

ICM samplesIntrinsic Coregionalization Model

K(X,X) = ww> ⌦ k(X,X).

w ="15

#

B ="1 55 25

#

Page 339: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

ICM samplesIntrinsic Coregionalization Model

K(X,X) = B ⌦ k(X,X).

B ="

1 0.50.5 1.5

#

Page 340: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

ICM samplesIntrinsic Coregionalization Model

K(X,X) = B ⌦ k(X,X).

B ="

1 0.50.5 1.5

#

Page 341: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

ICM samplesIntrinsic Coregionalization Model

K(X,X) = B ⌦ k(X,X).

B ="

1 0.50.5 1.5

#

Page 342: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

ICM samplesIntrinsic Coregionalization Model

K(X,X) = B ⌦ k(X,X).

B ="

1 0.50.5 1.5

#

Page 343: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

ICM samplesIntrinsic Coregionalization Model

K(X,X) = B ⌦ k(X,X).

B ="

1 0.50.5 1.5

#

Page 344: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Application to MT quality estimation

Case study of MT Quality Estimation, manual assessment oftranslation quality given source and translation texts

Human judgements are highly subjective, biased, noisy

I typing speedI experience levelsI expectations from MT

‘Quality’ can be measured many ways

I subjective scoring (1-5) for fluency, adequacy, perceivedeffort to correct

I post-editing effort: HTER or time takenI binary judgements, ranking, . . .

See e.g., Specia et al. (2009)

Page 345: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Experimental setup

Quality estimation data

I 2k examples of source sentence and MT outputI measuring subjective post-editing (1-5) WMT12I post-editing time per word, in log seconds WPTP12I 17 dense features extracted using

Quest toolkit (Specia et al., 2013)I using official train/test split, or random assignment

Gaussian Process models

I exponentiated quadratic data kernel (RBF)I hyper-parameter values trained using type II MLEI consider simple interpolation coregionalisation kernelsI include per-task noise or global tied noise

Page 346: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Results: WMT12 RMSE for 1-5 ratings

50 100 150 200 250 300 350 400 450 5000.7

0.72

0.74

0.76

0.78

0.8

0.82

Training examples

STLMTLPooled

Page 347: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Incorporating layers of task metadata

Annotator System SourceSenTence

Page 348: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Results: WPTP12 RMSE post-editing time

0.637& 0.635& 0.628& 0.617&

0.748& 0.741&

0.600&0.585&

0.500&

0.550&

0.600&

0.650&

0.700&

0.750&

0.800&

constant&Ann&&

Ind&SVM&Ann&&

Pooled&GP&&

MTL&GP&Ann&

MTL&GP&Sys&

MTL&GP&senT&

MTL&GP&&&&&A+S&&

MTL&GP&&&&A+S+T&&

Page 349: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Extensions and wider applications

Application to data annotation

I crowd-sourcing: dealing with noise and biasesRogers et al. (2010); Groot et al. (2011)

I intelligent ’active’ data acquisition

Joint learning of correlated phenomena in NLP

I domain adaptationI same data annotated for many things (PTB etc)I multi-lingual applications and language universals

Applicable with other likelihoods, e.g.,

I classificationI ordinal regression (ranking)I structured prediction

Page 350: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Outline

Introduction

GP fundamentals

NLP Applications

Sparse GPs: Characterising user impact

Multi-task learning with GPs: Machine Translationevaluation

Model selection and Kernels: Identifying temporal patternsin word frequencies

Advanced Topics

Page 351: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Case study: Temporal patterns of words

Categorising temporal patterns of hashtags in Twitter

I collect hashtag normalised frequency time series formonths

I use models learnt on past frequencies to forecast futurefrequencies

I identify and group similar temporal patternsI emphasise periodicities in word frequencies

Instance of a forecasting problem

I emphasis on forecasting (extrapolation)I different effects modelled by specific kernels

See our paper Preotiuc-Pietro and Cohn (2013), EMNLP.

Page 352: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Model selection

Although parameter free, we still need to specify to a GP:

I the kernel parameters a.k.a. hyper-parameters θI the kernel definition Hi ∈ H

Training a GP = selecting the kernel and its parameters

Can use only training data (and no validation)

Page 353: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Bayesian model selection

Marginal likelihood or Bayesian evidence:

p(y|x, θ,Hi) =

fp(y|X, f ,Hi)p( f |θ,Hi)

The posterior over the hyperparameters is hard to compute dueto the integral in the denominator:

p(θ|y, x,Hi) =p(y|x, θ,Hi)p(θ|Hi)∫p(y|x, θ,Hi)p(θ|Hi)dθ

We approximate it by maximising over the Bayesian evidence(type II maximum likelihood - ML-II)

Page 354: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Bayesian model selection

For GP regression, the negative log of the evidence (a.k.a.NLML) can be computed analytically:

− log(p(y|x, θ)) =12

yTK−1y y +

12

log |Ky| + n2

log2π

where Ky = K f + σ2nI and K f is the covariance matrix for the

latent function f

Page 355: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Bayesian model selection

The posterior for a model given the data is:

p(Hi|y, x) =p(y|x,Hi)p(Hi)

p(y|x)

Assuming the prior over models is flat:

p(Hi|y, x) ∝ p(y|x,Hi) =

θp(y|x, θ,Hi)p(θ|Hi)

Page 356: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Bayesian model selection

Occam’s razor: ‘the simplest solution is to be preferred over amore complex one’

The evidence must normalise:

I automatic trade-off between data-fit and model complexityI complex models are penalised because they can describe

many datasetsI simple models can describe a few datasets, thus the chance

of a good data fit is lowI can be thought as the probability that a random draw of a

function from the model can generate the training set

Page 357: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Identifying temporal patterns in word frequencies

Word/hashtag frequencies in Twitter

I very time dependentI many ‘live’ only for hours reflecting timely events or

memesI some hashtags are constant over timeI some experience bursts at regular time intervalsI some follow human activity cycles

Can we automatically forecast future hashtag frequencies?

Can we automatically categorise temporal patterns?

Page 358: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Twitter hashtag temporal patterns

Regression task

I Extrapolation: forecast future frequenciesI using predictive mean

Dataset

I two months of Twitter Gardenhose (10%)I first month for training, second month for testingI 1176 hashtags occurring in both splitsI ∼ 6.5 million tweetsI 5456 tweets/hashtag

Page 359: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Kernels

The kernel

I induces the covariance in the response between pairs ofdata points

I encodes the prior belief on the type of function we aim tolearn

I for extrapolation, kernel choice is paramountI different kernels are suitable for each specific category of

temporal patterns: isotropic, smooth, periodic,non-stationary, etc.

Page 360: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Kernels

03/1 10/1 17/10

0.2

0.4

0.6

0.8

1#goodmorning

GoldConstLinearSEPer(168)PS(168)

#goodmorning

Const Linear SE Per PSNLML -41 -34 -176 -180 -192

NRMSE 0.213 0.214 0.262 0.119 0.107

Lower is better

Use Bayesian model selection techniques to choose betweenkernels

Page 361: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Kernels: Constant

kC(x, x′) = c

I constant relationshipbetween outputs

I predictive mean is thevalue c

I assumes signal ismodelled by Gaussiannoise centred around thevalue c

25 50 75 100 1250

0.2

0.4

0.6

0.8

1

c=0.6c=0.4

Page 362: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Kernels: Squared exponential

kSE(x, x′) = s2 · exp(− (x − x′)2

2l2

)

I smooth transitionbetween neighbouringpoints

I best describes time serieswith a smooth shape e.g.uni-modal burst with asteady decrease

I predictive varianceincreases exponentiallywith distance

25 50 75 100 1250

0.2

0.4

0.6

0.8

1

l=1, s=1l=10, s=1l=10, s=0.5

Page 363: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Kernels: Linear

kLin(x, x′) =|x · x′| + 1

s2

I non-stationary kernel:covariance depends onthe data points values,not only on theirdifference |t − t′|

I equivalent to Bayesianlinear regression withN(0, 1) priors on theregression weights and aprior ofN(0, s2) on thebias

25 50 75 100 1250

0.5

1

1.5

2

s=50s=70

Page 364: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Kernels: Periodic

kPER(x, x′) = s2·exp ·(−2 sin2(2π(x − x′)/p)

l2

)

I s and l are characteristiclength-scales

I p is the period (distancebetween consecutivepeaks)

I best describes periodicpatterns that oscillatesmoothly between highand low values 25 50 75 100 125

0

0.2

0.4

0.6

0.8

1

l=1, p=50l=0.75, p=25

Page 365: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Kernels: Periodic Spikes

kPS(x, x′) = cos(sin

(2π · (x − x′)

p

))·exp

(s cos(2π · (x − x′))

p− s

)

I p is the periodI s is a shape parameter

controlling the width ofthe spike

I best describes time serieswith constant low values,followed by abruptperiodic rise

25 50 75 100 1250

0.2

0.4

0.6

0.8

1

s=1s=5s=50

Page 366: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Results: Examples

#fyi (Const)

#fail (Per)

#snow (SE)

#raw (PS)

Page 367: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Results: Categories

Const SE PER PS#funny #2011 #brb #ff#lego #backintheday #coffee #followfriday

#likeaboss #confessionhour #facebook #goodnight#money #februarywish #facepalm #jobs

#nbd #haiti #fail #news#nf #makeachange #love #nowplaying

#notetoself #questionsidontlike #rock #tgif#priorities #savelibraries #running #twitterafterdark

#social #snow #xbox #twitteroff#true #snowday #youtube #ww

49 268 493 366

Page 368: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Results: Forecasting

Lag+ GP−Lin GP−SE GP−PER GP−PS GP+−10

−5

0

5

10

Page 369: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Application: Text classification

Task

I assign the hashtag of a given tweet based on its text

Methods

I Most frequent (MF)I Naive Bayes model with empirical prior (NB-E)I Naive Bayes with GP forecast as prior (NB-P)

MF NB-E NB-PMatch@1 7.28% 16.04% 17.39%Match@5 19.90% 29.51% 31.91%

Match@50 44.92% 59.17% 60.85%MRR 0.144 0.237 0.252

Higher is better

Page 370: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

References

Alvarez, M. A., Rosasco, L., and Lawrence, N. D. (2011). Kernels for vector-valued functions: A review. Foundationsand Trends in Machine Learning, 4(3):195–266.

Bonilla, E., Chai, K. M., and Williams, C. (2008). Multi-task Gaussian process prediction. NIPS.

Cohn, T. and Specia, L. (2013). Modelling annotator bias with multi-task Gaussian processes: An application tomachine translation quality estimation. ACL.

Daume III, H. (2007). Frustratingly easy domain adaptation. ACL.

Evgeniou, T., Micchelli, C. A., and Pontil, M. (2006). Learning multiple tasks with kernel methods. Journal of MachineLearning Research, 6(1):615.

Groot, P., Birlutiu, A., and Heskes, T. (2011). Learning from multiple annotators with gaussian processes. ICANN.

Lampos, V., Aletras, N., Preotiuc-Pietro, D., and Cohn, T. (2014). Predicting and Characterising User Impact onTwitter. EACL.

Preotiuc-Pietro, D. and Cohn, T. (2013). A temporal model of text periodicities using Gaussian Processes. EMNLP.

Quinonero Candela, J. and Rasmussen, C. E. (2005). A unifying view of sparse approximate gaussian processregression. Journal of Machine Learning Research, 6:1939–1959.

Rogers, S., Girolami, M., and Polajnar, T. (2010). Semi-parametric analysis of multi-rater data. Statistics andComputing, 20(3):317–334.

Specia, L., Shah, K., De Souza, J. G., and Cohn, T. (2013). Quest-a translation quality estimation framework. Citeseer.

Specia, L., Turchi, M., Cancedda, N., Dymetman, M., and Cristianini, N. (2009). Estimating the sentence-Levelquality of Machine Translation systems. EAMT.

Page 371: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Outline

Introduction

GP fundamentals

NLP Applications

Advanced Topics

Classification

Structured prediction

Structured kernels

Page 372: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Outline

Introduction

GP fundamentals

NLP Applications

Advanced Topics

Classification

Structured prediction

Structured kernels

Page 373: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Recap: Regression

Observations, yi, are a noisy version of latent process fi,

yi = fi(xi) + εi, with εi ∼ N(0, σ2)

GP Regression

Observations yi are a distorted version of a process fi:

yi = fi(xi) + ✏i, with ✏i ⇠ N(0, �2)

Page 374: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Likelihood models

Analytic solution for Gaussian likelihood aka noise

Gaussian (process) prior× Gaussian likelihood

= Gaussian posterior

But what about other likelihoods?

I Counts y ∈ NI Classification y ∈ {C1,C2, ...,Ck}I Ordinal regression (ranking) C1 < C2 < ... < Ck

I . . .

Page 375: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Classification

Binary classification, yi ∈ {0, 1}.Two popular choices for the likelihood

I Logistic sigmoid: p(yi = 1| fi) = σ( fi) = 11+exp(− fi)

I Probit function: p(yi = 1| fi) = Φ( fi) =∫ fi−∞N(z|0, 1)dz

”Squashing” input from (−∞,∞) into range [0, 1]

3 2 1 0 1 2 3

0.0

0.2

0.4

0.6

0.8

1.0sigmoidprobit

Page 376: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Squashing function

Pass latent function through logistic function to obtainprobability, π(x) = p(yi = 1| fi)

C. E. Rasmussen & C. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006,ISBN 026218253X. c� 2006 Massachusetts Institute of Technology. www.GaussianProcess.org/gpml

40 Classification

−4

−2

0

2

4

input, x

late

nt fu

nctio

n, f(

x)

0

1

input, xcl

ass

prob

abilit

y, π

(x)

(a) (b)

Figure 3.2: Panel (a) shows a sample latent function f(x) drawn from a Gaussianprocess as a function of x. Panel (b) shows the result of squashing this sample func-tion through the logistic logit function, �(z) = (1 + exp(�z))�1 to obtain the classprobability ⇡(x) = �(f(x)).

regression model and parallels the development from linear regression to GPregression that we explored in section 2.1. Specifically, we replace the linearf(x) function from the linear logistic model in eq. (3.6) by a Gaussian process,and correspondingly the Gaussian prior on the weights by a GP prior.

The latent function f plays the role of a nuisance function: we do notnuisance function

observe values of f itself (we observe only the inputs X and the class labels y)and we are not particularly interested in the values of f , but rather in ⇡, inparticular for test cases ⇡(x⇤). The purpose of f is solely to allow a convenientformulation of the model, and the computational goal pursued in the comingsections will be to remove (integrate out) f .

We have tacitly assumed that the latent Gaussian process is noise-free, andnoise-free latent process

combined it with smooth likelihood functions, such as the logistic or probit.However, one can equivalently think of adding independent noise to the latentprocess in combination with a step-function likelihood. In particular, assumingGaussian noise and a step-function likelihood is exactly equivalent to a noise-free8 latent process and probit likelihood, see exercise 3.10.1.

Inference is naturally divided into two steps: first computing the distributionof the latent variable corresponding to a test case

p(f⇤|X,y,x⇤) =

Zp(f⇤|X,x⇤, f)p(f |X,y) df , (3.9)

where p(f |X,y) = p(y|f)p(f |X)/p(y|X) is the posterior over the latent vari-ables, and subsequently using this distribution over the latent f⇤ to produce aprobabilistic prediction

⇡⇤ , p(y⇤=+1|X,y,x⇤) =

Z�(f⇤)p(f⇤|X,y,x⇤) df⇤. (3.10)

8This equivalence explains why no numerical problems arise from considering a noise-freeprocess if care is taken with the implementation, see also comment at the end of section 3.4.3.

Figure from Rasmussen and Williams (2006)

Page 377: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Inference Challenges: for test case x∗

Distribution over latent function

p( f ∗|X,y, x∗) =

∫p( f ∗|X, x∗, f) p(f|X,y)︸ ︷︷ ︸

posterior

df

Distribution over classification output

p(y∗ = 1|X,y, x∗) =

∫σ( f∗)p( f∗|X,y, x∗)d f∗

Problem: likelihood no longer conjugate with prior, so noanalytic solution.

Page 378: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Approximate inference

Several inference techniques proposed for non-conjugatelikelihoods:

I Laplace approximationWilliams and Barber (1998)

I Expectation propagationMinka (2001)

I Variational inferenceGibbs and MacKay (2000)

I MCMCNeal (1999)

And more, including sparse approaches for large scaleapplication.

Page 379: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Laplace approximation

Approximate non-Gaussian posterior by a Gaussian, centred atthe mode

0 20 40 60 80 100y

0.00

0.01

0.02

0.03

0.04

0.05p(y

)

GammaLaplace approx

Figure from Rogers and Girolami (2012)

Page 380: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Laplace approximation

Log posterior

Φ(f) = log p(y|f) + log p(f|X) + const

Find the posterior mode, f, i.e., MAP estimation, O(n3).

Then take a second order Taylor series expansion about mode,and fit with a Gaussian

I with mean, µ = fI and co-variance Σ = (K−1 + ∇∇ log p(y|f))−1

Allows for computation of posterior and marginal likelihood,but predictions may still be intractable.

Page 381: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Expectation propagation

Take intractable posterior:

p(f|X,y) =1Z

p(f|X)n∏

i=1

p(yi| fi)

Z =

∫p(f|X)

n∏i=1

p(yi| fi)df

Approximation with fully factorised distribution

q(f|X,y) =1

ZEPp(f|X)

n∏i=1

t( fi)

Page 382: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Expectation propagation

Approximate posterior defined as

q(f|y) =1

ZEPp(f)

n∏i=1

t( fi)

where each component assumed to be Gaussian

I p(yi| fi) ≈ ti( fi) = ZiN( fi|µi, σ2i )

I p(f|X) ∼ N(f|0,Knn)

Results in Gaussian formulation for q(f|y)

I allows for tractable multiplication, division with GaussiansI and marginalisation, expectations etc

Page 383: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Expectation propagation

EP algorithm aims to fit ti( fi) to the posterior, starting with aguess for q, then iteratively refining as follows

I minimise KL divergence between the true posterior for fiand the approximation, ti

minti

KL(p(yi| fi)q−i( fi) || ti( fi)q−i( fi)

)where q−i( fi) is the cavity distribution formed bymarginalising q(f) over f j, j , i then dividing by ti( fi).

I key idea: only need accurate approximation for globallyfeasible fi

I match moments to update ti, then update q(f)

Page 384: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Expectation propagationSite approximation example

Page 385: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Expectation propagation

No proof of convergence

I but empirically works wellI often more accurate than Laplace approximation

Formulated for many different likelihoods

I complexity O(n3), dominated by matrix inversionI sparse EP approximations can reduce this to O(nm2)

See Minka (2001) and Rasmussen and Williams (2006) forfurther details.

Page 386: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Multi-class classification

Consider multi-class classification, y ∈ {C1,C2, . . . ,Ck}.Draw vector of k latent function values for each input

f =(

f 11 , . . . , f 1

n , f 21 , . . . , f 2

n , f k1 , . . . , f k

n

)Formulate clasification probability using soft-max

p(yi = c|fi) =exp( f c

i )∑c′ exp( f c′

i )

Page 387: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Multi-class classification

Assume k latent processes are uncorrelated, leading to priorcovariance f ∼ N(0,K) where

K =

K1 0 · · · 00 K2 · · · 0...

.... . .

...0 0 · · · Kk

is block diagonal kn × kn with each K j of size n × n.

Various approximation methods for inference, e.g., Laplace(Williams and Barber, 1998), EP (Kim and Ghahramani, 2006),MCMC (Neal, 1999).

Page 388: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Outline

Introduction

GP fundamentals

NLP Applications

Advanced Topics

Classification

Structured prediction

Structured kernels

Page 389: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

GPs for Structured Prediction

I GPSC (Altun et al., 2004):I Defines a likelihood over label sequences: p(y|x), with latent

variable over full sequences yI HMM-inspired kernel, combining features from each

observed symbol xi and label pairsI MAP inference for hidden function values, f, and

sparsification trick for tractable inference

I GPstruct (Bratieres et al., 2013):I Base model is a CRF:

p(y|x, f) =exp

∑c f (c, xc,yc)∑

y′∈Y exp∑

c f (c, xc,y′c)

I Assumes that each potential f (c, xc,yc) is drawn from a GPI Bayesian inference using MCMC (Murray et al., 2010)

Page 390: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Outline

Introduction

GP fundamentals

NLP Applications

Advanced Topics

Classification

Structured prediction

Structured kernels

Page 391: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

String Kernels

k(x, x′) =∑s∈Σ∗

wsφs(x)φs(x′),

I φs(x): counts of substring s inside x;I 0 ≤ ws ≤ 1: weight of substring s;I s can also be a subsequence (containing gaps);

I s = char sequences→ ngram kernels (Lodhi et al., 2002)(useful for stems);k(bar, bat) = 3 (b,a,ba)

I s = word sequences→Word Sequence kernels (Canceddaet al., 2003);k(gas only injection, gas assisted plastic injection) = 3

I Soft matching:k(battle, battles) , 0k(battle, combat) , 0

Page 392: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Tree Kernels

I Subset Tree Kernels (Collins and Duffy, 2001)

S

VP

Johnloves

NP

Mary

S

VP

dinnerhad

NP

Mary

I Partial Tree Kernels (Moschitti, 2006): allows “broken”rules, useful for dependency trees;

I Soft matching can also be applied.

More on GPs + structured kernels in Daniel Beck’s SRWpresentation tomorrow

Page 393: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Tree Kernels

I Subset Tree Kernels (Collins and Duffy, 2001)

S

VP

Johnloves

NP

Mary

S

VP

dinnerhad

NP

Mary

I Partial Tree Kernels (Moschitti, 2006): allows “broken”rules, useful for dependency trees;

I Soft matching can also be applied.

More on GPs + structured kernels in Daniel Beck’s SRWpresentation tomorrow

Page 394: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Tree Kernels

I Subset Tree Kernels (Collins and Duffy, 2001)

S

VP

Johnloves

NP

Mary

S

VP

dinnerhad

NP

Mary

I Partial Tree Kernels (Moschitti, 2006): allows “broken”rules, useful for dependency trees;

I Soft matching can also be applied.

More on GPs + structured kernels in Daniel Beck’s SRWpresentation tomorrow

Page 395: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Tree Kernels

I Subset Tree Kernels (Collins and Duffy, 2001)

S

VP

Johnloves

NP

Mary

S

VP

dinnerhad

NP

Mary

I Partial Tree Kernels (Moschitti, 2006): allows “broken”rules, useful for dependency trees;

I Soft matching can also be applied.

More on GPs + structured kernels in Daniel Beck’s SRWpresentation tomorrow

Page 396: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

Tree Kernels

I Subset Tree Kernels (Collins and Duffy, 2001)

S

VP

Johnloves

NP

Mary

S

VP

dinnerhad

NP

Mary

I Partial Tree Kernels (Moschitti, 2006): allows “broken”rules, useful for dependency trees;

I Soft matching can also be applied.

More on GPs + structured kernels in Daniel Beck’s SRWpresentation tomorrow

Page 397: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

References I

Altun, Y., Hofmann, T., and Smola, A. J. (2004). Gaussian ProcessClassification for Segmenting and Annotating Sequences. InProceedings of ICML, page 8, New York, New York, USA. ACMPress.

Bratieres, S., Quadrianto, N., and Ghahramani, Z. (2013). BayesianStructured Prediction using Gaussian Processes. arXiv:1307.3846,pages 1–17.

Cancedda, N., Gaussier, E., Goutte, C., and Renders, J.-M. (2003).Word-Sequence Kernels. The Journal of Machine Learning Research,3:1059–1082.

Collins, M. and Duffy, N. (2001). Convolution Kernels for NaturalLanguage. In Advances in Neural Information Processing Systems.

Gibbs, M. N. and MacKay, D. J. (2000). Variational gaussian processclassifiers. IEEE Transactions on Neural Networks, 11(6):1458–1464.

Kim, H.-C. and Ghahramani, Z. (2006). Bayesian gaussian processclassification with the em-ep algorithm. Pattern Analysis andMachine Intelligence, IEEE Transactions on, 28(12):1948–1959.

Page 398: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

References IILodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., and

Watkins, C. (2002). Text Classification using String Kernels. TheJournal of Machine Learning Research, 2:419–444.

Minka, T. (2001). Expectation propagation for approximate bayesianinference. In Proceedings of the Seventeenth conference on Uncertaintyin Artificial Intelligence, pages 362–369. Morgan KaufmannPublishers Inc.

Moschitti, A. (2006). Making Tree Kernels practical for NaturalLanguage Learning. In EACL, pages 113–120.

Murray, I., Adams, R. P., and Mackay, D. (2010). Elliptical slicesampling. In International Conference on Artificial Intelligence andStatistics, pages 541–548.

Neal, R. (1999). Regression and classification using gaussian processpriors. Bayesian Statistics, 6.

Rasmussen, C. E. and Williams, C. K. (2006). Gaussian processes formachine learning, volume 1. MIT press Cambridge, MA.

Rogers, S. and Girolami, M. (2012). A First Course in Machine Learning.Chapman & Hall/CRC.

Page 399: Gaussian Processes for Natural Language …danielpr/files/gptut.pdfGaussian Processes State of the art forregression I exact posterior inference I supports very complex non-linear

References III

Williams, C. K. and Barber, D. (1998). Bayesian classification withgaussian processes. Pattern Analysis and Machine Intelligence, IEEETransactions on, 20(12):1342–1351.