gaussian process covariance functionsmlg.eng.cam.ac.uk/teaching/4f13/1718/covariance...
Post on 21-Jul-2020
23 Views
Preview:
TRANSCRIPT
Gaussian process covariance functions
Carl Edward Rasmussen
October 20th, 2016
Carl Edward Rasmussen Gaussian process covariance functions October 20th, 2016 1 / 15
Key concepts
• chose covariance functions and use the marginal likelihood to• set hyperparameters• chose between different covariance functions
• covariance functions and hyperparameters can help interpret the data• we illutrate a number of different covariance function families
• stationary covariance functions: squared exponential, rational quadraticand Matérn forms
• many existing models are special cases of Gaussian processes• radial basis function networks (RBF)• splines• large neural networks
• combining existing simple covariance functions into more interesting ones
Carl Edward Rasmussen Gaussian process covariance functions October 20th, 2016 2 / 15
Model Selection, Hyperparameters, and ARD
We need to determine both the form and parameters of the covariance function.We typically use a hierarchical model, where the parameters of the covariance arecalled hyperparameters.A very useful idea is to use automatic relevance determination (ARD) covariancefunctions for feature/variable selection, e.g.:
k(x, x ′) = v20 exp
(−
D∑d=1
(xd − x ′d)2
2v2d
), hyperparameters θ = (v0, v1, . . . , vd,σ2
n).
−20
2
−20
2
0
1
2
x1
v1=v2=1
x2 −20
2
−20
2−2
0
2
x1
v1=v2=0.32
x2 −20
2
−20
2
−2
0
2
x1
v1=0.32 and v2=1
x2Carl Edward Rasmussen Gaussian process covariance functions October 20th, 2016 3 / 15
Rational quadratic covariance function
The rational quadratic (RQ) covariance function, where r = x− x ′:
kRQ(r) =(
1 +r2
2α`2
)−αwith α, ` > 0 can be seen as a scale mixture (an infinite sum) of squaredexponential (SE) covariance functions with different characteristic length-scales.Using τ = `−2 and p(τ|α,β) ∝ τα−1 exp(−ατ/β):
kRQ(r) =
∫p(τ|α,β)kSE(r|τ)dτ
∝∫τα−1 exp
(−ατ
β
)exp
(−τr2
2
)dτ ∝
(1 +
r2
2α`2
)−α,
Carl Edward Rasmussen Gaussian process covariance functions October 20th, 2016 4 / 15
Rational quadratic covariance function II
0 1 2 30
0.2
0.4
0.6
0.8
1
input distance
cova
rianc
e
α=1/2α=2α→∞
−5 0 5−3
−2
−1
0
1
2
3
input, xou
tput
, f(x
)
The limit α→∞ of the RQ covariance function is the SE.
Carl Edward Rasmussen Gaussian process covariance functions October 20th, 2016 5 / 15
Matérn covariance functions
Stationary covariance functions can be based on the Matérn form:
k(x, x ′) =1
Γ(ν)2ν−1
[√2ν`
|x − x ′|]νKν
(√2ν`
|x − x ′|)
,
where Kν is the modified Bessel function of second kind of order ν, and ` is thecharacteristic length scale.Sample functions from Matérn forms are bν− 1c times differentiable. Thus, thehyperparameter ν can control the degree of smoothnessSpecial cases:
• kν=1/2(r) = exp(− r`): Laplacian covariance function, Brownian motion
(Ornstein-Uhlenbeck)
• kν=3/2(r) =(1 +
√3r`
)exp
(−√
3r`
)(once differentiable)
• kν=5/2(r) =(1 +
√5r`
+ 5r2
3`2
)exp
(−√
5r`
)(twice differentiable)
• kν→∞ = exp(− r2
2`2 ): smooth (infinitely differentiable)
Carl Edward Rasmussen Gaussian process covariance functions October 20th, 2016 6 / 15
Matérn covariance functions II
Univariate Matérn covariance function with unit characteristic length scale andunit variance:
0 1 2 30
0.5
1
covariance function
input distance
cova
rianc
e
−5 0 5
−2
−1
0
1
2
sample functions
input, x
outp
ut, f
(x)
ν=1/2ν=1ν=2ν→∞
Carl Edward Rasmussen Gaussian process covariance functions October 20th, 2016 7 / 15
Periodic, smooth functions
To create a distribution over periodic functions of x, we can first map the inputsto u = (sin(x), cos(x))>, and then measure distances in the u space. Combinedwith the SE covariance function, which characteristic length scale `, we get:
kperiodic(x, x′) = exp(−2 sin2(π(x− x ′))/`2)
−2 −1 0 1 2−3
−2
−1
0
1
2
3
−2 −1 0 1 2−3
−2
−1
0
1
2
3
Three functions drawn at random; left ` > 1, and right ` < 1.
Carl Edward Rasmussen Gaussian process covariance functions October 20th, 2016 8 / 15
Spline models
One dimensional minimization problem: find the function f(x) which minimizes:
c∑i=1
(f(x(i)) − y(i))2 + λ
∫1
0(f ′′(x))2dx,
where 0 < x(i) < x(i+1) < 1, ∀i = 1, . . . ,n− 1, has as solution the NaturalSmoothing Cubic Spline: first order polynomials when x ∈ [0; x(1)] and whenx ∈ [x(n); 1] and a cubic polynomical in each x ∈ [x(i); x(i+1)], ∀i = 1, . . . ,n− 1,joined to have continuous second derivatives at the knots.The identical function is also the mean of a Gaussian process: Consider the classa functions given by:
f(x) = α+ βx+ limn→∞ 1√
n
n−1∑i=0
γi(x−i
n)+, where (x)+ =
{x if x > 00 otherwise
with Gaussian priors:
α ∼ N(0, ξ), β ∼ N(0, ξ), γi ∼ N(0, Γ), ∀i = 0, . . . ,n− 1.
Carl Edward Rasmussen Gaussian process covariance functions October 20th, 2016 9 / 15
The covariance function becomes:
k(x, x ′) = ξ+ xx ′ξ+ Γ limn→∞ 1
n
n−1∑i=0
(x−i
n)+ (x ′ −
i
n)+
= ξ+ xx ′ξ+ Γ
∫1
0(x− u)+ (x ′ − u)+du
= ξ+ xx ′ξ+ Γ(1
2|x− x ′|min(x, x ′)2 +
13
min(x, x ′)3).In the limit ξ→∞ and λ = σ2
n/Γ the posterior mean becomes the natrual cubicspline.We can thus find the hyperparameters σ2 and Γ (and thereby λ) by maximisingthe marginal likelihood in the usual way.Defining h(x) = (1, x)> the posterior predictions with mean and variance:
µ̃(X∗) = H(X∗)>β+ K(X,X∗)[K(X,X) + σ2
nI]−1(y −H(X)>β)
Σ̃(x∗) = Σ(X∗) + R(X,X∗)>A(X)−1R(X,X∗)
β = A(X)−1H(X)[K+ σ2nI]
−1y, A(X) = H(X)[K(X,X) + σ2nI]
−1H(X)>
R(X,X∗) = H(X∗) −H(X)[K+ σ2nI]
−1K(X,X∗)
Carl Edward Rasmussen Gaussian process covariance functions October 20th, 2016 10 / 15
Cubic Splines, Example
Although this is not the fastest way to compute splines, it offers a principled wayof finding hyperparameters, and uncertainties on predictions.Note also, that although the posterior mean is smooth (piecewise cubic), posteriorsample functions are not.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−1
−0.5
0
0.5
1
1.5observationsmeanrandomrandom95% conf
Carl Edward Rasmussen Gaussian process covariance functions October 20th, 2016 11 / 15
Feed Forward Neural Networks
���������������������������������������������������������������������������������
���������������������������������������������������������������������������������
...
... bias
hidden units
input units
output unit
tanh
linear
input−hiddenbias−hidden
Weight groups:output weights
A feed forward neural network implements the function:
f(x) =
H∑i=1
vi tanh(∑j
uijxj + bj)
Carl Edward Rasmussen Gaussian process covariance functions October 20th, 2016 12 / 15
Limits of Large Neural Networks
Sample random neural network weights from the (Gaussian) prior.
−10 0 10
−1
0
1
2 hidden units
input, x
netw
ork
outp
ut, y
−10 0 10
−1
0
1
5 hidden units
input, x
netw
ork
outp
ut, y
−10 0 10
−1
0
1
1000 hidden units
input, x
netw
ork
outp
ut, y
Note: The prior on the neural network weights induces a prior over functions.
Carl Edward Rasmussen Gaussian process covariance functions October 20th, 2016 13 / 15
−4−3
−2−1
01
23
4
−4−3
−2−1
01
23
4
−1.5
−1
−0.5
0
0.5
1
1.5
Function drawn at random from a Neural Network covariance function
k(x, x ′) =2π
arcsin( 2x>Σx ′√
(1 + x>Σx)(1 + 2x ′>Σx ′)
).
Carl Edward Rasmussen Gaussian process covariance functions October 20th, 2016 14 / 15
Composite covariance functions
We’ve seen many examples of covariance functions.
Covariance functions have to be possitive definite.
One way of building covariance functions is by composing simpler ones invarious ways
• sums of covariance functions k(x, x ′) = k1(x, x ′) + k2(x, x ′)• products k(x, x ′) = k1(x, x ′)× k2(x, x ′)• other combinations: g(x)k(x, x ′)g(x ′)• etc.
The gpml toolbox supports such constructions.
Carl Edward Rasmussen Gaussian process covariance functions October 20th, 2016 15 / 15
top related