sparse approximations to bayesian gaussian processes matthias seeger university of edinburgh

Sparse Approximations toBayesian Gaussian Processes

Matthias SeegerMatthias Seeger

University of EdinburghUniversity of Edinburgh

Joint Work With

Christopher Williams (Edinburgh)Christopher Williams (Edinburgh) Neil Lawrence (Sheffield)Neil Lawrence (Sheffield)

Builds on prior work by:Builds on prior work by: Lehel Csato, Manfred Opper (Birmingham)Lehel Csato, Manfred Opper (Birmingham)

Overview of the Talk

Gaussian processes and approximationsGaussian processes and approximations Understanding sparse schemes asUnderstanding sparse schemes as

likelihood approximationslikelihood approximations Fast greedy selectionFast greedy selection Model selectionModel selection

The Recipe Goal:Goal: Probabilistic approximation to GP Probabilistic approximation to GP

inference. Scaling (in principle) inference. Scaling (in principle) O(n)O(n) Ingredients:Ingredients:

Gaussian approximationsGaussian approximations m-projections (moment matching)m-projections (moment matching)

e-projections (mean field)e-projections (mean field)

Gaussian Process Models

Given Given uuii: : yyii independent of rest! independent of rest!

y1

u1

x1

y2

u2

x2

y3

u3

x3

Gaussian priorGaussian prior(dense),(dense),kernel kernel KK

Roadmap

Non-Gaussian Posterior ProcessNon-Gaussian Posterior Process

Finite Gaussian Approximation,Finite Gaussian Approximation,Feasible Fitting SchemeFeasible Fitting Scheme

Sparse Gaussian Approximation,Sparse Gaussian Approximation,leading to sparse predictorleading to sparse predictor

m-project.,m-project.,

EPEP

LikelihoodLikelihood

approx. byapprox. by

e-project.e-project.Sparse SchemeSparse Scheme

GP ApproximationGP Approximation

Step 1: Infinite Finite

GaussianGaussian process approximation process approximation Q(u(Q(u(¢¢) | ) | yy)) of posterior of posterior P(u(P(u(¢¢) | ) | yy)) by by m-projectionm-projection

DataData constrains constrains uu=(u=(u11,…,u,…,unn)) only only

QQ determined by finite Gaussian determined by finite GaussianQ(Q(uu | | yy)) and prior GP and prior GP

Optimal Gaussian Optimal Gaussian Q(Q(uu | | yy)) hard to find, not hard to find, not sparsesparse

Step 2: Expectation Propagation

Behind EP: Approx. Behind EP: Approx. variationalvariational principle principle((e-projectionse-projections) with “weak marginalisation” ) with “weak marginalisation” (moment) constraints ((moment) constraints (m-projectionsm-projections))

Replace likelihood terms Replace likelihood terms P(yP(yii | u | uii)) by by

Gaussian-like Gaussian-like sites tsites tii(u(uii) ) // N(u N(uii | m | mii,p,pii-1-1))

Update: Change Update: Change ttii(u(uii)) P(yP(yii | u | uii)), , m-projectm-project

to Gaussian, extract new to Gaussian, extract new ttii(u(uii))

ttii(u(uii)): role of Shafer/Shenoy update factors: role of Shafer/Shenoy update factors

Likelihood Approximations

P(P(yy | | uu) = P() = P(yy | | uuII)) Sparse approximation! Sparse approximation!

Active Set Active Set I = {2,3}, d = |I| = 2I = {2,3}, d = |I| = 2

y1 y2 y3 y4

u1

x1

u2

x2

u3

x3

u4

x4

Step 3: KL-optimal Projections

IfIf P( P(uu | | yy) ) // N( N(mm | | uu,,-1-1) P() P(uu)) , ,e-projectione-projection to to II-LH-approx. family gives-LH-approx. family givesQ(Q(uu | | yy) ) // N( N(mm | E | EPP[[uu | | uuII],],-1-1) P() P(uu))

[Csato/Opper][Csato/Opper] Here: Here: Good news: Good news: EEPP[[uu | | uuII] = ] = PPII

TT uuII requires requires smallsmall

inversion inversion KKII-1-1 only! only!

O(nO(n33)) scaling can be circumvented scaling can be circumvented

Sparse Approximation Scheme Iterate between:Iterate between:

Select new Select new ii and include into and include into II EP updates (EP updates (m-projectionm-projection), followed by ), followed by

e-projectione-projection to to II-LH-approx. family-LH-approx. family[skip EP if likelihood is Gaussian][skip EP if likelihood is Gaussian]

Exchange moves possible (unstable?)Exchange moves possible (unstable?) But But howhow to select inclusion candidates to select inclusion candidates ii

using a using a fastfast score? score?

Fast Selection Scores Criteria like Criteria like Information Gain D[QInformation Gain D[Qnewnew || Q] || Q]

((QQnewnew after after ii-inclusion) too expensive:-inclusion) too expensive:uuii immediately coupled with all immediately coupled with all nn sites! sites!

Approximate criteria by removing most Approximate criteria by removing most couplings in couplings in QQnewnew O(|H| dO(|H| d22 + 1) + 1)

{1,…,n} \ (H{1,…,n} \ (H[[ i) i) iiSites:Sites: HH

uuII uuiiLatents:Latents:

Model Selection Gaussian likelihood (regression):Gaussian likelihood (regression):

Sparse approximation Sparse approximation Q(Q(yy)) to to marginal marginal likelihoodlikelihood P(P(yy)) by plugging in LH approx. by plugging in LH approx. Iterate between descent in Iterate between descent in log Q(log Q(yy)) and and re-selection of re-selection of II

General case: minimise variational criterion General case: minimise variational criterion behind EP (ADATAP)behind EP (ADATAP) Similar to EM, using Similar to EM, using Q(Q(uu | | yy)) instead of instead of posteriorposterior

Related Work

Csato/OpperCsato/Opper: Same approximation, but uses : Same approximation, but uses online-like scheme of including/removing online-like scheme of including/removing points instead of greedy forward selectionpoints instead of greedy forward selection

Smola/BartlettSmola/Bartlett: Restricted to regression : Restricted to regression with Gaussian noise. Expensive selection with Gaussian noise. Expensive selection heuristic [ heuristic [ O(n d)O(n d) ] ] high training cost high training cost

Experiments

Regression with Gaussian noise, simplest Regression with Gaussian noise, simplest selection score approximation (selection score approximation (H=H=;;)) See paper for details See paper for details

Promising: hard, Promising: hard, low-noiselow-noise task with task with many many irrelevant attributesirrelevant attributes. Sparse scheme matches . Sparse scheme matches performance of full GPR in <1/10 time. performance of full GPR in <1/10 time. Methods with isotropic kernel fail poorlyMethods with isotropic kernel fail poorly Model selection essential here Model selection essential here

Conclusions Sparse approximations overcome the severe Sparse approximations overcome the severe

scaling problem of GP methodsscaling problem of GP methods Greedy selection based on “active learning” Greedy selection based on “active learning”

criteria can yield very sparse solutions with criteria can yield very sparse solutions with errors close to or better than for full GPserrors close to or better than for full GPs

Sparse inference is inner loop for model Sparse inference is inner loop for model selection selection fast selection scores are fast selection scores are essential for greedy schemesessential for greedy schemes

Conclusions (II) Controllable sparsity and training timeControllable sparsity and training time Staying as close as possible to the “gold Staying as close as possible to the “gold

standard” (EP), given resource constraints standard” (EP), given resource constraints transfer of properties (errors bars, transfer of properties (errors bars,model selection, embedding in other model selection, embedding in other models, …)models, …)

Fast, flexible C++ implementation will be Fast, flexible C++ implementation will be made availablemade available

sparse approximations to bayesian gaussian processes matthias seeger university of edinburgh

Documents