sparse approximations to bayesian gaussian processes matthias seeger university of edinburgh
TRANSCRIPT
Sparse Approximations toBayesian Gaussian Processes
Matthias SeegerMatthias Seeger
University of EdinburghUniversity of Edinburgh
Joint Work With
Christopher Williams (Edinburgh)Christopher Williams (Edinburgh) Neil Lawrence (Sheffield)Neil Lawrence (Sheffield)
Builds on prior work by:Builds on prior work by: Lehel Csato, Manfred Opper (Birmingham)Lehel Csato, Manfred Opper (Birmingham)
Overview of the Talk
Gaussian processes and approximationsGaussian processes and approximations Understanding sparse schemes asUnderstanding sparse schemes as
likelihood approximationslikelihood approximations Fast greedy selectionFast greedy selection Model selectionModel selection
The Recipe Goal:Goal: Probabilistic approximation to GP Probabilistic approximation to GP
inference. Scaling (in principle) inference. Scaling (in principle) O(n)O(n) Ingredients:Ingredients:
Gaussian approximationsGaussian approximations m-projections (moment matching)m-projections (moment matching)
e-projections (mean field)e-projections (mean field)
Gaussian Process Models
Given Given uuii: : yyii independent of rest! independent of rest!
y1
u1
x1
y2
u2
x2
y3
u3
x3
Gaussian priorGaussian prior(dense),(dense),kernel kernel KK
Roadmap
Non-Gaussian Posterior ProcessNon-Gaussian Posterior Process
Finite Gaussian Approximation,Finite Gaussian Approximation,Feasible Fitting SchemeFeasible Fitting Scheme
Sparse Gaussian Approximation,Sparse Gaussian Approximation,leading to sparse predictorleading to sparse predictor
m-project.,m-project.,
EPEP
LikelihoodLikelihood
approx. byapprox. by
e-project.e-project.Sparse SchemeSparse Scheme
GP ApproximationGP Approximation
Step 1: Infinite Finite
GaussianGaussian process approximation process approximation Q(u(Q(u(¢¢) | ) | yy)) of posterior of posterior P(u(P(u(¢¢) | ) | yy)) by by m-projectionm-projection
DataData constrains constrains uu=(u=(u11,…,u,…,unn)) only only
QQ determined by finite Gaussian determined by finite GaussianQ(Q(uu | | yy)) and prior GP and prior GP
Optimal Gaussian Optimal Gaussian Q(Q(uu | | yy)) hard to find, not hard to find, not sparsesparse
Step 2: Expectation Propagation
Behind EP: Approx. Behind EP: Approx. variationalvariational principle principle((e-projectionse-projections) with “weak marginalisation” ) with “weak marginalisation” (moment) constraints ((moment) constraints (m-projectionsm-projections))
Replace likelihood terms Replace likelihood terms P(yP(yii | u | uii)) by by
Gaussian-like Gaussian-like sites tsites tii(u(uii) ) // N(u N(uii | m | mii,p,pii-1-1))
Update: Change Update: Change ttii(u(uii)) P(yP(yii | u | uii)), , m-projectm-project
to Gaussian, extract new to Gaussian, extract new ttii(u(uii))
ttii(u(uii)): role of Shafer/Shenoy update factors: role of Shafer/Shenoy update factors
Likelihood Approximations
P(P(yy | | uu) = P() = P(yy | | uuII)) Sparse approximation! Sparse approximation!
Active Set Active Set I = {2,3}, d = |I| = 2I = {2,3}, d = |I| = 2
y1 y2 y3 y4
u1
x1
u2
x2
u3
x3
u4
x4
Step 3: KL-optimal Projections
IfIf P( P(uu | | yy) ) // N( N(mm | | uu,,-1-1) P() P(uu)) , ,e-projectione-projection to to II-LH-approx. family gives-LH-approx. family givesQ(Q(uu | | yy) ) // N( N(mm | E | EPP[[uu | | uuII],],-1-1) P() P(uu))
[Csato/Opper][Csato/Opper] Here: Here: Good news: Good news: EEPP[[uu | | uuII] = ] = PPII
TT uuII requires requires smallsmall
inversion inversion KKII-1-1 only! only!
O(nO(n33)) scaling can be circumvented scaling can be circumvented
Sparse Approximation Scheme Iterate between:Iterate between:
Select new Select new ii and include into and include into II EP updates (EP updates (m-projectionm-projection), followed by ), followed by
e-projectione-projection to to II-LH-approx. family-LH-approx. family[skip EP if likelihood is Gaussian][skip EP if likelihood is Gaussian]
Exchange moves possible (unstable?)Exchange moves possible (unstable?) But But howhow to select inclusion candidates to select inclusion candidates ii
using a using a fastfast score? score?
Fast Selection Scores Criteria like Criteria like Information Gain D[QInformation Gain D[Qnewnew || Q] || Q]
((QQnewnew after after ii-inclusion) too expensive:-inclusion) too expensive:uuii immediately coupled with all immediately coupled with all nn sites! sites!
Approximate criteria by removing most Approximate criteria by removing most couplings in couplings in QQnewnew O(|H| dO(|H| d22 + 1) + 1)
{1,…,n} \ (H{1,…,n} \ (H[[ i) i) iiSites:Sites: HH
uuII uuiiLatents:Latents:
Model Selection Gaussian likelihood (regression):Gaussian likelihood (regression):
Sparse approximation Sparse approximation Q(Q(yy)) to to marginal marginal likelihoodlikelihood P(P(yy)) by plugging in LH approx. by plugging in LH approx. Iterate between descent in Iterate between descent in log Q(log Q(yy)) and and re-selection of re-selection of II
General case: minimise variational criterion General case: minimise variational criterion behind EP (ADATAP)behind EP (ADATAP) Similar to EM, using Similar to EM, using Q(Q(uu | | yy)) instead of instead of posteriorposterior
Related Work
Csato/OpperCsato/Opper: Same approximation, but uses : Same approximation, but uses online-like scheme of including/removing online-like scheme of including/removing points instead of greedy forward selectionpoints instead of greedy forward selection
Smola/BartlettSmola/Bartlett: Restricted to regression : Restricted to regression with Gaussian noise. Expensive selection with Gaussian noise. Expensive selection heuristic [ heuristic [ O(n d)O(n d) ] ] high training cost high training cost
Experiments
Regression with Gaussian noise, simplest Regression with Gaussian noise, simplest selection score approximation (selection score approximation (H=H=;;)) See paper for details See paper for details
Promising: hard, Promising: hard, low-noiselow-noise task with task with many many irrelevant attributesirrelevant attributes. Sparse scheme matches . Sparse scheme matches performance of full GPR in <1/10 time. performance of full GPR in <1/10 time. Methods with isotropic kernel fail poorlyMethods with isotropic kernel fail poorly Model selection essential here Model selection essential here
Conclusions Sparse approximations overcome the severe Sparse approximations overcome the severe
scaling problem of GP methodsscaling problem of GP methods Greedy selection based on “active learning” Greedy selection based on “active learning”
criteria can yield very sparse solutions with criteria can yield very sparse solutions with errors close to or better than for full GPserrors close to or better than for full GPs
Sparse inference is inner loop for model Sparse inference is inner loop for model selection selection fast selection scores are fast selection scores are essential for greedy schemesessential for greedy schemes
Conclusions (II) Controllable sparsity and training timeControllable sparsity and training time Staying as close as possible to the “gold Staying as close as possible to the “gold
standard” (EP), given resource constraints standard” (EP), given resource constraints transfer of properties (errors bars, transfer of properties (errors bars,model selection, embedding in other model selection, embedding in other models, …)models, …)
Fast, flexible C++ implementation will be Fast, flexible C++ implementation will be made availablemade available