parametric gaussian process regressors

Parametric Gaussian Process Regressors

Martin Jankowiak 1 Geoff Pleiss 2 Jacob R. Gardner 3

AbstractThe combination of inducing point methods withstochastic variational inference has enabled ap-proximate Gaussian Process (GP) inference onlarge datasets. Unfortunately, the resulting predic-tive distributions often exhibit substantially under-estimated uncertainties. Notably, in the regressioncase the predictive variance is typically dominatedby observation noise, yielding uncertainty esti-mates that make little use of the input-dependentfunction uncertainty that makes GP priors attrac-tive. In this work we propose two simple methodsfor scalable GP regression that address this issueand thus yield substantially improved predictiveuncertainties. The first applies variational infer-ence to FITC (Fully Independent Training Condi-tional; Snelson et. al. 2006). The second bypassesposterior approximations and instead directly tar-gets the posterior predictive distribution. In anextensive empirical comparison with a number ofalternative methods for scalable GP regression,we find that the resulting predictive distributionsexhibit significantly better calibrated uncertaintiesand higher log likelihoods—often by as much ashalf a nat per datapoint.

1. IntroductionMachine learning is finding increasing use in applicationswhere autonomous decisions are driven by predictive mod-els. For example, machine learning can be used to steerdynamic load balancing in critical electrical systems, andautonomous vehicles use machine learning algorithms todetect and classify objects in unpredictable weather condi-tions and decide whether to brake. As machine learningmodels increasingly become deployed as components in

1The Broad Institute, Cambridge, MA, USA 2Dept. of Com-puter Science, Cornell University, Ithaca, NY, USA 3Dept. ofComputer and Information Science, University of Pennsylvania,Philadelphia, PA, USA . Correspondence to: Martin Jankowiak<[email protected]>.

Proceedings of the 37 th International Conference on MachineLearning, Online, PMLR 119, 2020. Copyright 2020 by the au-thor(s).

Figure 1. We depict GP regressors fit to a heteroscedastic datasetusing two different inference algorithms. Solid lines depict meanpredictions and 2-σ uncertainty bands are in blue. In the lowerpanel, fit with the PPGPR approach described in Sec. 3.2, signif-icant use is made of input-dependent function uncertainty (darkblue), while in the upper panel, fit with variational inference (seeSec. 2.3.1), the predictive uncertainty is dominated by the observa-tion noise σ2

obs (light blue) and the kernel scale σk is smaller.

larger decision making pipelines, it is essential that modelsbe able to reason about uncertainty and risk. Techniquesdrawn from probabilistic machine learning offer the abilityto deal with these challenges by offering predictive modelswith simple and interpretable probabilistic outputs.

Recent years have seen extensive use of variational inference(Jordan et al., 1999) as a workhorse inference algorithm for avariety of probabilistic models. The popularity of variationalinference has been been driven by a number of differentfactors, including: i) its amenability to data subsampling(Hoffman et al., 2013); ii) its applicability to black-box non-conjugate models (Kingma and Welling, 2013; Rezendeet al., 2014); and iii) its suitability for GPU acceleration.

The many practical successes of variational inferencenotwithstanding, it has long been recognized that varia-tional inference often results in overconfident uncertaintyestimates.1 This problem can be especially acute for Gaus-sian Process (GP) models (Bauer et al., 2016). In particular

1For example see Turner and Sahani (2011) for a discussion ofthis point in the context of time series models.

arX

iv:1

910.

0712

3v3

[st

at.M

L]

26

Dec

202

0


GP regressors fit with variational inference tend to appor-tion most of the predictive variance to the input-independentobservation noise, making little use of the input-dependentfunction uncertainty that makes GP priors attractive in thefirst place (see Fig. 4 in Sec. 5 for empirical evidence). Aswe explain in Sec. 3.3, this tendency can be understoodas resulting from the asymmetry with which variationalinference—through its reliance on Jensen’s inequality—treats the various contributions to the uncertainty in outputspace, in particular its asymmetric treatment of the observa-tion noise.

In this work we propose two simple solutions that correctthis undesirable behavior. In the first we apply variationalinference to the well known FITC (Fully Independent Train-ing Conditional; Snelson and Ghahramani (2006)) methodfor sparse GP regression. In the second we directly targetthe predictive distribution—bypassing posterior approxima-tions entirely—to formulate an objective function that treatsthe various contributions to predictive variance on an equalfooting. As we show empirically in Sec. 5, the predictivedistributions resulting from these two parametric approachesto GP regression exhibit better calibrated uncertainties andhigher log likelihoods than those obtained with existingmethods for scalable GP regression.

2. BackgroundThis section is organized as follows. In Sec. 2.1-2.2 wereview the basics of Gaussian Processes and inducing pointmethods. In Sec. 2.3 we review various approaches to scal-able GP inference that will serve as the baselines in ourexperiments. In Sec. 2.4 we describe the predictive distri-butions that result from these methods. Finally in Sec. 2.5we review FITC (Snelson and Ghahramani, 2006)—an ap-proach to sparse GP regression that achieves scalability bysuitably modifying the regression model—as it will serve asone of the ingredients to the approach introduced in Sec. 3.1.We also use this section to establish our notation.

2.1. Gaussian Process Regression

In probabilistic modeling Gaussian Processes offer powerfulnon-parametric function priors that are useful in a varietyof regression and classification tasks (Rasmussen, 2003).For a given input space Rd GPs are entirely specified by acovariance function or kernel k : Rd × Rd → R and a meanfunction µ : Rd → R. Different choices of µ and k allow themodeler to encode prior information about the generativeprocess. In the prototypical case of univariate regression thejoint density takes the form2

p(y, f |X) = p(y|f , σ2obs)p(f |X, k, µ) (1)

2In the following we will suppress dependence on the kernel kand the mean function µ.

where y are the real-valued targets, f are the latent functionvalues, X = {xi}Ni=1 are the N inputs with xi ∈ Rd, andσ2

obs is the variance of the Normal likelihood p(y|·). Themarginal likelihood takes the form

p(y|X) =

∫df p(y|f , σ2

obs)p(f |X) (2)

Eqn. 2 can be computed analytically, but doing so is com-putationally prohibitive for large datasets. This is becausethe cost scales as O(N3) from the terms involving K−1

NN

and logdet KNN in Eqn. 2, where KNN = k(X,X). Thisnecessitates approximate methods when N is large.

2.2. Sparse Gaussian Processes

Over the past two decades significant progress has beenmade in scaling Gaussian Process inference to large datasets.The key technical innovation was the development of induc-ing point methods (Snelson and Ghahramani, 2006; Titsias,2009; Hensman et al., 2013), which we now review. Byintroducing inducing variables u that depend on variationalparameters {zm}Mm=1, where M = dim(u)� N and witheach zm ∈ Rd, we augment the GP prior as follows:

p(f |X)→ p(f |u,X,Z)p(u|Z)

We then appeal to Jensen’s inequality and lower bound thelog joint density over the targets and inducing variables:

log p(y,u|X,Z) = log

∫dfp(y|f)p(f |u)p(u)

≥ Ep(f |u) [log p(y|f) + log p(u)]

=

N∑i=1

logN (yi|kTi K−1MMu, σ2

obs)

− 12σ2

obsTrKNN + log p(u)

(3)

where ki = k(xi,Z), KMM = k(Z,Z) and KNN is givenby

KNN = KNN −KNMK−1MMKMN (4)

with KNM = KTMN = k(X,Z). The essential characteris-

tics of Eqn. 3 are that: i) it replaces expensive computationsinvolving KNN with cheaper computations like K−1

MM thatscale as O(M3); and ii) it is amenable to data subsampling,since the log likelihood and trace terms factorize as sumsover datapoints (yi,xi).

2.3. Approximate GP Inference

We now describe how Eqn. 3 can be used to construct avariety of algorithms for scalable GP inference. We limitourselves to algorithms that satisfy two desiderata: i) sup-port for data subsampling;3 and ii) the result of inference

3For this reason we do not consider the method in (Titsias,2009). Note that all the inference algorithms that we describe


is a compact artifact that enables fast predictions at testtime.4 The rest of this section is organized as follows.In Sec. 2.3.1 we describe SVGP (Hensman et al., 2013)—currently the most popular method for scalable GP inference.In Sec. 2.3.2 we describe how Eqn. 3 can be leveraged in thecontext of MAP (maximum a posteriori) inference. Thenin Sec. 2.3.3 we briefly review how ideas from robust varia-tional inference can be applied to the GP setting.

2.3.1. SVGP

SVGP proceeds by introducing a multivariate Normal vari-ational distribution q(u) = N (m,S). The parametersm and S are optimized using the ELBO (evidence lowerbound), which is the expectation of Eqn. 3 w.r.t. q(u) plusan entropy term term H[q(u)]:

Lsvgp = Eq(u) [log p(y,u|X,Z)] +H[q(u)]

=

N∑i=1

logN (yi|kTi K−1MMm, σ2

obs)− 12σ2

obsTr KNN

− 12σ2

obs

N∑i=1

kTi K−1MMSK−1

MMki −KL(q(u)|p(u))

(5)

where KL denotes the Kullback-Leibler divergence. Eqn. 5can be rewritten more compactly as

Lsvgp =

N∑i=1

{logN (yi|µf (xi), σ

2obs)−

σf (xi)2

2σ2obs

}−KL(q(u)|p(u))

(6)

where µf (xi) is the predictive mean function given by

µf (xi) = kTi K−1MMm (7)

and where σf (xi)2 ≡ Var[fi|xi] denotes the latent functionvariance

σf (xi)2 = Kii + kTi K−1

MMSK−1MMki (8)

Lsvgp, which depends onm,S,Z, σobs and the various ker-nel hyperparameters θker, can then be maximized with gradi-ent methods. Below we refer to the resulting GP regressionmethod as SVGP. Note that throughout this work we con-sider a variant of SVGP in which the KL divergence term inEqn. 6 is scaled by a positive constant βreg > 0.

2.3.2. MAP

In contrast to SVGP, which maintains a distribution over theinducing variables u, MAP is a particle method in which

that make use of Eqn. 3 automatically inherit its support for datasubsampling.

4Consequently we do not consider MCMC algorithms like theStochastic gradient HMC algorithm explored in the context of deepgaussian processes in (Havasi et al., 2018), and which also utilizesEqn. 3.

we directly optimize a single point u ∈ RM rather thana distribution over u. In particular we simply maximizeEqn. 3 evaluated at u. Note that the term log p(u) servesas a regularizer. In the following we refer to this inferenceprocedure as MAP. To the best of our knowledge, it has notbeen considered before in the sparse GP literature.

2.3.3. ROBUST GAUSSIAN PROCESSES

Knoblauch et al. (2019); Knoblauch (2019) consider mod-ifications to the typical variational objective (e.g. Eqn. 5),which consists of an expected log likelihood and a KL diver-gence term. In particular, they replace the expected log like-lihood loss with an alternative divergence like the gammadivergence. This divergence raises the likelihood to a power

log p(y|f)→ p(y|f)γ−1 (9)

where typically γ ∈ (1.0, 1.1).5 Empirically Knoblauch(2019) demonstrates that this modification can yield betterperformance than SVGP on regression tasks. We refer tothis inference procedure as γ-Robust.

2.4. Predictive Distributions

All of the methods in Sec. 2.3 yield predictive distributionsof the same form. In particular, conditioned on the inducingvariable u the predictive distribution at input x∗ is given by

p(y∗|x∗,u) = N (y∗|kT∗K−1MMu, K∗∗ + σ2

obs) (10)

Integrating out u ∼ N (m,S) then yields

p(y∗|x∗) = N (y∗|µf (x∗), σf (x

∗)2 + σ2obs) (11)

where µf (x∗) is the predictive mean function in Eqn. 7 and

σf (x∗)2 is the latent function variance in Eqn. 8. Note that

since MAP can be viewed as a degenerate limit of SVGP, thepredictive distribution for MAP can be obtained by takingthe limit S→ 0 in σf (x∗)2.

2.5. FITC

FITC (Fully Independent Training Conditional) (Snelsonand Ghahramani, 2006) is a method for sparse GP regressionthat is formulated using a joint probability

p(y,u) =

N∏i=1

p(yi|kTi K−1MMu, Kii + σ2

obs)p(u) (12)

with a modified likelihood corresponding to the conditionalpredictive distribution in Eqn. 10 (for additional interpre-tations see Bauer et al. (2016)). As noted by Snelson andGhahramani (2006) this can be viewed as a standard re-gression model with a particular form of parametric mean

5See (Cichocki and Amari, 2010) for a detailed discussion ofthis and other divergences.


Algorithm 1: Scalable GP Regression. All of the inferencealgorithms we consider follow the same basic pattern andonly differ in the form of the objective function, e.g. Lsvgp

(Eqn. 6), Lvfitc (Eqn. 15) or Lppgpr (Eqn. 18). Similarly forall methods the predictive distribution is given by Eqn. 11.See Sec. B in the supplementary materials for a discussionof the time and space complexity of each method.Input: L – objective function

optim() – gradient-based optimizerD = {xi, yi}Ni=1 – training datam,S,Z, θker – initial parameters

Output: m,S,Z, θker.

while not converged doChoose a random mini-batch Dmb ⊂ DForm an unbiased estimate L(Dmb)

Gradient step: m,S,Z, θker ← optim(L(Dmb)

)end

function and input-dependent observation noise. Integrat-ing out the latent function values u results in a marginallikelihood

p(y) = N (y|0,KNMK−1MMKMN+diag(KNN )+σ2

obs1)(13)

that can be used for training withO(NM2 +M3) computa-tional complexity per gradient step. Note that the structureof the covariance matrix in Eqn. 13 prevents mini-batchtraining, limiting FITC to moderately large datasets.

3. Parametric Gaussian Process RegressorsBefore introducing the two scalable methods for GP regres-sion we consider in this work, we examine one of the salientcharacteristics of the SVGP objective Eqn. 6 referred to inthe introduction. As discussed in Sec. 2.4, in SVGP thepredictive variance Var[y∗|x∗] at an input x∗ has two con-tributions, one that is input-dependent, i.e. σf (x∗)2, and onethat is input-independent, i.e. σ2

obs:

Var[y∗|x∗] = σ2obs + σf (x

∗)2

= σ2obs + K∗∗ + kT∗K−1

MMSK−1MMk∗

(14)

These two contributions appear asymmetrically in Eqn. 6;in particular σf (xi)

2 does not appear in the data fitterm that results from the expected log likelihoodEq(u) [log p(yi|xi,u)]. We expect this behavior to be unde-sirable, since it leads to a mismatch between the trainingobjective and the predictive distribution used at test time.

In the following we introduce two approaches that addressthis asymmetry. Crucially, unlike the inference strategiesoutlined in Sec. 2.3, neither approach makes use of thelower-bound energy surface in Eqn. 3. As we will see, the

first approach, Variational FITC, introduced in Sec. 3.1,only partially removes the asymmetry, while the secondParametric Predictive GP approach, introduced in Sec. 3.2,restores full symmetry between the training objective andthe test time predictive distribution. For more discussion ofthis point see Sec. 3.3.

3.1. Variational FITC

Variational FITC proceeds by applying variational infer-ence to the FITC model defined in Eqn. 12. That is, weintroduce a multivariate Normal variational distributionq(u) = N (m,S) and compute the ELBO:

Lvfitc = Eq(u) [log p(y,u|X,Z)] +H[q(u)]

=

N∑i=1

logN (yi|µf (xi), Kii + σ2obs)

− 12

N∑i=1

kTi K−1MMSK−1

MMki

Kii + σ2obs

−KL(q(u)|p(u))

(15)

The parameters m, S, Z, as well as the observation noiseσobs and kernel hyperparameters θker can then be optimizedby maximizing Eqn. 15 using gradient methods (see Al-gorithm 1). Note that since the inducing point locationsZ appear in the model, this is properly understood as aparametric model.

We note that, unlike FITC, the objective in Eqn. 15 readilysupports mini-batch training, and is thus suitable for verylarge datasets. Below we refer to this method as VFITC.

3.2. Parametric Predictive GP Regression

As we discuss further in the next section, using the VFITCobjective only partially addresses the asymmetric treatmentof σ2

obs and σf (x∗)2 in Eqn. 6. We now describe our secondapproach to scalable GP regression, which is motivated bythe goal of restoring full symmetry between the trainingobjective and the test time predictive distribution. To accom-plish this, we introduce a parametric GP regression modelformulated directly in terms of the family of predictive dis-tributions given in Eqn. 11. We then introduce an objectivefunction based on the KL divergence between p(y|x) andpdata(y|x)

L′ppgpr = −Epdata(x) KL(pdata(y|x)||p(y|x))→ Epdata(y,x) [log p(y|x)]

(16)

where pdata(y,x) is the empirical distribution over trainingdata. In the second line we have dropped the entropy term−Epdata(y|x) [log pdata(y|x)], since it is a constant w.r.t. tothe maximization problem. We obtain our final objectivefunction by adding an optional regularization term modu-


lated by a positive constant βreg > 0:

Lppgpr = Epdata(y,x) [log p(y|x)]− βregKL(q(u)||p(u))(17)

Note that apart from the regularization term, maximizingthis objective function corresponds to doing maximum like-lihood estimation of a parametric model defined by p(y|x).The objective in Eqn. 17 can be expanded as

Lppgpr =

N∑i=1

logN (yi|µf (xi), σ2obs + σf (xi)

2)

− βregKL(q(u)||p(u))

(18)

where σf (xi)2 = Var[fi|xi] is the function variance definedin Eqn. 8. The parametersm, S, Z, as well as the observa-tion noise σobs and kernel hyperparameters θker can then beoptimized by maximizing Eqn. 17 using gradient methods(see Algorithm 1). We refer to this model class as PPGPR.

When βreg = 1 the form of the objective in Eqn. 17 canbe motivated by a connection to Expectation Propagation(Minka, 2004); see Sec. G in the supplementary materialsand (Li and Gal, 2017) for further discussion.6

3.3. Discussion

What are the implications of employing the scalable GPregressors described in Sec. 3.1 and Sec. 3.2?7 It is helpfulto compare the objective functions in Eqn. 15 and Eqn. 17to the SVGP objective in Eqn. 6. In Eqn. 15 we obtain adata fit term

Lvfitc ⊃ − 12

1σ2obs+Kii

|yi − µf (xi)|2 (19)

while in Eqn. 18 we obtain a data fit term

Lppgpr ⊃ − 12

1σ2obs+σf (xi)2

|yi − µf (xi)|2 (20)

In contrast in SVGP the corresponding data fit term is

Lsvgp ⊃ − 12

1σ2obs|yi − µf (xi)|2 (21)

with σ2obs in the denominator. We reiterate that in all three

approaches the predictive variance Var[y∗|x∗] at an inputx∗ is given by the formula

Var[y∗|x∗] = σ2obs + σf (x

∗)2 (22)

where σf (x∗)2 is the latent function variance at x∗ (seeEqn. 8). Thus SVGP—and, more generally, variational in-ference for any regression model with a Normal likelihood—

6We thank Thang Bui for pointing out this connection.7See Sec D in the supplementary materials for a comparison to

exact GPs.

makes an arbitrary8 distinction between the observation vari-ance σ2

obs and the latent function variance σf (x∗)2, eventhough both terms contribute symmetrically to the total pre-dictive variance in Eqn. 22. Moreover, this asymmetry willbe inherited by any method that makes use of Eqn. 3, e.g. theMAP procedure described in Sec. 2.3.2.

When the priority is predictive performance—the typicalcase in machine learning—this asymmetric treatment ofthe two contributions to the predictive variance is troubling.As we will see in experiments (Sec. 5), the difference be-tween Eqn. 20 and Eqn. 21 has dramatic consequences. Inparticular the data fit term in SVGP does nothing to encour-age function variance σf (x∗). As a consequence for manydatasets σf (x∗)� σobs; i.e. most of the predictive varianceis explained by the input-independent observation noise.

In contrast, the data fit term in Eqn. 20 directly encour-ages large σf (x∗), typically resulting in behavior oppositeto that of SVGP, i.e. σf (x∗) � σobs. This is gratifyingbecause—after having gone to the effort to introduce aninput-dependent kernel and learn an appropriate geometryon the input space—we end up with predictive variancesthat make substantial use of the input-dependent kernel.

The case of Variational FITC (see Eqn. 19) is intermedi-ate in that the objective function directly incentivizes non-negligible latent function variance through the term Kii(xi),while the S-dependent contribution to σf (xi) is still treatedasymmetrically (since it does not appear in the data fit term).

As argued by (Bauer et al., 2016), methods based on FITC—and by extension our Parametric Predictive GP approach—tend to underestimate the observation noise σ2

obs, while vari-ational methods like SVGP or the closely related method in-troduced by (Titsias, 2009) tend to overestimate σ2

obs. Whilethe possibility of underestimating σ2

obs is indeed a concernfor our approach, our empirical results in Sec. 5 suggest that,this tendency notwithstanding, our methods yield excellentpredictive performance.

3.4. Additional Variants

A number of variants to the Parametric Predictive GP ap-proach outlined in Sec. 3.2 immediately suggest themselves.One possibility is to take the formal limit S → 0 in q(u).In this limit q(u) is a Dirac delta distribution, the functionvariance σf (xi)2 → Kii, and the number of parameters isnow linear in M instead of quadratic.9 Below we refer tothis variant as PPGPR-δ. Another possibility is to restrict

8Arbitrary from the point of view of output space. Dependingon the particular application and structure of the model, distinctionsbetween different contributions to the predictive variance may beof interest.

9Additionally in the regularizer we make the replacement−βregKL(q(u)|p(u))→ βreg log p(u).


-1.09 -0.61

MAPSVGP

γ-RobustOD-SVGP

VFITCPPGPR

Exact

Pol(N=11250)

0.37 0.45

Elevators(N=12449)

-1.75 -0.81

Bike(N=13034)

-1.28 -0.07

Kin40K(N=30000)

0.74 0.93

Protein(N=34297)

-1.57 -0.87

Keggdir.(N=36620)

-1.67 -0.92

Slice(N=40125)

-1.80 -0.64

Keggundir.(N=47706)

-0.30 0.41

3Droad(N=326155)

1.10 1.18

Song(N=386508)

-0.05 0.08

Buzz(N=437437)

-2.02 -1.52

Houseelectric(N=1536960)

NLL

Figure 2. We depict test negative log likelihoods (NLL) for twelve univariate regression datasets (lower is better). Results are averagedover ten random train/test/validation splits. See Sec. 5.1 for details. Here and elsewhere horizontal error bars depict standard errors.

.074 .115

MAPSVGP

γ-RobustOD-SVGP

VFITCPPGPR

Exact

Pol(N=11250)

.347 .37

Elevators(N=12449)

.022 .097

Bike(N=13034)

.084 .156

Kin40K(N=30000)

.514 .61

Protein(N=34297)

.085 .09

Keggdir.(N=36620)

.031 .071

Slice(N=40125)

.119 .123

Keggundir.(N=47706)

.057 .424

3Droad(N=326155)

.75 .788

Song(N=386508)

.256 .283

Buzz(N=437437)

.046 .053


RMSE

Figure 3. We depict test root mean squared errors (RMSE) for twelve univariate regression datasets (lower is better). Results are averagedover ten random train/test/validation splits. See Sec. 5.1 for details.

the covariance matrix S in q(u) to be diagonal; we refer tothis ‘mean field’ variant as PPGPR-MF. Yet another possi-bility is to decouple the parametric mean function µf (x) andvariance function σf (x)2 used to define p(y|x) in Eqn. 17,i.e. use separate inducing point locations Zµ and Zσ foradditional flexibility. This is conceptually similar to thedecoupled approach introduced in Cheng and Boots (2017),with the difference that our parametric modeling approach isnot constrained by the need to construct a well-defined vari-ational inference problem in an infinite-dimensional RKHS.Below we refer to this decoupled approach together witha diagonal covariance S as PPGPR-MFD. We refer to thevariant of PPGPR that is closest to SVGP (because it utilizesa full-rank covariance matrix S = LLT) as PPGPR-Chol.

A number of other variants are also possible. For examplewe might replace the KL regularizer in Eqn. 17 with anotherdivergence, for example a Renyi divergence (Knoblauchet al., 2019). Alternatively we could use another divergencemeasure in Eqn. 16—e.g. the gamma divergence used inSec. 2.3.3—to control the qualitative features of pdata(y|x)that we would like to capture in p(y|x). We leave the explo-ration of these and other variants to future work.

4. Related WorkThe use of pseudo-inputs and inducing point methods toscale-up Gaussian Process inference has spawned a largeliterature, especially in the context of variational inference(Csato and Opper, 2002; Seeger et al., 2003; Quinonero-

Candela and Rasmussen, 2005; Snelson and Ghahramani,2006; Titsias, 2009; Hensman et al., 2013; 2015a; Chengand Boots, 2017). While variational inference remains themost popular inference algorithm in the scalable GP set-ting, researchers have also explored different variants of Ex-pectation Propagation (Hernandez-Lobato and Hernandez-Lobato, 2016; Bui et al., 2017) as well as Stochastic gradientHamiltonian Monte Carlo (Havasi et al., 2018) and otherMCMC algorithms (Hensman et al., 2015b). For a recentreview of scalable methods for GP inference we refer thereader to (Liu et al., 2018).

Our approach bears some resemblance to that describedin (Raissi et al., 2019) in that, like them, we consider GPregression models that are parametric in nature. There areimportant differences, however. First because Raissi et al.(2019) consider a two-step training procedure that does notbenefit from a single coherent objective, they are unable tolearn inducing point locations Z. Second, inconsistent treat-ment of latent function uncertainty between the two trainingsteps degrades the quality of the predictive uncertainty. Forthese reasons we find that the approach in (Raissi et al.,2019) significantly underperforms10 all the other methodswe consider and hence we do not consider it further.

Several of our objective functions can also be seen as in-stances of Direct Loss Minimization, which emerges from a

10Both in terms of RMSE and log likelihood (especially thelatter, with the predictive uncertainty drastically underestimated insome cases). See Sec. D for details.


view of approximate inference as regularized loss minimiza-tion (Sheth and Khardon, 2016; 2017; 2019).11 We refer thereader to Sec. F in the supplementary materials for furtherdiscussion of this important connection.

Our focus on the predictive distribution also recalls (Snel-son and Ghahramani, 2005), in which the authors constructparsimonious approximations to Bayesian predictive distri-butions. Their approach differs from the approach adoptedhere, since the posterior distribution is still computed (or ap-proximated) as an intermediate step, whereas in PPGPR wecompletely bypass the posterior. Similarly PPGPR recalls(Gordon et al., 2018), where the authors consider trainingobjectives that explicitly target posterior predictive distribu-tions in the context of meta-learning.

5. ExperimentsIn this section we compare the empirical performance of theapproaches to scalable GP regression introduced in Sec. 3 tothe baseline inference strategies described in Sec. 2.3. Allour models use a prior mean of zero and a Matern kernelwith independent length scales for each input dimension.12

5.1. Univariate regression

We consider a mix of univariate regression datasets fromthe UCI repository (Dua and Graff, 2017), with the numberof datapoints ranging from N ∼ 104 to N ∼ 106 andthe number of input dimensions in the range dim(x) ∈[3, 380]. Among the methods introduced in Sec. 3, we focuson Variational FITC (Sec. 3.1) and PPGPR-MFD (Sec. 3.4).In addition to the baselines reviewed in Sec. 2.3, we alsocompare to the orthogonal parameterization of the basisdecoupling method described in Cheng and Boots (2017)and Salimbeni et al. (2018), which we refer to as OD-SVGP.For all but the two largest datasets we also compare to ExactGP inference, leveraging the conjugate gradient approachdescribed in (Wang et al., 2019).

We summarize the results in Fig. 2-4 and Table 1 (see thesupplementary materials for additional results). Both ourapproaches yield consistently lower negative log likelihoods(NLLs) than the baseline approaches, with PPGPR perform-ing particularly well. Averaged across all twelve datasets,PPGPR outperforms the strongest baseline, OD-SVGP, by∼ 0.35 nats. Interestingly on most datasets PPGPR outper-forms Exact GP inference. We hypothesize that this is atleast partially due to the ability of our parametric modelsto encode heteroscedasticity, a “bonus” feature that was al-

11Note, however, that this interpretation is not applicable to ourbest performing model class, PPGPR-MFD, which benefits frommean and variance functions that are decoupled.

12For an implementation of our method in GPyTorch seehttps://git.io/JJy9b.

Table 1. Average ranking of methods (lower is better). CRPS isthe Continuous Ranking Probability Score, a popular calibrationmetric for regression (Gneiting and Raftery, 2007). Rankings areaverages across datasets and splits. See Sec. 5.1 for details.

MAP γ-Robust SVGP OD-SVGP VFITC PPGPR

NLL 5.47 4.76 4.33 3.08 2.20 1.17RMSE 4.53 4.59 2.92 2.49 4.35 2.12CRPS 5.55 4.12 4.21 3.20 2.90 1.02

ready noted by Snelson and Ghahramani (2006). Perhapssurprisingly, we note that on most datasets MAP yields com-parable NLLs to SVGP. Indeed averaged across all twelvedatasets, SVGP only outperforms MAP by ∼ 0.05 nats.

The results for root mean squared errors (RMSE) exhibitsomewhat less variability (see Fig. 3), with PPGPR per-forming the best and OD-SVGP performing second bestamong the scalable methods. In particular, in aggregatePPGPR attains the lowest RMSE (see Table 1), though itis outperformed by other methods on 2/12 datasets. Wehypothesize that the RMSE performance of VFITC couldbe substantially improved if the variational distribution tookthe structured form that is implicitly used in OD-SVGP. Notsurprisingly Exact GP inference yields the lowest RMSEfor most datasets.

Strikingly, both VFITC and PPGPR yield predictive vari-ances that are in a qualitatively different regime than thoseresulting from the scalable inference baselines. Fig. 4 de-picts the fraction of the total predictive variance Var[y∗ |x]due to the observation noise σ2

obs. The predictive variancesfrom the baseline methods make relatively little use of input-dependent function uncertainty, instead relying primarily onthe observation noise. By contrast the variances of our para-metric GP regressors are dominated by function uncertainty.This substantiates the discussion in Sec. 3.3. Additionally,this observation explains the similar log likelihoods exhib-ited by SVGP and MAP: since neither method makes muchuse of function uncertainty, the uncertainty encoded in q(u)is of secondary importance.13

Finally we note that for most datasets PPGPR prefers smallvalues of βreg. This suggests that choosing M � N issufficient for ensuring well-regularized models and thatoverfitting is not much of a concern in practice.

5.2. PPGPR Ablation Study

Encouraged by the predictive performance of PPGPR-MFDin Sec. 5.1, we perform a detailed comparison of the differ-ent PPGPR variants introduced in Sec. 3.4. Our results aresummarized in Table 2 (see supplementary materials for ad-ditional figures). We find that as we increase the capacity of

13Note that MAP can be viewed as a degenerate limit of SVGPin which q(u) is a Dirac delta function.


.25 .5 .75

MAPSVGP

γ-RobustOD-SVGP

VFITCPPGPR

Exact

Pol(N=11250)

.25 .5 .75

Elevators(N=12449)

.25 .5 .75

Bike(N=13034)

.25 .5 .75

Kin40K(N=30000)

.25 .5 .75

Protein(N=34297)

.25 .5 .75

Keggdir.(N=36620)

.25 .5 .75

Slice(N=40125)

.25 .5 .75

Keggundir.(N=47706)

.25 .5 .75

3Droad(N=326155)

.25 .5 .75

Song(N=386508)

.25 .5 .75

Buzz(N=437437)

.25 .5 .75


σ2obs / Var [y∗ |x∗]

Figure 4. We depict the mean fraction of the predictive variance that is due to the observation noise (as measured on the test set). Resultsare averaged over ten random train/test/validation splits. See Sec. 5.1 for details.

Figure 5. We visualize the calibration of three probabilistic deep learning models fit to trip data in three cities. For each method we depictthe empirical CDF of the z-scores z = (y∗ − µf (x

∗))/σ(x∗) computed on the test set {(x∗k, y

∗k)}. The “Ideal CDF” is the Normal CDF,

which corresponds to the best possible calibration for a model with a Normal predictive distribution. See Sec. 5.3 for details.

the variance function σf (x)2 (i.e. PPGPR-δ⇒ PPGPR-MF⇒ PPGPR-Chol) the test NLL tends to decrease, while thetest RMSE tends to increase. This make sense since PPGPRprefers large predictive uncertainty in regions of input spacewhere good data fit is hard to achieve. Consequently, thereis less incentive to move the predictive mean function awayfrom the prior in those regions, which can then result inhigher RMSEs; this effect becomes more pronounced as thevariance function becomes more flexible. In contrast, sincethe mean and variance functions are entirely decoupled inthe case of PPGPR-MFD (apart from shared kernel hyper-parameters), this model class obtains low NLLs withoutsacrificing performance on RMSE.

Note, finally, that PPGPR-δ and PPGPR-Chol utilize thesame family of predictive distributions as MAP and SVGP,respectively, and only differ in the objective function usedduring training. If we compare these pairs of methods inTable 2, we find that PPGPR-δ and PPGPR-Chol yield thebest log likelihoods, with PPGPR-Chol exhibiting degradedRMSE performance for the reason described in the previ-ous paragraph. Indeed these observations motivated theintroduction of PPGPR-MFD.

Table 2. Average rank of PPGPR variants (lower is better).

MAP SVGP PPGPRδ

PPGPRMF

PPGPRChol

PPGPRMFD

NLL 5.90 5.08 3.96 2.66 1.59 1.81RMSE 3.64 2.36 3.65 4.16 5.38 1.82CRPS 5.72 4.78 3.87 2.77 2.47 1.40

5.3. Calibration in DKL Regression

In this section we demonstrate that PPGPR (introduced inSec. 3.2) offers an effective mechanism for calibrating deepneural networks for regression. To evaluate the potential ofthis approach, we utilize a real-world dataset of vehiculartrip durations in three large cities. In this setting, uncertaintyestimation is critical for managing risk when estimatingtransportation costs.

We compare three methods: i) deep kernel learning (Wilsonet al., 2016) using SVGP (SVGP+DKL); ii) deep kernellearning using PPGPR-MFD (PPGPR+DKL); and iii) MC-Dropout (Gal and Ghahramani, 2016), a popular method forcalibrating neural networks that does not rely on Gaussian


Processes.

In Fig. 5 we visualize how well each of the three methodsis calibrated as compared to the best possible calibrationfor a model with Normal predictive distributions. OverallPPGPR+DKL performs the best, with MCDropout outper-forming or matching SVGP+DKL. Using PPGPR has anumber of additional advantages over MCDropout. In par-ticular, because the predictive variances can be computedanalytically for Gaussian Process models, the PPGPR+DKLmodel is significantly faster at test time than MCDropout,which requires forwarding data points through many sam-pled models (here 50). This is impractically slow for manyapplied settings, especially for large neural networks.

6. ConclusionsGaussian Process regression with a Normal likelihood rep-resents a peculiar case in that: i) we can give an analyticformula for the exact posterior predictive distribution; but ii)it is impractical to compute for large datasets. In this workwe have argued that if our goal is high quality predictive dis-tributions, it is sensible to bypass posterior approximationsand directly target the quantity of interest. While this maybe a bad strategy for an arbitrary probabilistic model, inthe case of GP regression inducing point methods provide anatural family of parametric predictive distributions whosecapacity can be controlled to prevent overfitting. As wehave shown empirically, the resulting predictive distribu-tions exhibit significantly better calibrated uncertainties andhigher log likelihoods than those obtained with other meth-ods, which tend to yield overconfident uncertainty estimatesthat make little use of the kernel.

More broadly, our empirical results suggest that the goodpredictive performance of approximate GP methods likeSVGP may have more to do with the power of kernel meth-ods than adherence to Bayes. The approach we have adoptedhere can be viewed as maximum likelihood estimation fora model with a high-powered likelihood. This likelihoodleverages flexible mean and variance functions whose para-metric form is motivated by inducing point methods. Wesuspect that a good ansatz for the variance function is ofparticular importance for good predictive uncertainty. Weexplore one possible alternative—and much more flexible—parameterization in Jankowiak et al. (2020). Here we sug-gest that one simple and natural variant of PPGPR wouldreplace the mean function with a flexible neural networkso that inducing points are only used to define the variancefunction.

ACKNOWLEDGEMENTS

MJ would like to thank Jeremias Knoblauch, Jack Jewson,and Theodoros Damouls for engaging conversations about

robust variational inference. MJ would also like to thankFelipe Petroski Such for help with Uber compute infrastruc-ture. This work was completed while MJ and JG were atUber AI.

ReferencesMatthias Bauer, Mark van der Wilk, and Carl Edward Ras-

mussen. Understanding probabilistic sparse gaussian pro-cess approximations. In Advances in neural informationprocessing systems, pages 1533–1541, 2016.

Eli Bingham, Jonathan P Chen, Martin Jankowiak, FritzObermeyer, Neeraj Pradhan, Theofanis Karaletsos, RohitSingh, Paul Szerlip, Paul Horsfall, and Noah D Goodman.Pyro: Deep universal probabilistic programming. TheJournal of Machine Learning Research, 20(1):973–978,2019.

Thang D Bui, Josiah Yan, and Richard E Turner. A unifyingframework for gaussian process pseudo-point approxima-tions using power expectation propagation. The Journalof Machine Learning Research, 18(1):3649–3720, 2017.

Ching-An Cheng and Byron Boots. Variational inference forgaussian process models with linear complexity. In Ad-vances in Neural Information Processing Systems, pages5184–5194, 2017.

Andrzej Cichocki and Shun-ichi Amari. Families of alpha-beta-and gamma-divergences: Flexible and robust mea-sures of similarities. Entropy, 12(6):1532–1568, 2010.

Lehel Csato and Manfred Opper. Sparse on-line gaussianprocesses. Neural computation, 14(3):641–668, 2002.

Dheeru Dua and Casey Graff. UCI machine learningrepository, 2017. URL http://archive.ics.uci.edu/ml.

Yarin Gal and Zoubin Ghahramani. Dropout as a bayesianapproximation: Representing model uncertainty in deeplearning. In international conference on machine learn-ing, pages 1050–1059, 2016.

Jacob Gardner, Geoff Pleiss, Kilian Q Weinberger, DavidBindel, and Andrew G Wilson. Gpytorch: Blackboxmatrix-matrix gaussian process inference with gpu accel-eration. In Advances in Neural Information ProcessingSystems, pages 7576–7586, 2018.

Tilmann Gneiting and Adrian E Raftery. Strictly proper scor-ing rules, prediction, and estimation. Journal of the Amer-ican Statistical Association, 102(477):359–378, 2007.

Jonathan Gordon, John Bronskill, Matthias Bauer, Sebas-tian Nowozin, and Richard E Turner. Meta-learningprobabilistic inference for prediction. arXiv preprintarXiv:1805.09921, 2018.

http://archive.ics.uci.edu/ml

http://archive.ics.uci.edu/ml


Marton Havasi, Jose Miguel Hernandez-Lobato, andJuan Jose Murillo-Fuentes. Inference in deep gaussianprocesses using stochastic gradient hamiltonian montecarlo. In Advances in Neural Information ProcessingSystems, pages 7506–7516, 2018.

James Hensman, Nicolo Fusi, and Neil D Lawrence.Gaussian processes for big data. arXiv preprintarXiv:1309.6835, 2013.

James Hensman, Alexander Matthews, and Zoubin Ghahra-mani. Scalable variational gaussian process classification.2015a.

James Hensman, Alexander G Matthews, Maurizio Filip-pone, and Zoubin Ghahramani. Mcmc for variationallysparse gaussian processes. In Advances in Neural Infor-mation Processing Systems, pages 1648–1656, 2015b.

Daniel Hernandez-Lobato and Jose Miguel Hernandez-Lobato. Scalable gaussian process classification via ex-pectation propagation. In Artificial Intelligence and Statis-tics, pages 168–176, 2016.

Jose Miguel Hernandez-Lobato, Yingzhen Li, Mark Row-land, Daniel Hernandez-Lobato, Thang Bui, and RichardTurner. Black-box α-divergence minimization. 2016.

Matthew D Hoffman, David M Blei, Chong Wang, and JohnPaisley. Stochastic variational inference. The Journal ofMachine Learning Research, 14(1):1303–1347, 2013.

Martin Jankowiak, Geoff Pleiss, and Jacob R Gardner. Deepsigma point processes. In Proceedings of the 36th confer-ence on Uncertainty in artificial intelligence, 2020.

Michael I Jordan, Zoubin Ghahramani, Tommi S Jaakkola,and Lawrence K Saul. An introduction to variationalmethods for graphical models. Machine learning, 37(2):183–233, 1999.

Diederik P Kingma and Jimmy Ba. Adam: A method forstochastic optimization. arXiv preprint arXiv:1412.6980,2014.

Diederik P Kingma and Max Welling. Auto-encoding varia-tional bayes. arXiv preprint arXiv:1312.6114, 2013.

Jeremias Knoblauch. Robust deep gaussian processes. arXivpreprint arXiv:1904.02303, 2019.

Jeremias Knoblauch, Jack Jewson, and TheodorosDamoulas. Generalized variational inference. arXivpreprint arXiv:1904.02063, 2019.

Yingzhen Li and Yarin Gal. Dropout inference in bayesianneural networks with alpha-divergences. In Proceed-ings of the 34th International Conference on MachineLearning-Volume 70, pages 2052–2061. JMLR. org,2017.

Yingzhen Li, Jose Miguel Hernandez-Lobato, and Richard ETurner. Stochastic expectation propagation. In Advancesin neural information processing systems, pages 2323–2331, 2015.

Haitao Liu, Yew-Soon Ong, Xiaobo Shen, and Jianfei Cai.When gaussian process meets big data: A review of scal-able gps. arXiv preprint arXiv:1807.01065, 2018.

Alexander Graeme de Garis Matthews. Scalable Gaussianprocess inference using variational methods. PhD thesis,University of Cambridge, 2017.

Thomas Minka. Power ep. Dep. Statistics, Carnegie MellonUniversity, Pittsburgh, PA, Tech. Rep, 2004.

Joaquin Quinonero-Candela and Carl Edward Rasmussen.A unifying view of sparse approximate gaussian processregression. Journal of Machine Learning Research, 6(Dec):1939–1959, 2005.

Maziar Raissi, Hessam Babaee, and George Em Karniadakis.Parametric gaussian process regression for big data. Com-putational Mechanics, 64(2):409–416, 2019.

Carl Edward Rasmussen. Gaussian processes in machinelearning. In Summer School on Machine Learning, pages63–71. Springer, 2003.

Danilo Jimenez Rezende, Shakir Mohamed, and DaanWierstra. Stochastic backpropagation and approximateinference in deep generative models. arXiv preprintarXiv:1401.4082, 2014.

Hugh Salimbeni, Ching-An Cheng, Byron Boots, and MarcDeisenroth. Orthogonally decoupled variational gaussianprocesses. In Advances in neural information processingsystems, pages 8711–8720, 2018.

Matthias Seeger, Christopher Williams, and Neil Lawrence.Fast forward selection to speed up sparse gaussian processregression. Technical report, 2003.

Rishit Sheth and Roni Khardon. Monte carlo structuredsvi for two-level non-conjugate models. arXiv preprintarXiv:1612.03957, 2016.

Rishit Sheth and Roni Khardon. Excess risk bounds for thebayes risk using variational inference in latent gaussianmodels. In Advances in Neural Information ProcessingSystems, pages 5151–5161, 2017.

Rishit Sheth and Roni Khardon. Pseudo-bayesian learningvia direct loss minimization with applications to sparsegaussian process models. 2019.

Edward Snelson and Zoubin Ghahramani. Compact approx-imations to bayesian predictive distributions. In Proceed-ings of the 22nd international conference on Machinelearning, pages 840–847. ACM, 2005.


Edward Snelson and Zoubin Ghahramani. Sparse gaussianprocesses using pseudo-inputs. In Advances in neuralinformation processing systems, pages 1257–1264, 2006.

Michalis Titsias. Variational learning of inducing variablesin sparse gaussian processes. In Artificial Intelligenceand Statistics, pages 567–574, 2009.

R. E. Turner and M. Sahani. Two problems with variationalexpectation maximisation for time-series models. InD. Barber, T. Cemgil, and S. Chiappa, editors, BayesianTime series models, chapter 5, pages 109–130. CambridgeUniversity Press, 2011.

Ke Alexander Wang, Geoff Pleiss, Jacob R Gardner,Stephen Tyree, Kilian Q Weinberger, and Andrew Gor-don Wilson. Exact gaussian processes on a million datapoints. arXiv preprint arXiv:1903.08114, 2019.

Andrew Gordon Wilson, Zhiting Hu, Ruslan Salakhutdinov,and Eric P Xing. Deep kernel learning. In ArtificialIntelligence and Statistics, pages 370–378, 2016.


A. Details on PPGPR-MFDThe regularizer we use for PPGPR-MFD is given by

Lreg = βreg

{logN (m,Kµ)−KL(N (0,S) | N (0,Kσ))

}→ −βreg

2

{TrS(Kσ)−1 +mT(Kµ)−1m+

log detKσ + log detKµ − log detS}

(23)

with Kσ ≡ K(Zσ,Zσ) and Kµ ≡ K(Zµ,Zµ), and wherewe have dropped irrelevant constants. Here Zµ and Zσ arethe decoupled inducing point locations used to define themean function and variance function, respectively. Thisregularizer can be viewed as the sum of two multivariateNormal KL divergences, one of which involves a Dirac deltadistribution (with divergent terms dropped).

As discussed in Sec. 3.4, the mean and variance functionsfor PPGPR-MFD are given by

µf (xi) = kµTi (Kµ)−1m (24)

and

σf (xi)2 = Kσ

ii + kσTi (Kσ)−1S(Kσ)−1kσi (25)

where the various kernels in Eqn. 24 and Eqn. 25 share thesame hyperparameters and S is diagonal.

B. Time and Space ComplexityWe briefly describe the time and space complexity of themain algorithms for scalable GP regression discussed in thiswork. We note that our complexity analysis is similar to thatof most sparse Gaussian process methods, such as SVGP(Hensman et al., 2013).

Training complexity. Training a VFITC model requiresoptimizing Eqn. 15, and training PPGPR models requiresoptimizing Eqn. 17. First, we note that the KL(q(u)|p(u))term is the KL divergence between two multivariate Gaus-sians, which requires O(m3) time complexity and O(m2)space complexity. For the remaining terms, the main compu-tational bottleneck in both of these equations is computingthe term K−1

MM . This is usually accomplished by comput-ing its Cholesky factor L, which takes O(m3) computationgiven m inducing points. After the Cholesky factor hasbeen computed, all matrix solves involving K−1

MM requireonly O(m2) computation. The three main terms in bothobjective functions (µf (xi), Kii, and kTi K−1

MMSK−1MMki)

each require a constant number of matrix solves for eachdata point. All together, the total time complexity of eachtraining iteration is therefore O(m3 + bm2), where b is thenumber of data points in a minibatch. The space complexityis O(m2 + bm) — the size of storing KMM , its Choleskyfactor, and all ki vectors.

Prediction complexity. The predictive distributions forVFITC and PPGPR are given by Eqn. 11. Again, the termsµf (xi) and σf (xi) = Kii + kTi K−1

MMSK−1MMki require a

constant number of matrix solves with K−1MM for each xi.

However, we can cache and re-use the Cholesky factor ofK−1MM for all predictive distribution computations, since (af-

ter training) we are not updating the inducing point locationsor the kernel hyperparameters. Therefore, each predictivedistribution has a time complexity of O(m2) after the one-time cost of computing/caching the Cholesky factor. It isalso worth noting that predictive means can be computedin O(m) time by caching the vector K−1

MMm vector. Thespace complexity is also O(m2) (the size of the Choleskyfactor).

Note that each of our proposed variants in Section 3.4 havethe same computational complexity, as each variant simplymodifies the form of S which is not the computational bot-tleneck. We do note that the MFD variant requires roughlydouble the amount of computation and storage, as we arestoring/performing solves with two KMM matrices (one forthe predictive means and one for the predictive variances).Finally, we note that the whitening parameterization (seeAppendix E) has the same computaitonal/storage complex-ity, as it also requires computing/storing a Cholesky factor.

C. Experimental DetailsWe use zero mean functions and Matern kernels with in-dependent length scales for each input dimension through-out. All models and experiments are implemented using theGPyTorch framework (Gardner et al., 2018) and the Pyroprobabilistic programming language (Bingham et al., 2019).

C.1. Univariate regression

We use the Adam optimizer for optimization with an initiallearning rate of ` = 0.01 that is progressively decimatedover the course of training (Kingma and Ba, 2014). We use amini-batch size of B = 104 for the Buzz, Song, 3droad andHouseelectric datasets and B = 103 for all other datasets.We train for 400 epochs except for the Houseelectric datasetwhere we train for 200 epochs. Except for the Exact re-sults, we do 10 train/test/validation splits on all datasets(always in the proportion 15:3:2, respectively). In par-ticular for the Exact results we do 3 train/test/validationsplits on the smaller datasets and one split for the twolargest (3Droad and Song). All datasets are standardized inboth input and output space; thus a predictive distributionconcentrated at zero yields a root mean squared error ofunity. We use M = 1000 inducing points initialized withkmeans. In the case of OD-SVGP and PPGPR-MFD we useM = 1000 inducing points for the mean and M = 1000inducing points for the (co)variance. We use the validationset to determine a small set of hyperparameters. In par-


ticular for SVGP, OD-SVGP, and VFITC we search overβreg ∈ {0.1, 0.3, 0.5, 1.0}. For γ-Robust we search over{1.01, 1.03, 1.05, 1.07} (with βreg = 1). For all PPGPRvariants we search over βreg ∈ {0.01, 0.05, 0.2, 1.0}. ForMAP we fix βreg = 1.

C.2. PPGPR Ablation

The experimental procedure for the results reported inSec. 5.2 follows the procedure described in Sec. C.1.

C.3. DKL calibration

MC-Dropout has two hyperparameters that must be setby hand: a dropout proportion p and a prior vari-ance inflation term τ . These were set by temporar-ily removing a portion of the training data as a vali-dation set and performing a small grid search on eachdataset over p ∈ [0.05, 0.1, 0.15, 0.2, 0.25] and τ ∈[0.05, 0.1, 0.25, 0.5, 0.75, 1.0]. For PPGPR-MFD, β ∈[0.05, 0.2, 0.5, 1.0] was chosen in a similar fashion. Thedatasets for each of the three cities contain 30 features thatencode various aspects of a trip like origin and destinationlocation, time of day and week, as well as various rudimen-tary routing features.

For all three methods, we use the same five layer fullyconnected neural network, with hidden representation sizesof [256, 256, 128, 128, 64] and ReLU nonlinearities. Weuse the Adam optimizer and use an initial learning rate of` = 0.01, which we drop by a factor of 0.1 at 100 and150 epochs. We train for 200 epochs for all three methods.For the GP methods, we use M = 1024 inducing points,initialized by randomly selecting training data points andpassing them through the initial feature extractor.

D. Additional Experimental ResultsIn Fig. 6-9 we depict summary results for all the experimentsin Sec. 5.1-5.2 in the main text. In particular in Fig. 9 wedepict Continuous Ranked Probability Scores (Gneiting andRaftery, 2007). We also include a complete compilation ofour results in Table 3.

The effect of βreg We find empirically that PPGPR isrobust to the value of βreg. On the five smallest UCI datasets,the test RMSE varies by no more than 5% (relative) and theLL by no more than 0.05 nats as we vary βreg from 0 to 1.This is not unexpected, since we are always in the regimeM � N .

Comparison to Raissi et. al. We do not include a fullcomparison to the method in Raissi et al. (2019) becausewe find that it is outperformed by all the other baselineswe consider. In particular while this method can achieve

middling RMSEs, on many datasets it achieves very poorlog likelihoods. For example, using the authors’ implemen-tation14 we find a test NLL of ∼ 26 nats on the Elevatorsdataset and ∼ 6 nats on the Pol dataset. This performancecan be traced to the incoherency of the two-objective ap-proach adopted by this method. In particular the logic of thederivation makes it unclear whether the noise term in thekernel should be included during test time; this ambiguityis reflected in the authors’ code,15 where this contributionis by default commented out. While the NLL performancecan be improved by including this term, the larger point isthat this approach does not provide a coherent account offunction uncertainty.

In-Sample Comparison to Exact GPs In this section,we compare the fit of an exact GP, of SVGP, and of PPGPRto samples drawn from a Gaussian process prior with aMatern kernel and a Periodic kernel. For the draw fromthe Matern kernel, we use a lengthscale of 0.1 and an out-putscale of 1. For the periodic kernel, we use a period lengthof 0.2 and outputscale of 1. For both kernels, we draw thefunction on the range [0, 1], and fit all three models startingfrom default hyperparameter initializations. For the periodickernel case, we consider an extrapolation task. The resultsare presented in Fig 10. We observe that all three methodsare capable of performing the extrapolation task using theperiodic kernel, and in general depict qualitatively similarresults. This verifies that–in very simple cases–much ofthe model performance can depend on the choice of kernelrather than the particular training scheme.

E. Whitened Sparse Gaussian ProcessRegression

The hyperparameters and variational parameters of the mod-els can be learned by directly optimizing the objective func-tions in Eqn. 3 (for MAP), in Eqn. 5 (for SVGP), in Eqn. 15(for VFITC), and Eqn. 17 (for PPGPR). In practice, wemodify these objective functions using a transformation pro-posed by (Matthews, 2017). The “whitening transformation”is a simple change of variables

u′ = Λ−1MMu

where ΛMM is a matrix such that ΛMMΛ>MM = KMM .(Typically, ΛMM is taken to be the Cholesky factor ofKMM .) Intuitively, this transformation is advantageousbecause it reduces the number of changing terms in the ob-jective functions. In whitened coordinates the prior p(u′)is constant: p(u′) = N (0,Λ−1

MMKMMΛ−>MM ) = N (0, I).

14https://github.com/maziarraissi/ParametricGP15See line 177 in parametric_GP.py.


-1.09 -0.61

MAPSVGP

γ-RobustOD-SVGP

VFITCPPGPR-δ

PPGPR-MFPPGPR-Chol

PPGPR-MFDExact

Pol(N=11250)

0.32 0.45

Elevators(N=12449)

-1.75 -0.81

Bike(N=13034)

-1.28 -0.07

Kin40K(N=30000)

0.73 0.93

Protein(N=34297)

-1.67 -0.87

Keggdir.(N=36620)

-1.71 -0.92

Slice(N=40125)

-1.90 -0.64

Keggundir.(N=47706)

-0.30 0.41

3Droad(N=326155)

1.10 1.18

Song(N=386508)

-0.05 0.08

Buzz(N=437437)

-2.02 -1.52


NLL

Figure 6. We compare test NLLs for the various methods explored in Sec. 5.1-5.2 in the main text (lower is better). Results are averagedover ten random train/test/validation splits. Here and throughout error bars depict standard errors.

0.07 0.12

MAPSVGP

γ-RobustOD-SVGP

VFITCPPGPR-δ

PPGPR-MFPPGPR-Chol

PPGPR-MFDExact

Pol(N=11250)

0.35 0.37

Elevators(N=12449)

0.02 0.11

Bike(N=13034)

0.08 0.21

Kin40K(N=30000)

0.51 0.62

Protein(N=34297)

0.09 0.09

Keggdir.(N=36620)

0.03 0.30

Slice(N=40125)

0.12 0.12

Keggundir.(N=47706)

0.06 0.42

3Droad(N=326155)

0.75 0.79

Song(N=386508)

0.26 0.28

Buzz(N=437437)

0.05 0.05


RMSE

Figure 7. We compare test RMSEs for the various methods explored in Sec. 5.1-5.2 in the main text (lower is better). Results are averagedover ten random train/test/validation splits.

.25 .5 .75

MAPSVGP

γ-RobustOD-SVGP

VFITCPPGPR-δ

PPGPR-MFPPGPR-Chol

PPGPR-MFDExact

Pol(N=11250)

.25 .5 .75

Elevators(N=12449)

.25 .5 .75

Bike(N=13034)

.25 .5 .75

Kin40K(N=30000)

.25 .5 .75

Protein(N=34297)

.25 .5 .75

Keggdir.(N=36620)

.25 .5 .75

Slice(N=40125)

.25 .5 .75

Keggundir.(N=47706)

.25 .5 .75

3Droad(N=326155)

.25 .5 .75

Song(N=386508)

.25 .5 .75

Buzz(N=437437)

.25 .5 .75


σ2obs / Var [y∗ |x∗]

Figure 8. We compare the mean fraction of the predictive variance that is due to the observation noise for the various methods explored inSec. 5.1-5.2 in the main text. Results are averaged over ten random train/test/validation splits.

Incorporating this transformation into Eqn. 3 gives us

log p(y,u′|X,Z) ≥ Ep(f |u′) [log p(y|f) + log p(u′)]

=

n∑i=1

logN (yi|kTi Λ−1MMu′, σobs)− 1

2σobsTr KNN

− 12‖u

′‖22 − M2 log(2π).

Importantly, the prior term p(u′) in the modified objec-tive does not depend on the inducing point locations Zor kernel hyperparameters. In all experiments we use

similarly modified objectives for better optimization. ForMAP and PPGPR-δ, we directly optimize the whitenedvariables u′. For SVGP, VFITC, PPGPR-Chol, andPPGPR-MF, the whitened variational distribution is givenas q(u′) = N (Λ−1

MMm,Λ−1MMSΛ−1

MM ). We parameter-ize the mean with a vector m′ = Λ−1

MMm and we pa-rameterize the covariance with a lower triangular matrixLL> = Λ−1

MMSΛ−1MM . The same transformation is ap-

plied to the OD-SVGP covariance variational parameters—referred to as the β parameters by Salimbeni et al. (2018).


Table 3. A compilation of all the results from Sec. 5.1-5.2. For each metric and dataset we bold the result for the best performing scalablemethod. Errors are standard errors.

MAP SVGP γ-Robust OD-SVGP VFITC PPGPR-δ PPGPR-MF PPGPR-Chol PPGPR-MFD ExactMetric Dataset

NLL Pol −0.615± 0.007 −0.651± 0.005 −0.657± 0.007 −0.723± 0.006 −0.681± 0.005 −0.755± 0.009 −0.825± 0.005 −0.866± 0.005 −1.090± 0.009 −0.817± 0.001Elevators 0.412± 0.007 0.401± 0.007 0.409± 0.007 0.448± 0.010 0.376± 0.007 0.386± 0.008 0.329± 0.007 0.324± 0.007 0.368± 0.011 0.398± 0.013

Bike −0.830± 0.010 −0.807± 0.007 −0.901± 0.013 −0.824± 0.009 −0.989± 0.009 −1.019± 0.008 −1.314± 0.007 −1.402± 0.004 −1.426± 0.010 −1.750± 0.014Kin40K −0.333± 0.002 −0.414± 0.002 −0.423± 0.002 −0.830± 0.004 −0.658± 0.001 −0.678± 0.002 −0.770± 0.002 −0.837± 0.002 −1.284± 0.005 −0.075± 0.001Protein 0.931± 0.003 0.902± 0.003 0.906± 0.004 0.892± 0.006 0.796± 0.004 0.794± 0.004 0.747± 0.008 0.735± 0.007 0.743± 0.008 0.875± 0.002

Keggdir. −1.032± 0.017 −1.045± 0.017 −1.026± 0.021 −1.057± 0.018 −1.494± 0.014 −1.520± 0.013 −1.601± 0.016 −1.667± 0.028 −1.575± 0.015 −0.870± 0.052Slice −0.921± 0.002 −1.267± 0.003 −1.076± 0.006 −1.673± 0.013 −1.409± 0.004 −1.442± 0.004 −1.516± 0.004 −1.713± 0.004 −1.438± 0.057 −1.240± 0.004

Keggundir. −0.694± 0.006 −0.704± 0.006 −0.688± 0.007 −0.712± 0.006 −1.643± 0.016 −1.685± 0.018 −1.807± 0.018 −1.901± 0.021 −1.801± 0.013 −0.643± 0.0083Droad 0.336± 0.002 0.330± 0.002 0.325± 0.001 0.231± 0.014 0.045± 0.003 0.079± 0.004 −0.122± 0.002 −0.188± 0.002 −0.297± 0.003 0.408Song 1.180± 0.001 1.178± 0.001 1.181± 0.001 1.168± 0.001 1.145± 0.001 1.143± 0.001 1.109± 0.001 1.104± 0.001 1.103± 0.001 1.139Buzz 0.080± 0.002 0.069± 0.001 0.064± 0.001 0.044± 0.002 0.022± 0.002 0.020± 0.002 −0.020± 0.001 −0.037± 0.001 −0.047± 0.001 —

Houseelectric −1.524± 0.001 −1.543± 0.001 −1.521± 0.001 −1.559± 0.002 −1.794± 0.001 −1.822± 0.001 −1.931± 0.001 −1.978± 0.001 −2.020± 0.003 —

RMSE Pol 0.115± 0.001 0.107± 0.002 0.115± 0.002 0.109± 0.001 0.104± 0.002 0.101± 0.001 0.111± 0.002 0.121± 0.002 0.077± 0.001 0.074± 0.001Elevators 0.364± 0.002 0.360± 0.003 0.362± 0.002 0.370± 0.003 0.360± 0.003 0.362± 0.003 0.362± 0.003 0.370± 0.003 0.361± 0.003 0.347± 0.007

Bike 0.086± 0.002 0.088± 0.002 0.089± 0.002 0.097± 0.002 0.083± 0.002 0.076± 0.001 0.094± 0.002 0.111± 0.002 0.060± 0.001 0.022± 0.001Kin40K 0.156± 0.001 0.147± 0.001 0.152± 0.001 0.109± 0.001 0.156± 0.001 0.145± 0.001 0.188± 0.002 0.211± 0.003 0.126± 0.001 0.084± 0.005Protein 0.610± 0.002 0.594± 0.002 0.598± 0.002 0.591± 0.003 0.607± 0.002 0.604± 0.002 0.609± 0.002 0.618± 0.002 0.569± 0.002 0.514± 0.003

Keggdir. 0.087± 0.001 0.086± 0.001 0.088± 0.001 0.085± 0.001 0.089± 0.001 0.090± 0.001 0.089± 0.002 0.090± 0.001 0.087± 0.001 0.090± 0.004Slice 0.071± 0.001 0.051± 0.001 0.070± 0.003 0.043± 0.001 0.054± 0.001 0.053± 0.001 0.095± 0.001 0.298± 0.004 0.032± 0.001 0.031± 0.003

Keggundir. 0.121± 0.001 0.120± 0.001 0.121± 0.001 0.119± 0.001 0.123± 0.001 0.123± 0.001 0.124± 0.001 0.125± 0.001 0.123± 0.001 0.119± 0.0023Droad 0.329± 0.001 0.329± 0.001 0.331± 0.001 0.303± 0.004 0.424± 0.001 0.403± 0.005 0.383± 0.002 0.389± 0.001 0.304± 0.001 0.057Song 0.788± 0.001 0.786± 0.001 0.788± 0.001 0.778± 0.001 0.782± 0.001 0.780± 0.001 0.777± 0.001 0.778± 0.001 0.770± 0.001 0.750Buzz 0.265± 0.001 0.261± 0.000 0.264± 0.001 0.256± 0.001 0.281± 0.001 0.275± 0.001 0.271± 0.001 0.272± 0.001 0.283± 0.001 —

Houseelectric 0.052± 0.000 0.051± 0.000 0.052± 0.000 0.050± 0.000 0.053± 0.000 0.052± 0.000 0.053± 0.000 0.054± 0.000 0.046± 0.000 —

CRPS Pol 0.065± 0.001 0.061± 0.000 0.062± 0.001 0.059± 0.000 0.060± 0.000 0.057± 0.001 0.057± 0.000 0.057± 0.000 0.040± 0.000 0.051± 0.000Elevators 0.200± 0.001 0.198± 0.001 0.199± 0.001 0.203± 0.001 0.197± 0.001 0.198± 0.001 0.193± 0.001 0.195± 0.001 0.195± 0.002 0.195± 0.003

Bike 0.047± 0.000 0.049± 0.000 0.043± 0.000 0.051± 0.001 0.042± 0.000 0.040± 0.000 0.037± 0.000 0.037± 0.001 0.028± 0.000 0.019± 0.000Kin40K 0.088± 0.000 0.082± 0.000 0.082± 0.000 0.056± 0.000 0.074± 0.000 0.071± 0.000 0.077± 0.000 0.079± 0.000 0.050± 0.000 0.093± 0.000Protein 0.337± 0.001 0.326± 0.001 0.326± 0.001 0.317± 0.001 0.318± 0.001 0.317± 0.001 0.310± 0.001 0.310± 0.001 0.288± 0.001 0.293± 0.001

Keggdir. 0.038± 0.000 0.037± 0.000 0.036± 0.000 0.037± 0.000 0.033± 0.000 0.032± 0.000 0.031± 0.000 0.031± 0.000 0.031± 0.000 0.046± 0.002Slice 0.044± 0.000 0.031± 0.000 0.037± 0.000 0.021± 0.000 0.029± 0.000 0.028± 0.000 0.032± 0.000 0.060± 0.001 0.014± 0.000 0.029± 0.000

Keggundir. 0.052± 0.000 0.051± 0.000 0.050± 0.000 0.051± 0.000 0.038± 0.000 0.037± 0.000 0.036± 0.000 0.035± 0.000 0.036± 0.000 0.056± 0.0013Droad 0.178± 0.000 0.177± 0.000 0.175± 0.000 0.163± 0.002 0.191± 0.001 0.185± 0.002 0.170± 0.001 0.167± 0.000 0.138± 0.000 0.142Song 0.441± 0.000 0.440± 0.000 0.441± 0.000 0.434± 0.000 0.434± 0.000 0.433± 0.000 0.426± 0.000 0.425± 0.000 0.422± 0.000 0.420Buzz 0.141± 0.000 0.140± 0.000 0.139± 0.000 0.135± 0.000 0.141± 0.000 0.140± 0.000 0.136± 0.000 0.135± 0.000 0.134± 0.000 —

Houseelectric 0.027± 0.000 0.027± 0.000 0.027± 0.000 0.026± 0.000 0.025± 0.000 0.025± 0.000 0.024± 0.000 0.024± 0.000 0.022± 0.000 —

0.04 0.07

MAPSVGP

γ-RobustOD-SVGP

VFITCPPGPR-δ

PPGPR-MFPPGPR-Chol

PPGPR-MFDExact

Pol(N=11250)

0.19 0.20

Elevators(N=12449)

0.02 0.05

Bike(N=13034)

0.05 0.09

Kin40K(N=30000)

0.29 0.34

Protein(N=34297)

0.03 0.05

Keggdir.(N=36620)

0.01 0.06

Slice(N=40125)

0.04 0.06

Keggundir.(N=47706)

0.14 0.19

3Droad(N=326155)

0.42 0.44

Song(N=386508)

0.13 0.14

Buzz(N=437437)

0.02 0.03


CRPS

Figure 9. We compare test continuous ranked probability scores (CRPS) for the various methods explored in Sec. 5.1-5.2 (lower is better).Results are averaged over ten random train/test/validation splits.

As in the original work, the mean variational parameters(the γ parameters) are not re-parameterized.

We apply a similar whitening transformation to the decou-pled variant of our method (PGPR-MFD). Given inducingpoints Zµ and Zσ for the mean and variance functions re-spectively, we optimize the transformed parameters

m′ = Λ(µ)−1MM m, S′ = Λ

(σ)−1MM SΛ

(σ)−>MM ,

where Λ(µ)MM and Λ

(σ)MM are the Cholesky factor of the Zµ

and Zσ kernel matrices. S′ is constrained to be a diagonalmatrix because of the mean-field approximation.

F. Connection to Direct Loss MinimizationIn a series of papers the authors of (Sheth and Khardon,2016; 2017; 2019) consider approximate inference fromthe angle of regularized loss minimization. For Bayesianmodels of the general form

p(y,u) = p(u)

N∏i=1

p(yi|u,xi) (26)

and (approximate posterior) distributions q(u) Sheth andKhardon investigate PAC-Bayes-like bounds for the Bayesrisk

rBayes[q(u)] ≡ Epdata(x,y)

[− logEq(u)p(y|u,x)

](27)

and the Gibbs risk

rGibbs[q(u)] ≡ Epdata(x,y)Eq(u) [− log p(y|u,x)] (28)

Parametric Gaussian Process RegressorsP

erio

dic

Mat

érn

Figure 10. We compare exact GPs, SVGP, and PPGPR from functions drawn from a Gaussian process prior using a periodic kernel (top)and Matern kernel (bottom).

In particular, under some restrictions on the family of multi-variate Normal distributions q(u), they establish a bound ona smoothed log loss variant of Bayes risk for an inferencesetup that corresponds to Variational FITC (Sec. 3.1).16 Ad-ditionally, under similar assumptions an analogous boundfor an inference setup that corresponds to PPGPR-Cholis established in (Sheth and Khardon, 2019). Moreover,in (Sheth and Khardon, 2016) the authors provide someempirical evidence for the good performance of these twoobjective functions.

Importantly, our parametric modeling perspective allowsus more flexibility than is allowed by Direct Loss Mini-mization as explored by Sheth and Khardon. Indeed muchof the predictive power of approximate GP regressors ar-guably comes from the sensible behavior of the parametricfamily of mean and variance functions that is induced bya particular kernel and associated inducing point locationsZ—for example, the variance σf (x)2 increases as x movesaway from Z. In particular the maximum likelihood ap-proach we adopt in Sec. 3.2 allows us to consider parametricfamilies of predictive distributions that are not of the formEq(u)Ep(f |u,x) [p(y|f)] for some distribution q(u). This isin fact the case for our best performing method, PPGPR-MFD, which freely combines mean and variance functionsthat are parameterized by distinct inducing point locationsZµ and Zσ. Indeed, while we have not done so in our ex-periments, this perspective allows us to entirely decoupleµf (x) and σf (x)2 by introducing separate kernels for each.

16See corollary 12 in (Sheth and Khardon, 2017).

G. Connection to stochastic EP and the BB-αobjective

Suppose we want to approximate the distribution

p(ω) =1

Zp0(ω)

N∏i=1

fn(ω) (29)

where Z is an unknown normalizer. In the prototypicalcontext of Bayesian modeling p0(ω) would be a prior distri-bution and fn(ω) would be a likelihood factor for the nth

datapoint. Expectation propagation (EP) is a broad classof algorithms that can be used to approximate distributionslike that in Eqn. 29 (Minka, 2004). In the following wegive a brief review of a few variants of EP and describe aconnection to the Predictive objective defined in Eqn. 17 inthe main text.

Li et al. (2015) propose a particular variant of EP calledStochastic EP that reduces memory requirements by a factorof N by tying (i.e. sharing) factors together. In subsequentwork Hernandez-Lobato et al. (2016) present a version ofStochastic EP that is formulated in terms of an energy func-tion, the so-called BB-α objective Lα, which is given by

Lα = − 1

α

N∑i=1

logEq(ω)

[(fn(ω)p0(ω)

1/N

q(ω)1/N

)α](30)

where q(ω) is a so-called cavity distribution and α ∈ [0, 1].As shown in (Li and Gal, 2017), this can be rewritten as

Lα = Rξ(q||p0)− 1α

N∑i=1

logEq [fn(ω)α] (31)


where q(ω) is defined by the equation

q(ω) =1

Zqq(ω)

(q(ω)

p0(ω)

) αN−α

(32)

and where ξ ≡ NN−α and Rξ(q||p0) is a Renyi divergence.

Li and Gal (2017) then argue that, under suitable conditions,we have that as α

N → 0 this becomes

Lα → KL(q||p0)− 1α

N∑i=1

logEq [fn(ω)α] (33)

For the particular choice α = 1 (so that we requireN →∞)this then becomes

Lα=1 → KL(q||p0)−N∑i=1

logEq [fn(ω)] (34)

The similarity of Eqn. 34 and Eqn. 17 is now manifest.

For further discussion of EP methods in the context of Gaus-sian Process inference we refer the reader to (Bui et al.,2017).

parametric gaussian process regressors

Documents