fitting covariance and multioutput gaussian...
TRANSCRIPT
Fitting Covariance and Multioutput GaussianProcesses
Neil D. Lawrence
GPSS16th September 2014
Outline
Parametric Models are a Bottleneck
Constructing Covariance
GP Limitations
Kalman Filter
Outline
Parametric Models are a Bottleneck
Constructing Covariance
GP Limitations
Kalman Filter
Nonparametric Gaussian Processes
I We’ve seen how we go from parametric to non-parametric.I The limit implies infinite dimensional w.I Gaussian processes are generally non-parametric: combine
data with covariance function to get model.I This representation cannot be summarized by a parameter
vector of a fixed size.
The Parametric Bottleneck
I Parametric models have a representation that does notrespond to increasing training set size.
I Bayesian posterior distributions over parameters containthe information about the training data.
I Use Bayes’ rule from training data, p(w|y,X
),
I Make predictions on test data
p(y∗|X∗,y,X
)=
∫p(y∗|w,X∗
)p(w|y,X)dw
).
I w becomes a bottleneck for information about the trainingset to pass to the test set.
I Solution: increase m so that the bottleneck is so large that itno longer presents a problem.
I How big is big enough for m? Non-parametrics saysm→∞.
The Parametric Bottleneck
I Now no longer possible to manipulate the model throughthe standard parametric form.
I However, it is possible to express parametric as GPs:
k(xi, x j
)= φ: (xi)
> φ:
(x j
).
I These are known as degenerate covariance matrices.I Their rank is at most m, non-parametric models have full
rank covariance matrices.I Most well known is the “linear kernel”, k(xi, x j) = x>i x j.
The Parametric Bottleneck
I Now no longer possible to manipulate the model throughthe standard parametric form.
I However, it is possible to express parametric as GPs:
k(xi, x j
)= φ: (xi)
> φ:
(x j
).
I These are known as degenerate covariance matrices.I Their rank is at most m, non-parametric models have full
rank covariance matrices.I Most well known is the “linear kernel”, k(xi, x j) = x>i x j.
The Parametric Bottleneck
I Now no longer possible to manipulate the model throughthe standard parametric form.
I However, it is possible to express parametric as GPs:
k(xi, x j
)= φ: (xi)
> φ:
(x j
).
I These are known as degenerate covariance matrices.
I Their rank is at most m, non-parametric models have fullrank covariance matrices.
I Most well known is the “linear kernel”, k(xi, x j) = x>i x j.
The Parametric Bottleneck
I Now no longer possible to manipulate the model throughthe standard parametric form.
I However, it is possible to express parametric as GPs:
k(xi, x j
)= φ: (xi)
> φ:
(x j
).
I These are known as degenerate covariance matrices.I Their rank is at most m, non-parametric models have full
rank covariance matrices.
I Most well known is the “linear kernel”, k(xi, x j) = x>i x j.
The Parametric Bottleneck
I Now no longer possible to manipulate the model throughthe standard parametric form.
I However, it is possible to express parametric as GPs:
k(xi, x j
)= φ: (xi)
> φ:
(x j
).
I These are known as degenerate covariance matrices.I Their rank is at most m, non-parametric models have full
rank covariance matrices.I Most well known is the “linear kernel”, k(xi, x j) = x>i x j.
Making Predictions
I For non-parametrics prediction at new points f∗ is made byconditioning on f in the joint distribution.
I In GPs this involves combining the training data with thecovariance function and the mean function.
I Parametric is a special case when conditional predictioncan be summarized in a fixed number of parameters.
I Complexity of parametric model remains fixed regardlessof the size of our training data set.
I For a non-parametric model the required number ofparameters grows with the size of the training data.
Making Predictions
I For non-parametrics prediction at new points f∗ is made byconditioning on f in the joint distribution.
I In GPs this involves combining the training data with thecovariance function and the mean function.
I Parametric is a special case when conditional predictioncan be summarized in a fixed number of parameters.
I Complexity of parametric model remains fixed regardlessof the size of our training data set.
I For a non-parametric model the required number ofparameters grows with the size of the training data.
Making Predictions
I For non-parametrics prediction at new points f∗ is made byconditioning on f in the joint distribution.
I In GPs this involves combining the training data with thecovariance function and the mean function.
I Parametric is a special case when conditional predictioncan be summarized in a fixed number of parameters.
I Complexity of parametric model remains fixed regardlessof the size of our training data set.
I For a non-parametric model the required number ofparameters grows with the size of the training data.
Making Predictions
I For non-parametrics prediction at new points f∗ is made byconditioning on f in the joint distribution.
I In GPs this involves combining the training data with thecovariance function and the mean function.
I Parametric is a special case when conditional predictioncan be summarized in a fixed number of parameters.
I Complexity of parametric model remains fixed regardlessof the size of our training data set.
I For a non-parametric model the required number ofparameters grows with the size of the training data.
Making Predictions
I For non-parametrics prediction at new points f∗ is made byconditioning on f in the joint distribution.
I In GPs this involves combining the training data with thecovariance function and the mean function.
I Parametric is a special case when conditional predictioncan be summarized in a fixed number of parameters.
I Complexity of parametric model remains fixed regardlessof the size of our training data set.
I For a non-parametric model the required number ofparameters grows with the size of the training data.
Covariance Functions and Mercer Kernels
I Mercer Kernels and Covariance Functions are similar.
I the kernel perspective does not make a probabilisticinterpretation of the covariance function.
I Algorithms can be simpler, but probabilistic interpretationis crucial for kernel parameter optimization.
Covariance Functions and Mercer Kernels
I Mercer Kernels and Covariance Functions are similar.I the kernel perspective does not make a probabilistic
interpretation of the covariance function.
I Algorithms can be simpler, but probabilistic interpretationis crucial for kernel parameter optimization.
Covariance Functions and Mercer Kernels
I Mercer Kernels and Covariance Functions are similar.I the kernel perspective does not make a probabilistic
interpretation of the covariance function.I Algorithms can be simpler, but probabilistic interpretation
is crucial for kernel parameter optimization.
Outline
Parametric Models are a Bottleneck
Constructing Covariance
GP Limitations
Kalman Filter
Constructing Covariance Functions
I Sum of two covariances is also a covariance function.
k(x, x′) = k1(x, x′) + k2(x, x′)
Constructing Covariance Functions
I Product of two covariances is also a covariance function.
k(x, x′) = k1(x, x′)k2(x, x′)
Multiply by Deterministic Function
I If f (x) is a Gaussian process.I g(x) is a deterministic function.I h(x) = f (x)g(x)I Then
kh(x, x′) = g(x)k f (x, x′)g(x′)
where kh is covariance for h(·) and k f is covariance for f (·).
Covariance Functions
MLP Covariance Function
k (x, x′) = αasin(
wx>x′ + b√
wx>x + b + 1√
wx′>x′ + b + 1
)
I Based on infinite neuralnetwork model.
w = 40
b = 4
Covariance Functions
MLP Covariance Function
k (x, x′) = αasin(
wx>x′ + b√
wx>x + b + 1√
wx′>x′ + b + 1
)
I Based on infinite neuralnetwork model.
w = 40
b = 4
Covariance Functions
Linear Covariance Function
k (x, x′) = αx>x′
I Bayesian linearregression.
α = 1
Covariance Functions
Linear Covariance Function
k (x, x′) = αx>x′
I Bayesian linearregression.
α = 1
Gaussian Process Interpolation
-3
-2
-1
0
1
2
3
-2 -1 0 1 2
f(x)
x
Figure : Real example: BACCO (see e.g. (Oakley and O’Hagan, 2002)).Interpolation through outputs from slow computer simulations (e.g.atmospheric carbon levels).
Gaussian Process Interpolation
-3
-2
-1
0
1
2
3
-2 -1 0 1 2
f(x)
x
Figure : Real example: BACCO (see e.g. (Oakley and O’Hagan, 2002)).Interpolation through outputs from slow computer simulations (e.g.atmospheric carbon levels).
Gaussian Process Interpolation
-3
-2
-1
0
1
2
3
-2 -1 0 1 2
f(x)
x
Figure : Real example: BACCO (see e.g. (Oakley and O’Hagan, 2002)).Interpolation through outputs from slow computer simulations (e.g.atmospheric carbon levels).
Gaussian Process Interpolation
-3
-2
-1
0
1
2
3
-2 -1 0 1 2
f(x)
x
Figure : Real example: BACCO (see e.g. (Oakley and O’Hagan, 2002)).Interpolation through outputs from slow computer simulations (e.g.atmospheric carbon levels).
Gaussian Process Interpolation
-3
-2
-1
0
1
2
3
-2 -1 0 1 2
f(x)
x
Figure : Real example: BACCO (see e.g. (Oakley and O’Hagan, 2002)).Interpolation through outputs from slow computer simulations (e.g.atmospheric carbon levels).
Gaussian Process Interpolation
-3
-2
-1
0
1
2
3
-2 -1 0 1 2
f(x)
x
Figure : Real example: BACCO (see e.g. (Oakley and O’Hagan, 2002)).Interpolation through outputs from slow computer simulations (e.g.atmospheric carbon levels).
Gaussian Process Interpolation
-3
-2
-1
0
1
2
3
-2 -1 0 1 2
f(x)
x
Figure : Real example: BACCO (see e.g. (Oakley and O’Hagan, 2002)).Interpolation through outputs from slow computer simulations (e.g.atmospheric carbon levels).
Gaussian Process Interpolation
-3
-2
-1
0
1
2
3
-2 -1 0 1 2
f(x)
x
Figure : Real example: BACCO (see e.g. (Oakley and O’Hagan, 2002)).Interpolation through outputs from slow computer simulations (e.g.atmospheric carbon levels).
Gaussian Noise
I Gaussian noise model,
p(yi| fi
)= N
(yi| fi, σ2
)where σ2 is the variance of the noise.
I Equivalent to a covariance function of the form
k(xi, x j) = δi, jσ2
where δi, j is the Kronecker delta function.I Additive nature of Gaussians means we can simply add
this term to existing covariance matrices.
Gaussian Process Regression
-3
-2
-1
0
1
2
3
-2 -1 0 1 2
y(x)
x
Figure : Examples include WiFi localization, C14 callibration curve.
Gaussian Process Regression
-3
-2
-1
0
1
2
3
-2 -1 0 1 2
y(x)
x
Figure : Examples include WiFi localization, C14 callibration curve.
Gaussian Process Regression
-3
-2
-1
0
1
2
3
-2 -1 0 1 2
y(x)
x
Figure : Examples include WiFi localization, C14 callibration curve.
Gaussian Process Regression
-3
-2
-1
0
1
2
3
-2 -1 0 1 2
y(x)
x
Figure : Examples include WiFi localization, C14 callibration curve.
Gaussian Process Regression
-3
-2
-1
0
1
2
3
-2 -1 0 1 2
y(x)
x
Figure : Examples include WiFi localization, C14 callibration curve.
Gaussian Process Regression
-3
-2
-1
0
1
2
3
-2 -1 0 1 2
y(x)
x
Figure : Examples include WiFi localization, C14 callibration curve.
Gaussian Process Regression
-3
-2
-1
0
1
2
3
-2 -1 0 1 2
y(x)
x
Figure : Examples include WiFi localization, C14 callibration curve.
Gaussian Process Regression
-3
-2
-1
0
1
2
3
-2 -1 0 1 2
y(x)
x
Figure : Examples include WiFi localization, C14 callibration curve.
Gaussian Process Regression
-3
-2
-1
0
1
2
3
-2 -1 0 1 2
y(x)
x
Figure : Examples include WiFi localization, C14 callibration curve.
General Noise Models
Graph of a GPI Relates input variables,
X, to vector, y, through fgiven kernel parametersθ.
I Plate notation indicatesindependence of yi| fi.
I In general p(yi| fi
)is
non-Gaussian.I We approximate with
Gaussianp(yi| fi
)≈ N
(mi| fi, β−1
i
).
yi
X
fi
θ
i = 1 . . . n
Figure : The Gaussian processdepicted graphically.
Gaussian Noise
0
1
2
-3 -2 -1 0 1 2 3 4
p(
f∗|X, x∗,y)
Figure : Inclusion of a data point with Gaussian noise.
Gaussian Noise
0
1
2
-3 -2 -1 0 1 2 3 4
p(
f∗|X, x∗,y)
p(y∗ = 0.6| f∗
)
Figure : Inclusion of a data point with Gaussian noise.
Gaussian Noise
0
1
2
-3 -2 -1 0 1 2 3 4
p(
f∗|X, x∗,y)
p(y∗ = 0.6| f∗
)p(
f∗|X, x∗,y, y∗)
Figure : Inclusion of a data point with Gaussian noise.
Expectation Propagation
Local Moment Matching
I Easiest to consider a single previously unseen data point,y∗, x∗.
I Before seeing data point, prediction of f∗ is a GP, q(
f∗|y,X).
I Update prediction using Bayes’ Rule,
p(
f∗|y, y∗,X, x∗)
=p(y∗| f∗
)p(
f∗|y,X, x∗)
p(y, y∗|X, x∗
) .
This posterior is not a Gaussian process if p(y∗| f∗
)is
non-Gaussian.
Classification Noise Model
Probit Noise Model
0
0.5
1
-4 -2 0 2 4
p(y i|f
i)
fi
yi = −1 yi = 1
Figure : The probit model (classification). The plot shows p(yi| fi
)for
different values of yi. For yi = 1 we have
p(yi| fi
)= φ
(fi)
=∫ fi−∞N (z|0, 1) dz.
Expectation Propagation II
Match Moments
I Idea behind EP — approximate with a Gaussian process atthis stage by matching moments.
I This is equivalent to minimizing the following KLdivergence where q
(f∗|y, y∗,X, x∗
)is constrained to be a GP.
q(
f∗|y, y∗X, x∗)
= argminq( f∗ |y,y∗X,x∗)KL(p(
f∗|y, y∗X, x∗)||q
(f∗|y, y∗,X, x∗
))I This is equivalent to setting⟨
f∗⟩
q( f∗|y,y∗,X,x∗) =⟨
f∗⟩
p( f∗|y,y∗,X,x∗)⟨f 2∗
⟩q( f∗|y,y∗,X,x∗)
=⟨
f 2∗
⟩p( f∗|y,y∗,X,x∗)
Expectation Propagation III
Equivalent Gaussian
I This is achieved by replacing p(y∗| f∗
)with a Gaussian
distribution
p(
f∗|y, y∗,X, x∗)
=p(y∗| f∗
)p(
f∗|y,X, x∗)
p(y, y∗|X, x∗
)becomes
q(
f∗|y, y∗,X, x∗)
=N
(m∗| f∗, β−1
m
)p(
f∗|y,X, x∗)
p(y, y∗|X, x∗
) .
Classification
0
1
2
3
-3 -2 -1 0 1 2 3
p(
f∗|X, x∗,y)
Figure : An EP style update with a classification noise model.
Classification
0
1
2
3
-3 -2 -1 0 1 2 3
p(
f∗|X, x∗,y)
p(y∗ = 1| f∗
)
Figure : An EP style update with a classification noise model.
Classification
0
1
2
3
-3 -2 -1 0 1 2 3
p(
f∗|X, x∗,y)
p(y∗ = 1| f∗
)p(
f∗|X, x∗,y, y∗)
Figure : An EP style update with a classification noise model.
Classification
0
1
2
3
-3 -2 -1 0 1 2 3
p(
f∗|X, x∗,y)
p(y∗ = 1| f∗
)p(
f∗|X, x∗,y, y∗)
q(
f∗|X, x∗,y)
Figure : An EP style update with a classification noise model.
Ordinal Noise Model
Ordered Categories
0
0.5
1
-4 -2 0 2 4
p(y i|f
i)
fi
yi = −1 yi = 1yi = 0
Figure : The ordered categorical noise model (ordinal regression).The plot shows p
(yi| fi
)for different values of yi. Here we have
assumed three categories.
Laplace Approximation
I Equivalent Gaussian is found by making a local 2nd orderTaylor approximation at the mode.
I Laplace was the first to suggest this1, so it’s known as theLaplace approximation.
Learning Covariance ParametersCan we determine covariance parameters from the data?
N(y|0,K
)=
1
(2π)n2 |K|
12exp
(−
y>K−1y2
)
The parameters are inside the covariancefunction (matrix).
ki, j = k(xi, x j;θ)
Learning Covariance ParametersCan we determine covariance parameters from the data?
N(y|0,K
)=
1
(2π)n2 |K|
12exp
(−
y>K−1y2
)
The parameters are inside the covariancefunction (matrix).
ki, j = k(xi, x j;θ)
Learning Covariance ParametersCan we determine covariance parameters from the data?
logN(y|0,K
)=−
12
log |K|−y>K−1y
2−
n2
log 2π
The parameters are inside the covariancefunction (matrix).
ki, j = k(xi, x j;θ)
Learning Covariance ParametersCan we determine covariance parameters from the data?
E(θ) =12
log |K| +y>K−1y
2
The parameters are inside the covariancefunction (matrix).
ki, j = k(xi, x j;θ)
Eigendecomposition of Covariance
A useful decomposition for understanding the objectivefunction.
K = RΛ2R>
λ1λ2
Diagonal of Λ represents distancealong axes.R gives a rotation of these axes.
where Λ is a diagonal matrix and R>R = I.
Useful representation since |K| =∣∣∣Λ2
∣∣∣ = |Λ|2.
Capacity control: log |K|
λ1 0
0 λ2
λ1
Λ =
Capacity control: log |K|
λ1 0
0 λ2
λ1
λ2Λ =
Capacity control: log |K|
λ1 0
0 λ2
λ1
λ2Λ =
Capacity control: log |K|
|Λ| = λ1λ2
λ1 0
0 λ2
λ1
λ2Λ =
Capacity control: log |K|
|Λ| = λ1λ2
λ1 0
0 λ2
λ1
λ2Λ =
Capacity control: log |K|
|Λ| = λ1λ2
λ1 0
0 λ2
λ1
λ2 |Λ|Λ =
Capacity control: log |K|
|Λ| = λ1λ2
λ1 0 0
0 λ2 0
0 0 λ3λ1
λ2 |Λ|Λ =
Capacity control: log |K|
|Λ| = λ1λ2λ3
λ1 0 0
0 λ2 0
0 0 λ3λ1
λ2
λ3
|Λ|Λ =
Capacity control: log |K|
|Λ| = λ1λ2
λ1 0
0 λ2
λ1
λ2 |Λ|Λ =
Capacity control: log |K|
|RΛ| = λ1λ2
w1,1 w1,2
w2,1 w2,2
λ1λ2
|Λ|RΛ =
Data Fit: y>K−1y2
-6
-4
-2
0
2
4
6
-6 -4 -2 0 2 4 6
y 2
y1
λ1
λ2
Data Fit: y>K−1y2
-6
-4
-2
0
2
4
6
-6 -4 -2 0 2 4 6
y 2
y1
λ1
λ2
Data Fit: y>K−1y2
-6
-4
-2
0
2
4
6
-6 -4 -2 0 2 4 6
y 2
y1
λ1λ2
Learning Covariance ParametersCan we determine length scales and noise levels from the data?
-2
-1
0
1
2
-2 -1 0 1 2
y(x)
x
-10-505
101520
10−1 100 101
length scale, `
E(θ) =12
log |K| +y>K−1y
2
Learning Covariance ParametersCan we determine length scales and noise levels from the data?
-2
-1
0
1
2
-2 -1 0 1 2
y(x)
x
-10-505
101520
10−1 100 101
length scale, `
E(θ) =12
log |K| +y>K−1y
2
Learning Covariance ParametersCan we determine length scales and noise levels from the data?
-2
-1
0
1
2
-2 -1 0 1 2
y(x)
x
-10-505
101520
10−1 100 101
length scale, `
E(θ) =12
log |K| +y>K−1y
2
Learning Covariance ParametersCan we determine length scales and noise levels from the data?
-2
-1
0
1
2
-2 -1 0 1 2
y(x)
x
-10-505
101520
10−1 100 101
length scale, `
E(θ) =12
log |K| +y>K−1y
2
Learning Covariance ParametersCan we determine length scales and noise levels from the data?
-2
-1
0
1
2
-2 -1 0 1 2
y(x)
x
-10-505
101520
10−1 100 101
length scale, `
E(θ) =12
log |K| +y>K−1y
2
Learning Covariance ParametersCan we determine length scales and noise levels from the data?
-2
-1
0
1
2
-2 -1 0 1 2
y(x)
x
-10-505
101520
10−1 100 101
length scale, `
E(θ) =12
log |K| +y>K−1y
2
Learning Covariance ParametersCan we determine length scales and noise levels from the data?
-2
-1
0
1
2
-2 -1 0 1 2
y(x)
x
-10-505
101520
10−1 100 101
length scale, `
E(θ) =12
log |K| +y>K−1y
2
Learning Covariance ParametersCan we determine length scales and noise levels from the data?
-2
-1
0
1
2
-2 -1 0 1 2
y(x)
x
-10-505
101520
10−1 100 101
length scale, `
E(θ) =12
log |K| +y>K−1y
2
Learning Covariance ParametersCan we determine length scales and noise levels from the data?
-2
-1
0
1
2
-2 -1 0 1 2
y(x)
x
-10-505
101520
10−1 100 101
length scale, `
E(θ) =12
log |K| +y>K−1y
2
Gene Expression Example
I Given given expression levels in the form of a time seriesfrom Della Gatta et al. (2008).
I Want to detect if a gene is expressed or not, fit a GP to eachgene (Kalaitzis and Lawrence, 2011).
RESEARCH ARTICLE Open Access
A Simple Approach to Ranking DifferentiallyExpressed Gene Expression Time Courses throughGaussian Process RegressionAlfredo A Kalaitzis* and Neil D Lawrence*
Abstract
Background: The analysis of gene expression from time series underpins many biological studies. Two basic formsof analysis recur for data of this type: removing inactive (quiet) genes from the study and determining whichgenes are differentially expressed. Often these analysis stages are applied disregarding the fact that the data isdrawn from a time series. In this paper we propose a simple model for accounting for the underlying temporalnature of the data based on a Gaussian process.
Results: We review Gaussian process (GP) regression for estimating the continuous trajectories underlying in geneexpression time-series. We present a simple approach which can be used to filter quiet genes, or for the case oftime series in the form of expression ratios, quantify differential expression. We assess via ROC curves the rankingsproduced by our regression framework and compare them to a recently proposed hierarchical Bayesian model forthe analysis of gene expression time-series (BATS). We compare on both simulated and experimental data showingthat the proposed approach considerably outperforms the current state of the art.
Conclusions: Gaussian processes offer an attractive trade-off between efficiency and usability for the analysis ofmicroarray time series. The Gaussian process framework offers a natural way of handling biological replicates andmissing values and provides confidence intervals along the estimated curves of gene expression. Therefore, webelieve Gaussian processes should be a standard tool in the analysis of gene expression time series.
BackgroundGene expression profiles give a snapshot of mRNA con-centration levels as encoded by the genes of an organ-ism under given experimental conditions. Early studiesof this data often focused on a single point in timewhich biologists assumed to be critical along the generegulation process after the perturbation. However, thestatic nature of such experiments severely restricts theinferences that can be made about the underlying dyna-mical system.With the decreasing cost of gene expression microar-
rays time series experiments have become commonplacegiving a far broader picture of the gene regulation pro-cess. Such time series are often irregularly sampled andmay involve differing numbers of replicates at each timepoint [1]. The experimental conditions under which
gene expression measurements are taken cannot be per-fectly controlled leading the signals of interest to be cor-rupted by noise, either of biological origin or arisingthrough the measurement process.Primary analysis of gene expression profiles is often
dominated by methods targeted at static experiments, i.e. gene expression measured on a single time-point, thattreat time as an additional experimental factor [1-6].However, were possible, it would seem sensible to con-sider methods that can account for the special nature oftime course data. Such methods can take advantage ofthe particular statistical constraints that are imposed ondata that is naturally ordered [7-12].The analysis of gene expression microarray time-series
has been a stepping stone to important problems in sys-tems biology such as the genome-wide identification ofdirect targets of transcription factors [13,14] and the fullreconstruction of gene regulatory networks [15,16]. Amore comprehensive review on the motivations and
* Correspondence: [email protected]; [email protected] Sheffield Institute for Translational Neuroscience, 385A Glossop Road,Sheffield, S10 2HQ, UK
Kalaitzis and Lawrence BMC Bioinformatics 2011, 12:180http://www.biomedcentral.com/1471-2105/12/180
© 2011 Kalaitzis and Lawrence; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the CreativeCommons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, andreproduction in any medium, provided the original work is properly cited.
-2.5-2
-1.5-1
-0.50
0.51
1 1.5 2 2.5 3 3.5
log 10
SNR
log10 length scale
Contour plot of Gaussian process likelihood.
-2.5-2
-1.5-1
-0.50
0.51
1 1.5 2 2.5 3 3.5
log 10
SNR
log10 length scale
-1
-0.5
0
0.5
1
0 50100150200250300
y(x)
x
Optima: length scale of 1.2221 and log10 SNR of 1.9654 loglikelihood is -0.22317.
-2.5-2
-1.5-1
-0.50
0.51
1 1.5 2 2.5 3 3.5
log 10
SNR
log10 length scale
-1
-0.5
0
0.5
1
0 50100150200250300
y(x)
x
Optima: length scale of 1.5162 and log10 SNR of 0.21306 loglikelihood is -0.23604.
-2.5-2
-1.5-1
-0.50
0.51
1 1.5 2 2.5 3 3.5
log 10
SNR
log10 length scale
-0.8-0.6-0.4-0.2
00.20.40.60.8
0 50100150200250300
y(x)
x
Optima: length scale of 2.9886 and log10 SNR of -4.506 loglikelihood is -2.1056.
Outline
Parametric Models are a Bottleneck
Constructing Covariance
GP Limitations
Kalman Filter
Limitations of Gaussian Processes
I Inference is O(n3) due to matrix inverse (in practice useCholesky).
I Gaussian processes don’t deal well with discontinuities(financial crises, phosphorylation, collisions, edges inimages).
I Widely used exponentiated quadratic covariance (RBF) canbe too smooth in practice (but there are manyalternatives!!).
Outline
Parametric Models are a Bottleneck
Constructing Covariance
GP Limitations
Kalman Filter
Simple Markov Chain
I Assume 1-d latent state, a vector over time, x = [x1 . . . xT].I Markov property,
xi =xi−1 + εi,
εi ∼N (0, α)
=⇒ xi ∼N (xi−1, α)
I Initial state,x0 ∼ N (0, α0)
I If x0 ∼ N (0, α) we have a Markov chain for the latent states.I Markov chain it is specified by an initial distribution
(Gaussian) and a transition distribution (Gaussian).
Gauss Markov Chain
-4
-2
0
2
4
0 1 2 3 4 5 6 7 8 9
x
t
x0 = 0, εi ∼ N (0, 1)
x0 = 0.000, ε1 = −2.24
x1 = 0.000 − 2.24 = −2.24
Gauss Markov Chain
-4
-2
0
2
4
0 1 2 3 4 5 6 7 8 9
x
t
x0 = 0, εi ∼ N (0, 1)
x1 = −2.24, ε2 = 0.457
x2 = −2.24 + 0.457 = −1.78
Gauss Markov Chain
-4
-2
0
2
4
0 1 2 3 4 5 6 7 8 9
x
t
x0 = 0, εi ∼ N (0, 1)
x2 = −1.78, ε3 = 0.178
x3 = −1.78 + 0.178 = −1.6
Gauss Markov Chain
-4
-2
0
2
4
0 1 2 3 4 5 6 7 8 9
x
t
x0 = 0, εi ∼ N (0, 1)
x3 = −1.6, ε4 = −0.292
x4 = −1.6 − 0.292 = −1.89
Gauss Markov Chain
-4
-2
0
2
4
0 1 2 3 4 5 6 7 8 9
x
t
x0 = 0, εi ∼ N (0, 1)
x4 = −1.89, ε5 = −0.501
x5 = −1.89 − 0.501 = −2.39
Gauss Markov Chain
-4
-2
0
2
4
0 1 2 3 4 5 6 7 8 9
x
t
x0 = 0, εi ∼ N (0, 1)
x5 = −2.39, ε6 = 1.32
x6 = −2.39 + 1.32 = −1.08
Gauss Markov Chain
-4
-2
0
2
4
0 1 2 3 4 5 6 7 8 9
x
t
x0 = 0, εi ∼ N (0, 1)
x6 = −1.08, ε7 = 0.989
x7 = −1.08 + 0.989 = −0.0881
Gauss Markov Chain
-4
-2
0
2
4
0 1 2 3 4 5 6 7 8 9
x
t
x0 = 0, εi ∼ N (0, 1)
x7 = −0.0881, ε8 = −0.842
x8 = −0.0881 − 0.842 = −0.93
Gauss Markov Chain
-4
-2
0
2
4
0 1 2 3 4 5 6 7 8 9
x
t
x0 = 0, εi ∼ N (0, 1)
x8 = −0.93, ε9 = −0.41
x9 = −0.93 − 0.410 = −1.34
Multivariate Gaussian Properties: Reminder
Ifz ∼ N
(µ,C
)and
x = Wz + b
thenx ∼ N
(Wµ + b,WCW>
)
Multivariate Gaussian Properties: Reminder
Simplified: Ifz ∼ N
(0, σ2I
)and
x = Wz
thenx ∼ N
(0, σ2WW>
)
Matrix Representation of Latent Variables
x1
x2
x3
x4
x5
ε1
ε2
ε3
ε4
ε5
1 0 0 0 01 1 0 0 01 1 1 0 01 1 1 1 01 1 1 1 1
×=
x1 = ε1
Matrix Representation of Latent Variables
x1
x2
x3
x4
x5
ε1
ε2
ε3
ε4
ε5
1 0 0 0 01 1 0 0 01 1 1 0 01 1 1 1 01 1 1 1 1
×=
x2 = ε1 + ε2
Matrix Representation of Latent Variables
x1
x2
x3
x4
x5
ε1
ε2
ε3
ε4
ε5
1 0 0 0 01 1 0 0 01 1 1 0 01 1 1 1 01 1 1 1 1
×=
x3 = ε1 + ε2 + ε3
Matrix Representation of Latent Variables
x1
x2
x3
x4
x5
ε1
ε2
ε3
ε4
ε5
1 0 0 0 01 1 0 0 01 1 1 0 01 1 1 1 01 1 1 1 1
×=
x4 = ε1 + ε2 + ε3 + ε4
Matrix Representation of Latent Variables
x1
x2
x3
x4
x5
ε1
ε2
ε3
ε4
ε5
1 0 0 0 01 1 0 0 01 1 1 0 01 1 1 1 01 1 1 1 1
×=
x5 = ε1 + ε2 + ε3 + ε4 + ε5
Matrix Representation of Latent Variables
x εL1 ×=
Multivariate Process
I Since x is linearly related to εwe know x is a Gaussianprocess.
I Trick: we only need to compute the mean and covarianceof x to determine that Gaussian.
Latent Process Mean
x = L1ε
Latent Process Mean
〈x〉 = 〈L1ε〉
Latent Process Mean
〈x〉 = L1 〈ε〉
Latent Process Mean
〈x〉 = L1 〈ε〉
ε ∼ N (0, αI)
Latent Process Mean
〈x〉 = L10
Latent Process Mean
〈x〉 = 0
Latent Process Covariance
xx> = L1εε>L>1x> = ε>L>
Latent Process Covariance
⟨xx>
⟩=
⟨L1εε>L>1
⟩
Latent Process Covariance
⟨xx>
⟩= L1
⟨εε>
⟩L>1
Latent Process Covariance
⟨xx>
⟩= L1
⟨εε>
⟩L>1
ε ∼ N (0, αI)
Latent Process Covariance
⟨xx>
⟩= αL1L>1
Latent Process
x = L1ε
Latent Process
x = L1ε
ε ∼ N (0, αI)
Latent Process
x = L1ε
ε ∼ N (0, αI)
=⇒
Latent Process
x = L1ε
ε ∼ N (0, αI)
=⇒
x ∼ N(0, αL1L>1
)
Covariance for Latent Process II
I Make the variance dependent on time interval.I Assume variance grows linearly with time.I Justification: sum of two Gaussian distributed random
variables is distributed as Gaussian with sum of variances.I If variable’s movement is additive over time (as described)
variance scales linearly with time.
Covariance for Latent Process II
I Givenε ∼ N (0, αI) =⇒ ε ∼ N
(0, αL1L>1
).
Thenε ∼ N (0,∆tαI) =⇒ ε ∼ N
(0,∆tαL1L>1
).
where ∆t is the time interval between observations.
Covariance for Latent Process II
ε ∼ N (0, α∆tI) , x ∼ N(0, α∆tL1L>1
)
K = α∆tL1L>1
ki, j = α∆tl>:,il:, j
where l:,k is a vector from the kth row of L1: the first k elementsare one, the next T − k are zero.
ki, j = α∆t min(i, j)
define ∆ti = ti so
ki, j = αmin(ti, t j) = k(ti, t j)
Covariance for Latent Process II
ε ∼ N (0, α∆tI) , x ∼ N(0, α∆tL1L>1
)K = α∆tL1L>1
ki, j = α∆tl>:,il:, j
where l:,k is a vector from the kth row of L1: the first k elementsare one, the next T − k are zero.
ki, j = α∆t min(i, j)
define ∆ti = ti so
ki, j = αmin(ti, t j) = k(ti, t j)
Covariance for Latent Process II
ε ∼ N (0, α∆tI) , x ∼ N(0, α∆tL1L>1
)K = α∆tL1L>1
ki, j = α∆tl>:,il:, j
where l:,k is a vector from the kth row of L1: the first k elementsare one, the next T − k are zero.
ki, j = α∆t min(i, j)
define ∆ti = ti so
ki, j = αmin(ti, t j) = k(ti, t j)
Covariance for Latent Process II
ε ∼ N (0, α∆tI) , x ∼ N(0, α∆tL1L>1
)K = α∆tL1L>1
ki, j = α∆tl>:,il:, j
where l:,k is a vector from the kth row of L1: the first k elementsare one, the next T − k are zero.
ki, j = α∆t min(i, j)
define ∆ti = ti so
ki, j = αmin(ti, t j) = k(ti, t j)
Covariance FunctionsWhere did this covariance matrix come from?
Markov Process
k (t, t′) = αmin(t, t′)
I Covariance matrix isbuilt using the inputs tothe function t.
Covariance FunctionsWhere did this covariance matrix come from?
Markov Process
k (t, t′) = αmin(t, t′)
I Covariance matrix isbuilt using the inputs tothe function t.
-3-2-10123
0 0.5 1 1.5 2
Covariance FunctionsWhere did this covariance matrix come from?
Markov Process
Visualization of inverse covariance (precision).
I Precision matrix issparse: only neighboursin matrix are non-zero.
I This reflects conditionalindependencies in data.
I In this case Markovstructure.
Covariance FunctionsWhere did this covariance matrix come from?
Exponentiated Quadratic Kernel Function (RBF, SquaredExponential, Gaussian)
k (x, x′) = α exp
−‖x − x′‖222`2
I Covariance matrix is
built using the inputs tothe function x.
I For the example above itwas based on Euclideandistance.
I The covariance functionis also know as a kernel.
Covariance FunctionsWhere did this covariance matrix come from?
Exponentiated Quadratic Kernel Function (RBF, SquaredExponential, Gaussian)
k (x, x′) = α exp
−‖x − x′‖222`2
I Covariance matrix is
built using the inputs tothe function x.
I For the example above itwas based on Euclideandistance.
I The covariance functionis also know as a kernel.
Covariance FunctionsWhere did this covariance matrix come from?
Exponentiated Quadratic
Visualization of inverse covariance (precision).
I Precision matrix is notsparse.
I Each point is dependenton all the others.
I In this casenon-Markovian.
Covariance FunctionsWhere did this covariance matrix come from?
Markov Process
Visualization of inverse covariance (precision).
I Precision matrix issparse: only neighboursin matrix are non-zero.
I This reflects conditionalindependencies in data.
I In this case Markovstructure.
Simple Kalman Filter I
I We have state vector X =[x1 . . . xq
]∈ RT×q and if each state
evolves independently we have
p(X) =
q∏i=1
p(x:,i)
p(x:,i) = N(x:,i|0,K
).
I We want to obtain outputs through:
yi,: = Wxi,:
Stacking and Kronecker Products I
I Represent with a ‘stacked’ system:
p(x) = N (x|0, I ⊗K)
where the stacking is placing each column of X one on topof another as
x =
x:,1x:,2...
x:,q
Kronecker Product
aK bKcK dK
Ka b
c d⊗ =
Kronecker Product
⊗ =
Stacking and Kronecker Products I
I Represent with a ‘stacked’ system:
p(x) = N (x|0, I ⊗K)
where the stacking is placing each column of X one on topof another as
x =
x:,1x:,2...
x:,q
Column Stacking
⊗ =
For this stacking the marginal distribution over time is given bythe block diagonals.
For this stacking the marginal distribution over time is given bythe block diagonals.
For this stacking the marginal distribution over time is given bythe block diagonals.
For this stacking the marginal distribution over time is given bythe block diagonals.
For this stacking the marginal distribution over time is given bythe block diagonals.
Two Ways of Stacking
Can also stack each row of X to form column vector:
x =
x1,:x2,:...
xT,:
p(x) = N (x|0,K ⊗ I)
Row Stacking
⊗ =
For this stacking the marginal distribution over the latentdimensions is given by the block diagonals.
For this stacking the marginal distribution over the latentdimensions is given by the block diagonals.
For this stacking the marginal distribution over the latentdimensions is given by the block diagonals.
For this stacking the marginal distribution over the latentdimensions is given by the block diagonals.
For this stacking the marginal distribution over the latentdimensions is given by the block diagonals.
Observed Process
The observations are related to the latent points by a linearmapping matrix,
yi,: = Wxi,: + εi,:
ε ∼ N(0, σ2I
)
Mapping from Latent Process to Observed
Wx1,:
Wx2,:
Wx3,:
x1,:
x2,:
x3,:
W 0 0
0 W 0
0 0 W
× =
Output Covariance
This leads to a covariance of the form
(I ⊗W)(K ⊗ I)(I ⊗W>) + Iσ2
Using (A ⊗ B)(C ⊗D) = AC ⊗ BD This leads to
K ⊗WW> + Iσ2
ory ∼ N
(0,WW>
⊗K + Iσ2)
Kernels for Vector Valued Outputs: A Review
Foundations and TrendsR© inMachine LearningVol. 4, No. 3 (2011) 195–266c© 2012 M. A. Alvarez, L. Rosasco and N. D. LawrenceDOI: 10.1561/2200000036
Kernels for Vector-ValuedFunctions: A Review
By Mauricio A. Alvarez,
Lorenzo Rosasco and Neil D. Lawrence
Contents
1 Introduction 197
2 Learning Scalar Outputs
with Kernel Methods 200
2.1 A Regularization Perspective 200
2.2 A Bayesian Perspective 202
2.3 A Connection Between Bayesian
and Regularization Points of View 205
3 Learning Multiple Outputs with
Kernel Methods 207
3.1 Multi-output Learning 207
3.2 Reproducing Kernel for Vector-Valued Functions 209
3.3 Gaussian Processes for Vector-Valued Functions 211
4 Separable Kernels and Sum of Separable Kernels 213
4.1 Kernels and Regularizers 214
4.2 Coregionalization Models 217
4.3 Extensions 228
Kronecker Structure GPs
I This Kronecker structure leads to several publishedmodels.
(K(x, x′))d,d′ = k(x, x′)kT(d, d′),
where k has x and kT has n as inputs.I Can think of multiple output covariance functions as
covariances with augmented input.I Alongside x we also input the d associated with the output
of interest.
Separable Covariance Functions
I Taking B = WW> we have a matrix expression acrossoutputs.
K(x, x′) = k(x, x′)B,
where B is a p × p symmetric and positive semi-definitematrix.
I B is called the coregionalization matrix.I We call this class of covariance functions separable due to
their product structure.
Sum of Separable Covariance Functions
I In the same spirit a more general class of kernels is given by
K(x, x′) =
q∑j=1
k j(x, x′)B j.
I This can also be written as
K(X,X) =
q∑j=1
B j ⊗ k j(X,X),
I This is like several Kalman filter-type models addedtogether, but each one with a different set of latentfunctions.
I We call this class of kernels sum of separable kernels (SoSkernels).
Geostatistics
I Use of GPs in Geostatistics is called kriging.I These multi-output GPs pioneered in geostatistics:
prediction over vector-valued output data is known ascokriging.
I The model in geostatistics is known as the linear model ofcoregionalization (LMC, Journel and Huijbregts (1978);Goovaerts (1997)).
I Most machine learning multitask models can be placed inthe context of the LMC model.
Weighted sum of Latent Functions
I In the linear model of coregionalization (LMC) outputs areexpressed as linear combinations of independent randomfunctions.
I In the LMC, each component fd is expressed as a linear sum
fd(x) =
q∑j=1
wd, ju j(x).
where the latent functions are independent and havecovariance functions k j(x, x′).
I The processes f j(x)qj=1 are independent for q , j′.
Kalman Filter Special Case
I The Kalman filter is an example of the LMC whereui(x)→ xi(t).
I I.e. we’ve moved form time input to a more general inputspace.
I In matrix notation:1. Kalman filter
F = WX
2. LMCF = WU
where the rows of these matrices F, X, U each contain qsamples from their corresponding functions at a differenttime (Kalman filter) or spatial location (LMC).
Intrinsic Coregionalization Model
I If one covariance used for latent functions (like in Kalmanfilter).
I This is called the intrinsic coregionalization model (ICM,Goovaerts (1997)).
I The kernel matrix corresponding to a dataset X takes theform
K(X,X) = B ⊗ k(X,X).
Autokrigeability
I If outputs are noise-free, maximum likelihood isequivalent to independent fits of B and k(x, x′) (Helterbrandand Cressie, 1994).
I In geostatistics this is known as autokrigeability(Wackernagel, 2003).
I In multitask learning its the cancellation of intertasktransfer (Bonilla et al., 2008).
Intrinsic Coregionalization Model
K(X,X) = ww> ⊗ k(X,X).
w =
[15
]B =
[1 55 25
]
Intrinsic Coregionalization Model
K(X,X) = ww> ⊗ k(X,X).
w =
[15
]B =
[1 55 25
]
Intrinsic Coregionalization Model
K(X,X) = ww> ⊗ k(X,X).
w =
[15
]B =
[1 55 25
]
Intrinsic Coregionalization Model
K(X,X) = ww> ⊗ k(X,X).
w =
[15
]B =
[1 55 25
]
Intrinsic Coregionalization Model
K(X,X) = ww> ⊗ k(X,X).
w =
[15
]B =
[1 55 25
]
Intrinsic Coregionalization Model
K(X,X) = B ⊗ k(X,X).
B =
[1 0.5
0.5 1.5
]
Intrinsic Coregionalization Model
K(X,X) = B ⊗ k(X,X).
B =
[1 0.5
0.5 1.5
]
Intrinsic Coregionalization Model
K(X,X) = B ⊗ k(X,X).
B =
[1 0.5
0.5 1.5
]
Intrinsic Coregionalization Model
K(X,X) = B ⊗ k(X,X).
B =
[1 0.5
0.5 1.5
]
Intrinsic Coregionalization Model
K(X,X) = B ⊗ k(X,X).
B =
[1 0.5
0.5 1.5
]
LMC Samples
K(X,X) = B1 ⊗ k1(X,X) + B2 ⊗ k2(X,X)
B1 =
[1.4 0.50.5 1.2
]`1 = 1
B2 =
[1 0.5
0.5 1.3
]`2 = 0.2
LMC Samples
K(X,X) = B1 ⊗ k1(X,X) + B2 ⊗ k2(X,X)
B1 =
[1.4 0.50.5 1.2
]`1 = 1
B2 =
[1 0.5
0.5 1.3
]`2 = 0.2
LMC Samples
K(X,X) = B1 ⊗ k1(X,X) + B2 ⊗ k2(X,X)
B1 =
[1.4 0.50.5 1.2
]`1 = 1
B2 =
[1 0.5
0.5 1.3
]`2 = 0.2
LMC Samples
K(X,X) = B1 ⊗ k1(X,X) + B2 ⊗ k2(X,X)
B1 =
[1.4 0.50.5 1.2
]`1 = 1
B2 =
[1 0.5
0.5 1.3
]`2 = 0.2
LMC Samples
K(X,X) = B1 ⊗ k1(X,X) + B2 ⊗ k2(X,X)
B1 =
[1.4 0.50.5 1.2
]`1 = 1
B2 =
[1 0.5
0.5 1.3
]`2 = 0.2
LMC in Machine Learning and Statistics
I Used in machine learning for GPs for multivariateregression and in statistics for computer emulation ofexpensive multivariate computer codes.
I Imposes the correlation of the outputs explicitly throughthe set of coregionalization matrices.
I Setting B = Ip assumes outputs are conditionallyindependent given the parameters θ. (Minka and Picard,1997; Lawrence and Platt, 2004; Yu et al., 2005).
I More recent approaches for multiple output modeling aredifferent versions of the linear model of coregionalization.
Semiparametric Latent Factor Model
I Coregionalization matrices are rank 1 Teh et al. (2005).rewrite equation (??) as
K(X,X) =
q∑j=1
w:, jw>:, j ⊗ k j(X,X).
I Like the Kalman filter, but each latent function has adifferent covariance.
I Authors suggest using an exponentiated quadraticcharacteristic length-scale for each input dimension.
Semiparametric Latent Factor Model Samples
K(X,X) = w:,1w>:,1 ⊗ k1(X,X) + w:,2w>:,2 ⊗ k2(X,X)
w1 =
[0.51
]w2 =
[1
0.5
]
Semiparametric Latent Factor Model Samples
K(X,X) = w:,1w>:,1 ⊗ k1(X,X) + w:,2w>:,2 ⊗ k2(X,X)
w1 =
[0.51
]w2 =
[1
0.5
]
Semiparametric Latent Factor Model Samples
K(X,X) = w:,1w>:,1 ⊗ k1(X,X) + w:,2w>:,2 ⊗ k2(X,X)
w1 =
[0.51
]w2 =
[1
0.5
]
Semiparametric Latent Factor Model Samples
K(X,X) = w:,1w>:,1 ⊗ k1(X,X) + w:,2w>:,2 ⊗ k2(X,X)
w1 =
[0.51
]w2 =
[1
0.5
]
Semiparametric Latent Factor Model Samples
K(X,X) = w:,1w>:,1 ⊗ k1(X,X) + w:,2w>:,2 ⊗ k2(X,X)
w1 =
[0.51
]w2 =
[1
0.5
]
Gaussian processes for Multi-task, Multi-output andMulti-class
I Bonilla et al. (2008) suggest ICM for multitask learning.I Use a PPCA form for B: similar to our Kalman filter
example.I Refer to the autokrigeability effect as the cancellation of
inter-task transfer.I Also discuss the similarities between the multi-task GP and
the ICM, and its relationship to the SLFM and the LMC.
Multitask Classification
I Mostly restricted to the case where the outputs areconditionally independent given the hyperparameters φ(Minka and Picard, 1997; Williams and Barber, 1998; Lawrenceand Platt, 2004; Seeger and Jordan, 2004; Yu et al., 2005;Rasmussen and Williams, 2006).
I Intrinsic coregionalization model has been used in themulticlass scenario. Skolidis and Sanguinetti (2011) use theintrinsic coregionalization model for classification, byintroducing a probit noise model as the likelihood.
I Posterior distribution is no longer analytically tractable:approximate inference is required.
Computer Emulation
I A statistical model used as a surrogate for acomputationally expensive computer model.
I Higdon et al. (2008) use the linear model ofcoregionalization to model images representing theevolution of the implosion of steel cylinders.
I In Conti and O’Hagan (2009) use the ICM to model avegetation model: called the Sheffield Dynamic GlobalVegetation Model (Woodward et al., 1998).
References I
E. V. Bonilla, K. M. Chai, and C. K. I. Williams. Multi-task Gaussian process prediction. In J. C. Platt, D. Koller,Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems, volume 20, Cambridge, MA,2008. MIT Press.
S. Conti and A. O’Hagan. Bayesian emulation of complex multi-output and dynamic computer models. Journal ofStatistical Planning and Inference, 140(3):640–651, 2009. [DOI].
G. Della Gatta, M. Bansal, A. Ambesi-Impiombato, D. Antonini, C. Missero, and D. di Bernardo. Direct targets of thetrp63 transcription factor revealed by a combination of gene expression profiling and reverse engineering.Genome Research, 18(6):939–948, Jun 2008. [URL]. [DOI].
P. Goovaerts. Geostatistics For Natural Resources Evaluation. Oxford University Press, 1997. [Google Books] .
J. D. Helterbrand and N. A. C. Cressie. Universal cokriging under intrinsic coregionalization. Mathematical Geology,26(2):205–226, 1994.
D. M. Higdon, J. Gattiker, B. Williams, and M. Rightley. Computer model calibration using high dimensional output.Journal of the American Statistical Association, 103(482):570–583, 2008.
A. G. Journel and C. J. Huijbregts. Mining Geostatistics. Academic Press, London, 1978. [Google Books] .
A. A. Kalaitzis and N. D. Lawrence. A simple approach to ranking differentially expressed gene expression timecourses through Gaussian process regression. BMC Bioinformatics, 12(180), 2011. [DOI].
N. D. Lawrence and J. C. Platt. Learning to learn with the informative vector machine. In R. Greiner andD. Schuurmans, editors, Proceedings of the International Conference in Machine Learning, volume 21, pages 512–519.Omnipress, 2004. [PDF].
T. P. Minka and R. W. Picard. Learning how to learn is learning with point sets. Available on-line., 1997. [URL].Revised 1999, available at http://www.stat.cmu.edu/˜minka/.
J. Oakley and A. O’Hagan. Bayesian inference for the uncertainty distribution of computer model outputs.Biometrika, 89(4):769–784, 2002.
C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, Cambridge, MA, 2006.[Google Books] .
M. Seeger and M. I. Jordan. Sparse Gaussian Process Classification With Multiple Classes. Technical Report 661,Department of Statistics, University of California at Berkeley,
References II
G. Skolidis and G. Sanguinetti. Bayesian multitask classification with Gaussian process priors. IEEE Transactions onNeural Networks, 22(12):2011 – 2021, 2011.
Y. W. Teh, M. Seeger, and M. I. Jordan. Semiparametric latent factor models. In R. G. Cowell and Z. Ghahramani,editors, Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics, pages 333–340,Barbados, 6-8 January 2005. Society for Artificial Intelligence and Statistics.
H. Wackernagel. Multivariate Geostatistics: An Introduction With Applications. Springer-Verlag, 3rd edition, 2003.[Google Books] .
C. K. Williams and D. Barber. Bayesian Classification with Gaussian processes. IEEE Transactions on Pattern Analysisand Machine Intelligence, 20(12):1342–1351, 1998.
I. Woodward, M. R. Lomas, and R. A. Betts. Vegetation-climate feedbacks in a greenhouse world. PhilosophicalTransactions: Biological Sciences, 353(1365):29–39, 1998.
K. Yu, V. Tresp, and A. Schwaighofer. Learning Gaussian processes from multiple tasks. In Proceedings of the 22ndInternational Conference on Machine Learning (ICML 2005), pages 1012–1019, 2005.