Download - Machine Learning vs Statistics
7/29/2019 Machine Learning vs Statistics
http://slidepdf.com/reader/full/machine-learning-vs-statistics 1/66
Gaussian Processes for Machine Learning
Chris Williams
Institute for Adaptive and Neural ComputationSchool of Informatics, University of Edinburgh, UK
August 2007
Chris Williams ANC
Gaussian Processes for Machine Learning
7/29/2019 Machine Learning vs Statistics
http://slidepdf.com/reader/full/machine-learning-vs-statistics 2/66
Overview
1
What is machine learning?2 Gaussian Processes for Machine Learning
3 Multi-task Learning
Chris Williams ANC
Gaussian Processes for Machine Learning
7/29/2019 Machine Learning vs Statistics
http://slidepdf.com/reader/full/machine-learning-vs-statistics 3/66
1. What is Machine Learning?
The goal of machine learning is to build computer systemsthat can adapt and learn from their experience. (Dietterich,1999)
Machine learning usually refers to changes in systems that
perform tasks associated with artificial intelligence (AI). Suchtasks involve recognition, diagnosis, planning, robot control,prediction, etc. (Nilsson, 1996)
Some reasons for adaptation:
Some tasks can be hard to define except via examples
Adaptation can improve a human-built system, or trackchanges over time
Goals can be autonomous machine performance, or enablinghumans to learn from data (data mining)
Chris Williams ANC
Gaussian Processes for Machine Learning
7/29/2019 Machine Learning vs Statistics
http://slidepdf.com/reader/full/machine-learning-vs-statistics 4/66
Roots of Machine Learning
Statistical pattern recognition, adaptive control theory (EE)
Artificial Intelligence: e.g. discovering rules using decisiontrees, inductive logic programming
Brain models, e.g. neural networks
Psychological models
Statistics
Chris Williams ANC
Gaussian Processes for Machine Learning
7/29/2019 Machine Learning vs Statistics
http://slidepdf.com/reader/full/machine-learning-vs-statistics 5/66
Problems Addressed by Machine Learning
Supervised Learning
model p (y |x): regression,classification, e tc
Unsupervised Learning
model p (x): not just clustering!
Reinforcement Learning
Markov decision processes,
POMDPs, planning.
Chris Williams ANC
Gaussian Processes for Machine Learning
7/29/2019 Machine Learning vs Statistics
http://slidepdf.com/reader/full/machine-learning-vs-statistics 6/66
1 2 3
4 5 6
Mask
Mask * Foreground
Mask
Mask * Foreground Background
(Williams and Titisias, 2004)
Chris Williams ANC
Gaussian Processes for Machine Learning
7/29/2019 Machine Learning vs Statistics
http://slidepdf.com/reader/full/machine-learning-vs-statistics 7/66
Machine Learning and Statistics
Statistics Machine Learning
probabilistic (graphical) models
Same models, but different problems?
Not all machine learning methods are based on probabilisicmodels, e.g. SVMs, non-negative matrix factorization
Chris Williams ANC
Gaussian Processes for Machine Learning
7/29/2019 Machine Learning vs Statistics
http://slidepdf.com/reader/full/machine-learning-vs-statistics 8/66
Some Differences
Statistics: focus on understanding data in terms of models
Statistics: interpretability, hypothesis testingMachine Learning: greater focus on prediction
Machine Learning: focus on the analysis of learningalgorithms (not just large dataset issues)
Chris Williams ANC
Gaussian Processes for Machine Learning
7/29/2019 Machine Learning vs Statistics
http://slidepdf.com/reader/full/machine-learning-vs-statistics 9/66
Slide from Rob Tibshirani (early 1990s)
NEURAL NETS STATISTICSnetwork modelweights parameters
learning fittinggeneralization test set performancesupervised learning regression/classificationunsupervised learning density estimationoptimal brain damage model selectionlarge grant = $100,000 large grant= $10,000
nice place to have a meeting: nice place to have a meeting:Snowbird, Utah, French Alps Las Vegas in August
Chris Williams ANC
Gaussian Processes for Machine Learning
7/29/2019 Machine Learning vs Statistics
http://slidepdf.com/reader/full/machine-learning-vs-statistics 10/66
2. Gaussian Processes for Machine Learning
Gaussian processes
History
Regression, classification and beyondCovariance functions/kernels
Dealing with hyperparameters
Theory
Approximations for large datasets
Chris Williams ANC
Gaussian Processes for Machine Learning
7/29/2019 Machine Learning vs Statistics
http://slidepdf.com/reader/full/machine-learning-vs-statistics 11/66
Gaussian Processes
A Gaussian process is a stochastic process specified by itsmean and covariance functions
Mean functionµ(x) = E[f (x)]
often we take µ(x) ≡ 0 ∀x
Covariance function
k (x, x) = E[(f (x) − µ(x))(f (x) − µ(x))]
Chris Williams ANC
Gaussian Processes for Machine Learning
7/29/2019 Machine Learning vs Statistics
http://slidepdf.com/reader/full/machine-learning-vs-statistics 12/66
A Gaussian process prior over functions can be thought of asa Gaussian prior on the coefficients w ∼ N (0,Λ) where
Y (x) =N F i =1
w i φi (x)
In many interesting cases, N F = ∞Can choose φ’s as eigenfunctions of the kernel k (x, x) wrt
p (x) (Mercer) k (x, y)p (x)φi (x) d x = λi φi (y)
(For stationary covariance functions and Lebesgue measure we
get instead k (x − x)e −2πi s·xd x = S (s)e −2πi s·x
where S (s) is the power spectrum)
Chris Williams ANC
Gaussian Processes for Machine Learning
7/29/2019 Machine Learning vs Statistics
http://slidepdf.com/reader/full/machine-learning-vs-statistics 13/66
k (x , x ) = σ20 + σ2
1xx k (x , x ) = exp −|x − x |
k (x , x ) = exp −(x − x )2
Chris Williams ANC
Gaussian Processes for Machine Learning
7/29/2019 Machine Learning vs Statistics
http://slidepdf.com/reader/full/machine-learning-vs-statistics 14/66
Prediction with Gaussian Processes
A non-parametric priorover functions
Although GPs can beinfinite-dimensional
objects, prediction from afinite dataset is O (n3) 0 0.5 1
−2
−1
0
1
2
input, x
f ( x )
0 0.5 1
−2
−1
0
1
2
input, x
f ( x )
Chris Williams ANC
Gaussian Processes for Machine Learning
7/29/2019 Machine Learning vs Statistics
http://slidepdf.com/reader/full/machine-learning-vs-statistics 15/66
Gaussian Process Regression
Dataset D = (xi , y i )ni =1, Gaussian likelihood p (y i |f i ) ∼ N (0, σ2)
f (x) =n
i =1
αi k (x, xi )
whereα = (K + σ2I )−1y
var(f (x)) = k (x, x) − kT (x)(K + σ2I )−1k(x)
in time O (n3), with k(x) = (k (x, x1), . . . , k(x, xn))T
Chris Williams ANC
Gaussian Processes for Machine Learning
7/29/2019 Machine Learning vs Statistics
http://slidepdf.com/reader/full/machine-learning-vs-statistics 16/66
Some GP History
1940s: Wiener, Kolmogorov (time series)
Geostatistics (Matheron, 1973), Whittle (1963)
O’Hagan (1978); Sacks et al (Design and Analysis of Computer Experiments, 1989)
Williams and Rasmussen (1996), inspired by Neal’s (1996)construction of GPs from neural networks with an infinitenumber of hidden units
Regularization framework (Tikhonov and Arsenin, 1977;
Poggio and Girosi, 1990); MAP rather than fully probabilistic
SVMs (Vapnik, 1995): non-probabilistic, use “kernel trick”and quadratic programming
Chris Williams ANC
Gaussian Processes for Machine Learning
7/29/2019 Machine Learning vs Statistics
http://slidepdf.com/reader/full/machine-learning-vs-statistics 17/66
Carl Edward Rasmussen and ChrisWilliams, MIT Press, 2006
New: available online
Chris Williams ANC
Gaussian Processes for Machine Learning
7/29/2019 Machine Learning vs Statistics
http://slidepdf.com/reader/full/machine-learning-vs-statistics 18/66
Regression, classification and beyond
Regression with Gausian noise: e.g. robot arm inversedynamics (21-d input space)
Classification: binary, multiclass, e.g. handwritten digitclassification
ML community tends to use approximations to deal withnon-Gaussian likelihoods, cf MCMC in statistics?
MAP solution, Laplace approximation
Expectation Propagation (Minka, 2001; see also Opper and
Winther, 2000)
Other likelihoods (e.g. Poisson), observations of derivatives,uncertain inputs, mixtures of GPs
Chris Williams ANC
Gaussian Processes for Machine Learning
7/29/2019 Machine Learning vs Statistics
http://slidepdf.com/reader/full/machine-learning-vs-statistics 19/66
Covariance functions
Covariance function is key entity, determining notion of similarity
Squared exponential (“Gaussian”) covariance function iswidely applied in ML; Matern kernel not very widely used
Polynomial kernel k (x, x) = (1 + x · x)p is popular in kernelmachines literature
Neural network covariance function (Williams, 1998)
k NN(x, x) = σ2f sin−1 2xM x
(1 + 2xM x)(1 + 2xM x)
where x = (1, x 1, . . . , x D )
Chris Williams ANC
Gaussian Processes for Machine Learning
7/29/2019 Machine Learning vs Statistics
http://slidepdf.com/reader/full/machine-learning-vs-statistics 20/66
String kernels: let φs (x ) denote the number of times asubstring s appears in string x
k (x , x ) =s
w s φs (x )φs (x )
(Watkins, 1999; Haussler, 1999).
Efficient methods using suffix trees to compute certain string
kernels in time |x | + |x
| (Leslie et al, 2003; Vishwanathan andSmola, 2003)
Extended to tree kernels (Collins and Duffy, 2002)
Fisher kernel
φθ(x ) = ∇θ log p (x |θ)
k (x , x ) = φθ(x )F −1φθ(x )
where F is the Fisher information matrix (Jaakkola et al,2000)
Chris Williams ANCGaussian Processes for Machine Learning
7/29/2019 Machine Learning vs Statistics
http://slidepdf.com/reader/full/machine-learning-vs-statistics 21/66
Automatic Relevance Determination
k SE (x, x) = σ2f exp
− 1
2(x − x)M (xp − xq )
Isotropic M = −2I
ARD: M = diag(−21 , −2
2 , . . . , −2D )
−2
0
2
−2
0
2
−2
−1
0
1
2
input x1input x2
o u t p u t y
−2
0
2
−2
0
2
−2
−1
0
1
2
input x1input x2
o u t p u t y
Chris Williams ANCGaussian Processes for Machine Learning
7/29/2019 Machine Learning vs Statistics
http://slidepdf.com/reader/full/machine-learning-vs-statistics 22/66
Dealing with hyperparameters
Criteria for model selection
Marginal likelihood p (y|X , θ)
Estimate the generalization error: LOO-CVni =1 log p (y i |y−i ,X ,θ)
Bound the generalization error (e.g. PAC-Bayes)
Typically do ML-II rather than sampling of p (θ|X , y)
Optimize by gradient descent (etc) on objective function
SVMs do not generally have good methods for kernel selection
Chris Williams ANCGaussian Processes for Machine Learning
7/29/2019 Machine Learning vs Statistics
http://slidepdf.com/reader/full/machine-learning-vs-statistics 23/66
How the marginal likelihood works
100
−100
−80
−60
−40
−20
0
20
40
l o g p
r o b a b i l i t y
characteristic lengthscale
minus complexity penaltydata fitmarginal likelihood
log p (y|X ,θ) = −1
2yT K −1
y y − log |K y | −n
2log2π
Chris Williams ANCGaussian Processes for Machine Learning
7/29/2019 Machine Learning vs Statistics
http://slidepdf.com/reader/full/machine-learning-vs-statistics 24/66
Marginal Likelihood and Local Optima
−5 0 5−2
−1
0
1
2
input, x
o
u t p u t , y
−5 0 5−2
−1
0
1
2
input, x
o
u t p u t , y
There can be multiple optima of the marginal likelihoodThese correspond to different interpretations of the data
Chris Williams ANCGaussian Processes for Machine Learning
7/29/2019 Machine Learning vs Statistics
http://slidepdf.com/reader/full/machine-learning-vs-statistics 25/66
The Baby and the Bathwater
MacKay (2003 ch 45): In moving from neural networks tokernel machines did we throw out the baby with the
bathwater? i.e. the ability to learn hiddenfeatures/representations
But consider M = ΛΛ for Λ being D × k , for k < D
The k columns of Λ can identify directions in the input spacewith specially high relevance (Vivarelli and Williams, 1999)
Chris Williams ANCGaussian Processes for Machine Learning
7/29/2019 Machine Learning vs Statistics
http://slidepdf.com/reader/full/machine-learning-vs-statistics 26/66
Theory
Equivalent kernel (Silverman, 1984)
Consistency (Diaconis and Freedman, 1986; Choudhuri,Ghoshal and Roy 2005; Choi and Schervish, 2004)
Average case learning curvesPAC-Bayesian analysis for GPs (Seeger, 2003)
p D{R L(f D) ≤ R L(f D) + gap(f D,D, δ )} ≥ 1 − δ
where R L(f D) is the expected risk, and R L(f D) is the empirical(training) risk
Chris Williams ANCGaussian Processes for Machine Learning
7/29/2019 Machine Learning vs Statistics
http://slidepdf.com/reader/full/machine-learning-vs-statistics 27/66
Approximation Methods for Large Datasets
Fast approximate solution of the linear system
Subset of Data
Subset of Regressors
Inducing Variables
Projected Process Approximation
FITC, PITC, BCM
SPGPEmpirical Comparison
Chris Williams ANCGaussian Processes for Machine Learning
7/29/2019 Machine Learning vs Statistics
http://slidepdf.com/reader/full/machine-learning-vs-statistics 28/66
Some interesting recent uses for Gaussian Processes
Modelling transcriptional regulation using Gaussian Processes. NeilD. Lawrence, Guido Sanguinetti, Magnus Rattray (NIPS 2006)
A Switched Gaussian Process for Estimating Disparity andSegmentation in Binocular Stereo. Oliver Williams (NIPS 2006)
Learning to Control an Octopus Arm with Gaussian ProcessTemporal Difference Methods. Yaakov Engel, Peter Szabo, DmitryVolkinshtein (NIPS 2005)
Worst-Case Bounds for Gaussian Process Models. Sham Kakade,
Matthias Seeger, Dean Foster (NIPS 2005)Infinite Mixtures of Gaussian Process Experts. Carl Rasmussen,Zoubin Ghahramani (NIPS 2002)
Chris Williams ANCGaussian Processes for Machine Learning
7/29/2019 Machine Learning vs Statistics
http://slidepdf.com/reader/full/machine-learning-vs-statistics 29/66
3. Multi-task Learning
There are multiple (possibly) related tasks, and we wish toavoid tabula rasa learning by sharing information across tasks
E.g. Task clustering, inter-task correlationsTwo cases:
With task-descriptor features t
Without task-descriptor features, based solely on task identities
Joint work with Edwin Bonilla & Felix Agakov (AISTATS
2007) and Kian Ming Chai
Chris Williams ANCGaussian Processes for Machine Learning
7/29/2019 Machine Learning vs Statistics
http://slidepdf.com/reader/full/machine-learning-vs-statistics 30/66
Multi-task Learning using Task-specific Features
M tasks, learn mapping g i (x), i = 1, . . . ,M
ti is task descriptor (task-specific feature vector) for task i
g i (x) = g (ti , x): potential for transfer across tasks
Out motivation is for compiler performance prediction, wherethere are multiple benchmark programs (=tasks), and x
describes sequences of code transformations
Another example: predicting school pupil performance basedon pupil and school features
We particularly care about the case when we have very littledata from the test task; here inter-task transfer will be mostimportant
Chris Williams ANCGaussian Processes for Machine Learning
7/29/2019 Machine Learning vs Statistics
http://slidepdf.com/reader/full/machine-learning-vs-statistics 31/66
Overview
Model setup
Related work
Experimental setup, feature representation
Results
Discussion
Chris Williams ANCGaussian Processes for Machine Learning
7/29/2019 Machine Learning vs Statistics
http://slidepdf.com/reader/full/machine-learning-vs-statistics 32/66
Task-descriptor Model
z =
x
t
k (z, z) = k x (x, x)k t (t, t)
Decomposition into task similarity (k t ) and input similarity(k x )
For the widely-used “Gaussian” kernel, this occurs naturally
Independent tasks if k t (ti , t j ) = δ ij
C.f. co-kriging in geostatistics (e.g. Wackernagel, 1998)Without task-descriptors, simply parameterize K t
Chris Williams ANCGaussian Processes for Machine Learning
7/29/2019 Machine Learning vs Statistics
http://slidepdf.com/reader/full/machine-learning-vs-statistics 33/66
xf
Chris Williams ANCGaussian Processes for Machine Learning
7/29/2019 Machine Learning vs Statistics
http://slidepdf.com/reader/full/machine-learning-vs-statistics 34/66
Related Work
Work using task-specific features
Bakker and Heskes (2003) use neural networks. These can betricky to train (local optima, number of hidden units etc)
Yu et al (NIPS 2006, Stochastic Relational Models forDiscriminative Link Prediction)
Chris Williams ANCGaussian Processes for Machine Learning
7/29/2019 Machine Learning vs Statistics
http://slidepdf.com/reader/full/machine-learning-vs-statistics 35/66
General work on Multi-task Learning
What should be transferred?
Early work: Thrun (1996), Caruana (1997)
Minka and Picard (1999); multiple tasks share same GP
hyperparameters (but are uncorrelated)Evgeniou et al (2005): induce correlations between tasksbased on a correlated prior over linear regression parameters(special case of co-kriging)
Multilevel (or hierarchical) modelling in statistics (e.g.Goldstein, 2003)
Chris Williams ANCGaussian Processes for Machine Learning
7/29/2019 Machine Learning vs Statistics
http://slidepdf.com/reader/full/machine-learning-vs-statistics 36/66
Compiler Performance Prediction
Goal: Predict speedup of a new program under a givensequence of compiler transformations
Only have a limited number of runs of the new program, butalso have data from other (related?) tasks
Speedup s measured as
s (x) =time(baseline)
time(x)
Chris Williams ANC
Gaussian Processes for Machine Learning
7/29/2019 Machine Learning vs Statistics
http://slidepdf.com/reader/full/machine-learning-vs-statistics 37/66
Example Transformation
Loop unrolling
// original loop
for(i=0; i<100; i++)a[i] = b[i] + c[i];
// loop unrolled twice
for(i=0; i<100; i+=2){a[i] = b[i] + c[i];
a[i+1] = b[i+1] +
c[i+1];
}
Chris Williams ANC
Gaussian Processes for Machine Learning
7/29/2019 Machine Learning vs Statistics
http://slidepdf.com/reader/full/machine-learning-vs-statistics 38/66
Experimental Setup
Benchmarks: 11 C programs from UTDSP
Transformations: Source-to-source using SUIF
Platform: TI C6713 board13 transformations in sequences up to length 5, using eachtransformation at most once ⇒ 88214 sequences perbenchmark (exhaustively enumerated)
Significant speedups can be obtained (max is 1.84)
Chris Williams ANC
Gaussian Processes for Machine Learning
7/29/2019 Machine Learning vs Statistics
http://slidepdf.com/reader/full/machine-learning-vs-statistics 39/66
Input Features x
Code features (C), or transformation-based representation (T)
Code features: extract features from transformed program
based on knowledge of compiler experts (code size,instructions executed, parallelism)
83 features reduced to 15-d with PCA
Transformation-based representation: length-13 bit vectorstating what transformations were used (“bag of characters”)
Chris Williams ANC
Gaussian Processes for Machine Learning
7/29/2019 Machine Learning vs Statistics
http://slidepdf.com/reader/full/machine-learning-vs-statistics 40/66
Task-specific features t
Record the speedup on a small number of canonical sequences: response-based approach
Canonical sequences selected by principal variables method(McCabe, 1984)
A variety of possible criteria can be used, e.g. maximize|ΣS (1)
|, minimize tr(ΣS (2)|S (1)). Use greedy selection
We don’t use all 88214 sequences to define the canonicalsequences, only only 2048. In our experiments we use 8canonical variables
Could consider e.g. code features from untransformedprograms, but experimentally response-based method issuperior
Chris Williams ANC
Gaussian Processes for Machine Learning
7/29/2019 Machine Learning vs Statistics
http://slidepdf.com/reader/full/machine-learning-vs-statistics 41/66
Experiments
LOO-CV setup (leave out one task at a time)
Therefore 10 reference tasks for each prediction task; we usednr = 256 examples per benchmark
Use nte
examples from the test task (nte
≥ 8)
Assess performance using mean absolute error (MAE) on allremaining test sequences
Comparison to baseline “no transfer” method using just datafrom test task
Used GP regression prediction with squared exponential kernel
ARD was used, except for “no transfer” case when nte ≤ 64
Chris Williams ANC
Gaussian Processes for Machine Learning
7/29/2019 Machine Learning vs Statistics
http://slidepdf.com/reader/full/machine-learning-vs-statistics 42/66
ADP COM EDG FFT FIR HIS IIR LAT LMS LPC SPE AVG0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
M
A E
[T] COMBINED
[T] GATING
[C] COMBINED
[C] GATING
MEDIAN CANONICALS
MEDIAN TEST
0.540(0.007)
Chris Williams ANC
Gaussian Processes for Machine Learning
7/29/2019 Machine Learning vs Statistics
http://slidepdf.com/reader/full/machine-learning-vs-statistics 43/66
Results
T-combined is best overall (av MAE is 0.0576, compared to
0.1162 for median canonicals)T-combined generally either improves performance or leaves itabout the same compared to T-no-transfer-canonicals
Chris Williams ANC
Gaussian Processes for Machine Learning
7/29/2019 Machine Learning vs Statistics
http://slidepdf.com/reader/full/machine-learning-vs-statistics 44/66
8 16 32 64 1280
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4HISTOGRAM
TEST SAMPLES INCLUDED FOR TRAINING
M A E
[T] COMBINED[T] NO TRANSFER RANDOM
[T] NO TRANSFER CANONICALSMEDIAN CANONICALS
8 16 32 64 1280
0.02
0.04
0.06
0.08
0.1
0.12
0.14EDGE_DETECT
TEST SAMPLES INCLUDED FOR TRAINING
M A E
[T] COMBINED[T] NO TRANSFER RANDOM
[T] NO TRANSFER CANONICALSMEDIAN CANONICALS
Chris Williams ANC
Gaussian Processes for Machine Learning
7/29/2019 Machine Learning vs Statistics
http://slidepdf.com/reader/full/machine-learning-vs-statistics 45/66
8 16 32 64 1280
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16ADPCM
TEST SAMPLES INCLUDED FOR TRAINING
M A E
[T] COMBINED[T] NO TRANSFER RANDOM
[T] NO TRANSFER CANONICALS
MEDIAN CANONICALS
8 16 32 64 1280
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
TEST SAMPLES INCLUDED FOR TRAINING
M A E
ALL BENCHMARKS
[T] COMBINED[T] NO TRANSFER RANDOM
[T] NO TRANSFER CANONICALS
MEDIAN CANONICALS
T-combined generally improves performance or leaves it aboutthe same compared to the best “no transfer” scenario
Chris Williams ANC
Gaussian Processes for Machine Learning
U d di T k R l d
7/29/2019 Machine Learning vs Statistics
http://slidepdf.com/reader/full/machine-learning-vs-statistics 46/66
Understanding Task Relatedness
GP predictive mean is
s (z∗) = kT (z∗)(K f ⊗ K x + σ2I )−1s
Can look at K f , but difficult to interpret?
Predictive mean s (z∗) = hT
(z∗)s, where
hT (z) = (h11, . . . , h1
nr , . . . , hM
1 , . . . , hM nr , hM +1
1 , . . . , hM +1nte
, )
Measure contribution of task i on test point z∗ by computing
r i (z∗) =|hi (z∗)|
|h(z∗)|
Chris Williams ANC
Gaussian Processes for Machine Learning
Average r ’s over test examples
7/29/2019 Machine Learning vs Statistics
http://slidepdf.com/reader/full/machine-learning-vs-statistics 47/66
Average r s over test examples
adp com edg fft fir his iir lat lms lpc spe
spe
lpc
lms
lat
iir
his
fir
fft
edg
com
adp
Chris Williams ANC
Gaussian Processes for Machine Learning
Di i
7/29/2019 Machine Learning vs Statistics
http://slidepdf.com/reader/full/machine-learning-vs-statistics 48/66
Discussion
Our focus is on the hard problem of prediction on a new taskgiven very little data for that task
The presented method allows sharing over tasks. This should
be beneficial, but note that “no transfer” method has thefreedom to use different hyperparams on each task
Can learn similarity between tasks directly (unparameterizedK t ), but this is not so easy if nte is very small
Note that there is no inter-task transfer in noiseless case!
(autokrigeability)
Chris Williams ANC
Gaussian Processes for Machine Learning
G l C l i
7/29/2019 Machine Learning vs Statistics
http://slidepdf.com/reader/full/machine-learning-vs-statistics 49/66
General Conclusions
Key issues:
Designing/discovering covariance functions suitable for various
types of data
Methods for setting/inference of hyperparameters
Dealing with large datasets
Chris Williams ANC
Gaussian Processes for Machine Learning
Ga ssia P ocess Reg essio
7/29/2019 Machine Learning vs Statistics
http://slidepdf.com/reader/full/machine-learning-vs-statistics 50/66
Gaussian Process Regression
Dataset D = (xi , y i )ni =1, Gaussian likelihood p (y i |f i ) ∼ N (0, σ2)
f (x) =n
i =1
αi k (x, xi )
whereα = (K + σ2I )−1y
var(x) = k (x, x) − kT
(x)(K + σ2
I )−1
k(x)in time O (n3), with k(x) = (k (x, x1), . . . ,k(x, xn))T
Chris Williams ANC
Gaussian Processes for Machine Learning
Fast approximate solution of linear systems
7/29/2019 Machine Learning vs Statistics
http://slidepdf.com/reader/full/machine-learning-vs-statistics 51/66
Fast approximate solution of linear systems
Iterative solution of (K + σ2nI )v = y, e.g. using ConjugateGradients. Minimizing
1
2vT (K + σ2
nI )v − yT v.
This takes O (kn2
) for k iterations.Fast approximate matrix-vector multiplication
ni =1
k (x j , xi )v i
k -d tree/ dual tree methods (Gray, 2004; Shen, Ng andSeeger, 2006; De Freitas et al 2006)
Improved Fast Gauss transform (Yang et al, 2005)
Chris Williams ANC
Gaussian Processes for Machine Learning
Subset of Data
7/29/2019 Machine Learning vs Statistics
http://slidepdf.com/reader/full/machine-learning-vs-statistics 52/66
Subset of Data
Simply keep m datapoints, discard the rest: O (m3)
Can choose the subset randomly, or by a greedy selection
criterionIf we are prepared to do work for each test point, can selecttraining inputs nearby to the test point. Stein (Ann. Stat.,2002) shows that a screening effect operates for somecovariance functions
Chris Williams ANC
Gaussian Processes for Machine Learning
7/29/2019 Machine Learning vs Statistics
http://slidepdf.com/reader/full/machine-learning-vs-statistics 53/66
K uu
uf
n
m
K = K fu K −1uu K uf
Nystrom approximation to K
Chris Williams ANC
Gaussian Processes for Machine Learning
Subset of Regressors
7/29/2019 Machine Learning vs Statistics
http://slidepdf.com/reader/full/machine-learning-vs-statistics 54/66
Subset of Regressors
Silverman (1985) showed that the mean GP predictor can beobtained from the finite-dimensional model
f (x∗) =n
i =1
αi k (x∗, xi )
with a prior α ∼ N (0,K −1)
A simple approximation to this model is to consider only asubset of regressors
f SR(x∗) =mi =1
αi k (x∗, xi ), with αu ∼ N (0,K −1uu )
Chris Williams ANC
Gaussian Processes for Machine Learning
7/29/2019 Machine Learning vs Statistics
http://slidepdf.com/reader/full/machine-learning-vs-statistics 55/66
f SR(x∗) = ku (x∗)(K uf K fu + σ2nK uu )
−1K uf y,
V[f SR(x∗)] = σ2nku (x∗)(K uf K fu + σ2
nK uu )−1ku (x∗)
SoR corresponds to using a degenerate GP prior (finite rank)
Chris Williams ANC
Gaussian Processes for Machine Learning
Inducing Variables
7/29/2019 Machine Learning vs Statistics
http://slidepdf.com/reader/full/machine-learning-vs-statistics 56/66
Inducing Variables
Quinonero-Candela and Rasmussen (JMLR, 2005)
p (f ∗|y) =1
p (y)
p (y|f )p (f , f ∗)d f
Now introduce inducing variables u
p (f , f ∗) =
p (f , f ∗,u)d u =
p (f , f ∗|u)p (u)d u
Approximation
p (f , f ∗) q (f , f ∗)def
=
q (f |u)q (f ∗|u)p (u)d u
q (f |u) – training conditionalq (f ∗|u) – test conditional
Chris Williams ANC
Gaussian Processes for Machine Learning
7/29/2019 Machine Learning vs Statistics
http://slidepdf.com/reader/full/machine-learning-vs-statistics 57/66
u
f f *
Inducing variables can be:
(sub)set of training points(sub)set of test points
new x points
Chris Williams ANC
Gaussian Processes for Machine Learning
Projected Process Approximation—PP
7/29/2019 Machine Learning vs Statistics
http://slidepdf.com/reader/full/machine-learning-vs-statistics 58/66
Projected Process Approximation PP
(Csato & Opper, 2002; Seeger, et al 2003; aka PLV, DTC)
Inducing variables are subset of training points
q (y|u) = N (y|K fu K −1uu u, σ2
nI )
K fu K −1uu u is mean prediction for f given u
Predictive mean for PP is the same as SR, but variance is
never smaller. SR is like PP but with deterministic q (f ∗|u)
¡ ¡ ¡
¡ ¡ ¡
e e e
e e e
u
f 1 f 2 r r r f n f ∗
Chris Williams ANC
Gaussian Processes for Machine Learning
FITC, PITC and BCM
7/29/2019 Machine Learning vs Statistics
http://slidepdf.com/reader/full/machine-learning-vs-statistics 59/66
FITC, PITC and BCM
See Quinonero-Candela and Rasmussen (2005) for overview
Under PP, q (f |u) = N (y|K fu K −1uu u, 0)
Instead FITC (Snelson and Ghahramani, 2005) uses individualpredictive variances diag[K ff − K fu K −1
uu K uf ], i.e. fully
independent training conditionals
PP can make poor predictions in low noise [S Q-C M R W]
PITC uses blocks of training points to improve theapproximation
BCM (Tresp, 2000) is the same approximation as PITC,except that the test points are the inducing set
Chris Williams ANC
Gaussian Processes for Machine Learning
Sparse GPs using Pseudo-inputs
7/29/2019 Machine Learning vs Statistics
http://slidepdf.com/reader/full/machine-learning-vs-statistics 60/66
p g p
(Snelson and Ghahramani, 2006)
FITC approximation, but inducing inputs are new points, in
neither the training or test setsLocations of the inducing inputs are changed along withhyperparameters so as to maximize the approximate marginallikelihood
Chris Williams ANC
Gaussian Processes for Machine Learning
Complexity
7/29/2019 Machine Learning vs Statistics
http://slidepdf.com/reader/full/machine-learning-vs-statistics 61/66
p y
Method Storage Initialization Mean Variance
SD O (m2) O (m3) O (m) O (m2)
SR O (mn) O (m2n) O (m) O (m2)PP, FITC O (mn) O (m2n) O (m) O (m2)BCM O (mn) O (mn) O (mn)
Chris Williams ANC
Gaussian Processes for Machine Learning
Empirical Comparison
7/29/2019 Machine Learning vs Statistics
http://slidepdf.com/reader/full/machine-learning-vs-statistics 62/66
p p
Robot arm problem, 44,484 training cases in 21-d, 4,449 testcases
For SD method subset of size m was chosen at random,hyperparameters set by optimizing marginal likelihood (ARD).Repeated 10 times
For SR, PP and BCM methods same subsets/hyperparameterswere used (BCM: hyperparameters only)
Chris Williams ANC
Gaussian Processes for Machine Learning
Method m SMSE MSLL mean runtime (s)SD 256 0 0813 ± 0 0198 -1 4291 ± 0 0558 0 8
7/29/2019 Machine Learning vs Statistics
http://slidepdf.com/reader/full/machine-learning-vs-statistics 63/66
SD 256 0.0813 ± 0.0198 1.4291 ± 0.0558 0.8512 0.0532 ± 0.0046 -1.5834 ± 0.0319 2.11024 0.0398 ± 0.0036 -1.7149 ± 0.0293 6.5
2048 0.0290 ± 0.0013 -1.8611 ± 0.0204 25.04096 0.0200 ± 0.0008 -2.0241 ± 0.0151 100.7
SR 256 0.0351 ± 0.0036 -1.6088 ± 0.0984 11.0512 0.0259 ± 0.0014 -1.8185 ± 0.0357 27.01024 0.0193 ± 0.0008 -1.9728 ± 0.0207 79.5
2048 0.0150 ± 0.0005 -2.1126 ± 0.0185 284.84096 0.0110 ± 0.0004 -2.2474 ± 0.0204 927.6
PP 256 0.0351 ± 0.0036 -1.6940 ± 0.0528 17.3512 0.0259 ± 0.0014 -1.8423 ± 0.0286 41.41024 0.0193 ± 0.0008 -1.9823 ± 0.0233 95.1
2048 0.0150 ± 0.0005 -2.1125 ± 0.0202 354.24096 0.0110 ± 0.0004 -2.2399 ± 0.0160 964.5BCM 256 0.0314 ± 0.0046 -1.7066 ± 0.0550 506.4
512 0.0281 ± 0.0055 -1.7807 ± 0.0820 660.51024 0.0180 ± 0.0010 -2.0081 ± 0.0321 1043.22048 0.0136 ± 0.0007 -2.1364 ± 0.0266 1920.7Chris Williams ANC
Gaussian Processes for Machine Learning
0.09
SD
7/29/2019 Machine Learning vs Statistics
http://slidepdf.com/reader/full/machine-learning-vs-statistics 64/66
10−1
100
101
102
103
104
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
time (s)
S M S E
SD
SR and PP
BCM
Chris Williams ANC
Gaussian Processes for Machine Learning
−1.4
SD
7/29/2019 Machine Learning vs Statistics
http://slidepdf.com/reader/full/machine-learning-vs-statistics 65/66
10−1
100
101
102
103
104
−2.3
−2.2
−2.1
−2
−1.9
−1.8
−1.7
−1.6
−1.5
time (s)
M
S L L
SD
SR
PP
BCM
Chris Williams ANC
Gaussian Processes for Machine Learning
7/29/2019 Machine Learning vs Statistics
http://slidepdf.com/reader/full/machine-learning-vs-statistics 66/66
Judged on time, for this dataset SD, SR and PP are on thesame trajectory, with BCM being worse
But what about greedy vs random subset selection, methodsto set hyperparameters, different datasets?
In general, we must take into account training (initialization),testing and hyperparameter learning times separately [S Q-CM R W]. Balance will depend on your situation.
Chris Williams ANC
Gaussian Processes for Machine Learning