processes - diva portal552702/fulltext01.pdf · emanuel walldén viklund, johan wågberg...

Institutionen för systemteknikDepartment of Electrical Engineering

Examensarbete

Continuous Occupancy Mapping Using GaussianProcesses

Examensarbete utfört i Reglerteknikvid Tekniska högskolan vid Linköpings universitet

av

Emanuel Walldén Viklund, Johan Wågberg

LiTH-ISY-EX--12/4626--SE

Linköping 2012

Department of Electrical Engineering Linköpings tekniska högskolaLinköpings universitet Linköpings universitetSE-581 83 Linköping, Sweden 581 83 Linköping

Continuous Occupancy Mapping Using GaussianProcesses

Examensarbete utfört i Reglerteknikvid Tekniska högskolan vid Linköpings universitet

av



Handledare: Lionel Ottacfr, University of Sydney

Simon O’Callaghanacfr, University of Sydney

Karl Granströmisy, Linköpings universitet

Fabio Ramosacfr, University of Sydney

Examinator: Thomas Schönisy, Linköpings universitet

Linköping, 15 september 2012

Avdelning, InstitutionDivision, Department

Division of Automatic ControlDepartment of Electrical EngineeringSE-581 83 Linköping

DatumDate

2012-09-15

SpråkLanguage

� Svenska/Swedish

� Engelska/English

�

�

RapporttypReport category

� Licentiatavhandling

� Examensarbete

� C-uppsats

� D-uppsats

� Övrig rapport

�

�

URL för elektronisk version

http://www.ep.liu.se

ISBN

—

ISRN


Serietitel och serienummerTitle of series, numbering

ISSN

—

TitelTitle

Kontinuerlig kartering med Gaussprocesser

Continuous Occupancy Mapping Using Gaussian Processes

FörfattareAuthor


SammanfattningAbstract

The topic of this thesis is occupancy mapping for mobile robots, with an emphasis on a novelmethod for continuous occupancy mapping using Gaussian processes. In the new method,spatial correlation is accounted for in a natural way, and an a priori discretization of the areato be mapped is not necessary as within most other common methods. The main contri-bution of this thesis is the construction of a Gaussian process library for C++, and the useof this library to implement the continuous occupancy mapping algorithm. The continuousoccupancy mapping is evaluated using both simulated and real world experimental data.

The main result is that the method, in its current form, is not fit for online operations dueto its computational complexity. By using approximations and ad hoc solutions, the methodcan be run in real time on a mobile robot, though not without losing many of its benefits.

NyckelordKeywords gaussian process, regression, classification, mapping, integral kernel

http://www.ep.liu.se

Abstract

The topic of this thesis is occupancy mapping for mobile robots, with an emphasison a novel method for continuous occupancy mapping using Gaussian processes.In the new method, spatial correlation is accounted for in a natural way, and ana priori discretization of the area to be mapped is not necessary as within mostother common methods. The main contribution of this thesis is the constructionof a Gaussian process library for C++, and the use of this library to implement thecontinuous occupancy mapping algorithm. The continuous occupancy mappingis evaluated using both simulated and real world experimental data.

The main result is that the method, in its current form, is not fit for online op-erations due to its computational complexity. By using approximations and adhoc solutions, the method can be run in real time on a mobile robot, though notwithout losing many of its benefits.

iii

Acknowledgments

We would like to direct our most sincere thanks’ to Simon O’Callaghan and Li-onel Ott. Simon’s theoretical knowledge, encouragement and honest interestin our work, together with Lionel’s almost super-human computer skills havehelped us get to where we are today. Without them both the thesis had not comeclose to what it is today, and for this we are deeply grateful. Furthermore wethank Karl Granström and Thomas Schön for their feedback, support and greatinterest in our work, and Fabio Ramos for inviting us to Sydney.

v

Contents

Notation ix

1 Introduction 11.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Gaussian Processes 52.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2.1 Feature space . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2.2 Function space . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.3.1 Binary classification . . . . . . . . . . . . . . . . . . . . . . . 11

2.4 Covariance functions . . . . . . . . . . . . . . . . . . . . . . . . . . 182.4.1 Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . 182.4.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.5 Integral Observations . . . . . . . . . . . . . . . . . . . . . . . . . . 222.6 Non-stationary kernel . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.6.1 Construction of the non-stationary covariance function . . 252.6.2 Making predictions . . . . . . . . . . . . . . . . . . . . . . . 262.6.3 Learning Hyperparameters . . . . . . . . . . . . . . . . . . 26

2.7 Multi-Class Gaussian Processes . . . . . . . . . . . . . . . . . . . . 262.7.1 Efficient Multi-Class Classification . . . . . . . . . . . . . . 27

3 Implementation Details 313.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.2 Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.3 Design notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.3.1 Integral Kernel . . . . . . . . . . . . . . . . . . . . . . . . . 323.3.2 Threading . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

vii

viii CONTENTS

3.3.3 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4 Mapping 414.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.1.1 Mapping using Gaussian processes . . . . . . . . . . . . . . 424.2 Occupancy Mapping Using Gaussian Processes . . . . . . . . . . . 444.3 Active sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.3.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464.4 Splitting and merging Gaussian processes . . . . . . . . . . . . . . 47

4.4.1 Bayesian Committee Machine . . . . . . . . . . . . . . . . . 474.4.2 Slim Bayesian Committee Machine . . . . . . . . . . . . . . 48

4.5 Robot Operating System . . . . . . . . . . . . . . . . . . . . . . . . 50

5 Experiments and Results 515.1 gp results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.1.1 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . 515.1.2 Multi-Class Classification . . . . . . . . . . . . . . . . . . . 53

5.2 Library results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545.3 Mapping results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5.3.1 Simulated Data . . . . . . . . . . . . . . . . . . . . . . . . . 565.3.2 Real World Data . . . . . . . . . . . . . . . . . . . . . . . . . 575.3.3 Active sampling . . . . . . . . . . . . . . . . . . . . . . . . . 59

6 Conclusions 676.1 Brief summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 676.2 Occupancy mapping . . . . . . . . . . . . . . . . . . . . . . . . . . 676.3 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

A Multivariate Gaussian Distribution 71

B Covariance functions and their derivatives 73B.1 Linear kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73B.2 Matérn 1 kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74B.3 Matérn 3 kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74B.4 Matérn 5 kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76B.5 Neural Network kernel . . . . . . . . . . . . . . . . . . . . . . . . . 77B.6 Polynomial kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78B.7 Rational Quadratic . . . . . . . . . . . . . . . . . . . . . . . . . . . 79B.8 Squared Exponential kernel . . . . . . . . . . . . . . . . . . . . . . 79

Bibliography 81

Notation

Notation

Notation Meaning

x scalar valuex vectorX matrixxᵀ

transposeI identity matrix0 column vector of zeros|X | determinant

N (µ,Σ) Multivariate Gaussian distribution with mean vectorµ and covariance matrix Σ

p(x|y) distribution of x given yE expectationV varianceR the set of real valuesC the set of complex valuesN the set of natural numbers

ix

x Notation

Abbreviations

Abbreviation Meaning

3d Three dimensionalbcm Bayesian committee machinegp Gaussian processkl Kullback-Leibler (divergence)loo Leave-one-outloo-cv Leave-one-out cross validationmcmc Markov chain Monte Carloros Robot operation systemrviz Visualization package in rosslam Simultaneous localization and mappingtbb (Intel®) Threading Building Blocks

1Introduction

This thesis aims to further investigate a novel approach to mapping of the im-mediate surroundings of an autonomous mobile robot. Generating useful mapsof the environment is a fundamental problem enabling path planning, collisionavoidance and navigation, among other things. Earlier attempts at mapping arobot’s environment have involved an a priori discretization of the spatial do-main into independent cells. This has proven to work well in two dimensionalcases but is not easily extended to three dimensions. The assumption of indepen-dence is also clearly false. Walls, floors and roofs do not appear randomly, but asspatially correlated structures. This new approach does not discretize the spatialdomain, but keeps the continuous nature of the real world, and is more easilyadapted to three dimensions.

1.1 Background

The history of robots has its roots far back in ancient myths and legends fromall around the world. One legend from ancient China tells the story about KingMu of Zhou and a mechanical engineer who had built an autonomous robot look-ing like a man. Showing off the robots ability to sing and dance, the king wasdelighted, but when the robot at the end of the show started winking at theladies, the king had had enough and ordered the immediate execution of theengineer. The engineer managed to save himself only by picking the robot apartin its pieces. The oldest of Western literature also witness of robots. In the Il-iad Homer illustrates the concept of robots by stating that the god Hephaestuscreated talking mechanical handmaids made out of gold in his workshop. Aris-totle later referenced Homer speculating that some day these robots could bringequality to humans by making the abolition of slavery possible.

1

2 1 Introduction

Even modern day religion contains examples of robots. In the old testament ac-cording to the Jews, God creates the first man out of dust of the earth and bringshim to life. Religious men have later attempted to replicate the creation of life byforming clay into the shape of man with the aim of bringing them to life. Numer-ous stories exists about these clay figures, or golems. Most famous is the storyabout how the chief rabbi Judah Loew ben Bezalel of Prague summoned a claygolem to protect the Jews of the city from the Holy Roman Emperor’s oppression.

The way we think of robots today is often accredited to the author Isaac Asimov.The word itself was invented in 1920 by the Czech author Karel Čapek who de-rived it from the Czech word robota, meaning servant, but it was Asimov whocreated the word robotics as the word for the field study. Asimov wrote severalscience-fiction novels, but his most famous contribution to the field of robotics isperhaps the three laws of robotics. They introduced a notion of ethics in robots,and the first and also most important law states that “A robot may not injure ahuman being or, through inaction, allow a human being to come to harm”.

Today, the interest in robots and robotics may be at it’s highest peak ever in his-tory. According to The New York Times [2011], a free course in artificial intelli-gence, given in 2011 over the Internet by Stanford University, had over 58’000students enrolled. Despite, or perhaps because, the large number of people inter-ested in robotics, there is really no consensus on what exactly a robot is. The Ency-clopedia Britannica defines a robot as “any automatically operated machine thatreplaces human effort, though it may not resemble human beings in appearanceor perform functions in a human-like manner”. One thing is for sure. Robotsare becoming more and more common and the techniques to enable robots tooperate autonomously among human beings is coming closer every day. One canonly hope that Asimov’s first law is fulfilled when the robots become strongerand smarter than any human.

1.2 Objectives

The main objective of this project has been to implement the continuous occu-pancy mapping algorithm in the Robot Operating System (ros, Quigley et al.[2009]). Occupancy grid methods have been around for quite some time and hasbecome popular methods for mapping. Because of the long time they have beenaround, many free and well functioning implementations of the algorithms exist.By implementing the new continuous occupancy mapping algorithm in ros as aproof of concept, the hope is to increase the interest for the new algorithm.

To achieve this goal, a general C++ library for Gaussian processes has been imple-mented and made complete enough to be useful for the implementation. Usingthis library, the continuous occupancy mapping algorithm was implemented andtested in ros both for online and offline processing of data.

1.3 Contributions 3

1.2.1 Limitations

When making a general library for others to use, a difficult decision to make iswhat to include in the library and what not to include. The number of possibleextensions to the library is vast and many of them would be good to have in thelibrary.

During the thesis work, basic tools in Gaussian process regression, classificationand learning have been implemented. When deciding what extra features to in-clude, high priority was given to the functionality needed for the new occupancymapping algorithm. A few others features not needed for the mapping algorithmhave also been implemented, but the library is not yet a complete gp library.

1.3 Contributions

The main contributions of this thesis are:

• An efficient C++ library for regression, classification and estimation of hy-perparameters. The library contains a wide range of covariance functionswith different optimization objectives for optimization of the hyperparame-ters.

• An implementation of the continuous occupancy mapping algorithm inros, using the created C++ library. The implementation makes several ap-proximations in order to make online operation possible.

1.4 Thesis Outline

The thesis is divided into five chapters apart from this introduction chapter. Thefirst three describes theory and methods, while the last two bring up results anddraw conclusions thereof.

The chapter following this introduction chapter, Chapter 2, describes the theorybehind Gaussian processes and how they are used in the area of machine learning.This includes regression theory, covariance functions, classification and so on.

The third chapter describes the library that was implemented to be able to per-form the new occupancy mapping technique. Apart from the layout of the library,with mainly gp functionality, specific design details are also discussed. These in-clude threading, optimization and methods to do numerical approximations ofintegrals.

The last theory chapter, Chapter 4, takes gp theory into the world of occupancymapping. It also talks about different techniques used when performing occu-pancy mapping, such as ways to reject redundant measurements (active sam-pling) and merging multiple gpmaps (Bayesian committee machine).

After the last theory chapter, Chapter 5 presents the results achieved from thedifferent parts of the theory. This includes the results of occupancy mapping

4 1 Introduction

using both simulated and real world data.

The final chapter, Chapter 6, draws conclusions from the results, and discussesfuture work. It also gives a brief summary of the thesis.

2Gaussian Processes

This chapter gives an introduction to the theory of Gaussian processes (gps), andhow they are used to perform regression and classification of data. The chapterbegins with a simple introductory example, and then uses that example as a basisfor explaining proper gp regression. This is expanded for classification purposeswhich are then explained, followed by other interesting theoretical aspects of thegp framework, such as integral observations.

2.1 Introduction

A Gaussian process can be seen as a generalization of the multivariate Gaussiandistribution to infinite dimensions. The n-dimensional Gaussian distribution isuniquely determined by its mean vector µ ∈ Rn and covariance matrix Σ ∈ Rn×n.A well known property of the multivariate Gaussian distribution is that all sub-sets of jointly Gaussian variables are themselves jointly Gaussian. This gives thepossibility to define a Gaussian process as in Definition 2.1.

2.1 Definition (Gaussian process). A Gaussian process is a collection of ran-dom variables, any finite number of which have a joint Gaussian distribution.

Note that the definition allows for finite dimensional Gaussian processes. Everyfinite dimensional collection of jointly Gaussian random variables is also a Gaus-sian process. This, however, does not introduce anything new. The interestingthing is that infinite dimensions are allowed.

Example 2.2 will act as an introduction to Bayesian regression, in which a line isfitted to some measurements. This, as it turns out, is also a case of gp regression.

5

6 2 Gaussian Processes

2.2 Example: Simple RegressionFitting a linear model to measurements D = {(xi , yi) : i = 1, . . . , N }, where xi ∈RD and yi ∈ R, can be done using ordinary least squares methods. Here, however,

a different approach is taken by allowing prior knowledge of the slope of the lineto be incorporated when estimating the actual slope. The ultimate goal is to findthe function values at some query points X∗, given the measurements and theprior assumptions. Measurements are taken as illustrated in Figure 2.1.

position, X

measu

redva

lue,

y

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Measurements

Figure 2.1: Measurements originating from something resembling a line.

Stacking the target values into a column vector and the measurement locationsin a matrix as

y =

y1y2...yN

X =

| | |x1 x2 · · · xN| | |

,a general linear model can be written as f (x) = x

ᵀwwhere w is a column vector of

weights. In the one dimensional case, w is scalar. The measurements are assumedto be subject to additive noise, y = f (X) + ε, where ε ∼ N

(0, σ2

n I). Given those

assumptions, the goal is to find the latent function value f (x∗) for an arbitrarytest (or query) point x∗.

Asuming that the weights w are normal distributed and independent of x, w ∼

2.1 Introduction 7

N(0, σ2

wI), f (x) is a Gaussian process with mean and covariance

µ (f (x)) ≡ µ(x) = E[xTw] = xᵀ

=0︷︸︸︷E[w] = 0 (2.1a)

Cov[f (x), f (x′)

]≡ k(x, x′) = E[f (x)f (x′)ᵀ] −

=0︷︸︸︷E[f (x)]

=0︷︸︸︷E[f (x′)ᵀ] (2.1b)

= E[xᵀwwᵀx′] = xᵀ

=σ2wI︷︸︸︷

E[wwᵀ] x′ = σ2wxᵀx′ . (2.1c)

Rasmussen and Williams [2005] illustrates how random functions can be gener-ated from such a gp prior, as is done in Figure 2.2. Some (of the infinitely manypossible) lines are generated and plotted together with the prior mean (zero) plusminus two times the standard deviation of the prior, corresponding to a 95% con-fidence interval. As seen in the picture, all lines pass through the origin, as wasassumed in the beginning.

Test points

Testva

lues

-10

-8

-6

-4

-2

0

2

4

6

8

10

0 1 2 3 4 5 6 7 8 9 10

Mean ±2× Std

Figure 2.2: Lines generated at random from prior. The gray area is the priormean (zero) plus and minus two times the prior’s standard deviation, corre-sponding to a 95% confidence interval; there is a 95% chance that a randomlygenerated line will be inside the gray area.

Since the noise is independent of the weights, using the additive property of the


variance gives the estimated mean and variance for y as

µ(y) ≡ µy(x) = 0 (2.2a)

Cov ≡ ky(x, x′) = k(x, x′) + σ2n I = σ2

wxᵀx′ + σ2

nδx=x′ . (2.2b)

With the priors for all measurements, y, and the latent function values at thequery points,f∗ , written together as joint distributions,[yf∗

]∼ N

(0,

[σ2wXᵀX + σ2

n I σ2wXᵀX∗

σ2wXᵀ∗ X σ2

wXᵀ∗ X∗

])= N

(0,

[K(X,X) + σ2

n I K(X,X∗)K(X∗,X) K(X∗,X∗)

]),

the predictive distribution can be calculated simply by conditioning f∗ on themeasurements:

f∗|y,X,X∗ ∼ N(K(X∗,X)(K(X,X) + σ2

n I)−1y,

K(X∗,X∗) − K(X∗,X)(K(X,X) + σ2n I)−1K(X,X∗)

).

(2.3)

The function values at test points X∗ can thus be taken as the mean values, withuncertainty terms associated. Given the measurements y as in Figure 2.1, theestimated latent function values f∗ can then be plotted as in Figure 2.3.

Test points

Testva

lues

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Mean ±2× StdLine estimated using Bayesian regressionMeasurements

Figure 2.3: Estimated line and mean plus minus two times the standard de-viation of the estimate (corresponding to a 95% confidence interval), plottedtogether with data measurements. For correctness it must be said that theline is actually not a line; it is only points at positions X∗ with values f∗, withlines drawn from every point to to next.

Noticeable, the problem of finding f∗ was solved completely without first finding

2.2 Regression 9

the true function f (x); the values of the weights, for example, never needed to becalculated.

2.2 Regression

The previous section demonstrated a simple example of linear regression. In thissection, this is generalized to allow nonlinear regression leading up to Gaussianprocess regression. The concept of kernels, together with the kernel trick, is alsodiscussed.

2.2.1 Feature space

To expand the regression concept from Example 2.2 seen in Section 2.1, inferencecan be done directly in feature space. Also, this time a slightly different approachis used, showing that multiple paths lead to the same results.

As a starting point, the input values x from the previous example are mappedinto a new space by a feature function φ : RD → R

D ′ , such that

y = φ(x)ᵀw + ε, (2.4)

where w ∼ N (0,Σw) and ε ∼ N(0, σ2

n

). The mapping φ(x) could be, for example,

the space of the powers of x: φ(x) =[1, x, x2, · · ·

], if dim(x) = 1. The approach

will then be made that the predictive probability can be found by averaging outthe weights:

p (f∗|x∗,X, y) =∫p(f∗,w|x∗,X, y)dw =

∫p (f∗|x∗,w) p (w|X, y) dw. (2.5)

To evaluate this integral, the posterior distribution for w needs to be found. Thisis done by applying the chain rule:

p (w, y|X) ={p (w|X) p (y|X,w)p (y|X) p (w|X, y)

p(w|X)=p(w)⇐========⇒ p (w|X, y) =

p (y|X,w) p (w)p (y|X)

. (2.6)

Since p(y|X) is independent of w and thus nothing but a scaling factor, the termcan be neglected when finding the posterior mean and variance.

With the likelihood function

p (y|X,w) ∼ N(Φ (X)ᵀ w, σ2

n I)

(2.7)

and (2.6), the posterior distribution for the weights becomes

p (w|X, y) ∝ exp(−1

21σ2n

(y − Φ (X)ᵀ w)ᵀ (y − Φ (X)ᵀ w))

exp(−1

2wᵀΣ−1

w w)

∼ N(

1σ2nA−1Φ (X) y,A−1

),

(2.8)

where A = 1σ2nΦ (X)Φ (X)ᵀ + Σ−1

w .


By evaluating (2.5) and using the shorthands

Φ (X)ᵀ ΣwΦ (X) = K, (2.9)

Φ (X)ᵀ Σwφ (x∗) = k∗, (2.10)

φ (x∗)ᵀΣwφ (x∗) = k∗∗, (2.11)

the predictive probability distribution can finally be derived as:

f∗|x∗,X, y ∼ N(kᵀ∗(K + σ2

n I)−1

y, k∗∗ − kᵀ∗(K + σ2

n I)−1

k∗). (2.12)

Kernel trick

The shorthand used above,

k(x1, x2) = φ (x1)ᵀ Σwφ (x2) , (2.13)

was not used solely by chance. With Σw being symmetric and positive definite(and thus possible to factor as Σw = LwL

ᵀw using the Cholesky decomposition), k

is an inner product with respect to Σw. Since all inner products produce positivedefinite matrices, this will be a valid covariance function (see Section 2.4 for fur-ther details). Instead of constructing a valid kernel out of some basis functionsφ, it is equally possible to construct a kernel k directly, and then show that thiscorresponds to an inner product in some space. This is known as the kernel trick.

For example, a linear kernel can easily be written as an inner product:

k (x1, x2) = xᵀ1Σx2 + b2 =

⟨(Lᵀx1, b

),(Lᵀx2, b

)⟩, (2.14)

with Σ = LLᵀ .

2.2.2 Function space

Given what has previously been discussed, it is now straightforward to move intofunction space. Start with the following measurement equation

y = f (X) + ε, (2.15)

where ε ∼ N(0, σ2

n I)

and f is a Gaussian process with mean and covariancef (X) ∼ N (0,K (X,X)). Hence, y is distributed as

y ∼ N(0,K + σ2

n I). (2.16)

The jointly Gaussian distribution can now be written directly, and the results arethe same as for Example 2.2. It is useful to utilize this scenario when performingregression; instead of finding a specific function f , feature φ or weight vector w,the covariance function is introduced and used, making the others obsolete:[

yf∗

]∼ N

(0,

[K(X,X) + σ2

n I K(X,X∗)K(X∗,X) K(X∗,X∗)

]).

Conditioning on the measurements and data points results in the same predictive

2.3 Classification 11

distribution as before:

f∗|y,X,X∗ ∼ N(Kᵀ∗(K + σ2

n I)−1

y,K∗∗ − Kᵀ∗(K + σ2

n I)−1

K∗). (2.17)

2.3 Classification

Section 2.2 looked at regression problems, where the targets were real valued.Another interesting class of problems is classification tasks where one seeks toassign a class label to each input x∗. Classification problems can be split intotwo types of problems. Binary classification, which deals with only two possibleclasses and the more general multi-class classification, which handles multipleclasses at once. In both cases, the aim is to give an estimate of the probabilityof the input x∗ belonging to each class. First the case of binary classification isaddressed and then multi-class classification. The presentation closely followsthe article by Nickisch and Rasmussen [2008].

2.3.1 Binary classification

In binary classification, the objective is to use a set D = {(xi , yi) : i = 1, . . . , N }of N training points xi with corresponding class labels yi ∈ {−1, 1} to predictthe probability of a test point x∗ pertaining to a certain class. The values of thetarget labels yi do not necessarily have to fulfil yi = ±1, but for simplicity, this isassumed from here on.

A Gaussian process model draws functions whose range are the entire real line(−∞,∞). The probability of a test point being a certain class, however, is nec-essarily constrained to the unit interval [0, 1]. Placing a gp prior on the classprobabilities directly is hence not possible. To overcome this, a latent functionf is introduced equipped with a gp prior. The value f (x∗) is then mapped intothe unit interval and interpreted as the predicted class probability. The mappingis done by a so called sigmoid function σ : R → [0, 1]. Any sigmoid functionwould do, but common examples used are the cumulative distribution functionof a standard Gaussian random variable Φ(t) or the logistic function λ(t),

Φ(t) =

t∫−∞

N (τ | 0, 1) dτ, λ(t) =1

1 + e−t. (2.18)

This construction gives rise to a prior on π(x) , p(y = +1|x) = σ (f (x)). Sincethe class membership probabilities must normalize

∑y p(y|x) = 1, it also holds

that p(y = +1|x) = 1 − p(y = −1|x). If the shifted sigmoid function σ (x) − 1/2 isantisymmetric, that is σ (−x) = 1 − σ (x), the likelihood can compactly be writtenas

p(y|x) = σ (y · f (x)). (2.19)


Given the latent function f , the class labels are assumed to be independent Ber-noulli distributed random variables, that is, each point in input space can onlybe of either one class or the other and the class labels are independent of otherclass labels given that we know the latent function f . The likelihood p(y|f ) hencefactorizes over data points and depends only on the values of the latent functionat the given input points f = [f1, f2, . . . , fn]ᵀ, where fi = f (xi),

p(y|f ) = p(y|f) =n∏i=1

p(yi |fi) =n∏i=1

σ (yifi). (2.20)

By applying Bayes’ rule, the posterior distribution over the latent values f can beexpressed as

p(f |y,X, θ) =p(y|f )p(f |X, θ)∫p(y|f )p(f |X, θ) df

=N (f | 0,K)p(y|X, θ)

n∏i=1

σ (yifi), (2.21)

where θ are the hyperparameters of the covariance function used in the gp-prioron f .

As for regression, the joint prior over the training and test points is given by

p(f , f∗|X,X∗, θ) = N([ff∗

] ∣∣∣∣∣∣ 0,[K K∗Kᵀ∗ K∗∗

]). (2.22)

To predict the values of f∗, the training set latent variables are marginalized

p(f∗|X∗, y,X, θ) =∫p(f∗, f |X∗, y,X, θ) df =

∫p(f∗|f ,X∗,X, θ)p(f |y,X, θ) df ,

(2.23)

where the joint posterior has been factored into the product of the posterior,(2.21), and the conditional prior

p(f∗ |f ,X∗ ,X, θ) = N(f∗

∣∣∣ Kᵀ∗ K−1f , K∗∗ − Kᵀ∗ K−1K∗

), (2.24)

which follows directly by conditioning on f in (2.22), see (A.4).

The predictive class membership probability for a single test point x∗, p(y∗ =+1|x∗, y,X, θ), is calculated by averaging out the latent variable f∗ = f (x∗)

p(y∗|x∗, y,X, θ) =∫p(y∗|f∗)p(f∗|x∗, y,X, θ) df∗ =

∫σ (y∗f∗)p(f∗|x∗, y,X, θ) df∗.

(2.25)

In the case of regression, all above distributions were Gaussian and all calcula-tions where analytically tractable. The non-Gaussian likelihood (2.20) makes the


posterior over the latent values f in (2.21) and the integral for the predictivedistribution in (2.23) analytically intractable.

To perform the calculations, either Monte Carlo sampling methods or analyticalapproximations of the integrals can be used. Sampling methods give the correctposterior distributions if time is of no issue. However, when computation timeis important, approximations have to be made in order to make the integrals eas-ier to compute. Below, two different approximations are presented. The firstapproximates the posterior distribution over the latent values f in (2.23) with aGaussian distribution using the Laplace approximation. The second approach,called Probabilistic Least Squares Classification, ignores the fact that the targetscan only take two distinct values and does ordinary regression using the class la-bels as targets. Instead of using the non-Gaussian likelihood, the predicted meanand variance of the latent function f is propagated through a sigmoid functionand interpreted as the class membership probability. This is not a strict Bayesianapproach, but the computations become very fast.

Laplace Approximation

One way to overcome intractability when the likelihood is non-Gaussian is to ap-proximate the posterior p = p (f |y,X) with a Gaussian distribution q = q (f |y,X),using Laplace’s approximation method. The approach is to find the mean (i.e.maximum a posteriori estimate(MAP)) of p, and use this as the mean for the trulyGaussian distributed q. The closer p is to a true Gaussian distribution, the betterthe approximation will be. The same holds for cases with a multi-modal p; themethod will find one of the modes, and approximate this one with a Gaussian.With a p consisting, for example, of multiple separated Gaussian-like distribu-tions, all but one will be neglected, and Laplace’s approximation method is thusnot well suited for approximating such distributions.

The method begins by expanding the logarithm of the posterior, log p (f |y,X), tothe second order around its MAP, f :

log p (f |y,X) ≈ log q (f |y,X)

= log p(f |y,X

)+

=0, stationary︷︸︸︷∇ log p (f |y,X)

∣∣∣f

(f − f

)+

12

(f − f

)T∇∇ log p (f |y,X)

∣∣∣∣∣f

(f − f

). (2.26)

The first term can be neglected as it does not depend on f , and thus does notaffect the MAP. This gives the unnormalized approximated posterior distributionas

q (f |y,X) ∝ exp(

12

(f − f

)T∇∇ log p (f |y,X)

∣∣∣∣∣f

(f − f

))∼ N

(f , (−∇∇ log p (f |y,X))−1

). (2.27)


To deduce the distribution in (2.27), f needs to be found. It is known that∇ log p (f |y,X) = 0 at f = f , because of stationarity. Finding f is thus equiva-lent to finding the (or any, in the multi-modal case) f that fulfills this property.One common way of doing this is to use Newton’s method on the gradient, whereeach iteration becomes

fnew = fold − ∇∇ log p (fold|y,X)−1 ∇ log p (fold|y,X) , (2.28)

this being repeated until some convergence requirement is fulfilled.

To perform this iteration, both the gradient and the Hessian of log p (f |y,X) needto be calculated. This can be done by observing that the posterior distributioncan be rewritten using Bayes’ formula as

log (p (f |y,X)) = log(p (y|f ,X) p (f |X)

p (y|X)

)= log (p (y|f )) + log (p (f |X)) − log (p (y|X)) ≡ Ψ (f ) . (2.29)

Noticing that p (f |X) = 1√(2π)n |K|

exp(−1

2 fTK−1f

)due to this prior being Gaus-

sian, differentiating Ψ (f ) w.r.t. f becomes fairly trivial. Thus, the gradient andHessian can be calculated as

∇Ψ (f ) = ∇ log p (y|f ) − K−1f , (2.30)

∇∇Ψ (f ) = ∇∇ log p (y|f ) − K−1, (2.31)

and the desired expressions are now fully deduced.

The likelihood function p (y|f ) is chosen to give proper target class probabili-ties (given a latent function value f , it returns a value between 0 and 1 that canbe interpreted as a probability). Two common choices for likelihood functionsare the cumulative Gaussian function, p (yi |fi) = 1

1+exp−yi fi, and logistic function,

p (yi |fi) = Φ (yifi).

To summarize, the calculated MAP together with the Hessian inserted into (2.27)gives the Laplace approximation to the posterior as a Gaussian distribution

q (f |X, y) = N(f ,

(−∇∇ log p (y|f ) + K−1

)−1). (2.32)

The intractable integral (2.23) now becomes tractable when using q (f |X, y) in-stead of p (f |X, y). Evaluating it gives the expected mean at test point x∗

Eq [f∗|X, y, x∗] = E

[∫p (f∗|f , x∗,X) q (f |y,X) df

]= kT∗ K

−1 f . (2.33)


The predictive variance at x∗ can equivalently be calculated as

Vq [f∗|X, y, x∗] = Ep(f∗ |X,y)

[(f∗ − E [f∗|X, x∗, f ])2

]+ Eq(f |X,y)

[(E [f∗|X, x∗, f ] − E [f∗|X, y, x∗])2

]= k (x∗, x∗) − k

ᵀ∗ K−1k∗ + kᵀ∗ K−1

(K−1 − ∇∇ log p (y|f )−1

)−1K−1k∗

= k (x∗, x∗) − kᵀ∗(K − ∇∇ log p (y|f )−1

)−1k∗ , (2.34)

where the details of these calculations can be found in Rasmussen and Williams[2005], (Sec. 3.4.2).

Predictions are finally done by computing

π∗ =∫σ (f∗) q (f∗|X, y, x∗) df∗, (2.35)

with q (f∗|X, yx∗) having mean and variance as in (2.33) and (2.34), and σ is thechosen sigmoid function.

Probabilistic Least Squares Classification

In probabilistic least squares classification, the intractable integrals are avoidedby simply ignoring the fact that the target values y can only take two differentvalues. Instead, a Gaussian likelihood is used just as in the regression case.

Applying ordinary regression using the target values yi ∈ {−1, 1}, it is possibleto choose the one of the two classes closest to the predicted value. With theseparticular targets, this would simply amount to calculating the predicted meanof f∗ and deciding y∗ = 1 if the mean is larger then zero. This, however, wouldnot give the probability of the test point x∗ being of either class.

Since the mean of f∗ is potentially unbounded, predicted probabilities that lie inthe unit interval [0, 1] can be achieved by a sigmoid function σ (x). To introducemore flexibility, f∗ is scaled and shifted by a linear transformation before it ispropagated through the sigmoid function as

f∗ = σ (αf∗ + β), (2.36)

This adds two parameters α and β to the problem that can be optimized withrespect to training data. The expected value of f∗ is interpreted as the class mem-bership probability.

Using the cumulative Gaussian Φ(x), see (2.18), as sigmoid function, a closedform solution of the expectation exists

p(y∗|X, y, x∗, θ) =∫

Φ(y∗(αf∗ + β))N (f∗ | µ∗, σ2∗ ) df∗ = Φ

y∗(αµ∗ + β)√1 + α2σ2

∗

, (2.37)

where µ∗ and σ2∗ are the predictive mean and variance at location x∗.

Optimizing parameters The hyper parameters of the covariance function andthe two parameters of the probabilistic least squares classifier can be optimized


jointly by utilizing leave-one-out cross-validation, loo-cv. The predictive logprobability when leaving out training case i is given by

log p(yi |X, y−i , θ) = −12

log σ2i −

(yi − µi)2

2σ2i

− 12

log 2π, (2.38)

where y−i denotes all targets except number i and µi and σ2i are the predictive

mean and variance computed at xi when the training set is taken to be (X−i , y−i).

Maximizing the probability of all training cases belonging to the correct class canbe done by maximizing the product of all predictive probabilities. Since its easierto optimize a sum than a product, the objective function for loo-cv is taken asthe sum of the log predictive probabilities in (2.37)

LLOO(X, y, θ) =n∑i=1

log p(yi |X, y−i , θ). (2.39)

Inspecting the equations for computing the predictive means and variances µiand σ2

i , it can be shown that an equivalent way of computing them is

µi = yi −

[K−1y

]i[

K−1]ii

and σ2i =

1[K−1]

ii, (2.40)

where [K]ii denotes element (i, i) in matrix K and [y]i denotes the i:th elementof vector y. Differentiating both expressions with respect to a hyperparameter ofthe covariance function gives

∂µi∂θj

=

[Zjα

]i[

K−1]2ii

−αi

[ZjK−1

]ii[

K−1]2ii

,∂σ2

i

∂θj=

[ZjK−1

]ii[

K−1]2ii

, (2.41)

where α = K−1y, αi is the i:th element of the vector α and Zj = K−1 ∂K∂θj

.

Inserting the predicted means and variances from (2.40) in (2.37),

p(yi |X, y−i , θ) =∫

Φ(yi(αfi + β))N (fi | µiσ2i ) dfi = Φ

yi(αµi + β)√1 + α2σ2

i

, (2.42)

gives the probability of each input point being classified as the correct class. In-serting these into equation (2.39) and taking the derivatives with respect to thehyperparameters gives

∂LLOO∂θj

=n∑i=1

∂ log p(yi |X, y−i , θ)∂µi

∂µi∂θj

+∂ log p(yi |X, y−i , θ)

∂σ2i

∂σ2i

∂θj

=n∑i=1

N (ri)Φ(yi ri)

yiα√1 + α2σ2

i

∂µi∂θj−α(αµi + β)

2(1 + α2σ2i )

∂σ2i

∂θj

, (2.43)


Input data

Class 2

Class 1

-10 -8 -6 -4 -2 0 2 4 6 8 10

(a) Class data

Input

Probabilityofclass

1

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

-10 -8 -6 -4 -2 0 2 4 6 8 10

(b) Classified

Figure 2.4: (a) Data from Example 2.3. The figure shows the input dataas belonging to one of two distinct classes. By using the data an input tothe probabilistic least squares classifier, the probability of an arbitrary inputpoint belonging one of the two classes can be predicted. (b) Predicted proba-bility that each input point is of class 1. The classifier has been trained usingthe data in Figure 2.4a and the hyper parameters of a squared exponentialcovariance function together with the two parameters of the linear rescalingin the classifier has been jointly optimized using gradient descent.

where ri = (αµi +β)/√

1 + α2σ2i and the partial derivatives of the loo parameters

∂µi /∂θj and ∂σ2i /∂θj are given in (2.41). The partial derivatives with respect to

the parameters of the linear function is given by

∂LLOO∂α

=n∑i=1

N (ri)Φ(yi ri)

yi√1 + α2σ2

i

µi − βασ2i

1 + α2σ2i

, (2.44)

∂LLOO∂β

=n∑i=1

N (ri)Φ(yi ri)

yi√1 + α2σ2

i

. (2.45)

2.3 Example: Least Squares ClassificationConsider the task of classifying a one dimensional input as one of two classes.

Using known classes for a few data points as shown in Figure 2.4a, the classfor an arbitrary input point can be predicted. After joint optimization of thehyper parameters of a squared exponential covariance function and the linearparameters of the classifier, the probabilities each input point is of class 1 can beseen in Figure 2.4b.


2.4 Covariance functions

In previous sections the term covariance function or kernel has been used, andon occasions also briefly explained. In this section a proper definition is given,and some properties are stated and exemplified. The most fundamental propertyof a covariance function is that it produces a covariance matrix that is positivesemi-definite (see definition 2.4). If one can prove that a function does so, it isa valid covariance function. Often they are also referred to as positive definitekernels.

2.4 Definition (Covariance function). A covariance function on a set S is afunction k : S × S → C such that ∀n ∈ N,∀x1, . . . , xn ∈ S :

K =

k(x1, x1) . . . k(x1, xn)

.... . .

...k(xn, x1) . . . k(xn, xn)

(2.46)

is hermitian (symmetric if k : S × S → R) positive semi-definite.

Some important properties can be shown to hold for covariance functions. Giventwo valid covariance functions k1(x1, x2) and k2(x1, x2), the following two covari-ance functions are also valid (proofs omitted):

1. k(x1, x2) = k1(x1, x2) + k2(x1, x2) (additive property).

2. k(x1, x2) = k1(x1, x2) × k2(x1, x2) (multiplicative property).

These can then be used to construct new covariance functions. A number ofcovariance functions can be found in Appendix B, the most common one beingthe squared exponential:

k(x, x′) = σ2 exp(−1

2(x − x′)ᵀ(x − x′)

l

). (2.47)

Two other ones worth mentioning are the Matérn 1,

k (x, x′) = σ2 exp

−√

(x − x′)ᵀ(x − x′)l

, (2.48)

and the Matérn 3,

k (x, x′) = σ2

1 +

√3

(x − x′)ᵀ(x − x′)l

exp

−√

3(x − x′)ᵀ(x − x′)

l

. (2.49)

All of these will later be used in section 2.4.2.

2.4.1 Hyperparameters

As can be seen in the above equations, every covariance function has two pa-rameters, θ = (l, σ ), where l is called a length-scale parameter and σ a noiseparameter. Together they are referred to as hyperparameters, and as will be seen

2.4 Covariance functions 19

in Section 2.4.2, different choices of hyperparameters can greatly alter the behav-ior of the inferred function. In multiple dimensions the length-scale parameterinstead becomes a matrix, which most often, though not necessarily, is taken tobe diagonal.

Finding optimal parameters

For simple covariance functions with a small number of hyperparameters, it ispossible to find a reasonable fit by manually changing the values and observingthe inferred function. As one starts to use more complex covariance functions,like sums of such, or move to higher dimensions, this method will most likelynot suffice. Also, one needs to quantify what is actually meant by a reasonable orgood fit. One way to do this is to look at the likelihood of the measurements, giventhe covariance function with certain hyperparameters. This likelihood shouldthen be possible to maximize with respect to the hyperparameters in some way,resulting in optimal parameters for an optimal fit. The marginal likelihood pre-viously found when performed regression fits this description, and is also analyt-ically tractable when inside the gp framework. Here, the marginal likelihood isfound by marginalizing the latent function:

p(y|X, θ) =∫p(y, f |X, θ)df =

∫ N (f ,σ2n I)︷︸︸︷

p(y|f ,X, θ)

N (0,K)︷︸︸︷p(f |X, θ) df (2.50)

=1√

|K + σ2n I |(2π)N

exp(−1

2yᵀ

(K + σ2

n I)−1

y). (2.51)

In practice, the logarithm of the marginal likelihood is used, as this renders lesscomplicated calculations;

log p(y|X, θ) = −12yᵀ

(K + σ2

n I)−1

y − 12

log |K + σ2n I | −

N2

log(2π). (2.52)

By calculating its derivatives with respect to the hyperparameters, some gradientdescent method can be used for finding the optimal values. Nota bene, this func-tion will rarely, if ever, be truly convex, and thus there will be no guarantees thata local optimum is a global optimum.

2.4.2 Examples

To illustrate the behavior of different covariance functions with different hyper-parameters, a function is sampled with noisy measurements as can be seen inFigure 2.5. The data is then inferred using first three different sets of hyperpa-rameters for the squared exponential covariance function, and then three differ-ent covariance functions where hyperparameters have been optimized using thelog marginal likelihood. The different covariance functions used are a squaredexponential, a Matérn 1 and the sum of a squared exponential, a Matérn 1 and aMatérn 3. The result can be seen in Figure 2.6. To get some kind of measurementon how well the inferred functions fit the underlying function, their respective


x

True function and measurements with additive noise

-1.5

-1

-0.5

0

0.5

1

1.5

-10 -8 -6 -4 -2 0 2 4 6 8 10

f(x)y

Figure 2.5: Data sampled from underlying function, where measurementsare taken to be noisy.

normalized mean square errors are calculated as 1Nf

∑Nfi=1 ||fi − fi ||

22 and can be

found in Table 2.1.

When using different hyperparameters (see Figure 2.6(a)), by only altering thelength-scale parameter one can see that this is correlated to the number of os-cillations per length interval. This justifies the name of length-scale parameter,as it impacts how long or short time it will take for the function to adjust tonew values. It is also possible to see that even though one might expect a shortlength-scale to be a good length-scale, this is not the case as the function canbecome over-fit. Even though the function goes through practically every singlemeasured point, in between them it behaves unpleasantly. A too long length-scale on the other hand makes the function too slow to be able to react properlyand catch sudden true function changes; the function becomes under-fit. A goodvalue is one that gives a quick enough function, without being too quick. Look-ing at the mean square errors in Table 2.1, the parameter resulting in the bestfitting function is also the one in between, l = 0.5. Thus the importance of goodhyperparameters for making a good function fit should be clear.

Of course, different covariance functions can also alter the behavior of the in-ferred functions. Looking at Figure 2.6(b), it is first of all possible to see howgood the functions seem to fit the data when hyperparameters are optimal. Sec-ond, it is clear that for the given underlying function, the squared exponential is

2.4 Covariance functions 21

x

(a) Different parameters (squared exponential)

-1.5

-1

-0.5

0

0.5

1

1.5

-10 -8 -6 -4 -2 0 2 4 6 8 10

Data pointsParams: (0.1,1)Params: (0.5,1)Params: (2,1)

x

(b) Different covariance functions

-1.5

-1

-0.5

0

0.5

1

1.5

-10 -8 -6 -4 -2 0 2 4 6 8 10

Data pointsSqExpMat1Mat1+Mat3+SqExp

Figure 2.6: Functions inferred from the data points using both different setsof hyperparameters as well as covariance functions. (a) Functions using dif-ferent length-scale parameters (0.1, 0.5 and 2). (b) Functions using threedifferent covariance functions: squared exponential, Matérn 1 and a sum ofsquared exponential, Matérn 1 and Matérn 3.


Function Parameters (θ) Mean square error(a)SqExp (0.1,1) 0.18SqExp (0.5,1) 0.076SqExp (2,1) 0.13(b)SqExp opt 0.091Mat1 opt 0.060Mat1+Mat3+SqExp opt 0.064

Table 2.1: The normalized mean square errors ( 1Nf

∑Nfi=1 ||fi − fi ||

22) between

the inferred functions and the true underlying function from which sam-ples were drawn. The second column specifies which parameters were used;“opt” means that they were found through optimization.

not a good choice as it tends to behave unnatural where there are only few datapoints. The Matérn 1 on the other hand performs well, not diverting from the un-derlying function in areas without measurements. In the last case when using asum of different covariance functions, the inferred function is smooth and quick,while still behaving nicely between measurements. Again looking at the errors inTable 2.1, the covariance function giving the best fit is actually the single Matérn1. This might seem un-intuitive, but since it is the fit against the underlyingfunction and not the measurements, a more complex and good looking functionmight fall short. This also in spite of fitting better to the measurements. Theimportance of choosing a proper covariance function should thus be clear.

2.5 Integral Observations

The covariance functions used in gps are usually defined for situations where thesought function is observed directly with additive noise. This can, however, beextended to allow observations of the integral of the function as well and is donein Osborne [2010]. The theory is here presented for the one dimensional inputcase, but it is valid also for multi dimensional inputs.

A property of the multivariate Gaussian distribution is that any set of randomvariables which have a multivariate Gaussian over them are joint Gaussian withany affine transformation of those variables. Let y = [y1, y2, . . . , yn]ᵀ be multivari-ate Gaussian distributed

p(y) = N (µ,K) , (2.53)

and consider the scaled sum

∆

n−1∑i=1

yi = ∆[1, 1, . . . , 1, 0]︸︷︷︸,S

y = Sy, (2.54)

2.5 Integral Observations 23

with ∆ > 0. This is a linear combination of the original variables and conse-quently y and Sy are jointly Gaussian distributed as

p(y, Sy) = N([µSµ

],

[K KS

ᵀ

SK SKSᵀ

]). (2.55)

By having the random variables y originate from some sensible function y( · ) suchthat yi = y(x0 + (i − 1)∆), the sum in (2.54) is a Riemann sum. Letting n tend toinfinity while scaling ∆ appropriately, the value of Sy tends toward the definiteRiemann integral.

A bit more formally, let f be a gp such that for a vector of function values f , ismultivariate Gaussian distributed as

p(f ) = N (µ,K) . (2.56)

By redefining S as a vector of integral functionals

S ,

x1∫−∞

· dx, . . . ,

xN∫−∞

· dx

ᵀ

, (2.57)

where x = [x1, x2, . . . , xn]ᵀ are the locations where integral observations have beenmade. Exploiting the linearity of integration, one can now treat S as a normallinear transformation. The column vector of integrals

Sf ,

x1∫−∞

f (x) dx, . . . ,

xN∫−∞

f (x) dx

ᵀ

(2.58)

are then jointly Gaussian with the vector of all function values f

p(f , Sf ) = N([fSf

] ∣∣∣∣∣∣[µSµ

],

[K KS

ᵀ

SK SKSᵀ

]). (2.59)

In this way, observations of the integral of the function can be treated in the sameway as observations of the function itself.

The covariance matrix K is evaluated as a standard covariance matrix betweenpoints. For the covariance matrix between points and integral observations, KSᵀ,each element is calculated by letting the integral operator act on the second argu-ment of the covariance function. One can see this as integrating the covariancebetween the point observation and all points up to the location of the integralobservation. The covariance between a point observation at x and an integralobservation at x∗ is thus calculated as

k(S)(x, x∗) =

x∗∫−∞

k(x, t) dt. (2.60)

In similar way, the covariance between an integral observation and a point obser-vation is calculated by letting the integral operator act on the first argument of


the covariance function

k(S)(x∗, x) =

x∗∫−∞

k(t, x) dt. (2.61)

The matrix SKSᵀ represents the covariance between integral observations. Thecovariance between two integral observations at locations x and x∗ is calculatedby letting the integral operator act twice, once on each argument of the covari-ance function,

k(S,S)(x,x∗) =

x∫−∞

x∗∫−∞

k(s, t) dtds. (2.62)

To further extend integral observations, O’Callaghan and Ramos [2011] intro-duced a way to handle observations of definite integrals over line segments bydefining the covariance between two line segments as

kll(l1, l2) =

b∫a

d∫c

k(l1(u), l2(d)) dudv, (2.63)

where l1 : [a, b] → C1 and l2 : [c, d] → C2 are two arbitrary bijective parameteri-zations of curves C1 and C2 where l1(a) and l1(b) gives the endpoints of C1 andl2(c) and l2(d) gives the endpoints of C2.

In a similar way, the covariance between a point x and a parameterized curvel(u) : [a, b]→ C where l(a) and l(b) gives the endpoints of the curve C is definedas

kl(x, l) =

b∫a

k(x, l(u)) du. (2.64)

An example of inference using both point and integral observations can be seenin Figure 2.7.

2.6 Non-stationary kernel

Most covariance functions used for gp regression are stationary. That is, the co-variance between two points x and x′ is function only of the norm of the distancebetween them ||x − x′ ||. As a consequence, a standard gp is unable to adapt to lo-cal variations in the smoothness of the function of interest. The ability to modelinput-dependent smoothness is essential in many fundamental problems in forexample mining and robotics. When modeling the ground of a large area, differ-ent areas may have different smoothness properties. For example, one area maybe essentially flat, while another area contains large rocks or a sudden change inelevation such as steps or walls in an indoor environment.

2.6 Non-stationary kernel 25

x

y

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

-3 -2 -1 0 1 2 3

True functionPoint observationLine observationMeanmean ±2 std

Figure 2.7: Function inference combining point and line observations. Thepoint observations are shown as red crosses and line observations as redlines. The line observations have been divided by their respective lengthand plotted as constant over the observed interval. When dividing by thelength, the observation represents the mean of the function in that interval.

Here, a way to construct non-stationary covariance functions from stationarycounterparts by varying the length parameters over the input-space, is presented.

2.6.1 Construction of the non-stationary covariance function

One way of extending a stationary covariance function k(xi , xj ) = kS (||xi − xj ||) =kS (τ) to a non-stationary counterpart kNS (xi , xj ) was presented by Paciorek andSchervish [2004]. By defining the quadratic form

Qij = (xi − xj )ᵀ(Σi + Σj

2

)−1

(xi − xj ), (2.65)

where Σi and Σj are the local length-scales at xi and xj respectively, the function

k(NS)(xi , xj ) = |Σi |14 |Σj |

14

∣∣∣∣∣∣Σi + Σj

2

∣∣∣∣∣∣−12

k(S)(√Qij

)(2.66)

is a valid covariance function for every xi , xj ∈ Rd ∀d ∈ N if k(xi , xj ) is a valid

covariance function on the same domain.

For the sake of brevity, from here one, only the case where the length-scale pa-rameters are the same for all dimensions is considered. The result for the generalcase can be achieved in the same way. This way the matrix Σi simplifies to 1

`2i

· Inwith `i ∈ R.

The value of the local length-scale `(x) at location x can be specified as a deter-ministic function, but instead an independent gp-prior is placed over `. This


process, denoted GP l , is governed by a different covariance function with a setof hyperparameters θl = (σ f , σ ` , σn). This allows for local length-scales ` to bedefined at locations X in the input-space and interpolating these with GP ` to anarbitrary input location.

2.6.2 Making predictions

For making prediction using the extended model, integration over all possiblelength-scales is necessary to get the correct predictive distribution

p(f∗|X∗,X, y, θ) ="

p(f∗|X∗,X, y, `∗, `, θy)p(`, `∗|X,X∗, `,X, θ`) d`d`∗.

This integral is intractable and Paciorek and Schervish [2004] resorted to mcmc -methods to approximate it. This is very slow and to overcome it, Plagemann et al.[2008] proposed an approximation by considering only the most probable length-scale. That is, making the simplification p(f∗|X∗,X, y, θ) ≈ p(f∗|X∗, `∗, `,X, y, θy)where `∗ and ` are the mean predictions of the length-scale process GP ` at loca-tions X∗ and X respectively. Since the length-scale process GP ` is independentin the combined regression model, making predictions with this approximationsimply amounts to two standard gp-prediction. First evaluation of GP ` to getthe mean predictions of (`, `∗) and then another of GP y treating the local length-scales as fixed.

2.6.3 Learning Hyperparameters

As for normal gp-regression, it is highly unlikely to know the hyperparame-ters θ a-priori. Instead, these should be learned using observed data. Assum-ing n observations y have been observed at input locations X, the goal is tomaximize the probability of observing y at locations X. That is, p(y|X, θ) =∫p(y|X, `, θy)p(`|X, `,X, θ`) d`. This marginalization is intractable and instead,

Plagemann et al. [2008] proposed to maximize the a-posteriori probability of thelatent length-scales

p(`|y,X, θ) ∝ p(y|X, `, θy)p(`|X, `,X, θ`), (2.67)

where ` are the mean predictions of GP `. This entity can be maximized my em-ploying gradient based optimization techniques.

2.7 Multi-Class Gaussian Processes

There are situations where multiple classes need to be inferred simultaneously.This can for example be the case when input data from a real-world room containinformation about object classes, such as computer, chair, etc. When classifyingtest points, different objects classes’ underlying functions could also inherit prop-erties dependent on their appearances. In other words; different object classes

2.7 Multi-Class Gaussian Processes 27

might benefit from being inferred with different covariance function hyperpa-rameters.

A common way to introduce multi-class classification is to build a separate gp foreach object class. Each gp would then give the probability of its respective classlabel belonging to each test point. Choosing, for example, the class with highestprobability of occupancy at each test point would then give some final output.

Other ways incorporate dependencies between classes, building a large gp con-sisting of all different class labels. The covariance matrix corresponding to this gpwould then contain each classes’ corresponding covariance matrix (each contain-ing all input data), together with cross term covariance matrices, as sub-matrices.

2.7.1 Efficient Multi-Class Classification

A problem when introducing multiple classes is the increases of computationsneeded to be performed. In the first case described above, with one gp for eachclass, each of the covariance matrices related to each gp will need to be invertedfor regression. With C classes and N data, this leads to a computational complex-ity of O(CN3). In the second case, with a large common covariance matrix, thecomplexity will be even higher; O(C3N3).

To overcome this, Reid [2011] proposed a novel and more efficient way to per-form multi-class classification. The method’s advantage comes from the way thecovariance matrix is built; all data are incorporated into the same covariance ma-trix, but only at a total size of the number of input data. The same covariancefunction is used for all classes, and the cross-class covariances are obtained byaveraging the hyperparameters.

Sorting data

To begin with, the labels and training data are sorted and stored in the followingway:

y = [1, 1, · · · , 1, 2, 2, · · · , 2, · · · , · · · , C, C, · · · , C]ᵀ

=

y1

y2

...yC

, (2.68)

X =

X1

X2

...XC

, (2.69)

the number of rows in X and y being the same as the number of measurements.As a new measurement is added, it is sorted into the vectors according to whichclass it belongs to.


Given the number of data (or measurements), N , and the number of classes asso-ciated with this data, C, a latent function vector is constructed as

f =

=f 1︷︸︸︷[f 1

1 , · · · , f1N ,

=f 2︷︸︸︷f 2

1 , · · · , f2N , · · · ,

=f C︷︸︸︷f C1 , · · · , f

CN

]ᵀ, (2.70)

where

f ci ={

+1, yi = c−1, yi , c

. (2.71)

Covariance functions and matrices

The efficiency is made possible by the way the covariance functions are formed.Taking an ordinary covariance function, it can be regarded as a function of threevariables; two input coordinates and the hyperparameters:

k = k (x1, x2, θ) . (2.72)

The question that then arises is how to calculate the covariance when x1 and x2belong to different classes, with different sets of hyperparameters. Reid solvesthis by simply taking the averages of the hyperparameters. The covariance be-tween measurements of classes i and j is thus calculated as

ki,j = k

(xi1, x

j2,θ i + θj

2

), (2.73)

where θ i and θj are the hyperparameters corresponding to classes i and j respec-tively.

The single most important property of a proper covariance function is that itgives rise to a positive semi-definite covariance matrix. To ensure this, the prop-erties of a diagonally dominant matrix are used, since a diagonally dominantmatrix with positive and real diagonal elements is always positive semi-definite.2.5 Definition (Diagonally dominant matrix). A matrix A with n rows and ncolumns, whose elements fulfill∣∣∣Ai,i ∣∣∣ ≥ n∑

j=1

∣∣∣Ai,j,i ∣∣∣ (2.74)

is called diagonally dominant.

After constructing a covariance matrix Kcf (X,X) with the proposed covariancefunction (2.73), all the off-diagonal elements are summed column-wise, and putas the diagonal elements in a diagonal “noise matrix”. This is added to the orig-inal covariance matrix, thus constructing a new matrix K (X,X) with the neces-

2.7 Multi-Class Gaussian Processes 29

sary properties:

K (X,X) =

Kcf (X,X)︷︸︸︷K11(X1,X1) · · · K1C(X1,XC)

.... . .

...KC1(XC ,X1) · · · KCC(XC , XC)

+ε, (2.75)

ε = diag

∑i,1

Kcf i,1 ,∑i,2

Kcf i,2 , · · · ,∑i,N

Kcf i,N

. (2.76)

Regression and Classification

The predictive latent function values are inferred using the normal regressionequation

f c∗ = K(X∗,X)K(X,X)−1f c. (2.77)

To deduce the probability that the test points belong to a certain class, the soft-max (multinomial logistic) function is used

πc∗i = p (y∗i = c|f∗) =exp

(f c∗i

)∑c′ exp

(f c′∗i

) , (2.78)

where πc∗i is the probability that query point i belongs to class c, and y∗i the classlabels related to test point X∗i .

Training hyperparameters

It is essential to be able to train the hyperparameters of the covariance functionfor the most accurate regression. This is done by using the marginal likelihoodfunction,

logp(f c |X,K) =−N2

log(f cTK−1f c

N

)− 1

2log|K| − N

2(log2π + 1), (2.79)

with the complete utility function as

logp(f 1···C |X,K) =−N2

C∑c=1

log(f cTK−1f c

N

)− C

2log|K| − NC

2(log2π + 1). (2.80)

Numerical optimization techniques can then be used to find the maximum of(2.80) w.r.t. the hyperparameters.

3Implementation Details

A big part of the thesis work has been put into implementing a C++ library forGaussian process regression and classification. This chapter describes the library,what functionality is implemented and how to effectively use its features.

3.1 Introduction

For generality, the Gaussian process part of the new occupancy mapping algo-rithm has been implemented as a stand-alone C++ library. Rather than restrict-ing the code solely to ros, by making a library it can easily be used also outsidethe world of occupancy mapping. To give flexibility, a template based program-ming method is used. By passing, for example, covariance functions as templateparameters, extending the library with new covariance functions are easily donewithout altering other parts of the library. For matrix operations, the Eigen C++library was used (see Guennebaud et al. [2010]).

3.2 Layout

The foundation for the implementation is the basic idea of how different partsof the theory should work together, i.e., the layout. Even though different partsof the theory are implemented in slightly different ways, the main functionalityrevolves around a Gaussian process class, GaussianProcess. This class is al-ways defined together with a mean function class, MeanFunc, and a covariancefunction class, CovFunc, as template parameters, and is used when performingmost of the gp operations.

A flowchart over simple regression can be seen in Figure 3.1, illustrating how

31

32 3 Implementation Details

Figure 3.1: Flowchart illustrating how simple regression is done.The Regression class is defined with a GaussianProcess classas template parameter, which is defined with a CovFunc and aMeanFunc as template parameters. The gaussian_process ob-ject, declared as a GaussianProcess, is fed with query points andtraining data using its member functions set_query_points(querypoints) and set_training_data(training data). The inferredoutput at the query points is obtained from the regression ob-ject, declared as a Regression, by using its member functioninference(gaussian_process).

this is performed. The Regression class is defined with a GaussianProcessclass as template parameter, which is, as previously mentioned, defined witha CovFunc and a MeanFunc as template parameters. To perform the regres-sion, two objects are declared; regression and gaussian_process. Thegaussian_process is fed with query points and training data using its memberfunctions set_query_points(query points) and set_training_data(training data). The inferred output at the query points is obtained by usingthe regression object’s member function inference(gaussian_process),which takes the previously declared gaussian_process object as parameterand outputs the inferred values.

3.3 Design notes

Some parts of the library deserve further explanation, which will be given in thissection. Both threading and how optimization is implemented are examples ofimportant parts of the library, though not necessarily Gaussian process related.These will be explained thoroughly, together with how integral kernels were im-plemented.

3.3.1 Integral Kernel

The integral kernel was defined in (2.63) for the covariance between two line ob-servations and in (2.64) for the covariance between a line and a point observation.

3.3 Design notes 33

Both expressions are in general intractable and numerical methods are necessaryto evaluate them. In the library, both expressions are evaluated using the Clen-shaw–Curtis quadrature proposed by Clenshaw and Curtis [1960].

The Clenshaw-Curtis quadrature approximates a definite integral by a weightedsum

I =

1∫−1

f (x) dx ≈n∑k=0

wkf (xk), (3.1)

where n is the order of the quadrature and wk weights with corresponding evalu-ation points xk . The evaluation points are taken to be extremes of the Chebychevpolynomial augmented by the boundary points

xk = cos(kπn

), k = 0, 1, . . . , n. (3.2)

The weights wk are obtained by integration the n-th-degree polynomial interpo-lating the n + 1 discrete points (xk , f (xk). This leads to expressions for wi thatdepend on the order n but not on the function f (x) itself. Both wi and xi can thusbe precomputed for a given order regardless of the function f (x). The expressionsfor wi and xi is given by Waldvogel [2006] as

wk =ckn

1 −bn/2c∑j=1

bj4j2 − 1

cos(2jkπn

)

, k = 0, 1, . . . , n, (3.3)

where the coefficients bj and ck are defined as

bj =

1, j = n/22, j < n/2

ck =

1, k = 0 mod n

2, otherwise.(3.4)

Evaluating equation (2.64), the covariance between a line and point observation,can be done directly by observing that

b∫a

f (t) dt =b − a

2

1∫−1

f (2

b − ax +

a + ba − b

) dx. (3.5)

Evaluation of (2.63), the covariance between two line observations, necessitatesan approximation of double integrals. In the case of Clenshaw-Curtis quadrature,this amounts to setting

I =

1∫−1

1∫−1

f (x, y) dxdy ≈n∑k=0

n∑i=0

wkwif (xk , xi) (3.6)

where the expressions for the weights and evaluation points are the same as be-fore. Rescaling of the evaluation points is done analogue to the single integral.Note that the number of function evaluations needed to compute the double in-tegral is (n + 1)2 as opposed to n + 1 for the single integral.


log length scale vs. log order

log(length scale)

log(order)

2.5

3

3.5

4

4.5

5

5.5

6

6.5

-4 -3.5 -3 -2.5 -2 -1.5 -1 -0.5 0

Measuredy = −0.979x + 2.855

(a) Length-scale

log line length vs. log order

log(line length)

log(order)

2

2.5

3

3.5

4

4.5

5

5.5

6

0 0.5 1 1.5 2 2.5 3 3.5

Measuredy = 0.955x + 2.228

(b) Line length

Figure 3.2: Lowest order needed in the quadrature to maintain a constantrelative error as a function of the of the length scale and the line lengthplotted in logarithmic scale. Fitting straight lines to the data give slopes veryclose to negative one for the length-scale and positive one for the line length.This shows that the order of the quadrature should be scaled proportionalto the line length and to the inverse of the length scale to keep a constantrelative error.

Order Selection

The order n of the quadrature determines both the quality of the approximationand the complexity. Higher order gives better approximations but requires morefunction evaluations. In a stationary covariance function, the hyperparameterscan often be interpreted as length-scales. This is true for the squared exponen-tial and the Matern family of covariance functions. To keep the relative errorconstant, the order of the quadrature can be adapted to the length-scale of thecovariance function and the length of the line observation. Intuitively, the ordershould be proportional to the length of the line and inversely proportional to thelength-scale. Figure 3.2 shows the minimum order needed to keep constant rela-tive error as a function of the length-scale in (a) and as a function of line lengthin (b), both plotted with logarithm scale on both axes. The fitted straight linesmake an almost perfect fit and has slope negative one for the length-scale andpositive one for the line length, confirming the intuitive relation.

3.3.2 Threading

The performance of modern CPUs increases every year. For many decades, thisimprovement could be achieved by shrinking the size of the integrated circuits,packing more functionality into the processor, and by increasing the clock fre-quency of the processor to allow more instructions per second to be processed.This, however, comes at a cost. Circuits generate heat and the more circuitsare packed into an area, the more heat is produced. Small circuits and highclock speed also introduce problems with data synchronization and data integrity,where current can leak between adjacent tracks in the chip without being con-

3.3 Design notes 35

nected. Modern processors are closing in on the limit of what is practically pos-sible in terms of clock speed and other measures must be taken to continue toincrease the processing power.

Instead of increasing the clock frequency, the trend today is to construct proces-sors with multiple cores. Simplified, a core is a logical unit in the processor thatcan execute the instructions of a single program at each instance of time. Manyprograms can then take turn and share the core among each other, making itseem as if they where executed in parallel. In contrast, when adding multiplecores into a single processor, many programs can be run at the same time. Oralternatively, a program can run multiple processing tasks in parallel on the dif-ferent cores. Modern consumer processors typically contain at least two coreswith up to eight cores for enthusiast editions.

To take advantage of the full power of modern processors, programs have to ex-plicitly be written to allow parallel execution. Threading is not natively sup-ported by C++. One option is to use Threading Building Blocks (tbb) by Intel®

[2012]. tbb is a template based library that simplifies parallelism in C++ by pro-viding parallel algorithms and containers that automatically scales to the numberof cores available. It is open source and supports most common compilers, bothcommercial and free.

Threading in GPs

All compound matrix expressions are automatically optimized by the matrix li-brary. Even though the library does not yet support multi-threading, breakingup expressions that can be run in parallel and for example coding ones own par-allel for loop won’t necessarily improve performance. The longer expression, themore ways the matrix library can optimize the calculations. Also, since multi-threading is possibly integrated in the matrix library in the future, threading isonly when calculating the integral kernel. To evaluate the performance gainedfrom threading, benchmarks were performed on an Intel® Core™ i5-2520M pro-cessor with two physical cores running at 2.50 GHz. The processor also featuresIntel® Hyper-Threading technology, running two virtual threads on each core, giv-ing a total of four parallel threads. The number of cycles needed to calculate thecovariance between 6000 point-to-point, 400 line-to-point and 200 line-to-lineobservations can be seen in Table 3.1. The calculation time needed for the line-to-pint and line-to-line observations is halved, in correspondence with the numberof cores in the test machine.

3.3.3 Optimization

An important task when handling Gaussian processes is the ability to learn thehyper parameters of the covariance function from data. Both ways presented inthis thesis, maximizing the marginal likelihood or leave one out cross validation,leads to non-convex optimization problems. How severe the problem of non-convexity is, depends on both the covariance function and the data.

The library uses its own internal representation of the optimization problem.


point-point line-point line-lineWithout threading 54991538846 2413314984 35312107172With threading 36437215597 1319189808 17175990349Improvement 33.7 % 45.3 % 51.4 %

Table 3.1: Performance gain from multi-threading the integral kernel. Im-provement shown as number of clock cycles needed to compute the covari-ance between 6000 line-to-line, 400 line-to-point and 200 line-to-line obser-vations. For line-to-points and -line-to-line observations, the computationtime is almost halved, in good correspondence with the number of physicalcores in the test machine.

This gives the possibility to use any optimizer with the library. In the current con-figuration, two different optimizers are supported. The first is the commerciallyavailable Knitro optimiser and secondly a variant of the optimization methodhave been implemented.

Knitro

Knitro is a commercial software package specialized at nonlinear optimizationthat offers three state-of-the-art optimization techniques where convexity is not anecessary condition. Two of the algorithms are interior point methods and one isan active-set method, see Byrd et al. [2006] for a detailed description. Knitro canautomatically decide which method to use and also supports crossover from onealgorithm to another during runtime.

Simulated Annealing

Simulated annealing is an optimization technique that has the potential to solvenon-convex problems. It was introduced to handle discrete problems where anexhaustive enumeration of all possible values is intractable, but it can also beused in continuous problems. The algorithm was inspired by annealing in metal-lurgy, where a metal is heated and then cooled in a controlled way. This allowsthe atoms to arrange in larger structures by forming larger crystals. The atomsare trying to achieve the configuration with lowest energy, but the temperaturecan knock an atom loose, allowing it to move randomly until it finds a new localminimum for the energy. In the same way, simulated annealing lowers a temper-ature. In each step, a new, neighboring, solution is proposed and its energy, orcost, is evaluated. The probability of accepting the new state is proportional tothe temperature. If the temperature is high, almost all new solutions is accepted,but as the temperature goes down, only states with lower cost will be accepted.

An overview of simulated annealing can be seen in Algorithm 1. The pseudo codecontains the two functions cool and neighbor. The cooling of the temperature canbe implemented as a percentual of the current temperature or as a function ofthe remaining computation time. The function neighbor generates a new solu-tion that is in some sense close to the current estimate. This is done by randomlychoosing one of the variables and multiplying it with a positive value. The mul-

3.3 Design notes 37

tiplicative value is decreased as the temperature lowers so smaller and smallerchanges are made.

Algorithm 1 Simulated Annealing

procedure SimulatedAnnealing(x0, f )x← x0; J ← f (x0) . Current estimate and costx← x; J ← J . Best estimate and costrepeat

T ← cool(T ) . Lower temperaturexnew ← neighbour(x) . Generate new proposal solutionJnew ← f (xnew) . Cost of new solutionif exp(−(Jnew − J)/T ) > U (0, 1) then . Probability of change

x← xnew; J ← Jnewend ifif Jnew < J then . New state has lowest cost

x← xnew; J ← Jnewend if

until convergencereturn x

end procedure

To demonstrate the ability of simulated annealing to avoid local minima, Fig-ure 3.3 shows the true function f together with samples yi = f (xi) + εi whereεi ∼ N (0, 0.12). Inserting these values into a gp using the Neural Network covari-ance function and optimizing with the same initial condition using both Knitroand simulated annealing, give the results in Figure 3.4. Knitro gets stuck at a lo-cal minima and produces a bad prediction, while simulated annealing is able toavoid the local minima and converge to a better solution. The simulated anneal-ing optimization was then followed by a run through Knitro to further optimizethe value, but the solution was only changed slightly.


x

True function and measurements with additive noise

-1.5

-1

-0.5

0

0.5

1

1.5

-10 -8 -6 -4 -2 0 2 4 6 8 10

f(x)y

Figure 3.3: True function f (x) together with samples yi = f (xi) + εi , whereεi ∼ N (0, 0.12)

3.3 Design notes 39

Optimized using Simulated Annealing and Knitro

x

-1.5

-1

-0.5

0

0.5

1

1.5

-10 -8 -6 -4 -2 0 2 4 6 8 10

x

Optimized using Knitro

-1.5

-1

-0.5

0

0.5

1

1.5

-10 -8 -6 -4 -2 0 2 4 6 8 10

Figure 3.4: The predicted mean and variance for the function f (x) after thehyperparameters have been optimized with the same initial conditions usingsimulated annealing in the top graph and Knitro in the bottom graph. Thevariance is shown as the gray area behind the mean representing two stan-dard deviations. Knitro found a local minimum with final cost Jknitro = −122while simulated annealing followed by Knitro performs better with final costJSA = −33.

4Mapping

The chapter describes mapping using a laser scanner. The history of mappingis touched upon briefly followed by an introduction to mapping using Gaussianprocesses. The chapter also contains other topics related to mapping, like meth-ods to make the mapping algorithm more feasible in an online running setting.

4.1 Introduction

A fundamental task for a robot operating autonomously is to be aware of its sur-roundings. For a mobile robot to carry out complex missions in different environ-ments, it is of great importance that it can build and maintain accurate modelsof its surroundings, mentioned by, among others, Thrun and Bücken [1996], andElfes [1989]. This requires not only the mapping techniques to be reliable, butalso sufficiently fast.

One of the most commonly used methods for robotic mapping is the occupancygrid method. The environment is discretized into multiple independent gridpoints, which are classified as either occupied or free. Moravec and Elfes [1985]made the first major contribution to the theory of occupancy grids, when theybuilt occupancy maps from wide angle sonar. They used a simple but effectiveway of adding sensor information by calculating and updating the probabilitiesof occupancy in an iterative manner. Some years later, Elfes [1989] concludesthat “The occupancy grid framework represents a fundamental departure fromtraditional approaches to robot perception and spatial reasoning”.

Although intuitive and popular, the occupancy grid methods make a major sim-plification by assuming independence between cells. This improves real-timeperformance, but gives worse accuracy in areas with sparse data. In these areas,

41

42 4 Mapping

spatial context could help giving more accurate results. The great increase ofcomputational power in modern computers have given way for computationallymore demanding and more accurate methods of performing occupancy mapping.

Already Pagac et al. [1996] proposed a method for overcoming the independencebetween cells assumption. The evidential approach proposed makes use of thewideness of the sonar’s sensor beam when inferring the occupancy map. By mod-eling the range sensor performance, the inference is made over the whole arcof uncertainty sprung from the width of the sensor beam. There will thus be aprobabilistic correlation between points. The major drawback is that with nar-row sensor beams the method will give little, if any, improvement over earliermethods.

Another attempt was made by Paskin and Thrun [2005], using polygonal randomfields. The polygonal random fields merge the advantages of occupancy grids’simplicity and speed, together with the geometric approaches’ way of represent-ing complex environments in a compact manner. Geometric approaches do notnormally argue about occupancy but rather boundaries, and the occupancy gridsdo not account for the correlation between points from the spatial structures.With the probability distributions of polygonal random fields, all these problemscan be managed within an existing probabilistic framework. By having a spatialcorrelation property, occupancy at each point will be influenced by nearby points,while also improving performance when using inaccurate sonar sensors. Paskinclearly displays this by comparing the posterior distribution of an occupancy gridwith that of a polygonal random field, with sonar measurements as in-data.

Aside from the world of occupancy mapping, the machine learning communityhas in recent years developed new techniques for supervised learning using Gaus-sian processes. A book on Gaussian processes for machine learning by Rasmussenand Williams [2005] has since become somewhat of a reference work. The booktalks in general about supervised learning using the Gaussian processes. The the-ory and the different parts of the learning process are discussed and exemplified,while also comparing the Gaussian processes to other models.

By applying the theory of Gaussian processes to the problem of occupancy map-ping, O’Callaghan et al. [2009] presented a new way to perform contextual oc-cupancy maps using Gaussian processes. Their approach to mapping does notdiscretize the environment beforehand but preserves the continuous nature ofreality. Using the properties of a Gaussian process, this allows them to use under-lying spatial structure when inferring over areas with sparse data. The methodalso produces an associated variance to each query point in the room, which thencould be used when trying to find unexplored regions and optimizing the robot’ssearch plan.

4.1.1 Mapping using Gaussian processes

The use of Gaussian processes as a technique for occupancy mapping was in-troduced by O’Callaghan et al. [2009]. They considered the problem of occu-pancy mapping as a classification problem. The Gaussian process was used to

4.1 Introduction 43

infer the probability of an arbitrary query point being occupied together with thevariance of the estimate. By querying multiple test points evenly distributed inspace, followed by simple thresholding of the probabilities, an occupancy mapwas constructed. For classification, a least squares classifier was used and thehyper parameters of the covariance function and the parameters of the classifierwere trained independently. First the hyper parameters were trained by maxi-mizing the marginal likelihood, and then the parameters of the classifier weretrained using leave one out cross validation. These methods are well describedby Rasmussen and Williams [2005].

O’Callaghan et al. showed promising results but uncertainties in the robot’s loca-tion was not considered. Later on, O’Callaghan et al. [2010] extended their workby incorporating input uncertainties in the GP framework. For additive Gaussiandistributed noise, an analytical expression was derived for the squared exponen-tial covariance function. For more general covariance functions, the computa-tions becomes intractable and numerical approximations must be used. The un-certainties from laser scans are not Gaussian, but were in the paper approximatedusing the unscented transform, as described by Julier and Uhlmann [1997].

McHutchon and Rasmussen [2011] recently presented a different method of in-corporating input uncertainties. The correct way of training on input points cor-rupted by Gaussian noise is to consider every input point as a Gaussian distribu-tion. This however leads to intractable models and approximations must be used.In their paper, they used a linear approximation of the posterior mean functionto propagate the input noise to the output. Solving the resulting equations di-rectly leads to analytically unsolvable differential equations. To overcome this,they proposed an iterative scheme where a mean function is first computed, ig-noring the input uncertainties, and the derivative of the resulting mean functionis used to add the uncertainties of the inputs.

A big problem with Gaussian processes is the need to invert the covariance matrixwhich has complexity O(n3), where n is the number of data points. Smith et al.[2010] applied Gaussian processes for 3d surface reconstruction from laser scandata. In order to reduce the number of data points, they used active sampling.Evaluating the information gained by adding a new data point to the existingones using the Kullback-Leibler divergence, points giving little new informationcould be rejected.

O’Callaghan and Ramos [2011] developed the continuous occupancy mappingfurther. Their earlier approaches discretized the continuous line segment obser-vations into point observations. By introducing integral kernels, the whole linesegment could be used as a single observation. Defining the covariance betweentwo lines, and between a line and a point, makes the covariance matrix of the datapoints independent of the query point. This allows for the covariance matrix tobe stored and expanded when new measurements are made. An even better wayis to directly store and update the Cholesky factorization of the covariance matrixas described by Osborne [2010] and by Smith et al. [2010].

44 4 Mapping

The use of integral kernels is able to speed up the algorithm, but the computationof the matrix inverse is still a major bottle neck. Even with active sampling thesize of the covariance matrix will grow to an intractable size. O’Callaghan solvesthis by splitting the covariance matrix once it reaches a certain size and beginsto grow a new model with the incoming observations. However, there are betterways to do this partitioning of the data.

By applying clustering techniques to the data and making local Gaussian pro-cesses on each cluster, the sizes of the covariance matrices can be kept small. Thelocal Gaussian processes can then be added together using for example a mixtureof Gaussian processes, Tresp [2001], or a Bayesian committee machine, Tresp [2000],to create a global occupancy map. An overview of ways to combine the result ofdifferent classifiers has been made by Dietterich [2000].

One approach to clustering is to partition the data according to topology. In anindoor environment, the occupancy in one room should in general be indepen-dent of the occupancy in the next room. The performance loss from having twoseparate Gaussian processes for the two rooms should then be negligible. Thrunand Bücken [1996] described a method of integrating occupancy grids with topo-logical maps. Although made for occupancy grids, the method used should beadaptable to the Gaussian process paradigm by looking at the discretized output.Thrun creates a Voronoi diagram, on which critical points and lines are foundthat divide the room into topological regions. Studying partitioned larger areas,such as a floor with multiple rooms this method should be able to generate in-teresting results. In open outdoor areas, the method might not be successful atall.

4.2 Occupancy Mapping Using Gaussian Processes

The above reasoning together with the theory presented in Chapter 2 can nowbe used to perform continuous occupancy mapping with integral kernels. Thestarting point will be the regression formulae, whose outputs are then classifiedusing the least-squares classifier, that is;

f (x∗) ∼ N(µ∗, σ

2∗), (4.1)

where

µ∗ = kᵀ∗(K + σ2

n I)−1

y, (4.2)

σ2∗ = k∗∗ − k

ᵀ∗(K + σ2

n I)−1

k∗ , (4.3)

with the probability of occupancy at each point as

p (y+|X, y, x∗) = Φ

y+ (αµ∗ + β)√1 + α2σ2

∗

. (4.4)

The input data contains both points and lines, and the covariance vectors and

4.3 Active sampling 45

matrices must be calculated as in Section 2.5. Thus, between two lines the covari-ance becomes

kI I (l1, l2) =

b∫a

d∫c

k (l1(u), l2(v)) dudv, (4.5)

and between a line and a point

kI (l, x) =

b∫a

k (l(u), x) du. (4.6)

For most covariance functions k, these integrals are intractable. Hence, they areapproximated using quadrature as in Section 3.3.1.

When running online and sequentially adding new measurements, the Choleskyfactorization of K is stored and updated. This is due to the decrease in compu-tational complexity when inverting a matrix using its Cholesky decompositionrather than the matrix itself. Given the current covariance matrix K1,1 and its

Cholesky decompositionR1,1, the new covariance matrix[K1,1 K1,2K2,1 K2,2

]is Cholesky

decomposed as[S1,1 S1,2

0 S2,2

]where the entries are updated according to:

S1,1 = R1,1, (4.7)

S1,2 = Rᵀ1,1\K1,2, (4.8)

S2,2 = chol(K2,2 − S

ᵀ1,2S1,2

). (4.9)

It is necessary at some point to optimize the covariance function’s hyperparame-ters. This is usually done by taking a subset of the whole quantity of input data,optimize, and then map using the whole set of data. In an off-line setting, thesubset can often contain the whole data set, as fast operations are not as highlyprioritized. When mapping on-line, though, the map is often partially scannedto obtain sufficient scans (but not too many; optimizing over large data sets arecomputationally heavy). These scans are then used to optimize the parametersbefore the robot is allowed to continue its mapping. Some times even simulateddata can be used for optimization before starting the mapping procedure.

4.3 Active sampling

For each new measurement added to the gp, a new set of covariances need tobe calculated. Not only is this computationally burdensome, but also adds tothe total number of data to operate on. As the number of measurement dataincreases, so does the computational load performing regression, and speed willbe reduced. If the speed drops too low, on-line operations will not be possible.

46 4 Mapping

Though, as is noticeable, not all measurements will make large impacts on thefinal inferred map; similar measurements, e.g. originating from the same locationwith the same value, will not change the map as much as measurements far awayfrom, or not similar, to ones already existing. By reducing or removing redundantmeasurements, this would help in keeping the number of stored data small andthus improving performance while doing mapping on-line. Preferably, this isalso to be done as measurements are added; one could analyze all data points todo sorting and removing, but the extra calculations required to perform such atask only risk murdering the benefits from rejecting redundant data. The wayof sorting and rejecting incoming measurements will hereby be denoted activesampling. The method implemented is described in algorithm 2.

Algorithm 2 Algorithm for active sampling

for all incoming measurements doObtain current mean µo and variance σ2

o at incoming measurement’s loca-tion

Add incoming measurement to the gpObtain new mean µn and variance σ2

n at incoming measurement’s locationCalculate kl divergence Dklif Dkl < Θ then

Remove measurement from gp as it is considered redundantelse

Do nothingend if

end for

In the implementation, the Kullback-Leibler (kl) divergence is used as a measure-ment of the difference between two distributions; the distribution before and af-ter a new measurement is added. The larger the difference the more informationa new measurement can be said to posses. Smith et al. [2010] gives the closedform solution for the kl divergence between two distributions as

Dkl (Nn||No) =12

[log

σ2o

σ2n

+µ2

o + µ2n − 2µoµn + σ2

n

σ2o

− 1]. (4.10)

The incoming measurement will be rejected or kept based upon whether the kldivergence is more or less than a certain pre-defined threshold.

4.3.1 Examples

To illustrate active sampling, a function was sampled as can be seen in Figure 4.1a.Two functions are then inferred using this as input data; one using active sam-pling and one without. The result can be seen in Figure 4.1b, where data keptafter performing active sampling are plotted as red circles.

Even with a significant amount of data being neglected, the two inferred func-tions share the same appearance. In applications where number of data is critical,

4.4 Splitting and merging Gaussian processes 47

Input data

-1

0

1

2

3

4

5

6

-10 -8 -6 -4 -2 0 2 4 6 8 10

(a) SamplesInput

Functionvalue

-1

0

1

2

3

4

5

6

-10 -8 -6 -4 -2 0 2 4 6 8 10

ASNo ASAS DataAll data

(b) AS

Figure 4.1: (a) The sampled data used for exemplifying active sampling. (b)Example of active sampling. Black crosses represent input data, red circlesdata kept after active sampling, and the lines represent the inferred func-tions with (red) and without (black) active sampling.

active sampling can thus seem like a plausible solution for keeping this numberlow.

4.4 Splitting and merging Gaussian processes

Using various tricks, such as pre-factorization of covariance matrices, the com-putational complexity for the most onerous operations, like matrix inversions,can be reduced to O

(n2

), n being the number of measurements. Some of these

operations have to be done in every time frame with a growing number of mea-surements this will eventually make the system too slow for on-line operations.If the gp, instead of just growing, could be split into multiple ones, each hav-ing a limited number of measurements, the limitation of quadratically growingcomputational load would be avoided. After performing regression in each ofthe smaller gps, their output could be summed together to build the large map,smaller meaning “with fewer measurements”.

The new problem thus becomes two-fold; first one must find a way to split thegps to build smaller sub-maps and assigning the incoming measurements to theproper ones, and second the outputs from the gps must be merged together tobuild the complete map. Most likely the number of smaller maps is not knownin advance, so some decision rules also have to apply when and how to startbuilding new ones.

4.4.1 Bayesian Committee Machine

One way to merge outputs from different gps is to utilize the Bayesian committeemachine (bcm), as described by Tresp [2000]. The expressions for the mean and

48 4 Mapping

covariance of the approximate density P become

E (f∗|X∗,D) = cov(f∗|D)M∑i=1

cov (f∗|X∗,Di)−1 E (f∗|X∗,Di) (4.11a)

cov(f∗|D) =

− (M − 1)K−1∗∗ +

M∑i=1

cov (f∗|Di)−1

−1

, (4.11b)

where D is all the measurements, Di the measurements of the i:th gp, M the num-ber of gps, E(f∗|Di) and cov(f∗|Di) the predicted mean and covariance obtainedfrom regression for the i:th gp. Algorithm 3 describes how the bcm is used.

Algorithm 3 Algorithm for calculating the mean and covariance using the bcm

Require: K∗∗, M = number of gpsfor i = 1→ M do

CovSum← CovSum + cov(f∗|Di)−1

end forC ← −(M − 1) × K−1

∗∗ + CovSumfor j = 1→ M do

EstSum← EstSum + cov(f∗|Di)−1E(f∗|x∗, Di)end forcov(f∗|D)← C−1

E(f∗|x∗, D)← cov(f∗|D) × EstSumreturn cov(f∗|D), E(f∗|x∗, D)

4.4.2 Slim Bayesian Committee Machine

A problem with the Bayesian committee machine is the inversion of the covari-ance matrix. When resolution of the map increases, the size of the matrix scalesquadratically. To generate a map of size 200 × 200 test points, the covariance be-tween all pairs of points would have to be stored. The resulting covariance matrixwould be of size 40000×40000. Without any optimization of storage such as onlystoring the upper triangular part or considering sparsity, the memory requiredto store the matrix in double precision would be approximately 5.96 Gb. Moderncomputers can store such a matrix, but then the matrix must also be inverted.This is in practice impossible with ordinary home computers and even then a200 × 200 sized map is in most cases to coarse to be useful.

To overcome the problems with the full Bayesian committee machine, the expres-sions can be simplified further to give a much faster calculation time at the costof a worse approximation. Instead of considering the full covariance matrix, onlythe variance of each prediction is considered. This is achieved by querying eachtest point separately instead of jointly in the standard formulation of the Bayesiancommittee machine.

A one dimensional example of the Bayesian committee machine can be seen inExample 4.1. The example shows both the full and the slim version and compares

4.4 Splitting and merging Gaussian processes 49

the results with an ordinary gp with all the data in a single gp.

4.1 Example: Bayesian Committee MachineThe Bayesian committee machine can be used to fuse results from Gaussian pro-cess regression where the different gps have unique measurements. To illustratethis, data generated by a Gaussian process have been sampled and split into twodistinct partitions. The data sets can be seen in Figure 4.2. The predicted means

x

y

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1

GP 1GP 2

Figure 4.2: Data used for regression. The data is split in three parts accord-ing to the color in the graph.

and variances together with the data points can be seen in Figure 4.3. The meanis represented as a line and the variance as an interval with plus and minus twostandard deviations. The variance is low in regions with data available and highoutside, where the only information comes from the prior. Fusing the two data

−1 −0.5 0 0.5 1

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

3

GP 1

x

y

−1 −0.5 0 0.5 1

−2

−1

0

1

2

3

GP 2

x

y

Figure 4.3: The predicted means and variances of the two gps together withthe data points. The means are shown as lines and the variances are shownthe shaded areas behind the means corresponding to plus and minus twostandard deviations. The variances are low where data is available but theonly information in regions without data comes from the prior.

sets has been done in three ways. Firstly, all data has been used in a single gp.

50 4 Mapping

This should be considered the goal for the other two methods to achieve. Sec-ondly, the full bcm has been used and lastly the slim bcm. The predicted meansand variances can be seen in Figure 4.4. The difference between the full bcm andthe single gp is negligible with a mean square error in the mean approximately5.55 · 10−13. The slim bcm gives a slightly different result as can be seen aroundx = 0. The mean square error in the mean is 7.19 · 10−4, which is significantlyhigher them the full bcm but still usable.

−1 −0.5 0 0.5 1−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

Single GP with all data

x−1 −0.5 0 0.5 1

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

Full BCM

x−1 −0.5 0 0.5 1

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

Slim BCM

x

Figure 4.4: The predicted mean and variance after fusing the data by puttingall data in a single gp to the left, using the full bcm in the middle and theslim bcm to the right. The difference between the full bcm and the singlegp is negligible. The slim bcm gives a slightly different result which can beseen visibly at around x = 0 but the result is still close to the single gp.

4.5 Robot Operating System

The Robot Operating System, ros, has made it possible to implement and evalu-ate the mapping method in both on-line and off-line applications. ros, made byQuigley et al. [2009], is an open source software framework for robot software de-velopment. It was originally developed in 2007 at Stanford University but since2008 development continues primarily at Willow Garage, a robotics research in-stitute. ros provides a common framework for people working in robotics, mak-ing it easy to share code with others. ros works on a number of different robotplatforms and many people are involved developing drivers for more hardwareand new platforms. Before ROS, every team working with robots had to createtheir own software environment making the exchange of software cumbersome.

5Experiments and Results

The previous chapters have discussed theory, motivated design choices and shownillustrative examples. In this chapter, examples are turned into results illustrat-ing the performance of the implementations made. For structural reasons thechapter is split into separate sections for each of the previous chapters, with themapping section being the most relevant one.

5.1 GP results

The foundation for the new mapping method is the theory of Gaussian processes.In this section, results related to gp classification are presented; two differentbinary classification methods are compared, and the efficient multi-class classifi-cation method is evaluated.

5.1.1 Classification

Two different methods for performing binary classification have been described:probabilistic least squares classification and Laplace’s approximation method.To best study their abilities, they will be compared using a fairly simple, one-dimensional test case.

Comparison between Least squares classification and Laplace approximation

Least squares classification and Laplace’s approximation method are evaluatedby comparing their ability to estimate a given predictive probability. In this casethe predictive probability is the (normalized) sum of three Gaussian PDFs, twobelonging to the same class C1, and one belonging to a second class C2: p(y =C1|x) = p1(x)+p3(x)

p1(x)+p2(x)+p3(x) , {p1, p3} ∈ C1, {p3} ∈ C2. The same number of samples are

51

52 5 Experiments and Results

Input

Class

probab

ility

Data sampled from two classes

0

0.5

1

1.5

-10 -8 -6 -4 -2 0 2 4 6 8 10

Class 1 samplesClass 2 samplesClass 1 probability: p(y=C

1|x)

Class 1 PDFsClass 2 PDF

Figure 5.1: Samples drawn from three normal distributions with means at-3, 2 and 4, and variances 0.5, 0.6 and 0.8 respectively. The first and thirddensity functions belong to the same class C1, while the second belongs toclass C2. This is visualized by letting samples from C1 have the value 1,while samples from C2 have the value 0. Also, the true predictive probabilityis drawn as a dotted line.

then drawn from each of the Gaussian distributions, as can be seen in Figure 5.1.There the samples are plotted together with the density functions and the truepredictive probability for class C1, where p1 ∼ N (−3, 0.5), p2 ∼ N (2, 0.6) andp3 ∼ N (4, 0.8).

The predictive probability for class C1 is estimated using least squares classifica-tion and Laplace’s approximation method, as seen in Figure 5.2. The estimatesare plotted together with the true predictive probability.

Using the same covariance hyperparameters for both classification methods, itis clear that the least squares classification outperforms Laplace’s approximationmethod almost everywhere. The Laplace method gives a smooth predictive proba-bility, which is perhaps too smooth, deviating from the true probability especiallywhen it makes fast shifts. One disadvantage of least squares classification can beseen at the shift from p2 to p3 around input value 2.4, as its prediction is a bit toofast. Though so might be the case, this should not belittle its great performanceotherwise.

If there is a choice between the two methods, least squares classification is mostprobably the better choice.

5.1 gp results 53

Input

Probab

ilityof

class1

Approximated probabilities

0

0.2

0.4

0.6

0.8

1

1.2

-10 -8 -6 -4 -2 0 2 4 6 8 10

Least squares classifier

Laplace classifierTrue probability

Figure 5.2: The true predictive probability plotted together with probabil-ities approximated using least squares classification and Laplace’s approxi-mation method.

5.1.2 Multi-Class Classification

The most desired property of the efficient multi-class classification method de-scribed in Section 2.7 would here be its ability to efficiently infer different objectsin different ways. That is, if part of a chair has been seen, and a part of a treeright next to it, they should be inferred differently and related to their real worldappearance. To test this property, two different object classes are used togetherwith a ground class. The objects are rings and circles, and they are uniformlysampled with labels, except for one ring and one circle that are not sampled inthe middle, as seen in Figure 5.3. The question is whether or not the method man-ages to, by using different covariance parameter for the different classes, classifythe un-sampled area inside the circle as “circle class”, and the un-sampled areainside the ring as “ground class”.

A single squared exponential covariance function is chosen for each class. Withthree hyperparameters each, and thus nine parameters in total, the optimal pa-rameters are found by finding an optimum on a nine dimensional surface. Whichis difficult, not to say impossible. The optimization techniques implemented of-ten fails, or get stuck in bad local optima.

The results seen in Figure 5.4 was thus found by alternately optimizing and man-ually tuning the hyperparameters. Since they clearly can not be guaranteed to beoptimal in any sense, the method’s true potential still remains undetermined.The rings are clearly most problematic, the trade-off lying between inferringenough ring in the area where the whole ring is sampled, and not inferring to


201816141210

Samples from three different classes

X pos86420

0

2

4

6

8

10

12

14

16

18

20

Ypos

Class samplesClass 1 borderClass 2 border

Figure 5.3: Two circles (green) and two rings (blue) are uniformly sampledwith class labels, except for an area inside the circle and an area inside thering that contain no samples. The area also contains a ground class, sampledin gray.

much ring in the un-sampled area inside the second ring. The way the differentcovariances are coupled also gave an unintuitive behavior when the parameterswere manually tuned; changing the length-scale parameters for one class affectedall classes in the whole area, and not only in the surroundings of the samples fromthat class.

Though the example fails to say whether or not the efficient multi-class classifica-tion method is usable in such a way that was previously mentioned, one thing isclear: if such great difficulties arise in the most simple case (only one covariancefunction instead of sums of such, and only three classes are used), how would itbe possible to find good hyperparameters in other, more complex cases? Since thecoupling between classes make such an impact, training each class’s covariancefunction separately is not a likely solution to the problem.

For a more thorough evaluation and discussion of the method, see Reid [2011].

5.2 Library results

The library has two main requirements: first of all it needs to be flexible, andsecond, it needs to be fast. The flexibility is of no issue per se; the C++ librarydoes what a C++ library does, and even though the layout could have been madein a different way no shadow should fall upon it. Though, what is objectivelytestable is how fast it is. By comparing the library with a well made Matlab

5.3 Mapping results 55

201816141210

Classified outputs

X pos86420

0

2

4

6

8

10

12

14

16

18

20

Ypos

Classified outputClass 1 borderClass 2 border

Figure 5.4: Classified output; green areas belong to circles while blue areasbelong to rings. While the green circles are fairly well classified, the ringsare clearly not. Good imagination is needed to interpret the blue areas asrings, without the help from class borders (drawn as lines).

library, some notion of how fast it is could be achieved.

The results were unambiguous and clear: in average, the C++ library imple-mented was outperformed by the Matlab library by a factor of six in terms ofspeed. Simple tasks were compared, like regression, classification etc., and for alltasks the results were practically the same. When it comes to pure matrix opera-tions, Matlab is just so much faster than the matrix operation library utilized.

5.3 Mapping results

The most central part of the thesis is the method for continuous occupancy map-ping using Gaussian processes. This section shows results obtained when run-ning the algorithm on both simulated and real data. The section first describesthe results when using simulated data and compares this with the ground truth.Then the results from using real world data is presented and compared with theold occupancy grid method.

For both the simulated and the real world data, the same covariance function hasbeen used. The covariance function and its hyperparameters was chosen the sameas used by O’Callaghan and Ramos [2011]. They used a sum of three Matérn 3covariance functions and optimized the parameters on data from similar environ-ments as used here. A slight modification was done and here a sum of only twoMatérn 3 was used to shorten the computation time. The Matérn 3 with the inter-


Sum of Length-scale VarianceMatérn 3 1.1524 2.3134Matérn 3 1.7172 1.4413

Table 5.1: Covariance function and hyperparameters used for occupancymapping.

mediate length-scale was removed. The same length scale was used in both the xand y direction and the values can be seen in Table 5.1. For the classifier, the twoparameters α and β were set to 566.604 and −111.811 and were also optimizedin O’Callaghan and Ramos [2011] using leave on out cross validation.

5.3.1 Simulated Data

The mapping algorithm takes as input the start and end points of each laser scantogether with corresponding labels −1 for free space and +1 for occupied region.To generate the measurements, a room was created in matlab together with atrajectory for a mobile robot as seen in Figure 5.5. 50 points were spaced equallyalong the trajectory were laser scan measurements were simulated. Each of thelaser scans consisted of 6 laser beams directed in the robots forward directionranging from the left of the robot to the right. This gave a total number of 600measurements, counting the free line measurements and the points of returns asseparate measurements. All the measurements, together with the room and thetrajectory, can be seen in Figure 5.6.

All the measurements were fed to the mapping algorithm, adding the measure-ments sequentially with active sampling active. The threshold for the active sam-pling was set to 2 which reduced the measurements by approximately 50 % fromthe original 600 to a total of 275 beams and points. The kept measurements canbe seen in Figure 5.7.

The resulting map is shown in Figure 5.8. Occupied areas are shown in red, freespace regions in blue and uncertain areas in green. Each point in the map hasbeen created by querying the map over an equally spaced grid with 200 points ineach dimension. The points have then been classified as either occupied or freeif the probability of the class label exceeded 90 % and otherwise as uncertain.In the figure, the class labels of these points gets to represent the occupancy forthe whole area of the pixel centred at that point. The outlines of the walls andobstacles in the room can bee seen as overlaid in black.

As can be seen, the algorithm correctly classified most of the occupied space as oc-cupied or uncertain. The only occupied space it really missed was the lower leftcircle between the two rectangles in the middle to the right of the figure. Study-ing the scans taken in Figure 5.6 reveals that no scans were taken from the circle.For free space regions, the classifiers performance was worse. Regions close to oc-cupied points are also classified as occupied despite the free line measurement.

In addition to the classified map, the mapping algorithm also gives an estimate of


−1 0 1 2 3 4 5 6 7

0

1

2

3

4

5

6

x

y

Figure 5.5: A top down view of the virtual room used to create measure-ments for the simulated experiment of occupancy mapping together withthe trajectory of the mobile robot. The walls of the room and the boundariesof the objects in the room is shown in black and the trajectory is shown inblue.

the uncertainty in the map. Figure 5.9 shows the variance in the estimate of eachpoint in the map. Dark red is high variance and dark blue is low variance. Nomeasurements were possible to get from the upper left corner of the room and asseen in the figure, this is also where the variance is the highest. Inspection of thelaser beams in Figure 5.7 also shows that the measurements in the lower left partof the map are more sparse then in the rest of the map. Also this is reflected inthe higher variance of that part then part were measurements wree more denselygathered.

5.3.2 Real World Data

When generating maps with the continuous occupancy mapping algorithm de-scribed in the thesis, the location of the mobile platform is assumed to be known.To be able to use the algorithm on real world data, a map was first created usingthe package gmapping which is delivered as part of the standard ros distribution.This gives a map built with the conventional occupancy grid approach. Usingthis map, the same data was then used off-line to localize the robot with the amclpackage in ros. Using this localization together with the original laser scans, anew map was built on-top of the old one.

The data was collected using a SICK laser scanner mounted on a Pioneer 3-ATmobile robot. The laser scanner is capable of giving 181 equally spaced laserbeams with an 190◦ field of view at a rate of up to 100 Hz. This amount of data


−1 0 1 2 3 4 5 6 7

0

1

2

3

4

5

6

x

y

Total number of measurements: 600

Figure 5.6: All measurements captured by the simulated mobile robot. Thefree space laser beams are shown in green and the points of laser returns areshown in red. Also shown is the picture is the trajectory of the robot togetherwith the points where the simulated measurements were taken.

is way to much for the occupancy mapping algorithm to handle. Instead, only6 of the 181 laser beams were used and new beams were added to the map onlywhen the robot had moved at least 25 cm from the point where the last scan wasadded. Also, active sampling was used with a threshold oset to 2 just as with thesimulated data.

During the data collection, the robot was driven manually around an open planoffice at the University of Sydney.

Two different continuous occupancy maps were created inside ros. The first useda higher resolution grid for the query points then the second. The higher resolu-tion map was not possible to generate in real time. The data was played backsuch that the mapping algorithm was able to keep up. When splitting the datainto different local maps, a new map was created when the robot had moved 3 me-ters from the center of the current map and each map was of size 10 × 10 meters.Each local map consisted of a grid of 100 × 100 query points, giving a resolutionof 10 × 10 cm per pixel. The whole map can be seen overlaid on top of the occu-pancy grid map in Figure 5.10 and a zoomed in view of both maps can be seen inFigure 5.11.

To build maps online, more simplifications have to be made. The most time con-suming part of the high resolution map is the calculation of the covariance forthe query points. By reducing the size of each local map to 25 × 25 query points


−1 0 1 2 3 4 5 6 7

0

1

2

3

4

5

6

x

y

Total number of measurements: 275

Figure 5.7: The measurements kept after the active sampling algorithm re-moved the unnecessary ones. The threshold for the change in the kl diver-gence was set to 2 which resulted in a reduction by approximately 50 % from600 measurements down to 275.

but keeping the same size of 10 × 10 meters, online generation was possible. Theresulting map can be seen in Figure 5.12 with a close up and comparison withthe occupancy grid map in Figure 5.13. Occupied and free space regions wereclassified as such if the probability of respective label was higher then 90 %.

5.3.3 Active sampling

A data set is inferred and classified using active sampling with three differentrejection thresholds. The classified outputs are then compared to the classifiedoutput obtained when no active sampling is used; ideally they would look thesame. Results can be seen in Figure 5.14, where an ROC curve has been used as aqualitative measurement on prediction performance. The threshold levels used,together with how many points that where kept for the respective thresholds, canbe seen in Table 5.2. An output predictive threshold is applied to the predictiveoutput as to say over which level the function is said to belong to the class, whichis used when calculating the ROC.

It is clear that the higher the threshold, the bigger the difference between clas-sifiers from actively sampled and un-sampled data. Though, contrary to whatmight be expected, the middle threshold with next to smallest number of points,clearly is outperformed by all classifiers including the one with the smallest num-ber of points. This is likely due to the characteristics of the underlying function.If studied more closely, the shape of classifier (b) is actually the sharpest one,


Figure 5.8: Map generated with the algorithm. Free space is shown as blue,occupied space as red and unknown points as green. On top of the generatedmap, the ground truth is shown in black.

Number of data kept Threshold valueAll points 40 -Threshold 1 31 0.5Threshold 2 21 1.1Threshold 3 17 1.7

Table 5.2: The total number of data points used, and how many of these thatwhere kept when using active sampling with the specified thresholds.

which can be the result of squeezing a function with a large swing. Even more, itis also clearly shifted in comparison with the “true” output probability function.It will thus render false positive rate almost immediately, and sustain it almostno matter which output predictive threshold is used.


Figure 5.9: Variance for the whole map. High variance is shown in darkred and low variance in dark blue. The variance is highest in the upper leftcorner where no measurements were taken. Also the lower right part of themap has a higher variance then most of the map. Studying Figure 5.7 showsthat the beams in that part are more sparse then the rest of the map.


Figure 5.10: The map continuous occupancy map overlaid in top of the oc-cupancy grid map. Each pixel in the continuous occupancy map represents10 × 10 cm in the real world. Red areas are classified as occupied and blueregions as free space. Green areas are uncertain. A pixel has been classifiedas either occupied or free if the probability of the class label was higher than70 %.


(a) Occupancy grid map

(b) Continuous occupancy map

Figure 5.11: A zoomed in view of the occupancy grid map and the continu-ous occupancy map over the same region.


Figure 5.12: Continuous occupancy map created using a lower resolutiongrid for the query points. This allowed the mapping algorithm to processthe collected data in real time.


(a) Occupancy grid map

(b) Continuous occupancy map

Figure 5.13: Close up comparison between the continuous occupancy mapand the occupancy grid map.


Input

Predictiveprobability

(a) AS, threshold 1

0

0.2

0.4

0.6

0.8

1

-10 -5 0 5 10

No ASAS, th1All dataAS data

Input


(b) AS, threshold 2

0

0.2

0.4

0.6

0.8

1

-10 -5 0 5 10


Input


(c) AS, threshold 3

0

0.2

0.4

0.6

0.8

1

-10 -5 0 5 10


False positive rate

Truepositiverate

(d) ROC

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

No ASAS th1AS th2AS th3

Figure 5.14: The predictive probability for a data set that has been sampledwith three differently thresholded active samplers, together with an ROCplot over each predictive probability. (a) The predictive probability whenactive sampling with the lowest threshold has been used. (b) The predictiveprobability for the middle threshold. (c) The predictive probability for thehighest threshold. (d) The ROC curve for the different classifications; thecloser to the upper left corner the better, meaning that a high true positivedetection rate can be achieved without detecting positive output falsely.

6Conclusions

To summarize the thesis work, this chapter discusses the main objectives and theachieved results. A brief summary is also given with key conclusions about thethesis work.

6.1 Brief summary

For the new method to definitely outperform the old occupancy grid methods,it needs to be accurate, but also fast and easy to use. The accuracy and speedissues are very closely related, and it has been shown that one issue could onlybe solved at the expense of the other; with sufficient speed the accuracy was toolow, and with high accuracy, the speed was too low. The C++ library where thegp framework is implemented can be used in any C++ project and relies mostlyon other open-source projects. Consideration has been taken to allow user exten-sions to the library and many example programs have been created to describehow different parts of the code are to be used.

6.2 Occupancy mapping

The main objective of the thesis has been to implement the novel method for con-tinuous occupancy mapping in a way that was easy to distribute and for others touse. For this reason, a separation was made between the gp part of the method,which was implemented in a C++ library, and the occupancy mapping part whichwas implemented in ros.

Occupancy mapping is something one often wants to do as the robot operatesin real time and therefore has to be fast. To make it fast, numerous techniques

67

68 6 Conclusions

were used (storing Cholesky factorization of covariance matrix, active sampling,using multiple threads, etc.). Even so, when running online, the number of scansobtained every second are just too many to all be used. Also, querying the occu-pancy in many points worsens the situations and reduces the speed greatly. Inspite of all simplifications implemented, to actually be able to run online, thenumber of scans was severely reduced. The number of query points had to be re-duced to form such a coarse grid that the usability of the map can be questioned.By doing this the method of performing occupancy mapping using gps could fi-nally be used in an online setting. Making all simplifications needed for onlineoperation, many of the benefits of the continuous map are lost.

One of the benefits of the new method is that it gives a natural way to incorpo-rate spatial correlation between measured points. This is utilized when inferringoccupancy at points where no measurements have been taken, making good esti-mations in areas with sparse measurements possible. In occupancy grid methodsthis has by some been done by beam widening, i.e. pretending the beam is morewide spread than it really is, to effect more grid points. In gps, this correlation isdetermined by the covariance function. Ideally one would like a covariance func-tion that is specially tailored for the area to be mapped. Hyperparameters canbe learned from data, but one still has to choose which covariance function onewishes to optimize. Seeing how a specific choice of covariance function affects theresulting map is hard. In this thesis, one set of hyperparameters was optimizedon one set of data and later used for the other data sets.

As can be seen in the mapping results, in an online setting where so many scanshave to be neglected when using the new method, it is clearly outperformed bythe common occupancy grid method in terms of visual performance. The lat-ter can be run with higher resolution and use all scans, where the former haslower resolutions and most of the scans are thrown away and not used. The newmethod would most likely perform better than the old occupancy grid methodwith very sparse measurements. Modern laser scanners, however, can deliver ahuge amount of scans every second, making it questionable whether or not it isreasonable to use the gpmethod in its current form.

The speed of a gp library is of importance. As shown in Section 5.2, the libraryis not as fast as existing implementations for matlab. matlab wins mostly be-cause it can handle matrix computations on multiple cores. The Eigen libraryused for matrix operations currently only supports a single core, and as suchit will be slower on pure matrix operations such as matrix multiplication andmatrix inversion. Eigen is developing fast and future releases might very wellincorporate multi threading support.

6.3 Future work

The greatest drawback of the gp mapping method is its computational complex-ity. Finding good simplifications to reduce the amount of computations needed istherefore an important task if one wishes to do further studies. The computation-

6.3 Future work 69

ally most expensive operations are the inversion of the data covariance matrix,calculation of the covariance when having many test points and the integral ker-nel.

Finding a new covariance function with an analytical expression for the covari-ance between two lines and the covariance between a line and a point wouldspeed up both the integral kernel as well as the cost of having many test points.

An ideal covariance function would also be sparse. A sparse covariance functiontakes on the value exactly zero for certain pairs of measurements and as suchleaves zeros in the covariance matrix. This could be used to reduce the calcula-tions needed for inversion of the data covariance matrix. It also seems natural toassume that the occupancy in one room should not affect the occupancy in theroom next to it, unless all rooms look exactly the same.

Another interesting thing to further study is the use of other probability distri-butions. gps are pleasant to use as lots of computations become analyticallytractable, as has been shown in previous chapters. None the less, there is noth-ing that says that Gaussian distributions give the most accurate estimates whendoing Bayesian inference; even a simple, one-dimensional scan does not give theprobability function one might expect. To overcome this one tries to build goodcovariance functions that approximates typical wall behavior, or whatever be-havior one is trying to infer. Instead, it can be interesting to go away from theGaussian distributions and try to find distributions more fit for actual occupancymapping. The sensor model also needs some revisiting; is it really plausible thatthe Gaussian noise term is related to the label (which is binary), and not the loca-tion of the scan?

Finally, it is important to say that Gaussian processes are not magic. Combiningtheir lack of magic properties with their computational complexity gives thatapplying them to all and everything is perhaps not such a good idea. Or, as aNew South Welsh landlady ones said about the outdoor lamp: “Use, but don’tabuse”.

AMultivariate Gaussian Distribution

The multivariate Gaussian distribution has joint probability density function

p(x|µ,Σ) = N (x | µ,Σ) =1√

(2π)D |Σ|exp

(−1

2(x − µ)ᵀΣ−1(x − µ)

), (A.1)

where µ ∈ RD is the mean vector and Σ ∈ RD×D is the positive definite covariancematrix.

If x and y are jointly Gaussian random vectors[xy

]∼ N

([µxµy

],

[Σxx ΣxyΣyx Σyy

]), (A.2)

the marginal distributions of x and y are given by

x ∼ N (µx, Σxx) y ∼ N (µy , Σyy) (A.3)

and the distribution of x given the value of y is given by

x|y ∼ N(µx − ΣxyΣ−1

yy (y − µy), Σxx − ΣxyΣ−1yyΣyx

). (A.4)

The product of two Gaussian distributions gives another (un-normalized) Gaus-sian distribution

N (x∣∣∣ µ1,Σ1) ·N (x

∣∣∣ µ2,Σ2) =N

(x∣∣∣ µ,Σ)

N (µ1 |µ2,Σ1 + Σ2), (A.5)

where µ = Σ−1(Σ−11 µ1 + Σ−1

2 µ2) and Σ = (Σ−11 + Σ−1

2 ).

71

BCovariance functions and their

derivatives

This appendix provides the equations for all covariance functions implementedin the library together with the derivatives with respect to each hyper parameter.The hyper parameters are denoted either by small letter σ for scalar hyper param-eters and a bold capital Σ for a matrix of hyper parameters. A single element ofmatrix Σ is denoted Σij for the element of the i:th row and the j:th column.

B.1 Linear kernel

The linear covariance function evaluated between two points x, x′ ∈ Rd can be

written as

k(x, x′) = σ2 + xᵀΣx′ , (B.1)

where σ2 is signal variance and Σ ∈ Rd×d a positive definite matrix.

The derivative with respect to an element Σij of Σ becomes

∂k(x, x′)∂Σij

= xᵀ ∂Σ∂Σij

x′ , (B.2)

and with respect to σ

∂k(x, x′)∂σ

= 2σ. (B.3)

73

74 B Covariance functions and their derivatives

B.2 Matérn 1 kernel

The Matérn 1 covariance function, also called the Laplace covariance function,evaluated between to points x, x′ ∈ Rd can be expressed as

k(x, x′) = σ2 exp(−√

(x − x′)ᵀΣ(x − x′)), (B.4)

where σ2 is signal variance and Σ ∈ Rd×d a positive definite matrix.

The derivative with respect to an element Σij of Σ becomes


= σ2 exp(−√

(x − x′)ᵀΣ(x − x′))

∂∂Σij

(−√

(x − x′)ᵀΣ(x − x′))

= −12

(x − x′)ᵀ ∂Σ∂Σij

(x − x′)√(x − x′)ᵀΣ(x − x′)

σ2 exp(−√

(x − x′)ᵀΣ(x − x′))

(B.5)


∂k(x, x′)∂σ

= 2σ exp(−√

(x − x′)ᵀ[Σ(x − x′)). (B.6)


The Matérn 3 covariance function evaluated between two points x, x′ ∈ Rd can

be written as

k(x, x′) = σ2(1 +

√3(x − x′)ᵀΣ(x − x′)

)exp

(−√

3(x − x′)ᵀΣ(x − x′)), (B.7)

where σ2 is the signal variance and Σ ∈ Rd×d a positive definite matrix.

B.3 Matérn 3 kernel 75

The derivative with respect to a parameter Σij in Σ becomes


= σ2

∂

(1 +

√3(x − x′)ᵀΣ(x − x′)

)∂Σij

exp(−√

3(x − x′)ᵀΣ(x − x′))

+

+(1 +

√3(x − x′)ᵀΣ(x − x′)

) ∂ exp(−√

3(x − x′)ᵀΣ(x − x′))

∂Σij

= σ2

∂√

3(x − x′)ᵀΣ(x − x′)∂Σij

exp(−√

3(x − x′)ᵀΣ(x − x′))

+

+(1 +

√3(x − x′)ᵀΣ(x − x′)

) −∂ √3(x − x′)ᵀΣ(x − x′)

∂Σij

·

· exp(−√

3(x − x′)ᵀΣ(x − x′))

= σ2∂√

3(x − x′)ᵀΣ(x − x′)∂Σij

[1 − 1 −

√3(x − x′)ᵀΣ(x − x′)

]·

· exp(−√

3(x − x′)ᵀΣ(x − x′))

=

= −σ2 32


(x − x′)√3(x − x′)ᵀΣ(x − x′)

√3(x − x′)ᵀΣ(x − x′) ·

· exp(−√

3(x − x′)ᵀΣ(x − x′))

=

= −σ2 32


(x − x′) exp(−√

3(x − x′)ᵀΣ(x − x′))

(B.8)

The derivative with respect to σ becomes

∂k(x, x′)∂σ

= 2σ(1 +

√3(x − x′)ᵀΣ(x − x′)

)exp

(−√

3(x − x′)ᵀΣ(x − x′)). (B.9)



The Matérn 5 covariance function evaluated between two points x, x′ ∈ Rd can

be expressed as

k(x, x′) = σ2(1 +

√5(x − x′)ᵀΣ(x − x′) +

53

(x − x′)ᵀΣ(x − x′))

·

· exp(−√

5(x − x′)ᵀΣ(x − x′)),

(B.10)


The derivative with respect to an element Σij in Σ becomes


= σ2

∂

(1 +

√5(x − x′)ᵀΣ(x − x′) + 5

3 (x − x′)ᵀΣ(x − x′))

∂Σij·

· exp(−√

5(x − x′)ᵀΣ(x − x′))

+

+(1 +

√5(x − x′)ᵀΣ(x − x′) +

53

(x − x′)ᵀΣ(x − x′))

·

·∂ exp

(−√

5(x − x′)ᵀΣ(x − x′))

∂Σij

= σ2

∂ √

5(x − x′)ᵀΣ(x − x′)∂Σij

+53∂ (x − x′)ᵀΣ(x − x′)

∂Σij

·

· exp(−√

5(x − x′)ᵀΣ(x − x′))

+

+(1 +

√5(x − x′)ᵀΣ(x − x′) +

53

(x − x′)ᵀΣ(x − x′))

·

·∂

(−√

5(x − x′)ᵀΣ(x − x′))

∂Σijexp

(−√

5(x − x′)ᵀΣ(x − x′))

B.5 Neural Network kernel 77

= σ2

53

(x − x′)ᵀ ∂ Σ

∂Σij(x − x′) −

√5(x − x′)ᵀΣ(x − x′)+

+53

(x − x′)ᵀΣ(x − x′)∂ √

5(x − x′)ᵀΣ(x − x′)∂Σij

·

· exp(−√

5(x − x′)ᵀΣ(x − x′))

= σ2

53

(x − x′)ᵀ ∂ Σ

∂Σij(x − x′) −

√5(x − x′)ᵀΣ(x − x′)+

+53

(x − x′)ᵀΣ(x − x′) 5

2

(x − x′)ᵀ ∂ Σ

∂Σij(x − x′)√

5(x − x′)ᵀΣ(x − x′)

·

· exp(−√

5(x − x′)ᵀΣ(x − x′))

= −σ2 56

(x − x′)ᵀ ∂ Σ

∂Σij(x − x′)

1 +√

5(x − x′)ᵀΣ(x − x′) ·

· exp(−√

5(x − x′)ᵀΣ(x − x′))

(B.11)

Differentiating with respect to σ yields

∂ k(x, x′)∂σ

= 2σ(1 +

√5(x − x′)ᵀΣ(x − x′) +

53

(x − x′)ᵀΣ(x − x′))

·

· exp(−√

5(x − x′)ᵀΣ(x − x′)).

(B.12)

B.5 Neural Network kernel

The neural network covariance function evaluated between two points x, x′ ∈ Rdis given by

k(x, x′) = σ2 arcsin

2xᵀΣx′√(1 + 2xᵀΣx)(1 + 2x′ᵀΣx′)

(B.13)

where x = [1, xᵀ]ᵀ, x′ = [1, x′ᵀ

]ᵀ , σ2 the signal variance and Σ ∈ R(d+1)×(d+1) a

positive definite matrix.




=σ2√

1 − (2xᵀΣx′)2

(1+2xᵀΣx)(1+2x′

ᵀΣx′)

∂∂Σij

2xᵀΣx′√(1 + 2xᵀΣx)(1 + 2x′ᵀΣx′)

=

a , (1 + 2xᵀΣx)(1 + 2x′ᵀΣx′)

=

σ2√1 − (2x

ᵀΣx′)2

a

∂∂Σij

2xᵀΣx′√a

=σ2

a

√1 − (2x

ᵀΣx′)2

a

[2√axᵀ ∂Σ∂Σij

x′ − 2xᵀΣx′1

2√a

∂a∂Σij

]

=2σ2√

a − (2xᵀΣx′)2

xT r ∂Σ∂Σij x′− (B.14)

− xᵀΣx′

a

(xᵀ ∂Σ∂Σij

x(1 + x′ᵀΣx′) + (1 + xᵀΣx)x′

ᵀ ∂Σ∂Σij

x′) , (B.15)

where a = (1 + 2xΣxᵀ)(1 + 2x∗Σxᵀ∗ ) and

∂k(x, x′)∂σ

= 2σ arcsin

2xᵀΣx′√(1 + 2xᵀΣx)(1 + 2x′ᵀΣx′)

(B.16)

B.6 Polynomial kernel

The Polynomial covariance function evaluated between two points x, x′ ∈ Rd canbe written as

k (x, x′) =(σ2 + xᵀΣx′

)p, (B.17)

where σ is the standard deviation parameter, Σ ∈ Rd×d a positive definite matrixand p the degree of the polynomial. The name of the covariance function comesfrom the fact that it can be derived by performing Bayesian regression to fit apolynomial of degree p. Ergo, p must be a positive integer.



= pxᵀ ∂Σ∂Σij

x′(σ2 + xᵀΣx′

)p−1, (B.18)


∂k(x, x′

∂σ= 2pσ

(σ2 + xᵀΣx′

)p−1. (B.19)

B.7 Rational Quadratic 79

B.7 Rational Quadratic

The Rational Quadratic covariance function between points x, x′ ∈ Rd is given

by

k (x, x′) = σ2(1 +

(x − x′)ᵀΣ(x − x′)2α

)−α, (B.20)

where σ2 is the signal variance, Σ ∈ Rd×d a positive definite matrix and α a real

number.



= −σ2

2(x − x′)ᵀ ∂Σ

∂Σij(x − x′)

(1 +

(x − x′)ᵀΣ(x − x′)2α

)−α−1

(B.21)

and with respect to α

∂k(x, x′)∂α

= σ2(1 +

(x − x′)ᵀΣ(x − x′)2α

)−α∂∂α

[−α ln

(1 +

(x − x′)ᵀΣ(x − x′)2α

)]= σ2

(1 +

(x − x′)ᵀΣ(x − x′)2α

)−α (x − x′)ᵀΣ(x − x′)2α + (x − x′)ᵀΣ(x − x′)

−

− ln(1 +

(x − x′)ᵀΣ(x − x′)2α

) . (B.22)

Differentiating with respect to σ gives

∂k(x, x′)∂σ

= 2σ(1 +

(x − x′)ᵀΣ(x − x′)2α

)−α. (B.23)

B.8 Squared Exponential kernel

The squared exponential covariance function evaluated between two points x,x′ ∈ Rd is given by

k(x, x′) = σ2 exp(−1

2(x − x′)ᵀΣ(x − x′)

)(B.24)


The derivative with respect to an element Σij in Σ is given by


= −12σ2(x − x′)ᵀ ∂Σ

∂Σij(x − x′) exp

(−1

2(x − x′)ᵀΣ(x − x′)

)(B.25)


∂k(x, x′

∂σ= 2σ exp

(−1

2(x − x′)ᵀΣ(x − x′)

). (B.26)

Bibliography

Richard H. Byrd, Jorge Nocedal, and Richard A. Waltz. KNITRO: An integratedpackage for nonlinear optimization. In Large Scale Nonlinear Optimization,35–59, 2006, pages 35–59, 2006. URL http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.126.425. Cited on page 36.

C. W. Clenshaw and A. R. Curtis. A method for numerical integration on anautomatic computer. Numerische Mathematik, 2(1):197–205, December 1960.ISSN 0029-599X. doi: 10.1007/BF01386223. URL http://dx.doi.org/10.1007/BF01386223. Cited on page 33.

Thomas G. Dietterich. Ensemble Methods in Machine Learning MultipleClassifier Systems. In Multiple Classifier Systems, volume 1857 of Lec-ture Notes in Computer Science, chapter 1, pages 1–15. Springer Berlin/ Heidelberg, Berlin, Heidelberg, December 2000. ISBN 978-3-540-67704-8. doi: 10.1007/3-540-45014-9\_1. URL http://dx.doi.org/10.1007/3-540-45014-9_1. Cited on page 44.

A. Elfes. Using occupancy grids for mobile robot perception and navigation.Computer, 22(6):46–57, June 1989. ISSN 0018-9162. doi: 10.1109/2.30720.URL http://dx.doi.org/10.1109/2.30720. Cited on page 41.

Gaël Guennebaud, Benoît Jacob, et al. Eigen v3. http://eigen.tuxfamily.org, 2010.Cited on page 31.

Intel®. Intel Threading Building Blocks, February 2012. URL http://threadingbuildingblocks.org/. Cited on page 35.

Simon J. Julier and Jeffrey K. Uhlmann. A New Extension of the Kalman Filterto Nonlinear Systems. 3068:182–193, 1997. URL http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.5.2891. Cited on page 43.

Andrew McHutchon and Carl E. Rasmussen. Gaussian Process Training withInput Noise. In J. Shawe-Taylor, R. S. Zemel, P. Bartlett, F. C. N. Pereira, andK. Q. Weinberger, editors, Advances in Neural Information Processing Systems24, pages 1341–1349. 2011. Cited on page 43.

81

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.126.425


http://dx.doi.org/10.1007/BF01386223

http://dx.doi.org/10.1007/BF01386223

http://dx.doi.org/10.1007/3-540-45014-9_1

http://dx.doi.org/10.1007/3-540-45014-9_1

http://dx.doi.org/10.1109/2.30720

http://threadingbuildingblocks.org/

http://threadingbuildingblocks.org/



82 Bibliography

H. Moravec and A. Elfes. High resolution maps from wide angle sonar. InRobotics and Automation. Proceedings. 1985 IEEE International Conferenceon, volume 2, pages 116–121. IEEE, March 1985. doi: 10.1109/ROBOT.1985.1087316. URL http://dx.doi.org/10.1109/ROBOT.1985.1087316.Cited on page 41.

Hannes Nickisch and Carl E. Rasmussen. Approximations for Binary GaussianProcess Classification. Journal of Machine Learning Research, pages 2035–2078, 2008. URL http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.165.8505. Cited on page 11.

S. O’Callaghan, F. T. Ramos, and H. Durrant-Whyte. Contextual occupancy mapsusing Gaussian processes. In Robotics and Automation, 2009. ICRA ’09. IEEEInternational Conference on, pages 1054–1060. IEEE, May 2009. ISBN 978-1-4244-2788-8. doi: 10.1109/ROBOT.2009.5152754. URL http://dx.doi.org/10.1109/ROBOT.2009.5152754. Cited on page 42.

S. T. O’Callaghan, F. T. Ramos, and H. Durrant-Whyte. Contextual occupancymaps incorporating sensor and location uncertainty. In Robotics and Automa-tion (ICRA), 2010 IEEE International Conference on, pages 3478–3485. IEEE,May 2010. ISBN 978-1-4244-5038-1. doi: 10.1109/ROBOT.2010.5509812.URL http://dx.doi.org/10.1109/ROBOT.2010.5509812. Cited onpage 43.

Simon T. O’Callaghan and Fabio T. Ramos. Continuous Occupancy Mapping withIntegral Kernels. In 25th AAAI Conference on Artificial Intelligence, pages1494–1500, 2011. URL http://dblp.uni-trier.de/db/conf/aaai/aaai2011.html#OCallaghanR11. Cited on pages 24, 43, 55, and 56.

M. Osborne. Bayesian Gaussian Processes for Sequential Prediction, Optimisa-tion and Quadrature. PhD thesis, University of Oxford, 2010. Cited on pages22 and 43.

Christopher J. Paciorek and Mark J. Schervish. Nonstationary covariance func-tions for gaussian process regression. In In Proc. of the Conf. on Neural In-formation Processing Systems (NIPS, 2004. URL http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.72.4125. Cited on pages 25and 26.

Daniel Pagac, Eduardo M. Nebot, and Hugh Durrant-Whyte. An eviden-tial approach to probabilistic map-building Reasoning with Uncertainty inRobotics. In Leo Dorst, Michiel Lambalgen, and Frans Voorbraak, editors,Reasoning with Uncertainty in Robotics, volume 1093 of Lecture Notes inComputer Science, chapter 7, pages 164–170. Springer Berlin / Heidelberg,Berlin/Heidelberg, 1996. ISBN 3-540-61376-5. doi: 10.1007/BFb0013958.URL http://dx.doi.org/10.1007/BFb0013958. Cited on page 42.

Mark A. Paskin and Sebastian Thrun. Robotic mapping with polygonal randomfields. In Faheim Bacchus and Tommi Jaakkola, editors, Proceedings of the 21stConference on Uncertainty in Artificial Intelligence. AUAU Press, Arlington,

http://dx.doi.org/10.1109/ROBOT.1985.1087316






http://dblp.uni-trier.de/db/conf/aaai/aaai2011.html#OCallaghanR11

http://dblp.uni-trier.de/db/conf/aaai/aaai2011.html#OCallaghanR11



http://dx.doi.org/10.1007/BFb0013958

Bibliography 83

Virginia, July 2005. URL http://paskin.org/pubs/PaskinThrun2005.pdf. Cited on page 42.

Christian Plagemann, Kristian Kersting, and Wolfram Burgard. NonstationaryGaussian Process Regression Using Point Estimates of Local Smoothness Ma-chine Learning and Knowledge Discovery in Databases. In Walter Daelemans,Bart Goethals, and Katharina Morik, editors, Machine Learning and Knowl-edge Discovery in Databases, volume 5212 of Lecture Notes in Computer Sci-ence, chapter 14, pages 204–219. Springer Berlin / Heidelberg, Berlin, Heidel-berg, 2008. ISBN 978-3-540-87480-5. doi: 10.1007/978-3-540-87481-2\_14.URL http://dx.doi.org/10.1007/978-3-540-87481-2_14. Cited onpage 26.

Morgan Quigley, Ken Conley, Brian P. Gerkey, Josh Faust, Tully Foote, JeremyLeibs, Rob Wheeler, and Andrew Y. Ng. ROS: an open-source Robot OperatingSystem. In ICRA Workshop on Open Source Software, 2009. Cited on pages 2and 50.

Carl E. Rasmussen and Christopher K. I. Williams. Gaussian Processes forMachine Learning (Adaptive Computation and Machine Learning series).The MIT Press, November 2005. ISBN 026218253X. URL http://www.worldcat.org/isbn/026218253X. Cited on pages 7, 15, 42, and 43.

Alistair S. Reid. Gaussian Process Models for Analysis of Remotely Sensed Geo-Spatial Data. PhD thesis, University of Sydney, 2011. Cited on pages 27 and 54.

Mike Smith, Ingmar Posner, and Paul Newman. Efficient Non-ParametricSurface Representations Using Active Sampling for Push Broom LaserData. In Proceedings of Robotics: Science and Systems VI, June2010. URL http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.165.9127. Cited on pages 43 and 46.

The New York Times. Virtual and Artificial, but 58,000 Want Course.urlhttp://www.nytimes.com/2011/08/16/science/16stanford.html, August2011. Cited on page 2.

Sebastian Thrun and Arno Bücken. Integrating Grid-Based and Topo-logical Maps for Mobile Robot Navigation. In Proceedings of theAAAI Thirteenth National Conference on Artificial Intelligence, 1996.URL http://www.ri.cmu.edu/pub_files/pub1/thrun_sebastian_1996_8/thrun_sebastian_1996_8.pdf. Cited on pages 41 and 44.

Volker Tresp. A Bayesian Committee Machine. In NEURAL COMPUTATION,2000. URL http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.25.5677. Cited on pages 44 and 47.

Volker Tresp. Mixtures of Gaussian processes. In Advances in Neural Informa-tion Processing Systems 13, pages 654–660, 2001. URL http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.69.5990. Cited on page44.

http://paskin.org/pubs/PaskinThrun2005.pdf

http://paskin.org/pubs/PaskinThrun2005.pdf

http://dx.doi.org/10.1007/978-3-540-87481-2_14

http://www.worldcat.org/isbn/026218253X

http://www.worldcat.org/isbn/026218253X



http://www.ri.cmu.edu/pub_files/pub1/thrun_sebastian_1996_8/thrun_sebastian_1996_8.pdf

http://www.ri.cmu.edu/pub_files/pub1/thrun_sebastian_1996_8/thrun_sebastian_1996_8.pdf





84 Bibliography

Jörg Waldvogel. Fast Construction of the Fejér and Clenshaw–Curtis Quadra-ture Rules. BIT Numerical Mathematics, 46(1):195–202, March 2006. ISSN0006-3835. doi: 10.1007/s10543-006-0045-4. URL http://dx.doi.org/10.1007/s10543-006-0045-4. Cited on page 33.

http://dx.doi.org/10.1007/s10543-006-0045-4

http://dx.doi.org/10.1007/s10543-006-0045-4

Upphovsrätt

Detta dokument hålls tillgängligt på Internet — eller dess framtida ersättare —under 25 år från publiceringsdatum under förutsättning att inga extraordinäraomständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner,skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för icke-kommersiell forskning och för undervisning. Överföring av upphovsrätten viden senare tidpunkt kan inte upphäva detta tillstånd. All annan användning avdokumentet kräver upphovsmannens medgivande. För att garantera äktheten,säkerheten och tillgängligheten finns det lösningar av teknisk och administrativart.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsmani den omfattning som god sed kräver vid användning av dokumentet på ovanbeskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådanform eller i sådant sammanhang som är kränkande för upphovsmannens litteräraeller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se förla-gets hemsida http://www.ep.liu.se/

Copyright

The publishers will keep this document online on the Internet — or its possi-ble replacement — for a period of 25 years from the date of publication barringexceptional circumstances.

The online availability of the document implies a permanent permission foranyone to read, to download, to print out single copies for his/her own use andto use it unchanged for any non-commercial research and educational purpose.Subsequent transfers of copyright cannot revoke this permission. All other usesof the document are conditional on the consent of the copyright owner. Thepublisher has taken technical and administrative measures to assure authenticity,security and accessibility.

According to intellectual property law the author has the right to be men-tioned when his/her work is accessed as described above and to be protectedagainst infringement.

For additional information about the Linköping University Electronic Pressand its procedures for publication and for assurance of document integrity, pleaserefer to its www home page: http://www.ep.liu.se/

© Emanuel Walldén Viklund, Johan Wågberg

http://www.ep.liu.se/

http://www.ep.liu.se/

processes - diva portal552702/fulltext01.pdf · emanuel walldén viklund, johan wågberg...

Documents