non-linear regression techniques part - iilasa.epfl.ch/.../lec_ix_nonlinearregression_part_ii.pdfii...

1

ADVANCED MACHINE LEARNING

1


Non-linear regression techniques

Part - II

2


2

Regression Algorithms in this Course

Support Vector Machine Relevance Vector Machine

Boosting – random projections Boosting – random gaussians

Random forest Gaussian Process

Support vector regression Relevance vector regression

Gaussian process regression Gradient boosting

Locally weighted projected regression

Not covered – replaced by one

hour to answer questions about

mini-project

3


3


Random forest Gaussian Process Gaussian process regression

4


4

, , , T Ny f x w w x w x

PR is a statistical approach to classical linear regression that estimates the

relationship between zero-mean variables y and x by building a linear model

of the form:

2, with 0,Ty w x N

If one assumes that the observed values of y differ from f(x) by an additive

noise that follows a zero-mean Gaussian distribution (such an

assumption consists of putting a prior distribution over the noise), then:

Probabilistic Regression (PR)

Where have we seen this before?

6


6

1

2

Training set of pairs of data points , ,

Likelihood of the regressive model

0,

~ | , ,

i iM

i

T

M X x y

w X N

p X w

y

y

y y

Probabilistic Regression

1

2

21

Data points are independently and identically distributed (i.i.d)

| , , ~ | , ,

1 exp

22

i

Mi i

i

T iM

i

p X w p y x w

y w x

y

Parameters of

the model

7


7

1

2

Training set of pairs of data points , ,

Likelihood of the regressive model

0,

~ | , ,

i iM

i

T

M X x y

w X N

p X w

y

y

y y


Prior model on distribution of parameter w:

110, exp

2

T

w wp w N w w

Hyperparameters

Given by user

Parameters of

the model

8


8

1 1

1 1

2 22

1 11| , ,T T

w wXX XXp w X N X

y y

11Prior on : 0, exp

2

T

w ww p w N w w

(drop , not a variable)

Estimates conditional distribution on given the data using Bayes' rule.

likelihood x priorposterior =

marginal likelihood

| ,| ,

|

w

p X w p wp w X

p X

y

yy


Posterior distribution on

is Gaussian.

w

9


9

1 1

1 1

2 22

1 11| , ,T T

w wXX XXp w X N X

y y

The conditional distribution of a

Gaussian distribution is also

Gaussian (image from Wikipedia)

Posterior distribution on

is Gaussian.

w

10


10


The expectation over the posterior distribution gives the best estimate:

This is called the maximum a posteriori (MAP) estimate of w.

1

1

2 2

1 1| ., T

w

A

p wE XX XX

y y

1 1

1 1

2 22

1 11| , ,T T

w wXX XXp w X N X

y y

12


12


1

2

1 1

2

1 with

1,| , ,

T

w

T T

XXA

p y Nx x x xX XA A

y y

Testing point

Training datapoints

1

2

The estimate of given a test point is given by :

1( | } T

y x

y E p y Ax x X

y

14


14


1

The variance gives a measure of the

uncertainty of the prediction:

var ( | } Tp y x x A x

1

2

1| , , TE p y x X x A X

y y

1

2

1 1

2

1 with

1,| , ,

T

w

T T

XXA

p y Nx x x xX XA A

y y

15

MACHINE LEARNING – 2012MACHINE LEARNINGADVANCED MACHINE LEARNING

15

2

How to extend the simple linear Bayesian regressive model

for nonlinear regression?

0,Ty w x N

Gaussian Process Regression

16


16

x

2, ~ 0,Ty w x N 2



0,Ty w x N


17


17

x

2, ~ 0,Ty w x N 2



0,Ty w x N


Distribution over functions

19


19

1

2

1 1

2

11| , , , , T

w

T TA XXp y x X N x A X x A x

y y

x Non-Linear Transformation


1 1

2

2 1

1| , , ,

with

T T

T

w


A X X

y y

2



0,Ty w x N

20


20


Again, a Gaussian distribution.

1 1

2

2 1

1| , , ,

with

T T

T

w


A X X

y y

21


21


Inner product in feature space

Define the kernel as: , ' 'T

wk x x x x

1 1

2

2 1

1| , , ,

with

T T

T

w


A X X

y y

12

2

12

1, ,

| , ,

,

T

w

T T T

w w w

x X K X X Ip y x X N

x x x X K X X I X x

yy

See supplement

for steps

22


22


Inner product in feature space

Define the kernel as: , ' 'T

wk x x x x

1

2

1

with ,

| , , , i

M

i

i

K X X I

y E y x X k x x

y

y

1 1

2

2 1

1| , , ,

with

T T

T

w


A X X

y y

See supplement

for steps

24


24


1

2

1

with ,

| , , , i

M

i

i

K X X I

y E y x X k x x

y

y >0i

All

datapoints are

used in the

computation!

25


25


1

2

1

with ,

| , , , i

M

i

i

K X X I

y E y x X k x x

y

y

The kernel and its hyperparameters are given by the user.

These can be optimized through maximum likelihood over the

marginal likelihood, see class’s supplement

RBF kernel, width = 0.1 RBF kernel, width = 0.5

26


26

Sensitivity to the choice of kernel width (called lengthscale in most books)

when using Gaussian kernels (also called RBF or square exponential).

Kernel Width=0.1

'

, '

x x

lk x x e


27


27

Kernel Width=0.5

'

, '

x x

lk x x e

Sensitivity to the choice of kernel width (called lengthscale in most books)

when using Gaussian kernels (also called RBF or square exponential).


28


28


1

2

1

with ,

| , , , i

M

i

i

K X X I

y E y x X k x x

y

y

The value for the noise needs to be pre-set by hand.

Sigma = 0.05 Sigma = 0.01

The larger the noise, the more uncertainty. The noise is <=1.

1

2cov | , , , ,p y x K x x K x X K X X I K X x

29


29


Low noise: 0.05

30


30


High noise: 0.2

31


31


1

2

1

with ,

| , , , i

M

i

i

K X X I

y E y x X k x x

y

y

Kernel is usually Gaussian kernel with stationary covariance function

Non-Stationary Covariance Functions can encapsulate local

variations in the density of the datapoints

Gibbs’ non stationary covariance function (length-scale a function of x):

12

2

2 211

2 2 ', ' exp

' '

N Ni i i i

ii i i i i

l x l x x xk x x

l x l x l x l x

32


32


20,Ty w x N 2, ~ 0,Ty w x N

Linear Model Non-Linear Model

Both models follow a zero mean Gaussian distribution!

20,

0 0

T

T

E y E w x N

E w x

20,

0 0

T

T

E y E w x N

E w x

Predict y=0 away from datapoints!

SVR predicts y=b away from datapoints

(see exercise session)

33


33

Examples of application of GPR

Kronander, K. and Billard, A. (2013) Learning Compliant Manipulation through Kinesthetic and Tactile Human-Robot

Interaction. IEEE Transactions on Haptics. 10.1109/TOH.2013.54.

Striking a match is a task that requires careful control of the force in interaction

to push enough to light the match but not too much in order not to break it.

34


34


The stiffness profile is encoded as a time-varying input using GPR.

The shaded area corresponds to the striking phase. We see that the

stiffness must be decreased just before entering into contact and again

during contact when the match lights up. Stiffness can increase again

when the robot moves back into free space

35


35


Building a 3D model of an object from tactile information can be useful to

guide manipulation of object when the object is no longer visible.

36


36


3Point on the surface: xDistance to surface: y

On the surface: 0y

Outside the surface: 0y

Inside the object: 0y

Learn a mapping with GPR

to determine how far one is from the surface.

y f x

37


37


Normal to the surface: z

Learn a mapping z with GPR to determine the normal from the

surface (need 3 GPRs for each of the coordinate of the vector ).

g x

z

The distance and normal to the surfaces can be used in an optimization

framework to determine the optimal posture of robot fingers on the object.

38


38


GPR can be used to model the shape of objects. Top: 3D points sampled either from a

camera or from tactile sensing. Bottom: 3D shape reconstructed by GPR. The arrows

represent the predicted normals at the surface (El Khoury, S., Li, M., and Billard, A. (2013). On the

generation of a variety of grasps.Robotics and Autonomous Systems, 61(12):1335–34)

39


39


Support Vector Machine Relevance Vector Machine

Boosting – random projections Boosting – random gaussians

Random forest Gaussian Process

Support vector regression Relevance vector regression

Gaussian process regression Gradient boosting

Locally weighted projected regression

40


40


Relevance Vector Machine

Boosting – random gaussians

Gradient boosting

41


41

Gradient Boosting



Gradient boosting

1 2

Choose some regressive technique (any we have seen sofar)

ˆ ˆ ˆApply boosting to train and combine the set of estimates , ... .mf f f

42


42

Gradient Boosting



Gradient boosting

1

1ˆ ˆAggregate to get the final estimate m

i

i

f fm

43


43

Gradient Boosting



Typical example of dataset with imbalanced data. Data was hand-drawn.

There are more datapoints in regions where the hand slowed down.

As a result, the fit is very good in these regions and less good in other regions.

Region with high density of

datapoints

Gradient boosting using squared loss

44


44

Gradient Boosting


Better results, i.e. smoother fit, are obtained when increasing the number

of functions for the fit

Gradient boosting using squared loss and using twice more functions

45


45

Summary

We have seen a few different techniques to perform non-linear

regression in machine learning.

The techniques differ in their algorithm and in the number of

hyperparameters.

Some techniques (GP, RVR) provide a metric of uncertainty of the

model, which can be used to determine when inference is trustable.

Some techniques (n-SVR, RVR, LWPR) are designed to be

computationally cheap at retrieval (very few support vectors, few

models).

Other techniques (GP) are meant to provide very accurate estimate of

the data, at the cost of retaining all datapoints for retrieval.

non-linear regression techniques part - iilasa.epfl.ch/.../lec_ix_nonlinearregression_part_ii.pdfii...

Documents