non-linear regression techniques part - iilasa.epfl.ch/.../lec_ix_nonlinearregression_part_ii.pdfii...

41
1 ADVANCED MACHINE LEARNING 1 ADVANCED MACHINE LEARNING Non-linear regression techniques Part - II

Upload: others

Post on 08-Jul-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Non-linear regression techniques Part - IIlasa.epfl.ch/.../Lec_IX_NonlinearRegression_Part_II.pdfii M i T M X x y w X N p X w V V y y yy Probabilistic Regression Prior model on distribution

1

ADVANCED MACHINE LEARNING

1

ADVANCED MACHINE LEARNING

Non-linear regression techniques

Part - II

Page 2: Non-linear regression techniques Part - IIlasa.epfl.ch/.../Lec_IX_NonlinearRegression_Part_II.pdfii M i T M X x y w X N p X w V V y y yy Probabilistic Regression Prior model on distribution

2

ADVANCED MACHINE LEARNING

2

Regression Algorithms in this Course

Support Vector Machine Relevance Vector Machine

Boosting – random projections Boosting – random gaussians

Random forest Gaussian Process

Support vector regression Relevance vector regression

Gaussian process regression Gradient boosting

Locally weighted projected regression

Not covered – replaced by one

hour to answer questions about

mini-project

Page 3: Non-linear regression techniques Part - IIlasa.epfl.ch/.../Lec_IX_NonlinearRegression_Part_II.pdfii M i T M X x y w X N p X w V V y y yy Probabilistic Regression Prior model on distribution

3

ADVANCED MACHINE LEARNING

3

Regression Algorithms in this Course

Random forest Gaussian Process Gaussian process regression

Page 4: Non-linear regression techniques Part - IIlasa.epfl.ch/.../Lec_IX_NonlinearRegression_Part_II.pdfii M i T M X x y w X N p X w V V y y yy Probabilistic Regression Prior model on distribution

4

ADVANCED MACHINE LEARNING

4

, , , T Ny f x w w x w x

PR is a statistical approach to classical linear regression that estimates the

relationship between zero-mean variables y and x by building a linear model

of the form:

2, with 0,Ty w x N

If one assumes that the observed values of y differ from f(x) by an additive

noise that follows a zero-mean Gaussian distribution (such an

assumption consists of putting a prior distribution over the noise), then:

Probabilistic Regression (PR)

Where have we seen this before?

Page 5: Non-linear regression techniques Part - IIlasa.epfl.ch/.../Lec_IX_NonlinearRegression_Part_II.pdfii M i T M X x y w X N p X w V V y y yy Probabilistic Regression Prior model on distribution

6

ADVANCED MACHINE LEARNING

6

1

2

Training set of pairs of data points , ,

Likelihood of the regressive model

0,

~ | , ,

i iM

i

T

M X x y

w X N

p X w

y

y

y y

Probabilistic Regression

1

2

21

Data points are independently and identically distributed (i.i.d)

| , , ~ | , ,

1 exp

22

i

Mi i

i

T iM

i

p X w p y x w

y w x

y

Parameters of

the model

Page 6: Non-linear regression techniques Part - IIlasa.epfl.ch/.../Lec_IX_NonlinearRegression_Part_II.pdfii M i T M X x y w X N p X w V V y y yy Probabilistic Regression Prior model on distribution

7

ADVANCED MACHINE LEARNING

7

1

2

Training set of pairs of data points , ,

Likelihood of the regressive model

0,

~ | , ,

i iM

i

T

M X x y

w X N

p X w

y

y

y y

Probabilistic Regression

Prior model on distribution of parameter w:

110, exp

2

T

w wp w N w w

Hyperparameters

Given by user

Parameters of

the model

Page 7: Non-linear regression techniques Part - IIlasa.epfl.ch/.../Lec_IX_NonlinearRegression_Part_II.pdfii M i T M X x y w X N p X w V V y y yy Probabilistic Regression Prior model on distribution

8

ADVANCED MACHINE LEARNING

8

1 1

1 1

2 22

1 11| , ,T T

w wXX XXp w X N X

y y

11Prior on : 0, exp

2

T

w ww p w N w w

(drop , not a variable)

Estimates conditional distribution on given the data using Bayes' rule.

likelihood x priorposterior =

marginal likelihood

| ,| ,

|

w

p X w p wp w X

p X

y

yy

Probabilistic Regression

Posterior distribution on

is Gaussian.

w

Page 8: Non-linear regression techniques Part - IIlasa.epfl.ch/.../Lec_IX_NonlinearRegression_Part_II.pdfii M i T M X x y w X N p X w V V y y yy Probabilistic Regression Prior model on distribution

9

ADVANCED MACHINE LEARNING

9

1 1

1 1

2 22

1 11| , ,T T

w wXX XXp w X N X

y y

The conditional distribution of a

Gaussian distribution is also

Gaussian (image from Wikipedia)

Posterior distribution on

is Gaussian.

w

Page 9: Non-linear regression techniques Part - IIlasa.epfl.ch/.../Lec_IX_NonlinearRegression_Part_II.pdfii M i T M X x y w X N p X w V V y y yy Probabilistic Regression Prior model on distribution

10

ADVANCED MACHINE LEARNING

10

Probabilistic Regression

The expectation over the posterior distribution gives the best estimate:

This is called the maximum a posteriori (MAP) estimate of w.

1

1

2 2

1 1| ., T

w

A

p wE XX XX

y y

1 1

1 1

2 22

1 11| , ,T T

w wXX XXp w X N X

y y

Page 10: Non-linear regression techniques Part - IIlasa.epfl.ch/.../Lec_IX_NonlinearRegression_Part_II.pdfii M i T M X x y w X N p X w V V y y yy Probabilistic Regression Prior model on distribution

11

ADVANCED MACHINE LEARNING

11

Probabilistic Regression

We can now compute the posterior distribution on y :

| , |, | ,,p y x p wX p y x w wX d y y

1

2

1 1

2

1 with

1| , , ,

T

w

T T

A XX

p y x X N x A X x A x

y y

1 1

1 1

2 22

1 11| , ,T T

w wXX XXp w X N X

y y

Page 11: Non-linear regression techniques Part - IIlasa.epfl.ch/.../Lec_IX_NonlinearRegression_Part_II.pdfii M i T M X x y w X N p X w V V y y yy Probabilistic Regression Prior model on distribution

12

ADVANCED MACHINE LEARNING

12

Probabilistic Regression

1

2

1 1

2

1 with

1,| , ,

T

w

T T

XXA

p y Nx x x xX XA A

y y

Testing point

Training datapoints

1

2

The estimate of given a test point is given by :

1( | } T

y x

y E p y Ax x X

y

Page 12: Non-linear regression techniques Part - IIlasa.epfl.ch/.../Lec_IX_NonlinearRegression_Part_II.pdfii M i T M X x y w X N p X w V V y y yy Probabilistic Regression Prior model on distribution

14

ADVANCED MACHINE LEARNING

14

Probabilistic Regression

1

The variance gives a measure of the

uncertainty of the prediction:

var ( | } Tp y x x A x

1

2

1| , , TE p y x X x A X

y y

1

2

1 1

2

1 with

1,| , ,

T

w

T T

XXA

p y Nx x x xX XA A

y y

Page 13: Non-linear regression techniques Part - IIlasa.epfl.ch/.../Lec_IX_NonlinearRegression_Part_II.pdfii M i T M X x y w X N p X w V V y y yy Probabilistic Regression Prior model on distribution

15

MACHINE LEARNING – 2012MACHINE LEARNINGADVANCED MACHINE LEARNING

15

2

How to extend the simple linear Bayesian regressive model

for nonlinear regression?

0,Ty w x N

Gaussian Process Regression

Page 14: Non-linear regression techniques Part - IIlasa.epfl.ch/.../Lec_IX_NonlinearRegression_Part_II.pdfii M i T M X x y w X N p X w V V y y yy Probabilistic Regression Prior model on distribution

16

MACHINE LEARNING – 2012MACHINE LEARNINGADVANCED MACHINE LEARNING

16

x

2, ~ 0,Ty w x N 2

How to extend the simple linear Bayesian regressive model

for nonlinear regression?

0,Ty w x N

Gaussian Process Regression

Page 15: Non-linear regression techniques Part - IIlasa.epfl.ch/.../Lec_IX_NonlinearRegression_Part_II.pdfii M i T M X x y w X N p X w V V y y yy Probabilistic Regression Prior model on distribution

17

MACHINE LEARNING – 2012MACHINE LEARNINGADVANCED MACHINE LEARNING

17

x

2, ~ 0,Ty w x N 2

How to extend the simple linear Bayesian regressive model

for nonlinear regression?

0,Ty w x N

Gaussian Process Regression

Distribution over functions

Page 16: Non-linear regression techniques Part - IIlasa.epfl.ch/.../Lec_IX_NonlinearRegression_Part_II.pdfii M i T M X x y w X N p X w V V y y yy Probabilistic Regression Prior model on distribution

19

ADVANCED MACHINE LEARNING

19

1

2

1 1

2

11| , , , , T

w

T TA XXp y x X N x A X x A x

y y

x Non-Linear Transformation

Gaussian Process Regression

1 1

2

2 1

1| , , ,

with

T T

T

w

p y x X N x A X x A x

A X X

y y

2

How to extend the simple linear Bayesian regressive model

for nonlinear regression?

0,Ty w x N

Page 17: Non-linear regression techniques Part - IIlasa.epfl.ch/.../Lec_IX_NonlinearRegression_Part_II.pdfii M i T M X x y w X N p X w V V y y yy Probabilistic Regression Prior model on distribution

20

ADVANCED MACHINE LEARNING

20

Gaussian Process Regression

Again, a Gaussian distribution.

1 1

2

2 1

1| , , ,

with

T T

T

w

p y x X N x A X x A x

A X X

y y

Page 18: Non-linear regression techniques Part - IIlasa.epfl.ch/.../Lec_IX_NonlinearRegression_Part_II.pdfii M i T M X x y w X N p X w V V y y yy Probabilistic Regression Prior model on distribution

21

ADVANCED MACHINE LEARNING

21

Gaussian Process Regression

Inner product in feature space

Define the kernel as: , ' 'T

wk x x x x

1 1

2

2 1

1| , , ,

with

T T

T

w

p y x X N x A X x A x

A X X

y y

12

2

12

1, ,

| , ,

,

T

w

T T T

w w w

x X K X X Ip y x X N

x x x X K X X I X x

yy

See supplement

for steps

Page 19: Non-linear regression techniques Part - IIlasa.epfl.ch/.../Lec_IX_NonlinearRegression_Part_II.pdfii M i T M X x y w X N p X w V V y y yy Probabilistic Regression Prior model on distribution

22

ADVANCED MACHINE LEARNING

22

Gaussian Process Regression

Inner product in feature space

Define the kernel as: , ' 'T

wk x x x x

1

2

1

with ,

| , , , i

M

i

i

K X X I

y E y x X k x x

y

y

1 1

2

2 1

1| , , ,

with

T T

T

w

p y x X N x A X x A x

A X X

y y

See supplement

for steps

Page 20: Non-linear regression techniques Part - IIlasa.epfl.ch/.../Lec_IX_NonlinearRegression_Part_II.pdfii M i T M X x y w X N p X w V V y y yy Probabilistic Regression Prior model on distribution

24

ADVANCED MACHINE LEARNING

24

Gaussian Process Regression

1

2

1

with ,

| , , , i

M

i

i

K X X I

y E y x X k x x

y

y >0i

All

datapoints are

used in the

computation!

Page 21: Non-linear regression techniques Part - IIlasa.epfl.ch/.../Lec_IX_NonlinearRegression_Part_II.pdfii M i T M X x y w X N p X w V V y y yy Probabilistic Regression Prior model on distribution

25

ADVANCED MACHINE LEARNING

25

Gaussian Process Regression

1

2

1

with ,

| , , , i

M

i

i

K X X I

y E y x X k x x

y

y

The kernel and its hyperparameters are given by the user.

These can be optimized through maximum likelihood over the

marginal likelihood, see class’s supplement

RBF kernel, width = 0.1 RBF kernel, width = 0.5

Page 22: Non-linear regression techniques Part - IIlasa.epfl.ch/.../Lec_IX_NonlinearRegression_Part_II.pdfii M i T M X x y w X N p X w V V y y yy Probabilistic Regression Prior model on distribution

26

ADVANCED MACHINE LEARNING

26

Sensitivity to the choice of kernel width (called lengthscale in most books)

when using Gaussian kernels (also called RBF or square exponential).

Kernel Width=0.1

'

, '

x x

lk x x e

Gaussian Process Regression

Page 23: Non-linear regression techniques Part - IIlasa.epfl.ch/.../Lec_IX_NonlinearRegression_Part_II.pdfii M i T M X x y w X N p X w V V y y yy Probabilistic Regression Prior model on distribution

27

ADVANCED MACHINE LEARNING

27

Kernel Width=0.5

'

, '

x x

lk x x e

Sensitivity to the choice of kernel width (called lengthscale in most books)

when using Gaussian kernels (also called RBF or square exponential).

Gaussian Process Regression

Page 24: Non-linear regression techniques Part - IIlasa.epfl.ch/.../Lec_IX_NonlinearRegression_Part_II.pdfii M i T M X x y w X N p X w V V y y yy Probabilistic Regression Prior model on distribution

28

ADVANCED MACHINE LEARNING

28

Gaussian Process Regression

1

2

1

with ,

| , , , i

M

i

i

K X X I

y E y x X k x x

y

y

The value for the noise needs to be pre-set by hand.

Sigma = 0.05 Sigma = 0.01

The larger the noise, the more uncertainty. The noise is <=1.

1

2cov | , , , ,p y x K x x K x X K X X I K X x

Page 25: Non-linear regression techniques Part - IIlasa.epfl.ch/.../Lec_IX_NonlinearRegression_Part_II.pdfii M i T M X x y w X N p X w V V y y yy Probabilistic Regression Prior model on distribution

29

ADVANCED MACHINE LEARNING

29

Gaussian Process Regression

Low noise: 0.05

Page 26: Non-linear regression techniques Part - IIlasa.epfl.ch/.../Lec_IX_NonlinearRegression_Part_II.pdfii M i T M X x y w X N p X w V V y y yy Probabilistic Regression Prior model on distribution

30

ADVANCED MACHINE LEARNING

30

Gaussian Process Regression

High noise: 0.2

Page 27: Non-linear regression techniques Part - IIlasa.epfl.ch/.../Lec_IX_NonlinearRegression_Part_II.pdfii M i T M X x y w X N p X w V V y y yy Probabilistic Regression Prior model on distribution

31

ADVANCED MACHINE LEARNING

31

Gaussian Process Regression

1

2

1

with ,

| , , , i

M

i

i

K X X I

y E y x X k x x

y

y

Kernel is usually Gaussian kernel with stationary covariance function

Non-Stationary Covariance Functions can encapsulate local

variations in the density of the datapoints

Gibbs’ non stationary covariance function (length-scale a function of x):

12

2

2 211

2 2 ', ' exp

' '

N Ni i i i

ii i i i i

l x l x x xk x x

l x l x l x l x

Page 28: Non-linear regression techniques Part - IIlasa.epfl.ch/.../Lec_IX_NonlinearRegression_Part_II.pdfii M i T M X x y w X N p X w V V y y yy Probabilistic Regression Prior model on distribution

32

ADVANCED MACHINE LEARNING

32

Gaussian Process Regression

20,Ty w x N 2, ~ 0,Ty w x N

Linear Model Non-Linear Model

Both models follow a zero mean Gaussian distribution!

20,

0 0

T

T

E y E w x N

E w x

20,

0 0

T

T

E y E w x N

E w x

Predict y=0 away from datapoints!

SVR predicts y=b away from datapoints

(see exercise session)

Page 29: Non-linear regression techniques Part - IIlasa.epfl.ch/.../Lec_IX_NonlinearRegression_Part_II.pdfii M i T M X x y w X N p X w V V y y yy Probabilistic Regression Prior model on distribution

33

ADVANCED MACHINE LEARNING

33

Examples of application of GPR

Kronander, K. and Billard, A. (2013) Learning Compliant Manipulation through Kinesthetic and Tactile Human-Robot

Interaction. IEEE Transactions on Haptics. 10.1109/TOH.2013.54.

Striking a match is a task that requires careful control of the force in interaction

to push enough to light the match but not too much in order not to break it.

Page 30: Non-linear regression techniques Part - IIlasa.epfl.ch/.../Lec_IX_NonlinearRegression_Part_II.pdfii M i T M X x y w X N p X w V V y y yy Probabilistic Regression Prior model on distribution

34

ADVANCED MACHINE LEARNING

34

Examples of application of GPR

The stiffness profile is encoded as a time-varying input using GPR.

The shaded area corresponds to the striking phase. We see that the

stiffness must be decreased just before entering into contact and again

during contact when the match lights up. Stiffness can increase again

when the robot moves back into free space

Page 31: Non-linear regression techniques Part - IIlasa.epfl.ch/.../Lec_IX_NonlinearRegression_Part_II.pdfii M i T M X x y w X N p X w V V y y yy Probabilistic Regression Prior model on distribution

35

ADVANCED MACHINE LEARNING

35

Examples of application of GPR

Building a 3D model of an object from tactile information can be useful to

guide manipulation of object when the object is no longer visible.

Page 32: Non-linear regression techniques Part - IIlasa.epfl.ch/.../Lec_IX_NonlinearRegression_Part_II.pdfii M i T M X x y w X N p X w V V y y yy Probabilistic Regression Prior model on distribution

36

ADVANCED MACHINE LEARNING

36

Examples of application of GPR

3Point on the surface: xDistance to surface: y

On the surface: 0y

Outside the surface: 0y

Inside the object: 0y

Learn a mapping with GPR

to determine how far one is from the surface.

y f x

Page 33: Non-linear regression techniques Part - IIlasa.epfl.ch/.../Lec_IX_NonlinearRegression_Part_II.pdfii M i T M X x y w X N p X w V V y y yy Probabilistic Regression Prior model on distribution

37

ADVANCED MACHINE LEARNING

37

Examples of application of GPR

Normal to the surface: z

Learn a mapping z with GPR to determine the normal from the

surface (need 3 GPRs for each of the coordinate of the vector ).

g x

z

The distance and normal to the surfaces can be used in an optimization

framework to determine the optimal posture of robot fingers on the object.

Page 34: Non-linear regression techniques Part - IIlasa.epfl.ch/.../Lec_IX_NonlinearRegression_Part_II.pdfii M i T M X x y w X N p X w V V y y yy Probabilistic Regression Prior model on distribution

38

ADVANCED MACHINE LEARNING

38

Examples of application of GPR

GPR can be used to model the shape of objects. Top: 3D points sampled either from a

camera or from tactile sensing. Bottom: 3D shape reconstructed by GPR. The arrows

represent the predicted normals at the surface (El Khoury, S., Li, M., and Billard, A. (2013). On the

generation of a variety of grasps.Robotics and Autonomous Systems, 61(12):1335–34)

Page 35: Non-linear regression techniques Part - IIlasa.epfl.ch/.../Lec_IX_NonlinearRegression_Part_II.pdfii M i T M X x y w X N p X w V V y y yy Probabilistic Regression Prior model on distribution

39

ADVANCED MACHINE LEARNING

39

Regression Algorithms in this Course

Support Vector Machine Relevance Vector Machine

Boosting – random projections Boosting – random gaussians

Random forest Gaussian Process

Support vector regression Relevance vector regression

Gaussian process regression Gradient boosting

Locally weighted projected regression

Page 36: Non-linear regression techniques Part - IIlasa.epfl.ch/.../Lec_IX_NonlinearRegression_Part_II.pdfii M i T M X x y w X N p X w V V y y yy Probabilistic Regression Prior model on distribution

40

ADVANCED MACHINE LEARNING

40

Regression Algorithms in this Course

Relevance Vector Machine

Boosting – random gaussians

Gradient boosting

Page 37: Non-linear regression techniques Part - IIlasa.epfl.ch/.../Lec_IX_NonlinearRegression_Part_II.pdfii M i T M X x y w X N p X w V V y y yy Probabilistic Regression Prior model on distribution

41

ADVANCED MACHINE LEARNING

41

Gradient Boosting

Relevance Vector Machine

Boosting – random gaussians

Gradient boosting

1 2

Choose some regressive technique (any we have seen sofar)

ˆ ˆ ˆApply boosting to train and combine the set of estimates , ... .mf f f

Page 38: Non-linear regression techniques Part - IIlasa.epfl.ch/.../Lec_IX_NonlinearRegression_Part_II.pdfii M i T M X x y w X N p X w V V y y yy Probabilistic Regression Prior model on distribution

42

ADVANCED MACHINE LEARNING

42

Gradient Boosting

Relevance Vector Machine

Boosting – random gaussians

Gradient boosting

1

1ˆ ˆAggregate to get the final estimate m

i

i

f fm

Page 39: Non-linear regression techniques Part - IIlasa.epfl.ch/.../Lec_IX_NonlinearRegression_Part_II.pdfii M i T M X x y w X N p X w V V y y yy Probabilistic Regression Prior model on distribution

43

ADVANCED MACHINE LEARNING

43

Gradient Boosting

Relevance Vector Machine

Boosting – random gaussians

Typical example of dataset with imbalanced data. Data was hand-drawn.

There are more datapoints in regions where the hand slowed down.

As a result, the fit is very good in these regions and less good in other regions.

Region with high density of

datapoints

Gradient boosting using squared loss

Page 40: Non-linear regression techniques Part - IIlasa.epfl.ch/.../Lec_IX_NonlinearRegression_Part_II.pdfii M i T M X x y w X N p X w V V y y yy Probabilistic Regression Prior model on distribution

44

ADVANCED MACHINE LEARNING

44

Gradient Boosting

Boosting – random gaussians

Better results, i.e. smoother fit, are obtained when increasing the number

of functions for the fit

Gradient boosting using squared loss and using twice more functions

Page 41: Non-linear regression techniques Part - IIlasa.epfl.ch/.../Lec_IX_NonlinearRegression_Part_II.pdfii M i T M X x y w X N p X w V V y y yy Probabilistic Regression Prior model on distribution

45

ADVANCED MACHINE LEARNING

45

Summary

We have seen a few different techniques to perform non-linear

regression in machine learning.

The techniques differ in their algorithm and in the number of

hyperparameters.

Some techniques (GP, RVR) provide a metric of uncertainty of the

model, which can be used to determine when inference is trustable.

Some techniques (n-SVR, RVR, LWPR) are designed to be

computationally cheap at retrieval (very few support vectors, few

models).

Other techniques (GP) are meant to provide very accurate estimate of

the data, at the cost of retaining all datapoints for retrieval.