bayesian analysis for machine learning - aifrenz · ai friends seminar ganguk hwang bayesian...

54
Bayesian Analysis for Machine Learning AI Friends Seminar Ganguk Hwang Department of Mathematical Sciences KAIST AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 1 / 54

Upload: others

Post on 20-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Bayesian Analysis for Machine Learning - AiFrenz · AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 201910/54 Projections of Inputs into Feature Space

Bayesian Analysis for Machine Learning

AI Friends Seminar

Ganguk Hwang

Department of Mathematical SciencesKAIST

AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 1 / 54

Page 2: Bayesian Analysis for Machine Learning - AiFrenz · AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 201910/54 Projections of Inputs into Feature Space

Bayesian Model for Linear Regression

Bayesian Model for Linear Regression

AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 2 / 54

Page 3: Bayesian Analysis for Machine Learning - AiFrenz · AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 201910/54 Projections of Inputs into Feature Space

Bayesian Model for Linear Regression

The Standard Linear Model

Let D = {(xi, yi)|i = 1, · · · , n} be a training set of n observations wherexi is an input vector of dimension D and y is a scalar output.

Let X = [x1, · · · ,xn].

AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 3 / 54

Page 4: Bayesian Analysis for Machine Learning - AiFrenz · AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 201910/54 Projections of Inputs into Feature Space

Bayesian Model for Linear Regression

The Standard Linear Model

First, we will consider the standard linear regression model with Gaussiannoise

f(xi) = xi>w, yi = f(xi) + εi

where εi ∼ N (0, σ2n) and {εi}ni=1 are independent.

Then, the likelihood can be computed as

p(y|X,w) =

n∏i=1

p(yi|xi,w) =

n∏i=1

1√2πσn

exp(− (yi − xi

>w)2

2σ2n

)=

1

(2πσ2n)n/2exp

(− 1

2σ2n|y −X>w|2

)∼ N (X>w, σ2nI)

AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 4 / 54

Page 5: Bayesian Analysis for Machine Learning - AiFrenz · AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 201910/54 Projections of Inputs into Feature Space

Bayesian Model for Linear Regression

The Standard Linear Model

In the Bayesian treatment, we need to specify a prior over the parameters.Suppose w ∼ N (0,Σp).By Bayes’ rule,

p(w|y, X) =p(y|X,w)p(w, X)

p(y,X)=p(y|X,w)p(w)

p(y|X).

Here, p(y|X) =∫p(y|X,w)p(w)dw is the normalizing constant. (It is

called as the marginal likelihood).

Since it is just a constant, we neglect p(y|X) when we compute theposterior p(w|X,y).

AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 5 / 54

Page 6: Bayesian Analysis for Machine Learning - AiFrenz · AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 201910/54 Projections of Inputs into Feature Space

Bayesian Model for Linear Regression

The Standard Linear Model

p(w|X,y) ∝ p(y|X,w)p(w)

∝ exp(− 1

2σ2n(y −X>w)>(y −X>w)

)exp

(− 1

2w>Σ−1p w

)

Here,

1

σ2n(y −X>w)>(y −X>w) + w>Σ−1p w = w>Aw −Bw −w>C +D

Where A = 1σ2nXX> + Σ−1p , B = 1

σ2n

(Xy)>, C = 1σ2nXy, D = 1

σ2ny>y

AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 6 / 54

Page 7: Bayesian Analysis for Machine Learning - AiFrenz · AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 201910/54 Projections of Inputs into Feature Space

Bayesian Model for Linear Regression

The Standard Linear Model

Observe that

(w −w)>A(w −w) = w>Aw −w>Aw −w>Aw + w>Aw

Now, set w = A−1C. Then,

w =1

σ2nA−1Xy

w>A =1

σ2n(Xy)> = B

By neglecting the constant term, we get

p(w|X,y) ∝ exp(− 1

2(w −w)>A(w −w)

)In other words,

p(w|X,y) ∼ N (1

σ2nA−1Xy, A−1)

AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 7 / 54

Page 8: Bayesian Analysis for Machine Learning - AiFrenz · AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 201910/54 Projections of Inputs into Feature Space

Bayesian Model for Linear Regression

Gaussian Identities

Theorem 1

The product of two Gaussians gives another Gaussian

N (x|a, A)N (x|b, B) = Z−1N (x|c, C)

where

c = C(A−1a +B−1b), C = (A−1 +B−1)−1, and

Z−1 = (2π)−D/2|A+B|−1/2 exp(− 1

2(a− b)>(A+B)−1(a− b)

).

Here, Z is just the normalizing constant.

AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 8 / 54

Page 9: Bayesian Analysis for Machine Learning - AiFrenz · AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 201910/54 Projections of Inputs into Feature Space

Bayesian Model for Linear Regression

The Standard Linear ModelPredictive distribution

Definition 1

The (Posterior) predictive distribution is the distribution of possibleunobserved values (test data) conditional on the observed values (trainingdata).

To make a prediction for a test data, we use the predictive distribution.Let x∗ be a test case and f∗ = f(x∗) = x∗

>w. Then, the predictivedistribution for f∗ at x∗ is given as

p(f∗|x∗, X,y) ∼ N(1

σ2nx∗>A−1Xy,x∗

>A−1x∗).

We use the mean of the predictive distribution as our estimator for f(x).

AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 9 / 54

Page 10: Bayesian Analysis for Machine Learning - AiFrenz · AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 201910/54 Projections of Inputs into Feature Space

Bayesian Model for Linear Regression

The Standard Linear ModelPredictive distribution

The predictive distribution is also Gaussian from our previous derivation.

The predictive variance x∗>A−1x∗ is a quadratic form of the test input

with the posterior covariance matrix. It implies that the predictiveuncertainties grow with the magnitude of test input.

AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 10 / 54

Page 11: Bayesian Analysis for Machine Learning - AiFrenz · AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 201910/54 Projections of Inputs into Feature Space

Projections of Inputs into Feature Space

Projections of Inputs into Feature Space

Consider the function φ(x) : RD → RN which maps an input x into an Ndimensional feature space.

Let Φ(X) = [φ(x1), · · · , φ(xn)].

The model is now given as f(xi) = φ(xi)>w, yi = f(xi) + εi with

εi ∼ N(0, σ2n) and they are independent.

This model linearly approximates the outputs using feature functionswhich form a basis of a high dimensional space.

AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 11 / 54

Page 12: Bayesian Analysis for Machine Learning - AiFrenz · AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 201910/54 Projections of Inputs into Feature Space

Projections of Inputs into Feature Space

Why feature funtions?

Figure: Feature function

AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 12 / 54

Page 13: Bayesian Analysis for Machine Learning - AiFrenz · AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 201910/54 Projections of Inputs into Feature Space

Projections of Inputs into Feature Space

Projections of Inputs into Feature SpaceThe predictive distribution

The analysis for this model is analogous to the standard linear model.Here, X,x∗ are replaced by Φ(X) and φ(x∗). For simplicity, we writeΦ = Φ(X), φ∗ = φ(x∗). Then, the predictive distribution becomes

f∗|x∗, X,y ∼ N (1

σ2nφ>∗ A

−1Φy, φ>∗ A−1φ∗)

where A = 1σ2n

ΦΦ> + Σ−1p .

Alternatively, we can rewrite the predictive distribution in the following way

f∗|x∗, X,y ∼ N(φ>∗ ΣpΦ(K+σ2nI)−1y, φ>∗ (Σp−ΣpΦ(K+σ2nI)−1Φ>Σp)φ∗

)where K = Φ>ΣpΦ (which is related with the Gaussian ProcessRegression).

AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 13 / 54

Page 14: Bayesian Analysis for Machine Learning - AiFrenz · AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 201910/54 Projections of Inputs into Feature Space

Regression-Examples

Examples - Introduction

There are two examples which use the GP regression for prediction.

The first one is a noise free case where the objective function is given asf(x) = xcos(x).

The second one is a noisy case and the objective function is not given.

The training data for the second example are the numbers ofinternational airline passengers in USA per month which arecommonly used to test the performance of a regression operator.

AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 14 / 54

Page 15: Bayesian Analysis for Machine Learning - AiFrenz · AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 201910/54 Projections of Inputs into Feature Space

Regression-Examples

Examples - Introduction

Python 3.5 is used to implement that examples. We use the followinglibraries

1 Numpy : It makes easy to do vector computation in python.

2 Scikit learn(sklearn) GaussianProcessRegressor, linear model : Thefirst one performs GP regression by optimizing the log-likelihood withrespect to hyper-parameters.The second one is used to get linear regression which is needed topredict the mean function of test data.

There are also good tools for GP regression such as gpytorch and gpflow.

AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 15 / 54

Page 16: Bayesian Analysis for Machine Learning - AiFrenz · AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 201910/54 Projections of Inputs into Feature Space

Regression-Examples

Noise free case

In this example, the objective function is given as f(x) = xcos(x). Thetraining data are given as {(x1, f(x1)), · · · , (xn, f(xn))} where x1, · · · , xnare randomly chosen.Here, we use the squared exponential covariance function.

k(xi, xj) = σ2fexp(− 1

2l2(xi − xj)2

)The following figure shows the GP prediction which comes from therandomly chosen points in y = f(x).

AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 16 / 54

Page 17: Bayesian Analysis for Machine Learning - AiFrenz · AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 201910/54 Projections of Inputs into Feature Space

Regression-Examples

Noise free case

Figure: Noise-free Case

AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 17 / 54

Page 18: Bayesian Analysis for Machine Learning - AiFrenz · AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 201910/54 Projections of Inputs into Feature Space

Regression-Examples

Noise free case

The red dotted line is our objective function. The blue line is our GPregressor.

Hyper-parameter tuning is done by optimizing the log-marginal likelihoodfunction using an optimization method.

The above figure is obtained from the optimized kernel

k(xi, xj) = 8.062exp(− 1

2(2.19)2(xi − xj)2

)with the log-marginal likelihood value = −13.67.

It is the case where the GP regressor is very accurate.

AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 18 / 54

Page 19: Bayesian Analysis for Machine Learning - AiFrenz · AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 201910/54 Projections of Inputs into Feature Space

Regression-Examples

Noisy case

This example deals with a real data for the international airline passengersin USA from 1949.01 to 1960.12.

Here, we use the first 85 percent of data as a training data and use theremaining data as a validation set. The validation set is used to test theestimator.

Here, we use the following covariance function

k(xi, xj) = σ21(xi · xj + σ22

)exp(− 2(

sin2(π(x−x′)2

σ3)

σ24))

and for noisy data we use

ky(xi, xj) = k(xi, xj) + σ25δij .

AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 19 / 54

Page 20: Bayesian Analysis for Machine Learning - AiFrenz · AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 201910/54 Projections of Inputs into Feature Space

Regression-Examples

Noisy case

Figure: Noisy Case

AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 20 / 54

Page 21: Bayesian Analysis for Machine Learning - AiFrenz · AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 201910/54 Projections of Inputs into Feature Space

Regression-Examples

Noisy case

The red line is our validation set (or, equivalently called as the test data)and the blue line is our training data. Black line is our GP regressor.

The above figure is obtained from the optimized kernel

ky(xi, xj) = (0.94)2(xi · xj + 10−5

)exp(− 2(

sin2(π(x−x′)2

0.902 )

(0.255)2))

+ 279δij

with the log-margianl likelihood value = -539.31.

AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 21 / 54

Page 22: Bayesian Analysis for Machine Learning - AiFrenz · AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 201910/54 Projections of Inputs into Feature Space

Regression-Examples

Remarks

There are some remarks.

In general, the log - marginal likelihood function may not be concavewith respect to its parameters. So, there may be many localoptimums and many gradient based optimization methods may fallinto local optimum.

In GP regression, choosing a proper covariance function is themain issue. For both examples, we choose our covariance functionsheuristically. So, there may exist some other better covariancefunctions which make the GP regressor more accurate.

AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 22 / 54

Page 23: Bayesian Analysis for Machine Learning - AiFrenz · AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 201910/54 Projections of Inputs into Feature Space

Regression-Examples

Remarks

Recall that our GP regression is based on the GP with zero meanfunction. So, we need to centralize our training data. In our secondexample, the training data are centralized by subtracting its movingaverage

MA(xi) =xi−[k/2] + · · ·+ xi+[k/2]+1

k(We use k = 9)

There are another centralization methods such as subtracting samplemean. You may get better results by choosing other window size k.

Your predictor should not train any information of validation set.

AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 23 / 54

Page 24: Bayesian Analysis for Machine Learning - AiFrenz · AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 201910/54 Projections of Inputs into Feature Space

Bayesian Model for Classification

Bayesian Models for Classification

AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 24 / 54

Page 25: Bayesian Analysis for Machine Learning - AiFrenz · AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 201910/54 Projections of Inputs into Feature Space

Bayesian Model for Classification

Classification

We consider classification problems.

x : data, y : the class label

When we consider p(y,x), we have two approaches by Bayes’ theorem.

The generative approach considers p(y,x) = p(x|y)p(y).

The discriminative approach focusses on modeling p(y|x) directly.

AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 25 / 54

Page 26: Bayesian Analysis for Machine Learning - AiFrenz · AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 201910/54 Projections of Inputs into Feature Space

Bayesian Model for Classification

Classification ProblemsThe generative approach

For y = C1, C2, . . . , CC , the posterior probability of each class is

p(y|x) =p(y,x)

p(x)=

p(x|y)p(y)∑Cc=1 p(Cc)p(x|Cc)

A simple and common choice for the class-conditional density is

p(x|Cc) = N (µc,Σc).

However, it is unclear whether this choice is appropriate.

AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 26 / 54

Page 27: Bayesian Analysis for Machine Learning - AiFrenz · AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 201910/54 Projections of Inputs into Feature Space

Bayesian Model for Classification

Classification ProblemsThe discriminative approach

Basic idea

x −→GP f(x) −→σ σ(f(x))

For the binary case, we usually use a response function σ(z) whichsquashes its argument into [0, 1], guaranteeing a valid probabilisticinterpretation.

The response function σ(z) can be any sigmoid function. (A sigmoidfunction is a monotically increasing function mapping from R to [0,1].)

AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 27 / 54

Page 28: Bayesian Analysis for Machine Learning - AiFrenz · AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 201910/54 Projections of Inputs into Feature Space

Bayesian Model for Classification

Classification ProblemsThe discriminative approach

Two examples for response functions are

Linear logistic regression model

p(Ci|x) = λ(wTx), where λ(z) =1

1 + exp(−z)

Probit regression model

p(Ci|x) = Φ(wTx), where Φ(z) =

∫ z

−∞

1√2πe−

x2

2 dx.

From now on we only consider the discriminative approach.

AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 28 / 54

Page 29: Bayesian Analysis for Machine Learning - AiFrenz · AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 201910/54 Projections of Inputs into Feature Space

Bayesian Model for Classification

Classification ProblemsDecision Theory for Classification

Let L(c, c′) be the loss incurred by making decision c′ if the true class isCc. Usually L(c, c′) = 0 when c = c′.Then the expected loss is

RL(c′|x) =∑c

L(c, c′)p(Cc|x).

The optimal class isc∗ = argmin

cRL(c|x).

One common choice for the loss function is the zero-one loss, i.e.,

L(c, c′) = 1− δcc′ .

AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 29 / 54

Page 30: Bayesian Analysis for Machine Learning - AiFrenz · AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 201910/54 Projections of Inputs into Feature Space

Bayesian Model for Classification

Linear Models for Classification

x −→w∼N wTx −→σ σ(wTx)

As we did in regression, we start with a linear model (f(x) = wTx).

Let the labels and the likelihood be

y = ±1

p(y = 1|x,w) = σ(wTx),

and we use the sigmoid function as the response function.

When σ(z) = λ(z), the model is called a linear logistic regression.When σ(z) = Φ(z), the model is called a linear probit regression.

AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 30 / 54

Page 31: Bayesian Analysis for Machine Learning - AiFrenz · AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 201910/54 Projections of Inputs into Feature Space

Bayesian Model for Classification

Linear Models for Classification

For symmetric σ(z), i.e., σ(−z) = 1− σ(z),

p(yi|xi,w) = σ(yifi),

where fi = f(xi) = wTxi because

p(yi = 1|xi,w) = σ(wTxi) = σ(fi),

p(yi = −1|xi,w) = 1− σ(wTx) = 1− σ(fi) = σ(−fi).

So we can write p(y|x,w) consistently regardless of the value of y.

AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 31 / 54

Page 32: Bayesian Analysis for Machine Learning - AiFrenz · AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 201910/54 Projections of Inputs into Feature Space

Bayesian Model for Classification

Linear Models for Classification

Remark: The logit transformation is defined as

logit(x) = logp(y = 1|x)

p(y = −1|x).

For a linear logistic regression, logit(x) = wTx.

AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 32 / 54

Page 33: Bayesian Analysis for Machine Learning - AiFrenz · AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 201910/54 Projections of Inputs into Feature Space

Bayesian Model for Classification

Linear Models for Classification

In binary classification, what we want to do is predicting y conditional onx. To do this we assume the existence of a function a(x) which modelsthe logit as a function of x. Thus

P (y = 1|x, a(x)) = σ(a(x)).

There are two approaches to complete this model.

One way is considering a(x) as a parametrized function a(x;w) andgive w a prior which is given below.

The other approach is to model a(x) using Gaussian process which isomitted here.

AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 33 / 54

Page 34: Bayesian Analysis for Machine Learning - AiFrenz · AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 201910/54 Projections of Inputs into Feature Space

Bayesian Model for Classification

Linear Models for Classification

Given a dataset D = {(xi, yi)|i = 1, . . . , n}, we assume that the labels aregenerated independently, conditional on f(x) = wTx. Using the Gaussianprior w ∼ N (0,Σp),

p(w|D,y) ∝ p(y|w,D)p(w|D)

=

n∏i=1

σ(yiwTxi)exp(−1

2wTΣ−1p w)

So

log p(w|D,y) =c −1

2wTΣ−1p w +

n∑i=1

log(σ(yiwTxi))

AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 34 / 54

Page 35: Bayesian Analysis for Machine Learning - AiFrenz · AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 201910/54 Projections of Inputs into Feature Space

Bayesian Model for Classification

Linear Models for ClassificationMaximum Likelihood

For σ(z) = λ(z), the log posterior is a concave function of w for fixed D.

Proposition 1

f(w) = −12w

TΣ−1p w is concave in w.

Proposition 2

log(λ(ywTx)) is concave where λ(z) = 11+exp(−z) .

For σ(z) = Φ(z), the log posterior is also concave.

Proposition 3

log(Φ(ywTx)) is concave.

AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 35 / 54

Page 36: Bayesian Analysis for Machine Learning - AiFrenz · AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 201910/54 Projections of Inputs into Feature Space

Bayesian Model for Classification

Linear Models for ClassificationMaximum Likelihood

Since the log posterior is a concave function in w, it is relatively easy tofind its unique maximum. We can use the Newton’s method to find themaximum.

AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 36 / 54

Page 37: Bayesian Analysis for Machine Learning - AiFrenz · AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 201910/54 Projections of Inputs into Feature Space

Bayesian Model for Classification

Linear Models for ClassificationPredictions

To make predictions based the training set D for a test point x∗, we have

p(y∗ = 1|x∗,D) =

∫p(y∗ = 1|w,x∗)p(w|D) dw.

AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 37 / 54

Page 38: Bayesian Analysis for Machine Learning - AiFrenz · AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 201910/54 Projections of Inputs into Feature Space

Bayesian Model for Classification

Linear Models for Classification

In the multi-class case, we use the softmax function

p(y = Cc|x,W ) =exp(xTwc)∑c′ exp(xTwc′)

where wc is the weight vector for class c, and W = (w1, . . . ,wn).

The corresponding log likelihood is of the form

n∑i=1

C∑c=1

δc,yi [xTi wc − log(

∑c′

exp(xTi wc′))]

which is also a concave function of W .

AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 38 / 54

Page 39: Bayesian Analysis for Machine Learning - AiFrenz · AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 201910/54 Projections of Inputs into Feature Space

Bayesian Model for Classification

In Gaussian Process Classification, we use Gaussian processes and thereare two approaches.

Laplace Approximation

Expectation Propagation

For the details, please refer to

Rasmussen and C.K.I. Williams, Gaussian Processes for MachineLearning, The MIT Press, 2006.

AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 39 / 54

Page 40: Bayesian Analysis for Machine Learning - AiFrenz · AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 201910/54 Projections of Inputs into Feature Space

Classification Examples

Example - Introduction

There are two examples which use the GP regression for classification.

The first one just a toy problem on our textbook pages 60 and 61.

The second one is a multi-class classification which classifies the types ofiris. The training data for the second example are the iris data setswhich are commonly used to test the performance of classification.

AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 40 / 54

Page 41: Bayesian Analysis for Machine Learning - AiFrenz · AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 201910/54 Projections of Inputs into Feature Space

Classification Examples

Example - Introduction

Python 3.5 is also used to implement the examples. We use the followinglibraries

1. Numpy : It makes easy to do vector computation in python.

2. Scikit learn(sklearn) GaussianProcessClassifier : It performs GPclassification by optimizing the log-likelihood with respect tohyper-parameters. It uses the logistic response function and Laplaceapproximation to approximate the non-Gaussian likelihood.

AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 41 / 54

Page 42: Bayesian Analysis for Machine Learning - AiFrenz · AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 201910/54 Projections of Inputs into Feature Space

Classification Examples

A Toy Problem

15 data points are randomly chosen in the square [0, 1]2. The 2 classes arelabeled as x (−1) and o (+1). We use the squared exponential covariancefunction

k(xi,xj) = σ2exp(− ‖xi − xj‖2

2l2)

The following figure shows the contour plots of the predictive probabilityEq[π(x∗)|f ] and the training data points.

AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 42 / 54

Page 43: Bayesian Analysis for Machine Learning - AiFrenz · AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 201910/54 Projections of Inputs into Feature Space

Classification Examples

A Toy Problem

Figure: Classification results

AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 43 / 54

Page 44: Bayesian Analysis for Machine Learning - AiFrenz · AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 201910/54 Projections of Inputs into Feature Space

Classification Examples

A Toy Problem

Here, test data are all points in [0, 1]2.

The value in each contour denotes the predictive probability of class +1.

Similarly as in GP regression, Hyper-parameter tuning is done byoptimizing the approximated log-marginal likelihood function.

The above figure is obtained from the optimized kernel

k(xi,xj) = 1.642exp(− ‖xi − xj‖2

2(0.116)2)

with the log-marginal likelihood value = −10.142.

AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 44 / 54

Page 45: Bayesian Analysis for Machine Learning - AiFrenz · AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 201910/54 Projections of Inputs into Feature Space

Classification Examples

Classification of iris dataIntroduction

This example deals with a real data on the types of the irises. There are 3main types of irises: setosa, versicolor, and virginica. The iris data iscomposed with 150 data of irises with their types and their lengths andwidths of sepals and petals. The 3 classes, ’setosa’, ’versicolor’, and’virginica’ are labeled as red, green and blue.

We want to classify a given iris by using its lengths and widths of the sepaland petal using GP classification.

AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 45 / 54

Page 46: Bayesian Analysis for Machine Learning - AiFrenz · AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 201910/54 Projections of Inputs into Feature Space

Classification Examples

Classification of iris dataIntroduction

Since each data has 4 features (the lengths and widths of the sepal andpetal), it may be hard to get a graphical demonstration. So, we select 2features: the widths and lengths of the sepal.

Next, to test our classifier, we divide the data sets into 2 parts of size100 and 50. 100 data are used to train our model, and 50 data areused for validation.

We again use the squared exponential covariance function with noise term

ky(xi,xj) = σ21exp(− ‖xi − xj‖2

2l2)

+ σ22δij .

The following figure shows the locations of training sets and classificationresults with respect to the widths and lengths of sepals.

AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 46 / 54

Page 47: Bayesian Analysis for Machine Learning - AiFrenz · AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 201910/54 Projections of Inputs into Feature Space

Classification Examples

Classification of iris data

Figure: Classification with the widths and lengths of sepalsAI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 47 / 54

Page 48: Bayesian Analysis for Machine Learning - AiFrenz · AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 201910/54 Projections of Inputs into Feature Space

Classification Examples

Classification of iris data

Here, the test data are all points in the 2d box[mintraining data(sepal length)− 1,maxtraining data(sepal length) + 1

]×[

min(sepal width)− 1,max(sepal width) + 1].

The above figure is obtained from the optimized kernel

ky(xi,xj) = (8.84)2exp(− ‖xi − xj‖2

2(2.71)2+ (2.66)2δij

)with log-marginal likelihood value = -31.892

AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 48 / 54

Page 49: Bayesian Analysis for Machine Learning - AiFrenz · AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 201910/54 Projections of Inputs into Feature Space

Classification Examples

Classification of iris data

From the above figure, we can say that our GP classifier works well forclassifying setosa. However, it doesn’t work well in classifying versicolorand virginica. Actually, its average error rate with respect to validation is20.5 percent. Does this mean that our GP classification works poorly?Let’s consider the following figure.

AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 49 / 54

Page 50: Bayesian Analysis for Machine Learning - AiFrenz · AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 201910/54 Projections of Inputs into Feature Space

Classification Examples

Classification of iris data

Figure: Comparison between two different feature sets

AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 50 / 54

Page 51: Bayesian Analysis for Machine Learning - AiFrenz · AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 201910/54 Projections of Inputs into Feature Space

Classification Examples

Classification of iris data

The right figure is the locations of training sets with respect to the widthsand lengths of petals. By considering both, we can observe that thewidths and lengths of petals are better feature sets for learning the GPclassifier. By using the new features we get better results which areprovided below.

AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 51 / 54

Page 52: Bayesian Analysis for Machine Learning - AiFrenz · AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 201910/54 Projections of Inputs into Feature Space

Classification Examples

Classification of iris data

Figure: Classification with the widths and lengths of petals

AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 52 / 54

Page 53: Bayesian Analysis for Machine Learning - AiFrenz · AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 201910/54 Projections of Inputs into Feature Space

Classification Examples

Classification of iris data

The test data are all points in the 2d box[mintraining data(Petal length)− 1,maxtraining data(Petal length) + 1

]×[

min(Petal width)− 1,max(Petal width) + 1].

The above figure is obtained from the optimized kernel

ky(xi,xj) = (7.77)2exp(− ‖xi − xj‖2

2(4.49)2+ (5.55)2δij

)with log-marginal likelihood value = -10.560 which is much larger than theprevious case. Moreover, its averaged error rate with respect to avalidation set is only 4 percent.

AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 53 / 54

Page 54: Bayesian Analysis for Machine Learning - AiFrenz · AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 201910/54 Projections of Inputs into Feature Space

Classification Examples

References

The iris data :http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load iris.html#sklearn.datasets.load iris

K.P. Murphy, Machine Learning, The MIT Press, 2012.

C.E. Rasmussen and C.K.I. Williams, Gaussian Processes for MachineLearning, The MIT Press, 2006.

AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 54 / 54