bayesian analysis for machine learning - aifrenz · ai friends seminar ganguk hwang bayesian...

Bayesian Analysis for Machine Learning

AI Friends Seminar

Ganguk Hwang

Department of Mathematical SciencesKAIST

AI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 1 / 54

Bayesian Model for Linear Regression




The Standard Linear Model

Let D = {(xi, yi)|i = 1, · · · , n} be a training set of n observations wherexi is an input vector of dimension D and y is a scalar output.

Let X = [x1, · · · ,xn].




First, we will consider the standard linear regression model with Gaussiannoise

f(xi) = xi>w, yi = f(xi) + εi

where εi ∼ N (0, σ2n) and {εi}ni=1 are independent.

Then, the likelihood can be computed as

p(y|X,w) =

n∏i=1

p(yi|xi,w) =

n∏i=1

1√2πσn

exp(− (yi − xi

>w)2

2σ2n

)=

1

(2πσ2n)n/2exp

(− 1

2σ2n|y −X>w|2

)∼ N (X>w, σ2nI)




p(w|X,y) ∝ p(y|X,w)p(w)

∝ exp(− 1

2σ2n(y −X>w)>(y −X>w)

)exp

(− 1

2w>Σ−1p w

)

Here,

1

σ2n(y −X>w)>(y −X>w) + w>Σ−1p w = w>Aw −Bw −w>C +D

Where A = 1σ2nXX> + Σ−1p , B = 1

σ2n

(Xy)>, C = 1σ2nXy, D = 1

σ2ny>y




Observe that

(w −w)>A(w −w) = w>Aw −w>Aw −w>Aw + w>Aw

Now, set w = A−1C. Then,

w =1

σ2nA−1Xy

w>A =1

σ2n(Xy)> = B

By neglecting the constant term, we get

p(w|X,y) ∝ exp(− 1

2(w −w)>A(w −w)

)In other words,

p(w|X,y) ∼ N (1

σ2nA−1Xy, A−1)



Gaussian Identities

Theorem 1

The product of two Gaussians gives another Gaussian

N (x|a, A)N (x|b, B) = Z−1N (x|c, C)

where

c = C(A−1a +B−1b), C = (A−1 +B−1)−1, and

Z−1 = (2π)−D/2|A+B|−1/2 exp(− 1

2(a− b)>(A+B)−1(a− b)

).

Here, Z is just the normalizing constant.



The Standard Linear ModelPredictive distribution

Definition 1

The (Posterior) predictive distribution is the distribution of possibleunobserved values (test data) conditional on the observed values (trainingdata).

To make a prediction for a test data, we use the predictive distribution.Let x∗ be a test case and f∗ = f(x∗) = x∗

>w. Then, the predictivedistribution for f∗ at x∗ is given as

p(f∗|x∗, X,y) ∼ N(1

σ2nx∗>A−1Xy,x∗

>A−1x∗).

We use the mean of the predictive distribution as our estimator for f(x).



The Standard Linear ModelPredictive distribution

The predictive distribution is also Gaussian from our previous derivation.

The predictive variance x∗>A−1x∗ is a quadratic form of the test input

with the posterior covariance matrix. It implies that the predictiveuncertainties grow with the magnitude of test input.


Projections of Inputs into Feature Space


Consider the function φ(x) : RD → RN which maps an input x into an Ndimensional feature space.

Let Φ(X) = [φ(x1), · · · , φ(xn)].

The model is now given as f(xi) = φ(xi)>w, yi = f(xi) + εi with

εi ∼ N(0, σ2n) and they are independent.

This model linearly approximates the outputs using feature functionswhich form a basis of a high dimensional space.



Why feature funtions?

Figure: Feature function



Projections of Inputs into Feature SpaceThe predictive distribution

The analysis for this model is analogous to the standard linear model.Here, X,x∗ are replaced by Φ(X) and φ(x∗). For simplicity, we writeΦ = Φ(X), φ∗ = φ(x∗). Then, the predictive distribution becomes

f∗|x∗, X,y ∼ N (1

σ2nφ>∗ A

−1Φy, φ>∗ A−1φ∗)

where A = 1σ2n

ΦΦ> + Σ−1p .

Alternatively, we can rewrite the predictive distribution in the following way

f∗|x∗, X,y ∼ N(φ>∗ ΣpΦ(K+σ2nI)−1y, φ>∗ (Σp−ΣpΦ(K+σ2nI)−1Φ>Σp)φ∗

)where K = Φ>ΣpΦ (which is related with the Gaussian ProcessRegression).


Regression-Examples

Examples - Introduction

There are two examples which use the GP regression for prediction.

The first one is a noise free case where the objective function is given asf(x) = xcos(x).

The second one is a noisy case and the objective function is not given.

The training data for the second example are the numbers ofinternational airline passengers in USA per month which arecommonly used to test the performance of a regression operator.


Regression-Examples

Examples - Introduction

Python 3.5 is used to implement that examples. We use the followinglibraries

1 Numpy : It makes easy to do vector computation in python.

2 Scikit learn(sklearn) GaussianProcessRegressor, linear model : Thefirst one performs GP regression by optimizing the log-likelihood withrespect to hyper-parameters.The second one is used to get linear regression which is needed topredict the mean function of test data.

There are also good tools for GP regression such as gpytorch and gpflow.


Regression-Examples

Noise free case

In this example, the objective function is given as f(x) = xcos(x). Thetraining data are given as {(x1, f(x1)), · · · , (xn, f(xn))} where x1, · · · , xnare randomly chosen.Here, we use the squared exponential covariance function.

k(xi, xj) = σ2fexp(− 1

2l2(xi − xj)2

)The following figure shows the GP prediction which comes from therandomly chosen points in y = f(x).


Regression-Examples

Noise free case

Figure: Noise-free Case


Regression-Examples

Noise free case

The red dotted line is our objective function. The blue line is our GPregressor.

Hyper-parameter tuning is done by optimizing the log-marginal likelihoodfunction using an optimization method.

The above figure is obtained from the optimized kernel

k(xi, xj) = 8.062exp(− 1

2(2.19)2(xi − xj)2

)with the log-marginal likelihood value = −13.67.

It is the case where the GP regressor is very accurate.


Regression-Examples

Noisy case

This example deals with a real data for the international airline passengersin USA from 1949.01 to 1960.12.

Here, we use the first 85 percent of data as a training data and use theremaining data as a validation set. The validation set is used to test theestimator.

Here, we use the following covariance function

k(xi, xj) = σ21(xi · xj + σ22

)exp(− 2(

sin2(π(x−x′)2

σ3)

σ24))

and for noisy data we use

ky(xi, xj) = k(xi, xj) + σ25δij .


Regression-Examples

Noisy case

Figure: Noisy Case


Regression-Examples

Noisy case

The red line is our validation set (or, equivalently called as the test data)and the blue line is our training data. Black line is our GP regressor.


ky(xi, xj) = (0.94)2(xi · xj + 10−5

)exp(− 2(

sin2(π(x−x′)2

0.902 )

(0.255)2))

+ 279δij

with the log-margianl likelihood value = -539.31.


Regression-Examples

Remarks

There are some remarks.

In general, the log - marginal likelihood function may not be concavewith respect to its parameters. So, there may be many localoptimums and many gradient based optimization methods may fallinto local optimum.

In GP regression, choosing a proper covariance function is themain issue. For both examples, we choose our covariance functionsheuristically. So, there may exist some other better covariancefunctions which make the GP regressor more accurate.


Regression-Examples

Remarks

Recall that our GP regression is based on the GP with zero meanfunction. So, we need to centralize our training data. In our secondexample, the training data are centralized by subtracting its movingaverage

MA(xi) =xi−[k/2] + · · ·+ xi+[k/2]+1

k(We use k = 9)

There are another centralization methods such as subtracting samplemean. You may get better results by choosing other window size k.

Your predictor should not train any information of validation set.


Bayesian Model for Classification

Bayesian Models for Classification



Classification

We consider classification problems.

x : data, y : the class label

When we consider p(y,x), we have two approaches by Bayes’ theorem.

The generative approach considers p(y,x) = p(x|y)p(y).

The discriminative approach focusses on modeling p(y|x) directly.



Classification ProblemsThe generative approach

For y = C1, C2, . . . , CC , the posterior probability of each class is

p(y|x) =p(y,x)

p(x)=

p(x|y)p(y)∑Cc=1 p(Cc)p(x|Cc)

A simple and common choice for the class-conditional density is

p(x|Cc) = N (µc,Σc).

However, it is unclear whether this choice is appropriate.



Classification ProblemsThe discriminative approach

Basic idea

x −→GP f(x) −→σ σ(f(x))

For the binary case, we usually use a response function σ(z) whichsquashes its argument into [0, 1], guaranteeing a valid probabilisticinterpretation.

The response function σ(z) can be any sigmoid function. (A sigmoidfunction is a monotically increasing function mapping from R to [0,1].)



Classification ProblemsThe discriminative approach

Two examples for response functions are

Linear logistic regression model

p(Ci|x) = λ(wTx), where λ(z) =1

1 + exp(−z)

Probit regression model

p(Ci|x) = Φ(wTx), where Φ(z) =

∫ z

−∞

1√2πe−

x2

2 dx.

From now on we only consider the discriminative approach.



Classification ProblemsDecision Theory for Classification

Let L(c, c′) be the loss incurred by making decision c′ if the true class isCc. Usually L(c, c′) = 0 when c = c′.Then the expected loss is

RL(c′|x) =∑c

L(c, c′)p(Cc|x).

The optimal class isc∗ = argmin

cRL(c|x).

One common choice for the loss function is the zero-one loss, i.e.,

L(c, c′) = 1− δcc′ .



Linear Models for Classification

x −→w∼N wTx −→σ σ(wTx)

As we did in regression, we start with a linear model (f(x) = wTx).

Let the labels and the likelihood be

y = ±1

p(y = 1|x,w) = σ(wTx),

and we use the sigmoid function as the response function.

When σ(z) = λ(z), the model is called a linear logistic regression.When σ(z) = Φ(z), the model is called a linear probit regression.




For symmetric σ(z), i.e., σ(−z) = 1− σ(z),

p(yi|xi,w) = σ(yifi),

where fi = f(xi) = wTxi because

p(yi = 1|xi,w) = σ(wTxi) = σ(fi),

p(yi = −1|xi,w) = 1− σ(wTx) = 1− σ(fi) = σ(−fi).

So we can write p(y|x,w) consistently regardless of the value of y.




Remark: The logit transformation is defined as

logit(x) = logp(y = 1|x)

p(y = −1|x).

For a linear logistic regression, logit(x) = wTx.




In binary classification, what we want to do is predicting y conditional onx. To do this we assume the existence of a function a(x) which modelsthe logit as a function of x. Thus

P (y = 1|x, a(x)) = σ(a(x)).

There are two approaches to complete this model.

One way is considering a(x) as a parametrized function a(x;w) andgive w a prior which is given below.

The other approach is to model a(x) using Gaussian process which isomitted here.




Given a dataset D = {(xi, yi)|i = 1, . . . , n}, we assume that the labels aregenerated independently, conditional on f(x) = wTx. Using the Gaussianprior w ∼ N (0,Σp),

p(w|D,y) ∝ p(y|w,D)p(w|D)

=

n∏i=1

σ(yiwTxi)exp(−1

2wTΣ−1p w)

So

log p(w|D,y) =c −1

2wTΣ−1p w +

n∑i=1

log(σ(yiwTxi))



Linear Models for ClassificationMaximum Likelihood

For σ(z) = λ(z), the log posterior is a concave function of w for fixed D.

Proposition 1

f(w) = −12w

TΣ−1p w is concave in w.

Proposition 2

log(λ(ywTx)) is concave where λ(z) = 11+exp(−z) .

For σ(z) = Φ(z), the log posterior is also concave.

Proposition 3

log(Φ(ywTx)) is concave.



Linear Models for ClassificationMaximum Likelihood

Since the log posterior is a concave function in w, it is relatively easy tofind its unique maximum. We can use the Newton’s method to find themaximum.



Linear Models for ClassificationPredictions

To make predictions based the training set D for a test point x∗, we have

p(y∗ = 1|x∗,D) =

∫p(y∗ = 1|w,x∗)p(w|D) dw.




In the multi-class case, we use the softmax function

p(y = Cc|x,W ) =exp(xTwc)∑c′ exp(xTwc′)

where wc is the weight vector for class c, and W = (w1, . . . ,wn).

The corresponding log likelihood is of the form

n∑i=1

C∑c=1

δc,yi [xTi wc − log(

∑c′

exp(xTi wc′))]

which is also a concave function of W .



In Gaussian Process Classification, we use Gaussian processes and thereare two approaches.

Laplace Approximation

Expectation Propagation

For the details, please refer to

Rasmussen and C.K.I. Williams, Gaussian Processes for MachineLearning, The MIT Press, 2006.


Classification Examples

Example - Introduction

There are two examples which use the GP regression for classification.

The first one just a toy problem on our textbook pages 60 and 61.

The second one is a multi-class classification which classifies the types ofiris. The training data for the second example are the iris data setswhich are commonly used to test the performance of classification.



Example - Introduction

Python 3.5 is also used to implement the examples. We use the followinglibraries

1. Numpy : It makes easy to do vector computation in python.

2. Scikit learn(sklearn) GaussianProcessClassifier : It performs GPclassification by optimizing the log-likelihood with respect tohyper-parameters. It uses the logistic response function and Laplaceapproximation to approximate the non-Gaussian likelihood.



A Toy Problem

15 data points are randomly chosen in the square [0, 1]2. The 2 classes arelabeled as x (−1) and o (+1). We use the squared exponential covariancefunction

k(xi,xj) = σ2exp(− ‖xi − xj‖2

2l2)

The following figure shows the contour plots of the predictive probabilityEq[π(x∗)|f ] and the training data points.



A Toy Problem

Figure: Classification results



A Toy Problem

Here, test data are all points in [0, 1]2.

The value in each contour denotes the predictive probability of class +1.

Similarly as in GP regression, Hyper-parameter tuning is done byoptimizing the approximated log-marginal likelihood function.


k(xi,xj) = 1.642exp(− ‖xi − xj‖2

2(0.116)2)

with the log-marginal likelihood value = −10.142.



Classification of iris dataIntroduction

This example deals with a real data on the types of the irises. There are 3main types of irises: setosa, versicolor, and virginica. The iris data iscomposed with 150 data of irises with their types and their lengths andwidths of sepals and petals. The 3 classes, ’setosa’, ’versicolor’, and’virginica’ are labeled as red, green and blue.

We want to classify a given iris by using its lengths and widths of the sepaland petal using GP classification.



Classification of iris dataIntroduction

Since each data has 4 features (the lengths and widths of the sepal andpetal), it may be hard to get a graphical demonstration. So, we select 2features: the widths and lengths of the sepal.

Next, to test our classifier, we divide the data sets into 2 parts of size100 and 50. 100 data are used to train our model, and 50 data areused for validation.

We again use the squared exponential covariance function with noise term

ky(xi,xj) = σ21exp(− ‖xi − xj‖2

2l2)

+ σ22δij .

The following figure shows the locations of training sets and classificationresults with respect to the widths and lengths of sepals.



Classification of iris data

Figure: Classification with the widths and lengths of sepalsAI Friends Seminar Ganguk Hwang Bayesian Analysis for Machine Learning January 2019 47 / 54



Here, the test data are all points in the 2d box[mintraining data(sepal length)− 1,maxtraining data(sepal length) + 1

]×[

min(sepal width)− 1,max(sepal width) + 1].


ky(xi,xj) = (8.84)2exp(− ‖xi − xj‖2

2(2.71)2+ (2.66)2δij

)with log-marginal likelihood value = -31.892




From the above figure, we can say that our GP classifier works well forclassifying setosa. However, it doesn’t work well in classifying versicolorand virginica. Actually, its average error rate with respect to validation is20.5 percent. Does this mean that our GP classification works poorly?Let’s consider the following figure.




Figure: Comparison between two different feature sets




The right figure is the locations of training sets with respect to the widthsand lengths of petals. By considering both, we can observe that thewidths and lengths of petals are better feature sets for learning the GPclassifier. By using the new features we get better results which areprovided below.




Figure: Classification with the widths and lengths of petals




The test data are all points in the 2d box[mintraining data(Petal length)− 1,maxtraining data(Petal length) + 1

]×[

min(Petal width)− 1,max(Petal width) + 1].


ky(xi,xj) = (7.77)2exp(− ‖xi − xj‖2

2(4.49)2+ (5.55)2δij

)with log-marginal likelihood value = -10.560 which is much larger than theprevious case. Moreover, its averaged error rate with respect to avalidation set is only 4 percent.



References

The iris data :http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load iris.html#sklearn.datasets.load iris

K.P. Murphy, Machine Learning, The MIT Press, 2012.

C.E. Rasmussen and C.K.I. Williams, Gaussian Processes for MachineLearning, The MIT Press, 2006.


bayesian analysis for machine learning - aifrenz · ai friends seminar ganguk hwang bayesian...

Documents