machine learning 10601 recitation 6 sep 30, 2009 oznur tastan

32
Machine Learning 10601 Recitation 6 Sep 30, 2009 Oznur Tastan

Upload: shanon-lucas

Post on 03-Jan-2016

224 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine Learning 10601 Recitation 6 Sep 30, 2009 Oznur Tastan

Machine Learning 10601Recitation 6Sep 30, 2009Oznur Tastan

Page 2: Machine Learning 10601 Recitation 6 Sep 30, 2009 Oznur Tastan

Outline

• Multivariate Gaussians• Logistic regression

Page 3: Machine Learning 10601 Recitation 6 Sep 30, 2009 Oznur Tastan

Multivariate Gaussians (or "multinormal distribution“ or “multivariate normal distribution”)

Multivariate case: Vector of observations x,

vector of means and covariance matrix

Univariate case: single mean and variance

Dimension of x Determinant

Page 4: Machine Learning 10601 Recitation 6 Sep 30, 2009 Oznur Tastan

Multivariate Gaussians

do not depend on xnormalization constants

Multivariate case

Univariate case

depends on x and positive

Page 5: Machine Learning 10601 Recitation 6 Sep 30, 2009 Oznur Tastan

The mean vector

m

2

1

μ

μ

μ

x

.

.)(Eμ

Page 6: Machine Learning 10601 Recitation 6 Sep 30, 2009 Oznur Tastan

Covariance of two random variables

Recall for two random variables xi, xj

)()()(

)])([(E

),(Cov2

jiji

jjii

jiij

xExExxE

xx

xx

Page 7: Machine Learning 10601 Recitation 6 Sep 30, 2009 Oznur Tastan

The covariance matrix

])μx)(μx(E[ Ttranspose operator

mmm

nn11

mm

11

μxμx

μx

μx

E

221

2422

21

141212

...

.

..

..

.

.

.

..

..

)]..()[(

)(

.

.

)(

Var(xm)=Cov(xm, xm)

Page 8: Machine Learning 10601 Recitation 6 Sep 30, 2009 Oznur Tastan

An example: 2 variate case

The pdf of the multivariate will be: Covariance matrix

Determinant

Page 9: Machine Learning 10601 Recitation 6 Sep 30, 2009 Oznur Tastan

An example: 2 variate case

Recall in general case independence implies uncorrelation but uncorrelation does not necessarily implies independence.Multivariate Gaussians is a special case where uncorrelation implies independence as well.

Factorized into two independent Gaussians!They are independent!

Page 10: Machine Learning 10601 Recitation 6 Sep 30, 2009 Oznur Tastan

Diagonal covariance matrix

Diagonal matrix: m matrix where off-diagonal terms are zero

22

12

0

0

ji

xx jjiiij

0)])([(E2

If all the variables are independent from each other,The covariance matrix will be an diagonal one.Reverse is also true:If the covariance matrix is a diagonal one they are independent

Page 11: Machine Learning 10601 Recitation 6 Sep 30, 2009 Oznur Tastan

Gaussian Intuitions: Size of

= [0 0] = [0 0] = [0 0] = I = 0.6I = 2I

As becomes larger, Gaussian becomes more spread out

Identity matrix

Page 12: Machine Learning 10601 Recitation 6 Sep 30, 2009 Oznur Tastan

Gaussian Intuitions: Off-diagonal

As the off-diagonal entries increase, more correlation between value of x and value of y

Page 13: Machine Learning 10601 Recitation 6 Sep 30, 2009 Oznur Tastan

Gaussian Intuitions: off-diagonal and diagonal

Decreasing non-diagonal entries (#1-2)Increasing variance of one dimension in diagonal (#3)

Page 14: Machine Learning 10601 Recitation 6 Sep 30, 2009 Oznur Tastan

Isocontours

Page 15: Machine Learning 10601 Recitation 6 Sep 30, 2009 Oznur Tastan

Isocontours example

We have showed

Now let’s try to find for some constant c the isocontour

Page 16: Machine Learning 10601 Recitation 6 Sep 30, 2009 Oznur Tastan

Isocontours continued

Page 17: Machine Learning 10601 Recitation 6 Sep 30, 2009 Oznur Tastan

Isocontours continued

Define

Equation of an ellipseCentered on μ1, μ2 and axis lengths 2r1 and 2r2

Page 18: Machine Learning 10601 Recitation 6 Sep 30, 2009 Oznur Tastan

We had started with diaogonal matrix

In the diagonal covariance matrix case the ellipses will be axis aligned.

Page 19: Machine Learning 10601 Recitation 6 Sep 30, 2009 Oznur Tastan

Don’t confuse Multivariate Gaussians with Mixtures of Gaussians

Mixture of Gaussians:

Component

Mixing coefficientK=3

Page 20: Machine Learning 10601 Recitation 6 Sep 30, 2009 Oznur Tastan

Logistic regression

Linear regressionOutcome variable Y is continuous

Logistic regressionOutcome variable Y is binary

Page 21: Machine Learning 10601 Recitation 6 Sep 30, 2009 Oznur Tastan

Logistic function (Logit function)

zez

1

1)(

zlogi

t(z)

Notice σ(z) is always bounded between [0,1] (a nice property) and as z increase σ(z) approaches 1, as z decreases σ(z) approaches to 0

This term is [0, infinity]

Page 22: Machine Learning 10601 Recitation 6 Sep 30, 2009 Oznur Tastan

Logistic regressionLearn a function to map X values to Y given data

),(),..,,( 11 NN YXYX

The function we try to learn is P(Y|X)

X can be continuous or discrete

Discrete

Page 23: Machine Learning 10601 Recitation 6 Sep 30, 2009 Oznur Tastan

Logistic regression

N

iiXwwe

YP1

01

1)X|1(

1

1

0

0( 0 | ) 1 ( 1| )

1

Ni i

Ni i

w w X

w w X

eP Y P Y

e

X X

Page 24: Machine Learning 10601 Recitation 6 Sep 30, 2009 Oznur Tastan

Classification

If this holds Y=0 is more probableThan Y=1 given X

Page 25: Machine Learning 10601 Recitation 6 Sep 30, 2009 Oznur Tastan

Classification

N

iiXwwe

YP1

01

1)X|1(

Nii

Nii

Xww

Xww

e

eYP

1

1

0

0

1)X|0(

Take log both sides

Classification rule if this holds Y=0

Page 26: Machine Learning 10601 Recitation 6 Sep 30, 2009 Oznur Tastan

Logistic regression is a linear classifier

100

N

i iw w X

00 1 N

iiXww

100

N

i iw w X

Y=0

Y=1

0)|0( XYP

0)|1( XYP

Decision boundary

Page 27: Machine Learning 10601 Recitation 6 Sep 30, 2009 Oznur Tastan

Classification

X1 X1

σ(z

)= σ

(w0+

w1X

1))

Notice σ(z) is 0.5 when X1=2

wo=+2, to check evaluate at X1=0 g(z)~0.1

σ(z) is 0.5 when X1=0 to see

0 1 1

1

1

0

2 ( 1) 0

2

w w X

X

X

0 1 1

1

0

0 ( 1) 0

1 0

w w X

X

X

Classify as Y=0 Classify as Y=0

Page 28: Machine Learning 10601 Recitation 6 Sep 30, 2009 Oznur Tastan

Estimating the parameters

Given data ),(),..,,( 11 NN YXYX

N

i

ii wXYP1w

),|(maxarg

Objective:

Train the model to get w that maximizes the conditional likelihood

Page 29: Machine Learning 10601 Recitation 6 Sep 30, 2009 Oznur Tastan

Difference with Naïve Bayes of Logistic Regression

Loss function! Optimize different functions → Obtaindifferent solutions

Naïve Bayes argmax P(X|Y) P(Y)

Logistic Regression argmax P(Y|X)

Page 30: Machine Learning 10601 Recitation 6 Sep 30, 2009 Oznur Tastan

Naïve Bayes and Logistic Regression

• Have a look at the Tom Mitchell’s book chapterhttp://www.cs.cmu.edu/%7Etom/mlbook/NBayesLogReg.pdf

Linked under Sep 23 Lecture Readings as well.

Page 31: Machine Learning 10601 Recitation 6 Sep 30, 2009 Oznur Tastan

Some matlab tips for the last question in HW3

• logical function might be useful for dividing into splits. An example of logical in use (please read the Matlab help)

S=X(logical(X(:,1)==1),:)this will also work S=X((X(:1)==1,:))

This will subset the portion of the X matrix where the first column has value 1 and will put in matrix S (like Data>Filter in Excel)

• Matlab has functions for mean, std, sum, inv, log2• Scaling data to zero mean and unit variance:

• shifting the mean by the mean (subtracting the mean from every element of the vector) and scaling such that it has variance=1 ( dividing the every element of the vector by standard deviation)

• To be able to do that in matrices. You will need the repmat function, have a look at that otherwise the size of the matrices would not match..etc

Elementwise multiplicationuse .*

Page 32: Machine Learning 10601 Recitation 6 Sep 30, 2009 Oznur Tastan

References

• http://www.stanford.edu/class/cs224s/lec/224s.09.lec10.pdf

• http://www.cs.cmu.edu/%7Etom/mlbook/NBayesLogReg.pdf

• Carlos Guestrin lecture notes• Andrew Ng lecture notes