lecture9: classiﬁcation,lda - web.stanford.edu · lineardiscriminantanalysis(lda)...

Lecture 9: Classification, LDA

Reading: Chapter 4

STATS 202: Data mining and analysis

Jonathan Taylor, 10/12Slide credits: Sergio Bacallado

1 / 21

Review: Main strategy in Chapter 4

Find an estimate P (Y | X). Then, given an input x0, we predictthe response as in a Bayes classifier:

y0 = argmax y P (Y = y | X = x0).

2 / 21

Linear Discriminant Analysis (LDA)

Instead of estimating P (Y | X), we will estimate:

1. P (X | Y ): Given the response, what is the distribution of theinputs.

2. P (Y ): How likely are each of the categories.

Then, we use Bayes rule to obtain the estimate:

P (Y = k | X = x) = P (X = x | Y = k)P (Y = k)

3 / 21






P (Y = k | X = x) = P (X = x | Y = k)P (Y = k)

3 / 21






P (Y = k | X = x) = P (X = x | Y = k)P (Y = k)

3 / 21






P (Y = k | X = x) = P (X = x | Y = k)P (Y = k)P (X = x)

3 / 21






P (Y = k | X = x) = P (X = x | Y = k)P (Y = k)j P (X = x | Y = j)P (Y = j)

3 / 21



1. We model P (X = x | Y = k) = fk(x) as a MultivariateNormal Distribution:

4 2 0 2 4

4

2

02

4

4 2 0 2 4

4

2

02

4

X1X1

X2

X2

2. P (Y = k) = k is estimated by the fraction of trainingsamples of class k.

4 / 21



1. We model P (X = x | Y = k) = fk(x) as a MultivariateNormal Distribution:

4 2 0 2 4

4

2

02

4

4 2 0 2 4

4

2

02

4

X1X1

X2

X2

2. P (Y = k) = k is estimated by the fraction of trainingsamples of class k.

4 / 21

LDA has linear decision boundaries

Suppose that:

I We know P (Y = k) = k exactly.

I P (X = x|Y = k) is Mutivariate Normal with density:

fk(x) =1

(2)p/2||1/2e

12(xk)T1(xk)

k : Mean of the inputs for category k. : Covariance matrix (common to all categories).

Then, what is the Bayes classifier?

5 / 21


Suppose that:



fk(x) =1

(2)p/2||1/2e

12(xk)T1(xk)



5 / 21


Suppose that:



fk(x) =1

(2)p/2||1/2e

12(xk)T1(xk)



5 / 21


Suppose that:



fk(x) =1

(2)p/2||1/2e

12(xk)T1(xk)



5 / 21


Suppose that:



fk(x) =1

(2)p/2||1/2e

12(xk)T1(xk)



5 / 21


By Bayes rule, the probability of category k, given the input x is:

P (Y = k | X = x) = fk(x)kP (X = x)

The denominator does not depend on the response k, so we canwrite it as a constant:

P (Y = k | X = x) = C fk(x)k

Now, expanding fk(x):

P (Y = k | X = x) = Ck(2)p/2||1/2

e12(xk)T1(xk)

6 / 21



P (Y = k | X = x) = fk(x)kP (X = x)


P (Y = k | X = x) = C fk(x)k


P (Y = k | X = x) = Ck(2)p/2||1/2

e12(xk)T1(xk)

6 / 21



P (Y = k | X = x) = fk(x)kP (X = x)


P (Y = k | X = x) = C fk(x)k


P (Y = k | X = x) = Ck(2)p/2||1/2

e12(xk)T1(xk)

6 / 21


P (Y = k | X = x) = Ck(2)p/2||1/2

e12(xk)T1(xk)

Now, let us absorb everything that does not depend on k into aconstant C :

P (Y = k | X = x) = C ke12(xk)T1(xk)

and take the logarithm of both sides:

logP (Y = k | X = x) = logC + log k 1

2(x k)T1(x k).

This is the same for every category, k.So we want to find the maximum of this over k.

7 / 21


P (Y = k | X = x) = Ck(2)p/2||1/2

e12(xk)T1(xk)


P (Y = k | X = x) = C ke12(xk)T1(xk)



2(x k)T1(x k).


7 / 21


P (Y = k | X = x) = Ck(2)p/2||1/2

e12(xk)T1(xk)


P (Y = k | X = x) = C ke12(xk)T1(xk)



2(x k)T1(x k).


7 / 21


P (Y = k | X = x) = Ck(2)p/2||1/2

e12(xk)T1(xk)


P (Y = k | X = x) = C ke12(xk)T1(xk)



2(x k)T1(x k).

This is the same for every category, k.

So we want to find the maximum of this over k.

7 / 21


P (Y = k | X = x) = Ck(2)p/2||1/2

e12(xk)T1(xk)


P (Y = k | X = x) = C ke12(xk)T1(xk)



2(x k)T1(x k).


7 / 21


Goal, maximize the following over k:

log k 1

2(x k)T1(x k).

= log k 1

2

[xT1x+ Tk

1k]+ xT1k

=C + log k 1

2Tk

1k + xT1k

We define the objective:

k(x) = log k 1

2Tk

1k + xT1k

At an input x, we predict the response with the highest k(x).

8 / 21



log k 1

2(x k)T1(x k).

= log k 1

2

[xT1x+ Tk

1k]+ xT1k

=C + log k 1

2Tk

1k + xT1k


k(x) = log k 1

2Tk

1k + xT1k


8 / 21



log k 1

2(x k)T1(x k).

= log k 1

2

[xT1x+ Tk

1k]+ xT1k

=C + log k 1

2Tk

1k + xT1k


k(x) = log k 1

2Tk

1k + xT1k


8 / 21



log k 1

2(x k)T1(x k).

= log k 1

2

[xT1x+ Tk

1k]+ xT1k

=C + log k 1

2Tk

1k + xT1k


k(x) = log k 1

2Tk

1k + xT1k


8 / 21


What is the decision boundary? It is the set of points in which 2classes do just as well:

k(x) = `(x)

log k 1

2Tk

1k + xT1k = log `

1

2T`

1` + xT1`

This is a linear equation in x.

4 2 0 2 4

4

2

02

4

4 2 0 2 4

4

2

02

4

X1X1

X2

X2

9 / 21



k(x) = `(x)

log k 1

2Tk

1k + xT1k = log `

1

2T`

1` + xT1`


4 2 0 2 4

4

2

02

4

4 2 0 2 4

4

2

02

4

X1X1

X2

X2

9 / 21



k(x) = `(x)

log k 1

2Tk

1k + xT1k = log `

1

2T`

1` + xT1`


4 2 0 2 4

4

2

02

4

4 2 0 2 4

4

2

02

4

X1X1

X2

X2

9 / 21

Estimating k

k =#{i ; yi = k}

n

In English, the fraction of training samples of class k.

10 / 21

Estimating the parameters of fk(x)

Estimate the center of each class k:

k =1

#{i ; yi = k}

i ; yi=k

xi

Estimate the common covariance matrix :

I One predictor (p = 1):

2 =1

nK

Kk=1

i ; yi=k

(xi k)2.

I Many predictors (p > 1): Compute the vectors of deviations(x1 y1), (x2 y2), . . . , (xn yn) and use an unbiasedestimate of its covariance matrix, .

11 / 21



k =1

#{i ; yi = k}

i ; yi=k

xi



2 =1

nK

Kk=1

i ; yi=k

(xi k)2.


11 / 21



k =1

#{i ; yi = k}

i ; yi=k

xi



2 =1

nK

Kk=1

i ; yi=k

(xi k)2.


11 / 21



k =1

#{i ; yi = k}

i ; yi=k

xi



2 =1

nK

Kk=1

i ; yi=k

(xi k)2.


11 / 21

LDA prediction

For an input x, predict the class with the largest:

k(x) = log k 1

2Tk

1k + xT 1k

The decision boundaries are defined by:

log k 1

2Tk

1k + xT 1k = log `

1

2T`

1` + xT 1`

Solid lines in:

4 2 0 2 4

4

2

02

4

4 2 0 2 4

4

2

02

4

X1X1

X2

X2

12 / 21

LDA prediction


k(x) = log k 1

2Tk

1k + xT 1k


log k 1

2Tk

1k + xT 1k = log `

1

2T`

1` + xT 1`

Solid lines in:

4 2 0 2 4

4

2

02

4

4 2 0 2 4

4

2

02

4

X1X1

X2

X2

12 / 21

LDA prediction


k(x) = log k 1

2Tk

1k + xT 1k


log k 1

2Tk

1k + xT 1k = log `

1

2T`

1` + xT 1`

Solid lines in:

4 2 0 2 4

4

2

02

4

4 2 0 2 4

4

2

02

4

X1X1

X2

X2

12 / 21

Quadratic discriminant analysis (QDA)

The assumption that the inputs of every class have the samecovariance can be quite restrictive:

4 2 0 2 4

4

3

2

1

01

2

4 2 0 2 4

4

3

2

1

01

2

X1X1

X2

X2

13 / 21


In quadratic discriminant analysis we estimate a mean k and acovariance matrix k for each class separately.

Given an input, it is easy to derive an objective function:

k(x) = log k1

2Tk

1k k+x

T1k k1

2xT1k x

1

2log |k|

This objective is now quadratic in x and so are the decisionboundaries.

14 / 21




k(x) = log k1

2Tk

1k k+x

T1k k1

2xT1k x

1

2log |k|


14 / 21




k(x) = log k1

2Tk

1k k+x

T1k k1

2xT1k x

1

2log |k|


14 / 21


I Bayes boundary ( )

I LDA ( )I QDA ().

4 2 0 2 4

4

3

2

1

01

2

4 2 0 2 4

4

3

2

1

01

2

X1X1

X2

X2

15 / 21

Evaluating a classification method

We have talked about the 0-1 loss:

1

m

mi=1

1(yi 6= yi).

It is possible to make the wrong prediction for some classes moreoften than others. The 0-1 loss doesnt tell you anything about this.

A much more informative summary of the error is a confusionmatrix:

16 / 21

Evaluating a classification method

We have talked about the 0-1 loss:

1

m

mi=1

1(yi 6= yi).

It is possible to make the wrong prediction for some classes moreoften than others. The 0-1 loss doesnt tell you anything about this.

A much more informative summary of the error is a confusionmatrix:

16 / 21

Example. Predicting default

Used LDA to predict credit card default in a dataset of 10K people.

Predicted yes if P (default = yes|X) > 0.5.

I The error rate among people who do not default (falsepositive rate) is very low.

I However, the rate of false negatives is 76%.I It is possible that false negatives are a bigger source of

concern!I One possible solution: Change the threshold.

17 / 21







17 / 21





I However, the rate of false negatives is 76%.

I It is possible that false negatives are a bigger source ofconcern!

I One possible solution: Change the threshold.

17 / 21






concern!

I One possible solution: Change the threshold.

17 / 21







17 / 21


Changing the threshold to 0.2 makes it easier to classify to yes.


Note that the rate of false positives became higher! That is theprice to pay for fewer false negatives.

18 / 21


Changing the threshold to 0.2 makes it easier to classify to yes.


Note that the rate of false positives became higher! That is theprice to pay for fewer false negatives.

18 / 21


Lets visualize the dependence of the error on the threshold:

0.0 0.1 0.2 0.3 0.4 0.5

0.0

0.2

0.4

0.6

Threshold

Err

or

Rate

I False negative rate (error for defaulting customers)

I False positive rate (error for non-defaulting customers)I 0-1 loss or total error rate.

19 / 21

Example. The ROC curve

ROC Curve

False positive rate

Tru

e p

ositiv

e r

ate

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

I Displays the performance ofthe method for any choice ofthreshold.

I The area under the curve(AUC) measures the qualityof the classifier:

I 0.5 is the AUC for arandom classifier

I The closer AUC is to 1,the better.

20 / 21

Example. The ROC curve

ROC Curve

False positive rate

Tru

e p

ositiv

e r

ate

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

I Displays the performance ofthe method for any choice ofthreshold.

I The area under the curve(AUC) measures the qualityof the classifier:

I 0.5 is the AUC for arandom classifier

I The closer AUC is to 1,the better.

20 / 21

Next time

I Comparison of logistic regression, LDA, QDA, and KNNclassification.

I Start Chapter 5: Resampling.

21 / 21

lecture9: classiﬁcation,lda - web.stanford.edu · lineardiscriminantanalysis(lda)...

Documents