lecture9: classification,lda - web.stanford.edu · lineardiscriminantanalysis(lda)...

55
Lecture 9: Classification, LDA Reading: Chapter 4 STATS 202: Data mining and analysis Jonathan Taylor, 10/12 Slide credits: Sergio Bacallado 1 / 21

Upload: buidien

Post on 11-Jan-2019

219 views

Category:

Documents


0 download

TRANSCRIPT

Lecture 9: Classification, LDA

Reading: Chapter 4

STATS 202: Data mining and analysis

Jonathan Taylor, 10/12Slide credits: Sergio Bacallado

1 / 21

Review: Main strategy in Chapter 4

Find an estimate P (Y | X). Then, given an input x0, we predictthe response as in a Bayes classifier:

y0 = argmax y P (Y = y | X = x0).

2 / 21

Linear Discriminant Analysis (LDA)

Instead of estimating P (Y | X), we will estimate:

1. P (X | Y ): Given the response, what is the distribution of theinputs.

2. P (Y ): How likely are each of the categories.

Then, we use Bayes rule to obtain the estimate:

P (Y = k | X = x) = P (X = x | Y = k)P (Y = k)

3 / 21

Linear Discriminant Analysis (LDA)

Instead of estimating P (Y | X), we will estimate:

1. P (X | Y ): Given the response, what is the distribution of theinputs.

2. P (Y ): How likely are each of the categories.

Then, we use Bayes rule to obtain the estimate:

P (Y = k | X = x) = P (X = x | Y = k)P (Y = k)

3 / 21

Linear Discriminant Analysis (LDA)

Instead of estimating P (Y | X), we will estimate:

1. P (X | Y ): Given the response, what is the distribution of theinputs.

2. P (Y ): How likely are each of the categories.

Then, we use Bayes rule to obtain the estimate:

P (Y = k | X = x) = P (X = x | Y = k)P (Y = k)

3 / 21

Linear Discriminant Analysis (LDA)

Instead of estimating P (Y | X), we will estimate:

1. P (X | Y ): Given the response, what is the distribution of theinputs.

2. P (Y ): How likely are each of the categories.

Then, we use Bayes rule to obtain the estimate:

P (Y = k | X = x) = P (X = x | Y = k)P (Y = k)P (X = x)

3 / 21

Linear Discriminant Analysis (LDA)

Instead of estimating P (Y | X), we will estimate:

1. P (X | Y ): Given the response, what is the distribution of theinputs.

2. P (Y ): How likely are each of the categories.

Then, we use Bayes rule to obtain the estimate:

P (Y = k | X = x) = P (X = x | Y = k)P (Y = k)j P (X = x | Y = j)P (Y = j)

3 / 21

Linear Discriminant Analysis (LDA)

Instead of estimating P (Y | X), we will estimate:

1. We model P (X = x | Y = k) = fk(x) as a MultivariateNormal Distribution:

4 2 0 2 4

4

2

02

4

4 2 0 2 4

4

2

02

4

X1X1

X2

X2

2. P (Y = k) = k is estimated by the fraction of trainingsamples of class k.

4 / 21

Linear Discriminant Analysis (LDA)

Instead of estimating P (Y | X), we will estimate:

1. We model P (X = x | Y = k) = fk(x) as a MultivariateNormal Distribution:

4 2 0 2 4

4

2

02

4

4 2 0 2 4

4

2

02

4

X1X1

X2

X2

2. P (Y = k) = k is estimated by the fraction of trainingsamples of class k.

4 / 21

LDA has linear decision boundaries

Suppose that:

I We know P (Y = k) = k exactly.

I P (X = x|Y = k) is Mutivariate Normal with density:

fk(x) =1

(2)p/2||1/2e

12(xk)T1(xk)

k : Mean of the inputs for category k. : Covariance matrix (common to all categories).

Then, what is the Bayes classifier?

5 / 21

LDA has linear decision boundaries

Suppose that:

I We know P (Y = k) = k exactly.

I P (X = x|Y = k) is Mutivariate Normal with density:

fk(x) =1

(2)p/2||1/2e

12(xk)T1(xk)

k : Mean of the inputs for category k. : Covariance matrix (common to all categories).

Then, what is the Bayes classifier?

5 / 21

LDA has linear decision boundaries

Suppose that:

I We know P (Y = k) = k exactly.

I P (X = x|Y = k) is Mutivariate Normal with density:

fk(x) =1

(2)p/2||1/2e

12(xk)T1(xk)

k : Mean of the inputs for category k. : Covariance matrix (common to all categories).

Then, what is the Bayes classifier?

5 / 21

LDA has linear decision boundaries

Suppose that:

I We know P (Y = k) = k exactly.

I P (X = x|Y = k) is Mutivariate Normal with density:

fk(x) =1

(2)p/2||1/2e

12(xk)T1(xk)

k : Mean of the inputs for category k. : Covariance matrix (common to all categories).

Then, what is the Bayes classifier?

5 / 21

LDA has linear decision boundaries

Suppose that:

I We know P (Y = k) = k exactly.

I P (X = x|Y = k) is Mutivariate Normal with density:

fk(x) =1

(2)p/2||1/2e

12(xk)T1(xk)

k : Mean of the inputs for category k. : Covariance matrix (common to all categories).

Then, what is the Bayes classifier?

5 / 21

LDA has linear decision boundaries

By Bayes rule, the probability of category k, given the input x is:

P (Y = k | X = x) = fk(x)kP (X = x)

The denominator does not depend on the response k, so we canwrite it as a constant:

P (Y = k | X = x) = C fk(x)k

Now, expanding fk(x):

P (Y = k | X = x) = Ck(2)p/2||1/2

e12(xk)T1(xk)

6 / 21

LDA has linear decision boundaries

By Bayes rule, the probability of category k, given the input x is:

P (Y = k | X = x) = fk(x)kP (X = x)

The denominator does not depend on the response k, so we canwrite it as a constant:

P (Y = k | X = x) = C fk(x)k

Now, expanding fk(x):

P (Y = k | X = x) = Ck(2)p/2||1/2

e12(xk)T1(xk)

6 / 21

LDA has linear decision boundaries

By Bayes rule, the probability of category k, given the input x is:

P (Y = k | X = x) = fk(x)kP (X = x)

The denominator does not depend on the response k, so we canwrite it as a constant:

P (Y = k | X = x) = C fk(x)k

Now, expanding fk(x):

P (Y = k | X = x) = Ck(2)p/2||1/2

e12(xk)T1(xk)

6 / 21

LDA has linear decision boundaries

P (Y = k | X = x) = Ck(2)p/2||1/2

e12(xk)T1(xk)

Now, let us absorb everything that does not depend on k into aconstant C :

P (Y = k | X = x) = C ke12(xk)T1(xk)

and take the logarithm of both sides:

logP (Y = k | X = x) = logC + log k 1

2(x k)T1(x k).

This is the same for every category, k.So we want to find the maximum of this over k.

7 / 21

LDA has linear decision boundaries

P (Y = k | X = x) = Ck(2)p/2||1/2

e12(xk)T1(xk)

Now, let us absorb everything that does not depend on k into aconstant C :

P (Y = k | X = x) = C ke12(xk)T1(xk)

and take the logarithm of both sides:

logP (Y = k | X = x) = logC + log k 1

2(x k)T1(x k).

This is the same for every category, k.So we want to find the maximum of this over k.

7 / 21

LDA has linear decision boundaries

P (Y = k | X = x) = Ck(2)p/2||1/2

e12(xk)T1(xk)

Now, let us absorb everything that does not depend on k into aconstant C :

P (Y = k | X = x) = C ke12(xk)T1(xk)

and take the logarithm of both sides:

logP (Y = k | X = x) = logC + log k 1

2(x k)T1(x k).

This is the same for every category, k.So we want to find the maximum of this over k.

7 / 21

LDA has linear decision boundaries

P (Y = k | X = x) = Ck(2)p/2||1/2

e12(xk)T1(xk)

Now, let us absorb everything that does not depend on k into aconstant C :

P (Y = k | X = x) = C ke12(xk)T1(xk)

and take the logarithm of both sides:

logP (Y = k | X = x) = logC + log k 1

2(x k)T1(x k).

This is the same for every category, k.

So we want to find the maximum of this over k.

7 / 21

LDA has linear decision boundaries

P (Y = k | X = x) = Ck(2)p/2||1/2

e12(xk)T1(xk)

Now, let us absorb everything that does not depend on k into aconstant C :

P (Y = k | X = x) = C ke12(xk)T1(xk)

and take the logarithm of both sides:

logP (Y = k | X = x) = logC + log k 1

2(x k)T1(x k).

This is the same for every category, k.So we want to find the maximum of this over k.

7 / 21

LDA has linear decision boundaries

Goal, maximize the following over k:

log k 1

2(x k)T1(x k).

= log k 1

2

[xT1x+ Tk

1k]+ xT1k

=C + log k 1

2Tk

1k + xT1k

We define the objective:

k(x) = log k 1

2Tk

1k + xT1k

At an input x, we predict the response with the highest k(x).

8 / 21

LDA has linear decision boundaries

Goal, maximize the following over k:

log k 1

2(x k)T1(x k).

= log k 1

2

[xT1x+ Tk

1k]+ xT1k

=C + log k 1

2Tk

1k + xT1k

We define the objective:

k(x) = log k 1

2Tk

1k + xT1k

At an input x, we predict the response with the highest k(x).

8 / 21

LDA has linear decision boundaries

Goal, maximize the following over k:

log k 1

2(x k)T1(x k).

= log k 1

2

[xT1x+ Tk

1k]+ xT1k

=C + log k 1

2Tk

1k + xT1k

We define the objective:

k(x) = log k 1

2Tk

1k + xT1k

At an input x, we predict the response with the highest k(x).

8 / 21

LDA has linear decision boundaries

Goal, maximize the following over k:

log k 1

2(x k)T1(x k).

= log k 1

2

[xT1x+ Tk

1k]+ xT1k

=C + log k 1

2Tk

1k + xT1k

We define the objective:

k(x) = log k 1

2Tk

1k + xT1k

At an input x, we predict the response with the highest k(x).

8 / 21

LDA has linear decision boundaries

What is the decision boundary? It is the set of points in which 2classes do just as well:

k(x) = `(x)

log k 1

2Tk

1k + xT1k = log `

1

2T`

1` + xT1`

This is a linear equation in x.

4 2 0 2 4

4

2

02

4

4 2 0 2 4

4

2

02

4

X1X1

X2

X2

9 / 21

LDA has linear decision boundaries

What is the decision boundary? It is the set of points in which 2classes do just as well:

k(x) = `(x)

log k 1

2Tk

1k + xT1k = log `

1

2T`

1` + xT1`

This is a linear equation in x.

4 2 0 2 4

4

2

02

4

4 2 0 2 4

4

2

02

4

X1X1

X2

X2

9 / 21

LDA has linear decision boundaries

What is the decision boundary? It is the set of points in which 2classes do just as well:

k(x) = `(x)

log k 1

2Tk

1k + xT1k = log `

1

2T`

1` + xT1`

This is a linear equation in x.

4 2 0 2 4

4

2

02

4

4 2 0 2 4

4

2

02

4

X1X1

X2

X2

9 / 21

Estimating k

k =#{i ; yi = k}

n

In English, the fraction of training samples of class k.

10 / 21

Estimating the parameters of fk(x)

Estimate the center of each class k:

k =1

#{i ; yi = k}

i ; yi=k

xi

Estimate the common covariance matrix :

I One predictor (p = 1):

2 =1

nK

Kk=1

i ; yi=k

(xi k)2.

I Many predictors (p > 1): Compute the vectors of deviations(x1 y1), (x2 y2), . . . , (xn yn) and use an unbiasedestimate of its covariance matrix, .

11 / 21

Estimating the parameters of fk(x)

Estimate the center of each class k:

k =1

#{i ; yi = k}

i ; yi=k

xi

Estimate the common covariance matrix :

I One predictor (p = 1):

2 =1

nK

Kk=1

i ; yi=k

(xi k)2.

I Many predictors (p > 1): Compute the vectors of deviations(x1 y1), (x2 y2), . . . , (xn yn) and use an unbiasedestimate of its covariance matrix, .

11 / 21

Estimating the parameters of fk(x)

Estimate the center of each class k:

k =1

#{i ; yi = k}

i ; yi=k

xi

Estimate the common covariance matrix :

I One predictor (p = 1):

2 =1

nK

Kk=1

i ; yi=k

(xi k)2.

I Many predictors (p > 1): Compute the vectors of deviations(x1 y1), (x2 y2), . . . , (xn yn) and use an unbiasedestimate of its covariance matrix, .

11 / 21

Estimating the parameters of fk(x)

Estimate the center of each class k:

k =1

#{i ; yi = k}

i ; yi=k

xi

Estimate the common covariance matrix :

I One predictor (p = 1):

2 =1

nK

Kk=1

i ; yi=k

(xi k)2.

I Many predictors (p > 1): Compute the vectors of deviations(x1 y1), (x2 y2), . . . , (xn yn) and use an unbiasedestimate of its covariance matrix, .

11 / 21

LDA prediction

For an input x, predict the class with the largest:

k(x) = log k 1

2Tk

1k + xT 1k

The decision boundaries are defined by:

log k 1

2Tk

1k + xT 1k = log `

1

2T`

1` + xT 1`

Solid lines in:

4 2 0 2 4

4

2

02

4

4 2 0 2 4

4

2

02

4

X1X1

X2

X2

12 / 21

LDA prediction

For an input x, predict the class with the largest:

k(x) = log k 1

2Tk

1k + xT 1k

The decision boundaries are defined by:

log k 1

2Tk

1k + xT 1k = log `

1

2T`

1` + xT 1`

Solid lines in:

4 2 0 2 4

4

2

02

4

4 2 0 2 4

4

2

02

4

X1X1

X2

X2

12 / 21

LDA prediction

For an input x, predict the class with the largest:

k(x) = log k 1

2Tk

1k + xT 1k

The decision boundaries are defined by:

log k 1

2Tk

1k + xT 1k = log `

1

2T`

1` + xT 1`

Solid lines in:

4 2 0 2 4

4

2

02

4

4 2 0 2 4

4

2

02

4

X1X1

X2

X2

12 / 21

Quadratic discriminant analysis (QDA)

The assumption that the inputs of every class have the samecovariance can be quite restrictive:

4 2 0 2 4

4

3

2

1

01

2

4 2 0 2 4

4

3

2

1

01

2

X1X1

X2

X2

13 / 21

Quadratic discriminant analysis (QDA)

In quadratic discriminant analysis we estimate a mean k and acovariance matrix k for each class separately.

Given an input, it is easy to derive an objective function:

k(x) = log k1

2Tk

1k k+x

T1k k1

2xT1k x

1

2log |k|

This objective is now quadratic in x and so are the decisionboundaries.

14 / 21

Quadratic discriminant analysis (QDA)

In quadratic discriminant analysis we estimate a mean k and acovariance matrix k for each class separately.

Given an input, it is easy to derive an objective function:

k(x) = log k1

2Tk

1k k+x

T1k k1

2xT1k x

1

2log |k|

This objective is now quadratic in x and so are the decisionboundaries.

14 / 21

Quadratic discriminant analysis (QDA)

In quadratic discriminant analysis we estimate a mean k and acovariance matrix k for each class separately.

Given an input, it is easy to derive an objective function:

k(x) = log k1

2Tk

1k k+x

T1k k1

2xT1k x

1

2log |k|

This objective is now quadratic in x and so are the decisionboundaries.

14 / 21

Quadratic discriminant analysis (QDA)

I Bayes boundary ( )

I LDA ( )I QDA ().

4 2 0 2 4

4

3

2

1

01

2

4 2 0 2 4

4

3

2

1

01

2

X1X1

X2

X2

15 / 21

Evaluating a classification method

We have talked about the 0-1 loss:

1

m

mi=1

1(yi 6= yi).

It is possible to make the wrong prediction for some classes moreoften than others. The 0-1 loss doesnt tell you anything about this.

A much more informative summary of the error is a confusionmatrix:

16 / 21

Evaluating a classification method

We have talked about the 0-1 loss:

1

m

mi=1

1(yi 6= yi).

It is possible to make the wrong prediction for some classes moreoften than others. The 0-1 loss doesnt tell you anything about this.

A much more informative summary of the error is a confusionmatrix:

16 / 21

Example. Predicting default

Used LDA to predict credit card default in a dataset of 10K people.

Predicted yes if P (default = yes|X) > 0.5.

I The error rate among people who do not default (falsepositive rate) is very low.

I However, the rate of false negatives is 76%.I It is possible that false negatives are a bigger source of

concern!I One possible solution: Change the threshold.

17 / 21

Example. Predicting default

Used LDA to predict credit card default in a dataset of 10K people.

Predicted yes if P (default = yes|X) > 0.5.

I The error rate among people who do not default (falsepositive rate) is very low.

I However, the rate of false negatives is 76%.I It is possible that false negatives are a bigger source of

concern!I One possible solution: Change the threshold.

17 / 21

Example. Predicting default

Used LDA to predict credit card default in a dataset of 10K people.

Predicted yes if P (default = yes|X) > 0.5.

I The error rate among people who do not default (falsepositive rate) is very low.

I However, the rate of false negatives is 76%.

I It is possible that false negatives are a bigger source ofconcern!

I One possible solution: Change the threshold.

17 / 21

Example. Predicting default

Used LDA to predict credit card default in a dataset of 10K people.

Predicted yes if P (default = yes|X) > 0.5.

I The error rate among people who do not default (falsepositive rate) is very low.

I However, the rate of false negatives is 76%.I It is possible that false negatives are a bigger source of

concern!

I One possible solution: Change the threshold.

17 / 21

Example. Predicting default

Used LDA to predict credit card default in a dataset of 10K people.

Predicted yes if P (default = yes|X) > 0.5.

I The error rate among people who do not default (falsepositive rate) is very low.

I However, the rate of false negatives is 76%.I It is possible that false negatives are a bigger source of

concern!I One possible solution: Change the threshold.

17 / 21

Example. Predicting default

Changing the threshold to 0.2 makes it easier to classify to yes.

Predicted yes if P (default = yes|X) > 0.2.

Note that the rate of false positives became higher! That is theprice to pay for fewer false negatives.

18 / 21

Example. Predicting default

Changing the threshold to 0.2 makes it easier to classify to yes.

Predicted yes if P (default = yes|X) > 0.2.

Note that the rate of false positives became higher! That is theprice to pay for fewer false negatives.

18 / 21

Example. Predicting default

Lets visualize the dependence of the error on the threshold:

0.0 0.1 0.2 0.3 0.4 0.5

0.0

0.2

0.4

0.6

Threshold

Err

or

Rate

I False negative rate (error for defaulting customers)

I False positive rate (error for non-defaulting customers)I 0-1 loss or total error rate.

19 / 21

Example. The ROC curve

ROC Curve

False positive rate

Tru

e p

ositiv

e r

ate

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

I Displays the performance ofthe method for any choice ofthreshold.

I The area under the curve(AUC) measures the qualityof the classifier:

I 0.5 is the AUC for arandom classifier

I The closer AUC is to 1,the better.

20 / 21

Example. The ROC curve

ROC Curve

False positive rate

Tru

e p

ositiv

e r

ate

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

I Displays the performance ofthe method for any choice ofthreshold.

I The area under the curve(AUC) measures the qualityof the classifier:

I 0.5 is the AUC for arandom classifier

I The closer AUC is to 1,the better.

20 / 21

Next time

I Comparison of logistic regression, LDA, QDA, and KNNclassification.

I Start Chapter 5: Resampling.

21 / 21