lecture 8: multiclass classi cation (i) - university of arizonahzhang/math574m/2020f_lect...multiple...

General SetupBayes Rule for Multiclass Problems

Traditional Methods for Multiclass ProblemsLinear Regression Models

Lecture 8: Multiclass Classification (I)

Hao Helen Zhang

Hao Helen Zhang Lecture 8: Multiclass Classification (I)



Multiclass Classification

General setup

Bayes rule for multiclass problems

under equal costsunder unequal costs

Linear multiclass methods

Linear regression modelLiner discriminant analysis (LDA)Multiple logistic regression Models




Multiclass Problems

Class label Y ∈ {1, ...,K},K ≥ 3.

The classifier f : Rd −→ {1, ...,K}.Examples: zip code classification, multiple-type cancerclassification, vowel recognition

The loss function

C (k , l) = cost of classifying a sample in class k to class l .

In general, C (k , k) = 0 for any k = 1, · · · ,K . The loss can bedescribed as an K × K matrix.For example, K = 3,

C =

0 1 11 0 11 1 0

.




Terminologies

Prior class probabilities (marginal distribution of Y ):

πk = P(Y = k), k = 1, · · · ,K

Conditional distribution of X given Y : assume that

gk is the conditional density of X given Y = k , k = 1, · · · ,K .

Marginal density of X (a mixture):

g(x) =K∑

k=1

πkgk(x), k = 1, · · · ,K .

Posterior class probabilities (conditional dist. of Y given X):

P(Y = k |X = x) =πkgk(x)

g(x), k = 1, · · · ,K .




Bayes Rule

The optimal (Bayes) rule aims to minimize the average lossfunction over the population

φB(x) = arg minf

EX,YC (Y , f (X))

= arg minf

EXEY |XC (Y , f (X)),

where

EY |XL(Y , f (X)) =K∑

k=1

I (f (X) = k)K∑l=1

C (l , k)P(Y = l |X).




Bayes Rule Under Equal Costs

If C (k , l) = I (k 6= l), then R[f ] = EXEY |XI (Y 6= f (X )) and

EY |XC (Y , f (X)) =K∑

k=1

I (f (X) = k)K∑l 6=k

P(Y = l |X )

=K∑l=k

I (f (X) = k)[1− P(Y = k|X)]

Given x, the minimizer of EY |X=xC (Y , f (X)) is f (x) = k∗, where

k∗ = arg mink=1,..,K

1− P(Y = k |x).

The Bayes rule is

φB(x) = k∗ if P(Y = k∗|x) = maxk=1,...,K

P(Y = k |x),

which assigns x to the most probable class using Pr(Y = k |X = x).Hao Helen Zhang Lecture 8: Multiclass Classification (I)



Bayes Rule Under Unequal Costs

For the general loss function, the Bayes rule can be derived as

φB(x) = k∗ if k∗ = arg mink=1,...,K

K∑l=1

C (l , k)P(Y = l |x).

There is not a simple analytic form for the solution.




Decision (Discriminating) Functions

For multiclass problems, we generally need to estimate multiplediscriminant functions fk(x), k = 1, · · · ,K

Each fk(x) is associated with class k .

fk(x) represents the evidence strength of a sample (x, y)belonging to class k.

The decision rule constructed using fk ’s is

f (x) = k∗, where k∗ = arg maxk=1,...,K

fk(x).

The decision boundary of the classification rule f between class kand class l is defined as

{x : fk(x) = fl(x)}.




Traditional Methods: Divide and Conquer

Main ideas:

(i) Decompose the multiclass classification problem into multiplebinary classification problems.

(ii) Use the majority voting principle (a combined decision fromthe committee) to predict the label

Common approaches: simple but effective

One-vs-rest (one-vs-all) approaches

Pairwise (one-vs-one, all-vs-all) approaches




One-vs-rest Approach

One of the simplest multiclass classifier; commonly used in SVMs;also known as the one-vs-all (OVA) approach

(i) Solve K different binary problems: classify “class k” versus“the rest classes” for k = 1, · · · ,K .

(ii) Assign a test sample to the class giving the largest fk(x)(most positive) value, where fk(x) is the solution from the kthproblem

Properties:

Very simple to implement, perform well in practice

Not optimal (asymptotically): the decision rule is not Fisherconsistent if there is no dominating class (i.e.arg max pk(x) < 1

2 ).

Read: Rifkin and Klautau (2004) “In Defense of One-vs-allClassification”




Pairwise Approach

Also known as all-vs-all (AVA) approach

(i) Solve(K

2

)different binary problems: classify “class k” versus

“class j” for all j 6= k . Each classifier is called gij .

(ii) For prediction at a point, each classifier is queried once andissues a vote. The class with the maximum number of(weighted) votes is the winner.

Properties:

Training process is efficient, by dealing with small binaryproblems.

If K is big, there are too many problems to solve. If K = 10,we need to train 45 binary classifiers.

Simple to implement; perform competitively in practice.

Read: Park and Furnkranz (2007) “Efficient PairwiseClassification”




Multivariate Linear RegressionLinear Discriminant AnalysisMultiple Logistic Regression

Linear Classifier for Multiclass Problems

Assume the decision boundaries are linear. Common linearclassification methods:

Multivariate linear regression methods

Linear log-odds (logit) models

Linear discriminant analysis (LDA)Multiple logistic regression

Separating hyperplane – explicitly model the boundaries aslinear

Perceptron model, SVMs





Coding for Linear Regression

For each class k , code it using an indicator variable Zk :

Zk = 1 if Y = k;

Zk = 0 if Y 6= k .

The response y is coded using a vector z = (z1, ..., zK ). IfY = k , only the kth component of z = 1.

The entire training samples can be coded using a n × Kmatrix, denoted by Z .

Call the matrix Z the indicator response matrix.





Multivariate Lineaer Regression

Consider the multivariate model

Z = XB + E, or Zk = fk ≡ Xbk + εk , k = 1, · · · ,K .

Zk is the kth column of the indicator response matrix Z .

X is n × (d + 1) input matrix

E is n × K matrix of errors.

B is (d + 1)× K matrix of parameters, bk is the kth columnof B.





Ordinary Least Square Estimates

Minimize the residual sum of squares

RSS(B) =K∑

k=1

n∑i=1

[zik − β0k −d∑

j=1

βjkxij ]2

= tr[(Z − XB)T (Z − XB)].

The LS estimates has the same form as univariate case

B = (XTX )−1XTZ ,

Z = X (XTX )−1XTZ ≡ XB.

Coefficient bk for yk is same as univariate OLS estimates.(Multiple outputs do not affect the OLS estimates).

For x, compute f (x) =[(1 x)B

]Tand classify it as

f (x) = argmaxk=1,...,K fk(x).





About Multiple Linear Regression

Justification: Each fk is an estimate of conditional expectation ofYk , which is the conditional probability of (X ,Y ) belonging to theclass k

E(Yk |X = x) = Pr(Y = k |X = x).

If the intercept is in the model (column of 1 in X), then∑Kk=1 fk(x) = 1 for any x.

If the linear assumption is appropriate, then fk give aconsistent estimate of the probability for k = 1, · · · ,K , as thesample size n goes to infinity.





Limitations of Linear Regression

There is no guarantee that all fk ∈ [0, 1]; some can benegative or greater than 1.

Linear structure of X ’s can be rigid.

Serious masking problem for K ≥ 3 due to rigid linearstructure.





Elements of Statisti al Learning Hastie, Tibshirani & Friedman 2001 Chapter 4Linear Regression

1

1

1

1

1

1111

1

1

1

1

11

1 1

1

1

1

1

1

1 11

1

1

1

1

1

1

11

11

1

1

1

1 11

1

1

1

1

1

1

1

1 1

1

1

1

1

11

1 1 1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

11

111 1

1

11 1

1

1

1 1

1

1

11

1

1

1

1

1

1

1

1

1

1

111

1

1

1

1

1

1

1

11

11

1

1

11

1

1

1

1

1

1

1

1 111

1

1

11

1

1

11

11

11

1

1

1

1

1

111

1

11

1

1

11

1

1

11

1

1

1

1

1

1

1

1

11

11

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1 111 1

1

1

1

1

1

1

1

1

1 11

11

1

1

1

1

1

1

1

1

1 1

1

11

11

1

1

1

1

1

1 1

1

1

1

1

1

11

11

1

1 1

11

1

1

1

1

1

1

1

111

1

1

1 1

1

1

1

11

1

1

1

11

1

1 1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

11

11

1

1

1

1

1 1

1

1

1

1

1 1

111

1

1

111

11

11

1

111

1

1

1

1

1

1

1

1

1

11 1

1

1

1

1

1

11

1

11

1

1

1

1

11

1

11

1

1

1 1

11

1

11

1

11

1

1

1

1

11

1

11

1

11

11

1

1

1

11

1

1

1

11

1

11

11

1

1

1

1

11

11

1

1

1

1

1

111 1

11

1 1

1

1

1

1

1

1

1 1

1

11

1

1

1

1

1

11

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1 1

11

1 1

11

1

1

1

1

1

1

1

1

1

1

11

22

2

2

22

2

2

2

2

2

2

22

2

22

22 22

2

2 2

2

2 222

2

22

2

2

2

2

2

2

2

2

22

2

22

22

2

22

2

2

22

2

2

2

2

2

2

2 2

2

2

2

22

2

2

22

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

22

2 22

2

2

2

2

2

2

2

2

22

2

2

2

2

2

2

2

22

2

2

2

2

2

2

2

2

22

2

2

2

22

2

2

2

22

2

2 22 2

2

2

2

2

2

22

22

2

2

22

2

2

2

2

2

2

2

2

2

2

2

2

22

2

2

2

2

22

2

22

2

2

2

22

2

2

2

2

2

2

2

2

22

2

2

2

2

22

2

2

2

22

2

2

2

2

2

2

2

2

2

22

2

2

2

22

22

2

2

2

22

2

2

2

2

2

2

2

2

2

2

2

2

222

2

22

2

2

2

2

2

22

2

2

22

2

2

2 2

2

22

2

2

2

2

2

2

2

2

2

2

2

22

2

2

2

22

2

2

22

2

2

2

2

2

22

2

2

2

2

2

2

2 2

2

2

2

2

2

2

2 2

2

2 22

2

2

2

2

22

2

2

222

2

2

2

22

22

222

22

2

2

22

2

2

2 2

2

2

222

2

2

2

2

2

22

2

2 2

2

22

2

2

22

2

2

22

2

2

2

2

2

2

22

2 2

2

2

2

2

2

2

22

2

2 2

2

2

22

22 22

2

2

2 2

2

22

2

2

2

2 2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

22

2

2

2 2

2

2 2

2

2

2

22

2

2

2

22

222

2

22

2

22

222

2

22

2

2

2

22

2

2

2

2

22

2

2

22

2 2

2

2

2

22

2

2

22

2

2 2

2

2

2

2

2

22

2

2

22

2

2

2

2

2

3

3

3

3

3

3

33

3

3

33

3

3

3

3

3

3

3

3

3

33

3

3

3

3

3

3

33

3

3

3

33

3

3 3 3

3

3

3

33

3

33

3

3

3

3

3

3

3

3

3

3

3

3

3

33

3

3

3

3

3

3

3

3

3

3

3

3

3

33

3

3

3

333

3 3

3

3

33

3

33

33

3

3

3

3

333

3

3

33

33

33

33

3

3

3

3

33

3

3

3

3

3

3

3

3

3

3

3

33

33

3

3

33

3

3

333

3

3

3

33

3

3

3

3

3

3

3

33

3

3

3

3

33

3

3

3

33

3

3

3 3

3

3

3

3

3

3

3

33

3

3

3

3

3 33

3

33

3

3

3

3

333

3

3

3

33

3

3

3

3 3

33

3 333

3

3

3

3

33

3

33

3

3

33

3

3

3

3

3

3

3

3

3

3

3

3

3

33

33

3

3

3

333

3

3

3

3

3

3

3

3

3

3

3 3

3

3

3

3

3

3

3

33

3

3

33

33

3

3

3

3

3

33

3

3

33

3

3

3

33

3

3 33

3

3

3

33

3

33

333

3

3

33

33

3

3

3

3

3

3 3

3

3

3

3

33

3

3

33

3

3

3

3

33

3 3

3

3

33

3

33

33

3

3

33

33

33

33 3

3

3

3

3

3

3

33

33

33

33

3

3

3

3

3

3

3

3

3

3

3

33

3

3

3

3 3

3

3

33

3

3

3 33

3

3

3

33

3

3

33

3

3

3

3

3

3

3

3

3

33

3

33

3

3

3

3

3

33

3

3

3

3 3

3

3

33

3

3

33

3

3

3

33

333

33

3

3

3

3

3

3

3

3

3

33

3

3

3

3

3

3

33

33

3

3

3

3

333

3

3

3

3

33

3

3

33

3

3

3

3

33

33

33

Linear Discriminant Analysis

1

1

1

1

1

1111

1

1

1

1

11

1 1

1

1

1

1

1

1 11

1

1

1

1

1

1

11

11

1

1

1

1 11

1

1

1

1

1

1

1

1 1

1

1

1

1

11

1 1 1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

11

111 1

1

11 1

1

1

1 1

1

1

11

1

1

1

1

1

1

1

1

1

1

111

1

1

1

1

1

1

1

11

11

1

1

11

1

1

1

1

1

1

1

1 111

1

1

11

1

1

11

11

11

1

1

1

1

1

111

1

11

1

1

11

1

1

11

1

1

1

1

1

1

1

1

11

11

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1 111 1

1

1

1

1

1

1

1

1

1 11

11

1

1

1

1

1

1

1

1

1 1

1

11

11

1

1

1

1

1

1 1

1

1

1

1

1

11

11

1

1 1

11

1

1

1

1

1

1

1

111

1

1

1 1

1

1

1

11

1

1

1

11

1

1 1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

11

11

1

1

1

1

1 1

1

1

1

1

1 1

111

1

1

111

11

11

1

111

1

1

1

1

1

1

1

1

1

11 1

1

1

1

1

1

11

1

11

1

1

1

1

11

1

11

1

1

1 1

11

1

11

1

11

1

1

1

1

11

1

11

1

11

11

1

1

1

11

1

1

1

11

1

11

11

1

1

1

1

11

11

1

1

1

1

1

111 1

11

1 1

1

1

1

1

1

1

1 1

1

11

1

1

1

1

1

11

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1 1

11

1 1

11

1

1

1

1

1

1

1

1

1

1

11

22

2

2

22

2

2

2

2

2

2

22

2

22

22 22

2

2 2

2

2 222

2

22

2

2

2

2

2

2

2

2

22

2

22

22

2

22

2

2

22

2

2

2

2

2

2

2 2

2

2

2

22

2

2

22

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

22

2 22

2

2

2

2

2

2

2

2

22

2

2

2

2

2

2

2

22

2

2

2

2

2

2

2

2

22

2

2

2

22

2

2

2

22

2

2 22 2

2

2

2

2

2

22

22

2

2

22

2

2

2

2

2

2

2

2

2

2

2

2

22

2

2

2

2

22

2

22

2

2

2

22

2

2

2

2

2

2

2

2

22

2

2

2

2

22

2

2

2

22

2

2

2

2

2

2

2

2

2

22

2

2

2

22

22

2

2

2

22

2

2

2

2

2

2

2

2

2

2

2

2

222

2

22

2

2

2

2

2

22

2

2

22

2

2

2 2

2

22

2

2

2

2

2

2

2

2

2

2

2

22

2

2

2

22

2

2

22

2

2

2

2

2

22

2

2

2

2

2

2

2 2

2

2

2

2

2

2

2 2

2

2 22

2

2

2

2

22

2

2

222

2

2

2

22

22

222

22

2

2

22

2

2

2 2

2

2

222

2

2

2

2

2

22

2

2 2

2

22

2

2

22

2

2

22

2

2

2

2

2

2

22

2 2

2

2

2

2

2

2

22

2

2 2

2

2

22

22 22

2

2

2 2

2

22

2

2

2

2 2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

22

2

2

2 2

2

2 2

2

2

2

22

2

2

2

22

222

2

22

2

22

222

2

22

2

2

2

22

2

2

2

2

22

2

2

22

2 2

2

2

2

22

2

2

22

2

2 2

2

2

2

2

2

22

2

2

22

2

2

2

2

2

3

3

3

3

3

3

33

3

3

33

3

3

3

3

3

3

3

3

3

33

3

3

3

3

3

3

33

3

3

3

33

3

3 3 3

3

3

3

33

3

33

3

3

3

3

3

3

3

3

3

3

3

3

3

33

3

3

3

3

3

3

3

3

3

3

3

3

3

33

3

3

3

333

3 3

3

3

33

3

33

33

3

3

3

3

333

3

3

33

33

33

33

3

3

3

3

33

3

3

3

3

3

3

3

3

3

3

3

33

33

3

3

33

3

3

333

3

3

3

33

3

3

3

3

3

3

3

33

3

3

3

3

33

3

3

3

33

3

3

3 3

3

3

3

3

3

3

3

33

3

3

3

3

3 33

3

33

3

3

3

3

333

3

3

3

33

3

3

3

3 3

33

3 333

3

3

3

3

33

3

33

3

3

33

3

3

3

3

3

3

3

3

3

3

3

3

3

33

33

3

3

3

333

3

3

3

3

3

3

3

3

3

3

3 3

3

3

3

3

3

3

3

33

3

3

33

33

3

3

3

3

3

33

3

3

33

3

3

3

33

3

3 33

3

3

3

33

3

33

333

3

3

33

33

3

3

3

3

3

3 3

3

3

3

3

33

3

3

33

3

3

3

3

33

3 3

3

3

33

3

33

33

3

3

33

33

33

33 3

3

3

3

3

3

3

33

33

33

33

3

3

3

3

3

3

3

3

3

3

3

33

3

3

3

3 3

3

3

33

3

3

3 33

3

3

3

33

3

3

33

3

3

3

3

3

3

3

3

3

33

3

33

3

3

3

3

3

33

3

3

3

3 3

3

3

33

3

3

33

3

3

3

33

333

33

3

3

3

3

3

3

3

3

3

33

3

3

3

3

3

3

33

33

3

3

3

3

333

3

3

3

3

33

3

3

33

3

3

3

3

33

33

33

PSfrag repla ementsX1X1

X 2X 2Figure 4.2: The data ome from three lasses inIR2 and are easily separated by linear de ision bound-aries. The right plot shows the boundaries found bylinear dis riminant analysis. The left plot shows theboundaries found by linear regression of the indi a-tor response variables. The middle lass is ompletelymasked (never dominates).





Elements of Statisti al Learning Hastie, Tibshirani & Friedman 2001 Chapter 4

11

1

1

1

1

11

11

1

1

1

1

1

1

1

1

1

11

1

11

111

1

1

1

1

1

1

1

1

1

1

1111

1

11

111

11

1

1

11

1

1

1

1

1

1

1

1

1

11

111

11

1

111

1

1

1

1

1

1111

11

11

1

1

11

11

1

111

1

11

1

1

1

11

11

1

1

1

1

1

11

1

1

11

1

1

1

1

1

1

11

1

1

1

1

1

1

1

111

1

11

111

11

11

1

111

1

222 222 2 2 2222 22 22 2 22 22222 2 22 222 22 2 22 22 22 22222 22 2 222 2 2222 222 222 222 2222 22 22 2 22 222 2222 22 222 222 222 222 222 2 22 22 222 222 22 22 2222 22 2 222 2 22 2 22 2 22 22 222 2 222 222 222 2

33

3

3

3

3

33

33

3

3

3

3

3

3

3

3

3

33

3

33

333

3

3

3

3

3

3

3

33

3

3333

3

33

333

33

3

3

33

3

3

3

3

3

3

3

3

3

33

333

33

3

333

3

33

3

3

3333

33

33

3

3

33

33

3

333

3

33

3

3

3

33

33

3

3

3

3

3

33

3

3

33

3

3

3

3

3

3

33

3

3

3

3

3

3

3

333

3

33

333

33

33

3

333

3

0.0

0.5

1.0

0.0 0.2 0.4 0.6 0.8 1.0

Degree = 1; Error = 0.25

11

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

11

1

1

1

111

1

1

1

1

1

1

1

1

1

1

11

11

1

11

111

1

1

1

1

111

1

1

11

11

1

1

11

1111 1

1

111

11 1

11

1111

11

11

1

1

1111

1

1111

111

11 11 111 11

11

1 11 1111 11 1

111 1 11 1 11 1 1

1 111

11 1 111 111 1111

22

2

2

2

2

2

2

2

22

2

2

2

2

2

2

2

2

22

2

2

2

222

2

2

2

2

2

2

2

2

2

2

22

22

2

22

222

2

2

2

2 2222

2

222

22 222 2222 22

22 2

2

22

22 2222 22 2222

22 222 2222

22 2

2

2

22

22

2

2

2

2

2

2

2

2

2

22

2

2

2

2

2

2

22

2

2

2

2

2

2

2

22

2

2

2

2

2

2

22

2

2

2

22

2

2

333 33

33 3 333

3

33

33

33

3 3333

3

3 33 333

3

3

3 33 33

33 33333 33 3 33

3 3

3333

3

33

3

3

3

3

33

333

3 3

333

3

3

33

33

3333

33

33

3

3

33

33

3333

3

333

3

3

33

33

3

3

3

3

3

3

3

3

3

33

3

3

3

3

3

3

33

3

3

3

3

3

3

3

33

3

3

3

3

3

3

33

3

3

3

33

3

3

0.0

0.5

1.0

0.0 0.2 0.4 0.6 0.8 1.0

Degree = 2; Error = 0.03

Figure 4.3: The e�e ts of masking on linear regres-sion in IR for a three- lass problem. The rug plot atthe base indi ates the positions and lass membershipof ea h observation. The three urves in ea h panel arethe �tted regressions to the three- lass indi ator vari-ables; for example, for the red lass, yred is 1 for thered observations, and 0 for the green and blue. The �tsare linear and quadrati polynomials. Above ea h plotis the training error rate. The Bayes error rate is 0:025for this problem, as is the LDA error rate.





About Masking Problems

One-dimensional example: (see the previous figure)

Projected the data onto the line joining the three centroid

There is no information in the northwest-southeast direction

Three curves are regression lines

The left panel is for the linear regression fit:the middle class is never dominant!The right panel is for the quadratic fit (masking problem issolved!)





How to Solve Masking Problems

In the previous example with K = 2

Using Quadratic regression rather than linear regression

Linear regression error 0.25; Quadratic regression error 0.03;Bayes error 0.025.

In general,

If K ≥ 3 classes are lined up, we need polynomial terms up todegree K − 1 to resolve the masking problem.

In the worst case, need O(pK−1) terms.

Though LDA is also based on linear functions, it does notsuffer from masking problems.





Large K , Small d Problem

Masking often occurs for large K and small d .Example: Vowel data K = 11 and d = 10. (Two-dimensional fortraining data using LDA projection)

A difficult classification problem, as the class overlap isconsiderate

the best methods achieve 40% test error

Method Traing Error Test Error

Linear regression 0.48 0.67LDA 0.32 0.56QDA 0.01 0.53Logistic regression 0.22 0.51





��

Coordinate 1 for Training Data

Coord

inate 2

for Tr

aining

Data

-4 -2 0 2 4

-6-4

-20

24

oooooo

ooo

o

oo

ooo

o oo

ooo

oo

o

o o o o oo

ooo o oo

oo

o

o

o o

o oo

ooo

oo oooo

o

o

oo

o

o

oooo

o

o

oooooo

o ooo

o

o

o

o

oooo

ooo o

oo

o

oo

oo

o

ooooo

oooooo

o

o

ooo

o

o

o oooo o

ooo o oo

ooo o

oo

oo o ooo

o ooo

oo

oooooo

oooooo

oo

oooo

oooooo

ooooo o

oooooo

oooooo

oo

oooo

o oo

oo

o

o ooooo

ooooo

o

oooooo

oooooo

oooooo

oooooo

oo

o

o

oo

ooooo oo ooo

o oo o

oooo

oooo

o

o

oo

o

o

oo

oooo

o

o

oooo

o

o oo

o

oo

o

oooooo

oo oooo

oo

oo oo o

oo

oo o

o o o

oo

ooo

oooo

oooo

o

o

oo

o

ooo

ooo

ooo

o o ooo o

o oooo

o

ooooo

o

o o

oo

oo

oo o ooo

ooo

oo o

o

oo

o o o

ooooo

o

ooo

oo

o

o

oo

ooo

ooooo

o

oooo

oo

o oooo

o

oooooo

ooo o

oo

ooooo oo oo

o o o

oooooo

oo

ooo o

oo

o oo o

ooo

ooo

o

o

oo

oo

oo

oooo

oooo

oo

ooo ooo

oo

oo oo

o ooo

oo

ooo

ooo

ooo

ooo

oooooo

oo

o

oo o

••••

•••• •••• ••

••

••••

••

Linear Discriminant Analysis

��

��

��

�� !��

��

��





Model Assumptions for Linear Discriminant Analysis

Model Setup

Let πk be prior probabilities of class k

Let gk(x) be the class-conditional densities of X in class k.

The posterior probability

Pr(Y = k|X = x) =gk(x)πk∑Kl=1 gl(x)πl

.

Model Assumptions:

Assume each class density is multivariate Gaussian N(µk ,Σk).

Further, we assume Σk = Σ for all k (equal covariancerequired!)





Linear Discriminant Function

The log-ratio between class k and l is

logPr(Y = k |X = x)

Pr(Y = l |X = x)= log

πkπl

+ loggk(x)

gl(x)

= [xTΣ−1µk −1

2µTk Σ−1µk + log πk ]−

[xTΣ−1µl −1

2µTl Σ−1µl + log πl ]

For each class k, its discriminant function is

fk(x) = xTΣ−1µk −1

2µTk Σ−1µk + log πk ,

The quadratic term canceled due to the equal covariance matrix.The decision rule is

Y (x) = argmaxk=1,...,K fk(x).






+ +

+3

21

1

1

2

3

3

3

1

2

3

3

2

1 1 21

1

3

3

1 21

2

3

2

3

3

1

2

2

1

1

1

1

3

2

2

2

2

1 3

2 2

3

1

3

1

3

3 2

1

3

3

2

3

1

3

3

21

33

2

2

32

2

211

1

1

1

2

1

3

3

11

3

32

2

2

23

1

2

Figure 4.5: The left panel shows three Gaussian distri-butions, with the same ovarian e and di�erent means.In luded are the ontours of onstant density en losing95% of the probability in ea h ase. The Bayes de isionboundaries between ea h pair of lasses are shown (bro-ken straight lines), and the Bayes de ision boundariesseparating all three lasses are the thi ker solid lines (asubset of the former). On the right we see a sampleof 30 drawn from ea h Gaussian distribution, and the�tted LDA de ision boundaries.





Interpretation of LDA

log Pr(Y = k |X = x) = −1

2(x−µk)TΣ−1(x−µk) + log πk + const.

The constant term does not involve x.

If prior probabilities are same, the LDA classifies x to the classwith centroid closest to x, using the squared Mahalanobisdistance, dΣ(x ,µ), based on the common covariance matrix.

If dΣ(x− µk) = dΣ(x− µl), then the prior determines theclassifier.

Special Case: Σk = I for all k

fk(x) = −1

2‖x− µk‖2 + log πk .

only Euclidean distance is needed!





Mahalanobis Distance

Mahalanobis distance is a distance measure introduced by P. C.Mahalanobis in 1936.Def: The Mahalanobis distance of x = (x1, . . . , xd)T from a set ofpoints with mean µ = (µ1, . . . , µd)T and covariance matrix Σ isdefined as:

dΣ(x,µ) =[(x− µ)TΣ−1(x− µ)

]1/2.

It differs from Euclidean distance in that

it takes into account the correlations of the data set

it is scale-invariant.

It can also be defined as a dissimilarity measure between x and x′

of the same distribution with the covariance matrix Σ:

dΣ(x, x′) =[(x− x′)TΣ−1(x− x′)

]1/2.





Special Cases

If Σ is the identity matrix, the Mahalanobis distance reducesto the Euclidean distance.

If Σ is diagonal diag(σ21, · · · , σ2

d), then the resulting distancemeasure is called the normalized Euclidean distance:

dΣ(x, x′) =

[d∑

i=1

(xi − x ′i )2

σ2i

]1/2

,

where σi is the standard deviation of the xi and x ′i .





Normalized Distance

Assume we like to decide whether x belongs to a class. Intuitively,the decision relies on the distance of x to the class center µ: thecloser to µ, the more likely x belongs to the class.

To decide whether a given distance is large or small, we needto know if the points in the class are spread out over a largerange or a small range. Statistically, we measure the spread bythe standard deviation of distances of the sample points to µ.

If dΣ(x,µ) is less than one standard deviation, then it ishighly probable that x belongs to the class. The further awayit is, the more likely that x not belonging to the class. This isthe concept of the normalized distance.





Motivation of Mahalanobis Distance

The drawback of the normalized distance is that the sample pointsare assumed to be distributed about µ in a spherical manner.

For non-spherical situations, for instance ellipsoidal, theprobability of x belonging to the class depends not only on thedistance from µ, but also on the direction.

In those directions where the ellipsoid has a short axis, x mustbe closer. In those where the axis is long, x can be furtheraway from the center.

The ellipsoid that best represents the class’s probability distributioncan be estimated by the covariance matrix of the samples.





Linear Decision Boundary of LDA

Linear discriminant functions

fk(x) = βT1kx + β0k , fj(x) = βT

1jx + β0j .

Decision boundary function between class k and j is

(β1k − β1j)T x + (β0k − β0j) = β

T1 x + β0 = 0.

Since β1k = Σ−1µk and β1j = Σ−1µj , the decision boundary hasthe directional vector

β1 = Σ−1(µk − µj)

β1 in generally not in the direction of µk − µj

The discriminant direction β1 minimizes the overlap forGaussian data






+

++

+

Figure 4.9: Although the line joining the entroidsde�nes the dire tion of greatest entroid spread, theproje ted data overlap be ause of the ovarian e (leftpanel). The dis riminant dire tion minimizes thisoverlap for Gaussian data (right panel).





Parameter Estimation in LDA

In practice, we estimate the parameters from the training data

πk = nk/n, where nk is sample size in

µk =∑

Yi=k xi/nkThe sample covariance matrix for the kth class:

Sk =1

nk − 1

∑Yi=k

(xi − µk)(xi − µk)T .

(Unbiased) pooled sample covariance is a weighted average

Σ =K∑

k=1

nk − 1∑Kl=1(nl − 1)

Sk

=K∑

k=1

∑Yi=k

(xi − µk)(xi − µk)T/(n − K ).





Implemenation of LDA

Compute the eigen-decomposition:

Σ = UDUT

Sphere the data using Σ, using the following transformation

X ∗ = D−1/2UTX .

The common covariance estimate of X ∗ is the identity matrix.

In the transformed space, classify the point to the closesttransformed centroid, with the adjustment using πk ’s.





Connection between LDA and Linear Regression for K = 2

Suppose we code the targets in the two classes as +1 and −1.Fit linear model

yi = β0 + βT1 xi , i = 1, · · · , n

Then the OLS (β0, β1) satisfy

β1 ∝ Σ−1(µ1 − µ2)

β0 is different from the LDA unless n1 = n2.

The linear regression gives a different decision rule from LDA,unless n1 = n2.





Multiple Logistic Models

Model the K posterior probabilities by linear functions of x.We use logit transformations

logPr(Y = 1|X = x)

Pr(Y = K |X = x)= β10 + βT

1 x

logPr(Y = 2|X = x)

Pr(Y = K |X = x)= β20 + βT

2 x

... ...

logPr(Y = K − 1|X = x)

Pr(Y = K |X = x)= β(K−1)0 + βT

K−1x

The choice of denominator in the odd-ratios is arbitrary.The parameter vector

θ = {β10,β1, ...., β(K−1)0,βK−1}.





Probability Estimates

We use the logit transformation to assure that

the total probabilities sum to one

each estimated probability lies in [0, 1]

pk(x) ≡ Pr(Y = k |x) =exp (βk0 + βT

k x)

1 +∑K−1

l=1 exp (βl0 + βTl x)

for k = 1, ...,K − 1.

pK (x) ≡ Pr(Y = K |x) =1

1 +∑K−1

l=1 exp (βl0 + βTl x)

Check∑K

k=1 pk(x) = 1 for any x.





Maximum Likelihood Estimate for Logistic Models

The joint conditional likelihood of yi given xi is

l(θ) =n∑

i=1

log pyi (θ; xi )

Using the Newton-Raphson method:

solve iteratively by minimizing re-weighted least squares.

provide consistent estimates if the models are specifiedcorrectly.


lecture 8: multiclass classi cation (i) - university of arizonahzhang/math574m/2020f_lect...multiple...

Documents