linear methods for classification jie lu, joy, lucian {jielu+,joy+, llita+}@cs.cmu.edu

Linear Methods for Classification

Jie Lu, Joy, Lucian

{jielu+,joy+, llita+}@cs.cmu.edu

Linear Methods for Classification

• What are they?Methods that give linear decision boundaries between classes

Linear decision boundaries {x: 0 + 1

T x = 0}

• How to define decision boundaries?Two classes of methods

– Model discriminant functions k(x) for each class as linear

– Model the boundaries between classes as linear

Two Classes of Linear Methods

• Model discriminant functions k(x) for each class as linear– Linear regression fit to the class indicator variables

– Linear discriminant analysis (LDA)

– Logistic regression (LOGREG)

• Model the boundaries between classes as linear (will be discussed on next Tuesday)– Perceptron

– Non-overlap support vector classifier (SVM)

Model Discriminant Functions k(x) For Each Class

• ModelDifferent for linear regression fit, linear discriminant analysis, and

logistic regression

• Discriminant functions k(x) Based on the model

• Decision Boundaries between class k and l{x: k(x) = l(x)}

• Classify to the class with the largest k(x) value

)(maxarg)( xxG kgk

Linear Regression Fit to the Class Indicator Variables

• Linear model for kth indicator response variable

• Decision boundary is set of points

• Linear discriminant function for class k

• Classify to the class with the largest value for its k(x)

• Parameters estimation– Objective function

– Estimated coefficients

xxf Tkkk

0)(

}0)()(:{)}()(:{ 00

xxxfxfx Tlklklk

)()( xfx kk

)(maxarg)( xxG kgk

2

1

||]),1[(||minarg)(RSSminarg Ti

N

ii xy

KpK

KTT YXXX

)1(1

0101

...

...)(


• Rationale– An estimate of conditional expectation

– An estimate of the target value

– An observation:

Why?A “straightforward” verification --- see next page

courtesy of Jian zhang and Yan Rong

)|Pr()|()( xXkGxXYExf kk

1)(

xfgk k


• Verification of

We want to prove

which is equivalent to prove

Notice that

1)(

xfgk k

1)(

xfgk k

1))(,1(

1))(,1(

1]),1[(

11

11

1

NTT

kTT

k

IXXXx

IYXXXx

IBx

IXXXX TT 1)( (Eq. 2)

(Eq. 1)


And the augmented X has

From Eq. 2: we can see that

Which means that

...1

....

...1

...1

X

11, X

...0

....

...0

...1

...1

....

...1

...1

)( 1 TT XXX

0

.

0

1

)( 11

NTT IXXX


Eq. 1 becomes:

True for any x.

1

0

.

.

0

1

),1(

x

Mask•Problem

–When K3, classes can be masked by others

–Because the rigid nature of the regression model:

Mask(2)Quadratic Polynomials

Linear Regression Fit

• Question: P81 Let's just consider binary classification. In "machine learning course", when we transfer from regression to classification, we fit a single regression curve on samples of both two classes, Then we decide a threshold on the curve and finished classification. Here we use two regression curves ,each for a category.Can you compare the two methods? (Fan Li)

x

y

+++++

------

Linear Discriminant Analysis (Common Convariance Matrix )

• Model class-conditional density of X in class k as multivariate Gaussian

• Class posterior


)()(2

1

2/12/

1

||)2(

1)(

kT

k xx

pk exf

K

l ll

kk

xf

xfxXkG

1)(

)()|Pr(

}0)|Pr(

)|Pr(log:{)}|Pr()|Pr(:{

xXlG

xXkGxxXlGxXkGx

}0)()()(2

1log:{ 11

lkT

lkT

lkl

k xx

Linear Discriminant Analysis (Common ) con’t




– Estimated parametersNNkk /

kkg ik Nxi

/

)/())((1

KNxx Tkik

K

k kg ii

kkT

kkT

k xx log2

1)( 11

)(maxarg)( xxG kgk

)(Pr)|(Prlogmaxarg),(Prlogmaxarg11 iii

N

iii

N

iyyxyx

Logistic Regression

• Model the class posterior Pr(G=k|X=x) in terms of K-1 log-odds




1,...,1,)|Pr(

)|Pr(log 0

KkxxXKG

xXkG Tkk

}0)|Pr(

)|Pr(log:{)}|Pr()|Pr(:{

xXlG

xXkGxxXlGxXkGx

}0)()(:{ 00

xx Tlklk

xx Tkkk 0)(

)(maxarg)( xxG kgk

Questions

• The log odds-ratio is typically defines as log(p/(1-p)), how is this consistent with p96 where they use log(pk/pl) where k,l are different classes in K. (Ashish Venugopal)

Logistic Regression con’t


– Parameters estimationIRLS (iteratively reweighted least squares)

Particularly, for two-class case, using Newton-Raphson algorithm to solve the equation (pages 98-99 for details)

)|(Prlogmaxarg1 ii

N

ixy

0));(()|Pr(log

1

1

N

iiii

N

iii

xpyxxy

WzXWXXpyXWXX TTTToldnew 11 )()()( );(),;(1)(;(),(1 old

iiold

iold

iiold xppxpxpWpyWXz

Logistic Regression con’t

• When it is used– binary responses (two classes)– As a data analysis and inference tool to understand the role of the input

variables in explaining the outcome

• Feature selection– Find a subset of the variables that are sufficient for explaining their joint effect

on the response. – One way is to repeatedly drop the least significant coefficient, and refit the

model until no further terms can be dropped– Another strategy is to refit each model with one variable removed, and perform

an analysis of deviance to decide which one variable to exclude

• Regularization– Maximum penalized likelihood – Shrinking the parameters via an L1 constraint, imposing a margin constraint in

the separable case

2

1 2)|(Prlog

Cxy ii

N

i

Questions

• p102 Are stepwise methods the only practical way to do model selection for logistic regression (because of nonlinearity + max likelihood criteria)? (comparing to section 3.4: what about the bias/variance tradeoff, where we could shrink coefficient estimates instead of just setting them to zero?) (Kevyn Collins-Thompson)

Classification by Linear Least Squares vs. LDA

• Two-class case, simple correspondence between LDA and classification by linear least squares– The coefficient vector from least squares is proportional to the

LDA direction in its classification rule (page 88)

• For more than two classes, the correspondence between regression and LDA can be established through the notion of optimal scoring (Section 12.5).

Questions

• On p88 paragraph 2 it says "the derivation of LDA via least squares does not use a Gaussian assumption for the features" - how can this statement be made, simply because the least squares coefficient vector is proportional to the LDA direction, how does that remove the obvious Gaussian assumptions that are made in LDA? (Ashish Venugopal)

LDA vs. Logistic Regression

• LDA (Generative model)– Assumes Gaussian class-conditional densities and a common covariance– Model parameters are estimated by maximizing the full log likelihood, parameters

for each class are estimated independently of other classes, Kp+p(p+1)/2+(K-1) parameters

– Makes use of marginal density information Pr(X)– Easier to train, low variance, more efficient if model is correct– Higher asymptotic error, but converges faster

• Logistic Regression (Discriminative model)– Assumes class-conditional densities are members of the (same) exponential family

distribution– Model parameters are estimated by maximizing the conditional log likelihood,

simultaneous consideration of all other classes, (K-1)(p+1) parameters– Ignores marginal density information Pr(X)– Harder to train, robust to uncertainty about the data generation process– Lower asymptotic error, but converges more slowly

Generative vs. Discriminative Learning

Generative Discriminative

Example Linear Discriminant Analysis

Logistic Regression

Objective Functions Full log likelihood: Conditional log likelihood

Model Assumptions Class densities:

e.g. Gaussian in LDA

Discriminant functions

Parameter Estimation “Easy” – One single sweep “Hard” – iterative optimization

Advantages More efficient if model correct, borrows strength from p(x)

More flexible, robust because fewer assumptions

Disadvantages Bias if model is incorrect May also be biased. Ignores information in p(x)

(Rubinstein 97)

i

ii yxp ),(log i

ii xyp )|(log

)|( kyxp )(xk

Comparison between LDA and LOGREG

(Rubinstein 97)

(ErrorRate / Standard Error)

True Distribution

Highly non-Gaussian

N/A Gaussian

LDA 25.2/0.47 9.6/0.61 7.6/0.12

LOGREG 12.6/0.94 4.1/0.17 8.1/0.27

Questions

• Can you give a more detailed explanation about the difference between the two methods: linear discriminant analysis and linear logistic regression. (P. 80. book: the essential difference between them is in the way the linear function is fit to the training data.) (Yanjun Qi)

• P105 first paragrpha. Why conditional likelihood need 30% more data to do as well? (Yi Zhang)

• The book says logistic regression is safer. Then it says LDA and logistic regression work very similar even when LDA is used in inappropriately, why not use LDA? Using LDA, we have a change to save 30% training data in case the assumption on marginal distribution is true. How inappropriately will make LDA worse than logistic regression? (Yi Zhang)

• Figure 4.2 Shows the different effects from linear regression and linear• Discriminant analysis on one data set. Can we have a more deep and general

understanding about when linear regression does not work well compared with linear discriminant analysis? (Yanjun Qi)

Questions

• On p88 paragraph 2 it says "the derivation of LDA via least squares does not use a Gaussian assumption for the features" - how can this statement be made, simply because the least squares coefficient vector is proportional to the LDA direction, how does that remove the obvious Gaussian assumptions that are made in LDA? (Ashish Venugopal)

• p91 - what does it mean to "Sphere" the data with a covariance matrix? (Ashish Venugopal)

• The log odds-ratio is typically defines as log(p/(1-p)), how is this consistent with p96 where they use log(pk/pl) where k,l are different classes in K. (Ashish Venugopal)

Questions

• Figure 4.2 on p. 83 gives an example of masking and in text, the authors go on to say, "a general rule is that...polynomial terms up to degree K - 1might be needed to resolve them". There seems to be an implication that adding polynomial basis functions according to this rule could be detrimental sometimes. I was trying to think of a graphical representation of a case where that would occur but can't come up with one. Do you have one? (Paul Bennett)

• (p. 80) what do the decision boundaries for the logit transformation space look like in the original space? (Francisco Pereira)

• (p. 82) whis is E(Y_k|X=x) = Pr(G=k|X=x)? (Francisco Pereira)• (p. 82) the target approach is just "predicting a vector of with all 0s except 1 at

the position of the true class"? (Francisco Pereira)• (p. 83) Can all of this be seen as projecting the data into a line with a given

direction and then dividing that line according to the classes (seems so in 2 class case, not sure in general). (Francisco Pereira)

Questions• What is the difference between logistic regression and exponential model, in

terms of definition, properties and experimental results? ( Discriminative VS Generative) [Yan Liu]

• The question is on the Indicator response matrix: as a general way to decompose the multi-class classification problems to binary-class classification problems, when it is applied, how do we evaluate the results? (Error rate or something else?) There is a good way called ECOC (Error Correcting Output Coding) to reduce multi-class problems to binary-class problems, can we use the same way as indicator response matrix and do linear regression? [Yan Liu]

• On page82. Why it is quite straight forward to that sum f,(x) =1 for any x?• As is said in the book (page 80), if the problem is linearly non-separable, we

can expand our variable set X1, X2,.., Xp by including their squares and cross-product and solve it. Furthermore, this approach can be used with any basis transformation. In theory, can any classification problems be solved using this way? (Maybe in practical, we might have the problems like “curse of dimension”) [Yan Liu]

Questions• one important step for applying regression method to the classification problem

is to encode the class label into some code scheme. In the book, it only illustrates the simplest one. More complicated code scheme includes the redundant code. However, it is not necessary to encode the class label into N region. Do you think it is possible to encode it with real number and actually achieve better performance? [Rong Jin]

• P.82. Book: If we allow linear regression onto basis expansions h(X) ofthe inputs,this approach can lead to consistent estimates of the probabilities.I do not fully understand this sentence. [Yanjun]

• In LDA, book tells us that it is easy to show that the coefficient vectorfrom leastsquares is proportional to the LDA diretion given by 4.11.Then how to understand this correspondence occurs for any distinct coding ofthe targets? [Yanjun]

• Both LDA and QDA performs well on an amazingly large and diverse set ofclassification tasks.But LDA assumes the data covariances are approximatel equal. Then i feelthis methodis too restricted to the general case, right? [Yanjun]

Questions

• The indicator matrix Y in the 4.2 first paragraph is a matrix of 0's and1's, with each row having a single 1. It seems that we can extends it to multi-label data by allowing each row having two or more 1, and for the model using Eq. 4.3. Have this way been tried in multi-label classification problem? [Wei-hao]

References

• Rubinstein, Y. D., & Hastie, T. (1997). Discriminative vs. informative learning. In Proceedings Third International Conference on Knowledge Discovery and Data Mining, pp. 49--53.

• Jordan, M. I. (1995) "Why the logistic function? A tutorial discussion on probabilities and neural networks," Technical Report • A. Y. Ng and M. I. Jordan, "On Discriminative vs. Generative Classifiers:

A Comparison of Logistic Regression and Naive Bayes," Neural Information Processing Systems

• p88 "QDA is generally preferred to LDA (in the quadratic space)". Why,and how do you decide which to use?(Is the main reason because QDA is more general in what it can modelaccurately, in not assuming a common covariance across classes?) [Kevyn]

• "By relying on the additional model assumptions,we have more information about the parameters,and hence can estimate them more efficiently (low variance)“, how? [Jian]

linear methods for classification jie lu, joy, lucian {jielu+,joy+, llita+}@cs.cmu.edu

Documents

linear methods

linear model

classification jie lu