linear methods for classification based on chapter 4 of hastie, tibshirani, and friedman david...

Linear Methods for Classification

Based on Chapter 4 of Hastie, Tibshirani, and Friedman

David Madigan

Predictive ModelingGoal: learn a mapping: y = f(x;)

Need: 1. A model structure

2. A score function

3. An optimization strategy

Categorical y {c1,…,cm}: classification

Real-valued y: regression

Note: usually assume {c1,…,cm} are mutually exclusive and exhaustive

Probabilistic Classification

Let p(ck) = prob. that a randomly chosen object comes from ck

Objects from ck have: p(x |ck , k) (e.g., MVN)

Then: p(ck | x ) p(x |ck , k) p(ck)

Bayes Error Rate: dxxpxcpp kk

B )())|(max1(*

•Lower bound on the best possible error rate

Bayes error rate about 6%

Classifier Types

Discriminative: model p(ck | x )

- e.g. logistic regression, CART

Generative: model p(x |ck , k)

- e.g. “Bayesian classifiers”, LDA

Regression for Binary Classification

•Can fit a linear regression model to a 0/1 response

•Predicted values are not necessarily between zero and one

-3 -2 -1 0 1 2 3

0.0

0.5

1.0

x

y

zeroOneR.txt

•With p>1, the decision boundary is lineare.g. 0.5 = b0 + b1 x1 + b2 x2

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.0

0.5

1.0

1.5

2.0

2.5

3.0

x1

x2

Linear Discriminant AnalysisK classes, X n × p data matrix.

p(ck | x ) p(x |ck , k) p(ck)

Could model each class density as multivariate normal:

)()(2

1

212

1

||)2(

1)|(

kkT

k xx

kpk excp

LDA assumes for all k. Then:k

)()()(2

1

)(

)(log

)|(

)|(log 11

lkT

lkT

lkl

k

l

k xcp

cp

xcp

xcp

This is linear in x.

Linear Discriminant Analysis (cont.)

It follows that the classifier should predict: )(maxarg xkk

)(log2

1)( 11

kkTkk

Tk cpxx

“linear discriminant function”

If we don’t assume the k’s are identicial, get Quadratic DA:

)(log)()(2

1||log

2

1)( 1

kkkT

kkk cpxxx

Linear Discriminant Analysis (cont.)

Can estimate the LDA parameters via maximum likelihood:

kki

ik Nx /ˆ

NNcp kk /)(ˆ

)/()')((ˆ1

KNxxK

k kikiki

LDA QDA

LDA (cont.)

•Fisher is optimal if the class are MVN with a common covariance matrix

•Computational complexity O(mp2n)

Logistic Regression

Note that LDA is linear in x:

)()()(2

1

)(

)(log

)|(

)|(log 0

10

10

00

k

Tk

Tk

kk xcp

cp

xcp

xcp

xTkk 0

Linear logistic regression looks the same:

xxcp

xcp Tkk

k 00 )|(

)|(log

But the estimation procedure for the co-efficients is different.LDA maximizes joint likelihood [y,X]; logistic regression maximizes conditional likelihood [y|X]. Usually similar predictions.

Logistic Regression MLE

For the two-class case, the likelihood is:

n

iiiii xpyxpyl

1

));(1log()1();(log)(

xxp

xp T

);(1

);(log ))exp(1log();(log xxxp TT

n

i

TTi xxyl

1

))exp(1log()(

The maximize need to solve (non-linear) score equations:

n

iiii xpyx

d

dl

1

0));(()(

Logistic Regression ModelingSouth African Heart Disease Example (y=MI)

Coef. S.E. Z score

Intercept

-4.130 0.964 -4.285

sbp 0.006 0.006 1.023

Tobacco

0.080 0.026 3.034

ldl 0.185 0.057 3.219

Famhist

0.939 0.225 4.178

Obesity

-0.035 0.029 -1.187

Alcohol

0.001 0.004 0.136

Age 0.043 0.010 4.184

Wald

Regularized Logistic Regression

•Ridge/LASSO logistic regression

•Successful implementation with over 100,000 predictor variables

•Can also regularize discriminant analysis

j

j

N

iii

T wyxwn

w 1

))exp(1log(1

infargˆ

Simple Two-Class Perceptron

Define:

Classify as class 1 if h(x)>0, class 2 otherwise

Score function: # misclassification errors on training data

For training, replace class 2 xj’s by -xj; now

need h(x)>0

pjxwxh jj 1,)(

Initialize weight vector

Repeat one or more times:

For each training data point xi

If point correctly classified, do nothing

Else ixww

Guaranteed to converge to a separating hyperplane (if exists)

“Optimal” Hyperplane

The “optimal” hyperplane separates the two classes and maximizes the distance to the closest point from either class.

Finding this hyperplane is a convex optimization problem.

This notion plays an important role in support vector machines

wx+b=0(0,0) from

|1|

w

b

(0,0) from |1|

w

b

w

2

linear methods for classification based on chapter 4 of hastie, tibshirani, and friedman david...

Documents