linear models for classification

Linear Models for Classification

Berkay Topçu

Linear Models for Classification Goal: Take an input vector and assign it to one

of K classes (Ck where k=1,...,K) Linear separation of classes

Generalized Linear Models We wish to predict discrete class labels, or

more generally class posterior probabilities that lies in range (0,1).

Classification model as a linear function of the parameters ,

Classification directly in the original input space , or a fixed nonlinear transformation of the input variables using a vector of basis functions

)()( 0Txwx wfy w

)|( xkCp

x

)(x

Discriminant Functions Linear discriminants

If , assign to class C1 and to class C2 otherwise

Decision boundary is given by determines the orientation of the decision

surface and determines the location Compact notation:

0Txwx wy )(

0)( xy0)( xy

w0w

xwx T~~)( y

),(~0 ww w

),1(~0 xx x

Multiple Classes K-class discriminant by combining number of

two-class discriminant functions (K>2) One-versus-the-rest: seperating points in one

particular class Ck from points not in that class One-versus-one: K(K-1)/2 binary discriminant

functions

Multiple Classes A single K-class discriminant comprising K

linear functions

Assign to class Ck if for all

How to learn the parameters of linear discriminant functions?

0xwx kTkk wy )(

)()( xx jk yy kj

Least Squares for Classification Each class Ck is described by its own linear

model

Training data set for n =1,...,N where

Matrix whose nth row is the vector and whose nth row is

xWxxwx 0~~

)(y )( Tk

Tkk wy

n, txn

)0,...,1,...,0,0(nt

T Tnt X

Tnx

Least Squares for Classification Minimizing the sum-of-squares error function

Solution :

Discriminant function :

)~~()

~~( Tr

2

1)( T TWXTWXW DE

TXTXXXW ~~)

~~(

~ 1 TT

xXTxWxy ~)~

(~~)( TTT

Fisher’s Linear Discriminant Dimensionality reduction: take the D-dimensional

input vector and project to one dimension using

Projection that maximizes class seperation Two-class problem: N1 points of C1 and N2 points of C2

Fisher’s idea: large separation between the projected class means small variance within each class, minimizing class

overlap

xwTy

ii

T

Cnn

Cnn

wmm

NN

1 ),(

1 ,

1

2112

22

11

21

mmw

xmxm

Fisher’s Linear Discriminant

The Fisher criterion:

21

))(())((

))(( ,)(

2211

1212

Cn

Tnn

Cn

TnnW

TB

WT

BT

J

mxmxmxmxS

mmmmSwSw

wSww

)( 121 mmSw w

Fisher’s Linear Discriminant For the two-class problem, Fisher criterion is a

special case of least squares (reference : Penalized Discriminant Analysis – Hastie, Buja and Tibshirani)

For multiple classes:

The weights values are determined by the eigenvectors that corresponds to K highest eigenvalues of

xWTy

)(

)(Tr)(

))(( ,1

B1

TW

TB

K

k

Tkkk

K

kkW

J

N

WWS

WWSW

mmmmSSS

BWSS 1

The Perceptron Algorithm Input vector is transformed using a

nonlinear transformation

Perceptron criterion: For all training samples

We need to minimize

)(xx

))(()( xwx Tfy

0 ,1

0 ,1)(

a

aaf

)1,1( t

0)( nnT txw

Mn

nnT tE ww)(P

The Perceptron Algorithm – Stocastic Gradient Descent Cycle through the training patterns in turn

If the pattern is correctly classified weight vectors remains unchanged, else:

nntE )(P

)()1( )( wwww

Probabilistic Generative Models Depend on simple assumptions about the

distribution of the data

Logistic sigmoid function Maps the whole real axis to a

finite interval

)()exp(1

1

)()|()()|(

)()|()|(

2211

111

aa

CpCpCpCp

CpCpCp

xx

xx

)()|(

)()|(ln

22

11

CpCp

CpCpa

x

x

Continuous Inputs - Gaussian Assuming the class-conditional densities are

Gaussian

Case of two classes

)()(

2

1exp

1

)2(

1)|( 1

212 kT

kDkCp μxΣμxΣ

x

)(

)(ln

2

1

2

1

)(

)()|(

2

12

121

110

211

01

Cp

Cpw

wCp

TT

T

μΣμμΣμ

μμΣw

xwx

Maximum Likelihood Solution Likelihood function:

Maximizing log-likelihood

N

nnn

nn NNp1

12121 )]|()1[()]|([)|( tt ,μx,μx,μ,μπ,t

2

1

))((1

))((1

)1(1

1

222

2

111

1

22

11

122

111

Cn

Tnn

Cn

Tnn

N

nnn

N

nnn

N

N

N

N

N

N

tN

tN

μxμxS

μxμxS

SS

xμxμ

Probabilistic Discriminative Models Probabilistic generative model

Number of parameters grows quadratically with M (# dim.)

However has M adjustable parameters Maximum likelihood solution for Logistic

Regression

Energy function: negative log likelihood

)()()|( 1 φwφφ TyCp

φwT

N

n

tn

tn

nn yyp1

1)1()|( wt

N

nnnnn ytytpE

1

)}1ln()1(ln{)|(ln)( wtw

Iterative Reweighted Least Squares Newton-Raphson iterative optimization on

linear regression

Same as the standard least-squares solution

)(1)()( wHww Eoldnew

tΦΦwΦφφww TTn

N

nnn

T tE 1

)()(

ΦΦφφwH TTn

N

nnE

1

)(

tΦΦΦtΦΦwΦΦΦww TTToldTToldnew 1)(1)()( )(}{)(

Iterative Reweighted Least Squares Newton-Raphson update for negative log

likelihood

Weighted least-squares problem

)()()(1

tyΦφw

Tn

N

nnn tyE

)1( ,)1()(1

nnnnTT

nn

N

nnn yyRyyE

RΦΦφφwH

)( )(

)( )(

)()(

1)(1

)(1

1)()(

tyRwΦRΦRΦΦ

tyΦwRΦΦRΦΦ

tyΦRΦΦww

oldTT

ToldTT

TToldnew

Maximum Margin Classifiers Support Vector Machines for two-class

problem

Assuming linearly seperable data set There exists at least one set of variables satisfies That give the smallest generalization error Margin: the smallest distance between decision

boundary and any of the samples

by T )()( xwx }1,1{ ,...1 nN txx0)( nn yt x

Support Vector Machines Optimization of parameters, maximizing the

margin

Maximizing the margin minimizing : subject to the constraint:

Introduction of Lagrange multipliers

w

xw

w

x ))(()( btyt nT

nnn

2

, 2

1minarg ww b

2w

1))(( bt nT

n xw

0na

N

nn

Tnn btabL

1

2}1))(({

2

1),,( xwwaw

Support Vector Machines - Lagrange Multipliers Minimizing with respect to w and b and

maximizing with respect to a.

The dual form:

Quadratic programming problem:

N

nn

N

nnnn tata

1n1

0 ,)(xw

,),(2

1)(

~

1 11

N

n

N

mmnmnmn

N

nn kttaaaL xxa

N

nnnn taa

1

0 ,0

CQaa

a

T

2

1 maxarg

~

bktayN

nnnn

1

),()( xxx

Support Vector Machines Overlapping class distributions (linearly

unseparable data) Slack variable: distance from the boundary

To maximize the margin whilepenalizing points that lie on the wrong side of the margin boundary

)(or 0 nnnn yt x

nnn yt 1)(x

2

1 2

1 min w

N

nnC

SVM-Overlapping Class Distributions Identical to separable case

Again represents a quadratic programming problem

N

nnn

N

nnnnn

N

nn ytaCbL

111

2}1)({

2

1),,( xwaw

,),(2

1)(

~

1 11

N

n

N

mmnmnmn

N

nn kttaaaL xxa

N

nnnn taCa

1

0 ,0

Support Vector Machines Relation to logistic regression

Hinge loss used in SVM and the error function of logistic regression approximate the ideal misclassification error(MCE)

Black : MCE, Blue: Hinge Loss, Red: Logistic Regression, Green: Squared Error

nntyz

linear models for classification

Documents

class c1

projected class

particular class ck

classificationeach class

linear regression

classconditional densities

discrete class labels

linear functions assign