linear models for classification

Linear Models for Classification

Berkay Topçu

Linear Models for Classification Goal: Take an input vector and assign it to one

of K classes (Ck where k=1,...,K) Linear separation of classes

Generalized Linear Models We wish to predict discrete class labels, or

more generally class posterior probabilities that lies in range (0,1).

Classification model as a linear function of the parameters ,

Classification directly in the original input space , or a fixed nonlinear transformation of the input variables using a vector of basis functions

)()( 0Txwx wfy w

)|( xkCp

Discriminant Functions Linear discriminants

If , assign to class C1 and to class C2 otherwise

Decision boundary is given by determines the orientation of the decision

surface and determines the location Compact notation:

0Txwx wy )(

0)( xy0)( xy

xwx T~~)( y

),(~0 ww w

),1(~0 xx x

Multiple Classes K-class discriminant by combining number of

two-class discriminant functions (K>2) One-versus-the-rest: seperating points in one

particular class Ck from points not in that class One-versus-one: K(K-1)/2 binary discriminant

functions

Multiple Classes A single K-class discriminant comprising K

linear functions

Assign to class Ck if for all

How to learn the parameters of linear discriminant functions?

0xwx kTkk wy )(

)()( xx jk yy kj

Least Squares for Classification Each class Ck is described by its own linear

Training data set for n =1,...,N where

Matrix whose nth row is the vector and whose nth row is

xWxxwx 0~~

)(y )( Tk

Tkk wy

n, txn

)0,...,1,...,0,0(nt

T Tnt X

Least Squares for Classification Minimizing the sum-of-squares error function

Solution :

Discriminant function :

~~( Tr

1)( T TWXTWXW DE

TXTXXXW ~~)

~ 1 TT

xXTxWxy ~)~

(~~)( TTT

Fisher’s Linear Discriminant Dimensionality reduction: take the D-dimensional

input vector and project to one dimension using

Projection that maximizes class seperation Two-class problem: N1 points of C1 and N2 points of C2

Fisher’s idea: large separation between the projected class means small variance within each class, minimizing class

overlap

Fisher’s Linear Discriminant

The Fisher criterion:

))(())((

))(( ,)(

mxmxmxmxS

mmmmSwSw

)( 121 mmSw w

Fisher’s Linear Discriminant For the two-class problem, Fisher criterion is a

special case of least squares (reference : Penalized Discriminant Analysis – Hastie, Buja and Tibshirani)

For multiple classes:

The weights values are determined by the eigenvectors that corresponds to K highest eigenvalues of

)(Tr)(

))(( ,1

mmmmSSS

BWSS 1

The Perceptron Algorithm Input vector is transformed using a

nonlinear transformation

Perceptron criterion: For all training samples

We need to minimize

))(()( xwx Tfy

0 ,1)(

)1,1( t

0)( nnT txw

nnT tE ww)(P

The Perceptron Algorithm – Stocastic Gradient Descent Cycle through the training patterns in turn

If the pattern is correctly classified weight vectors remains unchanged, else:

nntE )(P

)()1( )( wwww

Probabilistic Generative Models Depend on simple assumptions about the

distribution of the data

Logistic sigmoid function Maps the whole real axis to a

finite interval

)()exp(1

)()|()()|(

)()|()|(

CpCpCpCp

CpCpCp

)()|(ln

Continuous Inputs - Gaussian Assuming the class-conditional densities are

Gaussian

Case of two classes

1)|( 1

212 kT

kDkCp μxΣμxΣ

μΣμμΣμ

μμΣw

Maximum Likelihood Solution Likelihood function:

Maximizing log-likelihood

nn NNp1

12121 )]|()1[()]|([)|( tt ,μx,μx,μ,μπ,t

μxμxS

xμxμ

Probabilistic Discriminative Models Probabilistic generative model

Number of parameters grows quadratically with M (# dim.)

However has M adjustable parameters Maximum likelihood solution for Logistic

Regression

Energy function: negative log likelihood

)()()|( 1 φwφφ TyCp

nn yyp1

1)1()|( wt

nnnnn ytytpE

)}1ln()1(ln{)|(ln)( wtw

Iterative Reweighted Least Squares Newton-Raphson iterative optimization on

linear regression

Same as the standard least-squares solution

)(1)()( wHww Eoldnew

tΦΦwΦφφww TTn

T tE 1

ΦΦφφwH TTn

tΦΦΦtΦΦwΦΦΦww TTToldTToldnew 1)(1)()( )(}{)(

Iterative Reweighted Least Squares Newton-Raphson update for negative log

likelihood

Weighted least-squares problem

)()()(1

tyΦφw

nnn tyE

)1( ,)1()(1

nnnnTT

nnn yyRyyE

RΦΦφφwH

tyRwΦRΦRΦΦ

tyΦwRΦΦRΦΦ

tyΦRΦΦww

ToldTT

TToldnew

Maximum Margin Classifiers Support Vector Machines for two-class

problem

Assuming linearly seperable data set There exists at least one set of variables satisfies That give the smallest generalization error Margin: the smallest distance between decision

boundary and any of the samples

by T )()( xwx }1,1{ ,...1 nN txx0)( nn yt x

Support Vector Machines Optimization of parameters, maximizing the

margin

Maximizing the margin minimizing : subject to the constraint:

Introduction of Lagrange multipliers

x ))(()( btyt nT

1minarg ww b

1))(( bt nT

Tnn btabL

2}1))(({

1),,( xwwaw

Support Vector Machines - Lagrange Multipliers Minimizing with respect to w and b and

maximizing with respect to a.

The dual form:

Quadratic programming problem:

nnnn tata

0 ,)(xw

mmnmnmn

nn kttaaaL xxa

nnnn taa

1 maxarg

bktayN

),()( xxx

Support Vector Machines Overlapping class distributions (linearly

unseparable data) Slack variable: distance from the boundary

To maximize the margin whilepenalizing points that lie on the wrong side of the margin boundary

)(or 0 nnnn yt x

nnn yt 1)(x

1 min w

SVM-Overlapping Class Distributions Identical to separable case

Again represents a quadratic programming problem

nn ytaCbL

2}1)({

1),,( xwaw

mmnmnmn

nn kttaaaL xxa

nnnn taCa

Support Vector Machines Relation to logistic regression

Hinge loss used in SVM and the error function of logistic regression approximate the ideal misclassification error(MCE)

Black : MCE, Blue: Hinge Loss, Red: Logistic Regression, Green: Squared Error

linear models for classification

class c1

projected class

particular class ck

classificationeach class

linear regression

classconditional densities

discrete class labels

linear functions assign

Documents

linear models for classification -...

linear classification with perceptrons

ch 4. linear models for classification adopted from...

linear classifiers - york university€¦ · cse 4404/5327...

linear classification models: generative prof. navneet goyal...

generalized linear models classification

text classification & linear models · text classification...

linear models for classification: probabilistic methods...

linear classification

limma = linear models for microarray data linear models...

linear classification with discriminative models

prml读书会一周年 linear models for classification

classification and linear classifiers

linear models

linear models for classification - chapter 4 in prml

linear methods for classification

linear models for classification : probabilistic methods

master of statistics, ku leuven · linear regression models...

ch 4. linear models for...

a comparison between bayesian networks and generalized...