linear methods for classification

Linear Methods for Classification

Lecture Notes for CMPUT 466/551

Nilanjan Ray

Linear Classification

• What is meant by linear classification?– The decision boundaries in the in the feature

(input) space is linear

• Should the regions be contiguous?

Piecewise linear decision boundaries in 2D input space

Linear Classification…

• There is a discriminant function k(x) for

each class k

• Classification rule:

• In higher dimensional space the decision

boundaries are piecewise hyperplanar

• Remember that 0-1 loss function led to the

classification rule:

• So, can serve as k(x)

)}(maxarg:{ xkxR jj

)}|(maxarg:{ xXjGPkxRj

)|( XkGP

Linear Classification…

• All we require here is the class boundaries {x:k(x) = j(x)} be linear for every (k, j) pair

• One can achieve this if k(x) themselves are linear or any monotone transform of k(x) is linear– An example:

)|1(log[

)exp(1

)exp()|1(

Linear

So that

Linear Classification as a Linear Regression

)())(1()),((ˆ 3211

2121 TTTTT xxxxxxxY YXXX

535251

434241

333231

232221

131211

321213

221212

121211

)1())((ˆ

2D Input space: X = (X1, X2)

Number of classes/categories K=3, So output Y = (Y1, Y2, Y3)

Training sample, size N=5,

Regression output:

Each row hasexactly one 1indicating thecategory/class

Indicator Matrix

Or, Classification rule:

))((ˆmaxarg))((ˆ2121 xxYxxG k

The Masking

3213 )1(ˆ xxY

2212 )1(ˆ xxY

Linear regression of the indicator matrix can lead to masking

LDA can avoid this masking

2D input space and three classes Masking

1211 )1(ˆ xxY

Viewing direction

Linear Discriminant Analysis

xfxXkG

)()|Pr(

Essentially minimum error Bayes’ classifier

Assumes that the conditional class densities are (multivariate) Gaussian

Assumes equal covariance for every class

Posterior probability

k is the prior probability for class k

fk(x) is class conditional density or likelihood density

Application ofBayes rule

))()(2

2/12/ kT

kpk xxxf

LDA…

1(log)

loglog)|Pr(

)|Pr(log

ΣΣΣΣ

)(xl)(xk

)(maxarg)(ˆ xxG kk

)|Pr(maxarg)(ˆ xXkGxGk

Classification rule:

is equivalent to:

The good old Bayes classifier!

LDA…

kkg ik Nxi

NNkk /ˆ

)/()ˆ)(ˆ(ˆ1

Training data utilized to estimate

Prior probabilities:

Means:

Covariance matrix:

When are we going to use the training data?

Nixg ii :1),,( Total N input-output pairs Nk number of pairs in class k Total number of classes: K

LDA: Example

LDA was able to avoid masking here

Quadratic Discriminant Analysis

• Relaxes the same covariance assumption– class conditional probability densities (still multivariate Gaussians) are allowed to have different covariant matrices

• The class decision boundaries are not linear rather quadratic

|)|log2

1(log|)|log

loglog)|Pr(

)|Pr(log

Tllkkk

ΣΣΣΣ

)(xl)(xk

QDA and Masking

Better than Linear Regression in terms of handling masking:

Usually computationally more expensive than LDA

Fisher’s Linear Discriminant[DHS]

From training set we want to find out a direction where the separationbetween the class means is high and overlap between the classes is small

Fisher’s LD…

Projection of a vector x on a unit vector w:

Geometric interpretation:

From training set we want to find out a direction w where the separationbetween the projections of class means is high and

the projections of the class overlap is small

Fisher’s LD…

1~,1~ mwxw

Nmmwxw

)(~~1212 mmwmm T

wSwwmxmxwmwxwmys

))(()()~(~

Class means:

Projected class means:

Difference between projected class means:

Scatter of projected data (this will indicate overlap between the classes):

Fisher’s LD…

212~~)~~(

))(( 1212

)( 121 mmSw w

Ratio of difference of projected means over total scatter:

We want to maximize r(w). The solution is

Rayleigh quotient

Fisher’s LD: Classifier

1)(()(

1)( 2112

12121 mmxmmSmmwxwmmxwxy w

Classification rule: x in R2 if y(x)>0, else x in R1, where

So far so good. However, how do we get the classifier?

All we know at this point is that the direction )( 121 mmSw w

separates the projected data very well

Since we know that the projected class means are well separated, we can choose average of the two projected means as a thresholdfor classification

Fisher’s LD: Multiple Classes

mmmmnmmmmnS

mxmxmxmxSk

))((...))((

Bw SS 1

Maximize Rayleigh ratio:

The solution largest eigenvector of is

Compute means for the classes:

There are k clases C1,…,Ck with number of elements ni in the ith class

Compute variances:

Compute the grand mean:

At most (k-1) eigenvalues will be non-zero. Dimensionality reduction.

Fisher’s LD and LDA

They become same when

(1) Prior probabilities are same

(2) Common covariance matrix for the class conditional densities

(3) Both class conditional densities are multivariate Gaussian

Ex. Show that Fisher’s LD classifier and LDA produce thesame rule of classification given the above assumptions

Note: (1) Fisher’s LD does not assume Gaussian densities (2) Fisher’s LD can be used in dimension reduction for a multiple class scenario

Logistic Regression

• The output of regression is the posterior probability i.e., Pr(output | input)

• Always ensures that the sum of output variables is 1 and each output is non-negative

• A linear classification method• We need to know about two concepts to

understand logistic regression– Newton-Raphson method– Maximum likelihood estimation

Newton-Raphson Method

0)( 1 nxf

)()( 11

nnnn xf

xfxfxx

)()()()( 11 nnnnn xfxxxfxf

nnn xf

A technique for solving non-linear equation f(x)=0

Taylor series:

After rearrangement:

If xn+1 is a root or very close to the root, then:

Rule for iterationNeed an initial guess x0

Newton-Raphson in Multi-dimensions

fxfxxf

jjj ,...,1,)()(

0),,,(

We want to solve the equations:

Taylor series:

After some rearrangement etc.the rule for iteration:(Need an initial guess)

Jacobian matrix

Newton-Raphson : Example

0)sin(),(

0)cos(),(32

211212

221211

xxxxxf

xxxxfSolve:

)()()sin(

)cos()(

)(32)cos(

)sin(2nnn

Iteration ruleneed initial guess

Maximum Likelihood Parameter Estimation

)(exp(

),(maxarg)ˆ,ˆ(,

Let’s start with an example. We want to find out the unknown parameters mean and standard deviation of a Gaussian pdf, given N independent samples from it.

Samples: x1,….,xN

Form the likelihood function:

Estimate the parameters that maximize the likelihood function

Let’s find out )ˆ,ˆ(

Logistic Regression Model

)exp(1

1)|Pr(

1,,1,)exp(1

)exp()|Pr(

The method directly models the posterior probabilities as the output of regression

Note that the class boundaries are linear

How can we show this linear nature?

What is the discriminant function for every class in this model?

Logistic Regression Computation

Let’s fit the logistic regression model for K=2, i.e., number of classes is 2

i iTii

xXGyxXGy

)))exp(1log()1((

))exp(1

1log)1((

))|0log(Pr()1())|1log(Pr(

)}|Pr({log)(

Training set: (xi, gi), i=1,…,N

Log-likelihood:

We want to maximize the log-likelihood in order to estimate

Logistic Regression Computation…

0))exp(1

)exp((

(p+1) Non-linear equations

Solve by Newton-Raphson method:

Jacobian([ 1-oldold

oldnew ll

Let’s workout the details hidden in the above equation.In the process we’ll learn a bit about vector differentiation etc.

linear methods for classification

class boundaries

projected class means

class overlap

projections of class

class kfkx

projected means

class kclassification

conditional class densities

Documents

classification methods using envi

linear classification rules - asc.ohio-state.edu · linear...

linear methods for...

document classification methods for

chapter 4: linear models for classification

linear classification with perceptrons

classification ensemble methods 2

classification of analytical methods

linear distance coding for image classification

classification ensemble methods 2

data analytics for semiconductor manufacturing · model...

linear models and multiclass classification

linear methods for classification jie lu, joy, lucian...

classification methods envi

linear filtering about modifying pixels based on...

linear classification: the kernel method

lecture 8,9 – linear methods for classification rice elec...

classification methods

linear models for classification

linear classification with discriminative models