linear models for classification - chapter 4 in prml

LINEAR MODELS FOR CLASSIFICATIONChapter 4 in PRML

Yun Gu

Department of AutomationShanghai Jiao Tong University

Shanghai, CHINA

May 12, 2014

Overview of ClassificationDiscriminant Functions

Prob. Generative ModelsProb. Discrimintive Models

Bonus

Outline

1 Overview of Classification

2 Discriminant FunctionsFrom Two Classes to Multiple ClassesLeast SquareFisher’s Linear DiscriminantFind a Separating Hyperplane

3 Prob. Generative ModelsLinear Discriminant AnalysisQuadratic Discriminant Analysis

4 Prob. Discrimintive ModelsLogistic RegressionLDA or Logistic Regression?

5 BonusMulti-Class, Multi-Label, Struct-Output

Y. Gu LINEAR MODELS FOR CLASSIFICATION



Bonus

Outline of this section


2 Discriminant Functions

3 Prob. Generative Models

4 Prob. Discrimintive Models

5 Bonus




Bonus

What is Classification?

Definition (Classification)

Classification is the problem of identifying to which of a set ofcategories (sub-populations) a new observation belongs, on the basis of atraining set of data containing observations (or instances) whose categorymembership is known.

Input: Feature vector x, continuous or discrete;

Output: Qualitative values (i.e. categorical or discrete values) y ,where y ∈ Y, Y = {y1, y2, ..., yn};Mapping: y = f (x) : x→ y ;

Supervised Learning/Semi-supervised Learning.




Bonus

Example:Image Classification

Figure : An example of image classfication. Input: Visual Feaures(e.g. SIFT,Color Moment); Output: Scene Category (e.g. ”Bridge”, ”Castle”); Mapping:A Sparse Factor Representation (see ”Multi-Label Image Categorization WithSparse Factor Representation”, Sun, et.al. IEEE TIP 2014)




Bonus

From Two Classes to Multiple ClassesLeast SquareFisher’s Linear DiscriminantFind a Separating Hyperplane



2 Discriminant FunctionsFrom Two Classes to Multiple ClassesLeast SquareFisher’s Linear DiscriminantFind a Separating Hyperplane



5 Bonus




Bonus


Two-Class Problem

Let’s start with the simplest case: Two-class Classification, where:


Output: Binary values y , where y ∈ Y, Y = {−1,+1} orY = {0, 1};Model: y(X) = wTx + w0, x is assigned with −1, if y(x) ≥ 0;Otherwise, x is assigned with +1.

Examples: Male or Female; Toy Data.




Bonus


Example of Two-Class Problem: On ”Toy” Dataset

Figure : A demonstration of Classification on ”toy” dataset via Semi-supervisedLearning. (see ”Graph Transduction via Alternating”, Wang, et.al, ICML, 2008)




Bonus


From Two-Class to Multiple-Class

A complex case: Multiple-class Classification, where:


Output: For a n-class problem, Qualitive values C , where C ∈ C,C = {C1,C2, ...,Cn};Model:

Binary: One-vs-OneBinary: One-vs-RestBinary: ECOCDirectly comprising n Linear functions




Bonus


From Two-Class to Multiple-Class (Cont.)

Scheme 1: Decompose a n-class problem into several two-classesproblems:

One-vs-Rest: Use n − 1 binary classifiers each of which solves atwo-class problem of separating points in a particular class Ck frompoints not in that class.

Pros: Easy and Efficient;Cons: Ambigious samples; Unbalanced Samples.

One-vs-One: Use (n − 1)n/2 binary classifiers, one for evertpossible pairs of classes. Each point is classified according to amajority vote amongst the discriminant functions.

Pros: ”Better” performance than ”One-vs-Rest”Cons: Ambigious samples; Too many classifiers;




Bonus


Problems from 1-vs-1 and 1-vs-R

Figure : Attempting to construct a n class discriminant from a set of two classdiscriminants leads to ambiguous regions, shown in green. On the left is anexample involving the use of two discriminants designed to distinguish points inclass Ck from points not in class Ck . On the right is an example involving threediscriminant functions each of which is used to separate a pair of classes Ck

and Cj . (PRML pp.183)




Bonus



ECOC (Error-Correcting Output Codes:) Given a set of nclasses, the basis of the ECOC framework consists of designing acodeword for each of the classes. These codewords encode themembership information of each class for a given binary problem.

Pros: Robust;Cons: The construction of codebook.

4-Class One-vs-Rest Scheme

y1

y2

y3

y4

ECOC Extension

Figure : ECOC for a 4-Class ProblemY. Gu LINEAR MODELS FOR CLASSIFICATION



Bonus



A comprehensive evaluation on different schemes for”TwoClass2MultipleClass” has been conducted (R. Rifkin and A.Klautau,”In Defense of One-Vs-All Classification”,JMLR,2004). Their main thesisis that a simple ”one-vs-all” scheme is as accurate as any other approach,assuming that the underlying binary classiers are well-tuned regularizedclassiers such as support vector machines. This thesis is interesting inthat it disagrees with a large body of recent published work on multiclassclassication. They support our position by means of a critical review ofthe existing literature, a substantial collection of carefully controlledexperimental work, and theoretical arguments.




Bonus



Directly comprising n Linear functions:

yk(x) = wTx + w0, k = 1, 2, ..., n

x is assigned with Class Ck if yk≥yj(k ∈ {1, 2, ..., n}, k 6=j)

Pros: Likelihood Score, Convex;Cons: Easy to solve? Cannot use the binary mode;




Bonus


Methods for Classification

Discriminant Functions:

Data Fitting: Least Squares Estimation;Feature Reduction: Fisher’s Discriminant Analysis;Find a Separating Hyperplane: Perceptron and SVM.

Probalistic Generative Model: LDA/QDA.

Probalistic Discriminative Model: Logistic Regression.




Bonus


Least Squares For Classification

Least Squares: Similar to linear regression, we can solve theclassification via least squares estimation.

Data: Given the training set {(xn, tn)}, n = 1, 2, ...,N, xn is thefeature vector; tn is a indicator vector. For a K-Class problem,tn,k = 1, tn,i = 0(i 6=k), if xn belongs to Ck .Decision Function: For Class Ck , yk((x)) = wk

Tx + wk,0. x isassigned with Ck , if yk(x)≥yi (x), k 6=iOpt Problem:

minW

L(W) = 12Tr{(WTX− T)(WTX− T)T}

s.t. X = {x1, x2, ..., xN};W = {w1

T ,w2T , ...,wK

T}T ;

T = {tT1 , tT2 , ..., tTK}T .

Cons: Sensitive to singular data; Poor performance with asymetrydistributed data.




Bonus


Least Squares For Classification (Cont.)

Figure : Failure cases: Left:Singular data; Right: Asymetry data




Bonus


Fisher’s Linear Discriminant (Fisher, 1936)

Idea: Treat the classification problem as feature reduction(projected to 1-D);

Figure : The corresponding projection based on the Fisher linear discriminant.




Bonus


Fisher’s Linear Discriminant (Cont.)

Criterion: Maximize the between-class variance and Minimize thewithin-class variance.

Data: For a 2-Class problem, training set {(xn, yn}.n = 1, 2, ...,N,yn∈{−1,+1}Decision Function: y = wTx

Opt Problem:

maxw

J(w) = (m2−m1)2

S21 +S2

2= wT SBw

wT SWw

s.t. mk = wTmk; mk = 1Nk

∑xn∈Ck

xn;

s2k =

∑xn∈Ck

(yn −mk)2; yk = wTxk;

SB = (m2 −m1)(m2 −m1)T ;

SW =∑

k

∑n∈Ck

(xn −mk)(xn −mk)T .




Bonus


Fisher’s Linear Discriminant (Cont.)

MultiClass Cases: (Reduced to K-Dimension)

Data: For a K-Class problem, training set {(xn, tn}.n = 1, 2, ...,N, tnis a K-dim indicator vector.Decision Function: y = WTxOpt Problem:

maxw

J(W) = Tr{s−1W sB}

s.t. sW =∑N

k=1

∑n∈Ck

(yn − µk)(yn − µk)T

sB =∑K

k=1 Nk(µk − µ)(µk − µ)T

µk = 1Nk

∑n∈Ck

yn, µ = 1N

∑Nk=1 µkNk .




Bonus


Find a Separating Hyperplane

This procedure tries to conduct linear decision boundaries thatexplicitly separate the data into different classes.

PerceptronSupport Vector Machine

Figure : The orange line is the least squares solution, which misclassifies one ofthe training points. Two blue separating hyperplanes are found by theperceptron with different random starts.




Bonus


Preliminary

The distance between decision boundary and point:

Decision boundary: y(x) = wTx + w0 = 0;

Point: x = x⊥ + r w‖w‖ ;

Projection:r = y(x)‖w‖ (Pos/Neg).




Bonus


Perceptron (Rosenblatt,1962)

The Perceptron learning algorithms tries to find a separating hyperplaneby minimizing the distance of misclassified points to the decisionboundary. If a response yi = 1 is misclassified, then y = wTφ(x), andthe opposite for a misclassified response with yi = −1.

Problem: Binary Classification;

Decision Function: y = wTφ(x) + w0;

Opt Model:

minw,‖w‖=1

−∑i∈M

yi (wTφ(x))

where M indexes the set of missclassifield points. This problem canbe solved via Stochastic Gradient Descent (SGD) algorithm.

Cons: The solution is not unique; The number of iteration can bevery large.




Bonus


Optimal Separating Hyperplanes (Vapnik, 1996)

This approach tries to provides a unique solution to separatinghyperplane problem via maximizing the margin between the two classes.For a binary problem with N training data, we have:

maxw,w0

C

s.t. yi(wT xk+w0)‖w‖ ≥ C , i = 1, 2, ...,N

Set C‖w‖ = 1, the problem can be written as:

minw,w0

12‖w‖

s.t. yi (wTxk + w0) ≥ 1, i = 1, 2, ...,N

This is the basic form of Linear Support Vector Machine.




Bonus

Linear Discriminant AnalysisQuadratic Discriminant Analysis




3 Prob. Generative ModelsLinear Discriminant AnalysisQuadratic Discriminant Analysis


5 Bonus




Bonus


Generative Model Review

This part is based on Bayesian Posterior Inference as follows:

P(Ck |x) =P(x|Ck)P(Ck)

P(x)=

P(x|Ck)P(Ck)∑j P(x|Cj)P(Cj)

If P(x|Ck) is Gaussian, then the ”log-odds” c is:

logP(Ci |x)

P(Cj |x)= 1

2 log|Σj ||Σi | + 1

2 [(x− µj)TΣ−1

j (x− µj)− (x− µi)TΣ−1

i (x− µi)]

+ log P(Ci )P(Cj )

The task is to estimate the covariance and expectation of each class.




Bonus


Linear Discriminant Analysis

Linear Discriminant Analysis (LDA) arises in the special case when weassume that the classes have a common covariance matrix Σk = Σ,∀k.So we have:

logP(Ci |x)

P(Cj |x)= log

P(Ci )

P(Cj)− 1

2(µi + µj)

TΣ−1(µi − µj) + xTΣ−1(µi − µj)

an equation linear in x. For each class, we have the linear discriminantfunctions:

δk(x) = xTΣ−1µk −1

2µk

TΣ−1µk + logP(Ck)




Bonus


Linear Discriminant Analysis (Cont.)

How to estimate the parameters? MLE.ˆP(Ck) = Nk

N , where Nk is the number of Class-k observations;

µ̂k = 1Nk

∑n∈Ck

xn;

Σ̂ =∑K

k=1

∑n∈Ck

1N−K (x− µk)(x− µk)T




Bonus


Quadratic Discriminant Analysis

Getting back to the general discriminant problem, if the Σk are notassumed to be equal, then the pieces quadratic in x remain. We the getquadratic discriminant functions:

δk(x) = −1

2log |Σk | −

1

2(x− µk)TΣ−1

k (x− µk) + logP(Ck)

The decision boundary between each pair of classes K and L is describedby a quadratic equation {x : δk(x) = δl(x)}

Note: Compared with LDA, QDA performed better for unlinearproblems.




Bonus

Logistic RegressionLDA or Logistic Regression?





4 Prob. Discrimintive ModelsLogistic RegressionLDA or Logistic Regression?

5 Bonus




Bonus


Logistic Regression

The Logistic Regression Model arises from the desire to model theposterior probabilities of the K classes via linear functions in x , while atthe same time they sum to one and remain in [0, 1]. The model has theform:

logP(C1|x)

P(Ck |x)= w10 + w1

Tx

logP(C2|x)

P(Ck |x)= w20 + w2

Tx

...

logP(Ck−1|x)

P(Ck |x)= w(k−1)0 + wk−1

Tx




Bonus


Logistic Regression (Cont.)

The model is specified in terms of K − 1 log-odds. We use the logisticfunction f (σ) = 1

1+exp(−σ) and we have:

P(Ck |x) =exp(wk0 + wk

Tx)

1 +∑K−1

l=1 exp(wl0 + wlTx)

, k = 1, 2, ...,K − 1

P(CK |x) =1

1 +∑K−1

l=1 exp(wl0 + wlTx)

Computation: Maximum Likelihood Estimation;




Bonus


LDA or Logistic Regression?

For LDA, we find that the log-posterior odss between class k and K arelinear functions of x :

logP(Ci |x)

P(Cj |x)= log

P(Ci )

P(Cj)− 1

2(µi + µj)

TΣ−1(µi − µj) + xTΣ−1(µi − µj)

= αk0 + αkTx

For Logistic Regression, we have:

logP(Ck−1|x)

P(Ck |x)= w(k−1)0 + wk−1

Tx

It seems that the models are the same. The Logistic Regression model ismore general, in that it makes less assumptions.




Bonus

Multi-Class, Multi-Label, Struct-Output






5 BonusMulti-Class, Multi-Label, Struct-Output




Bonus


Roadmap of Machine Learning in Computer Vision

Classification

Binary Classification

Multiple-Labeling

MutiClass Classification

Multi-Label Multi-Instance

Structure Output

Bounding-Box Attributive

Pixel-wise Recognition

Early 1990s 2000s Today

Event & Behaviour




Bonus


Multi-Label and Multi-Instance

Figure : Examples of Multi-label Learning (Li,CIKM’13)




Bonus


Structure-Output: Bounding-Box

Figure : Examples of Bounding Box (Lan,ECCV’12)




Bonus


Structure-Output: Sentence

Figure : Examples of ”Sentence” (Lan, ECCV’12)




Bonus


Structure-Output: Semantic Description

Figure : Examples of ”Semantic Description” (Lan, CVPR’12)




Bonus


Structure-Output: Pixel-wise Annotation

Figure : Examples of ”Pixel-wise Annotation” (Ladicky, CVPR’13)




Bonus


Fine-grained Categorization

Figure : Examples of ”Fine-grained Categorization” (Yao, CVPR’11)




Bonus


Challenges & Future Work

Figure : Challenges of Visual Recognition




Bonus


Challenges & Future Work

Learning:

Transfer Learning;Weakly-supervised Learning;Deep Learning.

Data:

Visual Saliency, Objectness, Descriptors;Semantic Network.




Bonus


Thank you.Q&A.


linear models for classification - chapter 4 in prml

Documents