cse489-02 & cse589-02 multimedia processing lecture 09 pattern classifier and evaluation for...

44
CSE489-02 & CSE589-02 Multimedia Processing Lecture 09 Pattern Classifier and Evaluation for Multimedia Applications Spring 2009 New Mexico Tech

Upload: jaylen-wiggins

Post on 14-Dec-2015

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: CSE489-02 & CSE589-02 Multimedia Processing Lecture 09 Pattern Classifier and Evaluation for Multimedia Applications Spring 2009 New Mexico Tech

CSE489-02 & CSE589-02 Multimedia Processing

Lecture 09 Pattern Classifier and Evaluation for Multimedia

Applications

Spring 2009

New Mexico Tech

Page 2: CSE489-02 & CSE589-02 Multimedia Processing Lecture 09 Pattern Classifier and Evaluation for Multimedia Applications Spring 2009 New Mexico Tech

Basic Concepts and Definitions

Disease present

(D+) Disease absent

(D-)

Test positive

True positives

(TP)

False positives

(FP)

Total amount of test positive (TP+FP)

Test negative

False negative

(FN)

True negatives

(TN)

Total amount of test negative (FN+TN)

Total amount of disease present

(TP+FN)

Total amount of disease absent

(FP+TN)

Sensitivity —proportion of patients with disease who test positive

P(T+|D+) = TP / (TP+FN)

Specificity —proportion of patients without disease who test negative

P(T-|D-) = TN / (TN + FP)

Page 3: CSE489-02 & CSE589-02 Multimedia Processing Lecture 09 Pattern Classifier and Evaluation for Multimedia Applications Spring 2009 New Mexico Tech

Sensitivity: the ability to detect "true positives"

TPTP + FN

Accuracy (specificity): the ability to avoid "false positives"

TNTN + FP

Positive Predictive Value (PPV)

TPTP + FP

Sensitivity and Specificity

Page 4: CSE489-02 & CSE589-02 Multimedia Processing Lecture 09 Pattern Classifier and Evaluation for Multimedia Applications Spring 2009 New Mexico Tech

Accuracy Definition: TP + TN

TP + TN + FP + FN

Range: [0 … 1]

Page 5: CSE489-02 & CSE589-02 Multimedia Processing Lecture 09 Pattern Classifier and Evaluation for Multimedia Applications Spring 2009 New Mexico Tech

Definition: TP·TN - FN·FP√(TN+FN)(TN+FP)(TP+FN)

(TP+FP)

Range: [-1 … 1]

Matthews correlation coefficient

Page 6: CSE489-02 & CSE589-02 Multimedia Processing Lecture 09 Pattern Classifier and Evaluation for Multimedia Applications Spring 2009 New Mexico Tech

An example:

On a test set with 20 motif-containing (m+) and 47 motif-lacking (m-) proteins, the following results were obtained:predicted:

m+ m-true : m+ 17 3 tp fn m- 8 39 fp tn

sensitivity = tp/(tp+fn) = 17/(17+3)= 0.85

specificity = tp/(tp+fp) = 17/(17+8)= 0.68

MCC= ... = 0.64

Page 7: CSE489-02 & CSE589-02 Multimedia Processing Lecture 09 Pattern Classifier and Evaluation for Multimedia Applications Spring 2009 New Mexico Tech

An Example: Hypothyroidism

Hypothyroidism is the disease state in humans and other animals caused by insufficient production of thyroid hormone by the thyroid gland.

Clinical Response to Thyroxine Sodium in Clinically Hypothyroid but Biochemically Euthyroid Patients

Skinner GRB, Holmes D, Ahmad A, Davies JA, Benitez J. J Nutr Environ Med 2000;10:115-124.

Page 8: CSE489-02 & CSE589-02 Multimedia Processing Lecture 09 Pattern Classifier and Evaluation for Multimedia Applications Spring 2009 New Mexico Tech

An Example: Hypothyroidism

Page 9: CSE489-02 & CSE589-02 Multimedia Processing Lecture 09 Pattern Classifier and Evaluation for Multimedia Applications Spring 2009 New Mexico Tech

An Example: Hypothyroidism

Sensitivity is 18/32 = 0.56 Specificity is 92/93 = 0.99

Page 10: CSE489-02 & CSE589-02 Multimedia Processing Lecture 09 Pattern Classifier and Evaluation for Multimedia Applications Spring 2009 New Mexico Tech

An Example: Hypothyroidism

Sensitivity is 25/32 = 0.78 Specificity is 75/93 = 0.81

Page 11: CSE489-02 & CSE589-02 Multimedia Processing Lecture 09 Pattern Classifier and Evaluation for Multimedia Applications Spring 2009 New Mexico Tech

An Example: Hypothyroidism

Sensitivity is 29/32 = 0.91Specificity is 39/93 = 0.42

Page 12: CSE489-02 & CSE589-02 Multimedia Processing Lecture 09 Pattern Classifier and Evaluation for Multimedia Applications Spring 2009 New Mexico Tech

An Example: Hypothyroidism

Page 13: CSE489-02 & CSE589-02 Multimedia Processing Lecture 09 Pattern Classifier and Evaluation for Multimedia Applications Spring 2009 New Mexico Tech

An Example: Hypothyroidism

Page 14: CSE489-02 & CSE589-02 Multimedia Processing Lecture 09 Pattern Classifier and Evaluation for Multimedia Applications Spring 2009 New Mexico Tech

A Universal Illustration

Page 15: CSE489-02 & CSE589-02 Multimedia Processing Lecture 09 Pattern Classifier and Evaluation for Multimedia Applications Spring 2009 New Mexico Tech

Comparison of ROC Curve

Page 16: CSE489-02 & CSE589-02 Multimedia Processing Lecture 09 Pattern Classifier and Evaluation for Multimedia Applications Spring 2009 New Mexico Tech
Page 17: CSE489-02 & CSE589-02 Multimedia Processing Lecture 09 Pattern Classifier and Evaluation for Multimedia Applications Spring 2009 New Mexico Tech

The idea behind artificial neural networks

The brain of a vertebrate is (in general) capable of learning things

Example: having seen a number of trees, a normally giftedperson will be able to recognise almost all types of trees

The idea: to construct networks of artificial neurons and makethem learn and generalize in a way similar to how the physiological neural networks do that

Page 18: CSE489-02 & CSE589-02 Multimedia Processing Lecture 09 Pattern Classifier and Evaluation for Multimedia Applications Spring 2009 New Mexico Tech

The feed-forward neural network- the training principle

... ...

input hidden outputlayer layer layer

weights

-Data presented at input

Correct answer fixed at output

Difference between correct and actual output used for weight adjusting (training) se

quen

ce in

put d

ata

Page 19: CSE489-02 & CSE589-02 Multimedia Processing Lecture 09 Pattern Classifier and Evaluation for Multimedia Applications Spring 2009 New Mexico Tech

When to stop training er

ror

epoch

We want to get a good generalization performance and to avoidover-fitting of the parameters to the training set (over-training)

test set

training set

Page 20: CSE489-02 & CSE589-02 Multimedia Processing Lecture 09 Pattern Classifier and Evaluation for Multimedia Applications Spring 2009 New Mexico Tech

The training set

- large enough

- contain all possible classes in approximately equal amounts

- unbiased, i.e. no particular type within a class should be overrepresented --- this is important for two reasons:

- if training set is biased towards a particular type, so will the ANN be- if training and test set contain too similar examples, the performance will be over-estimated

in short: the training set should be representative

Page 21: CSE489-02 & CSE589-02 Multimedia Processing Lecture 09 Pattern Classifier and Evaluation for Multimedia Applications Spring 2009 New Mexico Tech

Cross-validation

N-fold cross-validation:divide the data set into n partsuse n-1 parts for traininguse 1 part for testing

e.g.., split the total data set into 5 parts:

- 4 parts for training- 1 part for testing

Page 22: CSE489-02 & CSE589-02 Multimedia Processing Lecture 09 Pattern Classifier and Evaluation for Multimedia Applications Spring 2009 New Mexico Tech

Support Vector Machines (SVM)

Page 23: CSE489-02 & CSE589-02 Multimedia Processing Lecture 09 Pattern Classifier and Evaluation for Multimedia Applications Spring 2009 New Mexico Tech

Linear Classifiers

denotes +1

denotes -1

f(x,w,b) = sign(w x + b)

How would you classify this data?

w x + b=

0w x + b<0

w x + b>0

Page 24: CSE489-02 & CSE589-02 Multimedia Processing Lecture 09 Pattern Classifier and Evaluation for Multimedia Applications Spring 2009 New Mexico Tech

Linear Classifiers

denotes +1

denotes -1

f(x,w,b) = sign(w x + b)

How would you classify this data?

Page 25: CSE489-02 & CSE589-02 Multimedia Processing Lecture 09 Pattern Classifier and Evaluation for Multimedia Applications Spring 2009 New Mexico Tech

Linear Classifiers

denotes +1

denotes -1

f(x,w,b) = sign(w x + b)

How would you classify this data?

Page 26: CSE489-02 & CSE589-02 Multimedia Processing Lecture 09 Pattern Classifier and Evaluation for Multimedia Applications Spring 2009 New Mexico Tech

Linear Classifiers

denotes +1

denotes -1

f(x,w,b) = sign(w x + b)

Any of these would be fine..

..but which is best?

Page 27: CSE489-02 & CSE589-02 Multimedia Processing Lecture 09 Pattern Classifier and Evaluation for Multimedia Applications Spring 2009 New Mexico Tech

Classifier Margin

denotes +1

denotes -1

f(x,w,b) = sign(w x + b)

Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint.

Classifier Margin

denotes +1

denotes -1

f(x,w,b) = sign(w x + b)

Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint.

Page 28: CSE489-02 & CSE589-02 Multimedia Processing Lecture 09 Pattern Classifier and Evaluation for Multimedia Applications Spring 2009 New Mexico Tech

Maximum Margin

denotes +1

denotes -1

f(x,w,b) = sign(w x + b)

The maximum margin linear classifier is the linear classifier with the, um, maximum margin.

This is the simplest kind of SVM (Called an LSVM)Linear SVM

Support Vectors are those datapoints that the margin pushes up against

1. Maximizing the margin is good according to intuition and PAC theory

2. Implies that only support vectors are important; other training examples are ignorable.

3. Empirically it works very very well.

Page 29: CSE489-02 & CSE589-02 Multimedia Processing Lecture 09 Pattern Classifier and Evaluation for Multimedia Applications Spring 2009 New Mexico Tech

What we know: w . x+ + b = +1 w . x- + b = -1 w . (x+-x-) = 2

“Predict Class

= +1”

zone

“Predict Class

= -1”

zonewx+b=1

wx+b=

0

wx+b=-1

X-

x+

ww

wxxM

2)(

M=Margin Width

Maximum Margin

Page 30: CSE489-02 & CSE589-02 Multimedia Processing Lecture 09 Pattern Classifier and Evaluation for Multimedia Applications Spring 2009 New Mexico Tech

Linear SVM Mathematically

Goal: 1) Correctly classify all training data

if yi = +1

if yi = -1

for all i

2) Maximize the Margin

same as minimize

We can formulate a Quadratic Optimization Problem and solve for w and b

Minimize

subject to

wM

2

www t

2

1)(

1bwxi

1bwxi

1)( bwxy ii

1)( bwxy ii

i

wwt

2

1

Page 31: CSE489-02 & CSE589-02 Multimedia Processing Lecture 09 Pattern Classifier and Evaluation for Multimedia Applications Spring 2009 New Mexico Tech

Solving the Optimization Problem

Need to optimize a quadratic function subject to linear constraints. Quadratic optimization problems are a well-known class of

mathematical programming problems, and many (rather intricate)

algorithms exist for solving them. The solution involves constructing a dual problem where a Lagrange

multiplier αi is associated with every constraint in the primary problem:

Find w and b such thatΦ(w) =½ wTw is minimized;

and for all {(xi ,yi)}: yi (wTxi + b) ≥ 1

Find α1…αN such that

Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and

(1) Σαiyi = 0(2) αi ≥ 0 for all αi

Page 32: CSE489-02 & CSE589-02 Multimedia Processing Lecture 09 Pattern Classifier and Evaluation for Multimedia Applications Spring 2009 New Mexico Tech

The Optimization Problem Solution The solution has the form:

Each non-zero αi indicates that corresponding xi is a support vector.

Then the classifying function will have the form:

Notice that it relies on an inner product between the test point x and the support vectors xi – we will return to this later.

Also keep in mind that solving the optimization problem involved computing the inner products xi

Txj between all pairs of training points.

w =Σαiyixi b= yk- wTxk for any xk such that αk 0

f(x) = ΣαiyixiTx + b

Page 33: CSE489-02 & CSE589-02 Multimedia Processing Lecture 09 Pattern Classifier and Evaluation for Multimedia Applications Spring 2009 New Mexico Tech

Dataset with noise

Hard Margin: So far we require all data points be classified correctly

- No training error

What if the training set is noisy?

- Solution 1: use very powerful kernels

denotes +1

denotes -1

OVERFITTING!

Page 34: CSE489-02 & CSE589-02 Multimedia Processing Lecture 09 Pattern Classifier and Evaluation for Multimedia Applications Spring 2009 New Mexico Tech

Slack variables ξi can be added to allow misclassification of difficult or noisy examples.

wx+b=1

wx+b=0

wx+b=-

1

7

11 2

Soft Margin Classification

What should our quadratic optimization criterion be?

Minimize

R

kkεC

1

.2

1ww

Page 35: CSE489-02 & CSE589-02 Multimedia Processing Lecture 09 Pattern Classifier and Evaluation for Multimedia Applications Spring 2009 New Mexico Tech

Hard Margin v.s. Soft Margin

The old formulation:

The new formulation incorporating slack variables:

Parameter C can be viewed as a way to control overfitting.

Find w and b such that

Φ(w) =½ wTw is minimized and for all {(xi ,yi)}yi (wTxi + b) ≥ 1

Find w and b such that

Φ(w) =½ wTw + CΣξi is minimized and for all {(xi ,yi)}yi (wTxi + b) ≥ 1- ξi and ξi ≥ 0 for all i

Page 36: CSE489-02 & CSE589-02 Multimedia Processing Lecture 09 Pattern Classifier and Evaluation for Multimedia Applications Spring 2009 New Mexico Tech

Linear SVMs: Overview

The classifier is a separating hyperplane. Most “important” training points are support vectors; they define the

hyperplane. Quadratic optimization algorithms can identify which training points

xi are support vectors with non-zero Lagrangian multipliers αi.

Both in the dual formulation of the problem and in the solution

training points appear only inside dot products: Find α1…αN such thatQ(α) =Σαi - ½ΣΣαiαjyiyjxi

Txj is maximized and (1) Σαiyi = 0(2) 0 ≤ αi ≤ C for all αi

f(x) = ΣαiyixiTx + b

Page 37: CSE489-02 & CSE589-02 Multimedia Processing Lecture 09 Pattern Classifier and Evaluation for Multimedia Applications Spring 2009 New Mexico Tech

Non-linear SVMs Datasets that are linearly separable with some noise work out

great:

But what are we going to do if the dataset is just too hard?

How about… mapping data to a higher-dimensional space:

0 x

0 x

0 x

x2

Page 38: CSE489-02 & CSE589-02 Multimedia Processing Lecture 09 Pattern Classifier and Evaluation for Multimedia Applications Spring 2009 New Mexico Tech

Non-linear SVMs: Feature spaces General idea: the original input space can always be mapped

to some higher-dimensional feature space where the training set is separable:

Φ: x → φ(x)

Page 39: CSE489-02 & CSE589-02 Multimedia Processing Lecture 09 Pattern Classifier and Evaluation for Multimedia Applications Spring 2009 New Mexico Tech

The “Kernel Trick”

The linear classifier relies on dot product between vectors K(xi,xj)=xiTxj

If every data point is mapped into high-dimensional space via some transformation Φ: x → φ(x), the dot product becomes:

K(xi,xj)= φ(xi) Tφ(xj)

A kernel function is some function that corresponds to an inner product in some expanded feature space.

Example:

2-dimensional vectors x=[x1 x2]; let K(xi,xj)=(1 + xiTxj)2

,

Need to show that K(xi,xj)= φ(xi) Tφ(xj):

K(xi,xj)=(1 + xiTxj)2

,

= 1+ xi12xj1

2 + 2 xi1xj1 xi2xj2+ xi2

2xj22 + 2xi1xj1 + 2xi2xj2

= [1 xi12 √2 xi1xi2 xi2

2 √2xi1 √2xi2]T [1 xj12 √2 xj1xj2 xj2

2 √2xj1 √2xj2]

= φ(xi) Tφ(xj), where φ(x) = [1 x1

2 √2 x1x2 x22 √2x1 √2x2]

Page 40: CSE489-02 & CSE589-02 Multimedia Processing Lecture 09 Pattern Classifier and Evaluation for Multimedia Applications Spring 2009 New Mexico Tech

Examples of Kernel Functions Linear: K(xi,xj)= xi

Txj

Polynomial of power p: K(xi,xj)= (1+ xi Txj)p

Gaussian (radial-basis function network):

Sigmoid: K(xi,xj)= tanh(β0xi Txj + β1)

)2

exp(),(2

2

ji

ji

xxxx

K

Page 41: CSE489-02 & CSE589-02 Multimedia Processing Lecture 09 Pattern Classifier and Evaluation for Multimedia Applications Spring 2009 New Mexico Tech

Non-linear SVMs Mathematically Dual problem formulation:

The solution is:

Optimization techniques for finding αi’s remain the same!

Find α1…αN such thatQ(α) =Σαi - ½ΣΣαiαjyiyjK(xi, xj) is maximized and (1) Σαiyi = 0(2) αi ≥ 0 for all αi

f(x) = ΣαiyiK(xi, xj)+ b

Page 42: CSE489-02 & CSE589-02 Multimedia Processing Lecture 09 Pattern Classifier and Evaluation for Multimedia Applications Spring 2009 New Mexico Tech

SVM locates a separating hyperplane in the feature space and classify points in that space

It does not need to represent the space explicitly, simply by defining a kernel function

The kernel function plays the role of the dot product in the feature space.

Nonlinear SVM - Overview

Page 43: CSE489-02 & CSE589-02 Multimedia Processing Lecture 09 Pattern Classifier and Evaluation for Multimedia Applications Spring 2009 New Mexico Tech

SVM Applications SVM has been used successfully in many

real-world problems - text (and hypertext) categorization - image classification - bioinformatics (Protein classification, Cancer classification) - hand-written character recognition

Page 44: CSE489-02 & CSE589-02 Multimedia Processing Lecture 09 Pattern Classifier and Evaluation for Multimedia Applications Spring 2009 New Mexico Tech

Useful Toolboxes PRtools STPRtool OSU-SVM SVMlight MATLABArsenal DENFIS ….