a review of our course classification and regression the perceptron algorithm: primal vs. dual form...

63
A Review of Our Course Classification and Regression he Perceptron Algorithm: Primal vs. Dual for An on-line and mistake-driven procedure Update the weight vector and bias when ther is a misclassified point Converge when problem is linearly separable

Post on 22-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Review of Our Course Classification and Regression  The Perceptron Algorithm: Primal vs. Dual form  An on-line and mistake-driven procedure  Update

A Review of Our CourseClassification and Regression

The Perceptron Algorithm: Primal vs. Dual form

An on-line and mistake-driven procedure

Update the weight vector and bias when there is a misclassified point

Converge when problem is linearly separable

Page 2: A Review of Our Course Classification and Regression  The Perceptron Algorithm: Primal vs. Dual form  An on-line and mistake-driven procedure  Update

Classification Problem2-Category Linearly Separable

Case

A-

A+

x0w+ b= à 1

wx0w+ b= + 1x0w+ b= 0

Malignant

Benign

Page 3: A Review of Our Course Classification and Regression  The Perceptron Algorithm: Primal vs. Dual form  An on-line and mistake-driven procedure  Update

Algebra of the Classification Problem

Linearly Separable Case

Given l points in the n dimensional real spaceRn

Represented by an â n matrixAor Membership of each pointA iin the classesAà A+

is specified by an `â ` diagonal matrix D :

D ii = à 1 if A i 2 Aà and D ii = 1 A i 2 A+if SeparateAà and A+by two bounding planes such that:

A iw+ b > + 1; for D ii = + 1;A iw+ b 6 à 1; for D ii = à 1

More succinctly:D(Aw+ eb)>e

e= [1;1; . . .;1]02 R :

, where

Page 4: A Review of Our Course Classification and Regression  The Perceptron Algorithm: Primal vs. Dual form  An on-line and mistake-driven procedure  Update

D(Aw+ eb) >e+ øø>0

whereø: nonnegative slack (error) vector

The term e0ø, 1-norm measure of error vector, is

called the training error.

minw;b;ø

e0ø

s.t. (LP)

Robust Linear ProgrammingPreliminary Approach to SVM

For the linearly separable case, at solution of (LP):

ø= 0

Page 5: A Review of Our Course Classification and Regression  The Perceptron Algorithm: Primal vs. Dual form  An on-line and mistake-driven procedure  Update

Support Vector MachinesMaximizing the Margin between Bounding

Planes

x0w+ b= + 1

x0w+ b= à 1

A+

A-

w

jjwjj22 = Margin

Page 6: A Review of Our Course Classification and Regression  The Perceptron Algorithm: Primal vs. Dual form  An on-line and mistake-driven procedure  Update

Support Vector Classification(Linearly Separable Case, Primal)

The hyperplane that solves the minimization problem:

(w;b)

min(w;b)2R n+1

21 jjwjj22

D(Aw+ eb)>e;

realizes the maximal margin hyperplane withgeometric margin í = jjwjj2

1

Page 7: A Review of Our Course Classification and Regression  The Perceptron Algorithm: Primal vs. Dual form  An on-line and mistake-driven procedure  Update

Soft Margin SVM(Nonseparable Case)

If data are not linearly separable Primal problem is infeasible Dual problem is unbounded above

Introduce the slack variable for each training point

yi(w0xi + b)>1à øi; øi>0 8 i

The inequality system is always feasible

w = 0; b= 0 & ø= ee.g.

Page 8: A Review of Our Course Classification and Regression  The Perceptron Algorithm: Primal vs. Dual form  An on-line and mistake-driven procedure  Update

xj

x

x

x

x

x

x

x

x

o

o

o

o

o

o

o

oi

í

í

øj

øi

Page 9: A Review of Our Course Classification and Regression  The Perceptron Algorithm: Primal vs. Dual form  An on-line and mistake-driven procedure  Update

Two Different Measures of Training Error

min(w;b;ø)2R n+1+l

21jjwjj22 + 2

Cjjøjj22

D(Aw+ eb) + ø>e

2-Norm Soft Margin:

1-Norm Soft Margin:min

(w;b;ø)2R n+1+l21jjwjj22 + Ce0ø

D(Aw+ eb) + ø>e

ø> 0

Page 10: A Review of Our Course Classification and Regression  The Perceptron Algorithm: Primal vs. Dual form  An on-line and mistake-driven procedure  Update

Optimization Problem Formulation

Problem setting: Given functionsf ; gi; i = 1;. . .; kand hj ; j = 1;. . .; m, defined on a domainÒ ò Rn;

minx2Ò

f (x)

subject to gi(x) 6 0; 8ihj(x) = 0; 8j

where f (x)is called the objective function and

g(x) 6 0; h(x) = 0are called constraints.

Page 11: A Review of Our Course Classification and Regression  The Perceptron Algorithm: Primal vs. Dual form  An on-line and mistake-driven procedure  Update

Definitions and Notation

Feasible region:

F = f x 2 Òj g(x)60; h(x)60g

where g(x) =g1(x)...gk(x)

2

4

3

5 and h(x) =h1(x)...hm(x)

2

4

3

5

A solution of the optimization problem is a point

xã 2 F such that@x 2 F for which f (x) < f (xã)

and xãis called a global minimum.

Page 12: A Review of Our Course Classification and Regression  The Perceptron Algorithm: Primal vs. Dual form  An on-line and mistake-driven procedure  Update

Definitions and Notation

A point x 2 Fis called a local minimum of the

optimization problem if9" > 0such that

f (x) > f (x); 8x 2 F and jjx à xjj < " At the solutionxã, an inequality constraintgi(x)

is said to be active if gi(xã) = 0, otherwise it is

called an inactive constraint.

gi(x) 6 0 , gi(x) + øi = 0; øi > 0 øiwhere

is called the slack variable

Page 13: A Review of Our Course Classification and Regression  The Perceptron Algorithm: Primal vs. Dual form  An on-line and mistake-driven procedure  Update

Definitions and Notation

Remove an inactive constraint in an optimizationproblem will NOT affect the optimal solution Very useful feature in SVM

If F = Rnthen the problem is called unconstrained minimization problem

SSVM formulation is in this category Difficult to find the global minimum without convexity assumption

Least square problem is in this category

Page 14: A Review of Our Course Classification and Regression  The Perceptron Algorithm: Primal vs. Dual form  An on-line and mistake-driven procedure  Update

Gradient and Hessian

Let f : Rn ! Rbe a differentiable function. The gradient of function f at a point x 2 Rnis defined

asr f (x) = [ @x1

@f (x) ; @x2

@f (x) ; . . .; @xn

@f (x)] 2 Rn

If f : Rn ! Ris a twice differentiable function. The Hessian matrix off at a point x 2 Rnis defined as

r 2f (x) =@x2

1

@2f@x1@x2

@2f ááá @x1@xn

@2f

......

......

@xn@x1

@2f@xn@x2

@2f ááá@x2

n

@2f

2

64

3

75 2Rnâ n

Page 15: A Review of Our Course Classification and Regression  The Perceptron Algorithm: Primal vs. Dual form  An on-line and mistake-driven procedure  Update

The Most Important Concept in Optimization (minimization)

A point is said to be an optimal solution of a unconstrained minimization if there exists no decent direction

A point is said to be an optimal solution of a constrained minimization if there exists no feasible decent direction

There might exist decent direction but move along this direction will leave out the feasible region

Page 16: A Review of Our Course Classification and Regression  The Perceptron Algorithm: Primal vs. Dual form  An on-line and mistake-driven procedure  Update

Two Important Algorithms for Unconstrained Minimization Problem

Steepest decent with exact line search

Newton’s method

Page 17: A Review of Our Course Classification and Regression  The Perceptron Algorithm: Primal vs. Dual form  An on-line and mistake-driven procedure  Update

Linear Program and Quadratic Program

An optimization problem in which the objective function and all constraints are linear functions is called a linear programming problem

If the objective function is convex quadratic while the constraints are all linear then the problem is called convex quadratic programming problem

Standard SVM formulation is in this category

formulation is in this category SVM jjájj1

Page 18: A Review of Our Course Classification and Regression  The Perceptron Algorithm: Primal vs. Dual form  An on-line and mistake-driven procedure  Update

Lagrangian Dual Problem

maxë;ì

minx2Ò

L(x;ë; ì )

subject to ë > 0

Page 19: A Review of Our Course Classification and Regression  The Perceptron Algorithm: Primal vs. Dual form  An on-line and mistake-driven procedure  Update

Lagrangian Dual Problem

maxë;ì

minx2Ò

L(x;ë; ì )

subject to ë > 0

maxë;ì

ò(ë; ì )

subject to ë > 0where ò(ë; ì ) = inf

x2ÒL(x;ë; ì )

Page 20: A Review of Our Course Classification and Regression  The Perceptron Algorithm: Primal vs. Dual form  An on-line and mistake-driven procedure  Update

Weak Duality Theorem

Let x 2 Òbe a feasible solution of the primal

problem and(ë; ì )a feasible solution of the

dual problem. Then f (x)>ò(ë; ì )

Corollary: supfò(ë; ì )j ë>0g

6 inff f (x)j g(x) 6 0; h(x) = 0g

ò(ë; ì ) = infx2Ò

L(x;ë; ì ) ô L (xà;ë; ì )

Page 21: A Review of Our Course Classification and Regression  The Perceptron Algorithm: Primal vs. Dual form  An on-line and mistake-driven procedure  Update

Saddle Point of Lagrangian

Let xã 2 Ò;ëã>0; ì ã 2 Rmsatisfying

L (xã;ë; ì )6L (xã;ëã; ì ã) 6L(x;ëã; ì ã);

8 x 2 Ò; ë>0: Then (xã;ëã; ì ã) is called

The saddle point of the Lagrangian function

Page 22: A Review of Our Course Classification and Regression  The Perceptron Algorithm: Primal vs. Dual form  An on-line and mistake-driven procedure  Update

Dual Problem of Linear Program

minx2R n

p0x

subject to Ax > b; x>0

Primal LP

Dual LP maxë2R m

b0ë

subject to A0ë6p; ë>0

※ All duality theorems hold and work perfectly!

Page 23: A Review of Our Course Classification and Regression  The Perceptron Algorithm: Primal vs. Dual form  An on-line and mistake-driven procedure  Update

Dual Problem of Strictly Convex Quadratic Program

minx2R n

21x0Qx + p0x

subject to Ax6 b

Primal QP

With strictly convex assumption, we have

Dual QP

max à 21(p0+ ë0A)Qà 1(A0ë + p) à ë0b

subject to ë>0

Page 24: A Review of Our Course Classification and Regression  The Perceptron Algorithm: Primal vs. Dual form  An on-line and mistake-driven procedure  Update

Dual Problem of Strictly Convex Quadratic Program

minx2R n

21x0Qx + p0x

subject to Ax6 b

Primal QP

With strictly convex assumption, we have

Dual QP

max à 21(p0+ ë0A)Qà 1(A0ë + p) à ë0b

subject to ë>0

Page 25: A Review of Our Course Classification and Regression  The Perceptron Algorithm: Primal vs. Dual form  An on-line and mistake-driven procedure  Update

Support Vector Classification(Linearly Separable Case, Dual Form)

The dual problem of previous MP:

maxë2R l

e0ë à 21ë0DAA0Dë

subject to

e0Dë = 0; ë>0:Applying the KKT optimality conditions, we have

w = A0Dë. But where isb?

06ë ? D(Aw+ eb) à e>0Don’t forget

Page 26: A Review of Our Course Classification and Regression  The Perceptron Algorithm: Primal vs. Dual form  An on-line and mistake-driven procedure  Update

Dual Representation of SVM

(Key of Kernel Methods: )

The hypothesis is determined by(ëã;bã)

h(x) = sgn(êx;A0Dëã

ë+ bã)

= sgn(P

i=1

l

yiëãi

êxi;x

ë+ bã)

= sgn(P

ëãi >0

yiëãi

êxi;x

ë+ bã)

w = A0Dëã =P

i=1

`

yiëiA0i

Remember : A0i = xi

Page 27: A Review of Our Course Classification and Regression  The Perceptron Algorithm: Primal vs. Dual form  An on-line and mistake-driven procedure  Update

Learning in Feature Space(Could Simplify the Classification Task)

Learning in a high dimensional space could degradegeneralization performance This phenomenon is called curse of dimensionality

By using a kernel function, that represents the innerproduct of training example in feature space, we neverneed to explicitly know what the nonlinear map is. Even do not know the dimensionality of feature space

There is no free lunch Deal with a huge and dense kernel matrix

Reduced kernel can avoid this difficulty

Page 28: A Review of Our Course Classification and Regression  The Perceptron Algorithm: Primal vs. Dual form  An on-line and mistake-driven procedure  Update

X Fþ

þ( ) þ( )

þ( )þ( )

þ( )

þ( )

þ( )þ( )

Page 29: A Review of Our Course Classification and Regression  The Perceptron Algorithm: Primal vs. Dual form  An on-line and mistake-driven procedure  Update

The value of kernel function represents the inner product in feature space

Kernel functions merge two steps 1. map input data from input space to feature space (might be infinite dim.) 2. do inner product in the feature space

Kernel TechniqueBased on Mercer’s Condition (1909)

Page 30: A Review of Our Course Classification and Regression  The Perceptron Algorithm: Primal vs. Dual form  An on-line and mistake-driven procedure  Update

f (x) =ð P

j=1

?

wjþj(x)ñ

+ b

Linear Machine in Feature Space

Let þ : X ! Fbe a nonlinear map from the

input space to some feature space

The classifier will be in the form (Primal):

Make it in the dual form:

f (x) =ð P

i=1

lë iyi

êþ(xi) áþ(x)

ëñ+ b

Page 31: A Review of Our Course Classification and Regression  The Perceptron Algorithm: Primal vs. Dual form  An on-line and mistake-driven procedure  Update

K (x;z) =êþ(x) áþ(z)

ë

Kernel: Represent Inner Product in Feature Space

The classifier will become:

f (x) =ð P

i=1

lë iyiK (xi;x)

ñ+ b

Definition: A kernel is a functionK : X â X ! Rsuch thatfor all x;z 2 X

where þ : X ! F

Page 32: A Review of Our Course Classification and Regression  The Perceptron Algorithm: Primal vs. Dual form  An on-line and mistake-driven procedure  Update

A Simple Example of Kernel

Polynomial Kernel of Degree 2:K (x;z) =êx;z

ë2

Let

x = x1

x2

ô õ

;z = z1

z2

ô õ2 R2and the nonlinear map

þ : R27! R3 defined by

þ(x) =x2

1

x22

2p

x1x2

2

4

3

5 .

Then

êþ(x);þ(z)

ë=

êx;z

ë2= K (x;z).

There are many other nonlinear maps, (x), that

satisfy the relation:ê (x); (z)

ë=

êx;z

ë2= K (x;z)

Page 33: A Review of Our Course Classification and Regression  The Perceptron Algorithm: Primal vs. Dual form  An on-line and mistake-driven procedure  Update

Power of the Kernel Technique

Consider a nonlinear map

þ : Rn7! Rp that consists

of distinct features of all the monomials of degree d.Then p = n + dà 1

d

ð ñ.

For example:n = 11; d = 10; p = 92378

Is it necessary? We only need to know êþ(x);þ(z)

ë!

This can be achieved

K (x;z) =êx;z

ëd

Page 34: A Review of Our Course Classification and Regression  The Perceptron Algorithm: Primal vs. Dual form  An on-line and mistake-driven procedure  Update

2-Norm Soft Margin Dual Formulation

The Lagrangian for 2-norm soft margin:

L (w;b;ø;ë) = 21w0w+ 2

Cø0ø+ë0[eà D(Aw+ eb) à ø]

where ë>0

The partial derivatives with respect to primalvariables equal zeros

@w@L (w;b;ø;ë) = wà A0Dë = 0

@b@L (w;b;ø;ë) = e0Dë = 0; @ø

@L (w;b;ø;ë) = Cøà ë = 0

Page 35: A Review of Our Course Classification and Regression  The Perceptron Algorithm: Primal vs. Dual form  An on-line and mistake-driven procedure  Update

Dual Maximization ProblemFor 2-Norm Soft Margin

Dual:

ë>0

maxë2R l

e0ë à 21ë0D(AA0+ C

1I )Dë

e0Dë = 0

The corresponding KKT complementarity:

06ë ? D(Aw+ eb) + øà e>0 Use above conditions to find bã

Page 36: A Review of Our Course Classification and Regression  The Perceptron Algorithm: Primal vs. Dual form  An on-line and mistake-driven procedure  Update

Introduce Kernel in Dual FormulationFor 2-Norm Soft Margin

ë>0

maxë2R l

e0ë à 21ë0D(K (A;A0) + C

1I )Dë

e0Dë = 0

Then the decision rule is defined by

Use above conditions to find

The feature space implicitly defined byk(x;z) Supposeëãsolves the QP problem:

h(x) = sgn(K (x;A0)Dëã + bã)

Page 37: A Review of Our Course Classification and Regression  The Perceptron Algorithm: Primal vs. Dual form  An on-line and mistake-driven procedure  Update

Introduce Kernel in Dual Formulationfor 2-Norm Soft Margin

for any

bã is chosen so that

yi[K (A0i;A

0)Dëã + bã] = 1à Cëã

i

i with ëãi 6= 0

06ëã ? D(K (A;A0)Dëã + ebã)+ øã à e> 0

Because:

and ëã = Cøã

Page 38: A Review of Our Course Classification and Regression  The Perceptron Algorithm: Primal vs. Dual form  An on-line and mistake-driven procedure  Update

Sequential Minimal Optimization

Deals with an equality constraint and a box

constraints of dual problem

Works on the smallest working set (only 2)

Find the optimal solution by only changing

value that is in the working set

The solution can be analytically defined

(SMO)

ë0s

The best feature of SMO

Page 39: A Review of Our Course Classification and Regression  The Perceptron Algorithm: Primal vs. Dual form  An on-line and mistake-driven procedure  Update

Analytical Solution for Two Points

Suppose that we changeë1 and ë2

ë1y1+ ë2y2 = ëold1 y1+ ëold

2 y2

In order to keep the equality constraint we have to change twoë0svalue such that

The new ë0svalue has to satisfy the box constraints

We have a more restriction on changing ë

Page 40: A Review of Our Course Classification and Regression  The Perceptron Algorithm: Primal vs. Dual form  An on-line and mistake-driven procedure  Update

A Restrictive Constraint on New

Suppose that we changeë1 and ë2

ë

Once we haveënew2

we can getënew1

A restrictive constraint: U6ënew2

6V

where U = max(0; ëold2 à ëold

1 );

V = min(C;C à ëold1 + ëold

2 )if y16=y2

U = max(0;ëold1 + ëold

2 à C)

V = min(C;ëold1 + ëold

2 )

and

if y1 = y2

Page 41: A Review of Our Course Classification and Regression  The Perceptron Algorithm: Primal vs. Dual form  An on-line and mistake-driven procedure  Update

-Support Vector Regression(Linear Case:f (x) = x0w+ b)

Given the training set:

S = f (xi;yi)j xi 2 Rn; yi 2 R; i = 1; . . .;mg

Motivated by SVM:

jjwjj2should be as small as possible Some tiny error should be discarded

Represented by an matrix and a vector mâ n Ay 2 Rm

ï

Try to find such that that is (w;b) y ù Aw+ eb

yi ù w0xi + b ;i = 1ááámwhere e= [1;ááá1]02 Rm

Page 42: A Review of Our Course Classification and Regression  The Perceptron Algorithm: Primal vs. Dual form  An on-line and mistake-driven procedure  Update

-Insensitive Loss Function"

-insensitive loss function:"

jyi à f (xi)j" = maxf0; jyi à f (xi)j à "g

The loss made by the estimation function, fat the data point(xi;yi) is

jøj" = maxf0; jøj à "g=0 if jøj6 "

jøj à " otherwise

ú

If ø2 Rn then jøj" 2 Rn is defined as:

(jøj") i = jøij" ; i = 1. . .n

(Tiny Error Should Be Discarded)

Page 43: A Review of Our Course Classification and Regression  The Perceptron Algorithm: Primal vs. Dual form  An on-line and mistake-driven procedure  Update

x

x

x

x

x

x

x

x

x

"

"

-Insensitive Linear Regression"

f (x) = x0w+ b

yj à f (xj) à "f (xk) à yk à "

Find (w;b)with the smallest overall error

Page 44: A Review of Our Course Classification and Regression  The Perceptron Algorithm: Primal vs. Dual form  An on-line and mistake-driven procedure  Update

Five Popular Loss Functions

Page 45: A Review of Our Course Classification and Regression  The Perceptron Algorithm: Primal vs. Dual form  An on-line and mistake-driven procedure  Update

-Insensitive Loss Regression

Linear -insensitive loss function:

where

""L "(x;y; f ) = jyà f (x)j"

= max(0; jyà f (x)j à ");

x 2 Rn;y 2 R & f is a real function

Quadratic -insensitive loss function:"

L 2"(x;y; f ) = jyà f (x)j2"

Page 46: A Review of Our Course Classification and Regression  The Perceptron Algorithm: Primal vs. Dual form  An on-line and mistake-driven procedure  Update

ï- insensitive Support Vector Regression Model

Motivated by SVM: jjwjj2should be as small as possible

Some tiny error should be discarded

min(w;b;ø)2Rn+1+m

21jjwjj22+ Ce0 øj jï

where øj jï 2 Rm; ( øj jï)i = max(0; A iw+ bà yij j à ï )

Page 47: A Review of Our Course Classification and Regression  The Perceptron Algorithm: Primal vs. Dual form  An on-line and mistake-driven procedure  Update

Why minimize ?probably approximately correct

(pac)

íí w

íí

2

Consider performing linear regression for any trainingdata distribution and

max1ô iô m

jj(xi;yi)jj6R ;0< î < 1 and c> 0

Pr(err(f ) > mc( ï2jjwjj22R2+SSElog2m + logî1)) < îD

Pr(err(f ) ô mc ( ï 2

jjwjj22R2+SSE

log2m+ logî1)) > 1à î

D

D

then

Occam’s razor: the simplest is the best

Page 48: A Review of Our Course Classification and Regression  The Perceptron Algorithm: Primal vs. Dual form  An on-line and mistake-driven procedure  Update

Reformulated - SVR as a Constrained Minimization Problem

min(w;b;ø;øã)2Rn+1+2m

21w0w+ Ce0(ø+ øã)

yà Awà eb 6 eï + øAw+ ebà y 6 eï + øã

ø;øã > 0

subject to

n+1+2m variables and 2m constrains minimization problem

ï

Enlarge the problem size and computational complexity for solving the problem

Page 49: A Review of Our Course Classification and Regression  The Perceptron Algorithm: Primal vs. Dual form  An on-line and mistake-driven procedure  Update

SV Regression by Minimizing Quadratic -Insensitive Loss"

min(w;b;ø)2R n+1+l

21jjwjj22 + 2

÷jj(jøj")jj22

We have the following problem:

where (jøj") i = jyi à (w0xi + b)j"

Page 50: A Review of Our Course Classification and Regression  The Perceptron Algorithm: Primal vs. Dual form  An on-line and mistake-driven procedure  Update

Primal Formulation of SVR for Quadratic -Insensitive Loss

min(w;b;ø+;øà )2R n+1+2l

21jjwjj22 + 2

C(jjø+jj22 + jjøà jj22)

"

Extremely important: At the solution

à Awà eb+ y 6 e" + ø+

Aw+ ebà y 6 e" + øà

ø+;øà > 0

06øà ? ø+>0

subject to

Page 51: A Review of Our Course Classification and Regression  The Perceptron Algorithm: Primal vs. Dual form  An on-line and mistake-driven procedure  Update

Simplify Dual Formulation of SVR

maxë

y0ë à "jjëjj1 à 21ë0(AA0+ C

1I )ë

e0ë = 0subject to

The case , problem becomes to the least squares linear regression with a weight decay factor

" = 0

Page 52: A Review of Our Course Classification and Regression  The Perceptron Algorithm: Primal vs. Dual form  An on-line and mistake-driven procedure  Update

Kernel in Dual Formulation for SVR

maxë2R l

y0ë à "jjëjj1 à 21ë0(K (A;A0) + C

1I )ë0

e0ë = 0 Then the regression function is defined by

Supposeëãsolves the QP problem:

f (x) = K (x;A0)ëã + bã

where bãis chosen such that

f (xi) à yi = à " à Cëã

i with ëãi > 0

subject to

Page 53: A Review of Our Course Classification and Regression  The Perceptron Algorithm: Primal vs. Dual form  An on-line and mistake-driven procedure  Update

Probably Approximately Correct Learningpac Model

Key assumption:

Training and testing data are generated i.i.d.according to anfixed but unknowndistributionD

We call such measure risk functional and denote

When we evaluate the “quality” of a hypothesis(classification function)h 2 H

D

we should take the

unknowndistributionerror” or “expected error”made by theh 2 H

( i.e.“average

)

it as Derr(h) =D

f (x;y) 2 X â f 1;à 1gj h(x)6=yg

into account

Page 54: A Review of Our Course Classification and Regression  The Perceptron Algorithm: Primal vs. Dual form  An on-line and mistake-driven procedure  Update

Generalization Error of pac Model

Let be a set ofS = f (x1;y1);. . .;(xl;yl)g l training

Dexamples chosen i.i.d. according to Treat the generalization error err(hS)

Das a r.v.

depending on the random selection of S Find a bound of the trail of the distribution of

in the formr.v.

err(hS)D

" = "(l;H;î )

" = "(l;H;î ) is a function ofl;H and î,where1à î

is the confidence level of the error bound which isgiven by learner

Page 55: A Review of Our Course Classification and Regression  The Perceptron Algorithm: Primal vs. Dual form  An on-line and mistake-driven procedure  Update

Probably Approximately Correct

We assert:

Pr(f err(hS)D

> " = "(l;H; î )g) < î

The error made by the hypothesisthen the error bound

hs will be less

"(l;H;î )that is not dependon the unknown distributionD

Pr(f err(hS)D

6" = "(l;H; î )g)>1à î

or

Page 56: A Review of Our Course Classification and Regression  The Perceptron Algorithm: Primal vs. Dual form  An on-line and mistake-driven procedure  Update

Find the Hypothesis with MinimumExpected Risk?

LetS = f (x1;y1);. . .;(xl;yl)gò X â f à 1;1gthe training Dexamples chosen i.i.d. according towiththe probability densityp(x;y)

be

The expected misclassification error made byh 2 His

R[h] =8;

X â f à 1;1g21jh(x) à yjdp(x;y)

The ideal hypothesishãoptshould has the smallest

expected riskR[hãopt]6 R[h]; 8h 2 H

Unrealistic !!!

Page 57: A Review of Our Course Classification and Regression  The Perceptron Algorithm: Primal vs. Dual form  An on-line and mistake-driven procedure  Update

Empirical Risk Minimization (ERM)

Find the hypothesishãempwith the smallest empirical

risk Remp[hãemp]6 Remp[h]; 8h 2 H

D p(x;y)and are not needed)(

Replace the expected risk over by an p(x;y)average over the training example

Remp[h] = l1 P

i=1

l

21 jh(xi) à yij The empirical risk:

Only focusing on empirical risk will cause overfitting

Page 58: A Review of Our Course Classification and Regression  The Perceptron Algorithm: Primal vs. Dual form  An on-line and mistake-driven procedure  Update

Overfitting

Solid : f (x) = 2x2 à 5x + 5Spot : nonlinear regression

which passes

through this 8 points

Overfitting is a phenomena that the resulting function fits the training set too well, but does not have a good prediction performance on unseen data.Red dots :

generated by f(x) with random noise

Page 59: A Review of Our Course Classification and Regression  The Perceptron Algorithm: Primal vs. Dual form  An on-line and mistake-driven procedure  Update

Tuning Procedure

overfitting

The final value of parameter is one with the maximum testing set correctness !

Page 60: A Review of Our Course Classification and Regression  The Perceptron Algorithm: Primal vs. Dual form  An on-line and mistake-driven procedure  Update

VC ConfidenceRemp[h] & R[h](The Bound between )

R[h]6Remp[h]+ lv(log(2l=v)+1)à log(î =4)

q

The following inequality will be held with probability

1à î

C. J. C. Burges, A tutorial on support vector machines for pattern recognition,Data Mining and Knowledge Discovery 2 (2) (1998), p.121-167

Page 61: A Review of Our Course Classification and Regression  The Perceptron Algorithm: Primal vs. Dual form  An on-line and mistake-driven procedure  Update

Capacity (Complexity) of Hypothesis Space :VC-dimension

H

A given training set is shattered byif for every labeling of

with this labeling

S H if and only

S; 9 h 2 H consistent

Three (linear independent) points shattered by ahyperplanes inR2

Page 62: A Review of Our Course Classification and Regression  The Perceptron Algorithm: Primal vs. Dual form  An on-line and mistake-driven procedure  Update

Shattering Points with Hyperplanesin Rn

Theorem: Consider some set of m points inRn. Choosea point as origin. Then the m points can be shattered

by oriented hyperplanes if and only if the positionvectors of the rest points are linearly independent.

Can you always shatter three points with a line inR2?

Page 63: A Review of Our Course Classification and Regression  The Perceptron Algorithm: Primal vs. Dual form  An on-line and mistake-driven procedure  Update

Definition of VC-dimension H(A Capacity Measure of Hypothesis Space )

The Vapnik-Chervonenkis dimension,VC(H), ofhypothesis spaceHdefined over the input spaceXis the size of the (existent) largest finite subset

Xshattered byH If arbitrary large finite set ofXcan be shattered

byH, then VC(H) ñ 1

of

Let H = fall hyperplanes in Rngthen

VC(H) = n + 1