a review of our course classification and regression the perceptron algorithm: primal vs. dual form...
Post on 22-Dec-2015
216 Views
Preview:
TRANSCRIPT
A Review of Our CourseClassification and Regression
The Perceptron Algorithm: Primal vs. Dual form
An on-line and mistake-driven procedure
Update the weight vector and bias when there is a misclassified point
Converge when problem is linearly separable
Classification Problem2-Category Linearly Separable
Case
A-
A+
x0w+ b= à 1
wx0w+ b= + 1x0w+ b= 0
Malignant
Benign
Algebra of the Classification Problem
Linearly Separable Case
Given l points in the n dimensional real spaceRn
Represented by an â n matrixAor Membership of each pointA iin the classesAà A+
is specified by an `â ` diagonal matrix D :
D ii = à 1 if A i 2 Aà and D ii = 1 A i 2 A+if SeparateAà and A+by two bounding planes such that:
A iw+ b > + 1; for D ii = + 1;A iw+ b 6 à 1; for D ii = à 1
More succinctly:D(Aw+ eb)>e
e= [1;1; . . .;1]02 R :
, where
D(Aw+ eb) >e+ øø>0
whereø: nonnegative slack (error) vector
The term e0ø, 1-norm measure of error vector, is
called the training error.
minw;b;ø
e0ø
s.t. (LP)
Robust Linear ProgrammingPreliminary Approach to SVM
For the linearly separable case, at solution of (LP):
ø= 0
Support Vector MachinesMaximizing the Margin between Bounding
Planes
x0w+ b= + 1
x0w+ b= à 1
A+
A-
w
jjwjj22 = Margin
Support Vector Classification(Linearly Separable Case, Primal)
The hyperplane that solves the minimization problem:
(w;b)
min(w;b)2R n+1
21 jjwjj22
D(Aw+ eb)>e;
realizes the maximal margin hyperplane withgeometric margin í = jjwjj2
1
Soft Margin SVM(Nonseparable Case)
If data are not linearly separable Primal problem is infeasible Dual problem is unbounded above
Introduce the slack variable for each training point
yi(w0xi + b)>1à øi; øi>0 8 i
The inequality system is always feasible
w = 0; b= 0 & ø= ee.g.
xj
x
x
x
x
x
x
x
x
o
o
o
o
o
o
o
oi
í
í
øj
øi
Two Different Measures of Training Error
min(w;b;ø)2R n+1+l
21jjwjj22 + 2
Cjjøjj22
D(Aw+ eb) + ø>e
2-Norm Soft Margin:
1-Norm Soft Margin:min
(w;b;ø)2R n+1+l21jjwjj22 + Ce0ø
D(Aw+ eb) + ø>e
ø> 0
Optimization Problem Formulation
Problem setting: Given functionsf ; gi; i = 1;. . .; kand hj ; j = 1;. . .; m, defined on a domainÒ ò Rn;
minx2Ò
f (x)
subject to gi(x) 6 0; 8ihj(x) = 0; 8j
where f (x)is called the objective function and
g(x) 6 0; h(x) = 0are called constraints.
Definitions and Notation
Feasible region:
F = f x 2 Òj g(x)60; h(x)60g
where g(x) =g1(x)...gk(x)
2
4
3
5 and h(x) =h1(x)...hm(x)
2
4
3
5
A solution of the optimization problem is a point
xã 2 F such that@x 2 F for which f (x) < f (xã)
and xãis called a global minimum.
Definitions and Notation
A point x 2 Fis called a local minimum of the
optimization problem if9" > 0such that
f (x) > f (x); 8x 2 F and jjx à xjj < " At the solutionxã, an inequality constraintgi(x)
is said to be active if gi(xã) = 0, otherwise it is
called an inactive constraint.
gi(x) 6 0 , gi(x) + øi = 0; øi > 0 øiwhere
is called the slack variable
Definitions and Notation
Remove an inactive constraint in an optimizationproblem will NOT affect the optimal solution Very useful feature in SVM
If F = Rnthen the problem is called unconstrained minimization problem
SSVM formulation is in this category Difficult to find the global minimum without convexity assumption
Least square problem is in this category
Gradient and Hessian
Let f : Rn ! Rbe a differentiable function. The gradient of function f at a point x 2 Rnis defined
asr f (x) = [ @x1
@f (x) ; @x2
@f (x) ; . . .; @xn
@f (x)] 2 Rn
If f : Rn ! Ris a twice differentiable function. The Hessian matrix off at a point x 2 Rnis defined as
r 2f (x) =@x2
1
@2f@x1@x2
@2f ááá @x1@xn
@2f
......
......
@xn@x1
@2f@xn@x2
@2f ááá@x2
n
@2f
2
64
3
75 2Rnâ n
The Most Important Concept in Optimization (minimization)
A point is said to be an optimal solution of a unconstrained minimization if there exists no decent direction
A point is said to be an optimal solution of a constrained minimization if there exists no feasible decent direction
There might exist decent direction but move along this direction will leave out the feasible region
Two Important Algorithms for Unconstrained Minimization Problem
Steepest decent with exact line search
Newton’s method
Linear Program and Quadratic Program
An optimization problem in which the objective function and all constraints are linear functions is called a linear programming problem
If the objective function is convex quadratic while the constraints are all linear then the problem is called convex quadratic programming problem
Standard SVM formulation is in this category
formulation is in this category SVM jjájj1
Lagrangian Dual Problem
maxë;ì
minx2Ò
L(x;ë; ì )
subject to ë > 0
Lagrangian Dual Problem
maxë;ì
minx2Ò
L(x;ë; ì )
subject to ë > 0
maxë;ì
ò(ë; ì )
subject to ë > 0where ò(ë; ì ) = inf
x2ÒL(x;ë; ì )
Weak Duality Theorem
Let x 2 Òbe a feasible solution of the primal
problem and(ë; ì )a feasible solution of the
dual problem. Then f (x)>ò(ë; ì )
Corollary: supfò(ë; ì )j ë>0g
6 inff f (x)j g(x) 6 0; h(x) = 0g
ò(ë; ì ) = infx2Ò
L(x;ë; ì ) ô L (xà;ë; ì )
Saddle Point of Lagrangian
Let xã 2 Ò;ëã>0; ì ã 2 Rmsatisfying
L (xã;ë; ì )6L (xã;ëã; ì ã) 6L(x;ëã; ì ã);
8 x 2 Ò; ë>0: Then (xã;ëã; ì ã) is called
The saddle point of the Lagrangian function
Dual Problem of Linear Program
minx2R n
p0x
subject to Ax > b; x>0
Primal LP
Dual LP maxë2R m
b0ë
subject to A0ë6p; ë>0
※ All duality theorems hold and work perfectly!
Dual Problem of Strictly Convex Quadratic Program
minx2R n
21x0Qx + p0x
subject to Ax6 b
Primal QP
With strictly convex assumption, we have
Dual QP
max à 21(p0+ ë0A)Qà 1(A0ë + p) à ë0b
subject to ë>0
Dual Problem of Strictly Convex Quadratic Program
minx2R n
21x0Qx + p0x
subject to Ax6 b
Primal QP
With strictly convex assumption, we have
Dual QP
max à 21(p0+ ë0A)Qà 1(A0ë + p) à ë0b
subject to ë>0
Support Vector Classification(Linearly Separable Case, Dual Form)
The dual problem of previous MP:
maxë2R l
e0ë à 21ë0DAA0Dë
subject to
e0Dë = 0; ë>0:Applying the KKT optimality conditions, we have
w = A0Dë. But where isb?
06ë ? D(Aw+ eb) à e>0Don’t forget
Dual Representation of SVM
(Key of Kernel Methods: )
The hypothesis is determined by(ëã;bã)
h(x) = sgn(êx;A0Dëã
ë+ bã)
= sgn(P
i=1
l
yiëãi
êxi;x
ë+ bã)
= sgn(P
ëãi >0
yiëãi
êxi;x
ë+ bã)
w = A0Dëã =P
i=1
`
yiëiA0i
Remember : A0i = xi
Learning in Feature Space(Could Simplify the Classification Task)
Learning in a high dimensional space could degradegeneralization performance This phenomenon is called curse of dimensionality
By using a kernel function, that represents the innerproduct of training example in feature space, we neverneed to explicitly know what the nonlinear map is. Even do not know the dimensionality of feature space
There is no free lunch Deal with a huge and dense kernel matrix
Reduced kernel can avoid this difficulty
X Fþ
þ( ) þ( )
þ( )þ( )
þ( )
þ( )
þ( )þ( )
The value of kernel function represents the inner product in feature space
Kernel functions merge two steps 1. map input data from input space to feature space (might be infinite dim.) 2. do inner product in the feature space
Kernel TechniqueBased on Mercer’s Condition (1909)
f (x) =ð P
j=1
?
wjþj(x)ñ
+ b
Linear Machine in Feature Space
Let þ : X ! Fbe a nonlinear map from the
input space to some feature space
The classifier will be in the form (Primal):
Make it in the dual form:
f (x) =ð P
i=1
lë iyi
êþ(xi) áþ(x)
ëñ+ b
K (x;z) =êþ(x) áþ(z)
ë
Kernel: Represent Inner Product in Feature Space
The classifier will become:
f (x) =ð P
i=1
lë iyiK (xi;x)
ñ+ b
Definition: A kernel is a functionK : X â X ! Rsuch thatfor all x;z 2 X
where þ : X ! F
A Simple Example of Kernel
Polynomial Kernel of Degree 2:K (x;z) =êx;z
ë2
Let
x = x1
x2
ô õ
;z = z1
z2
ô õ2 R2and the nonlinear map
þ : R27! R3 defined by
þ(x) =x2
1
x22
2p
x1x2
2
4
3
5 .
Then
êþ(x);þ(z)
ë=
êx;z
ë2= K (x;z).
There are many other nonlinear maps, (x), that
satisfy the relation:ê (x); (z)
ë=
êx;z
ë2= K (x;z)
Power of the Kernel Technique
Consider a nonlinear map
þ : Rn7! Rp that consists
of distinct features of all the monomials of degree d.Then p = n + dà 1
d
ð ñ.
For example:n = 11; d = 10; p = 92378
Is it necessary? We only need to know êþ(x);þ(z)
ë!
This can be achieved
K (x;z) =êx;z
ëd
2-Norm Soft Margin Dual Formulation
The Lagrangian for 2-norm soft margin:
L (w;b;ø;ë) = 21w0w+ 2
Cø0ø+ë0[eà D(Aw+ eb) à ø]
where ë>0
The partial derivatives with respect to primalvariables equal zeros
@w@L (w;b;ø;ë) = wà A0Dë = 0
@b@L (w;b;ø;ë) = e0Dë = 0; @ø
@L (w;b;ø;ë) = Cøà ë = 0
Dual Maximization ProblemFor 2-Norm Soft Margin
Dual:
ë>0
maxë2R l
e0ë à 21ë0D(AA0+ C
1I )Dë
e0Dë = 0
The corresponding KKT complementarity:
06ë ? D(Aw+ eb) + øà e>0 Use above conditions to find bã
Introduce Kernel in Dual FormulationFor 2-Norm Soft Margin
ë>0
maxë2R l
e0ë à 21ë0D(K (A;A0) + C
1I )Dë
e0Dë = 0
Then the decision rule is defined by
Use above conditions to find
The feature space implicitly defined byk(x;z) Supposeëãsolves the QP problem:
h(x) = sgn(K (x;A0)Dëã + bã)
Introduce Kernel in Dual Formulationfor 2-Norm Soft Margin
for any
bã is chosen so that
yi[K (A0i;A
0)Dëã + bã] = 1à Cëã
i
i with ëãi 6= 0
06ëã ? D(K (A;A0)Dëã + ebã)+ øã à e> 0
Because:
and ëã = Cøã
Sequential Minimal Optimization
Deals with an equality constraint and a box
constraints of dual problem
Works on the smallest working set (only 2)
Find the optimal solution by only changing
value that is in the working set
The solution can be analytically defined
(SMO)
ë0s
The best feature of SMO
Analytical Solution for Two Points
Suppose that we changeë1 and ë2
ë1y1+ ë2y2 = ëold1 y1+ ëold
2 y2
In order to keep the equality constraint we have to change twoë0svalue such that
The new ë0svalue has to satisfy the box constraints
We have a more restriction on changing ë
A Restrictive Constraint on New
Suppose that we changeë1 and ë2
ë
Once we haveënew2
we can getënew1
A restrictive constraint: U6ënew2
6V
where U = max(0; ëold2 à ëold
1 );
V = min(C;C à ëold1 + ëold
2 )if y16=y2
U = max(0;ëold1 + ëold
2 à C)
V = min(C;ëold1 + ëold
2 )
and
if y1 = y2
-Support Vector Regression(Linear Case:f (x) = x0w+ b)
Given the training set:
S = f (xi;yi)j xi 2 Rn; yi 2 R; i = 1; . . .;mg
Motivated by SVM:
jjwjj2should be as small as possible Some tiny error should be discarded
Represented by an matrix and a vector mâ n Ay 2 Rm
ï
Try to find such that that is (w;b) y ù Aw+ eb
yi ù w0xi + b ;i = 1ááámwhere e= [1;ááá1]02 Rm
-Insensitive Loss Function"
-insensitive loss function:"
jyi à f (xi)j" = maxf0; jyi à f (xi)j à "g
The loss made by the estimation function, fat the data point(xi;yi) is
jøj" = maxf0; jøj à "g=0 if jøj6 "
jøj à " otherwise
ú
If ø2 Rn then jøj" 2 Rn is defined as:
(jøj") i = jøij" ; i = 1. . .n
(Tiny Error Should Be Discarded)
x
x
x
x
x
x
x
x
x
"
"
-Insensitive Linear Regression"
f (x) = x0w+ b
yj à f (xj) à "f (xk) à yk à "
Find (w;b)with the smallest overall error
Five Popular Loss Functions
-Insensitive Loss Regression
Linear -insensitive loss function:
where
""L "(x;y; f ) = jyà f (x)j"
= max(0; jyà f (x)j à ");
x 2 Rn;y 2 R & f is a real function
Quadratic -insensitive loss function:"
L 2"(x;y; f ) = jyà f (x)j2"
ï- insensitive Support Vector Regression Model
Motivated by SVM: jjwjj2should be as small as possible
Some tiny error should be discarded
min(w;b;ø)2Rn+1+m
21jjwjj22+ Ce0 øj jï
where øj jï 2 Rm; ( øj jï)i = max(0; A iw+ bà yij j à ï )
Why minimize ?probably approximately correct
(pac)
íí w
íí
2
Consider performing linear regression for any trainingdata distribution and
max1ô iô m
jj(xi;yi)jj6R ;0< î < 1 and c> 0
Pr(err(f ) > mc( ï2jjwjj22R2+SSElog2m + logî1)) < îD
Pr(err(f ) ô mc ( ï 2
jjwjj22R2+SSE
log2m+ logî1)) > 1à î
D
D
then
Occam’s razor: the simplest is the best
Reformulated - SVR as a Constrained Minimization Problem
min(w;b;ø;øã)2Rn+1+2m
21w0w+ Ce0(ø+ øã)
yà Awà eb 6 eï + øAw+ ebà y 6 eï + øã
ø;øã > 0
subject to
n+1+2m variables and 2m constrains minimization problem
ï
Enlarge the problem size and computational complexity for solving the problem
SV Regression by Minimizing Quadratic -Insensitive Loss"
min(w;b;ø)2R n+1+l
21jjwjj22 + 2
÷jj(jøj")jj22
We have the following problem:
where (jøj") i = jyi à (w0xi + b)j"
Primal Formulation of SVR for Quadratic -Insensitive Loss
min(w;b;ø+;øà )2R n+1+2l
21jjwjj22 + 2
C(jjø+jj22 + jjøà jj22)
"
Extremely important: At the solution
à Awà eb+ y 6 e" + ø+
Aw+ ebà y 6 e" + øà
ø+;øà > 0
06øà ? ø+>0
subject to
Simplify Dual Formulation of SVR
maxë
y0ë à "jjëjj1 à 21ë0(AA0+ C
1I )ë
e0ë = 0subject to
The case , problem becomes to the least squares linear regression with a weight decay factor
" = 0
Kernel in Dual Formulation for SVR
maxë2R l
y0ë à "jjëjj1 à 21ë0(K (A;A0) + C
1I )ë0
e0ë = 0 Then the regression function is defined by
Supposeëãsolves the QP problem:
f (x) = K (x;A0)ëã + bã
where bãis chosen such that
f (xi) à yi = à " à Cëã
i with ëãi > 0
subject to
Probably Approximately Correct Learningpac Model
Key assumption:
Training and testing data are generated i.i.d.according to anfixed but unknowndistributionD
We call such measure risk functional and denote
When we evaluate the “quality” of a hypothesis(classification function)h 2 H
D
we should take the
unknowndistributionerror” or “expected error”made by theh 2 H
( i.e.“average
)
it as Derr(h) =D
f (x;y) 2 X â f 1;à 1gj h(x)6=yg
into account
Generalization Error of pac Model
Let be a set ofS = f (x1;y1);. . .;(xl;yl)g l training
Dexamples chosen i.i.d. according to Treat the generalization error err(hS)
Das a r.v.
depending on the random selection of S Find a bound of the trail of the distribution of
in the formr.v.
err(hS)D
" = "(l;H;î )
" = "(l;H;î ) is a function ofl;H and î,where1à î
is the confidence level of the error bound which isgiven by learner
Probably Approximately Correct
We assert:
Pr(f err(hS)D
> " = "(l;H; î )g) < î
The error made by the hypothesisthen the error bound
hs will be less
"(l;H;î )that is not dependon the unknown distributionD
Pr(f err(hS)D
6" = "(l;H; î )g)>1à î
or
Find the Hypothesis with MinimumExpected Risk?
LetS = f (x1;y1);. . .;(xl;yl)gò X â f à 1;1gthe training Dexamples chosen i.i.d. according towiththe probability densityp(x;y)
be
The expected misclassification error made byh 2 His
R[h] =8;
X â f à 1;1g21jh(x) à yjdp(x;y)
The ideal hypothesishãoptshould has the smallest
expected riskR[hãopt]6 R[h]; 8h 2 H
Unrealistic !!!
Empirical Risk Minimization (ERM)
Find the hypothesishãempwith the smallest empirical
risk Remp[hãemp]6 Remp[h]; 8h 2 H
D p(x;y)and are not needed)(
Replace the expected risk over by an p(x;y)average over the training example
Remp[h] = l1 P
i=1
l
21 jh(xi) à yij The empirical risk:
Only focusing on empirical risk will cause overfitting
Overfitting
Solid : f (x) = 2x2 à 5x + 5Spot : nonlinear regression
which passes
through this 8 points
Overfitting is a phenomena that the resulting function fits the training set too well, but does not have a good prediction performance on unseen data.Red dots :
generated by f(x) with random noise
Tuning Procedure
overfitting
The final value of parameter is one with the maximum testing set correctness !
VC ConfidenceRemp[h] & R[h](The Bound between )
R[h]6Remp[h]+ lv(log(2l=v)+1)à log(î =4)
q
The following inequality will be held with probability
1à î
C. J. C. Burges, A tutorial on support vector machines for pattern recognition,Data Mining and Knowledge Discovery 2 (2) (1998), p.121-167
Capacity (Complexity) of Hypothesis Space :VC-dimension
H
A given training set is shattered byif for every labeling of
with this labeling
S H if and only
S; 9 h 2 H consistent
Three (linear independent) points shattered by ahyperplanes inR2
Shattering Points with Hyperplanesin Rn
Theorem: Consider some set of m points inRn. Choosea point as origin. Then the m points can be shattered
by oriented hyperplanes if and only if the positionvectors of the rest points are linearly independent.
Can you always shatter three points with a line inR2?
Definition of VC-dimension H(A Capacity Measure of Hypothesis Space )
The Vapnik-Chervonenkis dimension,VC(H), ofhypothesis spaceHdefined over the input spaceXis the size of the (existent) largest finite subset
Xshattered byH If arbitrary large finite set ofXcan be shattered
byH, then VC(H) ñ 1
of
Let H = fall hyperplanes in Rngthen
VC(H) = n + 1
top related