support vector machines - university at buffalosrihari/cse574/chap7/7.1-svms.pdfmachine learning...
TRANSCRIPT
Machine Learning Srihari
SVM Discussion Overview
1. Overview of SVMs2. Margin Geometry3. SVM Optimization4. Overlapping Distributions5. Relationship to Logistic Regression6. Dealing with Multiple Classes7. SVM for Regression8. SVM and Computational Learning Theory9. Relevance Vector Machines
Machine Learning Srihari Kernel Methods• In the kernel method for regression
• Given data set {xi,ti}, i = 1..N• Gram matrix K has elements Knm= ϕ (xn)Tϕ (xm)=k(xn,xm)
• Output y predicted for x isy(x)=k(x)T(K +λIN)-1t, where k(x) has elements kn(x)=(xn,x)
• Many choices of kernel1. Gaussian: k(x,x’) = exp (-||x-x’||2/2σ2)2. Cosine similarity (BOW, TF-IDF):
3. Mercer kernel:# substrings shared
k(x
i,x
i ') =
xiTx
i '
||xi||
2||x
i '||
2
xi=[xi1,..,xiD]
k(x,x ') = w
ss∈A*∑ φ
s(x)φ
s(x ') ϕs(x) is no of times
substring s appears in x
J(w) =
12
wTφ(xn)−t
n{ }2
+λ2n=1
N
∑ wTw
J(w) =
12aTΦΦTΦΦTa−aTΦΦTt +
12
tTt +λ2aTΦΦTa
J(a) =
12aTKKa−aTKt +
12
tTt +λ2aTKa
a =(K +λIN)-1t
w = ΦTa a
n=−
1λ
wTφ(xn)−t
n{ }
Primal optimization
Dual optimization
w appearing in a eliminated using kernel
Machine Learning Srihari
Sparse Kernel Methods
• Kernel methods have a limitation • that k(xn,xm) must be evaluated for all possible
pairs (xn,xm) of training points• Which is computationally infeasible
• In sparse solutions, predictions for new inputs depend only on kernels evaluated at a subset of training data points
Machine Learning Srihari
Alternative Names for SVM• Also called Sparse kernel machines
• Prediction is based on a linear combination of a kernel evaluated at the training points
• Sparse because not all pairs of training points need be used
• Also called Maximum margin classifiers
Machine Learning Srihari
Importance of SVMs• SVM is a discriminative method• SVM brings together:
1. Computational learning theory 2. Previous methods in linear discriminants 3. Optimization theory
• Widely used for ML problems: • Classification, regression, novelty detection
Machine Learning Srihari
Ideas Used in SVM• Linearly separable case considered
• By using higher-dimensions than original feature space
• Kernel trick reduces computational overhead
• Determination of model parameters is a convex optimization problem
• Extensive use of Lagrange multipliers
k(y
j,y
k) = y
jt ⋅y
k= φ(x
j)t ⋅φ(x
k)
Machine Learning Srihari
Boundaries induced by kernels
Source: https://gist.github.com/WittmannF/60680723ed8dd0cb993051a7448f7805
Machine Learning Srihari
2. Maximum Margin Classifiers• Begin with 2-class linear classifier
y(x)=wTϕ(x)+b• where ϕ(x) is a feature space transformation
• We will introduce a dual representation which avoids working with feature space• Training set: x1,.., xN with targets t1,..,tN , tn ε {-1,1}
• New x classified according to sign of y(x)• Assume training data is linearly separable in
feature space, • i.e., there exists at least one choice of w and b
such that tny(xn)>0 for all points
Machine Learning Srihari
Definition of margin • Perceptron guarantees a solution in finite
no. of steps, which depends on 1. Initial values for w,b2. Order in which data are presented
• If there are multiple solutions we need one with smallest generalization error• SVM approaches this with margin concept
• Defined as smallest distance between decision boundary and any of the samples
Machine Learning Srihari
Support Vector Definition
• Margin:• Perpendicular distance between between
boundary and the closest data point (left)• Maximizing margin leads to a particular
choice of decision boundary • Determined by a subset of the data points
• Which are known as support vectors (circles)
y = 1y = 0
y = �1
margin
y = 1
y = 0
y = �1
Machine Learning Srihari
Three support vectors are shown as solid dots
Support Vectors and Margin
• Support vectors are those nearest patterns at distance b from hyperplane
• SVM finds hyperplanewith maximum distance
(margin distance b)
from nearest training patterns
Machine Learning Srihari
Motivation for maximum margin• Maximum margin solution is motivated by
computational learning theory• Simple insight
• Based on a hybrid of generative and discriminative approaches• Seen next (Tong, Koller 2000)
Machine Learning Srihari
Max Margin Insight• Model class-conditionals p(x |Ci)
• Using Parzen estimator with Gaussian kernels– With common parameter σ2
• defines the Bayes classifier – Determines best hyperplane for learned density model
• As σ2 à 0 hyperplane has maximum margin– In the limit hyperplane becomes independent of points
that are not support vectors
100 normallydistributed noswith decreasing σp(x|Ci)
Different priors p(Ci)for linearlyseparable data leads toa decision boundarythat lies in the middle
P(C
1|x) =
P(x |C1)P(C
1)
P(x |C1)P(C
1)+ P(x |C
2)P(C
2)
Machine Learning Srihari
y(x) = wt(x
p+ r
w||w ||
)+ b
Distance of a point x from planeTheorem: Given a plane y(x)=wTx+b=0, perpendiculardistance from x to the plane is
|||| wwrxx p +=
r =
y(x)||w ||
px y(x) < 0
= wtx
p+w
0+ r
wtw||w ||
= y(xp)+ r
||w ||2
||w || = r ||w ||
r = y(x)
||w ||
y(x) = wt ⋅x +b = 0
y(x) > 0
Corollary: Distance of origin to plane is
r = y(0)/||w|| = b/||w||
Since y(0)=wT0+b=bthen b=0 implies thatplane passes through origin
= 0
QED x = x
p+ r
w||w ||
Proof:
Machine Learning Srihari
y(x) = 1
SVM Margin geometry
Shortest line connectingConvex hulls, whichhas length = 2/||w||
Optimal hyperplaneis orthogonal to..
y1y2y(x) = -1
y(x) = wtx = 0y2
x1
Machine Learning Srihari
Maximum Margin Solution• We interested in solutions for which tny(xn)>0 for all n• Distance of point xn to decision surface is
• Margin is closest point xn of data set• We wish to optimize w, b to maximize distance
• Maximum margin solution is found from
• Direct solution is very complex• So convert into equivalent problem
tny(x
n)
||w ||=
tn(wTφ(x
n)+b)
||w ||
argmax
w,b
1||w ||
minn
tn(wTφ(x
n)+b)⎡
⎣⎢⎤⎦⎥
⎧⎨⎪⎪
⎩⎪⎪
⎫⎬⎪⎪
⎭⎪⎪
Machine Learning Srihari
Equivalent problem• If we rescale wàκw and bàκb then the
distance from any point xn to the decision surface given by tny(xn)/||w|| is unchanged• We can use this freedom to set, for the point
closest to the surface, tn (wTϕ(xn)+b)=1• In this case all points will satisfy
tn (wTϕ(xn)+b)≥1 n=1,..,N• This is the canonical representation of the
hyperplane
Machine Learning Srihari
Canonical Representationtn (wTϕ(xn)+b)≥1 n=1,..,N
• When equality holds, constraints are active• For the remainder they are inactive• There will always be at least one active
constraint, because there will be a closest point• Once margin is maximized there will be two
active constraints
Machine Learning Srihari Optimization Problem (OP) Restated• Augmented space:
y(x) = wtx by choosing w0= b and x0=1, i.e, plane passes through origin• Canonical representation of decision hyperplane is then
tn y(xn) > 0, n = 1,.., N• Since distance of a point y to hyperplane y(x)=0 is
Maximizing the margin is equivalent to maximizing ||w||-1
Which is equivalent to minimizing ||w||2
y(x)||w ||
margin
xy(x)= wTx
So the optimization problem
is equivalent to
subject to the constraintstn (wTϕ(xn)+b)≥1 n=1,..,N
argmax
w,b
1||w ||
minn
tn(wTφ(x
n)+b)⎡
⎣⎢⎤⎦⎥
⎧⎨⎪⎪
⎩⎪⎪
⎫⎬⎪⎪
⎭⎪⎪
argmin
w,b
12
w2 The factor ½ is for later convenience
This is an example of quadratic programmingWe are trying to maximize a quadratic function subject to a set of linear inequality constraints
Machine Learning Srihari
• Can be cast as an unconstrained problem • by introducing Lagrange undetermined multipliers
with one multiplier an ≥ 0 for each constraint• The Lagrange function we wish to minimize is
L w,b,a( ) =
12
w2− a
ntn(wtφ(x
n)+b)−1⎡
⎣⎢⎤⎦⎥
n=1
N
∑
SVM Constrained Optimization
Optimize
arg minw,b
12
||w ||2
subject to constraints
tn(wtφ(x
n)+b)≥1, n = 1,.....N
where a=(a1,..,aN)T
Note the minus sign in front of the Lagrange multiplier termBecause we are minimizing wrt w and b, and maximizing wrt a
Machine Learning Srihari
• We wish to minimize
• We seek to minimize L ( ) • with respect to w and b• maximize it w.r.t . the undetermined
multipliers • Last term represents the goal of classifying the points correctly• Karush-Kuhn-Tucker construction shows that this can be recast
as a maximization problem which is computationally better
Optimization of Lagrange function
L w,b,a( ) =
12
w2− a
ntnwt(φ(x
n)+b)−1⎡
⎣⎢⎤⎦⎥
n=1
N
∑
an ≥ 0
Machine Learning Srihari
Eliminating w and b
• Setting derivatives of L(w,b,a) wrt w and bequal to zero• We obtain the following conditions
• Eliminating w and b from L(w,b,a) using these conditions then gives the dual representation of the maximum margin problem
L w,b,a( ) =
12
w2− a
ntnwt(φ(x
n)+b)−1⎡
⎣⎢⎤⎦⎥
n=1
N
∑
w = a
nn∑ t
nx
n 0 = a
nn∑ t
n
Machine Learning Srihari
• Problem is one of maximizing
• Subject to the constraints
given the training data• Here the kernel function is defined by
• Again a quadratic optimization problem • Optimize a quadratic function subject to a set
of inequality constraints
!L a( ) = a
nn=1
N
∑ −12 m=1
N
∑ ama
ntntmk(x
n,x
m)
n=1
N
∑
Dual O.P.
a
nn=1
N
∑ tn
= 0 an≥ 0, n = 1,.....,N
k(x,x ') = φ(x)t ⋅φ(x')
Machine Learning Srihari
Complexity of Primal and Dual• Quadratic programming complexity is O(M3)
• Primal OP of minimizing • over M basis functions • Dual OP of maximizing• over N data points
• For M<<N it appears disadvantageous• But it allows model to be reformulated using kernels
• So it can be applied to feature spaces whose dimensionality M exceeds no of data points N
• Kernel formulation also makes clear that k(x,x’) be positive definite, which bounds L(a) is lower-bounded
argmin
w,b
12
w2
!L a( ) = a
nn=1
N
∑ −12 m=1
N
∑ ama
ntntmk(x
n,x
m)
n=1
N
∑
Machine Learning Srihari
• To classify points using trained model:• Evaluate sign of y(x)=wTϕ(x)+b
• This can be expressed in terms of the parameters {an} and the kernel function by substituting for w using to give
Classifying new points
w = a
nn∑ t
nx
n
y(x) = a
ntnk(x,x
n)+b
n=1
N
∑
Machine Learning Srihari
Definition of Support Vectors• This OP satisfies KKT conditions
• which requires the following hold:
• Thus for every data point either an=0 or tny(xn)=1– A point with an=0 will not appear in sum
• Remaining data points are called support vectors– Because they satisfy tny(xn)=1 they correspond to
points that lie on the maximum margin hyperplanes in feature space
an≥ 0
tny(x
n)−1≥ 0
an{t
ny(x
n)−1} = 0
y(x) = a
ntnk(x,x
n)+b
n=1
N
∑
y = 1y = 0
y = �1
margin
y = 1
y = 0
y = �1
This property is central to practical applicability ofSVMs. Once it is trained a significant proportion ofData points can be discarded and only the support vectors retained
Machine Learning Srihari
Solution for b• Having solved the quadratic programming
problem and found a we can can determine value of threshold parameter b• Noting that for any support vector tny(xn)=1• Using gives
• Where S denotes the indices of support vectors• Solving for b gives
• Where NS is the total no of support vectors
y(x) = a
ntnk(x,x
n)+b
n=1
N
∑ tn
amtmk(x
n,x
m)+b
m∈S∑⎛
⎝⎜⎜⎜
⎞
⎠⎟⎟⎟⎟= 1
b =
1N
S
tn− a
mtmk(x
n,x
m)
m∈S∑
⎛
⎝⎜⎜⎜
⎞
⎠⎟⎟⎟⎟
Machine Learning Srihari
Equivalent Error Function of SVM• For comparison with alternative models
• Express the maximum margin classifier in terms of minimization of an error function:
• Where E∞(z) is zero if z≥0 and ∞ otherwise– Ensures that tn (wTϕ(xn)+b)≥1 n=1,..,N are satisfied
• Gives infinite error if point is misclassified• Zero error if it was classified correctly
• We optimize parameters to maximize margin
E∞
n=1
N
∑ y(xn)t
n−1( )+λ w
2
Machine Learning Srihari
Example of SVM Classifier • Two classes in two
dimensions• Synthetic Data• Shows contours of constant
y (x)• Obtained from SVM with
Gaussian kernel function• Decision boundary is shown• Margin boundaries are also
shown• Support vectors are shown• Shows sparsity of SVM
Machine Learning Srihari
Quadratic term
Summary of SVM Optimization ProblemsDifferentNotation here!
Machine Learning Srihari
Machine Learning Srihari
Kernel Function: key property
• If kernel function is chosen with propertyk(x,x’) = (ϕ (x)T.ϕ(x’))
then computational expense of increased dimensionality is avoided.
• Polynomial kernel k(x,x’) = (xT.x’)pcan be shown (next slide) to correspond to a mapϕ into the space spanned by all products of exactly d dimensions.
Machine Learning Srihari
A Polynomial Kernel Function
• Inner product needs computing six feature values and 3 ✕ 3 = 9 multiplications
• Kernel function k(x,y) has 2 multiplications, one add and a squaring
Suppose x = (x1,x
2) is a 2 - dimensional input vector
The feature space is 3 - dimensional :
φ(x) = x12,( 2x
1x
2,x
22)
t
Then inner product is
φ(x)Tφ(y) = x12,( 2x
1x
2,x
22)
t
y12,( 2y
1y
2,y
22) = (x
1y
1+ x
2y
2)2
Polynomial kernel function to compute the same value is
k(x,y) = (xT.y)2 = x1,( x
2)T y1,( y
2)⎛⎝⎜⎜⎜
⎞⎠⎟⎟⎟
2
= (x1y
1+ x
2y
2)2
or k(x,y) = φ(x)Tφ(y)
φ(x)Tφ(y)
same
Machine Learning Srihari
Another Polynomial kernel function
• k(x,y) = (x.y + 1)2• This one maps d = 2, p = 2 into a six- dimensional space
• Contains all the powers of x
• Inner product needs 36multiplications
• Kernel function needs 4multiplications
k(x,y) = φ(x)φ(y)where
φ(x) = x12 ,( x2
2 , 2x1, 2x2 , 2x1x2 ,1)x22
Machine Learning Srihari
Non-Linear Case• Mapping function ϕ (.) to a
sufficiently high dimension• So that data from two
categories can always be separated by a hyperplane
• Assume each pattern xkhas been transformed toϕ(xk), for k=1,.., n
• First choose the non-linear ϕfunctions
• To map the input vector to a higher dimensional feature space
• Dimensionality of space can be arbitrarily high only limited by computational resources
Machine Learning Srihari
Mapping into Higher Dimensional Feature Space
• Mapping each input point x by map
Points on 1-d line are mapped onto curve in 3-d.
• Linear separation in 3-d space is possible. Linear discriminant function in 3-d is in the form
Φ(x) =1xx 2
⎛
⎝
⎜⎜⎜⎜⎜⎜⎜⎜
⎞
⎠
⎟⎟⎟⎟⎟⎟⎟⎟⎟
y(x) = a1x
1+a
2x
2+a
3x
3
Machine Learning Srihari
Gaussian (RBF) Kernel• ϕ(x)maps x to an infinite space
• Achieves perfect separationFor sufficiently small kernel bandwidth, decision boundary will look like you just drew little circles around the points whenever they are needed to separatepositiveand negative examples.
Machine Learning Srihari
RBF Features from Taylor’s Series
Machine Learning Srihari
Pattern Transformation using Kernels
• Problem with high-dimensional mapping • Very many parameters• Polynomial of degree p over d variables leads to O(dp) variables in feature space
• Example: if d = 50 and p =2 we need a feature space of size 2500• Solution:
• Dual Optimization problem needs only inner products
• Each pattern xk transformed into
• Dimensionality of mapped space can be arbitrarily high
φ(xk)
Machine Learning Srihari
SVM for the XOR problem• XOR: binary valued features x1, x2• Not solved by linear discriminant• Function ϕ maps input x = [x1, x2] into six-D
feature space y = [1, √2x1, √2x2, √2x1x2 , x12 , x2
2]T
HyperbolasCorresponding tox1 x2 = +1
HyperplanesCorresponding to√2x1 x2 = +1
input space feature sub-space
Machine Learning Srihari
• We seek to maximize
• Subject to the constraints , i.e.,
• From problem symmetry, at the solution
SVM for XOR: maximization problem
a1−a
2+a
3−a
4= 0
0≤ak k = 1, 2, 3, 4
a
kk=1
4
∑ −12
ak
j=1
4
∑k=1
4
∑ ajt
jtkφ(x
j)φ(x
k)
a1=a3 and a2=a4
0 = a
nn∑ t
n
Machine Learning Srihari
SVM for XOR: maximization problem
• Can use iterative gradient descent• Or use analytical techniques for small problem• The solution is
a* = (1/8, 1/8, 1/8, 1/8)• Last term of Optimizn Problem implies that all four points are support vectors
(unusual and due to symmetric nature of XOR)• The final discriminant function is y(x1,x2) = x1✕ x2• Decision hyperplane is defined by y(x1,x2) = 0• Margin is given by b=1/||a|| = √2
HyperbolasCorresponding tox1 x2 = +1
HyperplanesCorresponding to_
√2x1 x2 = +1