regression and classification using kernel...
TRANSCRIPT
Regression and Classification using
Kernel MethodsBarnabás Póczos
University of Alberta
Oct 15, 2009
2
Roadmap I
• Primal Perceptron ) Dual Perceptron ) Kernels(It doesn’t want large margin...)
• Primal hard SVM ) Dual hard SVM ) Kernels (It wants large margin, but assumes the data is
linearly separable in the feature space)
• Primal soft SVM ) Dual soft SVM ) Kernels(It wants large margin, and doesn’t assume that the
data is linearly separable in the feature space)
Classification
3
Roadmap II
• Ridge Regression ) Dual form ) Kernels
• Primal SVM for Regression ) Dual SVM ) Kernels
• Logistic Regression ) Kernels
• Bayesian Ridge Regression in feature space (Gaussian Processes)) Dual form ) Kernels
Regression
The Perceptron
5
The primal algorithm in the feature space
Picture is taken from R. Herbrich
6
The Dual Perceptron
Picture is taken from R. Herbrich
The SVM
8
Linear Classifiersf x yest
f(x,w,b) = sign(w ∙ x + b)
How to classify this data?
Each of these seems fine…
... which is best?
w,b
denotes +1
denotes –1
taken from Andrew W. Moore
9
Classifier Margin
f(x,w,b) = sign(w ∙ x + b)
The margin of a linear classifier =the width that the boundary could be increased by, before hitting a datapoint
denotes +1
denotes –1
f x yest
w,b
taken from Andrew W. Moore
10
Maximum Marginf x yest
f(x,w,b) = sign(w ∙ x + b)
Support Vectors are datapoints that “touch” the margin
The maximum margin linear classifier is the linear classifier with the, um, maximum margin.
• the simplest kind of SVM – an LSVM
Linear SVM
denotes +1
denotes –1
w,b
taken from Andrew W. Moore
11taken from Andrew W. Moore
12
Specifying a Line and a Margin
• Plus-plane = { x : w ∙ x + b = +1 }• Minus-plane = { x : w ∙ x + b = –1 }
Plus-Plane
Minus-PlaneClassifier Boundary
“Predict Class =
+1”
zone
“Predict Class =
-1”
zone
-1 < w ∙ x + b < 1ifUniverse explodes
w ∙ x + b ≤ –1if–1
w ∙ x + b ≥ 1if+1Classify as..
wx+b=1
wx+b=0
wx+b=-1
taken from Andrew W. Moore
13
Computing the margin width
Given …• w ∙ x+ + b = +1 • w ∙ x- + b = -1 • x+ = x- + λ w• |x+ – x-| = M•
“Predict Class =
+1”
zone
“Predict Class =
-1”
zonewx+b=1
wx+b=0
wx+b=-1
M = Margin Width =
M = |x+ – x- | =| λ w |=
x-
x+
€
λ = 2
w ⋅ w €
= 2 w ⋅ ww ⋅ w
= 2
w ⋅ w€
=λ| w |=λ w⋅w
€
2
w ⋅ w
Yay! Just maximize
…≡ minimize w·w€
2
w ⋅ w
Wait…OMG, I forgot the data! taken from Andrew W. Moore
14
The Primal Hard SVM
This is a QP problem ) solution is expressible in dual form
The computational effort is O(n2)
15
Quadratic Programming – in general
€
a rg m inw
c + d T w + w T K w2
Find
€
a 1 1 w 1 + a 1 2 w 2 + . . . + a 1 m w m ≤ b 1
a 2 1 w 1 + a 2 2 w 2 + . . . + a 2 m w m ≤ b 2
:
a n 1 w 1 + a n 2 w 2 + . . . + a n m w m ≤ b n
€
a ( n +1 )1 w 1 + a ( n +1 ) 2 w 2 + . . . + a ( n +1 ) m w m = b ( n +1 )
a ( n + 2 )1 w 1 + a ( n + 2 ) 2 w 2 + . . . + a ( n + 2 ) m w m = b ( n + 2 )
:
a ( n + e )1 w 1 + a ( n + e ) 2 w 2 + . . . + a ( n + e ) m w m = b ( n + e )
and to
n additional linear inequality constraints
e additional linear equality constraints
Quadratic criterion
Subject to
Note w∙x = wTx
taken from Andrew W. Moore
16
The Dual Hard SVM
Quadratic Programming, the computational effort is O(m2)
Lemma
17
ProofPrimal Lagrangian:
The solution:
18
Proof cont.
19
The Problem with Hard SVMIt assumes samples are linearly separable...
Solutions:• Use kernel with large expressive power
(e.g. RBF with small σ) ) each training samples are linearly separable in the feature space ) Hard SVM can be applied ) overfitting...
• Soft margin SVM instead of Hard SVM• Slack variables...
20
Hard SVMThe Hard SVM problem can be rewritten:
whereMisclassification
Correct classification
21
From Hard to Soft constraints
We can try solve the soft version of it
Instead of using hard constraints (points are linearly separable)
whereMisclassification
Correct classification
22
Problems with l0-1 loss
It is not convex in yf(x) ) It is not convex in w, either...... and we like only convex functions...
Let us approximate it with convex functions!
• Piecewise linear approximations (hinge loss, llin)
• Quadratic approximation (lquad)
23
Approximation of the Heaviside step function
Picture is taken from R. Herbrich
24
The hinge loss approximation of l0-1
Where,
25
The Slack Variables
wx+b = 0
wx+b = -1
M =
€
2
w ⋅ w
ξ7
ξ 11
ξ2
wx+b = 1
taken from Andrew W. Moore
26
The Primal Soft SVM problemUsing lin (hinge) loss approximation:
Or we can use this form, too...
27
The Dual Soft SVM (using hinge loss)
28
Proofs
where
29
Proofs
30
Classifying using SVMs with Kernels• Choose a kernel function• Solve the dual problem
A few results
32
Kernels and Linear Classifiers
We will use linear classifiers in this feature space.
33
The Decision Surface
Picture is taken from R. Herbrich
For general feature maps it can be very difficult
to solve these...
34
Steve Gunn’s svm toolboxResults, Iris 2vs13, Linear kernel
35
Results, Iris 1vs23, 2nd order kernel
36
Results, Iris 1vs23, 2nd order kernel
37
Results, Iris 1vs23, 13th order kernel
38
Results, Iris 1vs23, RBF kernel
39
Results, Iris 1vs23, RBF kernel
40
Results, Chessboard, Poly kernel
41
Results, Chessboard, Poly kernel
42
Results, Chessboard, Poly kernel
43
Results, Chessboard, Poly kernel
44
Results, Chessboard, poly kernel
45
Results, Chessboard, RBF kernel
Regression
47
Ridge Regression
Primal:
Dual for a given λ: ...after some calculations...
This can be solved in closed form:
Linear regression:
48
Kernel Ridge Regression Algorithm
49
SVM Regression• Typical loss function:
(ridge regression) … quadratic penalty whenever yn ≠ tn
• To be sparse… don’t worry if “close enough”
• Eε( y, t ) =
• … ε insensitive loss function
• No penalty if in ε-tube
22
21)( wtyC
nnn +−∑
0 if |y – t| < ε
|y – t| – ε otherwise
2
21),( wtyEC
nnn +∑ ε
taken from Andrew W. Moore
50
SVM Regression
• No penalty if yn– ε ≤ tn ≤ yn + ε • Slack variables: {ξn+ , ξn– }
• tn ≤ yn+ ε + ξn+
• tn ≥ yn – ε – ξn–
• Error function:
• … use Lagrange Multipliers { an+, an-, µn+, µn- } ≥ 0
2
21),( wtyEC
nnn +∑ ε
0 if |y – t| < ε
|y – t| – ε otherwiseEε( y, t ) =
2
21)( wC
nnn ++∑ −+ ξξ
)()(
)(21)((...)min 2
∑∑∑∑
+−+−−++−
+−++=
−−++
−−++−+
nnnnn
nnnnn
nn
nnnn
nn
tyatya
wCL
ξεξε
ξµξµξξ
taken from Andrew W. Moore
51
SVM Regression, con’t
• Set derivatives to 0, solve for {ξn+, ξn–, µn+ , µn- } …
)()(
)(21)((...) 2
∑∑∑∑
+−+−−++−
+−++=
−−++
−−++−+
nnnnn
nnnnn
nn
nnnn
nn
tyatya
wCL
ξεξε
ξµξµξξ
nn
nnn
nn
n mmnmmnnaa
taaaa
xxkaaaaaaL
∑∑∑ ∑
−+−+
−+−+−+
−+−−
−−−=−+
)()(
),())((21),(~min
,
ε
s.t. 0 ≤ an+ ≤ C 0 ≤ an- ≤ C
• Prediction for new x :
),()()( xxkaaxy nn
nn∑ −+ −=
Quadratic Program!!
taken from Andrew W. Moore
52
SVM Regression, con’t
),()()( xxkaaxy nn
nn∑ −+ −=
• Can ignore xn unless either an+>0 or an->0
• an+>0 only if tn = yn+ ε + ξn+ ie, if on upper boundary of ε-tube (ξn+=0) or above (ξn+>0)
• an->0 only if tn = yn – ε – ξn– ie, if on lower boundary of ε-tube (ξn-=0) or below (ξn->0)
taken from Andrew W. Moore
53
Kernels in Logistic Regression
• Define weights in terms of support vectors:
• Derive simple gradient descent rule on αitaken from Andrew W. Moore
A few results
55
Sinc=sin(π x)/ (π x), RBF kernel
56
Sinc=sin(π x)/ (π x), RBF kernel
57
Sinc=sin(π x)/ (π x), RBF kernel
58
Sinc=sin(π x)/ (π x), RBF kernel
59
Sinc=sin(π x)/ (π x), RBF kernel
60
Sinc=sin(π x)/ (π x), RBF kernel
61
Sinc=sin(π x)/ (π x), poly kernel
62
Sinc=sin(π x)/ (π x), poly kernel
63
Sinc=sin(π x)/ (π x), poly kernel
64
Sinc=sin(π x)/ (π x), poly kernel
65
Sinc=sin(π x)/ (π x), poly kernel
66
Sinc=sin(π x)/ (π x), poly kernel
Summary
68
Key SVM Ideas• Maximize the margin between + and – examples
• connects to PAC theory• Sparse:
Only the support vectors contribute to solution • Penalize errors in non-separable case• Kernels map examples into a new, usually
nonlinear space• Implicitly do dot products in this new space
(in the “dual” form of the SVM program)
• Kernels are separate from SVMs… but they combine very nicely with SVMs
taken from Andrew W. Moore
69
Summary IAdvantages• Systematic implementation through quadratic
programming∀∃ very efficient implementations
• Excellent data-dependent generalization bounds • Regularization built into cost function• Statistical performance is independent of
dim. of feature space
• Theoretically related to widely studied fields of regularization theory and sparse approximation
• Fully adaptive procedures available for determining hyper-parameters
taken from Andrew W. Moore
70
Summary IIDrawbacks• Treatment of non-separable case somewhat
heuristic• Number of support vectors may depend strongly
on the kernel type and the hyper-parameters• Systematic choice of kernels is difficult (prior
information) • … some ideas exist
• Optimization may require clever heuristics for large problems
taken from Andrew W. Moore
71
What You Should Know• Definition of a maximum margin classifier• Sparse version: (Linear) SVMs• What QP can do for you
(but you don’t need to know how it does it)• How Maximum Margin = a QP problem• How to deal with noisy (non-separable) data
• Slack variable• How to permit “non-linear boundaries”
• Kernel trick• How SVM Kernel functions permit us to pretend
we’re working with a zillion featurestaken from Andrew W. Moore
72
Thanks for the Attention!
73
Attic
74
Difference between SVMs and Logistic Regression
Solution sparse
High dimensional features with kernels
Semantics of output
Loss function
LogisticRegression
SVMs
Often yes! Almost always no!
“Margin” Real probabilities
Hinge loss Log-loss
Yes! No