11 introduction to predictive learning electrical and computer engineering lecture set 7 support...
Post on 21-Dec-2015
216 views
TRANSCRIPT
![Page 1: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/1.jpg)
11
Introduction to Predictive Learning
Electrical and Computer Engineering
LECTURE SET 7
Support Vector Machines
![Page 2: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/2.jpg)
2
OUTLINE• Objectives
explain motivation for SVM
describe basic SVM for classification & regression compare SVM vs. statistical & NN methods
• Motivation for margin-based loss• Linear SVM Classifiers• Nonlinear SVM Classifiers• Practical Issues and Examples• SVM for Regression• Summary and Discussion
![Page 3: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/3.jpg)
3
MOTIVATION for SVM• Recall ‘conventional’ methods:
- model complexity ~ dimensionality- nonlinear methods multiple local minima- hard to control complexity
• ‘Good’ learning method:(a) tractable optimization formulation(b) tractable complexity control(1-2 parameters)(c) flexible nonlinear parameterization
• Properties (a), (b) hold for linear methods• SVM solution approach
![Page 4: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/4.jpg)
4
SVM APPROACH• Linear approximation in Z-space using
special adaptive loss function• Complexity independent of dimensionality
x g x
z wz ˆ y
![Page 5: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/5.jpg)
555
Motivation for Nonlinear Methods1. Nonlinear learning algorithm proposed using
‘reasonable’ heuristic arguments.
reasonable ~ statistical or biological
2. Empirical validation + improvement
3. Statistical explanation (why it really works)
Examples: statistical, neural network methods.
In contrast, SVM methods have been originally proposed under VC theoretic framework.
![Page 6: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/6.jpg)
6
OUTLINE• Objectives• Motivation for margin-based loss
Loss functions for regression
Loss functions for classification
Philosophical interpretation
• Linear SVM Classifiers• Nonlinear SVM Classifiers• Practical Issues and Examples• SVM for Regression• Summary and Discussion
![Page 7: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/7.jpg)
7
Main Idea• Model complexity controlled by a special
loss function used for fitting training data• Such empirical loss functions may be
different from the loss functions used in a learning problem setting
• Such loss functions are adaptive, i.e. can adapt their complexity to particular data set
• Different loss functions for different learning problems (classification, regression etc)
• Model complexity(VC-dim.) is controlled independently of the number of features
![Page 8: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/8.jpg)
8
Robust Loss Function for Regression• Squared loss ~ motivated by
large sample settingsparametric assumptions Gaussian noise
• For practical settings better to use linear loss
![Page 9: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/9.jpg)
9
Epsilon-insensitive Loss for Regression
0,|),(|max)),(,( xx fyfyL
• Can also control model complexity
![Page 10: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/10.jpg)
10
Empirical Comparison• Univariate regression• Squared, linear and SVM loss (with ):
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-1
-0.5
0
0.5
1
1.5
2
x
y
xy ]1,0[x )36.0,0(~ N
0.6
• Red ~ target function, Dotted ~ estimate using squared loss, Dashed ~ linear loss, Dashed-dotted ~ SVM loss
![Page 11: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/11.jpg)
11
Empirical Comparison (cont’d)• Univariate regression• Squared, linear and SVM loss (with )• Test error (MSE) estimated for 5 independent
realizations of training data (4 training samples)
xy ]1,0[x )36.0,0(~ N
0.6
Squared loss
Least modulus loss
SVM loss with epsilon=0.6
1 0.024 0.134 0.067
2 0.128 0.075 0.063
3 0.920 0.274 0.041
4 0.035 0.053 0.032
5 0.111 0.027 0.005
Mean 0.244 0.113 0.042
St. Dev. 0.381 0.099 0.025
![Page 12: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/12.jpg)
12
Loss Functions for Classification• Decision rule • Quantity is analogous to residuals in regression• Common loss functions: 0/1 loss and linear loss• Properties of a good loss function?
),()( xx fsignD
),( xyf
![Page 13: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/13.jpg)
13
Motivation for margin-based loss (1)
• Given: Linearly separable data How to construct linear decision boundary?
(a) Many linear decision boundaries (that have no errors)
![Page 14: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/14.jpg)
14
Motivation for margin-based loss (2)
• Given: Linearly separable data Which linear decision boundary is better ?
The model with larger margin is more robust for future data
![Page 15: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/15.jpg)
15
Largest-margin solution• All solutions explain the data well (zero error) All solutions ~ the same linear parameterization Larger margin ~ more confidence (larger falsifiability)
2M
![Page 16: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/16.jpg)
16
Margin-based loss for classification
1
2
1y
1y
( , ( , )) max | ( , ) |,0L y f yf x x( , )i i iy f x
SVM loss or hinge lossMinimization of slack variables
![Page 17: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/17.jpg)
17
Margin-based loss for classification: margin size is adapted to training data
M argin
C lass +1 C lass -1
0),,(max)),(,( xx yffyL
![Page 18: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/18.jpg)
18
Motivation: philosophical• Classical view: good model
explains the data + low complexity Occam’s razor (complexity ~ # parameters)
• VC theory: good model
explains the data + low VC-dimension VC-falsifiability (small VC-dim ~ large falsifiability),
i.e. the goal is to find a model that:
can explain training data / cannot explain other data
The idea: falsifiability ~ empirical loss function
![Page 19: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/19.jpg)
19
Adaptive Loss Functions
• Both goals (explanation + falsifiability) can encoded into empirical loss function where- (large) portion of the data has zero loss- the rest of the data has non-zero loss, i.e. it falsifies the model
• The trade-off (between the two goals) is adaptively controlled adaptive loss fct
• For classification, the degree of falsifiability is ~ margin size (see below)
![Page 20: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/20.jpg)
20
Margin-based loss for classification
0),,(max)),(,( xx yffyL 2Margin
![Page 21: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/21.jpg)
21
Classification: non-separable data
),(
_
xyf
variablesslack
0),,(max)),(,( xx yffyL
![Page 22: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/22.jpg)
22
Margin based complexity control
• Large degree of falsifiability is achieved by
- large margin (classification)
- small epsilon (regression)
• For linear classifiers:
larger margin smaller VC-dimension
~1 2 3 ... 1 2 3 ...h h h
![Page 23: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/23.jpg)
23
-margin hyperplanes • Solutions provided by minimization of SVM loss can be
indexed by the value of margin
SRM structure:
for VC-dim.
• If data samples belong to a sphere of radius R, then the VC dimension bounded by
• For large margin hyperplanes, VC-dimension controlled independent of dimensionality d.
1),/min( 22 dRh
1 2 3 ... 1 2 3 ...h h h
![Page 24: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/24.jpg)
24
SVM Model Complexity
• Two ways to control model complexity-via model parameterization use fixed loss function:
-via adaptive loss function: use fixed (linear) parameterization
• ~ Two types of SRM structures
• Margin-based loss can be motivated by Popper’s falsifiability
)),(,( xfyL
)),(,( xfyL
),( xf
bf )(),( xwx
![Page 25: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/25.jpg)
25
Margin-based loss: summary
• Classification:
falsifiability controlled by margin
• Regression:
falsifiability controlled by
• Single class learning:
falsifiability controlled by radius r
NOTE: the same interpretation/ motivation for margin-based loss for different types of learning problems.
0),,(max)),(,( xx yffyL
0,|),(|max)),(,( xx fyfyL
0,|)|max)),(( rfLr axx
![Page 26: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/26.jpg)
26
OUTLINE• Objectives• Motivation for margin-based loss• Linear SVM Classifiers
- Primal formulation (linearly separable case)- Dual optimization formulation- Soft-margin SVM formulation
• Nonlinear SVM Classifiers• Practical Issues and Examples• SVM for Regression• Summary and Discussion
![Page 27: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/27.jpg)
27
Optimal Separating Hyperplane
Distance btwn hyperplane and sample Margin Shaded points are SVsw/1
wx /)'(f
![Page 28: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/28.jpg)
28
Optimization Formulation
• Given training data • Find parameters of linear hyperplane
that minimize
under constraints
• Quadratic optimization with linear constraints
tractable for moderate dimensions d• For large dimensions use dual formulation:
- scales better with n (rather than d)- uses only dot products
bf xwx
25.0 ww
1 by ii xw
ni ,...,1
x i ,yi b,w
![Page 29: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/29.jpg)
29
From Optimization Theory:
• For a given convex minimization problem with convex inequality constraints there exists an equivalent dual unconstrained maximization formulation with nonnegative Lagrange multipliers
• Karush-Kuhn-Tucker (KKT) conditions:Lagrange coefficients only for samples that satisfy the original constraint
with equality ~ SV’s have positive Lagrange coefficients
1 by ii xw
0* i
0i
![Page 30: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/30.jpg)
30
Convex Hull Interpretation of Dual
Find convex hulls for each class. The closest points to an optimal hyperplane are support vectors
![Page 31: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/31.jpg)
31
Dual Optimization Formulation
• Given training data • Find parameters of an opt. hyperplane
as a solution to maximization problem
under constraints
• Note: data samples with nonzero are SV’s• Formulation requires only inner products
ni ,...,1
x i ,yi ** ,bi
n
iiii byD
1
** xxx
max2
1
1,1
n
jijijiji
n
ii yy xxL
,01
n
iiiy i 0
*i
'xx
![Page 32: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/32.jpg)
32
Support Vectors• SV’s ~ training samples with non-zero loss• SV’s are samples that falsify the model • The model depends only on SVs SV’s ~ robust characterization of the data
WSJ Feb 27, 2004:About 40% of us (Americans) will vote for a Democrat, even if the candidate is Genghis Khan. About 40% will vote for a Republican, even if the candidate is Attila the Han. This means that the election is left in the hands of one-fifth of the voters.
![Page 33: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/33.jpg)
33
Support Vectors
• SVM test error bound:
small # SV’s ~ good generalization• Can be explained using LOO cross-validation
• SVM generalization can be related to data compression
#_ support _ vectors_
n
EE Test error
![Page 34: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/34.jpg)
34
Soft-Margin SVM formulation
min2
1 2
1
wn
iiC
iii by 1xw
Minimize:
under constraints
bf xwx)(
f x( ) = +1
f x( ) = -1
f x( ) = 0
x1
= 1 - f x1
( )
x1
x2
x3
x2
=1 - f x2
( )
x3
= 1+ f x3
( )
![Page 35: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/35.jpg)
35
SVM Dual Formulation
• Given training data • Find parameters of an opt. hyperplane
as a solution to maximization problem
under constraints
• Note: data samples with nonzero are SVs• Formulation requires only inner products
ni ,...,1
x i ,yi ** ,bi
n
iiii byD
1
** xxx
max2
1
1,1
n
jijijiji
n
ii yy xxL
,01
n
iiiy Ci 0
*i
'xx
![Page 36: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/36.jpg)
36
OUTLINE• Objectives• Motivation for margin-based loss• Linear SVM Classifiers• Nonlinear SVM Classifiers• Practical Issues and Examples• SVM for Regression• Summary and Discussion
![Page 37: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/37.jpg)
37
Nonlinear Decision Boundary• Fixed (linear) parameterization is too rigid• Nonlinear curved margin may yield larger margin
(falsifiability) and lower error
![Page 38: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/38.jpg)
38
Nonlinear Mapping via KernelsNonlinear f(x,w) + margin-based loss = SVM
• Nonlinear mapping to feature z space
• Linear in z-space ~ nonlinear in x-space
• But ~ symmetric fct
Compute dot product via kernel analytically
x g x
z wz ˆ y
1
( ) ( ) ,m
i j k i k j i jk
g g
z z x x x x
![Page 39: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/39.jpg)
39
Example of Kernel Function• 2D input space
• Mapping to z space (2-d order polynomial)
• Can show by direct substitution that for two input vectors
Their dot product is calculated analytically
x x1 ,x2
11 z 12 2xz 23 2xz 214 2 xxz
215 xz 2
26 xz
21,uuu 21,vvv
2( ) ( ) , 1G G u v u v u v
![Page 40: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/40.jpg)
40
SVM Formulation (with kernels)• Replacing leads to: • Find parameters of an optimal
hyperplane
as a solution to maximization problem
under constraints
• Given: the training data an inner product kernel
regularization parameter C
** ,bi
n
iiii bKyD
1
** , xxx
max,2
1
1,1
n
jijijiji
n
ii Kyy xxL
,01
n
iiiy nCi /0
xxzz ,K
x i ,yi xx ,K
ni ,...,1
![Page 41: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/41.jpg)
41
Examples of KernelsKernel is a symmetric function satisfying general
(Mercer’s) conditions.
Examples of kernels for different mappings xz• Polynomials of degree m
• RBF kernel
(width parameter)
• Neural Networks
for given parameters
Automatic selection of the number of hidden units (SV’s)
xx ,K
mK 1', xxxx
2
2'
exp,
xxxxK
avK )'(tanh, xxxxav,
![Page 42: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/42.jpg)
42
More on Kernels• The kernel matrix has all info (data + kernel)
K(1,1) K(1,2)…….K(1,n)K(2,1) K(2,2)…….K(2,n)………………………….K(n,1) K(n,2)…….K(n,n)
• Kernel defines a distance in some feature space (aka kernel-induced feature space)
• Kernel parameter controls nonlinearity• Kernels can incorporate a priori knowledge• Kernels can be defined over complex
structures (trees, sequences, sets, etc.)
![Page 43: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/43.jpg)
43
New insights provided by SVM
• Why linear classifiers can generalize?
(1) Margin is large (relative to R)
(2) % of SV’s is small
(3) ratio d/n is small
• SVM offers an effective way to control complexity (via margin + kernel selection) i.e. implementing (1) or (2) or both
• Requires common-sense parameter tuning
![Page 44: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/44.jpg)
44
OUTLINE• Objectives• Motivation for margin-based loss• Linear SVM Classifiers• Nonlinear SVM Classifiers• Practical Issues and Examples
- Model Selection- Histogram of Projections- SVM Extensions and Modifications
• SVM for Regression• Summary and Discussion
![Page 45: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/45.jpg)
45
SVM Model Selection• The quality of SVM classifiers depends on
proper tuning of model parameters:- kernel type (poly, RBF, etc)- kernel complexity parameter- regularization parameter C
• Note: VC-dimension depends on both C and kernel parameters
• These parameters are usually selected via x-validation, by searching over wide range of parameter values (on the log-scale)
![Page 46: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/46.jpg)
46
SVM Example 1: Ripley’s data• Ripley’s data set:
- 250 training samples- SVM using RBF kernel- model selection via 10-fold cross-validation
• Cross-validation error table:
optimal C = 1,000, gamma = 1Note: may be multiple optimal parameter values
2( , ) exp u v u v
gamma C= 0.1 C= 1 C= 10 C= 100 C= 1000 C= 10000
=2-3 98.4% 23.6% 18.8% 20.4% 18.4% 14.4%
=2-2 51.6% 22% 20% 20% 16% 14%
=2-1 33.2% 19.6% 18.8% 15.6% 13.6% 14.8%
=20 28% 18% 16.4% 14% 12.8% 15.6%
=21 20.8% 16.4% 14% 12.8% 16% 17.2%
=22 19.2% 14.4% 13.6% 15.6% 15.6% 16%
=23 15.6% 14% 15.6% 16.4% 18.4% 18.4%
![Page 47: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/47.jpg)
47
Optimal SVM Model• RBF SVM with optimal parameters C = 1,000, gamma = 1
• Test error is 9.8% (estimated using1,000 test samples)
-1.5 -1 -0.5 0 0.5 1-0.2
0
0.2
0.4
0.6
0.8
1
1.2
x1
x2
![Page 48: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/48.jpg)
48
SVM Example 2: Noisy Hyperbolas• Noisy Hyperbolas data set:
- 100 training samples (50 per class)- 100 validation samples (used for parameter tuning)RBF SVM model: Poly SVM model (5-th degree):
• Which model is ‘better’?• Model interpretation?
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
![Page 49: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/49.jpg)
49
SVM Example 3: handwritten digits• MNIST handwritten digits (5 vs. 8) ~ high-dimensional data
- 1,000 training samples (500 per class)- 1,000 validation samples (used for parameter tuning)- 1,866 test samples
• Each sample is a real-valued vector of size 28*28=784:
• RBF SVM: optimal parameters C=1,
28 pixels
28 pixels
82
![Page 50: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/50.jpg)
50
How to visualize high-dim SVM model?• Histogram of projections for linear SVM:
- project training data onto normal vector w (of SVM model)- show univariate histogram of projected training samples
• On the histogram: ‘0’~ decision boundary, -1/+1 ~ margins• Similar histograms can be obtained for nonlinear SVM
![Page 51: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/51.jpg)
51
Histogram of Projections for Digits Data• Projections of training/test data onto normal direction of RBF SVM decision boundary:
Training data Test data
-1.5 -1 -0.5 0 0.5 1 1.50
50
100
150
200
250
300
350
400
450
500
-1.5 -1 -0.5 0 0.5 1 1.50
20
40
60
80
100
120
140
160
180
![Page 52: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/52.jpg)
52
Practical Issues for SVM Classifiers
• Pre-processing all inputs pre-scaled to the range [0,1] or [-1,+1]
• Model Selection (parameter tuning)
• SVM Extensions
- multi-class problems
- unbalanced data sets
- unequal misclassification costs
![Page 53: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/53.jpg)
53
SVM for multi-class problems
• Digit recognition ~ ten-class problem:
- estimate 10 binary classifiers
(one digit vs the rest)
• For prediction:
a test input is applied to all 10 binary SVM classifiers, and the class with the largest SVM output value is selected
![Page 54: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/54.jpg)
54
Unbalanced Settings and Unequal Costs
• Unbalanced settings: - different number of positive and negative samples
- different prior probabilities for training /test data
• Different Misclassification Costs
- two types of errors, FP and FN
- Cost (false_positive) vs. Cost(false_negative)
- Loss function:
- these ‘costs’ need to be specified a priori, based on application requirements minPCPC fnfp
![Page 55: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/55.jpg)
55
SVM Modifications
• The Problem: - How to modify standard SVM formulation?
• Unbalanced Data + Unequal Costs
where
• In practice, need to specify
minCCclassi
iclassi
i
2
2
1w
S
S
posfalseCostC
negfalseCostC
C
CC ,
![Page 56: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/56.jpg)
56
Example: SVM with unequal costs• Ripley’s data set (as before) where
- negative samples ~ ‘triangles’- given misclassification costs
Note: boundary shifted away from positive samples/ 3 :1C C
-1.5 -1 -0.5 0 0.5 1-0.2
0
0.2
0.4
0.6
0.8
1
1.2
x1
x2
![Page 57: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/57.jpg)
57
SVM Applications
• Handwritten digit recognition
• Face detection in unrestricted images
• Text/ document classification
• Image classification and retrieval
• …….
![Page 58: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/58.jpg)
58
Handwritten Digit Recognition (mid-90’s)
• Data set: postal images (zip-code), segmented, cropped;
~ 7K training samples, and 2K test samples
• Data encoding: 28x28 grey scale pixel image
• Original motivation: Compare SVM with custom MLP network (LeNet) designed for this application
• Multi-class problem: one-vs-all approach 10 SVM classifiers (one per each digit)
![Page 59: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/59.jpg)
59
Digit Recognition Results• Summary
- prediction accuracy better than custom NN’s- accuracy does not depend on the kernel type- 100 – 400 support vectors per class (digit)
• More details Type of kernel No. of Support Vectors Error% Polynomial 274 4.0RBF 291 4.1Neural Network 254 4.2
• ~ 80-90% of SV’s coincide (for different kernels)
• Reduced-set SVM (Burges, 1996) ~ 15 per class
![Page 60: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/60.jpg)
60
Document Classification (Joachims, 1998)
• The Problem: Classification of text documents in large data bases, for text indexing and retrieval
• Traditional approach: human categorization (i.e. via feature selection) – relies on a good indexing scheme. This is time-consuming and costly
• Predictive Learning Approach (SVM): construct a classifier using all possible features (words)
• Document/ Text Representation:
individual words = input features (possibly weighted)• SVM performance:
– Very promising (~ 90% accuracy vs 80% by other classifiers)– Most problems are linearly separable use linear SVM
![Page 61: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/61.jpg)
61
Image Data Mining (Chapelle et al, 1999)• Example image data:• Classification of images in
data bases, for image indexing etc
• DATA SETCorel photo images:2670 samples divided into 7 classes: airplanes, birds, fish, vehicles etc.
Training data: 1375 images; Test data: 1375 images (50%)
MAIN ISSUE: invariant representation/ data encoding
![Page 62: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/62.jpg)
62
OUTLINE• Objectives• Motivation for margin-based loss• Linear SVM Classifiers• Nonlinear SVM Classifiers• Practical Issues and Examples• SVM for Regression
- SV Regression formulation- Dual optimization formulation- Model selection- Example: Boston Housing
• Summary and Discussion
![Page 63: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/63.jpg)
63
General SVM Modeling Approach1 For linear model
minimize SVM functional
using SVM loss suitable for the learning problem at hand
2 Transform (1) to dual optimization formulation (using only dot products)
3 Use kernels to obtain nonlinear version of (2).
Note: this approach is used for all learning problems. However, tunable parameters of margin-based loss are different for various types of learning problems
21( , , ) ( , , )
2SVM n emp nR b C R b w Z w w Z
( , )f b x w x
![Page 64: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/64.jpg)
64
SVM Regression• For linear model
minimize SVM functional
where empirical loss (for regression) is given by
• Two distinct ways to control model complexity:- by the value of C (with fixed epsilon)- by the value of epsilon (with fixed large C)
• SVM regression tunes both epsilon and C for optimal performance
21( , , ) ( , , )
2SVM n emp nR b C R b w Z w w Z
( , )f b x w x
0,|),f(|max)),f(,( xx yyL
![Page 65: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/65.jpg)
65
Linear SVM regression
0,|),(|max)),(,( xx fyfyL
bf xwx ),(
min),(2
1),,(
2 nempnSVM RCbR ZwZw
n
iiinemp fyL
nR
1
)),(,(1
),( xZ
For linear parameterization
SVM regression functional:
where
![Page 66: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/66.jpg)
66
Direct Optimization Formulation
x i ,yi
x
y
*2
1
Minimize
n
iiiC
1
*)()(2
1 ww
ni
yb
by
ii
iii
iii
,...,1,0,
)(
)(
*
*
xw
xwUnder constraints
Given training data
ni ,...,1
![Page 67: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/67.jpg)
67
Dual Formulation for SVM Regression
x i ,yi And the values of
Under constraints
Given training data ni ,...,1
C,
Find coefficients which maximize
)())(()()(),(,
ji
n
jijjiiii
n
iiii
n
iii y xx
111 2
1 L
n
ii
n
ii
11
Ci 0
Ci 0
niii ,...,1,, **
Yields the following solution*** )()()( bf i
SViii
xxx
![Page 68: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/68.jpg)
68
Example: RBF regression
RBF estimate (dashed line) using
SVM model uses only 5 SV’s (out of the 40 points)
2000,16.0 C
2
21
exp0.20
mj
jj
x cf x w
,w
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-0.2
0
0.2
0.4
0.6
0.8
1
x
y
![Page 69: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/69.jpg)
69
Example: decomposition of RBF model
Weighted sum of 5 RBF kernel fcts gives the SVM model
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-4
-3
-2
-1
0
1
2
3
4
x
y
![Page 70: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/70.jpg)
70
SVM Model Selection: General• Setting/ tuning of SVM hyper-parameters
- usually performed by experts- more recently, by non-expert practitioners
• Issues for SVM model selection (1) parameters controlling the ‘margin’ size(2) kernel type and kernel complexity
• Strategies for model selection- exhaustive search in the parameter space(via resampling)- efficient search using VC analytic bounds- rule-of-thumb analytic strategies (for a particular type of learning problem)
![Page 71: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/71.jpg)
71
Model Selection: continued• Parameters controlling margin size
- for classification, parameter C- for regression, the value of epsilon- for single-class learning, the radius
• Complexity control ~ the fraction of SV’s( -SVM)- for classification, replace C with - for regression, specify the fraction of points allowed to lie outside -insensitive zone
• For very sparse data (d/n>>1) use linear SVM
]1,0[
![Page 72: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/72.jpg)
72
Parameter Selection for SVM Regression• Selection of parameter C
Recall the SVM solution
where and with bounded kernels (RBF)
• Selection of in general, (noise level)But this does not reflect dependency on sample sizeFor linear regression: suggesting
• The final prescription
Ci *0 niCi ,...,1,0 *
minmax yyC
~
nxy
22
/
n
n
nln3
*** )()()( bf iSVi
ii
xxx
![Page 73: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/73.jpg)
73
Effect of SVM parameters on test error• Training data
univariate Sinc(x) function
with additive Gaussian noise (sigma=0.2)(a) small sample size 50 (b) large sample size 200
02
46
810
0
0.2
0.4
0.60
0.05
0.1
0.15
0.2
C/n
Prediction Risk
epsilon0
24
68
10
0
0.2
0.4
0.60
0.05
0.1
0.15
0.2
C/n
Prediction Risk
epsilon 02
46
810
0
0.2
0.4
0.60
0.05
0.1
0.15
0.2
C/n
Prediction Risk
epsilon
]10,10[)sin(
)( xx
xxt
![Page 74: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/74.jpg)
74
SVM vs Regularization• System imitation SVM• System identification regularization
But their risk functionals ‘look similar’
Recent claims: SVM = special case of regularization
• These claims neglect the role of margin loss
2
1 2
1)),(,(),( wxw
n
iiiSVM fyLCbR
2
1
2)),((),( wxw
n
iiireg fybR
![Page 75: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/75.jpg)
75
Comparison for Classification• Linear SVM vs Penalized LDA – comparison is fair• Data Sets: small (20 samples per class) large
(100 samples per class)
![Page 76: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/76.jpg)
76
Comparison results: classification• Small sample size:
Linear SVM yields 0.5% - 1.1% error ratePenalized LDA yields 2.8% - 3% error
• Large sample size:Linear SVM yields 0.4% - 1.1% error ratePenalized LDA yields 1.1% - 2.2% error
• Conclusion: margin based complexity control is better than regularization
![Page 77: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/77.jpg)
77
Comparison for regression• Linear SVM vs linear ridge regressionNote: Linear SVM has 2 parameters
• Sparse Data Set: 30 noisy samples, using target function
corrupted with gaussian noise with
• Complexity Control: - for RR vary regularization parameter- for SVM ~ epsilon and C parameters
54321 0002)( xxxxxt x 5]1,0[x
2.0
![Page 78: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/78.jpg)
78
Control for ridge regression• Coefficient shrinkage for ridge regression
-8 -6 -4 -2 0 2 4 6 8-0.5
0
0.5
1
1.5
2
2.5
Log(Lambda)
Co
effi
cien
ts
w1
w2 w2 w2
w3
w4
w5
![Page 79: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/79.jpg)
79
Complexity control for SVM• Coefficient shrinkage for SVM:(a)Vary C (epsilon=0) (b)Vary epsilon(C=large)
0 0.5 1 1.5 2-0.5
0
0.5
1
1.5
2
2.5
epsilon
Coe
ffici
ents
w1
w3
w4
w5
w2
-8 -6 -4 -2 0 2 4 6 8-0.5
0
0.5
1
1.5
2
2.5
Log(n/C)
Coe
ffic
ient
s
w1
w2
w3
w4
w5
![Page 80: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/80.jpg)
80
Comparison: ridge regression vs SVM • Sparse setting: n=10, noise• Ridge regression: chosen by cross-validation• SV Regression:
C selected by cross-validation • Ave Risk (100 realizations): 0.44 (RR) vs 0.37 (SVM)
2.0
2.0
Ridge SVM 10
-2
10-1
100
101
Ris
k (M
SE
)
![Page 81: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/81.jpg)
81
OUTLINE• Objectives• Motivation for margin-based loss• Linear SVM Classifiers• Nonlinear SVM Classifiers• Practical Issues and Examples• SVM for Regression• Summary and Discussion
![Page 82: 11 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines](https://reader033.vdocuments.us/reader033/viewer/2022051415/56649d635503460f94a45669/html5/thumbnails/82.jpg)
82
Summary• Direct approach different formulations
• Margin-based loss: robust, controls complexity (falsifiability)
• SRM: new type of structure
• Nonlinear feature selection (~ SV’s): incorporated into model estimation
• Appropriate applications- high-dimensional data- content-based /content-dependent