jbr1 support vector machines classification venables & ripley section 12.5 csu hayward...
TRANSCRIPT
![Page 1: JBR1 Support Vector Machines Classification Venables & Ripley Section 12.5 CSU Hayward Statistics 6601 Joseph Rickert & Timothy McKusick December 1, 2004](https://reader030.vdocuments.us/reader030/viewer/2022032722/56649f495503460f94c6ac6d/html5/thumbnails/1.jpg)
JBR 1
Support Vector MachinesClassification
Venables & Ripley Section 12.5CSU Hayward Statistics 6601
Joseph Rickert
&
Timothy McKusick
December 1, 2004
![Page 2: JBR1 Support Vector Machines Classification Venables & Ripley Section 12.5 CSU Hayward Statistics 6601 Joseph Rickert & Timothy McKusick December 1, 2004](https://reader030.vdocuments.us/reader030/viewer/2022032722/56649f495503460f94c6ac6d/html5/thumbnails/2.jpg)
JBR 2
Support Vector Machine
What is the SVM?
The SVM is a generalization of
the Optimal
Hyperplane Algorithm
Why is the SVM important?
It allows the use of more similarity measures than the OHA
Through the use of kernel methods it works with non vector data
![Page 3: JBR1 Support Vector Machines Classification Venables & Ripley Section 12.5 CSU Hayward Statistics 6601 Joseph Rickert & Timothy McKusick December 1, 2004](https://reader030.vdocuments.us/reader030/viewer/2022032722/56649f495503460f94c6ac6d/html5/thumbnails/3.jpg)
JBR 3
Simple Linear Classifier
X=Rp
f(x) = wTx + bEach x X is classified into 2 classes labeled y {+1,-1}
y = 1 if f(x) 0 and y = -1 if f(x) < 0
S = {(x1,y1),(x2,y2),...}
Given S, the problem is to learn f (find w and b) .
For each f check to see if all (xi,yi) are correctly classified i.e. yif(xi) 0
Choose f so that the number of errors is minimized
![Page 4: JBR1 Support Vector Machines Classification Venables & Ripley Section 12.5 CSU Hayward Statistics 6601 Joseph Rickert & Timothy McKusick December 1, 2004](https://reader030.vdocuments.us/reader030/viewer/2022032722/56649f495503460f94c6ac6d/html5/thumbnails/4.jpg)
JBR 4
But what if the training set is not linearly separable?
f(x) = wTx + b defines two half planes {x:f(x) 1} and {x: f(x) -1}
Classify with the “Hinge” lossfunction: c(f,x,y) = max(0,1-yf(x))c (f,x,y) as distance from correct
half plane If (x,y) is correctly classified with
large confidence then c(f,x,y) = 0
wTx+b > 1
wTx+b < - 1
margin = 2/w1
yf(x)
yf(x) 1: correct with large conf
0 yf(x) < 0: correct with small conf
yf(x) < 0: misclassified
![Page 5: JBR1 Support Vector Machines Classification Venables & Ripley Section 12.5 CSU Hayward Statistics 6601 Joseph Rickert & Timothy McKusick December 1, 2004](https://reader030.vdocuments.us/reader030/viewer/2022032722/56649f495503460f94c6ac6d/html5/thumbnails/5.jpg)
JBR 5
SVMs combine requirements of large margin and few misclassifications
by solving the problem:
New formulation: min 1/2w+ Cc(f,xi,yi) w.r.t w,x and b
C is parameter that controls tradeoff between margin and misclassification
Large C small margins but more samples correctly classified with strong confidence
Technical difficulty: hinge loss function c(f,xi,yi) is not differentiable
Even better formulation: use slack variables i
min 1/2w+ Ci w.r.t w, and b
under the constraint i c(f,xi,yi) (*)
But (*) is equivalent to i 0
i - 1 + yi(wTxi + b) 0
Solve this quadratic optimization problem with Lagrange Multipliers
for i = 1...n
![Page 6: JBR1 Support Vector Machines Classification Venables & Ripley Section 12.5 CSU Hayward Statistics 6601 Joseph Rickert & Timothy McKusick December 1, 2004](https://reader030.vdocuments.us/reader030/viewer/2022032722/56649f495503460f94c6ac6d/html5/thumbnails/6.jpg)
JBR 6
Support VectorsLagrange Multiplier
formulation: Find that minimizes:
W()=(-1/2) yiyjijxiTxj +
i
under the constraints:
i = 0 and 0 i C The points with positive
Lagrange Multipliers,i > 0, are called Support Vectors
The set of support vectors contains all the information used by the SVM to learn a discrimination function
= C
0 < a < C
= 0
![Page 7: JBR1 Support Vector Machines Classification Venables & Ripley Section 12.5 CSU Hayward Statistics 6601 Joseph Rickert & Timothy McKusick December 1, 2004](https://reader030.vdocuments.us/reader030/viewer/2022032722/56649f495503460f94c6ac6d/html5/thumbnails/7.jpg)
JBR 7
Kernel Methods: data not represented individually, but only through a set of pairwise comparisons
X a set of objects(proteins)
S
(s) = (aatcgagtcac, atggacgtct, tgcactact)
K = 1 0.5 0.3
0.5 1 0.6
0.3 0.6 1
Each object represented by a sequence
Each number in the kernel matrix is a measure of the similarity or “distance” between two objects.
![Page 8: JBR1 Support Vector Machines Classification Venables & Ripley Section 12.5 CSU Hayward Statistics 6601 Joseph Rickert & Timothy McKusick December 1, 2004](https://reader030.vdocuments.us/reader030/viewer/2022032722/56649f495503460f94c6ac6d/html5/thumbnails/8.jpg)
JBR 8
Kernels
Properties of Kernels Kernels are measures of
similarity: K(x,x’) large when x and x’ are similar
Kernels must be: Positive definite Symmetric
kernel K, a Hilbert Space F and a mapping : X F K(x,x’) = <(x),(x’)> x,x’ X
Hence all kernels can be thought of as dot products in some feature space
Advantages of Kernels Data of very different
nature can be analyzed in a unified framework
No matter what the objects are, n objects are always represented by an n x n matrix
Many times, it is easier to compare objects than represent them numerically
Complete modularity between function to represent data and algorithm to analyze data
![Page 9: JBR1 Support Vector Machines Classification Venables & Ripley Section 12.5 CSU Hayward Statistics 6601 Joseph Rickert & Timothy McKusick December 1, 2004](https://reader030.vdocuments.us/reader030/viewer/2022032722/56649f495503460f94c6ac6d/html5/thumbnails/9.jpg)
JBR 9
The “Kernel Trick”
Any algorithm for vector data that can be expressed in terms of dot products can be performed implicitly in the feature space associated with the kernel by replacing each dot product with the kernel representation
e.g. For some feature space F let: d(x,x’) = (x) - (x’) But(x)-(x’)2(x),(x)(x’),(x’)> -
2<(x),(x’)>
Sod(x,x’) =(K(x,x)+K(x’,x’)-2K(x,x’))1/2
![Page 10: JBR1 Support Vector Machines Classification Venables & Ripley Section 12.5 CSU Hayward Statistics 6601 Joseph Rickert & Timothy McKusick December 1, 2004](https://reader030.vdocuments.us/reader030/viewer/2022032722/56649f495503460f94c6ac6d/html5/thumbnails/10.jpg)
JBR 10
Nonlinear Separation
Nonlinear kernel: X is a vector space the kernel is
nonlinear linear separation in
the feature space F can be associated with non linear separation in X
X
F
![Page 11: JBR1 Support Vector Machines Classification Venables & Ripley Section 12.5 CSU Hayward Statistics 6601 Joseph Rickert & Timothy McKusick December 1, 2004](https://reader030.vdocuments.us/reader030/viewer/2022032722/56649f495503460f94c6ac6d/html5/thumbnails/11.jpg)
JBR 11
SVM with Kernel
Final formulation: Find that minimizes: W()=(-1/2)yiyjijxi
Txj +i
under the constraints: i = 0 and 0 i C
Find an index i, 0 < i < C and set:
b = yi - yjjk(xixj)
The classification of a new object x X is then determined by the sign of the function
f(x) = yiik(xix)+ b
![Page 12: JBR1 Support Vector Machines Classification Venables & Ripley Section 12.5 CSU Hayward Statistics 6601 Joseph Rickert & Timothy McKusick December 1, 2004](https://reader030.vdocuments.us/reader030/viewer/2022032722/56649f495503460f94c6ac6d/html5/thumbnails/12.jpg)
JBR 12
iris data set (Anderson 1935) 150 cases, 50 each of 3 species of iris
Example from page 48 of The e1071 Package.
First 10 lines of Iris
> iris Sepal.Length Sepal.Width Petal.Length Petal.Width Species1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
![Page 13: JBR1 Support Vector Machines Classification Venables & Ripley Section 12.5 CSU Hayward Statistics 6601 Joseph Rickert & Timothy McKusick December 1, 2004](https://reader030.vdocuments.us/reader030/viewer/2022032722/56649f495503460f94c6ac6d/html5/thumbnails/13.jpg)
JBR 13
SVM ANALYSIS OF IRIS DATA
# SVM ANALYSIS OF IRIS DATA SET# classification mode# default with factor response:model <- svm(Species ~ ., data = iris)summary(model)
Call: svm(formula = Species ~ ., data =
iris) Parameters: SVM-Type: C-classification SVM-Kernel: radial cost: 1 gamma: 0.25 Number of Support Vectors: 51
( 8 22 21 )
Number of Classes: 3
Levels: setosa versicolor virginica
Parameter “C” inLagrange Formulation
Radial Kernelexp(-|u - v|)2
![Page 14: JBR1 Support Vector Machines Classification Venables & Ripley Section 12.5 CSU Hayward Statistics 6601 Joseph Rickert & Timothy McKusick December 1, 2004](https://reader030.vdocuments.us/reader030/viewer/2022032722/56649f495503460f94c6ac6d/html5/thumbnails/14.jpg)
JBR 14
Exploring the SVM Model
# test with training datax <- subset(iris, select = -
Species)y <- Speciespred <- predict(model, x)
# Check accuracy:table(pred, y)
# compute decision values:pred <- predict(model, x,
decision.values = TRUE)attr(pred, "decision.values")
[1:4,]
ypred setosa versicolor virginica setosa 50 0 0 versicolor 0 48 2 virginica 0 2 48
setosa/versicolor setosa/virginica versicolor/virginica
[1,] 1.196000 1.091667 0.6706543[2,] 1.064868 1.055877 0.8482041[3,] 1.181229 1.074370 0.6438237[4,] 1.111282 1.052820 0.6780645
![Page 15: JBR1 Support Vector Machines Classification Venables & Ripley Section 12.5 CSU Hayward Statistics 6601 Joseph Rickert & Timothy McKusick December 1, 2004](https://reader030.vdocuments.us/reader030/viewer/2022032722/56649f495503460f94c6ac6d/html5/thumbnails/15.jpg)
JBR 15
Visualize classes with MDS
# visualize (classes by color, SV by crosses):
plot(cmdscale(dist(iris[,-5])),col = as.integer(iris[,5]),ch = c("o","+")[1:150 %in%
model$index + 1]) o
oo
o
o
o
o
o
+
o
o
o
o
+
o+
o
o
o
o+o
+ +o
+
o
oo
oo
o
o
o
o
o
o
o
o
oo
+
o
o
o
o
o
o
o
o
+
o
+
+
+
o
+
+
o
+
+
o
o
+o
o
+o
+o
+o
+o
o
o+ +
+
o
oo
o+
+
+
+
+o
o o
o
o
o
o
oo
o
+
o
o
+
o
oo
o
+
o
+
o
+
o
o
o
o
o+
o
+
+
o
+
o
+
o
+
++ o
+o
+
o+
+
o
o+
+
o
o
o
o
ooo
+
o+
+
-3 -2 -1 0 1 2 3 4
-1.0
-0.5
0.0
0.5
1.0
cmdscale(dist(iris[, -5]))[,1]
cmds
cale
(dis
t(iris
[, -5
]))[,2
]cmdscale : multidimensional scaling
or principal coordinates analysis
black: sertosared: versicolorgreen: virginica
![Page 16: JBR1 Support Vector Machines Classification Venables & Ripley Section 12.5 CSU Hayward Statistics 6601 Joseph Rickert & Timothy McKusick December 1, 2004](https://reader030.vdocuments.us/reader030/viewer/2022032722/56649f495503460f94c6ac6d/html5/thumbnails/16.jpg)
JBR 16
iris split into training and test sets first 25 of each case training set
## SECOND SVM ANALYSIS OF IRIS DATA SET
## classification mode# default with factor
response# Train with iris.train.datamodel.2 <- svm(fS.TR ~ .,
data = iris.train)# output from summarysummary(model.2)
Call: svm(formula = fS.TR ~ .,
data = iris.train) Parameters: SVM-Type: C-classification SVM-Kernel: radial cost: 1 gamma: 0.25 Number of Support Vectors:
32 ( 7 13 12 )Number of Classes: 3 Levels: setosa veriscolor virginica
![Page 17: JBR1 Support Vector Machines Classification Venables & Ripley Section 12.5 CSU Hayward Statistics 6601 Joseph Rickert & Timothy McKusick December 1, 2004](https://reader030.vdocuments.us/reader030/viewer/2022032722/56649f495503460f94c6ac6d/html5/thumbnails/17.jpg)
JBR 17
iris test results
# test with iris.test.datax.2 <- subset(iris.test, select =
-fS.TE)y.2 <- fS.TEpred.2 <- predict(model.2, x.2)# Check accuracy:table(pred.2, y.2)# compute decision values and
probabilities:pred.2 <- predict(model.2, x.2,
decision.values = TRUE)attr(pred.2, "decision.values")
[1:4,]
y.2pred.2 setosa veriscolor virginica setosa 25 0 0 veriscolor 0 25 0 virginica 0 0 25
setosa/veriscolor setosa/virginica veriscolor/virginica
[1,] 1.253378 1.086341 0.6065033[2,] 1.000251 1.021445 0.8012664[3,] 1.247326 1.104700 0.6068924[4,] 1.164226 1.078913 0.6311566
![Page 18: JBR1 Support Vector Machines Classification Venables & Ripley Section 12.5 CSU Hayward Statistics 6601 Joseph Rickert & Timothy McKusick December 1, 2004](https://reader030.vdocuments.us/reader030/viewer/2022032722/56649f495503460f94c6ac6d/html5/thumbnails/18.jpg)
JBR 18
iris training and test sets
o
+o
o
o
o
o
o
+
o
o
o
o
o
o+
o
o
+
o+o
+ +o
+
o
+
+
+
o
+
+
o
+
+
o
o
o
+
o
+o
+o
+o
+o
o
+
+
o
+o
o
+
o
o
o
+
o
o
+
o
o+
+
+
+
o
+
o
+
o
-2 0 2 4
-1.0
-0.5
0.0
0.5
1.0
cmdscale(dist(iris.train[, -5]))[,1]
cmd
sca
le(d
ist(
iris
.tra
in[,
-5])
)[,2
]
o
+
oo
oo
o
o
+
o
o
o
o
o
o+
o
o
+
o
+
o
+
+
o
+o +
+
+
o+
+o+
+
o
oo
+ o
+
o
+
o
+o
+
o
o
+
+o
+
oo
+
oo
o
+
oo
+
o
o
+
+
++o
+
o+
o
-3 -2 -1 0 1 2 3
-1.0
-0.5
0.0
0.5
1.0
cmdscale(dist(iris.test[, -5]))[,1]
cmd
sca
le(d
ist(
iris
.test
[, -5
]))[
,2]
![Page 19: JBR1 Support Vector Machines Classification Venables & Ripley Section 12.5 CSU Hayward Statistics 6601 Joseph Rickert & Timothy McKusick December 1, 2004](https://reader030.vdocuments.us/reader030/viewer/2022032722/56649f495503460f94c6ac6d/html5/thumbnails/19.jpg)
JBR 19
Microarray Data
from Golub et al. Molecular Classification of Cancer: Class Prediction by Gene
Expression Monitoring, Science, Vol 286, 10/15/1999 Expression levels of predictive
genes .
Rows: genes Columns: samples Expression levels (EL) of
each gene are relative to the mean EL for that gene in the initial dataset
Red if EL > mean Blue if EL < mean The scale indicates above
or below the mean Top panel: genes highly
expressed in ALL Bottom panel: genes more
highly expressed in AML.
![Page 20: JBR1 Support Vector Machines Classification Venables & Ripley Section 12.5 CSU Hayward Statistics 6601 Joseph Rickert & Timothy McKusick December 1, 2004](https://reader030.vdocuments.us/reader030/viewer/2022032722/56649f495503460f94c6ac6d/html5/thumbnails/20.jpg)
JBR 20
Microarray Data Transposedrows = samples, columns = genes
Training Data 38 Samples 7129 x 38 matrix ALL: 27 AML 11Test Data 38 Samples 7129 x 34 matrix ALL: 20 AML 14
Microarray Data Transposedrows = samples, columns = genes
[,1] [,2] [,3] [,4] [,5][,6] [,7] [,8] [,9] [,10] [1,] -214 -153 -58 88 -295 -558 199 -176 252 206
[2,] -139 -73 -1 283 -264 -400 -330 -168 101 74
[3,] -76 -49 -307 309 -376 -650 33 -367 206 -215
[4,] -135 -114 265 12 -419 -585 158 -253 49 31
[5,] -106 -125 -76 168 -230 -284 4 -122 70 252
[6,] -138 -85 215 71 -272 -558 67 -186 87 193
[7,] -72 -144 238 55 -399 -551 131 -179 126 -20
[8,] -413 -260 7 -2 -541 -790 -275 -463 70 -169
[9,] 5 -127 106 268 -210 -535 0 -174 24 506
[10,] -88 -105 42 219 -178 -246 328 -148 177 183
[11,] -165 -155 -71 82 -163 -430 100 -109 56 350
[12,] -67 -93 84 25 -179 -323 -135 -127 -2 -66
[13,] -92 -119 -31 173 -233 -227 -49 -62 13 230
[14,] -113 -147 -118 243 -127 -398 -249 -228 -37 113
[15,] -107 -72 -126 149 -205 -284 -166 -185 1 -23
![Page 21: JBR1 Support Vector Machines Classification Venables & Ripley Section 12.5 CSU Hayward Statistics 6601 Joseph Rickert & Timothy McKusick December 1, 2004](https://reader030.vdocuments.us/reader030/viewer/2022032722/56649f495503460f94c6ac6d/html5/thumbnails/21.jpg)
JBR 21
SVM ANALYSIS OF MICROARRAY DATAclassification mode
# default with factor response
y <-c(rep(0,27),rep(1,11))fy <-factor(y,levels=0:1)levels(fy) <-c("ALL","AML")
# compute svm on first 3000 genes only because of memory overflow problems
model.ma <- svm(fy ~.,data = fmat.train[,1:3000])
Call: svm(formula = fy ~ ., data =
fmat.train[, 1:3000]) Parameters: SVM-Type: C-classification SVM-Kernel: radial cost: 1 gamma: 0.0003333333 Number of Support Vectors:
37 ( 26 11 )Number of Classes: 2 Levels: ALL AML
![Page 22: JBR1 Support Vector Machines Classification Venables & Ripley Section 12.5 CSU Hayward Statistics 6601 Joseph Rickert & Timothy McKusick December 1, 2004](https://reader030.vdocuments.us/reader030/viewer/2022032722/56649f495503460f94c6ac6d/html5/thumbnails/22.jpg)
JBR 22
Visualize Microarray Training Data with Multidimensional Scaling
# visualize Training Data # (classes by color, SV by
crosses)# multidimensional scalingpc <-
cmdscale(dist(fmat.train[,1:3000]))
plot(pc,col = as.integer(fy),pch = c("o","+")[1:3000 %in%
model.ma$index + 1],main="Training Data ALL 'Black'
and AML 'Red' Classes")
+
++
++
+
++
+
+
+
+
+
+
o+
++
+
+
+
+
+
+
+
++
+
+
+
+
+
+ +
+
+
+
+
-40000 -20000 0 20000
-40
00
0-2
00
00
02
00
00
40
00
0
Training Data ALL 'Black' and AML 'Red' Classes
pc[,1]
pc[
,2]
![Page 23: JBR1 Support Vector Machines Classification Venables & Ripley Section 12.5 CSU Hayward Statistics 6601 Joseph Rickert & Timothy McKusick December 1, 2004](https://reader030.vdocuments.us/reader030/viewer/2022032722/56649f495503460f94c6ac6d/html5/thumbnails/23.jpg)
JBR 23
Check Model with Training DataPredict outcomes of Test Data
# check the training datax <- fmat.train[,1:3000]pred.train <- predict(model.ma, x)# check accuracy:table(pred.train, fy)
# classify the test datay2 <-c(rep(0,20),rep(1,14))fy2 <-factor(y2,levels=0:1)levels(fy2) <-c("ALL","AML")
x2 <- fmat.test[,1:3000]pred <- predict(model.ma, x2)# check accuracy:table(pred, fy2)
fypred.train ALL AML ALL 27 0 AML 0 11
fy2pred ALL AML
ALL 20 13 AML 0 1
Training data correctly classified
Model isworthless
so far
![Page 24: JBR1 Support Vector Machines Classification Venables & Ripley Section 12.5 CSU Hayward Statistics 6601 Joseph Rickert & Timothy McKusick December 1, 2004](https://reader030.vdocuments.us/reader030/viewer/2022032722/56649f495503460f94c6ac6d/html5/thumbnails/24.jpg)
JBR 24
Conclusion:
The SVM appears to be a powerful classifier applicable to many different kinds of data
ButKernel design is a full time jobSelecting model parameters is far
from obviousThe math is formidable