![Page 1: Olvi L. Mangasarian with G. M. Fung, Y.-J. Lee, J.W. Shavlik, W. H. Wolberg](https://reader030.vdocuments.us/reader030/viewer/2022032612/5681306d550346895d964e91/html5/thumbnails/1.jpg)
Support Vector Machine ClassificationComputation & Informatics in Biology & Medicine
Madison Retreat, November 15, 2002
Olvi L. Mangasarian
with
G. M. Fung, Y.-J. Lee, J.W. Shavlik, W. H. Wolberg
& Collaborators at ExonHit – Paris
Data Mining Institute
University of Wisconsin - Madison
![Page 2: Olvi L. Mangasarian with G. M. Fung, Y.-J. Lee, J.W. Shavlik, W. H. Wolberg](https://reader030.vdocuments.us/reader030/viewer/2022032612/5681306d550346895d964e91/html5/thumbnails/2.jpg)
What is a Support Vector Machine?
An optimally defined surface Linear or nonlinear in the input space Linear in a higher dimensional feature space Implicitly defined by a kernel function K(A,B) C
![Page 3: Olvi L. Mangasarian with G. M. Fung, Y.-J. Lee, J.W. Shavlik, W. H. Wolberg](https://reader030.vdocuments.us/reader030/viewer/2022032612/5681306d550346895d964e91/html5/thumbnails/3.jpg)
What are Support Vector Machines Used For?
Classification Regression & Data Fitting Supervised & Unsupervised Learning
![Page 4: Olvi L. Mangasarian with G. M. Fung, Y.-J. Lee, J.W. Shavlik, W. H. Wolberg](https://reader030.vdocuments.us/reader030/viewer/2022032612/5681306d550346895d964e91/html5/thumbnails/4.jpg)
Principal TopicsProximal support vector machine classification
Classify by proximity to planes instead of halfspacesMassive incremental classification
Classify by retiring old data & adding new dataKnowledge-based classification
Incorporate expert knowledge into a classifierFast Newton method classifier
Finitely terminating fast algorithm for classificationBreast cancer prognosis & chemotherapy
Classify patients on basis of distinct survival curves Isolate a class of patients that may benefit from chemotherapy
![Page 5: Olvi L. Mangasarian with G. M. Fung, Y.-J. Lee, J.W. Shavlik, W. H. Wolberg](https://reader030.vdocuments.us/reader030/viewer/2022032612/5681306d550346895d964e91/html5/thumbnails/5.jpg)
Principal Topics
Proximal support vector machine classification
![Page 6: Olvi L. Mangasarian with G. M. Fung, Y.-J. Lee, J.W. Shavlik, W. H. Wolberg](https://reader030.vdocuments.us/reader030/viewer/2022032612/5681306d550346895d964e91/html5/thumbnails/6.jpg)
Support Vector MachinesMaximize the Margin between Bounding Planes
x0w = í + 1
x0w = í à 1
A+
A-
jjwjj22
w
![Page 7: Olvi L. Mangasarian with G. M. Fung, Y.-J. Lee, J.W. Shavlik, W. H. Wolberg](https://reader030.vdocuments.us/reader030/viewer/2022032612/5681306d550346895d964e91/html5/thumbnails/7.jpg)
Proximal Support Vector Machines Maximize the Margin between Proximal Planes
x0w = í + 1
x0w = í à 1
A+
A-
jjwjj22
w
![Page 8: Olvi L. Mangasarian with G. M. Fung, Y.-J. Lee, J.W. Shavlik, W. H. Wolberg](https://reader030.vdocuments.us/reader030/viewer/2022032612/5681306d550346895d964e91/html5/thumbnails/8.jpg)
Standard Support Vector MachineAlgebra of 2-Category Linearly Separable Case
Given m points in n dimensional space Represented by an m-by-n matrix A Membership of each in class +1 or –1 specified by:A i
An m-by-m diagonal matrix D with +1 & -1 entries
D(Awà eí )=e;
More succinctly:
where e is a vector of ones.
x0w = í æ1: Separate by two bounding planes,
A iw=í + 1; for D i i = + 1;A iw5í à 1; for D i i = à 1:
![Page 9: Olvi L. Mangasarian with G. M. Fung, Y.-J. Lee, J.W. Shavlik, W. H. Wolberg](https://reader030.vdocuments.us/reader030/viewer/2022032612/5681306d550346895d964e91/html5/thumbnails/9.jpg)
Standard Support Vector Machine Formulation
Margin is maximized by minimizing21kw;í k2
2
÷> 0 Solve the quadratic program for some :
2÷kyk2
2 + 21kw;í k2
2
D(Awà eí ) + y > ey;w;ímin
s. t.(QP)
,
, denoteswhere D ii = æ1 A+ Aàor membership.
![Page 10: Olvi L. Mangasarian with G. M. Fung, Y.-J. Lee, J.W. Shavlik, W. H. Wolberg](https://reader030.vdocuments.us/reader030/viewer/2022032612/5681306d550346895d964e91/html5/thumbnails/10.jpg)
Proximal SVM Formulation (PSVM)
Standard SVM formulation:
w;í (QP)2÷kyk2
2 + 21kw;í k2
2
D(Awà eí ) + y
min
s. t. = e=
This simple, but critical modification, changes the nature of the optimization problem tremendously!!(Regularized Least Squares or Ridge Regression)
Solving for in terms of and gives:
minw;í 2
÷keà D(Awà eí )k22 + 2
1kw; í k22
y w í
![Page 11: Olvi L. Mangasarian with G. M. Fung, Y.-J. Lee, J.W. Shavlik, W. H. Wolberg](https://reader030.vdocuments.us/reader030/viewer/2022032612/5681306d550346895d964e91/html5/thumbnails/11.jpg)
Advantages of New Formulation
Objective function remains strongly convex.
An explicit exact solution can be written in terms of the problem data.
PSVM classifier is obtained by solving a single system of linear equations in the usually small dimensional input space.
Exact leave-one-out-correctness can be obtained in terms of problem data.
![Page 12: Olvi L. Mangasarian with G. M. Fung, Y.-J. Lee, J.W. Shavlik, W. H. Wolberg](https://reader030.vdocuments.us/reader030/viewer/2022032612/5681306d550346895d964e91/html5/thumbnails/12.jpg)
Linear PSVM
We want to solve:
w;ímin
2÷keà D(Awà eí )k2
2 + 21kw; í k2
2
Setting the gradient equal to zero, gives a nonsingular system of linear equations.
Solution of the system gives the desired PSVM classifier.
![Page 13: Olvi L. Mangasarian with G. M. Fung, Y.-J. Lee, J.W. Shavlik, W. H. Wolberg](https://reader030.vdocuments.us/reader030/viewer/2022032612/5681306d550346895d964e91/html5/thumbnails/13.jpg)
Linear PSVM Solution
H = [A à e]Here,
íw
h i= (÷
I + H 0H)à 1H 0De
The linear system to solve depends on:
H 0H(n + 1) â (n + 1)which is of size
is usually much smaller than n m
![Page 14: Olvi L. Mangasarian with G. M. Fung, Y.-J. Lee, J.W. Shavlik, W. H. Wolberg](https://reader030.vdocuments.us/reader030/viewer/2022032612/5681306d550346895d964e91/html5/thumbnails/14.jpg)
Linear & Nonlinear PSVM MATLAB Code
function [w, gamma] = psvm(A,d,nu)% PSVM: linear and nonlinear classification% INPUT: A, d=diag(D), nu. OUTPUT: w, gamma% [w, gamma] = psvm(A,d,nu); [m,n]=size(A);e=ones(m,1);H=[A -e]; v=(d’*H)’ %v=H’*D*e; r=(speye(n+1)/nu+H’*H)\v % solve (I/nu+H’*H)r=v w=r(1:n);gamma=r(n+1); % getting w,gamma from r
![Page 15: Olvi L. Mangasarian with G. M. Fung, Y.-J. Lee, J.W. Shavlik, W. H. Wolberg](https://reader030.vdocuments.us/reader030/viewer/2022032612/5681306d550346895d964e91/html5/thumbnails/15.jpg)
Numerical experimentsOne-Billion Two-Class Dataset
Synthetic dataset consisting of 1 billion points in 10- dimensional input space Generated by NDC (Normally Distributed Clustered) dataset generatorDataset divided into 500 blocks of 2 million points each.Solution obtained in less than 2 hours and 26 minutes on a 400Mhz About 30% of the time was spent reading data from disk.Testing set Correctness 90.79%
![Page 16: Olvi L. Mangasarian with G. M. Fung, Y.-J. Lee, J.W. Shavlik, W. H. Wolberg](https://reader030.vdocuments.us/reader030/viewer/2022032612/5681306d550346895d964e91/html5/thumbnails/16.jpg)
Principal Topics
Knowledge-based classification (NIPS*2002)
![Page 17: Olvi L. Mangasarian with G. M. Fung, Y.-J. Lee, J.W. Shavlik, W. H. Wolberg](https://reader030.vdocuments.us/reader030/viewer/2022032612/5681306d550346895d964e91/html5/thumbnails/17.jpg)
Conventional Data-Based SVM
![Page 18: Olvi L. Mangasarian with G. M. Fung, Y.-J. Lee, J.W. Shavlik, W. H. Wolberg](https://reader030.vdocuments.us/reader030/viewer/2022032612/5681306d550346895d964e91/html5/thumbnails/18.jpg)
Knowledge-Based SVM via Polyhedral Knowledge Sets
![Page 19: Olvi L. Mangasarian with G. M. Fung, Y.-J. Lee, J.W. Shavlik, W. H. Wolberg](https://reader030.vdocuments.us/reader030/viewer/2022032612/5681306d550346895d964e91/html5/thumbnails/19.jpg)
Incoporating Knowledge Sets Into an SVM Classifier
This implication is equivalent to a set of constraints that can be imposed on the classification problem.
Suppose that the knowledge set: belongs to the class A+. Hence it must lie in the halfspace :
èx??Bx 6 b
é
èxjx0w>í + 1
é
Bx6b ) x0w>í + 1
We therefore have the implication:
![Page 20: Olvi L. Mangasarian with G. M. Fung, Y.-J. Lee, J.W. Shavlik, W. H. Wolberg](https://reader030.vdocuments.us/reader030/viewer/2022032612/5681306d550346895d964e91/html5/thumbnails/20.jpg)
Numerical TestingThe Promoter Recognition Dataset
Promoter: Short DNA sequence that precedes a gene sequence.
A promoter consists of 57 consecutive DNA nucleotides belonging to {A,G,C,T} .
Important to distinguish between promoters and nonpromoters
This distinction identifies starting locations of genes in long uncharacterized DNA sequences.
![Page 21: Olvi L. Mangasarian with G. M. Fung, Y.-J. Lee, J.W. Shavlik, W. H. Wolberg](https://reader030.vdocuments.us/reader030/viewer/2022032612/5681306d550346895d964e91/html5/thumbnails/21.jpg)
The Promoter Recognition DatasetNumerical Representation
Simple “1 of N” mapping scheme for converting nominal attributes into a real valued representation:
Not most economical representation, but commonly used.
![Page 22: Olvi L. Mangasarian with G. M. Fung, Y.-J. Lee, J.W. Shavlik, W. H. Wolberg](https://reader030.vdocuments.us/reader030/viewer/2022032612/5681306d550346895d964e91/html5/thumbnails/22.jpg)
The Promoter Recognition DatasetNumerical Representation
Feature space mapped from 57-dimensional nominal space to a real valued 57 x 4=228 dimensional space.
57 nominal values
57 x 4 =228binary values
![Page 23: Olvi L. Mangasarian with G. M. Fung, Y.-J. Lee, J.W. Shavlik, W. H. Wolberg](https://reader030.vdocuments.us/reader030/viewer/2022032612/5681306d550346895d964e91/html5/thumbnails/23.jpg)
Promoter Recognition Dataset Prior Knowledge Rules
Prior knowledge consist of the following 64 rules:
R1orR2orR3orR4
2
66666664
3
77777775
V
R5orR6orR7orR8
2
66666664
3
77777775
V
R9or
R10or
R11or
R12
2
66666664
3
77777775
=) PROMOTER
![Page 24: Olvi L. Mangasarian with G. M. Fung, Y.-J. Lee, J.W. Shavlik, W. H. Wolberg](https://reader030.vdocuments.us/reader030/viewer/2022032612/5681306d550346895d964e91/html5/thumbnails/24.jpg)
Promoter Recognition Dataset Sample Rules
R8 : (pà 12 = T) ^(pà 11 = A) ^(pà 07 = T);
R4 : (pà 36 = T) ^(pà 35 = T) ^(pà 34 = G)^(pà 33 = A) ^(pà 32 = C);
R10 : (pà 45 = A) ^(pà 44 = A) ^(pà 41 = A);where denotes position of a nucleotide, with respect to a meaningful reference point starting at position and ending at positionpà 50
pj
p7:Then:
R4^R8^R10=) PROMOTER
![Page 25: Olvi L. Mangasarian with G. M. Fung, Y.-J. Lee, J.W. Shavlik, W. H. Wolberg](https://reader030.vdocuments.us/reader030/viewer/2022032612/5681306d550346895d964e91/html5/thumbnails/25.jpg)
The Promoter Recognition DatasetComparative Algorithms
KBANN Knowledge-based artificial neural network [Shavlik et al] BP: Standard back propagation for neural networks [Rumelhart et al] O’Neill’s Method Empirical method suggested by biologist O’Neill [O’Neill] NN: Nearest neighbor with k=3 [Cost et al] ID3: Quinlan’s decision tree builder[Quinlan] SVM1: Standard 1-norm SVM [Bradley et al]
![Page 26: Olvi L. Mangasarian with G. M. Fung, Y.-J. Lee, J.W. Shavlik, W. H. Wolberg](https://reader030.vdocuments.us/reader030/viewer/2022032612/5681306d550346895d964e91/html5/thumbnails/26.jpg)
The Promoter Recognition DatasetComparative Test Results
![Page 27: Olvi L. Mangasarian with G. M. Fung, Y.-J. Lee, J.W. Shavlik, W. H. Wolberg](https://reader030.vdocuments.us/reader030/viewer/2022032612/5681306d550346895d964e91/html5/thumbnails/27.jpg)
Wisconsin Breast Cancer Prognosis Dataset Description of the data
110 instances corresponding to 41 patients whose cancer had recurred and 69 patients whose cancer had not recurred
32 numerical features The domain theory: two simple rules used by doctors:
![Page 28: Olvi L. Mangasarian with G. M. Fung, Y.-J. Lee, J.W. Shavlik, W. H. Wolberg](https://reader030.vdocuments.us/reader030/viewer/2022032612/5681306d550346895d964e91/html5/thumbnails/28.jpg)
Wisconsin Breast Cancer Prognosis Dataset Numerical Testing Results
Doctor’s rules applicable to only 32 out of 110 patients.
Only 22 of 32 patients are classified correctly by this rule (20% Correctness).
KSVM linear classifier applicable to all patients with correctness of 66.4%.
Correctness comparable to best available results using conventional SVMs.
KSVM can get classifiers based on knowledge without using any data.
![Page 29: Olvi L. Mangasarian with G. M. Fung, Y.-J. Lee, J.W. Shavlik, W. H. Wolberg](https://reader030.vdocuments.us/reader030/viewer/2022032612/5681306d550346895d964e91/html5/thumbnails/29.jpg)
Principal Topics
Fast Newton method classifier
![Page 30: Olvi L. Mangasarian with G. M. Fung, Y.-J. Lee, J.W. Shavlik, W. H. Wolberg](https://reader030.vdocuments.us/reader030/viewer/2022032612/5681306d550346895d964e91/html5/thumbnails/30.jpg)
Fast Newton Algorithm for Classification
Standard quadratic programming (QP) formulation of SVM:
![Page 31: Olvi L. Mangasarian with G. M. Fung, Y.-J. Lee, J.W. Shavlik, W. H. Wolberg](https://reader030.vdocuments.us/reader030/viewer/2022032612/5681306d550346895d964e91/html5/thumbnails/31.jpg)
Newton Algorithm
f (z) = 21íí (eà D(Awà ew))+
ww2
+ 21íí w; í
íí 2
zi+1 = zi à @2f (zi)à 1r f (zi)Newton algorithm terminates in a finite number of steps
Termination at global minimum
Error rate decreases linearlyCan generate complex nonlinear classifiers
By using nonlinear kernels: K(x,y)
![Page 32: Olvi L. Mangasarian with G. M. Fung, Y.-J. Lee, J.W. Shavlik, W. H. Wolberg](https://reader030.vdocuments.us/reader030/viewer/2022032612/5681306d550346895d964e91/html5/thumbnails/32.jpg)
Nonlinear Spiral Dataset94 Red Dots & 94 White Dots
![Page 33: Olvi L. Mangasarian with G. M. Fung, Y.-J. Lee, J.W. Shavlik, W. H. Wolberg](https://reader030.vdocuments.us/reader030/viewer/2022032612/5681306d550346895d964e91/html5/thumbnails/33.jpg)
Principal Topics
Breast cancer prognosis & chemotherapy
![Page 34: Olvi L. Mangasarian with G. M. Fung, Y.-J. Lee, J.W. Shavlik, W. H. Wolberg](https://reader030.vdocuments.us/reader030/viewer/2022032612/5681306d550346895d964e91/html5/thumbnails/34.jpg)
Kaplan-Meier Curves for Overall Patients:With & Without Chemotherapy
![Page 35: Olvi L. Mangasarian with G. M. Fung, Y.-J. Lee, J.W. Shavlik, W. H. Wolberg](https://reader030.vdocuments.us/reader030/viewer/2022032612/5681306d550346895d964e91/html5/thumbnails/35.jpg)
Breast Cancer Prognosis & ChemotherapyGood, Intermediate & Poor Patient Groupings
(6 Input Features : 5 Cytological, 1 Histological)(Grouping: Utilizes 2 Histological Features &Chemotherapy)
![Page 36: Olvi L. Mangasarian with G. M. Fung, Y.-J. Lee, J.W. Shavlik, W. H. Wolberg](https://reader030.vdocuments.us/reader030/viewer/2022032612/5681306d550346895d964e91/html5/thumbnails/36.jpg)
Kaplan-Meier Survival Curvesfor Good, Intermediate & Poor Patients
82.7% Classifier Correctness via 3 SVMs
![Page 37: Olvi L. Mangasarian with G. M. Fung, Y.-J. Lee, J.W. Shavlik, W. H. Wolberg](https://reader030.vdocuments.us/reader030/viewer/2022032612/5681306d550346895d964e91/html5/thumbnails/37.jpg)
Kaplan-Meier Survival Curves for Intermediate Group Note Reversed Role of Chemotherapy
![Page 38: Olvi L. Mangasarian with G. M. Fung, Y.-J. Lee, J.W. Shavlik, W. H. Wolberg](https://reader030.vdocuments.us/reader030/viewer/2022032612/5681306d550346895d964e91/html5/thumbnails/38.jpg)
ConclusionNew methods for classification All based on rigorous mathematical foundationFast computational algorithms capable of classifying massive datasetsClassifiers based on both abstract prior knowledge as well as conventional datasetsIdentification of breast cancer patients that can benefit from chemotherapy
![Page 39: Olvi L. Mangasarian with G. M. Fung, Y.-J. Lee, J.W. Shavlik, W. H. Wolberg](https://reader030.vdocuments.us/reader030/viewer/2022032612/5681306d550346895d964e91/html5/thumbnails/39.jpg)
Future WorkExtend proposed methods to broader optimization problems
Linear & quadratic programming Preliminary results beat state-of-the-art software
Incorporate abstract concepts into optimization problems as constraintsDevelop fast online algorithms for intrusion and fraud detection Classify the effectiveness of new drug cocktails in combating various forms of cancer
Encouraging preliminary results for breast cancer
![Page 40: Olvi L. Mangasarian with G. M. Fung, Y.-J. Lee, J.W. Shavlik, W. H. Wolberg](https://reader030.vdocuments.us/reader030/viewer/2022032612/5681306d550346895d964e91/html5/thumbnails/40.jpg)
Breast Cancer Treatment ResponseJoint with ExonHit ( French BioTech)
35 patients treated by a drug cocktail 9 partial responders; 26 nonresponders25 gene expression measurements made on each patient1-Norm SVM classifier selected: 12 out of 25 genesCombinatorially selected 6 genes out of 12Separating plane obtained:
2.7915 T11 + 0.13436 S24 -1.0269 U23 -2.8108 Z23 -1.8668 A19 -1.5177 X05 +2899.1 = 0.
Leave-one-out-error: 1 out of 35 (97.1% correctness)
![Page 41: Olvi L. Mangasarian with G. M. Fung, Y.-J. Lee, J.W. Shavlik, W. H. Wolberg](https://reader030.vdocuments.us/reader030/viewer/2022032612/5681306d550346895d964e91/html5/thumbnails/41.jpg)
E1 I1 E2 I2 E3 E4 E5I3 I4DNA
3'5'E1 E2 E3I1 I2 E4 E5I3 I4 pre-mRNA
(m=messenger)
Transcription
Alternative RNA splicing
E1 E2 E4 E5E1 E2 E4 E5E3
(A)n(A)nmRNA
Chemo-Sensitive
NH2 COOH
Chemo-Resistant
NH2 COOHProteins
Translation
Detection of Alternative RNA Isoforms via DATAS(Levels of mRNA that Correlate with Senitivity to Chemotherapy)
DATAS
E3
DATAS: Differential Analysis of Transcripts with Alternative Splicing
![Page 42: Olvi L. Mangasarian with G. M. Fung, Y.-J. Lee, J.W. Shavlik, W. H. Wolberg](https://reader030.vdocuments.us/reader030/viewer/2022032612/5681306d550346895d964e91/html5/thumbnails/42.jpg)
Talk Available
www.cs.wisc.edu/~olvi