feature selection, svm-based classification and application to mass spectrometry data analysis
DESCRIPTION
Feature selection, SVM-based classification and application to mass spectrometry data analysis. Elena Marchiori Department of Computer Science Vrije Universiteit Amsterdam. Overview. Support Vector Machines Variable selection Application in Bioinformatics. Support Vector Machines. - PowerPoint PPT PresentationTRANSCRIPT
Feature selection, SVM-based classification and application to
mass spectrometry data analysis
Elena Marchiori
Department of Computer Science
Vrije Universiteit Amsterdam
Overview
• Support Vector Machines
• Variable selection
• Application in Bioinformatics
Support Vector Machines
• Advantages:– maximize the margin between two classes in the
feature space characterized by a kernel function– are robust with respect to high input dimension
• Disadvantages:– difficult to incorporate background knowledge– Sensitive to outliers
Linear Separators
Hyperplane Classifiers
11
11
ii
ii
yforbxw
yforbxw
SVM
• To construct optimal hyperplane
– Minimize
– Subject to
• Constrained Optimization problem with Lagrangian
libxwy
ww
ii ,...,1,1))((
21
)(2
l
iiii bwxywbwL
1
2
21 )1))(((),,(
0),,(0),,(
bwL
wbwL
b
SVM– Primal variables vanish
• KKT condition
• Support Vectors whose is nonzero
– Optimization problem
• Maximize
• Subject to
• Decision function
ii
l
iii
l
ii xywya
110
libwxy iii ,...,1,0]1))(([
i
l
iiii
l
iiii
l
i
l
jijijijii
bxxyxf
yandli
xxyyW
1
1
1 1,21
))(sgn()(
0,,...,1,0
)()(
SVM: separable classes
ρ
Support vector
margin
Optimal hyper-plane
Support vectors uniquely characterize optimal hyper-plane
SVM and outliers
outlier
Soft Margin Classification • What if the training set is not linearly separable?
• Slack variables ξi can be added to allow misclassification of difficult or noisy examples.
ξjξk
Weakening the constraints
Weakening the constraints
Allow that the objects do not strictly obey the constraints
Introduce ‘slack’-variables
SVC with slacks
The optimization problem changes into:
Tradeoff parameter C
Notice that the tradeoff parameter C has to be defined beforehand.
It weighs the contributions between the training error and the structural error.
Its value is often optimized using cross-validation.
Influence of C
Erroneous objects can still have a (large) influence on the solution
Classifying new examples
• Once the parameters (*, b*) are found by solving the required quadratic optimisation on the training set of points, the SVM is ready to be used for classifying new points.
• Given new point x, its class membership is
sign[f(x, *, b*)], where
***
1
***** ),,( bybybbfSVi iii
N
i iii xxxxxwx
Data enters only in the form of dot products!
Non-linear SVMs• Datasets that are linearly separable with some noise work
out great:
• But what are we going to do if the dataset is just too hard?
• How about… mapping data to a higher-dimensional space:
0
x2
x
0 x
0 x
Non-linear SVMs: Feature Spaces
• Map the original feature space to some higher-dimensional feature space where the training set is separable:
Φ: x → φ(x)
The “Kernel Trick”• The linear classifier relies on inner product between vectors K(xi,xj)=xi
Txj
• If every datapoint is mapped into high-dimensional space via some transformation Φ: x → φ(x), the inner product becomes:
K(xi,xj)= φ(xi) Tφ(xj)
• A kernel function is some function that corresponds to an inner product in some expanded feature space.
• Example:
2-dimensional vectors x=[x1 x2]; let K(xi,xj)=(1 + xiTxj)2
,
Need to show that K(xi,xj)= φ(xi) Tφ(xj):
K(xi,xj)=(1 + xiTxj)2
,= 1+ xi12xj1
2 + 2 xi1xj1 xi2xj2+ xi2
2xj22 + 2xi1xj1 + 2xi2xj2=
= [1 xi12 √2 xi1xi2 xi2
2 √2xi1 √2xi2]T [1 xj12 √2 xj1xj2 xj2
2 √2xj1 √2xj2] =
= φ(xi) Tφ(xj), where φ(x) = [1 x1
2 √2 x1x2 x22 √2x1 √2x2]
Examples of kernels
• Example1: 2D input space, 3D feature space
• Example2:
in this case the dimension of is infinite• Note: Not every function is a proper kernel. There is a
theorem called Mercer Theorem that characterises proper kernels
• To test a new input x when working with kernels
2
22
21
21
)(),(2)( jijiK
x
xx
x
xxxxx
)),(()(1
bKysignxfn
i iii xx
}2/||||exp{),( 22 jijiK xxxx
SVM applications
• SVMs were originally proposed by Boser, Guyon and Vapnik in 1992 and gained increasing popularity in late 1990s.
• SVMs are currently among the best performers for a number of classification tasks ranging from text to genomic data.
• SVM techniques have been extended to a number of tasks such as regression [Vapnik et al. ’97], principal component analysis [Schölkopf et al. ’99], etc.
• Most popular optimization algorithms for SVMs are SMO [Platt ’99] and SVMlight
[Joachims’ 99], both use decomposition to hill-climb over a subset of αi’s at a time.
• Tuning SVMs remains a black art: selecting a specific kernel and parameters is usually done in a try-and-see manner.
Variable Selection
• Select a subset of “relevant” input variables • Advantages:
– it is cheaper to measure less variables– the resulting classifier is simpler and potentially
faster – prediction accuracy may improve by discarding
irrelevant variables – identifying relevant variables gives more insight
into the nature of the corresponding classification problem (biomarker detection)
Approaches
• Wrapper– feature selection takes into account the contribution to
the performance of a given type of classifier
• Filter– feature selection is based on an evaluation criterion
for quantifying how well feature (subsets) discriminate the two classes
• Embedded– feature selection is part of the training procedure of a
classifier (e.g. decision trees)
SVM-RFE: wrapper
• Recursive Feature Elimination:– Train linear SVM -> linear decision function– Use absolute value of variable weights to rank
variables– Remove half variables with lower rank– Repeat above steps (train, rank, remove) on data
restricted to variables not removed
• Output: subset of variables
SVM-RFE• Linear binary classifier decision function
• Recursive Feature Elimination (SVM-RFE) – at each iteration:
1) eliminate threshold% of variables with lower score2) recompute scores of remaining variables
bxwxxf i
N
iiN
11 ),...,(
ii xw variableof score ||
SVM-RFEI. Guyon et al.,Machine Learning,46,389-422, 2002
RELIEF: filter• Idea: relevant variables make nearest
examples of same class closer and make nearest examples of opposite classes more far apart.
• Algorithm RELIEF:1. Initialize weights of variables to zero.2. For all examples in training set:
– find nearest example from same (hit) and opposite class (miss)
– update weight of variable by adding abs(example - miss) -abs(example - hit)
3. Rank variables using weights
Application in Bioinformatics
Biomarker detection with Mass Spectrometric data of mixed quality
What does a mass spectrometer do?
1. It measures mass better than any other technique.
2. It can give information about chemical structures.
What are mass measurements good for?
To identify, verify, and quantitate: metabolites, recombinant proteins, proteins isolated from natural sources, oligonucleotides, drug candidates, peptides, synthetic organic chemicals, polymers
Slides from University of California San Francisco
Pharmaceutical analysisBioavailability studiesDrug metabolism studies, pharmacokineticsCharacterization of potential drugsDrug degradation product analysisScreening of drug candidatesIdentifying drug targets
Biomolecule characterizationProteins and peptidesOligonucleotides
Environmental analysisPesticides on foodsSoil and groundwater contamination
Forensic analysis/clinical
Applications of Mass Spectrometry
Slides from University of California San Francisco
Inlet
Ionization
Mass Analyzer
Mass Sorting (filtering)
Ion Detector
Detection
Ion Source
• Solid• Liquid• Vapor
Detect ionsForm ions
(charged molecules)Sort Ions by Mass (m/z)
1330 1340 1350
100
75
50
25
0
Mass Spectrum
Summary: acquiring a mass spectrum
Slides from University of California San Francisco
h Laser
1. Sample is mixed with matrix (X) and dried on plate.
2. Laser flash ionizes matrix molecules.
3. Sample molecules (M) are ionized by proton transfer: XH+ + M MH+ + X.
MH+
MALDI: Matrix Assisted Laser Desorption Ionization
+/- 20 kV Grid (0 V)
Sample plate
Slides from University of California San Francisco
Time-of-flight (TOF) Mass Analyzer
+
+
+
+
Source Drift region (flight tube)
dete
ctor
V
• Measures the time for ions to reach the detector.
• Small ions reach the detector before large ones.
Slides adapted from University of California San Francisco
The mass spectrum shows the results
Re
lativ
e A
bun
dan
ce
Mass (m/z)
0
10000
20000
30000
40000
50000 100000 150000 200000
MH+
(M+2H)2+
(M+3H)3+
MALDI TOF spectrum of IgG
Slides from University of California San Francisco
Dataset
• MALDI-TOF data.
• samples of mixed quality due to different storage time.
• controlled molecule spiking used to generate two classes.
Profiles of one spiked sample
Comparison of ML algorithms
• Feature selection + classification:1. RFE+SVM
2. RFE+kNN
3. RELIEF+SVM
4. RELIEF+kNN
LOOCV results
• Misclassified samples are of bad quality (higher storage time)
• The selected features do not always correspond to m/z of spiked molecules
LOOCV results • The variables selected by RELIEF correspond
to the spiked peptides• RFE is less robust than RELIEF over LOOCV
runs and selects also “irrelevant” variables
RELIEF-based feature selection yields results which are better interpretable than RFE
BUT...
• RFE+SVM yields superior loocv accuracy than RELIEF+SVM
• RFE+kNN superior accuracy than RELIEF+kNN
(perfect LOOCV classification for RFE+1NN)
RFE-based feature selection yields better predictive performance than RELIEF
Conclusion• Better predictive performance does not
necessarily correspond to stability and interpretability of results
• Open issues: – how to measure reliability of potential
biomarkers identified by feature selection algorithms?
– Is stability of feature selection algorithms more important than predictive accuracy?