![Page 1: Machine Learning: Basic Principles - · PDF fileMachine Learning: Basic Principles Sparse Kernel Machines T-61.6020 Special Course in Computer and Information Science 20070226 | Janne](https://reader034.vdocuments.us/reader034/viewer/2022051721/5a7dbc987f8b9a2e358dbf2a/html5/thumbnails/1.jpg)
Machine Learning: Basic Principles
Sparse Kernel Machines
T-61.6020 Special Course in Computer and Information Science
20070226 | Janne Argillander, [email protected]
![Page 2: Machine Learning: Basic Principles - · PDF fileMachine Learning: Basic Principles Sparse Kernel Machines T-61.6020 Special Course in Computer and Information Science 20070226 | Janne](https://reader034.vdocuments.us/reader034/viewer/2022051721/5a7dbc987f8b9a2e358dbf2a/html5/thumbnails/2.jpg)
2
T-61.6050 Special Course in Computer and Information Science
20070226 | Janne Argillander, [email protected]
Contents
� Introduction
� Maximum margin classifier
� Overlapping class distributions
� Multiclass classifier
� Relevance Vector Machines
� Outroduction
![Page 3: Machine Learning: Basic Principles - · PDF fileMachine Learning: Basic Principles Sparse Kernel Machines T-61.6020 Special Course in Computer and Information Science 20070226 | Janne](https://reader034.vdocuments.us/reader034/viewer/2022051721/5a7dbc987f8b9a2e358dbf2a/html5/thumbnails/3.jpg)
3
T-61.6050 Special Course in Computer and Information Science
20070226 | Janne Argillander, [email protected]
Contents
� Introduction
� Maximum margin classifier
� Overlapping class distributions
� Multiclass classifier
� Relevance Vector Machines
� Outroduction
![Page 4: Machine Learning: Basic Principles - · PDF fileMachine Learning: Basic Principles Sparse Kernel Machines T-61.6020 Special Course in Computer and Information Science 20070226 | Janne](https://reader034.vdocuments.us/reader034/viewer/2022051721/5a7dbc987f8b9a2e358dbf2a/html5/thumbnails/4.jpg)
4
T-61.6050 Special Course in Computer and Information Science
20070226 | Janne Argillander, [email protected]
� Argorithms based on the non-linear kernels are nice, but because the kernel function k(xn,xm) has to be evaluated for possible pairs of the training points the computational complexity is high.
� Some kind of ”reduction” has to be done
� We say solution to be sparse when only the subset of the training data points are used for evaluation.
![Page 5: Machine Learning: Basic Principles - · PDF fileMachine Learning: Basic Principles Sparse Kernel Machines T-61.6020 Special Course in Computer and Information Science 20070226 | Janne](https://reader034.vdocuments.us/reader034/viewer/2022051721/5a7dbc987f8b9a2e358dbf2a/html5/thumbnails/5.jpg)
5
T-61.6050 Special Course in Computer and Information Science
20070226 | Janne Argillander, [email protected]
Which boundary should we pick?
![Page 6: Machine Learning: Basic Principles - · PDF fileMachine Learning: Basic Principles Sparse Kernel Machines T-61.6020 Special Course in Computer and Information Science 20070226 | Janne](https://reader034.vdocuments.us/reader034/viewer/2022051721/5a7dbc987f8b9a2e358dbf2a/html5/thumbnails/6.jpg)
6
T-61.6050 Special Course in Computer and Information Science
20070226 | Janne Argillander, [email protected]
Contents
� Introduction
� Maximum margin classifier
� Overlapping class distributions
� Multiclass classifier
� Relevance Vector Machines
� Outroduction
![Page 7: Machine Learning: Basic Principles - · PDF fileMachine Learning: Basic Principles Sparse Kernel Machines T-61.6020 Special Course in Computer and Information Science 20070226 | Janne](https://reader034.vdocuments.us/reader034/viewer/2022051721/5a7dbc987f8b9a2e358dbf2a/html5/thumbnails/7.jpg)
7
T-61.6050 Special Course in Computer and Information Science
20070226 | Janne Argillander, [email protected]
� Support Vector Machine (SVM) approaches the problem stated in the previous slide through the concept of the margin
� The margin is the smallest distance between the decision boundary and any of the training samples
� The points with the smallest margin define the decision boundary and are called as support vectors
![Page 8: Machine Learning: Basic Principles - · PDF fileMachine Learning: Basic Principles Sparse Kernel Machines T-61.6020 Special Course in Computer and Information Science 20070226 | Janne](https://reader034.vdocuments.us/reader034/viewer/2022051721/5a7dbc987f8b9a2e358dbf2a/html5/thumbnails/8.jpg)
8
T-61.6050 Special Course in Computer and Information Science
20070226 | Janne Argillander, [email protected]
![Page 9: Machine Learning: Basic Principles - · PDF fileMachine Learning: Basic Principles Sparse Kernel Machines T-61.6020 Special Course in Computer and Information Science 20070226 | Janne](https://reader034.vdocuments.us/reader034/viewer/2022051721/5a7dbc987f8b9a2e358dbf2a/html5/thumbnails/9.jpg)
9
T-61.6050 Special Course in Computer and Information Science
20070226 | Janne Argillander, [email protected]
Properties of SVM
� SVM does not provide posterior probabilities. The output value can be seen as a distance from the decision boundary to the test point (measured in the feature space)
– Considering the retrieval task (most probable first), how do we sort the results?
– Can we combine two different methods or models?
� Can be used for classifying, regression and to data reduction– In the classifying task the samples are usually annotated as a positive or negative
examples. SVM accepts also examples without annotations.
� SVM is fundamentally a two-class classifier– What if we have more classes?
![Page 10: Machine Learning: Basic Principles - · PDF fileMachine Learning: Basic Principles Sparse Kernel Machines T-61.6020 Special Course in Computer and Information Science 20070226 | Janne](https://reader034.vdocuments.us/reader034/viewer/2022051721/5a7dbc987f8b9a2e358dbf2a/html5/thumbnails/10.jpg)
10
T-61.6050 Special Course in Computer and Information Science
20070226 | Janne Argillander, [email protected]
Properties of SVM
� The naive solution is slow O(N3), but with the help of lagrange multipliers task is only from O(N) to O(N2)
� Handles non-línearly separable data set in the linearly separable feature space.
– This space is defined by the nonlinear kernel function K.
– Dimensionality of the feature space can be high
� Traditional SVM does not like outliers
![Page 11: Machine Learning: Basic Principles - · PDF fileMachine Learning: Basic Principles Sparse Kernel Machines T-61.6020 Special Course in Computer and Information Science 20070226 | Janne](https://reader034.vdocuments.us/reader034/viewer/2022051721/5a7dbc987f8b9a2e358dbf2a/html5/thumbnails/11.jpg)
11
T-61.6050 Special Course in Computer and Information Science
20070226 | Janne Argillander, [email protected]
Contents
� Introduction
� Maximum margin classifier
� Overlapping class distributions
� Multiclass classifier
� Relevance Vector Machines
� Outroduction
![Page 12: Machine Learning: Basic Principles - · PDF fileMachine Learning: Basic Principles Sparse Kernel Machines T-61.6020 Special Course in Computer and Information Science 20070226 | Janne](https://reader034.vdocuments.us/reader034/viewer/2022051721/5a7dbc987f8b9a2e358dbf2a/html5/thumbnails/12.jpg)
12
T-61.6050 Special Course in Computer and Information Science
20070226 | Janne Argillander, [email protected]
Overlapping class distributions
� Penalty that increases with the distance (slack variables)
� �-SVM� Method that can handle data sets with few outliers
� Parameters have to be estimated
- Cross-validation over the hold-out set
![Page 13: Machine Learning: Basic Principles - · PDF fileMachine Learning: Basic Principles Sparse Kernel Machines T-61.6020 Special Course in Computer and Information Science 20070226 | Janne](https://reader034.vdocuments.us/reader034/viewer/2022051721/5a7dbc987f8b9a2e358dbf2a/html5/thumbnails/13.jpg)
13
T-61.6050 Special Course in Computer and Information Science
20070226 | Janne Argillander, [email protected]
Contents
� Introduction
� Maximum margin classifier
� Overlapping class distributions
� Multiclass classifier
� Relevance Vector Machines
� Outroduction
![Page 14: Machine Learning: Basic Principles - · PDF fileMachine Learning: Basic Principles Sparse Kernel Machines T-61.6020 Special Course in Computer and Information Science 20070226 | Janne](https://reader034.vdocuments.us/reader034/viewer/2022051721/5a7dbc987f8b9a2e358dbf2a/html5/thumbnails/14.jpg)
14
T-61.6050 Special Course in Computer and Information Science
20070226 | Janne Argillander, [email protected]
SVM is fundamendally a two-class classifier
Let us consider problem with K classes
� One-versus-rest approach (1-BGM)– Training: Ck marked as positive, rest K-1 classes as negative
� Training sets will be inbalanced (if we had symmetry now it is lost)
(10 ’faces’, 100000 non faces)
� What if we had similar classes?
� Can lead to very very slow models, O(KN2) – O(K2N2)
� One-versus-one approach– Training: 2-Class SVMs on all possible pair of classes
![Page 15: Machine Learning: Basic Principles - · PDF fileMachine Learning: Basic Principles Sparse Kernel Machines T-61.6020 Special Course in Computer and Information Science 20070226 | Janne](https://reader034.vdocuments.us/reader034/viewer/2022051721/5a7dbc987f8b9a2e358dbf2a/html5/thumbnails/15.jpg)
15
T-61.6050 Special Course in Computer and Information Science
20070226 | Janne Argillander, [email protected]
Multiclass = problems?
One-versus-rest One-versus-one
HINT: Generally multiclass classification is an open issue. � Solve it and become famous!
![Page 16: Machine Learning: Basic Principles - · PDF fileMachine Learning: Basic Principles Sparse Kernel Machines T-61.6020 Special Course in Computer and Information Science 20070226 | Janne](https://reader034.vdocuments.us/reader034/viewer/2022051721/5a7dbc987f8b9a2e358dbf2a/html5/thumbnails/16.jpg)
16
T-61.6050 Special Course in Computer and Information Science
20070226 | Janne Argillander, [email protected]
Contents
� Introduction
� Maximum margin classifier
� Overlapping class distributions
� Multiclass classifier
� Relevance Vector Machines
� Outroduction
![Page 17: Machine Learning: Basic Principles - · PDF fileMachine Learning: Basic Principles Sparse Kernel Machines T-61.6020 Special Course in Computer and Information Science 20070226 | Janne](https://reader034.vdocuments.us/reader034/viewer/2022051721/5a7dbc987f8b9a2e358dbf2a/html5/thumbnails/17.jpg)
17
T-61.6050 Special Course in Computer and Information Science
20070226 | Janne Argillander, [email protected]
On the way to even more sparse solutions
� Some limitations of the SVM– Outputs of an SVM represent decisions
– Slack variables are found by using a hold-out method
– Positive definite kernels
– SVM is just not cool anymore?
� Relevance Vector Machine (RVM) has derived from the SVM– Shares many (good) characteristics of the SVM
– No restriction to positive definite kernels
– No cross-validation needed
– Number of relevance vectors << Number of support vectors � faster, sparser
– When training the RVM model we have to calculate inverse of M x M matrix (where M is amount of basis functions) � O(M3) � RVM training takes more time
![Page 18: Machine Learning: Basic Principles - · PDF fileMachine Learning: Basic Principles Sparse Kernel Machines T-61.6020 Special Course in Computer and Information Science 20070226 | Janne](https://reader034.vdocuments.us/reader034/viewer/2022051721/5a7dbc987f8b9a2e358dbf2a/html5/thumbnails/18.jpg)
18
T-61.6050 Special Course in Computer and Information Science
20070226 | Janne Argillander, [email protected]
v-SVM RVM
![Page 19: Machine Learning: Basic Principles - · PDF fileMachine Learning: Basic Principles Sparse Kernel Machines T-61.6020 Special Course in Computer and Information Science 20070226 | Janne](https://reader034.vdocuments.us/reader034/viewer/2022051721/5a7dbc987f8b9a2e358dbf2a/html5/thumbnails/19.jpg)
19
T-61.6050 Special Course in Computer and Information Science
20070226 | Janne Argillander, [email protected]
Function approximation (RVM)
sinc(x) sinc(x) + gaussian noise
Tipping, M. E. (2000). The Relevance Vector Machine
![Page 20: Machine Learning: Basic Principles - · PDF fileMachine Learning: Basic Principles Sparse Kernel Machines T-61.6020 Special Course in Computer and Information Science 20070226 | Janne](https://reader034.vdocuments.us/reader034/viewer/2022051721/5a7dbc987f8b9a2e358dbf2a/html5/thumbnails/20.jpg)
20
T-61.6050 Special Course in Computer and Information Science
20070226 | Janne Argillander, [email protected]
Contents
� Introduction
� Maximum margin classifier
� Overlapping class distributions
� Multiclass classifier
� Relevance Vector Machines
� Outroduction
![Page 21: Machine Learning: Basic Principles - · PDF fileMachine Learning: Basic Principles Sparse Kernel Machines T-61.6020 Special Course in Computer and Information Science 20070226 | Janne](https://reader034.vdocuments.us/reader034/viewer/2022051721/5a7dbc987f8b9a2e358dbf2a/html5/thumbnails/21.jpg)
21
T-61.6050 Special Course in Computer and Information Science
20070226 | Janne Argillander, [email protected]
Software packages and good to check
� Command line tool
SVMLIGHT http://svmlight.joachims.org/SVMTorch Multi III http://www.torch.ch/
– one-versus-rest approach
� Do it yourself
LIBSVM http://www.csie.ntu.edu.tw/~cjlin/libsvm/
� Interfaces to LIBSVM– Matlab, R, Weka. Labview,...
� http://www.kernel-machines.org/