today’s topics 11/17/15cs 540 - fall 2015 (shavlik©), lecture 23, week 111 review of the linear...
DESCRIPTION
y Support Vectors Recall: ‘Slack’ Variables Dealing with Data that is not Linearly Separable 11/17/15CS Fall 2015 (Shavlik©), Lecture 23, Week 113 For each wrong example, we pay a penalty, which is the distance we’d have to move it to get on the right side of the decision boundary (ie, the separating plane) If we deleted any/all of the non support vectors we’d get the same answer! 2 ||w|| 2TRANSCRIPT
1
Today’s Topics
11/17/15 CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11
• Review of the linear SVM with Slack Variables• Kernels (for non-linear models)• SVM Wrapup• Remind me to repeat Q’s for those listening to audio
• Informal Class Poll: Favorite ML Algo? (Domingos’s on-line ‘five tribes’ talk 11/24)
Nearest Neighbors
D-trees / D-forests
Genetic Algorithms
Naïve Bayes / Bayesian Nets
Neural Networks
Support Vector Machines
Recall: Three Key SVM Concepts
• Maximize the MarginDon’t choose just any separating plane
• Penalize Misclassified ExamplesUse soft constraints and ‘slack’ variables
• Use the ‘Kernel Trick’ to get Non-LinearityRoughly like ‘hardwiring’ the input HU
portion of ANNs (so only need a perceptron)
11/17/15 CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11 2
y Support Vectors
Recall: ‘Slack’ VariablesDealing with Data that is not Linearly Separable
11/17/15 CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11 3
For each wrong example, we pay a penalty, which is the distance we’d have to move it to get on the right side of the decision boundary (ie, the separating plane)
If we deleted any/all of the non support vectors we’d get the same answer!
2||w||2
Margin
min ||w||1 + μ ||S||1
such thatw · xposi
+ Si ≥ + 1
w · xnegj – Sj ≤ – 1
Sk ≥ 0
Recall: The Math Program with Slack Vars
w, s,
Dim = # of input features
Dim = # of training
examples
11/17/15 CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11
The S’s are how far we would need to move an example in order for it to be on the proper side of the decision surface
4
Notice we are solving the perceptron task with acomplexity penalty (sum of wgts) – Hinton’s wgt decay!
Recall: SVMs and Non-Linear Separating Surfaces
f1
f2 +
+
_
_
h(f1, f2)
g(f1, f2) +
+
_
_
Non-linearly map to new space
Linearly separate in new spaceResult is a non-linear
separator in original space
11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 10 5
11/17/15 CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11
Idea #3: Finding Non-Linear Separating Surfaces via Kernels
• Map inputs into new space, eg– ex1 features: x1=5, x2=4 Old Rep
– ex1 features: (x12, x2
2, 2 x1 x2) New Rep = (25, 16, 40) (sq of old rep)
• Solve linear SVM program in this new space– Computationally complex if many derived features– But a clever trick exists!
• SVM terminology (differs from other parts of ML)– Input space: the original features– Feature space: the space of derived features
6
11/17/15 CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11
Kernels
• Kernels produce non-linear separating surfaces in the original space
• Kernels are similarity functions between two examples, K(exi, exj), like in k-NN
• Sample kernels (many variants exist) K(exi, exj) = exi ● exj
this is linear
K(exi, exj) = exp{-||exi – exj||2 / σ2 }this is the Gaussian kernel
7
11/17/15 CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11
Kernels as Features
• Let the similarity between examples be the features!
• Feature j for example i is K(exi, exj)• Models are of the form
If ∑ αj K(exi, exj) > then + else –
The α’s weight the similarities (we
hope many α = 0)
So a model is determined by (a) finding some good exemplars (those with α ≠ 0; they are the support vectors) and(b) weighting the similarity to these exemplars
An instance-based learner!
8
Bug or feature?
11/17/15 CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11
Our Array of ‘Feature’ ValuesFeatures: K(exi, exj)
Exa
mpl
es
Similarity between
examples i and j
Our models are linear in these features, but will be non-linear in the original features if K is a non-linear function
Notice that we can compute K(exi, exj) outside the SVM code!So we really only need code for the LINEAR SVM – it doesn’t know from where the ‘rectangle’ of data has come
9
An array of #’s
CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11
Concrete ExampleUse the ‘squaring’ kernel to convert the following set of examples
K(exi, exj) = (x • z)2
11/17/15 10
F1 F2 OutputEx1 4 2 T
Ex2 -6 3 T
Ex3 -5 -1 F
Raw Data
K(exi, ex1) K(exi, ex2) K(exi, ex3) Output
Ex1 T
Ex2 T
Ex3 F
Derived Features
CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11
Concrete Example (w/ answers)
Use the ‘squaring’ kernel to convert the following set of examples
K(exi, exj) = (x • z)2
11/17/15 11
F1 F2 OutputEx1 4 2 T
Ex2 -6 3 T
Ex3 -5 -1 F
Raw Data
K(exi, ex1) K(exi, ex2) K(exi, ex3) Output
Ex1 400 324 484 T
Ex2 324 2025 729 T
Ex3 484 729 676 F
Derived Features
Probably want to divide this by 1000 to scale the derived features
CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11
A Simple Example of a Kernel Creating a Non-Linear Separation: Assume K(A,B) = - distance(A,B)
11/17/15 12
ex1
ex5
ex2ex3
ex4
ex6ex7
ex9ex8
Kernel-Produced Feature Space (only two dimensions shown)
K(e
x i, e
x 6)
K(exi, ex1)Feature Space
ex6
ex7
ex8
ex9
ex2
ex5
ex4
ex1
ex3
Separating surface in the original feature space (non-linear!)
Separating plane in the
derived spaceModel if (K(exnew, ex1) > -5) then GREEN else RED
11/17/15 CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11
Our 1-Norm SVM with Kernels
min ||α||1 + μ ||S||1such that
pos ex’s { ∑ αj K(xj, xposi) } + Si ≥ + 1
neg ex’s { ∑ αj K(xj, xnegk) } – Sk ≤ – 1
Sm ≥ 0
We use α instead of w to indicate we’re weighting similarities rather than ‘raw’ features
Same linear LP code can be used, simply create the K()’s externally!
13
The Kernel ‘Trick’• The Linear SVM can be written
(using [not-on-final] primal-dual concept of LPs)min ([½ yi yj i j (exi • exj)] - i)
• Whenever we see dot products:
• We can replace with kernels:– this is called the ‘kernel trick’ http://en.wikipedia.org/wiki/Kernel_trick
• This trick not only for SVMs ie, ‘kernel machines’ a broad ML topic - can use ‘similarity to examples’ as features for ANY ML algo! - eg, run d-trees with kernelized features
11/17/15 CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11
exi • exj
K(exi , exj )
14
Kernels and Mercer’s Theorem
K(x, y)’s that are– continuous– symmetric: K(x, y) = K(y, x)– positive semidefinite (create a square Hermitian matrix whose eigenvectors
are all positive; see en.wikipedia.org/wiki/Positive_semidefinite_matrix)
Are equivalent to a dot product in some spaceK(x, y) = (x) (y)
Note: can use any similarity function to create a new ‘feature space’ and solve with a linear SVM, but the ‘dot product in a derived space’ interpretation will be lost unless Mercer’s Theorem holds
11/17/15 CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11 15
11/17/15 CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11
The New Space for a Sample Kernel
Let K(x, z) = (x • z)2 and let #features = 2(x •z )2 = (x1z1 + x2z2)2
= x1x1z1z1 + x1x2z1z2 + x2x1z2z1 + x2x2z2z2
= <x1x1, x1x2, x2x1, x2x2> • <z1z1, z1z2, z2z1, z2z2>• Our new feature space with 4 dimensions;
we’re doing a dot product in it! • Note: if we used an exponent > 2, we’d have gotten a
much larger ‘virtual’ feature space for very little cost!
Notation: <a, b, …, z> indicates a vector, with its components explicitly listed
Key point: we don’t explicitly create the expanded ‘raw’ feature space, but the result is the same as if we did
16
CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11
Review: Matrix Multiplication
11/17/15 17
From (code also there): http://www.cedricnugteren.nl/tutorial.php?page=2
A B = C Matrix A is K by M
Matrix B is N by K
Matrix C is M by N
11/17/15 CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11
The Kernel Matrix
Let A be our usual array of one example per rowone (standard) feature per column
A’ is ‘A transpose’ (rotate around diagonal)
one (standard) feature per rowone example per column
The Kernel Matrix is K(A, A’)=x
e
f
e
e
e
e
f e
f
f
K
18
The Reduced SVM(Lee & Mangasarian, 2001)
• With kernels, learned models are weighted sums of similarities to some of the training examples
• Kernel matrix is size 0(N2), where N = # ex’s– With ‘big data’ squaring can be prohibitive!
• But no reason all training examples need tobe candidate ‘exemplars’
• Can randomly (or cleverly) choose a subset as candidates; size can scale 0(N)
11/17/15 CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11
K(ei,ej)
Exa
mpl
es
19
Create (and use) only these blue columns
11/17/15 CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11
More on Kernels
K(x, z) = tanh(c (x • z ) + d)Relates to the sigmoid of ANN’s (here # of HU’s determined by # of support vectors)
How to choose a good kernel function?– Use a tuning set– Or just use the Gaussian kernel– Some theory exists– A sum of kernels is a kernel ( other ‘closure’ properties)– Don’t want the kernel matrix to be all 0’s off the diagonal since we want to model examples as sums of other ex’s
20
11/17/15 CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11
The Richness of Kernels
• Kernels need not solely be similarities computed on numeric data!
• Or where ‘raw’ data is in a rectangle
• Can define similarity between examples represented as– trees (eg, parse trees in NLP; see image
above), count common subtrees, say– sequences (eg, DNA sequences)
21
CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11
Using Gradient Descent Instead of Linear Programming• Recall last lecture we said that perceptron training with
weight decay was quite similar to SVM training
• This is still the case with kernels; ie, we create a new data set outside the perceptron code and use gradient descent
• So here we get the non-linearity provided by HUs in a ‘hard-wired’ fashion (ie, by using the kernel to non-linearly compute a new representation of the data)
11/17/15 22
Ker
nel
CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11
SVM Wrapup
• For approx a decade, SVMs were the ‘hottest’ topic in ML (Deep NNs now are)
• Formalize nicely the task of find a simple models with few ‘outliers’
• Use hard-wired ‘kernels’ to do the job done by HUs in ANNs
• Kernels can be used in any ML algo
– just preprocess the data to create ‘kernel’ features
– can handle non fixed-length-feature vectors
• Lots of good theory and empirical results11/17/15 23