today’s topics 11/17/15cs 540 - fall 2015 (shavlik©), lecture 23, week 111 review of the linear...

1

Today’s Topics

11/17/15 CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11

• Review of the linear SVM with Slack Variables• Kernels (for non-linear models)• SVM Wrapup• Remind me to repeat Q’s for those listening to audio

• Informal Class Poll: Favorite ML Algo? (Domingos’s on-line ‘five tribes’ talk 11/24)

Nearest Neighbors

D-trees / D-forests

Genetic Algorithms

Naïve Bayes / Bayesian Nets

Neural Networks

Support Vector Machines

Recall: Three Key SVM Concepts

• Maximize the MarginDon’t choose just any separating plane

• Penalize Misclassified ExamplesUse soft constraints and ‘slack’ variables

• Use the ‘Kernel Trick’ to get Non-LinearityRoughly like ‘hardwiring’ the input HU

portion of ANNs (so only need a perceptron)

11/17/15 CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11 2

y Support Vectors

Recall: ‘Slack’ VariablesDealing with Data that is not Linearly Separable


For each wrong example, we pay a penalty, which is the distance we’d have to move it to get on the right side of the decision boundary (ie, the separating plane)

If we deleted any/all of the non support vectors we’d get the same answer!

2||w||2

Margin

min ||w||1 + μ ||S||1

such thatw · xposi

+ Si ≥ + 1

w · xnegj – Sj ≤ – 1

Sk ≥ 0

Recall: The Math Program with Slack Vars

w, s,

Dim = # of input features

Dim = # of training

examples


The S’s are how far we would need to move an example in order for it to be on the proper side of the decision surface

4

Notice we are solving the perceptron task with acomplexity penalty (sum of wgts) – Hinton’s wgt decay!

Recall: SVMs and Non-Linear Separating Surfaces

f1

f2 +

+

_

_

h(f1, f2)

g(f1, f2) +

+

_

_

Non-linearly map to new space

Linearly separate in new spaceResult is a non-linear

separator in original space



Idea #3: Finding Non-Linear Separating Surfaces via Kernels

• Map inputs into new space, eg– ex1 features: x1=5, x2=4 Old Rep

– ex1 features: (x12, x2

2, 2 x1 x2) New Rep = (25, 16, 40) (sq of old rep)

• Solve linear SVM program in this new space– Computationally complex if many derived features– But a clever trick exists!

• SVM terminology (differs from other parts of ML)– Input space: the original features– Feature space: the space of derived features

6


Kernels

• Kernels produce non-linear separating surfaces in the original space

• Kernels are similarity functions between two examples, K(exi, exj), like in k-NN

• Sample kernels (many variants exist) K(exi, exj) = exi ● exj

this is linear

K(exi, exj) = exp{-||exi – exj||2 / σ2 }this is the Gaussian kernel

7


Kernels as Features

• Let the similarity between examples be the features!

• Feature j for example i is K(exi, exj)• Models are of the form

If ∑ αj K(exi, exj) > then + else –

The α’s weight the similarities (we

hope many α = 0)

So a model is determined by (a) finding some good exemplars (those with α ≠ 0; they are the support vectors) and(b) weighting the similarity to these exemplars

An instance-based learner!

8

Bug or feature?


Our Array of ‘Feature’ ValuesFeatures: K(exi, exj)

Exa

mpl

es

Similarity between

examples i and j

Our models are linear in these features, but will be non-linear in the original features if K is a non-linear function

Notice that we can compute K(exi, exj) outside the SVM code!So we really only need code for the LINEAR SVM – it doesn’t know from where the ‘rectangle’ of data has come

9

An array of #’s

CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11

Concrete ExampleUse the ‘squaring’ kernel to convert the following set of examples

K(exi, exj) = (x • z)2

11/17/15 10

F1 F2 OutputEx1 4 2 T

Ex2 -6 3 T

Ex3 -5 -1 F

Raw Data

K(exi, ex1) K(exi, ex2) K(exi, ex3) Output

Ex1 T

Ex2 T

Ex3 F

Derived Features


Concrete Example (w/ answers)

Use the ‘squaring’ kernel to convert the following set of examples

K(exi, exj) = (x • z)2

11/17/15 11

F1 F2 OutputEx1 4 2 T

Ex2 -6 3 T

Ex3 -5 -1 F

Raw Data

K(exi, ex1) K(exi, ex2) K(exi, ex3) Output

Ex1 400 324 484 T

Ex2 324 2025 729 T

Ex3 484 729 676 F

Derived Features

Probably want to divide this by 1000 to scale the derived features


A Simple Example of a Kernel Creating a Non-Linear Separation: Assume K(A,B) = - distance(A,B)

11/17/15 12

ex1

ex5

ex2ex3

ex4

ex6ex7

ex9ex8

Kernel-Produced Feature Space (only two dimensions shown)

K(e

x i, e

x 6)

K(exi, ex1)Feature Space

ex6

ex7

ex8

ex9

ex2

ex5

ex4

ex1

ex3

Separating surface in the original feature space (non-linear!)

Separating plane in the

derived spaceModel if (K(exnew, ex1) > -5) then GREEN else RED


Our 1-Norm SVM with Kernels

min ||α||1 + μ ||S||1such that

pos ex’s { ∑ αj K(xj, xposi) } + Si ≥ + 1

neg ex’s { ∑ αj K(xj, xnegk) } – Sk ≤ – 1

Sm ≥ 0

We use α instead of w to indicate we’re weighting similarities rather than ‘raw’ features

Same linear LP code can be used, simply create the K()’s externally!

13

The Kernel ‘Trick’• The Linear SVM can be written

(using [not-on-final] primal-dual concept of LPs)min ([½ yi yj i j (exi • exj)] - i)

• Whenever we see dot products:

• We can replace with kernels:– this is called the ‘kernel trick’ http://en.wikipedia.org/wiki/Kernel_trick

• This trick not only for SVMs ie, ‘kernel machines’ a broad ML topic - can use ‘similarity to examples’ as features for ANY ML algo! - eg, run d-trees with kernelized features


exi • exj

K(exi , exj )

14

http://en.wikipedia.org/wiki/Kernel_trick

Kernels and Mercer’s Theorem

K(x, y)’s that are– continuous– symmetric: K(x, y) = K(y, x)– positive semidefinite (create a square Hermitian matrix whose eigenvectors

are all positive; see en.wikipedia.org/wiki/Positive_semidefinite_matrix)

Are equivalent to a dot product in some spaceK(x, y) = (x) (y)

Note: can use any similarity function to create a new ‘feature space’ and solve with a linear SVM, but the ‘dot product in a derived space’ interpretation will be lost unless Mercer’s Theorem holds



The New Space for a Sample Kernel

Let K(x, z) = (x • z)2 and let #features = 2(x •z )2 = (x1z1 + x2z2)2

= x1x1z1z1 + x1x2z1z2 + x2x1z2z1 + x2x2z2z2

= <x1x1, x1x2, x2x1, x2x2> • <z1z1, z1z2, z2z1, z2z2>• Our new feature space with 4 dimensions;

we’re doing a dot product in it! • Note: if we used an exponent > 2, we’d have gotten a

much larger ‘virtual’ feature space for very little cost!

Notation: <a, b, …, z> indicates a vector, with its components explicitly listed

Key point: we don’t explicitly create the expanded ‘raw’ feature space, but the result is the same as if we did

16


Review: Matrix Multiplication

11/17/15 17

From (code also there): http://www.cedricnugteren.nl/tutorial.php?page=2

A B = C Matrix A is K by M

Matrix B is N by K

Matrix C is M by N

http://www.cedricnugteren.nl/tutorial.php?page=2


The Kernel Matrix

Let A be our usual array of one example per rowone (standard) feature per column

A’ is ‘A transpose’ (rotate around diagonal)

one (standard) feature per rowone example per column

The Kernel Matrix is K(A, A’)=x

e

f

e

e

e

e

f e

f

f

K

18

The Reduced SVM(Lee & Mangasarian, 2001)

• With kernels, learned models are weighted sums of similarities to some of the training examples

• Kernel matrix is size 0(N2), where N = # ex’s– With ‘big data’ squaring can be prohibitive!

• But no reason all training examples need tobe candidate ‘exemplars’

• Can randomly (or cleverly) choose a subset as candidates; size can scale 0(N)


K(ei,ej)

Exa

mpl

es

19

Create (and use) only these blue columns


More on Kernels

K(x, z) = tanh(c (x • z ) + d)Relates to the sigmoid of ANN’s (here # of HU’s determined by # of support vectors)

How to choose a good kernel function?– Use a tuning set– Or just use the Gaussian kernel– Some theory exists– A sum of kernels is a kernel ( other ‘closure’ properties)– Don’t want the kernel matrix to be all 0’s off the diagonal since we want to model examples as sums of other ex’s

20


The Richness of Kernels

• Kernels need not solely be similarities computed on numeric data!

• Or where ‘raw’ data is in a rectangle

• Can define similarity between examples represented as– trees (eg, parse trees in NLP; see image

above), count common subtrees, say– sequences (eg, DNA sequences)

21


Using Gradient Descent Instead of Linear Programming• Recall last lecture we said that perceptron training with

weight decay was quite similar to SVM training

• This is still the case with kernels; ie, we create a new data set outside the perceptron code and use gradient descent

• So here we get the non-linearity provided by HUs in a ‘hard-wired’ fashion (ie, by using the kernel to non-linearly compute a new representation of the data)

11/17/15 22

Ker

nel


SVM Wrapup

• For approx a decade, SVMs were the ‘hottest’ topic in ML (Deep NNs now are)

• Formalize nicely the task of find a simple models with few ‘outliers’

• Use hard-wired ‘kernels’ to do the job done by HUs in ANNs

• Kernels can be used in any ML algo

– just preprocess the data to create ‘kernel’ features

– can handle non fixed-length-feature vectors

• Lots of good theory and empirical results11/17/15 23

today’s topics 11/17/15cs 540 - fall 2015 (shavlik©), lecture 23, week 111 review of the linear...

Documents