today’s topics 11/17/15cs 540 - fall 2015 (shavlik©), lecture 23, week 111 review of the linear...

23
Today’s Topics 11/17/15 CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11 1 •Review of the linear SVM with Slack Variables •Kernels (for non-linear models) •SVM Wrapup •Remind me to repeat Q’s for those listening to audio •Informal Class Poll: Favorite ML Algo? (Domingos’s on-line ‘five tribes’ talk 11/24) Nearest Neighbors D-trees / D-forests Genetic Algorithms Naïve Bayes / Bayesian Nets Neural Networks Support Vector Machines

Upload: abigail-powell

Post on 18-Jan-2018

214 views

Category:

Documents


0 download

DESCRIPTION

y Support Vectors Recall: ‘Slack’ Variables Dealing with Data that is not Linearly Separable 11/17/15CS Fall 2015 (Shavlik©), Lecture 23, Week 113 For each wrong example, we pay a penalty, which is the distance we’d have to move it to get on the right side of the decision boundary (ie, the separating plane) If we deleted any/all of the non support vectors we’d get the same answer! 2 ||w|| 2

TRANSCRIPT

Page 1: Today’s Topics 11/17/15CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 111 Review of the linear SVM with Slack Variables Kernels (for non-linear models)

1

Today’s Topics

11/17/15 CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11

• Review of the linear SVM with Slack Variables• Kernels (for non-linear models)• SVM Wrapup• Remind me to repeat Q’s for those listening to audio

• Informal Class Poll: Favorite ML Algo? (Domingos’s on-line ‘five tribes’ talk 11/24)

Nearest Neighbors

D-trees / D-forests

Genetic Algorithms

Naïve Bayes / Bayesian Nets

Neural Networks

Support Vector Machines

Page 2: Today’s Topics 11/17/15CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 111 Review of the linear SVM with Slack Variables Kernels (for non-linear models)

Recall: Three Key SVM Concepts

• Maximize the MarginDon’t choose just any separating plane

• Penalize Misclassified ExamplesUse soft constraints and ‘slack’ variables

• Use the ‘Kernel Trick’ to get Non-LinearityRoughly like ‘hardwiring’ the input HU

portion of ANNs (so only need a perceptron)

11/17/15 CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11 2

Page 3: Today’s Topics 11/17/15CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 111 Review of the linear SVM with Slack Variables Kernels (for non-linear models)

y Support Vectors

Recall: ‘Slack’ VariablesDealing with Data that is not Linearly Separable

11/17/15 CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11 3

For each wrong example, we pay a penalty, which is the distance we’d have to move it to get on the right side of the decision boundary (ie, the separating plane)

If we deleted any/all of the non support vectors we’d get the same answer!

2||w||2

Margin

Page 4: Today’s Topics 11/17/15CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 111 Review of the linear SVM with Slack Variables Kernels (for non-linear models)

min ||w||1 + μ ||S||1

such thatw · xposi

+ Si ≥ + 1

w · xnegj – Sj ≤ – 1

Sk ≥ 0

Recall: The Math Program with Slack Vars

w, s,

Dim = # of input features

Dim = # of training

examples

11/17/15 CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11

The S’s are how far we would need to move an example in order for it to be on the proper side of the decision surface

4

Notice we are solving the perceptron task with acomplexity penalty (sum of wgts) – Hinton’s wgt decay!

Page 5: Today’s Topics 11/17/15CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 111 Review of the linear SVM with Slack Variables Kernels (for non-linear models)

Recall: SVMs and Non-Linear Separating Surfaces

f1

f2 +

+

_

_

h(f1, f2)

g(f1, f2) +

+

_

_

Non-linearly map to new space

Linearly separate in new spaceResult is a non-linear

separator in original space

11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 10 5

Page 6: Today’s Topics 11/17/15CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 111 Review of the linear SVM with Slack Variables Kernels (for non-linear models)

11/17/15 CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11

Idea #3: Finding Non-Linear Separating Surfaces via Kernels

• Map inputs into new space, eg– ex1 features: x1=5, x2=4 Old Rep

– ex1 features: (x12, x2

2, 2 x1 x2) New Rep = (25, 16, 40) (sq of old rep)

• Solve linear SVM program in this new space– Computationally complex if many derived features– But a clever trick exists!

• SVM terminology (differs from other parts of ML)– Input space: the original features– Feature space: the space of derived features

6

Page 7: Today’s Topics 11/17/15CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 111 Review of the linear SVM with Slack Variables Kernels (for non-linear models)

11/17/15 CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11

Kernels

• Kernels produce non-linear separating surfaces in the original space

• Kernels are similarity functions between two examples, K(exi, exj), like in k-NN

• Sample kernels (many variants exist) K(exi, exj) = exi ● exj

this is linear

K(exi, exj) = exp{-||exi – exj||2 / σ2 }this is the Gaussian kernel

7

Page 8: Today’s Topics 11/17/15CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 111 Review of the linear SVM with Slack Variables Kernels (for non-linear models)

11/17/15 CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11

Kernels as Features

• Let the similarity between examples be the features!

• Feature j for example i is K(exi, exj)• Models are of the form

If ∑ αj K(exi, exj) > then + else –

The α’s weight the similarities (we

hope many α = 0)

So a model is determined by (a) finding some good exemplars (those with α ≠ 0; they are the support vectors) and(b) weighting the similarity to these exemplars

An instance-based learner!

8

Bug or feature?

Page 9: Today’s Topics 11/17/15CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 111 Review of the linear SVM with Slack Variables Kernels (for non-linear models)

11/17/15 CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11

Our Array of ‘Feature’ ValuesFeatures: K(exi, exj)

Exa

mpl

es

Similarity between

examples i and j

Our models are linear in these features, but will be non-linear in the original features if K is a non-linear function

Notice that we can compute K(exi, exj) outside the SVM code!So we really only need code for the LINEAR SVM – it doesn’t know from where the ‘rectangle’ of data has come

9

An array of #’s

Page 10: Today’s Topics 11/17/15CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 111 Review of the linear SVM with Slack Variables Kernels (for non-linear models)

CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11

Concrete ExampleUse the ‘squaring’ kernel to convert the following set of examples

K(exi, exj) = (x • z)2

11/17/15 10

F1 F2 OutputEx1 4 2 T

Ex2 -6 3 T

Ex3 -5 -1 F

Raw Data

K(exi, ex1) K(exi, ex2) K(exi, ex3) Output

Ex1 T

Ex2 T

Ex3 F

Derived Features

Page 11: Today’s Topics 11/17/15CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 111 Review of the linear SVM with Slack Variables Kernels (for non-linear models)

CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11

Concrete Example (w/ answers)

Use the ‘squaring’ kernel to convert the following set of examples

K(exi, exj) = (x • z)2

11/17/15 11

F1 F2 OutputEx1 4 2 T

Ex2 -6 3 T

Ex3 -5 -1 F

Raw Data

K(exi, ex1) K(exi, ex2) K(exi, ex3) Output

Ex1 400 324 484 T

Ex2 324 2025 729 T

Ex3 484 729 676 F

Derived Features

Probably want to divide this by 1000 to scale the derived features

Page 12: Today’s Topics 11/17/15CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 111 Review of the linear SVM with Slack Variables Kernels (for non-linear models)

CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11

A Simple Example of a Kernel Creating a Non-Linear Separation: Assume K(A,B) = - distance(A,B)

11/17/15 12

ex1

ex5

ex2ex3

ex4

ex6ex7

ex9ex8

Kernel-Produced Feature Space (only two dimensions shown)

K(e

x i, e

x 6)

K(exi, ex1)Feature Space

ex6

ex7

ex8

ex9

ex2

ex5

ex4

ex1

ex3

Separating surface in the original feature space (non-linear!)

Separating plane in the

derived spaceModel if (K(exnew, ex1) > -5) then GREEN else RED

Page 13: Today’s Topics 11/17/15CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 111 Review of the linear SVM with Slack Variables Kernels (for non-linear models)

11/17/15 CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11

Our 1-Norm SVM with Kernels

min ||α||1 + μ ||S||1such that

pos ex’s { ∑ αj K(xj, xposi) } + Si ≥ + 1

neg ex’s { ∑ αj K(xj, xnegk) } – Sk ≤ – 1

Sm ≥ 0

We use α instead of w to indicate we’re weighting similarities rather than ‘raw’ features

Same linear LP code can be used, simply create the K()’s externally!

13

Page 14: Today’s Topics 11/17/15CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 111 Review of the linear SVM with Slack Variables Kernels (for non-linear models)

The Kernel ‘Trick’• The Linear SVM can be written

(using [not-on-final] primal-dual concept of LPs)min ([½ yi yj i j (exi • exj)] - i)

• Whenever we see dot products:

• We can replace with kernels:– this is called the ‘kernel trick’ http://en.wikipedia.org/wiki/Kernel_trick

• This trick not only for SVMs ie, ‘kernel machines’ a broad ML topic - can use ‘similarity to examples’ as features for ANY ML algo! - eg, run d-trees with kernelized features

11/17/15 CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11

exi • exj

K(exi , exj )

14

Page 15: Today’s Topics 11/17/15CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 111 Review of the linear SVM with Slack Variables Kernels (for non-linear models)

Kernels and Mercer’s Theorem

K(x, y)’s that are– continuous– symmetric: K(x, y) = K(y, x)– positive semidefinite (create a square Hermitian matrix whose eigenvectors

are all positive; see en.wikipedia.org/wiki/Positive_semidefinite_matrix)

Are equivalent to a dot product in some spaceK(x, y) = (x) (y)

Note: can use any similarity function to create a new ‘feature space’ and solve with a linear SVM, but the ‘dot product in a derived space’ interpretation will be lost unless Mercer’s Theorem holds

11/17/15 CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11 15

Page 16: Today’s Topics 11/17/15CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 111 Review of the linear SVM with Slack Variables Kernels (for non-linear models)

11/17/15 CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11

The New Space for a Sample Kernel

Let K(x, z) = (x • z)2 and let #features = 2(x •z )2 = (x1z1 + x2z2)2

= x1x1z1z1 + x1x2z1z2 + x2x1z2z1 + x2x2z2z2

= <x1x1, x1x2, x2x1, x2x2> • <z1z1, z1z2, z2z1, z2z2>• Our new feature space with 4 dimensions;

we’re doing a dot product in it! • Note: if we used an exponent > 2, we’d have gotten a

much larger ‘virtual’ feature space for very little cost!

Notation: <a, b, …, z> indicates a vector, with its components explicitly listed

Key point: we don’t explicitly create the expanded ‘raw’ feature space, but the result is the same as if we did

16

Page 17: Today’s Topics 11/17/15CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 111 Review of the linear SVM with Slack Variables Kernels (for non-linear models)

CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11

Review: Matrix Multiplication

11/17/15 17

From (code also there): http://www.cedricnugteren.nl/tutorial.php?page=2

A B = C Matrix A is K by M

Matrix B is N by K

Matrix C is M by N

Page 18: Today’s Topics 11/17/15CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 111 Review of the linear SVM with Slack Variables Kernels (for non-linear models)

11/17/15 CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11

The Kernel Matrix

Let A be our usual array of one example per rowone (standard) feature per column

A’ is ‘A transpose’ (rotate around diagonal)

one (standard) feature per rowone example per column

The Kernel Matrix is K(A, A’)=x

e

f

e

e

e

e

f e

f

f

K

18

Page 19: Today’s Topics 11/17/15CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 111 Review of the linear SVM with Slack Variables Kernels (for non-linear models)

The Reduced SVM(Lee & Mangasarian, 2001)

• With kernels, learned models are weighted sums of similarities to some of the training examples

• Kernel matrix is size 0(N2), where N = # ex’s– With ‘big data’ squaring can be prohibitive!

• But no reason all training examples need tobe candidate ‘exemplars’

• Can randomly (or cleverly) choose a subset as candidates; size can scale 0(N)

11/17/15 CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11

K(ei,ej)

Exa

mpl

es

19

Create (and use) only these blue columns

Page 20: Today’s Topics 11/17/15CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 111 Review of the linear SVM with Slack Variables Kernels (for non-linear models)

11/17/15 CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11

More on Kernels

K(x, z) = tanh(c (x • z ) + d)Relates to the sigmoid of ANN’s (here # of HU’s determined by # of support vectors)

How to choose a good kernel function?– Use a tuning set– Or just use the Gaussian kernel– Some theory exists– A sum of kernels is a kernel ( other ‘closure’ properties)– Don’t want the kernel matrix to be all 0’s off the diagonal since we want to model examples as sums of other ex’s

20

Page 21: Today’s Topics 11/17/15CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 111 Review of the linear SVM with Slack Variables Kernels (for non-linear models)

11/17/15 CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11

The Richness of Kernels

• Kernels need not solely be similarities computed on numeric data!

• Or where ‘raw’ data is in a rectangle

• Can define similarity between examples represented as– trees (eg, parse trees in NLP; see image

above), count common subtrees, say– sequences (eg, DNA sequences)

21

Page 22: Today’s Topics 11/17/15CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 111 Review of the linear SVM with Slack Variables Kernels (for non-linear models)

CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11

Using Gradient Descent Instead of Linear Programming• Recall last lecture we said that perceptron training with

weight decay was quite similar to SVM training

• This is still the case with kernels; ie, we create a new data set outside the perceptron code and use gradient descent

• So here we get the non-linearity provided by HUs in a ‘hard-wired’ fashion (ie, by using the kernel to non-linearly compute a new representation of the data)

11/17/15 22

Ker

nel

Page 23: Today’s Topics 11/17/15CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 111 Review of the linear SVM with Slack Variables Kernels (for non-linear models)

CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11

SVM Wrapup

• For approx a decade, SVMs were the ‘hottest’ topic in ML (Deep NNs now are)

• Formalize nicely the task of find a simple models with few ‘outliers’

• Use hard-wired ‘kernels’ to do the job done by HUs in ANNs

• Kernels can be used in any ML algo

– just preprocess the data to create ‘kernel’ features

– can handle non fixed-length-feature vectors

• Lots of good theory and empirical results11/17/15 23