support vector machines graphic generated with lucent technologies demonstration 2-d pattern...

17
Support Vector Machines Graphic generated with Lucent Technologies Demonstration 2-D Pattern Recognition Applet at http://svm.research.bell-labs.com/SVT/SVMsvt.html

Post on 21-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Support Vector Machines Graphic generated with Lucent Technologies Demonstration 2-D Pattern Recognition Applet at

Support Vector Machines

Graphic generated with Lucent TechnologiesDemonstration 2-D Pattern Recognition Applet athttp://svm.research.bell-labs.com/SVT/SVMsvt.html

Page 2: Support Vector Machines Graphic generated with Lucent Technologies Demonstration 2-D Pattern Recognition Applet at

Slide 2David R. Musicant

Separating Line (or hyperplane)

Class 1Class -1

Goal: Find the best line (or hyperplane) to separate the training data. How to formalize?– In two dimensions, equation of the line is given by:

w1x+w2y= b– Better notation for n dimensions:

w~áx~= bP

i=0

n

wixi = b

Page 3: Support Vector Machines Graphic generated with Lucent Technologies Demonstration 2-D Pattern Recognition Applet at

Slide 3David R. Musicant

Simple Classifier The Simple Classifier:

– Points that fall on the right are classified as “1”– Points that fall on the left are classified as “-1”

Therefore: using the training set, find a hyperplane (line) so that

Class 1Class -1

w~áx~i > b for i 2 class 1w~áx~i < b for i 2 class à 1

This is a perceptron! How can we improve on the

perceptron?

Page 4: Support Vector Machines Graphic generated with Lucent Technologies Demonstration 2-D Pattern Recognition Applet at

Slide 4David R. Musicant

Finding the Best Plane Not all planes are equal. Which of the two following

planes shown is better?

Class 1Class -1

Both planes accurately classify the training set. The green plane is the better choice, since it is more

likely to do well on future test data. The green plane is further away from the data.

Page 5: Support Vector Machines Graphic generated with Lucent Technologies Demonstration 2-D Pattern Recognition Applet at

Slide 5David R. Musicant

Separating the planes Construct the bounding planes:

– Draw two parallel planes to the classification plane.– Push them as far apart as possible, until they hit data points.– The classification plane with bounding planes furthest apart is the best one.

Class 1Class -1

w~áx~= b

w~áx~= b+1w~áx~= bà 1

Page 6: Support Vector Machines Graphic generated with Lucent Technologies Demonstration 2-D Pattern Recognition Applet at

Slide 6David R. Musicant

Recap: Finding the Best Plane Details

– All points in class 1 should be to theright of bounding plane 1.

– All points in class -1 should be to theleft of bounding plane -1.

– Pick yi to be +1 or -1 depending on the classification. Then the above two inequalities can be written as one:

– The distance between bounding planes should be maximized.

– The distance between bounding planes is given by:

w21+w22+:::+w

2n

p 2 = jjw~jj22

Class 1Class -1

w~áx~i õ b+1

w~áx~i ô bà 1

yi(w~áx~i à b) õ 1

Page 7: Support Vector Machines Graphic generated with Lucent Technologies Demonstration 2-D Pattern Recognition Applet at

Slide 7David R. Musicant

The Optimization Problem The previous slide can be rewritten as:

minw;b

21jjw~jj2

yi(w~áx~i à b) õ 1such that This is a mathematical program.

– Optimization problem subject to constraints– More specifically, this is a quadratic program– There are high powered software tools for solving this kind

of problem (both commercial and academic)

No special algorithms are necessary (in theory...)– Just enter this problem and the associated data into a

quadratic programming solver (like CPLEX), and let it find an answer.

Page 8: Support Vector Machines Graphic generated with Lucent Technologies Demonstration 2-D Pattern Recognition Applet at

Slide 8David R. Musicant

Data Which is Not Linearly Separable What if a separating plane does not exist?

Class 1Class -1

error

Find the plane that maximizes the margin and minimizes the errors on the training points.

Take original inequality and add a slack variable to measure error:

w~áx~i õ b+1 w~áx~i +øi õ b+1

Page 9: Support Vector Machines Graphic generated with Lucent Technologies Demonstration 2-D Pattern Recognition Applet at

Slide 9David R. Musicant

The Support Vector Machine Push the planes apart and minimize the error at the same

time:

minw;b;øiõ 0

21jjw~jj2+C

P

i=1

m

øi

yi(w~áx~i à b) + øi õ 1such that C is a positive number that is chosen to balance these

two goals. This problem is called a Support Vector Machine, or SVM. The SVM is one of many techniques for doing supervised machine learning

– Others: Neural networks, decision trees, k-nearest neighbor

Page 10: Support Vector Machines Graphic generated with Lucent Technologies Demonstration 2-D Pattern Recognition Applet at

Slide 10David R. Musicant

Terminology Those points that touch the bounding plane, or lie on

the wrong side, are called support vectors.

If all the data points except the support vectors were removed, the solution would turn out the same.

The SVM is mathematically equivalent to force and torque equilibrium (hence the name support vectors).

Page 11: Support Vector Machines Graphic generated with Lucent Technologies Demonstration 2-D Pattern Recognition Applet at

Slide 11David R. Musicant

Research: Solving Massive SVMs The standard SVM is solved using a canned

quadratic programming (QP) solver. Problem:– Standard tools bring all the data into memory. If dataset is

bigger than memory, out of luck.

How do other supervised learning techniques handle data that does not fit in memory?

Why not use virtual memory? Let the operating system manage which data the QP solver is using.– Answer: The QP solver accesses data in a random, not a

continuous fashion. The cost to page data in and out of memory is enormous.

Page 12: Support Vector Machines Graphic generated with Lucent Technologies Demonstration 2-D Pattern Recognition Applet at

Slide 12David R. Musicant

What about nonlinear surfaces?

Generated with Lucent TechnologiesDemonstration 2-D Pattern Recognition Applet athttp://svm.research.bell-labs.com/SVT/SVMsvt.html

Some datasets may not be best separated by a plane.

How can we do nonlinear separating surfaces?

Simple method: Map into a higher dimensional space, and do the same thing we have already done.

Page 13: Support Vector Machines Graphic generated with Lucent Technologies Demonstration 2-D Pattern Recognition Applet at

Slide 13David R. Musicant

Finding nonlinear surfaces How to modify algorithm to find nonlinear surfaces?

– First idea (simple and effective): map each data point into a higher dimensional space, and find a linear fit there

Example: Find a quadratic surface forx1 x2 x3

3 5 74 6 2

Use new coordinates in regular linear SVM

A plane in this quadratic space is equivalent to a quadratic surface in our original space.

z1=x12 z2=x2

2 z3=x32 z4=x1x2 z5=x1x3 z6=x2x3 z7=x1 z8=x2 z9=x3

9 25 49 15 21 35 3 5 716 36 4 24 8 12 4 6 2

Page 14: Support Vector Machines Graphic generated with Lucent Technologies Demonstration 2-D Pattern Recognition Applet at

Slide 14David R. Musicant

Problem & Solution If dimensionality of space is high, lots of

calculations– For a high polynomial space, combinations of

coordinates explodes– Need to do all these calculations for all training

points, and for each testing point– Infinite dimensional spaces impossible

Nonlinear surfaces can be used without these problems through the use of a kernel function.– Demonstration: http://svm.cs.rhul.ac.uk/pagesnew/

GPat.shtml

Page 15: Support Vector Machines Graphic generated with Lucent Technologies Demonstration 2-D Pattern Recognition Applet at

Slide 15David R. Musicant

Example: Checkerboard

Page 16: Support Vector Machines Graphic generated with Lucent Technologies Demonstration 2-D Pattern Recognition Applet at

Slide 16David R. Musicant

5-Nearest Neighbor

Page 17: Support Vector Machines Graphic generated with Lucent Technologies Demonstration 2-D Pattern Recognition Applet at

Slide 17David R. Musicant

Sixth degree polynomial kernel