support vector machinesjbg/teaching/cmsc_723/02c...support vector machines computational...

46
Support Vector Machines Computational Linguistics: Jordan Boyd-Graber University of Maryland SEPTEMBER 5, 2018 Slides adapted from Tom Mitchell, Eric Xing, and Lauren Hannah Computational Linguistics: Jordan Boyd-Graber | UMD Support Vector Machines | 1 / 22

Upload: others

Post on 29-Jan-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

  • Support Vector Machines

    Computational Linguistics: Jordan Boyd-GraberUniversity of MarylandSEPTEMBER 5, 2018

    Slides adapted from Tom Mitchell, Eric Xing, and Lauren Hannah

    Computational Linguistics: Jordan Boyd-Graber | UMD Support Vector Machines | 1 / 22

  • Roadmap

    Classification: machines labeling data for us

    Previously: naïve Bayes and logistic regression This time: SVMs (another) example of linear classifier Good classification accuracy Good theoretical properties

    Computational Linguistics: Jordan Boyd-Graber | UMD Support Vector Machines | 2 / 22

  • Thinking Geometrically

    Suppose you have two classes: vacations and sports

    Suppose you have four documents

    Sports

    Doc1: {ball, ball, ball, travel}Doc2: {ball, ball}

    Vacations

    Doc3: {travel, ball, travel}Doc4: {travel}

    What does this look like in vector space?

    Computational Linguistics: Jordan Boyd-Graber | UMD Support Vector Machines | 3 / 22

  • Put the documents in vector space

    Travel

    3

    2

    1

    01 2 3

    Ball

    Computational Linguistics: Jordan Boyd-Graber | UMD Support Vector Machines | 4 / 22

  • Vector space representation of documents

    Each document is a vector, one component for each term.

    Terms are axes.

    High dimensionality: 10,000s of dimensions and more

    How can we do classification in this space?

    Computational Linguistics: Jordan Boyd-Graber | UMD Support Vector Machines | 5 / 22

  • Vector space classification

    As before, the training set is a set of documents, each labeled with itsclass.

    In vector space classification, this set corresponds to a labeled set ofpoints or vectors in the vector space.

    Premise 1: Documents in the same class form a contiguous region.

    Premise 2: Documents from different classes don’t overlap.

    We define lines, surfaces, hypersurfaces to divide regions.

    Computational Linguistics: Jordan Boyd-Graber | UMD Support Vector Machines | 6 / 22

  • Classes in the vector space

    Recap Vector space classification Linear classifiers Support Vector Machines Discussion

    Classes in the vector space

    xxx

    x

    !! !!

    !

    !

    China

    Kenya

    UK!

    Should the document ! be assigned to China, UK or Kenya?

    Schütze: Support vector machines 12 / 55

    Computational Linguistics: Jordan Boyd-Graber | UMD Support Vector Machines | 7 / 22

  • Classes in the vector space

    Recap Vector space classification Linear classifiers Support Vector Machines Discussion

    Classes in the vector space

    xxx

    x

    !! !!

    !

    !

    China

    Kenya

    UK!

    Should the document ! be assigned to China, UK or Kenya?

    Schütze: Support vector machines 12 / 55

    Should the document ? be assigned to China, UK or Kenya?

    Computational Linguistics: Jordan Boyd-Graber | UMD Support Vector Machines | 7 / 22

  • Classes in the vector space

    Recap Vector space classification Linear classifiers Support Vector Machines Discussion

    Classes in the vector space

    xxx

    x

    !! !!

    !

    !

    China

    Kenya

    UK!

    Find separators between the classes

    Schütze: Support vector machines 12 / 55

    Find separators between the classes

    Computational Linguistics: Jordan Boyd-Graber | UMD Support Vector Machines | 7 / 22

  • Classes in the vector space

    Recap Vector space classification Linear classifiers Support Vector Machines Discussion

    Classes in the vector space

    xxx

    x

    !! !!

    !

    !

    China

    Kenya

    UK!

    Find separators between the classes

    Schütze: Support vector machines 12 / 55

    Find separators between the classes

    Computational Linguistics: Jordan Boyd-Graber | UMD Support Vector Machines | 7 / 22

  • Classes in the vector space

    Recap Vector space classification Linear classifiers Support Vector Machines Discussion

    Classes in the vector space

    xxx

    x

    !! !!

    !

    !

    China

    Kenya

    UK!

    Find separators between the classes

    Schütze: Support vector machines 12 / 55

    Based on these separators: ? should be assigned to China

    Computational Linguistics: Jordan Boyd-Graber | UMD Support Vector Machines | 7 / 22

  • Classes in the vector space

    Recap Vector space classification Linear classifiers Support Vector Machines Discussion

    Classes in the vector space

    xxx

    x

    !! !!

    !

    !

    China

    Kenya

    UK!

    Find separators between the classes

    Schütze: Support vector machines 12 / 55

    How do we find separators that do a good job at classifying new documentslike ?? – Main topic of today

    Computational Linguistics: Jordan Boyd-Graber | UMD Support Vector Machines | 7 / 22

  • Linear Classifiers

    Linear classifiers

    Definition: A linear classifier computes a linear combination or weighted sum

    i βixiof the feature values.

    Classification decision:∑

    i βixi >β0? (β0 is our bias) . . . where β0 (the threshold) is a parameter.

    We call this the separator or decision boundary.

    We find the separator based on training set.

    Methods for finding separator: logistic regression, naïve Bayes, linearSVM

    Assumption: The classes are linearly separable.

    Before, we just talked about equations. What’s the geometric intuition?

    Computational Linguistics: Jordan Boyd-Graber | UMD Support Vector Machines | 8 / 22

  • Linear Classifiers

    Linear classifiers

    Definition: A linear classifier computes a linear combination or weighted sum

    i βixiof the feature values.

    Classification decision:∑

    i βixi >β0? (β0 is our bias) . . . where β0 (the threshold) is a parameter.

    We call this the separator or decision boundary.

    We find the separator based on training set.

    Methods for finding separator: logistic regression, naïve Bayes, linearSVM

    Assumption: The classes are linearly separable.

    Before, we just talked about equations. What’s the geometric intuition?

    Computational Linguistics: Jordan Boyd-Graber | UMD Support Vector Machines | 8 / 22

  • Linear Classifiers

    A linear classifier in 1D

    Recap Vector space classification Linear classifiers Support Vector Machines Discussion

    A linear classifier in 1D

    A linear classifier in 1D isa point x described by theequation w1d1 = θ

    x = θ/w1

    Schütze: Support vector machines 15 / 55

    A linear classifier in 1D is apoint x described by theequation β1x1 =β0

    x =β0/β1 Points (x1) with β1x1 ≥β0 are

    in the class c.

    Points (x1) with β1x1

  • Linear Classifiers

    A linear classifier in 1D

    Recap Vector space classification Linear classifiers Support Vector Machines Discussion

    A linear classifier in 1D

    A linear classifier in 1D isa point x described by theequation w1d1 = θ

    x = θ/w1

    Schütze: Support vector machines 15 / 55

    A linear classifier in 1D is apoint x described by theequation β1x1 =β0

    x =β0/β1

    Points (x1) with β1x1 ≥β0 arein the class c.

    Points (x1) with β1x1

  • Linear Classifiers

    A linear classifier in 1D

    Recap Vector space classification Linear classifiers Support Vector Machines Discussion

    A linear classifier in 1D

    A linear classifier in 1D isa point x described by theequation w1d1 = θ

    x = θ/w1

    Points (d1) with w1d1 ≥ θare in the class c .

    Schütze: Support vector machines 15 / 55

    A linear classifier in 1D is apoint x described by theequation β1x1 =β0

    x =β0/β1 Points (x1) with β1x1 ≥β0 are

    in the class c.

    Points (x1) with β1x1

  • Linear Classifiers

    A linear classifier in 1D

    Recap Vector space classification Linear classifiers Support Vector Machines Discussion

    A linear classifier in 1D

    A linear classifier in 1D isa point x described by theequation w1d1 = θ

    x = θ/w1

    Points (d1) with w1d1 ≥ θare in the class c .

    Points (d1) with w1d1 < θare in the complementclass c .

    Schütze: Support vector machines 15 / 55

    A linear classifier in 1D is apoint x described by theequation β1x1 =β0

    x =β0/β1 Points (x1) with β1x1 ≥β0 are

    in the class c.

    Points (x1) with β1x1

  • Linear Classifiers

    A linear classifier in 2D

    Recap Vector space classification Linear classifiers Support Vector Machines Discussion

    A linear classifier in 2D

    A linear classifier in 2D isa line described by theequation w1d1 + w2d2 = θ

    Example for a 2D linearclassifier

    Schütze: Support vector machines 16 / 55

    A linear classifier in 2D is aline described by the equationβ1x1 +β2x2 =β0

    Example for a 2D linearclassifier

    Points (x1 x2) withβ1x1 +β2x2 ≥β0 are in theclass c.

    Points (x1 x2) withβ1x1 +β2x2

  • Linear Classifiers

    A linear classifier in 2D

    Recap Vector space classification Linear classifiers Support Vector Machines Discussion

    A linear classifier in 2D

    A linear classifier in 2D isa line described by theequation w1d1 + w2d2 = θ

    Example for a 2D linearclassifier

    Schütze: Support vector machines 16 / 55

    A linear classifier in 2D is aline described by the equationβ1x1 +β2x2 =β0

    Example for a 2D linearclassifier

    Points (x1 x2) withβ1x1 +β2x2 ≥β0 are in theclass c.

    Points (x1 x2) withβ1x1 +β2x2

  • Linear Classifiers

    A linear classifier in 2DRecap Vector space classification Linear classifiers Support Vector Machines Discussion

    A linear classifier in 2D

    A linear classifier in 2D isa line described by theequation w1d1 + w2d2 = θ

    Example for a 2D linearclassifier

    Points (d1 d2) withw1d1 + w2d2 ≥ θ are inthe class c .

    Schütze: Support vector machines 16 / 55

    A linear classifier in 2D is aline described by the equationβ1x1 +β2x2 =β0

    Example for a 2D linearclassifier

    Points (x1 x2) withβ1x1 +β2x2 ≥β0 are in theclass c.

    Points (x1 x2) withβ1x1 +β2x2

  • Linear Classifiers

    A linear classifier in 2D

    Recap Vector space classification Linear classifiers Support Vector Machines Discussion

    A linear classifier in 2D

    A linear classifier in 2D isa line described by theequation w1d1 + w2d2 = θ

    Example for a 2D linearclassifier

    Points (d1 d2) withw1d1 + w2d2 ≥ θ are inthe class c .

    Points (d1 d2) withw1d1 + w2d2 < θ are inthe complement class c .

    Schütze: Support vector machines 16 / 55

    A linear classifier in 2D is aline described by the equationβ1x1 +β2x2 =β0

    Example for a 2D linearclassifier

    Points (x1 x2) withβ1x1 +β2x2 ≥β0 are in theclass c.

    Points (x1 x2) withβ1x1 +β2x2

  • Linear Classifiers

    A linear classifier in 3D

    Recap Vector space classification Linear classifiers Support Vector Machines Discussion

    A linear classifier in 3D

    A linear classifier in 3D isa plane described by theequationw1d1 + w2d2 + w3d3 = θ

    Schütze: Support vector machines 17 / 55

    A linear classifier in 3D is aplane described by theequationβ1x1 +β2x2 +β3x3 =β0

    Example for a 3D linearclassifier

    Points (x1 x2 x3) withβ1x1 +β2x2 +β3x3 ≥β0 arein the class c.

    Points (x1 x2 x3) withβ1x1 +β2x2 +β3x3

  • Linear Classifiers

    A linear classifier in 3D

    Recap Vector space classification Linear classifiers Support Vector Machines Discussion

    A linear classifier in 3D

    A linear classifier in 3D isa plane described by theequationw1d1 + w2d2 + w3d3 = θ

    Example for a 3D linearclassifier

    Schütze: Support vector machines 17 / 55

    A linear classifier in 3D is aplane described by theequationβ1x1 +β2x2 +β3x3 =β0

    Example for a 3D linearclassifier

    Points (x1 x2 x3) withβ1x1 +β2x2 +β3x3 ≥β0 arein the class c.

    Points (x1 x2 x3) withβ1x1 +β2x2 +β3x3

  • Linear Classifiers

    A linear classifier in 3DRecap Vector space classification Linear classifiers Support Vector Machines Discussion

    A linear classifier in 3D

    A linear classifier in 3D isa plane described by theequationw1d1 + w2d2 + w3d3 = θ

    Example for a 3D linearclassifier

    Points (d1 d2 d3) withw1d1 + w2d2 + w3d3 ≥ θare in the class c .

    Schütze: Support vector machines 17 / 55

    A linear classifier in 3D is aplane described by theequationβ1x1 +β2x2 +β3x3 =β0

    Example for a 3D linearclassifier

    Points (x1 x2 x3) withβ1x1 +β2x2 +β3x3 ≥β0 arein the class c.

    Points (x1 x2 x3) withβ1x1 +β2x2 +β3x3

  • Linear Classifiers

    A linear classifier in 3D

    Recap Vector space classification Linear classifiers Support Vector Machines Discussion

    A linear classifier in 3D

    A linear classifier in 3D isa plane described by theequationw1d1 + w2d2 + w3d3 = θ

    Example for a 3D linearclassifier

    Points (d1 d2 d3) withw1d1 + w2d2 + w3d3 ≥ θare in the class c .

    Points (d1 d2 d3) withw1d1 + w2d2 + w3d3 < θare in the complementclass c .

    Schütze: Support vector machines 17 / 55

    A linear classifier in 3D is aplane described by theequationβ1x1 +β2x2 +β3x3 =β0

    Example for a 3D linearclassifier

    Points (x1 x2 x3) withβ1x1 +β2x2 +β3x3 ≥β0 arein the class c.

    Points (x1 x2 x3) withβ1x1 +β2x2 +β3x3

  • Linear Classifiers

    Naive Bayes and Logistic Regression as linear classifiers

    Multinomial Naive Bayes is a linear classifier (in log space) defined by:

    M∑

    i=1

    βixi =β0

    where βi = log[P̂(ti |c)/P̂(ti |c̄)], xi = number of occurrences of ti in d , andβ0 =− log[P̂(c)/P̂(c̄)]. Here, the index i , 1≤ i ≤M, refers to terms of thevocabulary.Logistic regression is the same (we only put it into the logistic function toturn it into a probability).

    Takeway

    Naïve Bayes, logistic regression and SVM are all linear methods. Theychoose their hyperplanes based on different objectives: joint likelihood(NB), conditional likelihood (LR), and the margin (SVM).

    Computational Linguistics: Jordan Boyd-Graber | UMD Support Vector Machines | 12 / 22

  • Linear Classifiers

    Naive Bayes and Logistic Regression as linear classifiers

    Multinomial Naive Bayes is a linear classifier (in log space) defined by:

    M∑

    i=1

    βixi =β0

    where βi = log[P̂(ti |c)/P̂(ti |c̄)], xi = number of occurrences of ti in d , andβ0 =− log[P̂(c)/P̂(c̄)]. Here, the index i , 1≤ i ≤M, refers to terms of thevocabulary.Logistic regression is the same (we only put it into the logistic function toturn it into a probability).

    Takeway

    Naïve Bayes, logistic regression and SVM are all linear methods. Theychoose their hyperplanes based on different objectives: joint likelihood(NB), conditional likelihood (LR), and the margin (SVM).

    Computational Linguistics: Jordan Boyd-Graber | UMD Support Vector Machines | 12 / 22

  • Linear Classifiers

    Which hyperplane?

    Computational Linguistics: Jordan Boyd-Graber | UMD Support Vector Machines | 13 / 22

  • Linear Classifiers

    Which hyperplane?

    For linearly separable training sets: there are infinitely many separatinghyperplanes.

    They all separate the training set perfectly . . .

    . . . but they behave differently on test data.

    Error rates on new data are low for some, high for others.

    How do we find a low-error separator?

    Computational Linguistics: Jordan Boyd-Graber | UMD Support Vector Machines | 14 / 22

  • Support Vector Machines

    Support vector machines

    Machine-learning research in the last two decades has improvedclassifier effectiveness.

    New generation of state-of-the-art classifiers: support vector machines(SVMs), boosted decision trees, regularized logistic regression, neuralnetworks, and random forests

    Applications to IR problems, particularly text classification

    SVMs: A kind of large-margin classifier

    Vector space based machine-learning method aiming to find a decisionboundary between two classes that is maximally far from any point in thetraining data (possibly discounting some points as outliers or noise)

    Computational Linguistics: Jordan Boyd-Graber | UMD Support Vector Machines | 15 / 22

  • Support Vector Machines

    Support Vector Machines

    2-class training data

    decision boundary→linear separator

    criterion: beingmaximally far awayfrom any data point→determines classifiermargin

    linear separatorposition defined bysupport vectors

    Recap Vector space classification Linear classifiers Support Vector Machines Discussion

    Support Vector Machines

    2-class training data

    Schütze: Support vector machines 29 / 55

    Computational Linguistics: Jordan Boyd-Graber | UMD Support Vector Machines | 16 / 22

  • Support Vector Machines

    Support Vector Machines

    2-class training data

    decision boundary→linear separator

    criterion: beingmaximally far awayfrom any data point→determines classifiermargin

    linear separatorposition defined bysupport vectors

    Recap Vector space classification Linear classifiers Support Vector Machines Discussion

    Support Vector Machines

    2-class training data

    decision boundary→ linear separator

    Schütze: Support vector machines 29 / 55

    Computational Linguistics: Jordan Boyd-Graber | UMD Support Vector Machines | 16 / 22

  • Support Vector Machines

    Support Vector Machines

    2-class training data

    decision boundary→linear separator

    criterion: beingmaximally far awayfrom any data point→determines classifiermargin

    linear separatorposition defined bysupport vectors

    Recap Vector space classification Linear classifiers Support Vector Machines Discussion

    Support Vector Machines

    2-class training data

    decision boundary→ linear separator

    criterion: beingmaximally far awayfrom any data point→ determinesclassifier margin

    Margin ismaximized

    Schütze: Support vector machines 29 / 55

    Computational Linguistics: Jordan Boyd-Graber | UMD Support Vector Machines | 16 / 22

  • Support Vector Machines

    Support Vector Machines

    2-class training data

    decision boundary→linear separator

    criterion: beingmaximally far awayfrom any data point→determines classifiermargin

    linear separatorposition defined bysupport vectors

    Recap Vector space classification Linear classifiers Support Vector Machines Discussion

    Why maximize the margin?

    Points near decisionsurface → uncertainclassification decisions(50% either way).A classifier with a largemargin makes no lowcertainty classificationdecisions.Gives classificationsafety margin w.r.t slighterrors in measurement ordoc. variation

    Support vectors

    Margin ismaximized

    Maximummargindecisionhyperplane

    Schütze: Support vector machines 30 / 55

    Computational Linguistics: Jordan Boyd-Graber | UMD Support Vector Machines | 16 / 22

  • Support Vector Machines

    Why maximize the margin?

    Points near decisionsurface→ uncertainclassification decisions

    A classifier with a largemargin is alwaysconfident

    Gives classificationsafety margin(measurement orvariation)

    Recap Vector space classification Linear classifiers Support Vector Machines Discussion

    Why maximize the margin?

    Points near decisionsurface → uncertainclassification decisions(50% either way).A classifier with a largemargin makes no lowcertainty classificationdecisions.Gives classificationsafety margin w.r.t slighterrors in measurement ordoc. variation

    Support vectors

    Margin ismaximized

    Maximummargindecisionhyperplane

    Schütze: Support vector machines 30 / 55

    Computational Linguistics: Jordan Boyd-Graber | UMD Support Vector Machines | 17 / 22

  • Support Vector Machines

    Why maximize the margin?

    SVM classifier: large marginaround decision boundary

    compare to decision hyperplane:place fat separator betweenclasses unique solution

    decreased memory capacity

    increased ability to correctlygeneralize to test data

    Computational Linguistics: Jordan Boyd-Graber | UMD Support Vector Machines | 18 / 22

  • Formulation

    Equation

    Equation of a hyperplane

    ~w · xi + b = 0 (1)

    Distance of a point to hyperplane

    |~w · xi + b|||~w ||

    (2)

    The margin ρ is given by

    ρ ≡ min(x ,y)∈S

    |~w · xi + b|||~w ||

    =1

    ||~w ||(3)

    This is because for any point on the marginal hyperplane, ~w ·x + b =±1

    Computational Linguistics: Jordan Boyd-Graber | UMD Support Vector Machines | 19 / 22

  • Formulation

    Equation

    Equation of a hyperplane

    ~w · xi + b = 0 (1)

    Distance of a point to hyperplane

    |~w · xi + b|||~w ||

    (2)

    The margin ρ is given by

    ρ ≡ min(x ,y)∈S

    |~w · xi + b|||~w ||

    =1

    ||~w ||(3)

    This is because for any point on the marginal hyperplane, ~w ·x + b =±1

    Computational Linguistics: Jordan Boyd-Graber | UMD Support Vector Machines | 19 / 22

  • Formulation

    Optimization Problem

    We want to find a weight vector ~w and bias b that optimize

    min~w ,b

    1

    2||w ||2 (4)

    subject to yi(~w · xi + b)≥ 1, ∀i ∈ [1,m].

    Computational Linguistics: Jordan Boyd-Graber | UMD Support Vector Machines | 20 / 22

  • Formulation

    Choosing what kind of classifier to use

    When building a text classifier, first question: how much training data isthere currently available?

    None?

    Very little?

    A fair amount?

    A huge amount

    Computational Linguistics: Jordan Boyd-Graber | UMD Support Vector Machines | 21 / 22

  • Formulation

    Choosing what kind of classifier to use

    When building a text classifier, first question: how much training data isthere currently available?

    None? Hand write rules or use active learning

    Very little?

    A fair amount?

    A huge amount

    Computational Linguistics: Jordan Boyd-Graber | UMD Support Vector Machines | 21 / 22

  • Formulation

    Choosing what kind of classifier to use

    When building a text classifier, first question: how much training data isthere currently available?

    None? Hand write rules or use active learning

    Very little? Naïve Bayes

    A fair amount?

    A huge amount

    Computational Linguistics: Jordan Boyd-Graber | UMD Support Vector Machines | 21 / 22

  • Formulation

    Choosing what kind of classifier to use

    When building a text classifier, first question: how much training data isthere currently available?

    None? Hand write rules or use active learning

    Very little? Naïve Bayes

    A fair amount? SVM

    A huge amount

    Computational Linguistics: Jordan Boyd-Graber | UMD Support Vector Machines | 21 / 22

  • Formulation

    Choosing what kind of classifier to use

    When building a text classifier, first question: how much training data isthere currently available?

    None? Hand write rules or use active learning

    Very little? Naïve Bayes

    A fair amount? SVM

    A huge amount Doesn’t matter, use whatever works

    Computational Linguistics: Jordan Boyd-Graber | UMD Support Vector Machines | 21 / 22

  • Formulation

    SVM extensions: What’s next

    Finding solutions Slack variables: not perfect line Kernels: different geometries

    0.0 0.2 0.4 0.6 0.8 1.0

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Computational Linguistics: Jordan Boyd-Graber | UMD Support Vector Machines | 22 / 22

    Linear ClassifiersSupport Vector MachinesFormulation