nov 23rd, 2001copyright © 2001, 2003, andrew w. moore linear document classifier

Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore

Linear Document Classifier

Support Vector Machines: Slide 2Copyright © 2001, 2003, Andrew W. Moore

Linear Classifiers

• Binary classification • y=+1 for positive class, y=-1 for negative class

• Vector representation for documents

denotes +1

denotes -1 How would you classify this data?

b

a


Linear Classifiers

• Binary classification • y=+1 for positive class, y=-1 for negative class

• Vector representation for documents

denotes +1

denotes -1 How would you classify this data?

+

f(d)


Decision Boundary

d1

d2

d4

d3

f(d)

1. How to classify documents using f(d)?

2. How to find the line f(d) ?

• wa and wb are the weights for word a and b

a

b


How to Classify Documents ?

d1

d2

d4

d3

f(d)


a

b


Decision Boundary

d1

d2

d4

d3

f(d)

1. How to classify documents using f(d)?

2. How to find the line f(d) ?


a

b

Support Vector Machines: Slide 7

Perception Algorithm

• Initialize • Repeat

• Receive a labeled document (d, y) (y=+1 or -1)

• Check if doc d is classified correctly•yf(d) > 0 ?

• Yes: do nothing• No:

d1

d2

d4

d3

f(d)

y w w d

b

a







d1

d2

d4

d3

f(d)

y w w d

b

a


Geometrical Interpretation


Linear Classifiers

denotes +1

denotes -1

How would you classify this data?

f(d)


Linear Classifiers

denotes +1

denotes -1

Any of these would be fine..

..but which is best?


Classifier Margin

denotes +1

denotes -1 Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint.


Maximum Margin

denotes +1

denotes -1 The maximum margin linear classifier is the linear classifier with the, maximum margin.


Maximum Margin

denotes +1

denotes -1 The maximum margin linear classifier is the linear classifier with the, maximum margin.

Called Linear Support Vector Machine (SVM)


Empirical Studies with Text Categorization

• 10 Categories from Reuters-21578

• For a few categories, the SVM method significantly outperforms the KNN approach

Category KNN SVM

earn 97.3 98.0

acq 92.0 93.6

money-fx 78.2 74.5

grain 82.2 94.6

crude 85.7 88.9

trade 77.4 75.9

interest 74.0 77.7

ship 79.2 85.6

wheat 76.6 91.8

corn 77.9 90.3

Classification accuracy


Doing multi-class classification

• SVMs can only handle two-class outputs (i.e. a categorical output variable with arity 2).

• How to handle multiple classes• E.g., classify documents into three

categories: sports, business, politics


Doing multi-class classification

• SVMs can only handle two-class outputs (i.e. a categorical output variable with arity 2).

• How to handle multiple classes• E.g., classify documents into three

categories: sports, business, politics• Answer: one-vs-all, learn N SVM’s

• SVM 1 learns “Output==1” vs “Output != 1”• SVM 2 learns “Output==2” vs “Output != 2”• :• SVM N learns “Output==N” vs “Output != N”


One-vs-All • vs the other classes: red(d)

Copyright © 2001, 2003, Andrew W. Moore


One-vs-All • vs the other classes: red(d)• vs the other classes: yellow(d)



One-vs-All • vs the other classes: red(d)• vs the other classes: yellow(d)• vs the other classes: cyan(d)



One-vs-All • vs the other classes: red(d)• vs the other classes: yellow(d)• vs the other classes: cyan(d)• Given a test document d, how to decide its

color ?



One-vs-All • vs the other classes: red(d)• vs the other classes: yellow(d)• vs the other classes: cyan(d)• Given a test document d, how to decide its

color ?• Assign d to the color function with the largest

score



Suppose we’re in 1-dimension

What would SVMs do with this data?

x=0


Suppose we’re in 1-dimension

Not a big surprise

x=0


Harder 1-dimensional dataset

What can be done about this?

x=0


Harder 1-dimensional datasetExpand from one

dimensional space to a two dimensional space

x=0

x2

x


Harder 1-dimensional datasetExpand from one

dimensional space to a two dimensional space

x=0

x2

x

Kernel trick: expand the dimensionality by a kernel function


Nonlinear Kernel (I)


Nonlinear Kernel (II)


Software for SVM• SVMlight (http://svmlight.joachims.org/)• Libsvm (http://www.csie.ntu.edu.tw/~cjlin/libsvm/)

• It is faster than SVMlight• Sparse data representation

• The occurrences of most words in a document are zero

• <label> <index1>:<value1> <index2>:<value2>


class label word-id-1: word-occurrence


Software for SVM• SVMlight (http://svmlight.joachims.org/)• Libsvm (

http://www.csie.ntu.edu.tw/~cjlin/libsvm/)• It is faster than SVMlight

• Sparse data representation• The occurrences of most words in a document

are zero• Example

• D = (‘hello’: 2, ‘world’: 3), negative document• Wor-id for `hello’ is 100, word-id for ‘world’ is

54• -1 100:2 54:3