the disputed federalist papers: resolution via support vector machine feature selection olvi...

25
The Disputed Federalist Papers: Resolution via Support Vector Machine Feature Selection Olvi Mangasarian UW Madison & UCSD La Jolla Glenn Fung Amazon Inc., Seattle, Washington

Upload: sydney-houston

Post on 31-Dec-2015

221 views

Category:

Documents


1 download

TRANSCRIPT

The Disputed Federalist Papers:

Resolution via Support Vector Machine Feature Selection

Olvi MangasarianUW Madison & UCSD La Jolla

Glenn FungAmazon Inc., Seattle, Washington

The Federalist Papers

Written in 1787-1788 by Alexander Hamilton and James Madison to persuade the citizens of New York to ratify the constitution.

Papers consisted of 118 short essays, 900 to 3500 words in length.

Authorship of 106 of the papers is definitely known Authorship of 12 of those papers has been in

dispute ( Madison or Hamilton). These papers are referred to as the disputed Federalist papers.

In this talk we shall resolve this dispute using a linear support vector machine classifier.

Outline of Talk

Support Vector Machines (SVM) Introduction Standard Quadratic Programming Formulation

SVM Feature Selection via Concave Minimization

The Disputed Federalist Papers

Results

Classification Agrees with Previous Results

Successive Linearization Algorithm (SLA)

Description of the Classification Problem

1-norm Linear Linear Programming Formulation

Linear Classifier in Three Dimensions Resolves Dispute

Description of Previous Work

What is a Support Vector Machine?

An optimally defined surface Typically nonlinear in the input space Linear in a higher dimensional space Implicitly defined by a kernel function

What are Support Vector Machines Used For?

Classification Regression & Data Fitting Supervised & Unsupervised Learning

(Will concentrate on classification)

Geometry of the Classification Problem2-Category Linearly Separable Case

A+

A-

x0w = í + 1

x0w = í à 1

w

x0w = í

Algebra of the Classification Problem 2-Category Linearly Separable Case

Given m points in n dimensional space Represented by an m-by-n matrix A

More succinctly:D(Awà eí )=e;

where e is a vector of ones.

x0w = í æ1: Separate by two bounding planes,

A iw=í + 1; for D i i = + 1;

A iw5 í à 1; for D i i = à 1:

An m-by-m diagonal matrix D with +1 & -1 entries

Membership of each A i in class +1 or –1 specified by:

Support Vector MachinesMaximizing the Margin between Bounding Planes

x0w = í + 1

x0w = í à 1

A+

A-

jjwjj22

w

Support vectors

Support Vector Machines:Quadratic Programming Formulation

Solve the following quadratic program:

÷e0y+ 21kwk2

2y > 0;w; íD(Awà eí ) + y > e

min

s.t.

where is the weight of the training error ÷

Maximize the margin by minimizing21kwk2

2

Support Vector Machines: Linear Programming Formulation

Use the 1-norm instead of the 2-norm:

÷e0y+ kwk1y > 0;w; í

D(Awà eí ) + y > e

min

s.t. This is equivalent to the following linear program:

min ÷e0y+ e0vy>0;w; í ;v

D(Awà eí ) + y > es.t.

v>w> à v

Feature Selection and SVMs

Use the step function to suppress components of the normal to the separating hyperplane:

min ÷e0y+ e0vãy>0;w; í ;v

D(Awà eí ) + y > es.t.

v>w> à v

viã = 1 if vi > 00 if vi = 0

ú ûWhere:

1à "à 5y

0 y

1

Smooth Approximation of the Step Function

SVM Formulation with Feature Selection

For , we use the approximation of the step vector by the concave exponential:

v>0vã

vãt eà "à ëv;ë > 0 Here is the base of natural logarithms. This leads to:

min ÷e0y+ e0(eà "à ëv)y>0;w;í ;v

D(Awà eí ) + y > es.t.

v>w> à v

"

Successive Linearization Algorithm (SLA) for Feature Selection

(w0; í 0;y0;v0) Choose . Start with some . Having , determine the next iterate by solving the LP:

min ÷e0y+ ë("à ëvi)0(và vi)y>0;w;í ;v

D(Awà eí ) + y > es.t.

v>w> à v

÷;ë > 0(wi; í i;yi;vi)

Stop when:min ÷e0(yà yi) + ë("à ëvi)0(và vi) = 0

Proposition: Algorithm terminates in a finite numberof steps (typically 5 to 7) at a stationary point.

The Federalist Papers

(As Described Earlier)

Written in 1787-1788 by Alexander Hamilton and James Madison to persuade the citizens of New York to ratify the constitution.

Papers consisted of short essays, 900 to 3500 words in length.

Authorship of 12 of those papers has been in dispute ( Madison or Hamilton). These papers are referred to as the disputed Federalist papers.

Previous Work

Mosteller and Wallace (1964) Using statistical inference, determined the

authorship of the 12 disputed papers.

Bosch and Smith (1998). Using linear programming techniques and the

evaluation of every possible combination of one, two and three features out of 70, obtained a best separating hyperplane using only three words.

Description of the data

For every paper:Machine readable text was created using a scanner.Computed relative frequencies of 70 words, that

Mosteller-Wallace identified as good candidates for author-attribution studies.

Each document is represented as a vector containing the 70 real numbers corresponding to the 70 word frequencies.

The dataset consists of 118 papers: 50 Madison papers 56 Hamilton papers 12 disputed papers

Function Words Based on Relative Frequencies

SLA Feature Selection for Classifying the Disputed Federalist Papers

Apply the successive linearization algorithm to:Train on the 106 Federalist papers with known

authorsFind a classification hyperplane that uses as few

words as possible

Use the hyperplane to classify the 12 disputed papers

÷ The parameter was obtained by a tuning procedure.

Hyperplane Classifier Using 3 Words

A hyperplane depending on three words was found:

-0.5368to-24.6634upon-2.9532would=-66.6159

Hamilton > -66.6159Madison< -66.6159All disputed papers ended up on the

Madison side of the plane

Results: 3d plot of resulting hyperplane

Comparison with Previous Work & Conclusion

Bosch and Smith (1998) calculated all the possible combinations of one, two and three words out of 70 to find a separating hyperplane. They solved 57,225 linear programs.

-.5242are+.8895our+4.9235upon=4.7368 Our SLA algorithm for feature selection required

the solution of only 6 linear programs. Our classification of the disputed Federalist papers

agrees with that of both Mosteller-Wallace and Bosch-Smith.

References

K.P. Bennett & O.L. Mangasarian: Robust linear programming discrimination of two linearly inseparable sets. Optimization Methods and Software 1, 1992, 23-34.

P.S. Bradley & O.L. Mangasarian: Feature Selection via concave minimization and support vector machines. ICML 1998: Machine Learning Proceedings of the Fifteenth International Conference, San Francisco, Califrnia 1998, J. Shavlik, editor, pages 82-90, Morgan Kaufman.

References (Continued)

R.A. Bosch & J.A.Smith: Separating hyperplanes and the authorship of the disputed federalist papers. Amrican Mathematical Monthly 105(7) 601-608, 1998

F. Mosteller & D.L.Wallace: Inference and disputed authorship: The Federalist. Addison-Wesley, Reading, Massachusetts, 1964.

F. Mosteller & D.L.Wallace: Applied Bayesian and classical inference: The case of the Federalist papers, Second Edition, Springer-Verlag, New York 1984.

More on SVMs:

Olvi Mangasarian’s web page:

http://pages.cs.wisc.edu/~olvi