the disputed federalist papers : svm feature selection via concave minimization glenn fung and olvi...
TRANSCRIPT
The Disputed Federalist Papers :
SVM Feature Selection via Concave Minimization
Glenn Fung and Olvi L. Mangasarian
CSNA 2002
June 13-16, 2002
Madison, Wisconsin
Outline of Talk
Support Vector Machines (SVM) Introduction Standard Quadratic Programming Formulation
SVM Feature Selection
The Disputed Federalist Papers
Results
Classification Agrees with Previous Results
Successive Linearization Algorithm (SLA)
Description of the Classification Problem
1-norm Linear SVMs
Separating Hyperplane in Three Dimensions Only
Description of Previous Work
What is a Support Vector Machine?
An optimally defined surface Typically nonlinear in the input space Linear in a higher dimensional space Implicitly defined by a kernel function
What are Support Vector Machines Used For?
Classification Regression & Data Fitting Supervised & Unsupervised Learning
(Will concentrate on classification)
Geometry of the Classification Problem2-Category Linearly Separable Case
A+
A-
x0w = í + 1
x0w = í à 1
w
x0w = í
Algebra of the Classification Problem 2-Category Linearly Separable Case
Given m points in n dimensional space Represented by an m-by-n matrix A
More succinctly:D(Awà eí )=e;
where e is a vector of ones.
x0w = í æ1: Separate by two bounding planes,
A iw=í + 1; for D i i = + 1;
A iw5 í à 1; for D i i = à 1:
An m-by-m diagonal matrix D with +1 & -1 entries
Membership of each A i in class +1 or –1 specified by:
Support Vector MachinesMaximizing the Margin between Bounding Planes
x0w = í + 1
x0w = í à 1
A+
A-
jjwjj22
w
Support vectors
Support Vector Machines:Quadratic Programming Formulation
Solve the following quadratic program:
÷e0y+ 21kwk2
2y > 0;w; íD(Awà eí ) + y > e
min
s.t.
where is the weight of the training error ÷
Maximize the margin by minimizing21kwk2
2
Support Vector Machines: Linear Programming Formulation
Use the 1-norm instead of the 2-norm:
÷e0y+ kwk1y > 0;w; í
D(Awà eí ) + y > e
min
s.t. This is equivalent to the following linear program:
min ÷e0y+ e0vy>0;w; í ;v
D(Awà eí ) + y > es.t.
v>w> à v
Feature Selection and SVMs
Use the step function to suppress components of the normal to the separating hyperplane:
min ÷e0y+ e0vãy>0;w; í ;v
D(Awà eí ) + y > es.t.
v>w> à v
viã = 1 if vi > 00 if vi = 0
ú ûWhere:
SVM Formulation with Feature Selection
For , we use the approximation of the step vector by the concave exponential:
v>0vã
vãt eà "à ëv;ë > 0 Here is the base of natural logarithms. This leads to:
min ÷e0y+ e0(eà "à ëv)y>0;w;í ;v
D(Awà eí ) + y > es.t.
v>w> à v
"
Successive Linearization Algorithm (SLA) for Feature Selection
(w0; í 0;y0;v0) Choose . Start with some . Having , determine the next iterate by solving the LP:
min ÷e0y+ ë("à ëvi)0(và vi)y>0;w;í ;v
D(Awà eí ) + y > es.t.
v>w> à v
÷;ë > 0(wi; í i;yi;vi)
Stop when: ÷e0(yà yi) + ë("à ëvi)0(và vi) = 0
Proposition: Algorithm terminates in a finite numberof steps (typically 5 to 7) at a stationary point.
The Federalist Papers
Written in 1787-1788 by Alexander Hamilton, John Jay and James Madison to persuade the citizens of New York to ratify the constitution.
Papers consisted of short essays, 900 to 3500 words in length.
Authorship of 12 of those papers have been in dispute ( Madison or Hamilton). These papers are referred to as the disputed Federalist papers.
Previous Work
Mosteller and Wallace (1964) Using statistical inference, determined the
authorship of the 12 disputed papers.
Bosch and Smith (1998). Using linear programming techniques and the
evaluation of every possible combination of one, two and three features, obtained a separating hyperplane using only three words.
Description of the data
For every paper:Machine readable text was created using a scanner.Computed relative frequencies of 70 words, that
Mosteller-Wallace identified as good candidates for author-attribution studies.
Each document is represented as a vector containing the 70 real numbers corresponding to the 70 word frequencies.
The dataset consists of 118 papers: 50 Madison papers 56 Hamilton papers 12 disputed papers
SLA Feature Selection for Classifying the Disputed Federalist Papers
Apply the successive linearization algorithm to:Train on the 106 Federalist papers with known
authorsFind a classification hyperplane that uses as few
words as possible
Use the hyperplane to classify the 12 disputed papers
÷ The parameter was obtained by a tuning procedure.
Hyperplane Classifier Using 3 Words
A hyperplane depending on three words was found:
0.5368to+24.6634upon+2.9532would=66.6159
All disputed papers ended up on the Madison side of the plane
Comparison with Previous Work & Conclusion
Bosch and Smith (1998) calculated all the possible sets of one, two and three words to find a separating hyperplane. They solved 118,895 linear programs.
Our SLA algorithm for feature selection required the solution of only 6 linear programs.
Our classification of the disputed Federalist papers agrees with that of Mosteller-Wallace and Bosch-Smith.