automatic authorship identification (part ii) diana michalek, ross t. sowell, paul kantor, alex...
Post on 22-Dec-2015
218 views
TRANSCRIPT
![Page 1: Automatic Authorship Identification (Part II) Diana Michalek, Ross T. Sowell, Paul Kantor, Alex Genkin, David Madigan, Fred Roberts, and David D. Lewis](https://reader030.vdocuments.us/reader030/viewer/2022032523/56649d795503460f94a5c899/html5/thumbnails/1.jpg)
Automatic Authorship Identification (Part II)
Diana Michalek, Ross T. Sowell, Paul Kantor, Alex Genkin, David Madigan, Fred Roberts,
and David D. Lewis
![Page 2: Automatic Authorship Identification (Part II) Diana Michalek, Ross T. Sowell, Paul Kantor, Alex Genkin, David Madigan, Fred Roberts, and David D. Lewis](https://reader030.vdocuments.us/reader030/viewer/2022032523/56649d795503460f94a5c899/html5/thumbnails/2.jpg)
Acknowledgements
• Support– U.S. National Science Foundation
• DIMACS REU 2004• Knowledge Discovery and Dissemination Program
• Disclaimer– The views expressed in this talk are those of the
authors, and not of any other individuals or organizations.
![Page 3: Automatic Authorship Identification (Part II) Diana Michalek, Ross T. Sowell, Paul Kantor, Alex Genkin, David Madigan, Fred Roberts, and David D. Lewis](https://reader030.vdocuments.us/reader030/viewer/2022032523/56649d795503460f94a5c899/html5/thumbnails/3.jpg)
Outline
I. Recap
II. New Federalist Paper Results
III. New E-mail Data Results
IV. Conclusions and Future Work
![Page 4: Automatic Authorship Identification (Part II) Diana Michalek, Ross T. Sowell, Paul Kantor, Alex Genkin, David Madigan, Fred Roberts, and David D. Lewis](https://reader030.vdocuments.us/reader030/viewer/2022032523/56649d795503460f94a5c899/html5/thumbnails/4.jpg)
The Authorship Problem
• Given:– A piece of text with unknown author– A list of possible authors– A sample of their writing
• Problem:– Can we automatically determine which person
wrote the text?
![Page 5: Automatic Authorship Identification (Part II) Diana Michalek, Ross T. Sowell, Paul Kantor, Alex Genkin, David Madigan, Fred Roberts, and David D. Lewis](https://reader030.vdocuments.us/reader030/viewer/2022032523/56649d795503460f94a5c899/html5/thumbnails/5.jpg)
The Authorship Problem
• Given:– A piece of text
– A list of possible authors
– A sample of their writing
• Problem:– Can we automatically determine which person wrote
the text?
• Approach:– Use style markers to identify the author
![Page 6: Automatic Authorship Identification (Part II) Diana Michalek, Ross T. Sowell, Paul Kantor, Alex Genkin, David Madigan, Fred Roberts, and David D. Lewis](https://reader030.vdocuments.us/reader030/viewer/2022032523/56649d795503460f94a5c899/html5/thumbnails/6.jpg)
The Federalist Papers
• 85 Total
• 12 Disputed
![Page 7: Automatic Authorship Identification (Part II) Diana Michalek, Ross T. Sowell, Paul Kantor, Alex Genkin, David Madigan, Fred Roberts, and David D. Lewis](https://reader030.vdocuments.us/reader030/viewer/2022032523/56649d795503460f94a5c899/html5/thumbnails/7.jpg)
Previous Work: Mosteller and Wallace (1964)
• Function Words
Upon Also An
By Of On
There This To
Although Both Enough
While Whilst Always
Though Commonly Consequently
Considerable(ly) According Apt
Direction Innovation(s) Language
Vigor(ous) Kind Matter(s)
Particularly Probability Work(s)
![Page 8: Automatic Authorship Identification (Part II) Diana Michalek, Ross T. Sowell, Paul Kantor, Alex Genkin, David Madigan, Fred Roberts, and David D. Lewis](https://reader030.vdocuments.us/reader030/viewer/2022032523/56649d795503460f94a5c899/html5/thumbnails/8.jpg)
Our Previous Work: Trials with the Federalist Papers
• Wrote scripts in Perl and Python to compute– Sentence length frequencies– Word length frequencies– Ratios of 3-letter words to 2-letter words
• Analyzed our data with graphing and statistics software.
![Page 9: Automatic Authorship Identification (Part II) Diana Michalek, Ross T. Sowell, Paul Kantor, Alex Genkin, David Madigan, Fred Roberts, and David D. Lewis](https://reader030.vdocuments.us/reader030/viewer/2022032523/56649d795503460f94a5c899/html5/thumbnails/9.jpg)
![Page 10: Automatic Authorship Identification (Part II) Diana Michalek, Ross T. Sowell, Paul Kantor, Alex Genkin, David Madigan, Fred Roberts, and David D. Lewis](https://reader030.vdocuments.us/reader030/viewer/2022032523/56649d795503460f94a5c899/html5/thumbnails/10.jpg)
![Page 11: Automatic Authorship Identification (Part II) Diana Michalek, Ross T. Sowell, Paul Kantor, Alex Genkin, David Madigan, Fred Roberts, and David D. Lewis](https://reader030.vdocuments.us/reader030/viewer/2022032523/56649d795503460f94a5c899/html5/thumbnails/11.jpg)
Previous Conclusions
• Not too helpful…but there is hope!– Try more features– Try different features
![Page 12: Automatic Authorship Identification (Part II) Diana Michalek, Ross T. Sowell, Paul Kantor, Alex Genkin, David Madigan, Fred Roberts, and David D. Lewis](https://reader030.vdocuments.us/reader030/viewer/2022032523/56649d795503460f94a5c899/html5/thumbnails/12.jpg)
![Page 13: Automatic Authorship Identification (Part II) Diana Michalek, Ross T. Sowell, Paul Kantor, Alex Genkin, David Madigan, Fred Roberts, and David D. Lewis](https://reader030.vdocuments.us/reader030/viewer/2022032523/56649d795503460f94a5c899/html5/thumbnails/13.jpg)
-
![Page 14: Automatic Authorship Identification (Part II) Diana Michalek, Ross T. Sowell, Paul Kantor, Alex Genkin, David Madigan, Fred Roberts, and David D. Lewis](https://reader030.vdocuments.us/reader030/viewer/2022032523/56649d795503460f94a5c899/html5/thumbnails/14.jpg)
![Page 15: Automatic Authorship Identification (Part II) Diana Michalek, Ross T. Sowell, Paul Kantor, Alex Genkin, David Madigan, Fred Roberts, and David D. Lewis](https://reader030.vdocuments.us/reader030/viewer/2022032523/56649d795503460f94a5c899/html5/thumbnails/15.jpg)
![Page 16: Automatic Authorship Identification (Part II) Diana Michalek, Ross T. Sowell, Paul Kantor, Alex Genkin, David Madigan, Fred Roberts, and David D. Lewis](https://reader030.vdocuments.us/reader030/viewer/2022032523/56649d795503460f94a5c899/html5/thumbnails/16.jpg)
![Page 17: Automatic Authorship Identification (Part II) Diana Michalek, Ross T. Sowell, Paul Kantor, Alex Genkin, David Madigan, Fred Roberts, and David D. Lewis](https://reader030.vdocuments.us/reader030/viewer/2022032523/56649d795503460f94a5c899/html5/thumbnails/17.jpg)
![Page 18: Automatic Authorship Identification (Part II) Diana Michalek, Ross T. Sowell, Paul Kantor, Alex Genkin, David Madigan, Fred Roberts, and David D. Lewis](https://reader030.vdocuments.us/reader030/viewer/2022032523/56649d795503460f94a5c899/html5/thumbnails/18.jpg)
Feature Selection• Which features work best?• One way to rank features:
– Make a contingency table for each feature F– Compute abs ( log ( ad / bc ) )– Rank the log values
a b
c d
F
Madison
Hamilton
Not F
![Page 19: Automatic Authorship Identification (Part II) Diana Michalek, Ross T. Sowell, Paul Kantor, Alex Genkin, David Madigan, Fred Roberts, and David D. Lewis](https://reader030.vdocuments.us/reader030/viewer/2022032523/56649d795503460f94a5c899/html5/thumbnails/19.jpg)
49 Ranked Features
![Page 20: Automatic Authorship Identification (Part II) Diana Michalek, Ross T. Sowell, Paul Kantor, Alex Genkin, David Madigan, Fred Roberts, and David D. Lewis](https://reader030.vdocuments.us/reader030/viewer/2022032523/56649d795503460f94a5c899/html5/thumbnails/20.jpg)
Linear Discriminant Analysis
• A technique for classifying data
• Available in the R statistics package
• Input:– Table of training data– Table of test data
• Output:– Classification of test data
![Page 21: Automatic Authorship Identification (Part II) Diana Michalek, Ross T. Sowell, Paul Kantor, Alex Genkin, David Madigan, Fred Roberts, and David D. Lewis](https://reader030.vdocuments.us/reader030/viewer/2022032523/56649d795503460f94a5c899/html5/thumbnails/21.jpg)
Linear Discriminant Analysis: example
Input training data:
upon 2-letter 3-letter
M 0.000 206.943 194.927
M 0.000 212.915 194.665
M 0.369 202.583 190.775
M 0.000 201.891 213.712
M 0.000 236.943 206.221
H 3.015 235.176 187.940
H 2.458 226.647 201.082
H 4.955 232.432 192.793
H 2.377 232.937 186.078
H 3.788 224.116 196.338
upon 2-letter 3-letter
0.000 226.277 203.163
0.908 205.268 181.653
0.000 225.536 182.627
0.000 217.273 183.053
1.003 232.581 184.962
Input test data:
Ouput:m m m m h
![Page 22: Automatic Authorship Identification (Part II) Diana Michalek, Ross T. Sowell, Paul Kantor, Alex Genkin, David Madigan, Fred Roberts, and David D. Lewis](https://reader030.vdocuments.us/reader030/viewer/2022032523/56649d795503460f94a5c899/html5/thumbnails/22.jpg)
Some more LDA results
• 12 to Madison:– upon, 1-letter, 2-letter– upon, enough, there– upon, there
• 11 to Madison:– upon, 2-letter, 3-letter
• < 6 to Madison– 2-letter, 3-letter– there, 1-letter, 2-letter
![Page 23: Automatic Authorship Identification (Part II) Diana Michalek, Ross T. Sowell, Paul Kantor, Alex Genkin, David Madigan, Fred Roberts, and David D. Lewis](https://reader030.vdocuments.us/reader030/viewer/2022032523/56649d795503460f94a5c899/html5/thumbnails/23.jpg)
Some more LDA results
Class Output of lda Features tested
12 M m m m m m m m m m m m m
upon apt 9 2
12 M m m m m m m m m m m m m
to upon 2 3
11 M m m m m m m h m m m m m
on there 2 13
11 M h m m m m m m m m m m m
an by 5 10
10 M m m m m m m h m m m h m
particularly probability 3 9
8 M m m m m m m h h h m h m
also of 1 4
8 M m m m h m m h h m m h m
always of 1 3
7 M h m m h m h h m h m m m
of work 5 2
6 M m m h m m m h h m h h h there language 1 8
5 M m h m h h m h h h m m h consequently direction 5 11
![Page 24: Automatic Authorship Identification (Part II) Diana Michalek, Ross T. Sowell, Paul Kantor, Alex Genkin, David Madigan, Fred Roberts, and David D. Lewis](https://reader030.vdocuments.us/reader030/viewer/2022032523/56649d795503460f94a5c899/html5/thumbnails/24.jpg)
Feature Selection Part II
• Which combinations of features are best for LDA?
• Are the features independent?• We did some random sampling:
– Choose features a, b, c, d– Compute x = log a + log b + log x + log d– Compute y = log (a+b+c+d)– Plot x versus y
![Page 25: Automatic Authorship Identification (Part II) Diana Michalek, Ross T. Sowell, Paul Kantor, Alex Genkin, David Madigan, Fred Roberts, and David D. Lewis](https://reader030.vdocuments.us/reader030/viewer/2022032523/56649d795503460f94a5c899/html5/thumbnails/25.jpg)
![Page 26: Automatic Authorship Identification (Part II) Diana Michalek, Ross T. Sowell, Paul Kantor, Alex Genkin, David Madigan, Fred Roberts, and David D. Lewis](https://reader030.vdocuments.us/reader030/viewer/2022032523/56649d795503460f94a5c899/html5/thumbnails/26.jpg)
Selecting more features
• What happens when more than 4 features are used for the lda?
• Greedy approach– Add features one at a time from two lists– Perform lda on all features chosen so far
• Is overfitting a problem?
![Page 27: Automatic Authorship Identification (Part II) Diana Michalek, Ross T. Sowell, Paul Kantor, Alex Genkin, David Madigan, Fred Roberts, and David D. Lewis](https://reader030.vdocuments.us/reader030/viewer/2022032523/56649d795503460f94a5c899/html5/thumbnails/27.jpg)
First few greedy iterations
6 M 6 H h m h h m h m m h m h m
2-letter words
12 M 0 H m m m m m m m m m m m m upon
12 M 0 H m m m m m m m m m m m m 1-letter words
12 M 0 H m m m m m m m m m m m m 5-letter words
11 M 1 H
m m m m m h m m m m m m 4-letter words
12 M 0 H m m m m m m m m m m m m there
12 M 0 H m m m m m m m m m m m m enough
11 M 1 H m m m m m m h m m m m m whilst
12 M 0 H m m m m m m m m m m m m 3-letter words
11 M 1 H m m m m m m h m m m m m 15-letter words
![Page 28: Automatic Authorship Identification (Part II) Diana Michalek, Ross T. Sowell, Paul Kantor, Alex Genkin, David Madigan, Fred Roberts, and David D. Lewis](https://reader030.vdocuments.us/reader030/viewer/2022032523/56649d795503460f94a5c899/html5/thumbnails/28.jpg)
![Page 29: Automatic Authorship Identification (Part II) Diana Michalek, Ross T. Sowell, Paul Kantor, Alex Genkin, David Madigan, Fred Roberts, and David D. Lewis](https://reader030.vdocuments.us/reader030/viewer/2022032523/56649d795503460f94a5c899/html5/thumbnails/29.jpg)
![Page 30: Automatic Authorship Identification (Part II) Diana Michalek, Ross T. Sowell, Paul Kantor, Alex Genkin, David Madigan, Fred Roberts, and David D. Lewis](https://reader030.vdocuments.us/reader030/viewer/2022032523/56649d795503460f94a5c899/html5/thumbnails/30.jpg)
Listserv Data
• 70 Listerv archives
• Over 1 million e-mail messages
• Data was gathered by Andrei Anghelescu– http://mms-02.rutgers.edu/ListServ/
![Page 31: Automatic Authorship Identification (Part II) Diana Michalek, Ross T. Sowell, Paul Kantor, Alex Genkin, David Madigan, Fred Roberts, and David D. Lewis](https://reader030.vdocuments.us/reader030/viewer/2022032523/56649d795503460f94a5c899/html5/thumbnails/31.jpg)
Our Data
• One Listserv, “CINEMA-L”
• 992 authors, 41263 messages
• We look at 3 authors– sstone 1077 messages– thea70 1253– jmiles_2 1481
![Page 32: Automatic Authorship Identification (Part II) Diana Michalek, Ross T. Sowell, Paul Kantor, Alex Genkin, David Madigan, Fred Roberts, and David D. Lewis](https://reader030.vdocuments.us/reader030/viewer/2022032523/56649d795503460f94a5c899/html5/thumbnails/32.jpg)
Frustration
![Page 33: Automatic Authorship Identification (Part II) Diana Michalek, Ross T. Sowell, Paul Kantor, Alex Genkin, David Madigan, Fred Roberts, and David D. Lewis](https://reader030.vdocuments.us/reader030/viewer/2022032523/56649d795503460f94a5c899/html5/thumbnails/33.jpg)
Feature Selection
• How do we find “good” features?
![Page 34: Automatic Authorship Identification (Part II) Diana Michalek, Ross T. Sowell, Paul Kantor, Alex Genkin, David Madigan, Fred Roberts, and David D. Lewis](https://reader030.vdocuments.us/reader030/viewer/2022032523/56649d795503460f94a5c899/html5/thumbnails/34.jpg)
More Frustration
![Page 35: Automatic Authorship Identification (Part II) Diana Michalek, Ross T. Sowell, Paul Kantor, Alex Genkin, David Madigan, Fred Roberts, and David D. Lewis](https://reader030.vdocuments.us/reader030/viewer/2022032523/56649d795503460f94a5c899/html5/thumbnails/35.jpg)
A Measure of Variance
![Page 36: Automatic Authorship Identification (Part II) Diana Michalek, Ross T. Sowell, Paul Kantor, Alex Genkin, David Madigan, Fred Roberts, and David D. Lewis](https://reader030.vdocuments.us/reader030/viewer/2022032523/56649d795503460f94a5c899/html5/thumbnails/36.jpg)
Summary of LDA Results
• Ran LDA using “I”, “is”, and “think”
• Trained on 80%, tested on 20%
• Correctly classified 122/186 documents
![Page 37: Automatic Authorship Identification (Part II) Diana Michalek, Ross T. Sowell, Paul Kantor, Alex Genkin, David Madigan, Fred Roberts, and David D. Lewis](https://reader030.vdocuments.us/reader030/viewer/2022032523/56649d795503460f94a5c899/html5/thumbnails/37.jpg)
Future Work• Finish our 3 author experiment
• Use more and different features– Structural– E-mail specific features
• Analyzing the relationship among features
• Other authorship id problems– Many authors– Odd-man-out