style fingerprint - ecology labfaculty.cse.tamu.edu/huangrh/fall17/nlp_fp.pdf · × previous work...
TRANSCRIPT
![Page 1: Style Fingerprint - ecology labfaculty.cse.tamu.edu/huangrh/Fall17/NLP_FP.pdf · × Previous work has been done on authorship attrition for books and long texts. × We focus on shorter](https://reader036.vdocuments.us/reader036/viewer/2022071216/604804cdc0b803679d620d12/html5/thumbnails/1.jpg)
Social Media Writing Style Fingerprint
Himank Yadav, Juliang Li
![Page 2: Style Fingerprint - ecology labfaculty.cse.tamu.edu/huangrh/Fall17/NLP_FP.pdf · × Previous work has been done on authorship attrition for books and long texts. × We focus on shorter](https://reader036.vdocuments.us/reader036/viewer/2022071216/604804cdc0b803679d620d12/html5/thumbnails/2.jpg)
1.Overview
2
![Page 3: Style Fingerprint - ecology labfaculty.cse.tamu.edu/huangrh/Fall17/NLP_FP.pdf · × Previous work has been done on authorship attrition for books and long texts. × We focus on shorter](https://reader036.vdocuments.us/reader036/viewer/2022071216/604804cdc0b803679d620d12/html5/thumbnails/3.jpg)
× Humans have the cognitive ability to differentiate between writing styles of various authors.
× Previous work has been done on authorship attrition for books and long texts.
× We focus on shorter writing samples gathered through social media
Overview
3
![Page 4: Style Fingerprint - ecology labfaculty.cse.tamu.edu/huangrh/Fall17/NLP_FP.pdf · × Previous work has been done on authorship attrition for books and long texts. × We focus on shorter](https://reader036.vdocuments.us/reader036/viewer/2022071216/604804cdc0b803679d620d12/html5/thumbnails/4.jpg)
2.Motivation
4
![Page 5: Style Fingerprint - ecology labfaculty.cse.tamu.edu/huangrh/Fall17/NLP_FP.pdf · × Previous work has been done on authorship attrition for books and long texts. × We focus on shorter](https://reader036.vdocuments.us/reader036/viewer/2022071216/604804cdc0b803679d620d12/html5/thumbnails/5.jpg)
5
![Page 6: Style Fingerprint - ecology labfaculty.cse.tamu.edu/huangrh/Fall17/NLP_FP.pdf · × Previous work has been done on authorship attrition for books and long texts. × We focus on shorter](https://reader036.vdocuments.us/reader036/viewer/2022071216/604804cdc0b803679d620d12/html5/thumbnails/6.jpg)
6
![Page 7: Style Fingerprint - ecology labfaculty.cse.tamu.edu/huangrh/Fall17/NLP_FP.pdf · × Previous work has been done on authorship attrition for books and long texts. × We focus on shorter](https://reader036.vdocuments.us/reader036/viewer/2022071216/604804cdc0b803679d620d12/html5/thumbnails/7.jpg)
× Detect social media hacking activity.
× Establish the credibility of a source.
× Identifying anonymous negative phenomenon (bullying)
Applications
7
![Page 8: Style Fingerprint - ecology labfaculty.cse.tamu.edu/huangrh/Fall17/NLP_FP.pdf · × Previous work has been done on authorship attrition for books and long texts. × We focus on shorter](https://reader036.vdocuments.us/reader036/viewer/2022071216/604804cdc0b803679d620d12/html5/thumbnails/8.jpg)
3.Data
8
![Page 9: Style Fingerprint - ecology labfaculty.cse.tamu.edu/huangrh/Fall17/NLP_FP.pdf · × Previous work has been done on authorship attrition for books and long texts. × We focus on shorter](https://reader036.vdocuments.us/reader036/viewer/2022071216/604804cdc0b803679d620d12/html5/thumbnails/9.jpg)
× Comments v.s. Posts bias
× VRCkid (aka Sahil) and other top redditors that he interacts with
× PRAW - Python Reddit API Wrapper
× Removed empty comments and edge cases
Collection & Cleaning
9
![Page 10: Style Fingerprint - ecology labfaculty.cse.tamu.edu/huangrh/Fall17/NLP_FP.pdf · × Previous work has been done on authorship attrition for books and long texts. × We focus on shorter](https://reader036.vdocuments.us/reader036/viewer/2022071216/604804cdc0b803679d620d12/html5/thumbnails/10.jpg)
Sample Data
10
![Page 11: Style Fingerprint - ecology labfaculty.cse.tamu.edu/huangrh/Fall17/NLP_FP.pdf · × Previous work has been done on authorship attrition for books and long texts. × We focus on shorter](https://reader036.vdocuments.us/reader036/viewer/2022071216/604804cdc0b803679d620d12/html5/thumbnails/11.jpg)
4.Method
11
![Page 12: Style Fingerprint - ecology labfaculty.cse.tamu.edu/huangrh/Fall17/NLP_FP.pdf · × Previous work has been done on authorship attrition for books and long texts. × We focus on shorter](https://reader036.vdocuments.us/reader036/viewer/2022071216/604804cdc0b803679d620d12/html5/thumbnails/12.jpg)
× Counts the usage of vocabulary
× Decide features (bag of word)
× Use logistic regression to train and test
× 76% accuracy
Word Frequency
12
![Page 13: Style Fingerprint - ecology labfaculty.cse.tamu.edu/huangrh/Fall17/NLP_FP.pdf · × Previous work has been done on authorship attrition for books and long texts. × We focus on shorter](https://reader036.vdocuments.us/reader036/viewer/2022071216/604804cdc0b803679d620d12/html5/thumbnails/13.jpg)
Performance
13
![Page 14: Style Fingerprint - ecology labfaculty.cse.tamu.edu/huangrh/Fall17/NLP_FP.pdf · × Previous work has been done on authorship attrition for books and long texts. × We focus on shorter](https://reader036.vdocuments.us/reader036/viewer/2022071216/604804cdc0b803679d620d12/html5/thumbnails/14.jpg)
× Using unsupervised learning to capture distinctiveness in sentence structure.
× Tokenize words and sentences, focus on lexical features and punctuation style of text
× Word density, vocabulary diversity and punctuation placement
× Use clustering to find natural groupings, predict using a KMeans cluster ~ 69%
Lexical KMeans
14
![Page 15: Style Fingerprint - ecology labfaculty.cse.tamu.edu/huangrh/Fall17/NLP_FP.pdf · × Previous work has been done on authorship attrition for books and long texts. × We focus on shorter](https://reader036.vdocuments.us/reader036/viewer/2022071216/604804cdc0b803679d620d12/html5/thumbnails/15.jpg)
× Views the text as a sequence of characters
× Use N-gram to extract data
× Select top N-gram
× 77% Accuracy
Character N-gram
15
![Page 16: Style Fingerprint - ecology labfaculty.cse.tamu.edu/huangrh/Fall17/NLP_FP.pdf · × Previous work has been done on authorship attrition for books and long texts. × We focus on shorter](https://reader036.vdocuments.us/reader036/viewer/2022071216/604804cdc0b803679d620d12/html5/thumbnails/16.jpg)
Performance
16
![Page 17: Style Fingerprint - ecology labfaculty.cse.tamu.edu/huangrh/Fall17/NLP_FP.pdf · × Previous work has been done on authorship attrition for books and long texts. × We focus on shorter](https://reader036.vdocuments.us/reader036/viewer/2022071216/604804cdc0b803679d620d12/html5/thumbnails/17.jpg)
× Supervised learning combined with n-gram stylometric analysis
× Split data into training and verification, compute a specific threshold for the given user, derive user profile by extracting n-grams.
× Divide the verification user data into p blocks of characters of the same size
× Calculate the percentage of unique n-gram shared by blocks and the training set
× Block p is said to be a genuine sample of user if the percentage of unique n-grams shared by a block is greater than threshold value specified for the user
Author Verification for Short Messages
17
![Page 18: Style Fingerprint - ecology labfaculty.cse.tamu.edu/huangrh/Fall17/NLP_FP.pdf · × Previous work has been done on authorship attrition for books and long texts. × We focus on shorter](https://reader036.vdocuments.us/reader036/viewer/2022071216/604804cdc0b803679d620d12/html5/thumbnails/18.jpg)
× Analyze Syntactic information
× Author may subconsciously make similar phrase structure.
× Pick the most frequently used structures as features
Parts of Speech
18
![Page 19: Style Fingerprint - ecology labfaculty.cse.tamu.edu/huangrh/Fall17/NLP_FP.pdf · × Previous work has been done on authorship attrition for books and long texts. × We focus on shorter](https://reader036.vdocuments.us/reader036/viewer/2022071216/604804cdc0b803679d620d12/html5/thumbnails/19.jpg)
Performance
19
![Page 20: Style Fingerprint - ecology labfaculty.cse.tamu.edu/huangrh/Fall17/NLP_FP.pdf · × Previous work has been done on authorship attrition for books and long texts. × We focus on shorter](https://reader036.vdocuments.us/reader036/viewer/2022071216/604804cdc0b803679d620d12/html5/thumbnails/20.jpg)
× Multi-layered classifier simulating a neural net
× 5 input nodes where the final layer uses majority vote from the middle layer vector to classify
× Focus on strengths and weakness for each of the 5 classifiers
× Accuracy ~80%
Master Classifier
20
![Page 21: Style Fingerprint - ecology labfaculty.cse.tamu.edu/huangrh/Fall17/NLP_FP.pdf · × Previous work has been done on authorship attrition for books and long texts. × We focus on shorter](https://reader036.vdocuments.us/reader036/viewer/2022071216/604804cdc0b803679d620d12/html5/thumbnails/21.jpg)
21
![Page 22: Style Fingerprint - ecology labfaculty.cse.tamu.edu/huangrh/Fall17/NLP_FP.pdf · × Previous work has been done on authorship attrition for books and long texts. × We focus on shorter](https://reader036.vdocuments.us/reader036/viewer/2022071216/604804cdc0b803679d620d12/html5/thumbnails/22.jpg)
5.Future Work
22
![Page 23: Style Fingerprint - ecology labfaculty.cse.tamu.edu/huangrh/Fall17/NLP_FP.pdf · × Previous work has been done on authorship attrition for books and long texts. × We focus on shorter](https://reader036.vdocuments.us/reader036/viewer/2022071216/604804cdc0b803679d620d12/html5/thumbnails/23.jpg)
× Expand features to include semantic analysis and other external non-language data points
× Explore different social media datasets
× Build tooling to measure real world success
Extension
23
![Page 24: Style Fingerprint - ecology labfaculty.cse.tamu.edu/huangrh/Fall17/NLP_FP.pdf · × Previous work has been done on authorship attrition for books and long texts. × We focus on shorter](https://reader036.vdocuments.us/reader036/viewer/2022071216/604804cdc0b803679d620d12/html5/thumbnails/24.jpg)
Questions?
24