author gender identification from text
Post on 23-Feb-2016
69 Views
Preview:
DESCRIPTION
TRANSCRIPT
1
Author Gender Identification from Text
By: El Hebri Khiari 200790830COE 589 – Digital ForensicsDue: Tuesday 25th September 2012
2
Outline• Introduction & Motivation• Authorship Attribution• Detecting Genders• Contribution(s)• Problem Formulation• Data Pre-Processing
o Reuters Newsgroup Dataseto Enron Email dataset
• Feature Selection & Extraction• Classification Techniques• Experimental Results• Tool Results• Conclusion
3
Introduction & Motivation• Text most prevalent on Internet• Applications
o Twittero Craigslisto Facebook
• Statisticso 2008 33.1% increase in online crimeo October 2009 1.69 billion Internet users
• Motivations [??]o Anonymityo Faking gendero “MySpace mom”
o Emailo Blogso Chat rooms
4
Introduction & Motivation cont.
• Question– “Given a short text document, can we identify if the author is a
man or a woman?”
5
Authorship Attribution• Features
o Stylistic tendency Stylometric analysiso Over 1000 featureso Author’s state of mind
• Statistical methodso Word-length Distributiono Bayesian Classifiero Principle Component Analysis o Cluster Analysis
• Machine Learningo Decision Treeo Neural Networkso Support Vector Machine (SVM)
6
Authorship Attribution cont.
• Different problemo Abstractiono Length of messageso Special linguistic elements (emoticons)o Time constraints
7
Detecting Genders• Socially-constructed Gender
• Fundamental questionso “Do men & women inherently use different classes of language
styles?” “What are reliable linguistic features that indicate gender?”
• Robin Lakoff (1975)o Lexical, Syntactic & pragmatic featureso Specialized vocabulary, expletives, etc.
8
Detecting Genders cont.• Mary Talbot (1998)
o Influence of social divisions
• Mulac et al.(1990), Mulac & Lundell (1994)o Students’ impromptu essayso Descriptions of photographso Dyadic interactions between strangerso Written communication & face-to-face interaction
9
Contribution(s)• Little work on GI [??]
• Proposeo Robust Classifier
Based on content-free text messages Internet text messages
o Features types• Design
o Set of measureso Classifiers & Parameter optimization
10
Problem Formulation
• Binary problemo Class1 if author of e is maleo Class2 if author of e is female
• Set of featureso Constant for same gendero d-dimensional vector
11
Problem Formulation cont.(1)• Classifier
o Learning Classifiery = f(x), from a set of training examples
D = {(x1,y1), (x2,y2), … , (xN,yN)}
Let X = {xi, i = 1,2, … , N}
where xi is a d-dimensional vector
ALet Y = {yi, i = 1,2, … , N}
where yi{+1,-1} indicating class1(-1) or class2(+1)
12
Problem Formulation cont.(2)
13
Dataset Pre-processing
• Two extremeso Newsgroup messages
Reuters newsgroup dataseto Private Emails
Enron email dataset
14
Dataset Pre-processing cont.(1)• Reuters newsgroup dataset
o Stories by Reuters journalists, 1996 – 1997o Few Hundred to Thousand words
o Discard neutral nameso Remove unnecessary info & XML formattingo Limiting quotes, 0.002 per character
o >200 and <1000 words
15
Dataset Pre-processing cont.(2)• Enron email dataset
o Emails made public by Federal Energy Regulatory Commissiono Integrity problems some emails removedo Invalid emailso Final set
517,431 emails 150 users, 3.5 years Plain text, no attachments
o Removed headers & reply textso Removed duplicated emailso Removed ultra-short emailso > 50 and <100 words
16
Feature Set Selection• Question
o “What are good linguistic features that indicate gender?”
• Human psychology & extensive experimentationo Character-basedo Word-basedo Syntactico Structure-basedo Function words
• Total of 545 features
17
Feature Set Selection cont.(1)• Character-based features
o 29 Stylometric featureso Widely adopted in Authorship attributiono Examples
Number of white space characters Number of special characters
18
Feature Set Selection cont.(2)• Word-based features
o 33 statistical metrics Vocabulary richness Yule’s K measure Entropy measure
o 68 pshyco-linguistic features Linguistic & Word Count (LIWC)
o Individuals benefiting from writing Positive & negative emotional words Cognitive words (cause, know) Switch use of pronouns
19
20
Feature Set Selection cont.(3)• Syntactic features
o Sentence levelo Regular and informal punctuationo Mulac(1998)
Women use more question marks
21
Feature Set Selection cont.(4)• Structure-based features
o Layout Paragraphs length Use of greetings
o Big influence in online documents
22
Feature Set Selection cont.(5)• Function words
o Ambiguous meaningo Grammatical relationshipso Different set from word-based
Importance roleo 9 gender-linked features
• Women use emotionally-intensive & affective adjectives• Men express ‘independence’ First-person singular pronouns
23
24
Automatic Extraction • Normalization
25
Classification Techniques
• Three classifierso Bayesian-based logistic regressiono AdaBoost Decision treeo Support Vector Machine (SVM)
26
Classification Techniques cont.(1)
• Bayesian-based logistic regressiono Probability
o Threshold set to 0.5
27
Classification Techniques cont.(2)
o Avoid overfittingo Assume with Normal distributiono Mean = 0, Variance o Assume with exponential distribution
o Transform into Laplace distribution
28
Classification Techniques cont.(3)
o Assume components of are independento Overall prior of
o Posterior density given dataset D
29
Classification Techniques cont.(4)
o Use log posterior
o Minimum –l() convex function Suitable for optimization
30
Classification Techniques cont.(6)
• Decision Treeo Flowchart-like tree structureo Attribute Internal nodeo Outcome Brancho Class Terminal nodeo High variance Overfitting
• AdaBoost o Solid theoretical backgroundo Simpleo Accurate predictionso Proven to be successful
31
Classification Techniques cont.(5)
o Assign equal weights to all training exampleso Weights with distribution Dt at tth round
o Generate weak learner X ht X Yo Test ht, new weight distributions Dt+1
o Repeat T times
32
Classification Techniques cont.(7)
• Support Vector Machineo Linearly separable classeso Optimal
o Linearly inseparable
33
Classification Techniques cont.(8)
o Non-linear problemo Use Kernel trick
Linear Polynomial Radial basis
34
Experimental Results• Feature Extraction Python• Classifiers MatLab• Each experiment 10 times
35
Experimental Results cont.(1)• SVM outperforms (76.75% & 82.23%)• Sharp improvements in AdaBoost• Small changes in Bayesian Logistic Regression
36
Experimental Results cont.(2)
• Impact of parameterso >50 wordso >100 wordso >200 words
37
Experimental Results cont.(3)
• Significance of feature setso >100 wordso One feature at a time
38
Experimental Results cont.(4)
• Optimizationo 5% Feature size reduction 157 out of 545o Extraction time reduced from 1.35 to 3.77 secondso 3.03% drop in accuracy
39
Tool Results
• male 64.46%• male 75.83%• male 59.89%• neutral 96.98% ??• male 58.31%• male 72.60%• male 63.30%• male 57.57%• male 73.89%• male 59.07%
• Actual Results: 5 male out of 10
40
Conclusion
• Differences do exist between genders• SVM outperforms• Significant features [??]
o Word-based featureso Function wordso Structural features
o Increase data set better accuracy
top related