authorship attribution approaches for the lithuanian ... · authorship attribution (aa) aa –a...

16
JURGITA KAPOČIŪTĖ - DZIKIENĖ Authorship Attribution Approaches for the Lithuanian Language Using Internet Comments

Upload: others

Post on 12-Feb-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Authorship Attribution Approaches for the Lithuanian ... · Authorship Attribution (AA) AA –a task of identifying who from a set of candidate authors is an actual author of a given

J U R G I TA KAPOČIŪTĖ -DZIKIENĖ

Authorship Attribution Approaches for the

Lithuanian Language Using Internet

Comments

Page 2: Authorship Attribution Approaches for the Lithuanian ... · Authorship Attribution (AA) AA –a task of identifying who from a set of candidate authors is an actual author of a given

Authorship Attribution (AA)

AA – a task of identifying who from a set of candidate

authors is an actual author of a given anonymous text

Based on existing “human stylome” notion

(Van Halteren, 2005)

AA is one of the oldest problems, highly topical nowadays

Page 3: Authorship Attribution Approaches for the Lithuanian ... · Authorship Attribution (AA) AA –a task of identifying who from a set of candidate authors is an actual author of a given

Applications

Forensics Security

Electronic

commerceLiterary

applications

PAST NOW

Page 4: Authorship Attribution Approaches for the Lithuanian ... · Authorship Attribution (AA) AA –a task of identifying who from a set of candidate authors is an actual author of a given

Related works (1/2)

The research is done with:

e-mails, web forum messages, online chats, Internet blogs, tweets

/short texts, non-normative language/

hundreds (Luyckx and Daelemans, 2008) or thousands authors

(Koppel et al., 2011); (Narayanan et al., 2012)

/larger candidate author sets/

Page 5: Authorship Attribution Approaches for the Lithuanian ... · Authorship Attribution (AA) AA –a task of identifying who from a set of candidate authors is an actual author of a given

Related works (2/2)

Concept of idiolect for the first time discussed in 1971

(Pikčilingis, 1971)

Lots of descriptive linguistic works

(e.g. Žalkauskaitė, 2012; Venčkauskas et al., 2015)

AA research with:

Reference Texts Authors Methods

(Kapočiūtė-Dzikienė et al., 2015) Parliamentary transcripts

forum posts

100 Machine learning

(Kapočiūtė-Dzikienė et al., 2015a) Fiction texts 3; 5; 10; 20;50;100 Machine learning

(Kapočiūtė-Dzikienė et al., 2015b) Internet comments 1,000 Similarity-based

Page 6: Authorship Attribution Approaches for the Lithuanian ... · Authorship Attribution (AA) AA –a task of identifying who from a set of candidate authors is an actual author of a given

The corpus*

Internet comments from www.delfi.lt (topics “in Lithuania” and “Abroad”) in January 2015 – August 2015

To get clean corpus none of these authors were included: if the same pseudonym was used for writing from different IP addresses (dynamic IP

problem)

if different pseudonyms were used under the same IP address

Replies, meta-information and non-Lithuanian alphabetic letters (except punctuation marks and digits) were filtered out

No texts shorter than 30 symbols (excluding white-spaces)

Texts with out-of-vocabulary words, missing diacritics, etc.

/spoken non-normative Lithuanian language/

* At: http://dangus.vdu.lt/~jkd/wp-content/uploads/2015/04/INT_KOMENTARAI_INDV2.7z

Page 7: Authorship Attribution Approaches for the Lithuanian ... · Authorship Attribution (AA) AA –a task of identifying who from a set of candidate authors is an actual author of a given

Corpus statistics

Number of authors 10 100 1,000

Number of texts 14,443 63,131 155,078

Number of tokens (words/digits) 289,462 1,511,823 4,068,231

Avg. text length in tokens 20.042 23.947 26.233

Random baseline 0.001 0.002 0.003

Majority baseline 0.018 0.018 0.018

Page 8: Authorship Attribution Approaches for the Lithuanian ... · Authorship Attribution (AA) AA –a task of identifying who from a set of candidate authors is an actual author of a given

Main research directions

Methods: Machine Learning (ML): Naïve Bayes – NB; Support Vector Machine – SVM)

Similarity-Based – SB (cosine similarity measure)

Feature types: lex – lexical

lem – lemmas – using Lemuoklis (Zinkevičius, 20000)

chr4 – word-level character tetra-grams

Dimensionality reduction techniques: Entire feature set

Feature ranking (using chi-squared) and TopN (N=30 thousand)

Random feature set of N features in K iterations (K=20) – RFS (offered by Koppel et al., 2011)

Page 9: Authorship Attribution Approaches for the Lithuanian ... · Authorship Attribution (AA) AA –a task of identifying who from a set of candidate authors is an actual author of a given

Experimental setup

Stratified dataset

80%/20% for training/testing, respectively

Calculated accuracies/f-score values

McNemar’s (McNemar, 1947) test for statistical

significance ( = 0.05)

Page 10: Authorship Attribution Approaches for the Lithuanian ... · Authorship Attribution (AA) AA –a task of identifying who from a set of candidate authors is an actual author of a given

10 candidate authors*0

.61

1

0.6

08

0.6

92

0.5

82

0.6

98

0.7

10

0.7

26

0.6

91

0.7

58

0.6

75 0.7

44

0.7

44

0.2

26 0

.30

2

0.2

08

0.2

17

0.1

75 0

.24

9

0.3

66

0.2

17

0.1

84

0.5

90

0.5

90

0.5

61

0.5

66

0.6

81

0.6

94

0.7

09

0.6

72

0.6

77

0.6

58 0

.73

0

0.7

28

0.2

11 0

.27

4

0.1

86

0.1

92

0.1

64 0

.24

3

0.3

33

0.1

80

0.1

70

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

lex lem chr4 lex lem chr4 lex lem chr4 lex lem chr4

NBM SVM SB-TopN SB-RFS

* Accuracy and f-score in white and gray columns, respectively.

The first two columns – results on the entire feature set; the second two – on N=30,000 features

The dashed line indicates higher of random and majority baselines

differences are statistically significant

Page 11: Authorship Attribution Approaches for the Lithuanian ... · Authorship Attribution (AA) AA –a task of identifying who from a set of candidate authors is an actual author of a given

100 candidate authors0

.21

3

0.1

85

0.1

92

0.1

77

0.2

58

0.2

74

0.4

35

0.3

31

0.4

13

0.3

46

0.4

67

0.4

25

0.0

83

0.1

04

0.0

94

0.0

54 0

.09

8

0.0

42

0.0

60

0.0

58

0.0

42

0.1

47

0.1

21

0.1

28

0.1

14

0.1

84

0.1

99

0.4

14

0.3

13

0.3

97

0.3

29

0.4

40

0.3

95

0.0

70

0.0

95

0.0

80

0.0

39

0.0

98

0.0

34

0.0

44

0.0

44

0.0

34

0.0

0.1

0.2

0.3

0.4

0.5

lex lem chr4 lex lem chr4 lex lem chr4 lex lem chr4

NBM SVM SB-TopN SB-RFS

differences are statistically significant

Page 12: Authorship Attribution Approaches for the Lithuanian ... · Authorship Attribution (AA) AA –a task of identifying who from a set of candidate authors is an actual author of a given

1,000 candidate authors0

.10

3

0.0

92

0.0

75

0.0

61

0.0

97

0.0

84

0.2

39

0.1

91

0.2

35

0.1

94

0.2

67

0.2

28

0.0

34

0.0

24

0.0

25

0.0

19 0.0

39

0.0

26

0.0

28

0.0

17

0.0

25

0.0

25

0.0

24

0.0

15

0.0

22

0.0

23

0.0

20

0.1

54

0.1

37 0.1

56

0.1

43

0.1

80

0.1

85

0.0

29

0.0

20

0.0

21

0.0

16

0.0

32

0.0

22

0.0

24

0.0

15

0.0

21

0.0

0.1

0.2

0.3

lex lem chr4 lex lem chr4 lex lem chr4 lex lem chr4

NBM SVM SB-TopN SB-RFS

differences are not statistically significant

Page 13: Authorship Attribution Approaches for the Lithuanian ... · Authorship Attribution (AA) AA –a task of identifying who from a set of candidate authors is an actual author of a given

Summary of results

Not all results exceed random and majority baselines:

Lemmatizer is not adjusted to cope with the non-normative language

ML > SB

chr4 > lex > lem (on the larger author sets)

Entire feature set > RFS (N = 30,000; K = 20) > TopN (N = 30,000)

Page 14: Authorship Attribution Approaches for the Lithuanian ... · Authorship Attribution (AA) AA –a task of identifying who from a set of candidate authors is an actual author of a given

Conclusions and future work

Conclusions: The first comparative AA results (in terms of methods, feature

types, dimensionality reduction techniques) for the Lithuanian language using:

Internet comments

a big author set (i.e., 1,000 candidate authors)

The best results (especially on the larger author sets) with ML + chr4

Future work: Error analysis and improvements

Expanded number of candidate authors

Different types of non-normative texts (e.g., social networks data)

Page 15: Authorship Attribution Approaches for the Lithuanian ... · Authorship Attribution (AA) AA –a task of identifying who from a set of candidate authors is an actual author of a given

Researchers who also worked on this topic

Ligita Šarkutė

KUT

Andrius Utka

VMU

Robertas Damaševičius

KUT

Algimantas Venčkauskas

KUT

Lithuanian Cybercrime Centre of

Excellence for Training,

Research and Education

(HOME/2013/ISEC/AG/INT/400005176)

Automatic extraction of style applied to

individual authors and groups of authors

(ASTRA) (LIT-8-69)

http://dangus.vdu.lt/~jkd/eng/

Page 16: Authorship Attribution Approaches for the Lithuanian ... · Authorship Attribution (AA) AA –a task of identifying who from a set of candidate authors is an actual author of a given

Thank you for your attention!

j u rg i ta . k . dz@g mai l . com