multi-domain alias matching using machine learning

22
Multi-Domain Alias Matching Using Machine Learning 1 Amendra Shrestha 1 Lisa Kaati 1 Michael Ashcroft 2 Fredrik Johansson 1 Uppsala University 2 Swedish Defence Research Agency (FOI) September 5, 2016

Upload: amendra-shrestha

Post on 21-Mar-2017

18 views

Category:

Data & Analytics


2 download

TRANSCRIPT

Page 1: Multi-Domain Alias Matching Using Machine Learning

Multi-Domain Alias Matching Using MachineLearning

1Amendra Shrestha

1 Lisa Kaati 1Michael Ashcroft 2 Fredrik Johansson

1Uppsala University

2Swedish Defence Research Agency (FOI)

September 5, 2016

Page 2: Multi-Domain Alias Matching Using Machine Learning

Outline Introduction Methodology Experiments & Results Summary

1 IntroductionMultiple aliasesOnline anonymity

2 Methodology

3 Experiments & Results

4 Summary

- 1 -

Page 3: Multi-Domain Alias Matching Using Machine Learning

Outline Introduction Methodology Experiments & Results Summary

Multiple aliases

Multiple aliases

- 2 -

Page 4: Multi-Domain Alias Matching Using Machine Learning

Outline Introduction Methodology Experiments & Results Summary

Multiple aliases

Multiple aliases

- 3 -

Page 5: Multi-Domain Alias Matching Using Machine Learning

Outline Introduction Methodology Experiments & Results Summary

Multiple aliases

Possible reasons for multiple aliases

• Banned by administrator

• Banned for inactivity

• Lost trust of the group

• Developed bad personal relationships

• To support his arguments

• Privacy reasons

- 4 -

Page 6: Multi-Domain Alias Matching Using Machine Learning

Outline Introduction Methodology Experiments & Results Summary

Online anonymity

Online anonymity

People are often open with who they are online, but sometimesthey want to remain anonymous.

• Spreading terrorism propaganda

• Performing ”hate crimes” online

• Participating in political debates

• Acting as whistle blowers

• Protesting against totalitarian

- 5 -

Page 7: Multi-Domain Alias Matching Using Machine Learning

Outline Introduction Methodology Experiments & Results Summary

Online anonymity

Obtaining online anonymity

• Creation of anonymous user accounts (potentially incombination with use of Tor or internet cafes).

- 6 -

Page 8: Multi-Domain Alias Matching Using Machine Learning

Outline Introduction Methodology Experiments & Results Summary

Online anonymity

Example of Author identification (manually)

- 7 -

• Theodore John ”Ted” Kaczynski (”the Unabomber”)

• Bombing campaign against people involved with moderntechnology

• Killing 3 people and injuring 23 others

• ”Industrial Society and Its Future”

• Stopped after his brother recognize the writing style

Page 9: Multi-Domain Alias Matching Using Machine Learning

Outline Introduction Methodology Experiments & Results Summary

- 8 -

Page 10: Multi-Domain Alias Matching Using Machine Learning

Outline Introduction Methodology Experiments & Results Summary

Attacking online anonymity

Can the user be identified anyway?

• Stylometric profiling (S)

• Time-based profiling (T)

• Emotion-based profiling (E)

- 9 -

Page 11: Multi-Domain Alias Matching Using Machine Learning

Outline Introduction Methodology Experiments & Results Summary

Attacking online anonymity

- 10 -

• Electronic posts (blog posts, tweets, forumposts, etc.)

• Textual content• Metadata (e.g. publishing times)⇓

comparision←−−−−−→ create⇐===

Fingerprint of anony-mous user

Fingerprint of knownusers

Social Media

Page 12: Multi-Domain Alias Matching Using Machine Learning

Outline Introduction Methodology Experiments & Results Summary

Stylometric techniques

• Statistical analysis of writing style

- 11 -

Page 13: Multi-Domain Alias Matching Using Machine Learning

Outline Introduction Methodology Experiments & Results Summary

Time profile

- 12 -

• Hour of Day: Hour1, Hour2, . . . , Hour24

• Period of Day: MidNight, EarlyMorning, Morning, MidDay, Evening, Night

• Month: Jan, Feb, . . . , Dec

• Day: Sunday, Monday, . . . , Saturday

• Type of Day: WeekDay, WeekEnd

Page 14: Multi-Domain Alias Matching Using Machine Learning

Outline Introduction Methodology Experiments & Results Summary

Examples of time profiles

Analysis of forum posts (boards.ie) suggests that time profiles ofauthors often are quite stable over time.

- 13 -

Page 15: Multi-Domain Alias Matching Using Machine Learning

Outline Introduction Methodology Experiments & Results Summary

Emotions and Twitter-specific features

- 14 -

Page 16: Multi-Domain Alias Matching Using Machine Learning

Outline Introduction Methodology Experiments & Results Summary

Datasets

- 15 -

• Discussion Board : Top-1000 posters (DB-All) & postslimited to 60 (DB-60)

• Twitter : Top-1000 tweeps (TW-All) & tweets limited to60 (TW-60)

• Blog : 1414 distinct bloggers where 260 has at least 2blogs

Page 17: Multi-Domain Alias Matching Using Machine Learning

Outline Introduction Methodology Experiments & Results Summary

Performance of Classifiers

• Classifiers : AdaBoost, SVM, Naive Bayes

• Used all features (S + T + E)

• Experiments done on 4 different datasets

- 16 -

Page 18: Multi-Domain Alias Matching Using Machine Learning

Outline Introduction Methodology Experiments & Results Summary

Performance of Different Datasets and Features

• Classifier : Adaboost

• Experiments on combination of features

• T, S, (S + T), (S + E), (S + T + E)

- 17 -

Page 19: Multi-Domain Alias Matching Using Machine Learning

Outline Introduction Methodology Experiments & Results Summary

Cross-classification

• Data used : discussion forum & twitter

• Classifier : Adaboost

• Not worse than models trained on single dataset

- 18 -

Page 20: Multi-Domain Alias Matching Using Machine Learning

Outline Introduction Methodology Experiments & Results Summary

Evaluation on blog data

• Non-synthetic data

• Precision : 0.966

• Recall : 0.567

• Accuracy : 0.929

- 19 -

Page 21: Multi-Domain Alias Matching Using Machine Learning

Outline Introduction Methodology Experiments & Results Summary

Summary

• Techniques to identify multiple aliases

• AdaBoost outperforms SVM and Naive Bayes

• Combination of stylometric and time-based features yieldsbetter results

• Can be used for real-world linkage of user accounts

- 20 -

Page 22: Multi-Domain Alias Matching Using Machine Learning

Outline Introduction Methodology Experiments & Results Summary

- 21 -

Thank You