multi-domain alias matching using machine learning

Multi-Domain Alias Matching Using MachineLearning

1Amendra Shrestha

1 Lisa Kaati 1Michael Ashcroft 2 Fredrik Johansson

1Uppsala University

2Swedish Defence Research Agency (FOI)

September 5, 2016

Outline Introduction Methodology Experiments & Results Summary

1 IntroductionMultiple aliasesOnline anonymity

2 Methodology

3 Experiments & Results

4 Summary

- 1 -


Multiple aliases

Multiple aliases

- 2 -


Multiple aliases

Multiple aliases

- 3 -


Multiple aliases

Possible reasons for multiple aliases

• Banned by administrator

• Banned for inactivity

• Lost trust of the group

• Developed bad personal relationships

• To support his arguments

• Privacy reasons

- 4 -


Online anonymity

Online anonymity

People are often open with who they are online, but sometimesthey want to remain anonymous.

• Spreading terrorism propaganda

• Performing ”hate crimes” online

• Participating in political debates

• Acting as whistle blowers

• Protesting against totalitarian

- 5 -


Online anonymity

Obtaining online anonymity

• Creation of anonymous user accounts (potentially incombination with use of Tor or internet cafes).

- 6 -


Online anonymity

Example of Author identification (manually)

- 7 -

• Theodore John ”Ted” Kaczynski (”the Unabomber”)

• Bombing campaign against people involved with moderntechnology

• Killing 3 people and injuring 23 others

• ”Industrial Society and Its Future”

• Stopped after his brother recognize the writing style


- 8 -


Attacking online anonymity

Can the user be identified anyway?

• Stylometric profiling (S)

• Time-based profiling (T)

• Emotion-based profiling (E)

- 9 -


Attacking online anonymity

- 10 -

• Electronic posts (blog posts, tweets, forumposts, etc.)

• Textual content• Metadata (e.g. publishing times)⇓

comparision←−−−−−→ create⇐===

Fingerprint of anony-mous user

Fingerprint of knownusers

Social Media


Stylometric techniques

• Statistical analysis of writing style

- 11 -


Time profile

- 12 -

• Hour of Day: Hour1, Hour2, . . . , Hour24

• Period of Day: MidNight, EarlyMorning, Morning, MidDay, Evening, Night

• Month: Jan, Feb, . . . , Dec

• Day: Sunday, Monday, . . . , Saturday

• Type of Day: WeekDay, WeekEnd


Examples of time profiles

Analysis of forum posts (boards.ie) suggests that time profiles ofauthors often are quite stable over time.

- 13 -


Emotions and Twitter-specific features

- 14 -


Datasets

- 15 -

• Discussion Board : Top-1000 posters (DB-All) & postslimited to 60 (DB-60)

• Twitter : Top-1000 tweeps (TW-All) & tweets limited to60 (TW-60)

• Blog : 1414 distinct bloggers where 260 has at least 2blogs


Performance of Classifiers

• Classifiers : AdaBoost, SVM, Naive Bayes

• Used all features (S + T + E)

• Experiments done on 4 different datasets

- 16 -


Performance of Different Datasets and Features

• Classifier : Adaboost

• Experiments on combination of features

• T, S, (S + T), (S + E), (S + T + E)

- 17 -


Cross-classification

• Data used : discussion forum & twitter

• Classifier : Adaboost

• Not worse than models trained on single dataset

- 18 -


Evaluation on blog data

• Non-synthetic data

• Precision : 0.966

• Recall : 0.567

• Accuracy : 0.929

- 19 -


Summary

• Techniques to identify multiple aliases

• AdaBoost outperforms SVM and Naive Bayes

• Combination of stylometric and time-based features yieldsbetter results

• Can be used for real-world linkage of user accounts

- 20 -


- 21 -

Thank You

multi-domain alias matching using machine learning

Data & Analytics