unsupervised content-based characterization and anomaly ... · video games, etc.), making the...

10
Unsupervised Content-Based Characterization and Anomaly Detection of Online Community Dynamics Danelle Shah, Michael Hurley, Jessamyn Liu, Matthew Daggett MIT Lincoln Laboratory, Lexington, MA {danelle.shah, hurley, jessamyn.liu, daggett}@ll.mit.edu Abstract The structure and behavior of human networks have been investigated and quantitatively modeled by modern social scientists for decades, however the scope of these efforts is often constrained by the labor-intensive curation processes that are required to collect, organize, and analyze network data. The surge in online social media in recent years provides a new source of dynamic, semi-structured data of digital human networks, many of which embody attributes of real-world networks. In this paper we leverage the Reddit social media platform to study social communities whose dynamics indicate they may have experienced a disturbance event. We describe an unsupervised approach to analyzing natural language content for quantifying community similarity, monitoring temporal changes, and detecting anomalies indicative of disturbance events. We demonstrate how this method is able to detect anomalies in a spectrum of Reddit communities and discuss its applicability to unsupervised event detection for a broader class of social media use cases. 1. Introduction This research is part of a larger effort to explore how human networks respond to disturbances that impact the community’s membership, function, or interests. In particular, we are interested in observing network dynamics which may signal unrest, vulnerability, or redefinition in response to a disruptive event. While large-scale events such as natural disasters or political upheaval may be trivial to identify, the detection of localized events impacting specific sub-communities is more challenging. We also seek to study when and why similar networks behave differently or dissimilar This material is based upon work supported by the Department of Defense under Air Force Contract No. FA8721-05-C-0002 and/or FA8702-15-D-0001. Opinions, interpretations, conclusions and recommendations are those of the authors and are not necessarily endorsed by the Department of Defense. networks behave similarly in response to the same event. To these ends, it is necessary to: 1) characterize a community’s membership, function, or interests over time; 2) quantify intra- and inter-community similarity; 3) automatically detect events that cause a community disturbance; and 4) model the short- and long-term network effects of an observed disturbance. A community is defined as “a social, religious, occupational, or other group sharing common characteristics or interests and perceived or perceiving itself as distinct in some respect from the larger society within which it exists” [35]. The active association or interaction of individuals within a community to achieve some end (e.g., exchange information, execute a task) is referred to as a network. 1 Communities are often hierarchical and overlapping, and an individual may belong to an arbitrary number of communities. Social media provides a rich, observable data source of dynamic social networks, and it has been extensively studied to model both online and offline human social behavior. Data generated by online social networks have been used to detect significant events in real-time [7, 31], track the spread and sentiment of information [4, 29, 32], monitor population health and safety [13, 33, 34], etc. Reddit (www.reddit.com) is one such social media platform, generating large volumes of content shared by self-organizing, topically-oriented communities. In contrast to the significant body of work dedicated to the definition and detection of communities from individuals and connections [12], the structure of Reddit allows communities to be treated as atomic units without the need to infer membership. In this paper we describe an unsupervised approach to modeling the content shared within Reddit communities in order to measure inter-community similarity, track volatility over time, and automatically detect anomalous activity corresponding to real-world events. In Section 2, we review related research in 1 This distinction is subtle. In the context of this work we assume a 1:1 mapping between the (observable) network and the corresponding (sub)community it represents, and use these terms interchangeably. Proceedings of the 52nd Hawaii International Conference on System Sciences | 2019 URI: hps://hdl.handle.net/10125/59665 ISBN: 978-0-9981331-2-6 (CC BY-NC-ND 4.0) Page 2264

Upload: others

Post on 18-Mar-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Unsupervised Content-Based Characterization and Anomaly ... · video games, etc.), making the characterization of this post content a critical component of understanding the community

Unsupervised Content-Based Characterizationand Anomaly Detection of Online Community Dynamics

Danelle Shah, Michael Hurley, Jessamyn Liu, Matthew DaggettMIT Lincoln Laboratory, Lexington, MA

{danelle.shah, hurley, jessamyn.liu, daggett}@ll.mit.edu

Abstract

The structure and behavior of human networks havebeen investigated and quantitatively modeled by modernsocial scientists for decades, however the scope ofthese efforts is often constrained by the labor-intensivecuration processes that are required to collect, organize,and analyze network data. The surge in online socialmedia in recent years provides a new source of dynamic,semi-structured data of digital human networks, manyof which embody attributes of real-world networks. Inthis paper we leverage the Reddit social media platformto study social communities whose dynamics indicatethey may have experienced a disturbance event. Wedescribe an unsupervised approach to analyzing naturallanguage content for quantifying community similarity,monitoring temporal changes, and detecting anomaliesindicative of disturbance events. We demonstrate howthis method is able to detect anomalies in a spectrumof Reddit communities and discuss its applicability tounsupervised event detection for a broader class ofsocial media use cases.

1. Introduction

This research is part of a larger effort to explore howhuman networks respond to disturbances that impactthe community’s membership, function, or interests.In particular, we are interested in observing networkdynamics which may signal unrest, vulnerability, orredefinition in response to a disruptive event. Whilelarge-scale events such as natural disasters or politicalupheaval may be trivial to identify, the detection oflocalized events impacting specific sub-communities ismore challenging. We also seek to study when andwhy similar networks behave differently or dissimilar

This material is based upon work supported by the Departmentof Defense under Air Force Contract No. FA8721-05-C-0002and/or FA8702-15-D-0001. Opinions, interpretations, conclusions andrecommendations are those of the authors and are not necessarilyendorsed by the Department of Defense.

networks behave similarly in response to the same event.To these ends, it is necessary to: 1) characterize acommunity’s membership, function, or interests overtime; 2) quantify intra- and inter-community similarity;3) automatically detect events that cause a communitydisturbance; and 4) model the short- and long-termnetwork effects of an observed disturbance.

A community is defined as “a social, religious,occupational, or other group sharing commoncharacteristics or interests and perceived or perceivingitself as distinct in some respect from the larger societywithin which it exists” [35]. The active associationor interaction of individuals within a community toachieve some end (e.g., exchange information, executea task) is referred to as a network.1 Communities areoften hierarchical and overlapping, and an individualmay belong to an arbitrary number of communities.

Social media provides a rich, observable data sourceof dynamic social networks, and it has been extensivelystudied to model both online and offline human socialbehavior. Data generated by online social networkshave been used to detect significant events in real-time[7, 31], track the spread and sentiment of information[4, 29, 32], monitor population health and safety [13,33, 34], etc. Reddit (www.reddit.com) is one suchsocial media platform, generating large volumes ofcontent shared by self-organizing, topically-orientedcommunities. In contrast to the significant body of workdedicated to the definition and detection of communitiesfrom individuals and connections [12], the structure ofReddit allows communities to be treated as atomic unitswithout the need to infer membership.

In this paper we describe an unsupervised approachto modeling the content shared within Redditcommunities in order to measure inter-communitysimilarity, track volatility over time, and automaticallydetect anomalous activity corresponding to real-worldevents. In Section 2, we review related research in

1This distinction is subtle. In the context of this work we assume a1:1 mapping between the (observable) network and the corresponding(sub)community it represents, and use these terms interchangeably.

Proceedings of the 52nd Hawaii International Conference on System Sciences | 2019

URI: https://hdl.handle.net/10125/59665ISBN: 978-0-9981331-2-6(CC BY-NC-ND 4.0)

Page 2264

Page 2: Unsupervised Content-Based Characterization and Anomaly ... · video games, etc.), making the characterization of this post content a critical component of understanding the community

the area of event detection in social media. In Section3, we describe the Reddit dataset used in this workand introduce our model for characterizing subredditdynamics using natural language processing techniques.Specifically, subreddit post activity is representedin a low-dimensional embedding space, enablingefficient calculation of intra- and inter-communitycontent similarity. We describe an unsupervisedanomaly detection method in Section 4 and present twocase studies where detected community disturbancescorrelated with real-world events. We conclude with adiscussion of opportunities and future work.

2. Background and Related Work

2.1. Reddit

Reddit is a prominent social media platform2

launched in 2005 to capture what’s new and popularon the Internet via sharing of hyperlinks to interestingwebsites. As the user base grew, a community structureorganically developed through the use of interest-centricgroups called subreddits. Users post stories, links,and media relevant to a subreddit’s predefined topic,and engage in community-oriented discussions throughcomments. Due to the encoded structure of userinteractions via posts and threads, the heterogeneity ofcontent (text, images, videos, hyperlinks), open andauthentic discussion afforded by relative anonymity andminimal censorship, and the observability enabled byits Application Programming Interface (API), Redditprovides a unique opportunity to analyze complex inter-and intra-community dynamics.

While we focus our analysis on Reddit in this paper,the approach presented here can be trivially applied toany online social network in which communities arecentralized and explicitly defined. The extension ofthese techniques to platforms in which users interact ina distributed fashion (e.g., Twitter) or to data of physical(offline) networks remains a future research goal.

2.2. Event Detection in Social Media

Methods for detecting events in online social mediahas been an active research area since the popularadoption of the internet in the 1990s. A number ofexcellent surveys have been recently published whichprovide an overview of approaches used to detectanomalies and behaviors in online social media (e.g.[18, 39, 40]). Using the nomenclature of [18], thetechnique described in this paper uses an unsupervisedapproach to dynamic, contextual anomaly detection.

2As of this writing, Reddit is ranked 5 most visited website in theUnited States and 16 in the world [2].

Twitter has been leveraged extensively for detectingevents in social media using a variety of methodsincluding analysis of social network structures as wellas message volume, timing, and content. Applicationsinclude detection of events in sports [7, 28, 42],public emergencies and disease outbreaks [3, 34] andnatural disasters [8, 33]. Several recent works performgeneral event detection and trend spotting on Twitterby detecting anomalous or “bursty” phrases [6, 23, 36].However, Twitter presents many challenges due to theshort, unstructured, usually poorly formatted messagesthat users post, and lacks explicit community structurecorresponding to user membership or content relevancy.

Although Reddit has been used as a data source foronline social media analysis, relatively little research hasbeen conducted on event detection in Reddit. Instead,the research has more often focused on communityanalysis [15, 16], topic discovery [13, 37] or onsummarizing posts and comments [22]. Most analyseshave focused on event summarization as opposed toevent detection. Event summarization is concerned withcollecting and compressing streams of text into a concisesummary statement. Event detection aims to identifysignificant changes in the text streams [7, 24].

Hamilton, et al. [14] studied the extent thatsentiment varies over time within the content ofspecific communities on Reddit. They combineddomain-specific word embeddings and a labelpropagation framework to induce sentiment lexiconsusing a small numbers of seed words from Redditdata with a corpus of 150 years of English words.Lyngbaek’s thesis [22] describes an application calledSPORK that identifies and summarizes topics in Redditcomment threads. It uses natural language processingbased upon term frequency-inverse document frequency(TF-IDF), cosine similarity, and clustering algorithms.Lyngbaek uses a hierarchical clustering algorithm tocombine similar comments based upon word-frequencyvectors. In this paper we use a similar technique,however rather than clustering individual commentsinto summarization topics, we model entire subredditcommunities over finite time spans in order to detectsignificant changes in response to real-world events.

3. Modeling Community Content

We seek to model user behavior on Reddit asa means to characterize communities over time andto detect anomalies in community activity caused byconsequential events. In this section we describe acontent-based activity model, taking advantage of recentadvances in natural language processing to characterizethe text in subreddit posts. The motivations for this

Page 2265

Page 3: Unsupervised Content-Based Characterization and Anomaly ... · video games, etc.), making the characterization of this post content a critical component of understanding the community

model are threefold: quantitatively measure subredditsimilarity; monitor temporal changes in subredditactivity; and automatically detect anomalies caused byreal world events. Note that many of the results inthis paper highlight examples that we expect readersmay recognize, however the objective is to automaticallydetect events affecting any community of interest.

3.1. Dataset

The data used in this paper includes over 118 millionposts created by over 8 million unique users onReddit between March 2016 and September 2017,collected from the archive of Reddit made publiclyavailable on Google BigQuery,3 omitting known botsand default subreddits. Bots are computer programswhich automatically generate posts and comments; 988bots (190 active) were identified by manually inspectingabnormally-active users (> 3000 comments per month)and by scraping /r/botWatcher for recently-mentionedbots. Default subreddits (those to which all users areautomatically subscribed upon account creation) arelinked at the top of every Reddit page and are typicallyamong the most active subreddits; 49 default subreddits(as of March 2016) were omitted from this analysis.

3.2. Subreddit Activity Model

The content of a subreddit captures the interests,thoughts, questions, complaints, discussions,arguments, and general musings of the members of thatcommunity. This content consists of user-contributedposts (and subsequent comments) including articles,videos, and links to other websites, as well as freeformtext. By design, posts within each subreddit are typicallyorganized around a predefined theme (sports, politics,video games, etc.), making the characterization of thispost content a critical component of understanding thecommunity and detecting change. While posts maycontain heterogeneous media, each post is accompaniedby a short descriptive title, the text of which is usedexclusively to characterize subreddit content in thiswork.4 Specifically, a subreddit’s content over a finitetime span is represented as a weighted average wordembedding of the corresponding post titles.

Word embeddings are a class of natural languageprocessing techniques in which individual words arerepresented by a vector in a predefined vector space,the dimension of which is significantly smaller than

3https://bigquery.cloud.google.com/dataset/fh-bigquery4Although comments typically significantly outnumber posts on

any given subreddit, preliminary analysis showed that post titles aloneprovided a more robust representation, likely because posts are richerin content and less noisy than comments.

the size of the vocabulary. The term word embeddingswas introduced by [5], but it was the release ofword2vec [25], a toolkit for training and implementingword embeddings, that greatly expanded the use of thetechnique. Here, an embedding known as GloVe (GlobalVectors for Word Representation) was used. The GloVealgorithm was created by [30] and builds on word2vecto explicitly encode meaning as vector offsets. Forexample, models created using GloVe should accuratelyrepresent semantic relationships such as the analogy“king is to queen as man is to woman,” which wouldbe encoded in the vector space by the vector equation:king −man+ woman = queen.

A GloVe word embedding was trained on the corpusof over 118M Reddit posts, which included over 1.2Munique words. (Words that appeared less than 5 timesin the corpus were omitted.) Note that while one mayuse pre-trained word embeddings, training on in-domaindata is desirable to capture the vocabulary, jargon, andsymbols used on Reddit, as well as foreign language use.

The technique of representing a bag of words (suchas a sentence or document) as an average of constituentword embeddings has been demonstrated as a successfuland efficient approach for many applications, such asdetecting semantic similarity of sentences, identifyingplagiarism, classifying sentiment from social media,and others [10, 17, 27, 41]. In this work, aweighted average word embedding (i.e., a weekly5

subreddit content vector) is calculated by averagingthe GloVe embedding vectors for all the words in asubreddit-week, weighted by a term frequency-inversedocument frequency (TF-IDF) value:

wTF−IDF = (1 + log ft,d) · (1 + log (N/nt)) (1)

The term frequency ft,d is given by the number of timesa term t occurs in document d, and the inverse documentfrequency is given by the ratio of the total numberof documents N and nt, the number of documentscontaining the term t. In this work, a document is thecollection of posts from a given subreddit for a givenweek. This TF-IDF formulation assigns more weight towords that are unique and less to words that are usedfrequently across many subreddits.

Anecdotally, the weighted average word embeddingfor subreddit content captures relationships betweensubreddits in a similar way as the raw GloVe embeddingcaptures relationships between words, e.g.:

/r/Patriots− /r/boston + /r/Atlanta = /r/falcons/r/iphone− /r/apple + /r/google = /r/Android

5For the results presented in this paper, a time span of oneweek was used to smooth out day-to-day volatility, particularlyon smaller subreddits with low post activity. Furthermore, manysubreddits exhibit a natural weekly periodicity of posting behavior, forexample due to higher volumes of activity on weekends and repeatingoccurrences such as television shows and sporting events.

Page 2266

Page 4: Unsupervised Content-Based Characterization and Anomaly ... · video games, etc.), making the characterization of this post content a critical component of understanding the community

3.3. Community Similarity

Although shortcomings of word vectors and the useof word analogies for evaluating vector-space semanticmodels have been documented [11, 21], this relativelysimple and inexpensive calculation performed wellin grouping subreddits into topically-similar clusters.To quantify community similarity, the cosine distancebetween the weighted average word embeddings v forsubreddits j and k is used:

dj,k = 1− cos (θ) = 1− vj · vk||vj ||2||vk||2

(2)

The cosine distance of average embeddings is a simpleyet effective metric for calculating semantic similaritybetween documents [27, 41].

Figure 1 illustrates the hierarchical clustering of thecontent embeddings for the 1000 most active subredditson the week of September 9-15, 2017. (This is providedfor visual inspection only; no analysis on communityclusters is subsequently performed.) Here, one cansee large clusters of similar subreddits, as well as thepresence of individual subreddits that do not clusterwell. Most of these outliers are foreign-languagesubreddits (e.g. /r/france in French, /r/newsokuvipin Japanese) or have unique vocabulary or postingconventions (e.g. in /r/ShadowBan nearly all post titlestake the form of “Am I Shadowbanned?”). Figure 2zooms in on a few of these clusters. Figure 2a showsa large cluster of hobby or personal interest subreddits,with subclusters related to food, fashion, nature, pets,and art. A cluster of sports-focused subreddits is shownin Figure 2b, where baseball, football, boxing/wrestling,and “misc” sports subreddits form distinct subclusters.

While these clusters appear reasonable uponqualitative examination, we can quantify the efficacy ofthe subreddit embedding similarity metric by using it topredict whether one subreddit explicitly links to anotherin its community description. For example, at the timeof this writing, the subreddit /r/boston links to threeother Boston-related subreddits: /r/bostontenants,/r/bostonhousing, and /r/BostonJobs.

Table 1 lists the top ten false positives, or the tensubreddit pairs with smallest cosine distance (greatestcosine similarity) that do not link to one another.Upon inspection, all ten pairs are objectively similarsubreddits with respect to content. Table 2 lists the topten false negatives, or the ten subreddit pairs with largestcosine distance that do link to one another. In this case,nearly all pairs consist of two subreddits in differentlanguages. The pair /r/sweden and /r/cringe strangelyappears in this list, but in fact the subreddit /r/swedencontains a link to /r/Pinsamt which is described as

Top 10 False PositivesSubreddit 1 Subreddit 2askmen askwomenanswers nostupidquestionsgametrade steamgameswapcrazyideas nostupidquestionsconservative politiccrappydesign mildlyinfuriatinggrandtheftautov gtavindia indianewsanswers crazyideasoffmychest self

Table 1: Non-linked subreddits with highest similarity.

Top 10 False NegativesSource Targeteurope greecesuomi russianewsokuvip newsokursuomi denmarksweden cringeeurope swedeneurope denmarkeurope romaniasuomi swedeneurope portugal

Table 2: Linked subreddits with lowest similarity.

“Like /r/cringe, only a lot more Swedish.” Figure 3shows the receiver operating characteristic (ROC) curvefor all subreddit pairs over 52 weeks. The average areaunder the ROC curve (AUC) is 0.85.

4. Anomaly Detection

In this section, we leverage the previously-describedcontent model to automatically detect anomalies causedby community disturbances. We consider a disturbanceto be an event that has a real or perceived impact on acommunity’s membership, function, or interests. Themotivation for anomaly detection on Reddit is twofold:to leverage the Internet as a sensor for timely detectionof real-world events; and to generate a rich dataset ofobserved community perturbations to facilitate robustanalysis of disturbance effects.

4.1. Community Change and Event Detection

Unexpected, controversial, or disruptive events oftenresult in a flurry of discussion about a new topic onrelevant affected communities on Reddit. These events

Page 2267

Page 5: Unsupervised Content-Based Characterization and Anomaly ... · video games, etc.), making the characterization of this post content a critical component of understanding the community

2mei

rl4m

eirl

anim

e_irl

me_

irlm

eirl

King

dom

Hear

tsFa

llout

Mod

sCr

usad

erKi

ngs

eu4

AskH

istor

ians

AskS

cienc

eFict

ion

world

build

ing

War

ham

mer

40k civ

tota

lwar

Xcom

Fallo

utsk

yrim

skyr

imm

ods

Min

ecra

ftfe

edth

ebea

stm

inec

rafts

ugge

stio

nsNo

Man

sSky

TheG

ame

bind

ingo

fisaa

cRi

mW

orld fo4

Terra

riaKe

rbal

Spac

ePro

gram

Citie

sSky

lines

fact

orio

cust

omhe

arth

ston

eel

ders

crol

lsleg

ends

hear

thst

onec

ircle

jerk

Town

ofSa

lem

gam

eBr

awlh

alla

Battl

eRite

para

gon

lear

ndot

a2va

ingl

oryg

ame

Com

petit

iveo

verw

atch

Over

watc

hUni

vers

ityhe

roes

ofth

esto

rmPa

ladi

nsSm

iteDn

Ddn

dnex

tPa

thfin

der_

RPG

witc

her

bloo

dbor

neda

rkso

uls

dark

soul

s3AR

Kone

play

ark

elde

rscr

ollso

nlin

ewo

rldof

pvp

Cruc

ible

Play

book

Neve

rwin

ter

wowe

cono

my

War

fram

epa

thof

exile

wow

SWGa

laxy

OfHe

roes

aven

gers

acad

emyg

ame

Cont

estO

fCha

mpi

ons

RotM

GM

aple

stor

yffx

ivbl

ackd

eser

tonl

ine

blad

eand

soul

Guild

wars

2th

ediv

ision

2007

scap

eru

nesc

ape

dead

byda

ylig

htpl

ayru

stba

ttlef

ield

_one tf2

spik

esED

Hm

agicT

CGpt

cgo

yugi

ohDB

ZDok

kanB

attle

Scho

olId

olFe

stiv

alNa

ruto

Blaz

ing

OneP

iece

TCPu

zzle

AndD

rago

nssu

mm

oner

swar

firee

mbl

emM

onst

erHu

nter

gran

dord

erFF

Reco

rdKe

eper

Blea

chBr

aveS

ouls

brav

efro

ntie

rFF

Brav

eExv

ius

Mob

iusF

FGi

ftofG

ames

Rand

om_A

cts_

Of_A

maz

onRa

ndom

Acts

ofCa

rds

shar

ditk

eepi

tCr

ackW

atch vita

Vita

Pira

cyCa

llOfD

uty

CODZ

ombi

esBa

ttlef

ield

halo

Gear

sOfW

arbl

acko

ps3

Infin

itewa

rfare

battl

efie

ld_4

titan

fall

csgo

CoDC

ompe

titiv

est

arcr

aft

Kapp

asm

ashb

ros

FGC

Stre

etFi

ghte

rgt

aonl

ine

Gran

dThe

ftAut

oVGT

AVGa

min

gStu

nts

Lets

Play

Vide

osPr

omot

eGam

ingV

ideo

sYo

uTub

eGam

ers

GetM

oreV

iews

YTvi

deog

ames

roos

terte

eth

Smal

lYTC

hann

elYo

uTub

e_st

artu

psPS

VRoc

ulus

Vive

disc

orda

ppTw

itch

tipof

myj

oyst

ick PS4

xbox

one

Gam

esSt

eam

gam

ings

ugge

stio

nsAn

droi

dGam

ing

pcga

min

g3D

Sni

nten

dopo

kem

onpo

kem

ongo

TheS

ilphR

oad

Rive

nmai

nsCL

Gsu

mm

oner

scho

olle

ague

ofle

gend

sLe

ague

OfVi

deos

Gam

erPa

ls lfgDe

stin

yShe

rpa

fifac

lubs

Mad

denM

obile

Foru

ms

MLB

TheS

how

Mad

denU

ltim

ateT

eam

NHLH

UTFi

faCa

reer

sM

adde

nFI

FANB

A2k

Roya

leRe

crui

tCl

ashO

fCla

nsCl

ashR

oyal

ehe

arth

ston

ePi

ctur

eGam

eDe

stin

yThe

Gam

eOv

erwa

tch

Glob

alOf

fens

ive

DotA

2Ra

inbo

w6Ro

cket

Leag

ueDi

epio

Party

Links

mcs

erve

rsBt

cFor

umBi

tcoi

nCr

ypto

Curre

ncy

ethe

reum bt

cbt

cnew

sfee

dm

ath

Hom

ewor

kHel

pch

eata

tmat

hhom

ewor

kle

arnm

ath

exce

lbl

ende

rga

mem

aker

gam

edev

Unity

3DM

achi

neLe

arni

ngle

arnp

rogr

amm

ing

lear

npyt

hon

java

scrip

tpr

ogra

mm

ing

webd

evDe

sign

Inte

riorD

esig

nfo

rhire

Wor

dpre

ssSE

Owe

b_de

sign

HITs

Wor

thTu

rkin

gFor

Sam

pleS

izeed

ucat

ion

Engi

neer

ingS

tude

nts

Teac

hers

Appl

ying

ToCo

llege

colle

geAc

coun

ting

csca

reer

ques

tions

jobs

deal

sla

wle

gala

dvice

busin

ess

SAte

chne

wste

chno

logy

finan

ceRe

alEs

tate

build

apc

pcm

aste

rrace

build

apcf

orm

eSu

gges

tALa

ptop

Amd

nvid

iabu

ildap

csal

esM

onito

rsja

ilbre

akap

ple

ipho

neTh

eCol

orIsO

rang

eAp

pleW

atch

tmob

ilePl

eXra

spbe

rry_p

iho

mel

ablin

uxsy

sadm

inco

mpu

ters

softw

are

Surfa

cete

chsu

ppor

tW

indo

ws10

Tech

nolo

gy_

andr

oida

pps

goog

leap

pleh

elp

Gala

xyS7

wind

owsp

hone

onep

lus

Goog

lePi

xel

Nexu

s6P

Pick

AnAn

droi

dFor

Me

Andr

oid

Andr

oidQ

uest

ions

Cele

bsge

ntle

man

bone

rsW

rest

leW

ithTh

ePlo

tcu

mslu

tsNS

FW_G

IFns

fwha

rdco

rego

newi

ldcu

rvy

Gone

wild

Plug

Amat

eur

Real

Girls milf ass

Bust

yPet

iteLe

galC

olle

geGi

rlsns

fwre

ddtu

beho

twife

_cuc

kold

hous

tonr

4rxr

aySe

xsel

lsus

edpa

ntie

spo

rnID

tipof

myp

enis

gone

wild

audi

oSi

ssie

sla

dybo

ners

gwM

assiv

eCoc

kpe

nis

gone

wild

Gone

wild

s_Se

cret

wife

shar

ing

rela

tions

hip_

advi

cere

latio

nshi

psam

iugl

yRa

tem

eNo

Fap

askt

rans

gend

er sex

datin

g_ad

vice

AskM

enAs

kWom

enra

isedb

ynar

cissis

tsde

pres

sion

teen

ager

sof

fmyc

hest

Suici

deW

atch

Advi

ceCa

sual

Conv

ersa

tion

Five

Year

sAgo

OnRe

ddit

know

your

shit

Shitt

yLife

ProT

ips

Phot

osho

pReq

uest

trees

AskO

uija

what

isthi

sthi

ngHO

Tand

TREN

DING

NoSt

upid

Ques

tions

BigB

roth

ercir

cleje

rkAM

Aca

sual

iam

adr

unk

stop

drin

king

opia

tes

leav

esst

opsm

okin

gPu

rple

PillD

ebat

eIn

cels

MGT

OWJU

STNO

MIL

Baby

Bum

psbe

yond

theb

ump

child

free

Pare

ntin

gpr

oED

ExNo

Cont

act

conf

essio

nTr

ollX

Chro

mos

omes ftm

brea

king

mom

Fore

verA

lone self

Tind

erac

tual

lesb

ians

askg

aybr

osOk

Cupi

das

ktrp

sedu

ctio

nRo

ast_

Me

Roas

tMe

4cha

nJo

nTro

nre

actio

ngifs

Nude

lete

dank

mem

esm

emes

mad

lads

best

ofia

mve

rysm

art

OutO

fThe

Loop

Crin

geAn

arch

yra

ntho

ward

ster

nop

iean

dant

hony

h3h3

prod

uctio

nsPl

anet

Dola

npy

rocy

nica

lra

repu

pper

syo

utub

ehai

kube

auty

Dent

istry

Heal

thhe

alth

care

Med

itatio

nAD

HDAn

xiet

yAs

kDoc

spt

sdbi

pola

rm

enta

lhea

lthkr

atom

Drug

sLS

DSk

inca

reAd

dict

ion

keto

Fitn

ess

lose

itas

mr

nove

ltran

slatio

nspo

litics

The_

Dona

ldIce

_Pos

eido

nJo

noBo

ndTw

itter

ricka

ndm

orty

asoi

afga

meo

fthro

nes

tipof

myt

ongu

etra

nsla

tor

anal

ogM

yWal

lpap

erCl

ubito

okap

ictur

eLa

rgeI

mag

eswa

llpap

erwa

llpap

ers

fake

idDa

nkNa

tion

Dark

NetM

arke

tsDa

rkNe

tMar

kets

Noot

ropi

csph

arm

acyr

evie

wssa

ndbo

xtes

tGa

laxy

Note

7el

ectro

nics

Prod

uctT

estin

gSo

urce

ryM

arke

tgr

ams

DNM

Supe

rlist

Dark

NetM

arke

tsNo

obs

Drea

m_M

arke

tDa

rkNe

tDea

lsAl

phaB

ayM

arke

tDr

eam

Mar

ket

Drea

mM

arke

tDar

knet

Dark

net_

Mar

kets

Cash

orTr

ade

face

valu

etick

ets

airs

oftm

arke

tTo

chka

Free

Mar

ket

astro

logy

clean

indi

aPi

nayS

cand

alru

ssia

Turk

eyGa

ryJo

hnso

nHu

rrica

neM

atth

ewne

wsg

Daisy

Ridl

eyAm

azon

_Giv

eawa

ysIm

ages

ifyou

likeb

lank

mak

eupe

xcha

nge

Guita

rgu

itarp

edal

she

adph

ones

synt

hesiz

ers

mak

ingh

ipho

ped

mpr

oduc

tion

WeA

reTh

eMus

icMak

ers

vapo

rent

sel

ectro

nic_

cigar

ette

Vapi

ngGu

npla

Nerf

airs

oft

ar15

guns

Wat

ches

foun

tain

pens

knife

club

Coffe

eHo

meb

rewi

ngwi

cked

_edg

eAs

ianB

eaut

ycig

ars

Silv

erbu

gsM

echa

nica

lKey

boar

ds3D

prin

ting

Hom

eIm

prov

emen

two

odwo

rkin

gBa

king

Food

Porn

vega

nCo

okin

gsh

ittyf

oodp

orn

supr

emec

loth

ing

Snea

kers

fash

ion

mal

efas

hion

advi

cest

reet

wear

Aqua

rium

sFi

shin

gsp

ider

swh

atst

hisb

ugm

ycol

ogy

shro

oms

what

sthi

spla

ntsu

ccul

ents

gard

enin

gm

icrog

rowe

rydo

gsPe

tsca

tsdo

gpict

ures

Rabb

itsta

ttoos

draw

ing

redd

itget

sdra

wncr

oche

tRe

dditL

aque

rista

sha

llowe

enwe

ddin

gpla

nnin

gM

akeu

pAdd

ictio

nbe

ards

mal

ehai

radv

iceRW

BYm

ylitt

lepo

nyzo

otop

iaaw

wnim

eRe

_Zer

oIn

ktob

erPi

xelA

rtco

spla

yia

teac

rayo

nru

le34

fiven

ight

satfr

eddy

sSt

ardu

stCr

usad

ers

gam

egru

mps

Unde

rtale

Defe

nder

sM

arve

lm

arve

lstud

ios

whow

ould

win

DC_C

inem

atic

com

icboo

ksDC

com

icsfu

nkop

opAc

tionF

igur

esle

gost

artre

kSt

arW

ars

Star

War

sBat

tlefro

ntm

anga

hom

estu

ckNa

ruto

arro

wFl

ashT

Vda

ngan

ronp

adb

zst

even

univ

erse

anim

eOn

ePie

cetra

nce

Hous

eTe

chno

Vapo

rwav

eTh

isIsO

urM

usic

EDM

elec

troni

cmus

icM

onst

erca

tfu

ture

beat

stra

phi

phop

head

sin

dieh

eads

poph

eads

Fran

kOce

anKa

nye

Nam

eTha

tSon

gsp

otify

csgo

gam

blin

gpo

ker

Eve

War

thun

der

Wor

ldof

Tank

sW

orld

OfW

arsh

ips

sto

XWin

gTM

GEl

iteDa

nger

ous

star

citize

nka

hoot

Inst

agra

mPi

racy

chur

ning

avia

tion

flyin

gst

arbu

cks

walm

art

AirF

orce

uwat

erlo

oos

ugam

eAp

ocal

ypse

Risin

gTo

onto

wn2b

2tun

turn

edDi

epio

Maf

ia3

PvZG

arde

nWar

fare

swto

rpa

yday

theh

eist

Plan

etsid

eHx

HBat

tleAl

lSta

rsM

SLGa

me

cust

omm

agic

duel

yst

AceA

ttorn

eyGi

IvaS

unne

rTa

leso

fLin

kno

man

shig

hSt

arde

wVal

ley

tapp

edou

tyo

kaiw

atch

met

alge

arso

lidM

orta

lKom

bat

drun

kenp

easa

nts

Scen

esFr

omAH

atcr

inge

that

Happ

ened

week

endg

unni

tco

pypa

sta

im14

andt

hisis

deep

noco

ntex

tBl

ackP

eopl

eTwi

tter

dadj

okes

ImGo

ingT

oHel

lFor

This

Fello

wKid

stra

shy

Shitt

y_Ca

r_M

ods

briti

shpr

oble

ms

Crap

pyDe

sign

mild

lyin

furia

ting

quot

esRo

orh

shitt

yask

scie

nce

answ

ers

Craz

yIde

asW

TFm

asha

ble

fark

save

dyou

aclic

kco

mics

webc

omics

furry

MLP

Loun

geHu

ntin

gna

ture

ismet

alHa

ram

bepe

rfect

loop

sBa

ndna

mes

oddl

ysat

isfyi

ngin

tere

stin

gasf

uck

woah

dude

notin

tere

stin

gDe

epIn

toYo

uTub

eGi

fSou

ndUn

expe

cted

gofu

ndm

efin

dare

ddit

Help

MeF

ind

phot

ocrit

ique

Film

mak

ers

phot

ogra

phy

unkn

ownv

ideo

sNe

wTub

ers

yout

ube

anim

atio

nho

rror

ente

rtain

men

tco

med

yfu

nnyv

ideo

sM

etal

kpop

Best

OfSt

ream

ingV

ideo

boni

ver

Met

alco

reM

usict

hem

etim

era

dioh

ead

gree

nday

viny

lru

paul

sdra

grac

esu

rviv

orvl

ogW

altD

isney

Wor

ldAm

erica

nHor

rorS

tory

Dund

erM

ifflin

sout

hpar

kAn

imal

Cros

sing

nost

algi

aTh

eSim

pson

sGa

min

gcirc

leje

rkTw

oBes

tFrie

ndsP

lay

NYCC

MrR

obot

west

world

netfl

ixSt

rang

erTh

ings

harry

potte

rwr

iting

Fant

asy

sugg

estm

eabo

okFi

nalF

anta

syki

ckst

arte

rbo

ardg

ames rpg

form

ula1

Mec

hani

cAdv

ice Jeep

suba

ruRo

adca

mfo

rza

NASC

ARte

slam

otor

sBM

WAu

tos

cars

bcsn

ews

Aust

inAt

lant

aho

usto

nPo

rtlan

dSe

attle

WA

LosA

ngel

esch

icago ny

cCa

lgar

yot

tawa

toro

nto

vanc

ouve

rlo

ndon

mel

bour

ne bjj

body

build

ing

gopr

oM

ultic

opte

rdi

scgo

lfgo

lfru

nnin

gbi

cycli

ngm

otor

cycle

sTu

berS

imHa

lifax

News

prom

ote

Busin

essA

dvice

Team

Info

grap

hics

mar

ketin

gAd

vice

Anim

als

Entre

pren

eur

OhGi

rlIam

InTr

oubl

ePr

intin

gSho

pte

chEn

gadg

etTe

chfe

edar

chite

ctur

eHo

meD

ecor

atin

gSo

cial_P

sych

olog

yEc

onom

icsec

onom

yin

vest

ing

walls

treet

bets

Post

Wor

ldPo

wers

envi

ronm

ent

citra

lEv

eryt

hing

Scie

nce

Philip

pine

sha

rdrig

htTh

e_Eu

rope

Islam

Unve

iled

Chris

tians

Awak

e2NW

OW

hite

Righ

tspr

ogun

Prot

ectA

ndSe

rve

singa

pore

sout

hafri

cam

iscof

fbea

teu

rope

anna

tiona

lism

jillst

ein

colla

pse

nytim

esAn

ythi

ngGo

esNe

wsin

then

ews

pale

olib

erta

rian

Anar

cho_

Capi

talis

mso

cialis

mm

etac

anad

aLa

teSt

ageC

apita

lism

Liber

taria

nPo

litica

lDisc

ussio

nAs

kThe

_Don

ald

AskT

rum

pSup

porte

rsPo

litica

lHum

orPO

LITI

Cne

w_rig

htCo

nser

vativ

esOn

lyPo

litica

lVid

eoFU

LLCO

MM

UNIS

MOb

ject

ivism lg

btM

ensR

ight

sex

jwex

mus

limAg

ains

tHat

eSub

redd

itsDr

ama

Tum

blrIn

Actio

nKo

taku

InAc

tion

Sarg

onof

Akka

dwo

rldGl

ance

islam

paki

stan

bakc

hodi

indi

anew

sfo

rwar

dsfro

mgr

andm

ach

ange

myv

iew

exm

orm

onCa

thol

icism

athe

ismCh

ristia

nity

hilla

rycli

nton

Hilla

ryFo

rPris

onW

ayOf

TheB

ern

Polit

ical_R

evol

utio

nde

moc

rats

Koss

acks

_for

_San

ders

Clim

ateC

haos

indi

atra

vel

Cons

erva

tive

Enou

ghTr

umpS

pam

unce

nsor

edne

wswo

rldpo

litics

cons

pira

cyTh

eCol

orIsR

edun

rem

ovab

leeu

rope

ukpo

litics

unite

dkin

gdom

bost

onca

nada

Cana

daPo

litics

irela

ndau

stra

liane

wzea

land

Guns

AreC

ool

Brea

king

News

24hr

TheC

olor

IsBlu

eCC

TVCa

mer

asUK

NoFi

lterN

ews

nprs

torie

sPo

litica

lVid

eos

russ

iawa

rinuk

rain

esy

rianc

ivilw

arGe

osim

vexi

llolo

gyLu

cky3

03Fa

ntas

y_Fo

otba

llfo

otba

llFa

ntas

yPL

socc

erLiv

erpo

olFC

Gunn

ers

redd

evils

WW

EGam

esSC

Jerk

Squa

redC

ircle

WW

EBo

xing

MM

ATo

ront

oblu

ejay

sor

iole

sre

dsox

SFGi

ants

base

ball

CHIC

ubs

NewY

orkM

ets

Crick

etfa

ntas

ybba

llfa

ntas

yhoc

key

hock

ey MLS

rugb

yuni

on AFL nrl

spor

ts_u

ndel

ete

tenn

isea

gles

49er

sm

inne

sota

viki

ngs

Patri

ots

buffa

lobi

llsco

wboy

sDe

nver

Bron

cos

oakl

andr

aide

rsCh

arge

rsfa

lcons nba

warri

ors

CFB

fant

asyf

ootb

all

nfl

dark

netm

arke

tde

Denm

ark

swed

enth

enet

herla

nds

franc

eRo

man

iaca

talu

nya

italy

bras

ilpo

rtuga

lpo

dem

osar

gent

ina

mex

icone

wsok

uvip

news

okun

omor

alne

wsok

urSu

omi

gree

cepo

litot

aal

inity

kpics

Shad

owBa

nRo

cket

Leag

ueEx

chan

gecs

gotra

deGl

obal

Offe

nsiv

eTra

deDe

stin

yLFG

Fire

team

sAp

pNan

afri

ends

afar

ipo

kem

ontra

des

Casu

alPo

kem

onTr

ades

Poke

mon

give

away

ACTr

ade

beer

trade

DBZD

okka

nMar

ketp

lace

mec

hmar

ket

Gam

eSal

eec

igcla

ssifi

eds

funk

oswa

pgi

ftcar

dexc

hang

eha

rdwa

resw

appu

mpa

rum

Dota

2Tra

deGa

meT

rade

indi

egam

eswa

pSt

eam

Gam

eSwa

pSt

arcit

izen_

trade

ssn

eake

rmar

ket

Fash

ionR

eps

Reps

neak

ers

mad

denm

obile

buys

ell

Stea

mAc

coun

tsFo

rSal

ere

mov

albo

tFr

eeSt

uffN

YCbo

rrow

give

away

ssw

eeps

take

sUH

CMat

ches

Recr

uitC

SRo

cket

Leag

ueFr

iend

sOv

erwa

tchL

FTTe

amRe

dditT

eam

sFF

XIVR

ECRU

ITM

ENT

wowg

uild

sTh

eVoi

ceOf

Sere

nAd

vent

ures

OfFr

iend

scla

ncar

niva

lth

ebrit

ishel

ites

Rand

omAc

tsOf

Blow

Job

Rand

omAc

tsOf

Muf

fDiv

esn

apch

atdi

rtyr4

rKi

kpal

sr4

rFu

rryKi

kPal

sex

xxch

ange

NSFW

_KIK

dirty

kikp

als

Dirty

Snap

chat

Role

play

kik

Agep

layP

enPa

lsdi

rtype

npal

sal

le15

min

uten

ever

y15m

insh

itpos

tbot

5000

hobbiespersonal electronics sports

] ] ]

cities

]

games

]

Figure 1: Hierarchical clustering of subreddit embeddings using cosine distance.

(a) Hobbies and Interests Subreddits

Baki

ngFo

odPo

rnve

gan

Cook

ing

shitt

yfoo

dpor

nsu

prem

eclo

thin

gSn

eake

rsfa

shio

nm

alef

ashi

onad

vice

stre

etwe

arAq

uariu

ms

Fish

ing

spid

ers

what

sthi

sbug

myc

olog

ysh

room

swh

atst

hisp

lant

succ

ulen

tsga

rden

ing

micr

ogro

wery

dogs

Pets

cats

dogp

ictur

esRa

bbits

tatto

osdr

awin

gre

dditg

etsd

rawn

croc

het

Redd

itLaq

ueris

tas

hallo

ween

wedd

ingp

lann

ing

Mak

eupA

ddict

ion

bear

dsm

aleh

aira

dvice

(b) Sports Subreddits

WW

EGam

esSC

Jerk

Squa

redC

ircle

WW

EBo

xing

MM

ATo

ront

oblu

ejay

sor

iole

sre

dsox

SFGi

ants

base

ball

CHIC

ubs

NewY

orkM

ets

Crick

etfa

ntas

ybba

llfa

ntas

yhoc

key

hock

ey MLS

rugb

yuni

on AFL nrl

spor

ts_u

ndel

ete

tenn

isea

gles

49er

sm

inne

sota

viki

ngs

Patri

ots

buffa

lobi

llsco

wboy

sDe

nver

Bron

cos

oakl

andr

aide

rsCh

arge

rsfa

lcons nba

warri

ors

CFB

fant

asyf

ootb

all

nfl

Figure 2: Closeups of two clusters in the subreddit embedding dendrogram from Fig. 1

Figure 3: ROC for cosine similarity predicting explicitlinks between subreddits, averaged over 52 weeks.

may therefore be detectable by anomalous content,anomalous activity volume, or both.

Just as the cosine distance of weighted average wordembeddings was used to measure similarity betweensubreddits (Eqn. 2), this metric can also quantifycontent change from time tk−1 to time tk for asingle subreddit. The amount of change considered“anomalous” for any particular subreddit will dependon its expected (baseline) activity. The content modeltakes a weighted average over all words in the posttitles; popular, broad-topic subreddits will tend tohave more stable content week-to-week than smaller,niche subreddits. To model the typical change for asubreddit, the observed week-to-week cosine distancescan be approximated by a beta distribution6 with supportfrom 0 (no change, i.e., vk = vk−1) to 1 (whenvk = −vk−1). By fitting the expected contentchange distribution, one can estimate the probability ofobserving a week-to-week change greater than or equalto dk by evaluating P (d ≥ dk) = 1− F (dk), whereF is the distribution’s cumulative distribution function(CDF). In the examples presented, a threshold of 0.02is used to detect anomalies; this threshold is a tuningparameter and corresponds to content changes that are< 2% likely to occur in the data.

6This model was used due to preliminary analysis indicatingthat week-to-week cosine distances for most subreddits could bewell-approximated by a beta distribution.

Page 2268

Page 6: Unsupervised Content-Based Characterization and Anomaly ... · video games, etc.), making the characterization of this post content a critical component of understanding the community

Calculating a subreddit content vector for anyarbitrary time span is computationally inexpensive(the specific complexity depends on the scope ofthe IDF term in Eqn. 1), therefore content changecan be monitored at multiple levels of granularity(day-to-day, week-to-week, etc.). Similarly, theexpected change distribution and anomaly calculationis trivially inexpensive to calculate and may be done inan online or rolling fashion. However in practice, oneneeds enough observations (in terms of total numberof time intervals) to sufficiently estimate the contentchange distribution. In the examples given in this paperthe distribution is assumed to be static over all time; insome cases this may not accurately reflect permanent orseasonal changes to subreddit content.

4.2. Anomaly Detection Exemplars

In this section, content models are used to detectanomalies within two distinct community groups.The first case study examines several football-relatedsubreddits to illustrate the efficacy of the presentedtechniques on an example with which the reader maybe familiar. The second looks at subreddits dedicatedto various dark net markets (primarily illicit marketswhich operate via anonymous communication networkssuch as Tor or I2P) and the anomalies detected onReddit corresponding to significant disruptions to theseanonymous, encrypted websites.

4.2.1. Football. Professional football is consideredthe most popular American sport, so it is not surprisingthat a significant amount of football-related content canbe found on Reddit. Discussion about the NationalFootball League (NFL) can be found on /r/nfl; thoseinterested in NCAA college football can head to /r/CFB.Subreddits dedicated to individual teams abound, e.g./r/LSUFootball, /r/ClemsonTigers and /r/Patriots.

Figure 4 plots the week-to-week content changeand post volume for three football-related communities.Detected anomalies reliably correspond to eventssuch as the Super Bowl, draft dates, and dramaticor surprising incidents (anomalies were inspectedmanually and annotations provided for the reader’sconvenience). We note several observations:

• Events do not affect all communities, evenclosely-related ones, in the same way. Although allthree subreddits are close in the content embeddingspace (see Fig. 2b), they exhibit anomalous behaviorin response to different events specific to theirdomain. For example, National Signing Day (Feb.1, 2017) refers to the day new recruits can committo their future schools and is coincident with the

largest content-based anomaly on /r/CFB, likelycaused by users discussing the incoming class of newpromising football stars. This event, however, causesno observable change in either /r/nfl or /r/Patriots.

• Events detected due to anomalous content do notalways correlate with spikes in activity volume,and vice-versa. Content-based anomalies oftencorrespond to a surprising or dramatic occurrence,causing the existing community to discuss a newtopic. Changes in activity volume correspondto events which would affect the community’spopularity (either increase or decrease), such as thestart of a new season. Interestingly, Super Bowl LI(Feb. 5, 2017) caused a dramatic increase in postactivity on /r/Patriots but not a significant change incontent, likely because the Super Bowl temporarilyattracted more people to the subreddit, but the topicsof discussion remained largely the same.

• While the framework described in this paper cannotexplain the exact cause of an anomaly, word cloudscan often provide quick insight into the underlyingevent(s). For example, word clouds corresponding tothe three content anomalies observed on /r/Patriotsare plotted and described in Figure 5.

• Even for subreddits that correspond to a publicand well-defined community such as /r/Patriots, theevents detected are not always present in other formsof traditional media. For example, 342 news articlescontaining the word “patriots” were published inthe sports section of the New York Times betweenMarch 2016-September 2017. Most articles werenot relevant to the Patriots football team, and onlya small number (25 articles over nearly 19 months)correspond to articles covering non-game relatednews. These non-game articles represent the typeof news stories which would be most likely toshift the content of /r/Patriots away from typical,game-related discussions, but only 1 out of 25 (“F.B.I.Recovers Tom Brady’s Missing Super Bowl Jerseysin Mexico”) aligns with the events detected byour method. Other representative non-game newsarticles include headlines such as “Aaron Hernandez’sMurder Conviction Is Nullified”, “Six New EnglandPatriots Say They Will Skip a White House Visit”,and “Babe Parilli Dies at 87”. The events detected onReddit and articles published by the New York Timesmay point to differences between topics deemedrelevant by self-identified community members andthose recognized by an external editorial staff,illustrating one benefit of using social media asa sensor for detecting community disturbances inaddition to more traditional sources.

Page 2269

Page 7: Unsupervised Content-Based Characterization and Anomaly ... · video games, etc.), making the characterization of this post content a critical component of understanding the community

0.000

0.005

0.010

0.015

0.020

0.025

0.030

0.035

0.040CFBnflPatriots

0

500

1000

1500

2000

2500

3000

Cos

ine

Dis

tanc

eN

umbe

r of P

osts

CFBnflPatriots

2016-042016-06

2016-082016-10

2016-122017-02

2017-042017-06

2017-08

Patriots trade Jamie Collins to Cleveland Browns (Oct. 31, 2016)

Donald Trump claims endorsement by Patriots (Nov. 7, 2016)

National Signing Day (Feb. 1, 2017) Tom Brady's stolen Super Bowl

jersey found (Mar. 20, 2017)

NFL Draft (Apr. 27-29, 2017)Super Bowl LI

(Feb. 5, 2017)

o*o*

o*

o*

o* o*

Figure 4: (top) Week-to-week change for three football-related subreddits (more positive indicates greater change).Detected anomalies are indicated by red dots and annotated with relevant events that coincide with the anomalies;(bottom) Subreddit post volume is plotted for comparison.

(a) Week of Oct. 31–Nov. 6, 2016; OnOct. 31, 2016, the Patriots unexpectedlytraded top linebacker Jamie Collins to theCleveland Browns [20].

(b) Week of Nov. 7–13, 2016; On Nov.7, 2016, Donald Trump gave a speechclaiming Patriots coach Bill Belichick andquarterback Tom Brady supported him forpresident [9].

(c) Week of March 20-26, 2017; On March20, 2016, Tom Brady’s stolen Super Bowljersey was reported to have been found [1].

Figure 5: Word clouds corresponding to detected events on /r/Patriots. Word sizes are proportional to frequency [26].

4.2.2. Dark Net Markets. In July 2017, AlphaBay,an online dark net marketplace, was taken down by lawenforcement authorities. With over 240,000 members,AlphaBay was the largest market of its kind andreportedly hosted $600,000 to $800,000 in transactionsdaily. Its takedown, along with the coordinated seizureof Hansa (the second largest dark net market) two weekslater, scattered vendors and buyers to other dark netmarkets and caused significant disruption to dark netmarket communities both on and off Reddit.

Prior to its shutdown, AlphaBay users andadministrators used the subreddit /r/AlphaBayMarketto post reviews and updates on site operations. After

its final outage on July 4, 2017, Reddit users speculatedthat the outage was an exit scam7 and exchangedconspiracy theories on /r/AlphaBayMarket and on/r/DarkNetMarkets. Federal authorities did notconfirm their involvement in the takedown until July20, and new users flooded to /r/AlphaBayMarket inthe intervening weeks as vendors and buyers tried todetermine status of the network.

Vendors and buyers also sought an alternative darknet market on which to continue their transactions. Newlistings tripled on Hansa and DreamMarket, another

7In 2015, administrators of Evolution, another major dark netmarket, shut down the site and took millions of dollars of users’ moneyin what was known as an exit scam.

Page 2270

Page 8: Unsupervised Content-Based Characterization and Anomaly ... · video games, etc.), making the characterization of this post content a critical component of understanding the community

0.02

0.04

0.06

0.08

0.10

0.12

0.14C

osin

e D

ista

nce

2016-042016-06

2016-082016-10

2016-122017-02

2017-042017-06

2017-080

200

400

600

800

1000

Num

ber o

f Pos

ts

0.002

0.004

0.006

0.008

0.010AlphaBayMarketDarkNetMarkets

AlphaBayMarketDarkNetMarkets

*o *o *o *oDisclosure of major vulnerabilities on AlphaBay and Hansa markets

(Jan. 23, 2017)

AlphaBay final outage (July 4, 2017)

Hansa shut down & Law Enforcement involvement confirmed (July 20, 2017)

AlphaBay final outage (July 4, 2017)

Hansa shut down & Law Enforcement involvement confirmed (July 20, 2017)

Abandoned subreddit exhibits new baseline o* o*

Figure 6: (top) Week-to-week change for two largest Dark Web market subreddits (more positive indicates greaterchange); (bottom) Subreddit post volume is plotted for comparison.

(a) Week of June 26–July 2, 2017; Typicaldiscussion of marketplace activity, withlarge emphasis on AlphaBay

(b) Week of July 3–July 9, 2017; Redditorsdiscuss the outage of AlphaBay (July4, 2017) and debate alternative markets,Hansa and Dream Market.

(c) Week of July 17–July 23, 2017;Hansa marketplace is shut down (July 20,2017), and discussion of AlphaBay hassignificantly decreased.

Figure 7: Word Clouds for subreddit /r/DarkNetMarkets before, during, and after the AlphaBay and Hansa dark netmarkets were shut down in July 2017.

dark net market competitor, when AlphaBay went down[38]. Once it became clear that Hansa was also shutdown, DreamMarket became the largest dark net market,although others, such as Trade Route, Tochka, and WallStreet Market, saw similar boosts in traffic [19].

Figure 6 (top) shows anomalous change in/r/DarkNetMarkets content in the weeks of July 3(when AlphaBay had its final outage) and July 17 (whenthe federal takedown of AlphaBay and Hansa wasformally announced). Word clouds give a sense of howthe content of posts on /r/DarkNetMarkets shiftedfrom from AlphaBay to Hansa to DreamMarket as newsof the takedown unfolds (see Figure 7). This anomalouscontent change detected in /r/AlphaBayMarket may

be attributed to a significant reduction in posts afterAugust 2017 (/r/AlphaBayMarket had an average of92 posts per week from Jan.-June 2017 and an averageof 35 posts per week from Aug.-Sep. 2017).

Figure 6 (bottom) shows a large spike in postactivity on /r/AlphaBayMarket on the week ofJuly 3, reflecting the large increase in posts on/r/AlphaBayMarket in the confusion following thesite’s outage (total posts on /r/AlphaBayMarketjumped from less than 80 posts on the week of June26 to over 950 on the week of July 3). A similar spikeon /r/DarkNetMarkets for the week of July 17 likelyreflects the increase in posts on /r/DarkNetMarketsregarding the announced government takedown of both

Page 2271

Page 9: Unsupervised Content-Based Characterization and Anomaly ... · video games, etc.), making the characterization of this post content a critical component of understanding the community

AlphaBay and Hansa; we speculate that posts movedfrom /r/AlphaBayMarket to /r/DarkNetMarkets as itbecame clear that AlphaBay was completely defunct(total posts on /r/DarkNetMarkets jumped from 570posts on the week of July 10 to over 940 on the weekof July 17, whereas total posts for the same weeks on/r/AlphaBayMarket decreased from ∼720 to ∼380).

The results presented here represent two specificcommunity groups, however these analyses demonstratethe ability to quantitatively characterize subredditcommunities and automatically detect disturbances in awide range of communities in an unsupervised manner.Moreover, these disturbances appear to correspondto events that are unexpected and noteworthy to themembers of the community itself, rather than thosehighlighted by traditional external news sources.

5. Future Work

The methods we have presented show promisein detecting changes to both common and nichecommunities of interest, a challenge relevant to manyacademic and commercial fields. It is expected thatmost of the techniques applied here to Reddit canbe generalized to other data sources such as Twitter,although additional work would be required to definecommunity membership and analyze shorter and lessstructured messages. Verification and analysis of thedifferences between Reddit and Twitter (and othersources) are subjects for future research.

The analysis and explanation of the causes ofdetected changes in social media is also an area ofinterest, and the overall efficacy of these approacheslikely depends on the ability to correlate andcontextualize anomaly detections with auxiliary sourcesof information. Future work includes automaticallygenerating an interpretation of detected anomalies,detecting events by correlating weak changes acrossmultiple communities, and studying the long term inter-and intra-community effects caused by detected events.A more rigorous utility study must also be performedcorrelating detected anomalies with real-world events inorder to quantify the robustness and reliability of theevent detection across a range of communities.

References

[1] David Agren. “Tom Brady’s stolen Super Bowl jerseyfound in Mexico”. In: The Guardian (Mar. 2017).

[2] Alexa. https://www.alexa.com/siteinfo/reddit.com.[3] Eiji Aramaki, Sachiko Maskawa, and Mizuki Morita.

“Twitter catches the flu: detecting influenza epidemicsusing Twitter”. In: Conference on Empirical Methodsin Natural Language Processing. 2011, pp. 1568–1576.

[4] Ahmer Arif et al. “How Information Snowballs:Exploring the Role of Exposure in OnlineRumor Propagation”. In: ACM Conference onComputer-Supported Cooperative Work & SocialComputing. 2016, pp. 466–477.

[5] Yoshua Bengio et al. “A neural probabilistic languagemodel”. In: Journal of Machine Learning Research 3(Feb. 2003), pp. 1137–1155.

[6] James Benhardus and Jugal Kalita. “Streaming trenddetection in Twitter”. In: Intl. Journal of Web BasedCommunities 9.1 (2013), pp. 122–139.

[7] Deepayan Chakrabarti and Kunal Punera. “EventSummarization Using Tweets”. In: ICWSM 11 (2011),pp. 66–73.

[8] Andrew Crooks et al. “#Earthquake: Twitter as adistributed sensor system”. In: Transactions in GIS 17.1(2013), pp. 124–147.

[9] Jeremy Diamond. “Trump says Patriots’ Tom Brady,Bill Belichick supporting him”. In: CNN (Nov. 2016).

[10] Jeremy Ferrero et al. “Using Word Embeddingfor Cross-Language Plagiarism Detection”. In:15th Conference of the European Chapter of theAssociation for Computational Linguistics. Vol. 2.2017, pp. 415–421.

[11] Gregory Finley, Stephanie Farmer, and SergueiPakhomov. “What Analogies Reveal about WordVectors and their Compositionality”. In: 6th JointConference on Lexical and Computational Semantics.2017, pp. 1–11.

[12] Santo Fortunato and Darko Hric. “Communitydetection in networks: A user guide”. In: PhysicsReports 659 (2016), pp. 1–44.

[13] Reilly Grant et al. “Discovery of Informal Topicsfrom Post Traumatic Stress Disorder Forums”. In: Intl.Conference on Data Mining Workshops. IEEE. 2017,pp. 452–461.

[14] William L Hamilton et al. “Inducing Domain-SpecificSentiment Lexicons from Unlabeled Corpora”. In: 2016Conference on Empirical Methods in Natural LanguageProcessing. 2016, pp. 595–605.

[15] Jack Hessel, Chenhao Tan, and Lillian Lee. “Science,AskScience, and BadScience: On the Coexistenceof Highly Related Communities.” In: ICWSM. 2016,pp. 171–180.

[16] Jack Hessel et al. “What do Vegans do in their SpareTime? Latent Interest Detection in Multi-CommunityNetworks”. In: arXiv:1511.03371 (2015).

[17] Edilson Anselmo Corrza Jsnior, Vanessa QueirozMarinho, and Leandro Borges dos Santos. “NILC-USPat SemEval-2017 Task 4: A Multi-view Ensemble forTwitter Sentiment Analysis”. In: Proc. of the 11th Intl.Workshop on Semantic Evaluation. 2017, pp. 611–615.

[18] Ravneet Kaur and Sarbjeet Singh. “A survey of datamining and social network analysis based anomalydetection techniques”. In: Egyptian Informatics Journal17.2 (2016), pp. 199–216.

[19] Leo Kelion. “Dark web markets boom after AlphaBayand Hansa busts”. In: BBC (Aug. 2017).

[20] Emmett Knowlton. “The Patriots shocked the NFLworld by trading one of their best defenders to the worstteam in the NFL”. In: Business Insider (Oct. 2016).

[21] Tal Linzen. “Issues in evaluating semantic spaces usingword analogies”. In: (2016), pp. 13–18.

[22] Steffen Lyngbaek. “SPORK: A SummarizationPipeline for Online Repositories of Knowledge”.MA thesis. CA Polytechnic State University, 2013.

Page 2272

Page 10: Unsupervised Content-Based Characterization and Anomaly ... · video games, etc.), making the characterization of this post content a critical component of understanding the community

[23] Carlos Martin, David Corney, and Ayse Goker. “Miningnewsworthy topics from social media”. In: Advances inSocial Media Analysis. Springer, 2015, pp. 21–43.

[24] Sara Melvin et al. “Event Detection and SummarizationUsing Phrase Network”. In: Joint European Conferenceon Machine Learning and Knowledge Discovery inDatabases. Springer. 2017, pp. 89–101.

[25] Tomas Mikolov et al. “Efficient estimation of wordrepresentations in vector space”. In: arXiv:1301.3781(2013).

[26] Andreas Mueller et al. word cloud: 1.2.1.https://doi.org/10.5281/zenodo.49907. Apr. 2016.

[27] El Moatez Billah Nagoudi, Jeremy Ferrero, andDidier Schwab. “LIM-LIG at SemEval-2017 Task1: Enhancing the Semantic Similarity for ArabicSentences with Vectors Weighting”. In: Intl. Workshopon Semantic Evaluations. 2017, pp. 125–129.

[28] Jeffrey Nichols, Jalal Mahmud, and Clemens Drews.“Summarizing sporting events using Twitter”. In: ACMIntl. Conference on Intelligent User Interfaces. 2012,pp. 189–198.

[29] Sen Pei et al. “Searching for Superspreaders ofinformation in real-world social media”. In: ScientificReports 4 (2014), p. 5547.

[30] Jeffrey Pennington, Richard Socher, andChristopher Manning. “GloVe: Global vectors forword representation”. In: Conference on EmpiricalMethods in Natural Language Processing. 2014,pp. 1532–1543.

[31] Swit Phuvipadawat and Tsuyoshi Murata. “Breakingnews detection and tracking in Twitter”. In: WebIntelligence and Intelligent Agent Technology. Vol. 3.2010, pp. 120–123.

[32] Jinshan Qi et al. “Discrete time information diffusion inonline social networks: micro and macro perspectives”.In: Scientific Reports 8.1 (2018), p. 11872.

[33] Takeshi Sakaki, Makoto Okazaki, and Yutaka Matsuo.“Earthquake shakes Twitter users: real-time eventdetection by social sensors”. In: 19th Intl. Conferenceon World Wide Web. ACM. 2010, pp. 851–860.

[34] Alessio Signorini, Alberto Maria Segre, and Philip MPolgreen. “The use of Twitter to track levels of diseaseactivity and public concern in the US during theinfluenza A H1N1 pandemic”. In: PloS one 6.5 (2011).

[35] The New Dictionary of Cultural Literacy, Third Edition.Houghton Mifflin Harcourt Publishing Company, 2005.

[36] Soroush Vosoughi and Deb Roy. “A Semi-AutomaticMethod for Efficient Detection of Stories on SocialMedia.” In: ICWSM. 2016, pp. 707–710.

[37] Tim Weninger, Xihao Avi Zhu, and Jiawei Han. “Anexploration of discussion threads in social news sites:A case study of the Reddit community”. In: Advancesin Social Networks Analysis and Mining. IEEE. 2013,pp. 579–583.

[38] Tom Winter. “AlphaBay, Hansa Shut, but Drug DealersFlock to Dark Web DreamMarket”. In: NBC (July2017).

[39] William H Woodall et al. “An overview and perspectiveon social network monitoring”. In: IISE Transactions49.3 (2017), pp. 354–365.

[40] Rose Yu et al. “A survey on social media anomalydetection”. In: ACM SIGKDD Explorations Newsletter18.1 (2016), pp. 1–14.

[41] Jiang Zhao, Man Lan, and Jun Feng Tian. “ECNU:using traditional similarity measurements and wordembedding for semantic textual similarity estimation”.In: Intl. Workshop on Semantic Evaluation. 2015,pp. 117–122.

[42] Siqi Zhao et al. “Sportsense: Real-time detection ofNFL game events from Twitter”. In: arXiv:1205.3212(2012).

Page 2273