unsupervised content-based characterization and anomaly ... · video games, etc.), making the...
TRANSCRIPT
Unsupervised Content-Based Characterizationand Anomaly Detection of Online Community Dynamics
Danelle Shah, Michael Hurley, Jessamyn Liu, Matthew DaggettMIT Lincoln Laboratory, Lexington, MA
{danelle.shah, hurley, jessamyn.liu, daggett}@ll.mit.edu
Abstract
The structure and behavior of human networks havebeen investigated and quantitatively modeled by modernsocial scientists for decades, however the scope ofthese efforts is often constrained by the labor-intensivecuration processes that are required to collect, organize,and analyze network data. The surge in online socialmedia in recent years provides a new source of dynamic,semi-structured data of digital human networks, manyof which embody attributes of real-world networks. Inthis paper we leverage the Reddit social media platformto study social communities whose dynamics indicatethey may have experienced a disturbance event. Wedescribe an unsupervised approach to analyzing naturallanguage content for quantifying community similarity,monitoring temporal changes, and detecting anomaliesindicative of disturbance events. We demonstrate howthis method is able to detect anomalies in a spectrumof Reddit communities and discuss its applicability tounsupervised event detection for a broader class ofsocial media use cases.
1. Introduction
This research is part of a larger effort to explore howhuman networks respond to disturbances that impactthe community’s membership, function, or interests.In particular, we are interested in observing networkdynamics which may signal unrest, vulnerability, orredefinition in response to a disruptive event. Whilelarge-scale events such as natural disasters or politicalupheaval may be trivial to identify, the detection oflocalized events impacting specific sub-communities ismore challenging. We also seek to study when andwhy similar networks behave differently or dissimilar
This material is based upon work supported by the Departmentof Defense under Air Force Contract No. FA8721-05-C-0002and/or FA8702-15-D-0001. Opinions, interpretations, conclusions andrecommendations are those of the authors and are not necessarilyendorsed by the Department of Defense.
networks behave similarly in response to the same event.To these ends, it is necessary to: 1) characterize acommunity’s membership, function, or interests overtime; 2) quantify intra- and inter-community similarity;3) automatically detect events that cause a communitydisturbance; and 4) model the short- and long-termnetwork effects of an observed disturbance.
A community is defined as “a social, religious,occupational, or other group sharing commoncharacteristics or interests and perceived or perceivingitself as distinct in some respect from the larger societywithin which it exists” [35]. The active associationor interaction of individuals within a community toachieve some end (e.g., exchange information, executea task) is referred to as a network.1 Communities areoften hierarchical and overlapping, and an individualmay belong to an arbitrary number of communities.
Social media provides a rich, observable data sourceof dynamic social networks, and it has been extensivelystudied to model both online and offline human socialbehavior. Data generated by online social networkshave been used to detect significant events in real-time[7, 31], track the spread and sentiment of information[4, 29, 32], monitor population health and safety [13,33, 34], etc. Reddit (www.reddit.com) is one suchsocial media platform, generating large volumes ofcontent shared by self-organizing, topically-orientedcommunities. In contrast to the significant body of workdedicated to the definition and detection of communitiesfrom individuals and connections [12], the structure ofReddit allows communities to be treated as atomic unitswithout the need to infer membership.
In this paper we describe an unsupervised approachto modeling the content shared within Redditcommunities in order to measure inter-communitysimilarity, track volatility over time, and automaticallydetect anomalous activity corresponding to real-worldevents. In Section 2, we review related research in
1This distinction is subtle. In the context of this work we assume a1:1 mapping between the (observable) network and the corresponding(sub)community it represents, and use these terms interchangeably.
Proceedings of the 52nd Hawaii International Conference on System Sciences | 2019
URI: https://hdl.handle.net/10125/59665ISBN: 978-0-9981331-2-6(CC BY-NC-ND 4.0)
Page 2264
the area of event detection in social media. In Section3, we describe the Reddit dataset used in this workand introduce our model for characterizing subredditdynamics using natural language processing techniques.Specifically, subreddit post activity is representedin a low-dimensional embedding space, enablingefficient calculation of intra- and inter-communitycontent similarity. We describe an unsupervisedanomaly detection method in Section 4 and present twocase studies where detected community disturbancescorrelated with real-world events. We conclude with adiscussion of opportunities and future work.
2. Background and Related Work
2.1. Reddit
Reddit is a prominent social media platform2
launched in 2005 to capture what’s new and popularon the Internet via sharing of hyperlinks to interestingwebsites. As the user base grew, a community structureorganically developed through the use of interest-centricgroups called subreddits. Users post stories, links,and media relevant to a subreddit’s predefined topic,and engage in community-oriented discussions throughcomments. Due to the encoded structure of userinteractions via posts and threads, the heterogeneity ofcontent (text, images, videos, hyperlinks), open andauthentic discussion afforded by relative anonymity andminimal censorship, and the observability enabled byits Application Programming Interface (API), Redditprovides a unique opportunity to analyze complex inter-and intra-community dynamics.
While we focus our analysis on Reddit in this paper,the approach presented here can be trivially applied toany online social network in which communities arecentralized and explicitly defined. The extension ofthese techniques to platforms in which users interact ina distributed fashion (e.g., Twitter) or to data of physical(offline) networks remains a future research goal.
2.2. Event Detection in Social Media
Methods for detecting events in online social mediahas been an active research area since the popularadoption of the internet in the 1990s. A number ofexcellent surveys have been recently published whichprovide an overview of approaches used to detectanomalies and behaviors in online social media (e.g.[18, 39, 40]). Using the nomenclature of [18], thetechnique described in this paper uses an unsupervisedapproach to dynamic, contextual anomaly detection.
2As of this writing, Reddit is ranked 5 most visited website in theUnited States and 16 in the world [2].
Twitter has been leveraged extensively for detectingevents in social media using a variety of methodsincluding analysis of social network structures as wellas message volume, timing, and content. Applicationsinclude detection of events in sports [7, 28, 42],public emergencies and disease outbreaks [3, 34] andnatural disasters [8, 33]. Several recent works performgeneral event detection and trend spotting on Twitterby detecting anomalous or “bursty” phrases [6, 23, 36].However, Twitter presents many challenges due to theshort, unstructured, usually poorly formatted messagesthat users post, and lacks explicit community structurecorresponding to user membership or content relevancy.
Although Reddit has been used as a data source foronline social media analysis, relatively little research hasbeen conducted on event detection in Reddit. Instead,the research has more often focused on communityanalysis [15, 16], topic discovery [13, 37] or onsummarizing posts and comments [22]. Most analyseshave focused on event summarization as opposed toevent detection. Event summarization is concerned withcollecting and compressing streams of text into a concisesummary statement. Event detection aims to identifysignificant changes in the text streams [7, 24].
Hamilton, et al. [14] studied the extent thatsentiment varies over time within the content ofspecific communities on Reddit. They combineddomain-specific word embeddings and a labelpropagation framework to induce sentiment lexiconsusing a small numbers of seed words from Redditdata with a corpus of 150 years of English words.Lyngbaek’s thesis [22] describes an application calledSPORK that identifies and summarizes topics in Redditcomment threads. It uses natural language processingbased upon term frequency-inverse document frequency(TF-IDF), cosine similarity, and clustering algorithms.Lyngbaek uses a hierarchical clustering algorithm tocombine similar comments based upon word-frequencyvectors. In this paper we use a similar technique,however rather than clustering individual commentsinto summarization topics, we model entire subredditcommunities over finite time spans in order to detectsignificant changes in response to real-world events.
3. Modeling Community Content
We seek to model user behavior on Reddit asa means to characterize communities over time andto detect anomalies in community activity caused byconsequential events. In this section we describe acontent-based activity model, taking advantage of recentadvances in natural language processing to characterizethe text in subreddit posts. The motivations for this
Page 2265
model are threefold: quantitatively measure subredditsimilarity; monitor temporal changes in subredditactivity; and automatically detect anomalies caused byreal world events. Note that many of the results inthis paper highlight examples that we expect readersmay recognize, however the objective is to automaticallydetect events affecting any community of interest.
3.1. Dataset
The data used in this paper includes over 118 millionposts created by over 8 million unique users onReddit between March 2016 and September 2017,collected from the archive of Reddit made publiclyavailable on Google BigQuery,3 omitting known botsand default subreddits. Bots are computer programswhich automatically generate posts and comments; 988bots (190 active) were identified by manually inspectingabnormally-active users (> 3000 comments per month)and by scraping /r/botWatcher for recently-mentionedbots. Default subreddits (those to which all users areautomatically subscribed upon account creation) arelinked at the top of every Reddit page and are typicallyamong the most active subreddits; 49 default subreddits(as of March 2016) were omitted from this analysis.
3.2. Subreddit Activity Model
The content of a subreddit captures the interests,thoughts, questions, complaints, discussions,arguments, and general musings of the members of thatcommunity. This content consists of user-contributedposts (and subsequent comments) including articles,videos, and links to other websites, as well as freeformtext. By design, posts within each subreddit are typicallyorganized around a predefined theme (sports, politics,video games, etc.), making the characterization of thispost content a critical component of understanding thecommunity and detecting change. While posts maycontain heterogeneous media, each post is accompaniedby a short descriptive title, the text of which is usedexclusively to characterize subreddit content in thiswork.4 Specifically, a subreddit’s content over a finitetime span is represented as a weighted average wordembedding of the corresponding post titles.
Word embeddings are a class of natural languageprocessing techniques in which individual words arerepresented by a vector in a predefined vector space,the dimension of which is significantly smaller than
3https://bigquery.cloud.google.com/dataset/fh-bigquery4Although comments typically significantly outnumber posts on
any given subreddit, preliminary analysis showed that post titles aloneprovided a more robust representation, likely because posts are richerin content and less noisy than comments.
the size of the vocabulary. The term word embeddingswas introduced by [5], but it was the release ofword2vec [25], a toolkit for training and implementingword embeddings, that greatly expanded the use of thetechnique. Here, an embedding known as GloVe (GlobalVectors for Word Representation) was used. The GloVealgorithm was created by [30] and builds on word2vecto explicitly encode meaning as vector offsets. Forexample, models created using GloVe should accuratelyrepresent semantic relationships such as the analogy“king is to queen as man is to woman,” which wouldbe encoded in the vector space by the vector equation:king −man+ woman = queen.
A GloVe word embedding was trained on the corpusof over 118M Reddit posts, which included over 1.2Munique words. (Words that appeared less than 5 timesin the corpus were omitted.) Note that while one mayuse pre-trained word embeddings, training on in-domaindata is desirable to capture the vocabulary, jargon, andsymbols used on Reddit, as well as foreign language use.
The technique of representing a bag of words (suchas a sentence or document) as an average of constituentword embeddings has been demonstrated as a successfuland efficient approach for many applications, such asdetecting semantic similarity of sentences, identifyingplagiarism, classifying sentiment from social media,and others [10, 17, 27, 41]. In this work, aweighted average word embedding (i.e., a weekly5
subreddit content vector) is calculated by averagingthe GloVe embedding vectors for all the words in asubreddit-week, weighted by a term frequency-inversedocument frequency (TF-IDF) value:
wTF−IDF = (1 + log ft,d) · (1 + log (N/nt)) (1)
The term frequency ft,d is given by the number of timesa term t occurs in document d, and the inverse documentfrequency is given by the ratio of the total numberof documents N and nt, the number of documentscontaining the term t. In this work, a document is thecollection of posts from a given subreddit for a givenweek. This TF-IDF formulation assigns more weight towords that are unique and less to words that are usedfrequently across many subreddits.
Anecdotally, the weighted average word embeddingfor subreddit content captures relationships betweensubreddits in a similar way as the raw GloVe embeddingcaptures relationships between words, e.g.:
/r/Patriots− /r/boston + /r/Atlanta = /r/falcons/r/iphone− /r/apple + /r/google = /r/Android
5For the results presented in this paper, a time span of oneweek was used to smooth out day-to-day volatility, particularlyon smaller subreddits with low post activity. Furthermore, manysubreddits exhibit a natural weekly periodicity of posting behavior, forexample due to higher volumes of activity on weekends and repeatingoccurrences such as television shows and sporting events.
Page 2266
3.3. Community Similarity
Although shortcomings of word vectors and the useof word analogies for evaluating vector-space semanticmodels have been documented [11, 21], this relativelysimple and inexpensive calculation performed wellin grouping subreddits into topically-similar clusters.To quantify community similarity, the cosine distancebetween the weighted average word embeddings v forsubreddits j and k is used:
dj,k = 1− cos (θ) = 1− vj · vk||vj ||2||vk||2
(2)
The cosine distance of average embeddings is a simpleyet effective metric for calculating semantic similaritybetween documents [27, 41].
Figure 1 illustrates the hierarchical clustering of thecontent embeddings for the 1000 most active subredditson the week of September 9-15, 2017. (This is providedfor visual inspection only; no analysis on communityclusters is subsequently performed.) Here, one cansee large clusters of similar subreddits, as well as thepresence of individual subreddits that do not clusterwell. Most of these outliers are foreign-languagesubreddits (e.g. /r/france in French, /r/newsokuvipin Japanese) or have unique vocabulary or postingconventions (e.g. in /r/ShadowBan nearly all post titlestake the form of “Am I Shadowbanned?”). Figure 2zooms in on a few of these clusters. Figure 2a showsa large cluster of hobby or personal interest subreddits,with subclusters related to food, fashion, nature, pets,and art. A cluster of sports-focused subreddits is shownin Figure 2b, where baseball, football, boxing/wrestling,and “misc” sports subreddits form distinct subclusters.
While these clusters appear reasonable uponqualitative examination, we can quantify the efficacy ofthe subreddit embedding similarity metric by using it topredict whether one subreddit explicitly links to anotherin its community description. For example, at the timeof this writing, the subreddit /r/boston links to threeother Boston-related subreddits: /r/bostontenants,/r/bostonhousing, and /r/BostonJobs.
Table 1 lists the top ten false positives, or the tensubreddit pairs with smallest cosine distance (greatestcosine similarity) that do not link to one another.Upon inspection, all ten pairs are objectively similarsubreddits with respect to content. Table 2 lists the topten false negatives, or the ten subreddit pairs with largestcosine distance that do link to one another. In this case,nearly all pairs consist of two subreddits in differentlanguages. The pair /r/sweden and /r/cringe strangelyappears in this list, but in fact the subreddit /r/swedencontains a link to /r/Pinsamt which is described as
Top 10 False PositivesSubreddit 1 Subreddit 2askmen askwomenanswers nostupidquestionsgametrade steamgameswapcrazyideas nostupidquestionsconservative politiccrappydesign mildlyinfuriatinggrandtheftautov gtavindia indianewsanswers crazyideasoffmychest self
Table 1: Non-linked subreddits with highest similarity.
Top 10 False NegativesSource Targeteurope greecesuomi russianewsokuvip newsokursuomi denmarksweden cringeeurope swedeneurope denmarkeurope romaniasuomi swedeneurope portugal
Table 2: Linked subreddits with lowest similarity.
“Like /r/cringe, only a lot more Swedish.” Figure 3shows the receiver operating characteristic (ROC) curvefor all subreddit pairs over 52 weeks. The average areaunder the ROC curve (AUC) is 0.85.
4. Anomaly Detection
In this section, we leverage the previously-describedcontent model to automatically detect anomalies causedby community disturbances. We consider a disturbanceto be an event that has a real or perceived impact on acommunity’s membership, function, or interests. Themotivation for anomaly detection on Reddit is twofold:to leverage the Internet as a sensor for timely detectionof real-world events; and to generate a rich dataset ofobserved community perturbations to facilitate robustanalysis of disturbance effects.
4.1. Community Change and Event Detection
Unexpected, controversial, or disruptive events oftenresult in a flurry of discussion about a new topic onrelevant affected communities on Reddit. These events
Page 2267
2mei
rl4m
eirl
anim
e_irl
me_
irlm
eirl
King
dom
Hear
tsFa
llout
Mod
sCr
usad
erKi
ngs
eu4
AskH
istor
ians
AskS
cienc
eFict
ion
world
build
ing
War
ham
mer
40k civ
tota
lwar
Xcom
Fallo
utsk
yrim
skyr
imm
ods
Min
ecra
ftfe
edth
ebea
stm
inec
rafts
ugge
stio
nsNo
Man
sSky
TheG
ame
bind
ingo
fisaa
cRi
mW
orld fo4
Terra
riaKe
rbal
Spac
ePro
gram
Citie
sSky
lines
fact
orio
cust
omhe
arth
ston
eel
ders
crol
lsleg
ends
hear
thst
onec
ircle
jerk
Town
ofSa
lem
gam
eBr
awlh
alla
Battl
eRite
para
gon
lear
ndot
a2va
ingl
oryg
ame
Com
petit
iveo
verw
atch
Over
watc
hUni
vers
ityhe
roes
ofth
esto
rmPa
ladi
nsSm
iteDn
Ddn
dnex
tPa
thfin
der_
RPG
witc
her
bloo
dbor
neda
rkso
uls
dark
soul
s3AR
Kone
play
ark
elde
rscr
ollso
nlin
ewo
rldof
pvp
Cruc
ible
Play
book
Neve
rwin
ter
wowe
cono
my
War
fram
epa
thof
exile
wow
SWGa
laxy
OfHe
roes
aven
gers
acad
emyg
ame
Cont
estO
fCha
mpi
ons
RotM
GM
aple
stor
yffx
ivbl
ackd
eser
tonl
ine
blad
eand
soul
Guild
wars
2th
ediv
ision
2007
scap
eru
nesc
ape
dead
byda
ylig
htpl
ayru
stba
ttlef
ield
_one tf2
spik
esED
Hm
agicT
CGpt
cgo
yugi
ohDB
ZDok
kanB
attle
Scho
olId
olFe
stiv
alNa
ruto
Blaz
ing
OneP
iece
TCPu
zzle
AndD
rago
nssu
mm
oner
swar
firee
mbl
emM
onst
erHu
nter
gran
dord
erFF
Reco
rdKe
eper
Blea
chBr
aveS
ouls
brav
efro
ntie
rFF
Brav
eExv
ius
Mob
iusF
FGi
ftofG
ames
Rand
om_A
cts_
Of_A
maz
onRa
ndom
Acts
ofCa
rds
shar
ditk
eepi
tCr
ackW
atch vita
Vita
Pira
cyCa
llOfD
uty
CODZ
ombi
esBa
ttlef
ield
halo
Gear
sOfW
arbl
acko
ps3
Infin
itewa
rfare
battl
efie
ld_4
titan
fall
csgo
CoDC
ompe
titiv
est
arcr
aft
Kapp
asm
ashb
ros
FGC
Stre
etFi
ghte
rgt
aonl
ine
Gran
dThe
ftAut
oVGT
AVGa
min
gStu
nts
Lets
Play
Vide
osPr
omot
eGam
ingV
ideo
sYo
uTub
eGam
ers
GetM
oreV
iews
YTvi
deog
ames
roos
terte
eth
Smal
lYTC
hann
elYo
uTub
e_st
artu
psPS
VRoc
ulus
Vive
disc
orda
ppTw
itch
tipof
myj
oyst
ick PS4
xbox
one
Gam
esSt
eam
gam
ings
ugge
stio
nsAn
droi
dGam
ing
pcga
min
g3D
Sni
nten
dopo
kem
onpo
kem
ongo
TheS
ilphR
oad
Rive
nmai
nsCL
Gsu
mm
oner
scho
olle
ague
ofle
gend
sLe
ague
OfVi
deos
Gam
erPa
ls lfgDe
stin
yShe
rpa
fifac
lubs
Mad
denM
obile
Foru
ms
MLB
TheS
how
Mad
denU
ltim
ateT
eam
NHLH
UTFi
faCa
reer
sM
adde
nFI
FANB
A2k
Roya
leRe
crui
tCl
ashO
fCla
nsCl
ashR
oyal
ehe
arth
ston
ePi
ctur
eGam
eDe
stin
yThe
Gam
eOv
erwa
tch
Glob
alOf
fens
ive
DotA
2Ra
inbo
w6Ro
cket
Leag
ueDi
epio
Party
Links
mcs
erve
rsBt
cFor
umBi
tcoi
nCr
ypto
Curre
ncy
ethe
reum bt
cbt
cnew
sfee
dm
ath
Hom
ewor
kHel
pch
eata
tmat
hhom
ewor
kle
arnm
ath
exce
lbl
ende
rga
mem
aker
gam
edev
Unity
3DM
achi
neLe
arni
ngle
arnp
rogr
amm
ing
lear
npyt
hon
java
scrip
tpr
ogra
mm
ing
webd
evDe
sign
Inte
riorD
esig
nfo
rhire
Wor
dpre
ssSE
Owe
b_de
sign
HITs
Wor
thTu
rkin
gFor
Sam
pleS
izeed
ucat
ion
Engi
neer
ingS
tude
nts
Teac
hers
Appl
ying
ToCo
llege
colle
geAc
coun
ting
csca
reer
ques
tions
jobs
deal
sla
wle
gala
dvice
busin
ess
SAte
chne
wste
chno
logy
finan
ceRe
alEs
tate
build
apc
pcm
aste
rrace
build
apcf
orm
eSu
gges
tALa
ptop
Amd
nvid
iabu
ildap
csal
esM
onito
rsja
ilbre
akap
ple
ipho
neTh
eCol
orIsO
rang
eAp
pleW
atch
tmob
ilePl
eXra
spbe
rry_p
iho
mel
ablin
uxsy
sadm
inco
mpu
ters
softw
are
Surfa
cete
chsu
ppor
tW
indo
ws10
Tech
nolo
gy_
andr
oida
pps
goog
leap
pleh
elp
Gala
xyS7
wind
owsp
hone
onep
lus
Goog
lePi
xel
Nexu
s6P
Pick
AnAn
droi
dFor
Me
Andr
oid
Andr
oidQ
uest
ions
Cele
bsge
ntle
man
bone
rsW
rest
leW
ithTh
ePlo
tcu
mslu
tsNS
FW_G
IFns
fwha
rdco
rego
newi
ldcu
rvy
Gone
wild
Plug
Amat
eur
Real
Girls milf ass
Bust
yPet
iteLe
galC
olle
geGi
rlsns
fwre
ddtu
beho
twife
_cuc
kold
hous
tonr
4rxr
aySe
xsel
lsus
edpa
ntie
spo
rnID
tipof
myp
enis
gone
wild
audi
oSi
ssie
sla
dybo
ners
gwM
assiv
eCoc
kpe
nis
gone
wild
Gone
wild
s_Se
cret
wife
shar
ing
rela
tions
hip_
advi
cere
latio
nshi
psam
iugl
yRa
tem
eNo
Fap
askt
rans
gend
er sex
datin
g_ad
vice
AskM
enAs
kWom
enra
isedb
ynar
cissis
tsde
pres
sion
teen
ager
sof
fmyc
hest
Suici
deW
atch
Advi
ceCa
sual
Conv
ersa
tion
Five
Year
sAgo
OnRe
ddit
know
your
shit
Shitt
yLife
ProT
ips
Phot
osho
pReq
uest
trees
AskO
uija
what
isthi
sthi
ngHO
Tand
TREN
DING
NoSt
upid
Ques
tions
BigB
roth
ercir
cleje
rkAM
Aca
sual
iam
adr
unk
stop
drin
king
opia
tes
leav
esst
opsm
okin
gPu
rple
PillD
ebat
eIn
cels
MGT
OWJU
STNO
MIL
Baby
Bum
psbe
yond
theb
ump
child
free
Pare
ntin
gpr
oED
ExNo
Cont
act
conf
essio
nTr
ollX
Chro
mos
omes ftm
brea
king
mom
Fore
verA
lone self
Tind
erac
tual
lesb
ians
askg
aybr
osOk
Cupi
das
ktrp
sedu
ctio
nRo
ast_
Me
Roas
tMe
4cha
nJo
nTro
nre
actio
ngifs
Nude
lete
dank
mem
esm
emes
mad
lads
best
ofia
mve
rysm
art
OutO
fThe
Loop
Crin
geAn
arch
yra
ntho
ward
ster
nop
iean
dant
hony
h3h3
prod
uctio
nsPl
anet
Dola
npy
rocy
nica
lra
repu
pper
syo
utub
ehai
kube
auty
Dent
istry
Heal
thhe
alth
care
Med
itatio
nAD
HDAn
xiet
yAs
kDoc
spt
sdbi
pola
rm
enta
lhea
lthkr
atom
Drug
sLS
DSk
inca
reAd
dict
ion
keto
Fitn
ess
lose
itas
mr
nove
ltran
slatio
nspo
litics
The_
Dona
ldIce
_Pos
eido
nJo
noBo
ndTw
itter
ricka
ndm
orty
asoi
afga
meo
fthro
nes
tipof
myt
ongu
etra
nsla
tor
anal
ogM
yWal
lpap
erCl
ubito
okap
ictur
eLa
rgeI
mag
eswa
llpap
erwa
llpap
ers
fake
idDa
nkNa
tion
Dark
NetM
arke
tsDa
rkNe
tMar
kets
Noot
ropi
csph
arm
acyr
evie
wssa
ndbo
xtes
tGa
laxy
Note
7el
ectro
nics
Prod
uctT
estin
gSo
urce
ryM
arke
tgr
ams
DNM
Supe
rlist
Dark
NetM
arke
tsNo
obs
Drea
m_M
arke
tDa
rkNe
tDea
lsAl
phaB
ayM
arke
tDr
eam
Mar
ket
Drea
mM
arke
tDar
knet
Dark
net_
Mar
kets
Cash
orTr
ade
face
valu
etick
ets
airs
oftm
arke
tTo
chka
Free
Mar
ket
astro
logy
clean
indi
aPi
nayS
cand
alru
ssia
Turk
eyGa
ryJo
hnso
nHu
rrica
neM
atth
ewne
wsg
Daisy
Ridl
eyAm
azon
_Giv
eawa
ysIm
ages
ifyou
likeb
lank
mak
eupe
xcha
nge
Guita
rgu
itarp
edal
she
adph
ones
synt
hesiz
ers
mak
ingh
ipho
ped
mpr
oduc
tion
WeA
reTh
eMus
icMak
ers
vapo
rent
sel
ectro
nic_
cigar
ette
Vapi
ngGu
npla
Nerf
airs
oft
ar15
guns
Wat
ches
foun
tain
pens
knife
club
Coffe
eHo
meb
rewi
ngwi
cked
_edg
eAs
ianB
eaut
ycig
ars
Silv
erbu
gsM
echa
nica
lKey
boar
ds3D
prin
ting
Hom
eIm
prov
emen
two
odwo
rkin
gBa
king
Food
Porn
vega
nCo
okin
gsh
ittyf
oodp
orn
supr
emec
loth
ing
Snea
kers
fash
ion
mal
efas
hion
advi
cest
reet
wear
Aqua
rium
sFi
shin
gsp
ider
swh
atst
hisb
ugm
ycol
ogy
shro
oms
what
sthi
spla
ntsu
ccul
ents
gard
enin
gm
icrog
rowe
rydo
gsPe
tsca
tsdo
gpict
ures
Rabb
itsta
ttoos
draw
ing
redd
itget
sdra
wncr
oche
tRe
dditL
aque
rista
sha
llowe
enwe
ddin
gpla
nnin
gM
akeu
pAdd
ictio
nbe
ards
mal
ehai
radv
iceRW
BYm
ylitt
lepo
nyzo
otop
iaaw
wnim
eRe
_Zer
oIn
ktob
erPi
xelA
rtco
spla
yia
teac
rayo
nru
le34
fiven
ight
satfr
eddy
sSt
ardu
stCr
usad
ers
gam
egru
mps
Unde
rtale
Defe
nder
sM
arve
lm
arve
lstud
ios
whow
ould
win
DC_C
inem
atic
com
icboo
ksDC
com
icsfu
nkop
opAc
tionF
igur
esle
gost
artre
kSt
arW
ars
Star
War
sBat
tlefro
ntm
anga
hom
estu
ckNa
ruto
arro
wFl
ashT
Vda
ngan
ronp
adb
zst
even
univ
erse
anim
eOn
ePie
cetra
nce
Hous
eTe
chno
Vapo
rwav
eTh
isIsO
urM
usic
EDM
elec
troni
cmus
icM
onst
erca
tfu
ture
beat
stra
phi
phop
head
sin
dieh
eads
poph
eads
Fran
kOce
anKa
nye
Nam
eTha
tSon
gsp
otify
csgo
gam
blin
gpo
ker
Eve
War
thun
der
Wor
ldof
Tank
sW
orld
OfW
arsh
ips
sto
XWin
gTM
GEl
iteDa
nger
ous
star
citize
nka
hoot
Inst
agra
mPi
racy
chur
ning
avia
tion
flyin
gst
arbu
cks
walm
art
AirF
orce
uwat
erlo
oos
ugam
eAp
ocal
ypse
Risin
gTo
onto
wn2b
2tun
turn
edDi
epio
Maf
ia3
PvZG
arde
nWar
fare
swto
rpa
yday
theh
eist
Plan
etsid
eHx
HBat
tleAl
lSta
rsM
SLGa
me
cust
omm
agic
duel
yst
AceA
ttorn
eyGi
IvaS
unne
rTa
leso
fLin
kno
man
shig
hSt
arde
wVal
ley
tapp
edou
tyo
kaiw
atch
met
alge
arso
lidM
orta
lKom
bat
drun
kenp
easa
nts
Scen
esFr
omAH
atcr
inge
that
Happ
ened
week
endg
unni
tco
pypa
sta
im14
andt
hisis
deep
noco
ntex
tBl
ackP
eopl
eTwi
tter
dadj
okes
ImGo
ingT
oHel
lFor
This
Fello
wKid
stra
shy
Shitt
y_Ca
r_M
ods
briti
shpr
oble
ms
Crap
pyDe
sign
mild
lyin
furia
ting
quot
esRo
orh
shitt
yask
scie
nce
answ
ers
Craz
yIde
asW
TFm
asha
ble
fark
save
dyou
aclic
kco
mics
webc
omics
furry
MLP
Loun
geHu
ntin
gna
ture
ismet
alHa
ram
bepe
rfect
loop
sBa
ndna
mes
oddl
ysat
isfyi
ngin
tere
stin
gasf
uck
woah
dude
notin
tere
stin
gDe
epIn
toYo
uTub
eGi
fSou
ndUn
expe
cted
gofu
ndm
efin
dare
ddit
Help
MeF
ind
phot
ocrit
ique
Film
mak
ers
phot
ogra
phy
unkn
ownv
ideo
sNe
wTub
ers
yout
ube
anim
atio
nho
rror
ente
rtain
men
tco
med
yfu
nnyv
ideo
sM
etal
kpop
Best
OfSt
ream
ingV
ideo
boni
ver
Met
alco
reM
usict
hem
etim
era
dioh
ead
gree
nday
viny
lru
paul
sdra
grac
esu
rviv
orvl
ogW
altD
isney
Wor
ldAm
erica
nHor
rorS
tory
Dund
erM
ifflin
sout
hpar
kAn
imal
Cros
sing
nost
algi
aTh
eSim
pson
sGa
min
gcirc
leje
rkTw
oBes
tFrie
ndsP
lay
NYCC
MrR
obot
west
world
netfl
ixSt
rang
erTh
ings
harry
potte
rwr
iting
Fant
asy
sugg
estm
eabo
okFi
nalF
anta
syki
ckst
arte
rbo
ardg
ames rpg
form
ula1
Mec
hani
cAdv
ice Jeep
suba
ruRo
adca
mfo
rza
NASC
ARte
slam
otor
sBM
WAu
tos
cars
bcsn
ews
Aust
inAt
lant
aho
usto
nPo
rtlan
dSe
attle
WA
LosA
ngel
esch
icago ny
cCa
lgar
yot
tawa
toro
nto
vanc
ouve
rlo
ndon
mel
bour
ne bjj
body
build
ing
gopr
oM
ultic
opte
rdi
scgo
lfgo
lfru
nnin
gbi
cycli
ngm
otor
cycle
sTu
berS
imHa
lifax
News
prom
ote
Busin
essA
dvice
Team
Info
grap
hics
mar
ketin
gAd
vice
Anim
als
Entre
pren
eur
OhGi
rlIam
InTr
oubl
ePr
intin
gSho
pte
chEn
gadg
etTe
chfe
edar
chite
ctur
eHo
meD
ecor
atin
gSo
cial_P
sych
olog
yEc
onom
icsec
onom
yin
vest
ing
walls
treet
bets
Post
Wor
ldPo
wers
envi
ronm
ent
citra
lEv
eryt
hing
Scie
nce
Philip
pine
sha
rdrig
htTh
e_Eu
rope
Islam
Unve
iled
Chris
tians
Awak
e2NW
OW
hite
Righ
tspr
ogun
Prot
ectA
ndSe
rve
singa
pore
sout
hafri
cam
iscof
fbea
teu
rope
anna
tiona
lism
jillst
ein
colla
pse
nytim
esAn
ythi
ngGo
esNe
wsin
then
ews
pale
olib
erta
rian
Anar
cho_
Capi
talis
mso
cialis
mm
etac
anad
aLa
teSt
ageC
apita
lism
Liber
taria
nPo
litica
lDisc
ussio
nAs
kThe
_Don
ald
AskT
rum
pSup
porte
rsPo
litica
lHum
orPO
LITI
Cne
w_rig
htCo
nser
vativ
esOn
lyPo
litica
lVid
eoFU
LLCO
MM
UNIS
MOb
ject
ivism lg
btM
ensR
ight
sex
jwex
mus
limAg
ains
tHat
eSub
redd
itsDr
ama
Tum
blrIn
Actio
nKo
taku
InAc
tion
Sarg
onof
Akka
dwo
rldGl
ance
islam
paki
stan
bakc
hodi
indi
anew
sfo
rwar
dsfro
mgr
andm
ach
ange
myv
iew
exm
orm
onCa
thol
icism
athe
ismCh
ristia
nity
hilla
rycli
nton
Hilla
ryFo
rPris
onW
ayOf
TheB
ern
Polit
ical_R
evol
utio
nde
moc
rats
Koss
acks
_for
_San
ders
Clim
ateC
haos
indi
atra
vel
Cons
erva
tive
Enou
ghTr
umpS
pam
unce
nsor
edne
wswo
rldpo
litics
cons
pira
cyTh
eCol
orIsR
edun
rem
ovab
leeu
rope
ukpo
litics
unite
dkin
gdom
bost
onca
nada
Cana
daPo
litics
irela
ndau
stra
liane
wzea
land
Guns
AreC
ool
Brea
king
News
24hr
TheC
olor
IsBlu
eCC
TVCa
mer
asUK
NoFi
lterN
ews
nprs
torie
sPo
litica
lVid
eos
russ
iawa
rinuk
rain
esy
rianc
ivilw
arGe
osim
vexi
llolo
gyLu
cky3
03Fa
ntas
y_Fo
otba
llfo
otba
llFa
ntas
yPL
socc
erLiv
erpo
olFC
Gunn
ers
redd
evils
WW
EGam
esSC
Jerk
Squa
redC
ircle
WW
EBo
MM
ATo
ront
oblu
ejay
sor
iole
sre
dsox
SFGi
ants
base
ball
CHIC
ubs
NewY
orkM
ets
Crick
etfa
ntas
ybba
llfa
ntas
yhoc
key
hock
ey MLS
rugb
yuni
on AFL nrl
spor
ts_u
ndel
ete
tenn
isea
gles
49er
sm
inne
sota
viki
ngs
Patri
ots
buffa
lobi
llsco
wboy
sDe
nver
Bron
cos
oakl
andr
aide
rsCh
arge
rsfa
lcons nba
warri
ors
CFB
fant
asyf
ootb
all
nfl
dark
netm
arke
tde
Denm
ark
swed
enth
enet
herla
nds
franc
eRo
man
iaca
talu
nya
italy
bras
ilpo
rtuga
lpo
dem
osar
gent
ina
mex
icone
wsok
uvip
news
okun
omor
alne
wsok
urSu
omi
gree
cepo
litot
aal
inity
kpics
Shad
owBa
nRo
cket
Leag
ueEx
chan
gecs
gotra
deGl
obal
Offe
nsiv
eTra
deDe
stin
yLFG
Fire
team
sAp
pNan
afri
ends
afar
ipo
kem
ontra
des
Casu
alPo
kem
onTr
ades
Poke
mon
give
away
ACTr
ade
beer
trade
DBZD
okka
nMar
ketp
lace
mec
hmar
ket
Gam
eSal
eec
igcla
ssifi
eds
funk
oswa
pgi
ftcar
dexc
hang
eha
rdwa
resw
appu
mpa
rum
Dota
2Tra
deGa
meT
rade
indi
egam
eswa
pSt
eam
Gam
eSwa
pSt
arcit
izen_
trade
ssn
eake
rmar
ket
Fash
ionR
eps
Reps
neak
ers
mad
denm
obile
buys
ell
Stea
mAc
coun
tsFo
rSal
ere
mov
albo
tFr
eeSt
uffN
YCbo
rrow
give
away
ssw
eeps
take
sUH
CMat
ches
Recr
uitC
SRo
cket
Leag
ueFr
iend
sOv
erwa
tchL
FTTe
amRe
dditT
eam
sFF
XIVR
ECRU
ITM
ENT
wowg
uild
sTh
eVoi
ceOf
Sere
nAd
vent
ures
OfFr
iend
scla
ncar
niva
lth
ebrit
ishel
ites
Rand
omAc
tsOf
Blow
Job
Rand
omAc
tsOf
Muf
fDiv
esn
apch
atdi
rtyr4
rKi
kpal
sr4
rFu
rryKi
kPal
sex
xxch
ange
NSFW
_KIK
dirty
kikp
als
Dirty
Snap
chat
Role
play
kik
Agep
layP
enPa
lsdi
rtype
npal
sal
le15
min
uten
ever
y15m
insh
itpos
tbot
5000
hobbiespersonal electronics sports
] ] ]
cities
]
games
]
Figure 1: Hierarchical clustering of subreddit embeddings using cosine distance.
(a) Hobbies and Interests Subreddits
Baki
ngFo
odPo
rnve
gan
Cook
ing
shitt
yfoo
dpor
nsu
prem
eclo
thin
gSn
eake
rsfa
shio
nm
alef
ashi
onad
vice
stre
etwe
arAq
uariu
ms
Fish
ing
spid
ers
what
sthi
sbug
myc
olog
ysh
room
swh
atst
hisp
lant
succ
ulen
tsga
rden
ing
micr
ogro
wery
dogs
Pets
cats
dogp
ictur
esRa
bbits
tatto
osdr
awin
gre
dditg
etsd
rawn
croc
het
Redd
itLaq
ueris
tas
hallo
ween
wedd
ingp
lann
ing
Mak
eupA
ddict
ion
bear
dsm
aleh
aira
dvice
(b) Sports Subreddits
WW
EGam
esSC
Jerk
Squa
redC
ircle
WW
EBo
MM
ATo
ront
oblu
ejay
sor
iole
sre
dsox
SFGi
ants
base
ball
CHIC
ubs
NewY
orkM
ets
Crick
etfa
ntas
ybba
llfa
ntas
yhoc
key
hock
ey MLS
rugb
yuni
on AFL nrl
spor
ts_u
ndel
ete
tenn
isea
gles
49er
sm
inne
sota
viki
ngs
Patri
ots
buffa
lobi
llsco
wboy
sDe
nver
Bron
cos
oakl
andr
aide
rsCh
arge
rsfa
lcons nba
warri
ors
CFB
fant
asyf
ootb
all
nfl
Figure 2: Closeups of two clusters in the subreddit embedding dendrogram from Fig. 1
Figure 3: ROC for cosine similarity predicting explicitlinks between subreddits, averaged over 52 weeks.
may therefore be detectable by anomalous content,anomalous activity volume, or both.
Just as the cosine distance of weighted average wordembeddings was used to measure similarity betweensubreddits (Eqn. 2), this metric can also quantifycontent change from time tk−1 to time tk for asingle subreddit. The amount of change considered“anomalous” for any particular subreddit will dependon its expected (baseline) activity. The content modeltakes a weighted average over all words in the posttitles; popular, broad-topic subreddits will tend tohave more stable content week-to-week than smaller,niche subreddits. To model the typical change for asubreddit, the observed week-to-week cosine distancescan be approximated by a beta distribution6 with supportfrom 0 (no change, i.e., vk = vk−1) to 1 (whenvk = −vk−1). By fitting the expected contentchange distribution, one can estimate the probability ofobserving a week-to-week change greater than or equalto dk by evaluating P (d ≥ dk) = 1− F (dk), whereF is the distribution’s cumulative distribution function(CDF). In the examples presented, a threshold of 0.02is used to detect anomalies; this threshold is a tuningparameter and corresponds to content changes that are< 2% likely to occur in the data.
6This model was used due to preliminary analysis indicatingthat week-to-week cosine distances for most subreddits could bewell-approximated by a beta distribution.
Page 2268
Calculating a subreddit content vector for anyarbitrary time span is computationally inexpensive(the specific complexity depends on the scope ofthe IDF term in Eqn. 1), therefore content changecan be monitored at multiple levels of granularity(day-to-day, week-to-week, etc.). Similarly, theexpected change distribution and anomaly calculationis trivially inexpensive to calculate and may be done inan online or rolling fashion. However in practice, oneneeds enough observations (in terms of total numberof time intervals) to sufficiently estimate the contentchange distribution. In the examples given in this paperthe distribution is assumed to be static over all time; insome cases this may not accurately reflect permanent orseasonal changes to subreddit content.
4.2. Anomaly Detection Exemplars
In this section, content models are used to detectanomalies within two distinct community groups.The first case study examines several football-relatedsubreddits to illustrate the efficacy of the presentedtechniques on an example with which the reader maybe familiar. The second looks at subreddits dedicatedto various dark net markets (primarily illicit marketswhich operate via anonymous communication networkssuch as Tor or I2P) and the anomalies detected onReddit corresponding to significant disruptions to theseanonymous, encrypted websites.
4.2.1. Football. Professional football is consideredthe most popular American sport, so it is not surprisingthat a significant amount of football-related content canbe found on Reddit. Discussion about the NationalFootball League (NFL) can be found on /r/nfl; thoseinterested in NCAA college football can head to /r/CFB.Subreddits dedicated to individual teams abound, e.g./r/LSUFootball, /r/ClemsonTigers and /r/Patriots.
Figure 4 plots the week-to-week content changeand post volume for three football-related communities.Detected anomalies reliably correspond to eventssuch as the Super Bowl, draft dates, and dramaticor surprising incidents (anomalies were inspectedmanually and annotations provided for the reader’sconvenience). We note several observations:
• Events do not affect all communities, evenclosely-related ones, in the same way. Although allthree subreddits are close in the content embeddingspace (see Fig. 2b), they exhibit anomalous behaviorin response to different events specific to theirdomain. For example, National Signing Day (Feb.1, 2017) refers to the day new recruits can committo their future schools and is coincident with the
largest content-based anomaly on /r/CFB, likelycaused by users discussing the incoming class of newpromising football stars. This event, however, causesno observable change in either /r/nfl or /r/Patriots.
• Events detected due to anomalous content do notalways correlate with spikes in activity volume,and vice-versa. Content-based anomalies oftencorrespond to a surprising or dramatic occurrence,causing the existing community to discuss a newtopic. Changes in activity volume correspondto events which would affect the community’spopularity (either increase or decrease), such as thestart of a new season. Interestingly, Super Bowl LI(Feb. 5, 2017) caused a dramatic increase in postactivity on /r/Patriots but not a significant change incontent, likely because the Super Bowl temporarilyattracted more people to the subreddit, but the topicsof discussion remained largely the same.
• While the framework described in this paper cannotexplain the exact cause of an anomaly, word cloudscan often provide quick insight into the underlyingevent(s). For example, word clouds corresponding tothe three content anomalies observed on /r/Patriotsare plotted and described in Figure 5.
• Even for subreddits that correspond to a publicand well-defined community such as /r/Patriots, theevents detected are not always present in other formsof traditional media. For example, 342 news articlescontaining the word “patriots” were published inthe sports section of the New York Times betweenMarch 2016-September 2017. Most articles werenot relevant to the Patriots football team, and onlya small number (25 articles over nearly 19 months)correspond to articles covering non-game relatednews. These non-game articles represent the typeof news stories which would be most likely toshift the content of /r/Patriots away from typical,game-related discussions, but only 1 out of 25 (“F.B.I.Recovers Tom Brady’s Missing Super Bowl Jerseysin Mexico”) aligns with the events detected byour method. Other representative non-game newsarticles include headlines such as “Aaron Hernandez’sMurder Conviction Is Nullified”, “Six New EnglandPatriots Say They Will Skip a White House Visit”,and “Babe Parilli Dies at 87”. The events detected onReddit and articles published by the New York Timesmay point to differences between topics deemedrelevant by self-identified community members andthose recognized by an external editorial staff,illustrating one benefit of using social media asa sensor for detecting community disturbances inaddition to more traditional sources.
Page 2269
0.000
0.005
0.010
0.015
0.020
0.025
0.030
0.035
0.040CFBnflPatriots
0
500
1000
1500
2000
2500
3000
Cos
ine
Dis
tanc
eN
umbe
r of P
osts
CFBnflPatriots
2016-042016-06
2016-082016-10
2016-122017-02
2017-042017-06
2017-08
Patriots trade Jamie Collins to Cleveland Browns (Oct. 31, 2016)
Donald Trump claims endorsement by Patriots (Nov. 7, 2016)
National Signing Day (Feb. 1, 2017) Tom Brady's stolen Super Bowl
jersey found (Mar. 20, 2017)
NFL Draft (Apr. 27-29, 2017)Super Bowl LI
(Feb. 5, 2017)
o*o*
o*
o*
o* o*
Figure 4: (top) Week-to-week change for three football-related subreddits (more positive indicates greater change).Detected anomalies are indicated by red dots and annotated with relevant events that coincide with the anomalies;(bottom) Subreddit post volume is plotted for comparison.
(a) Week of Oct. 31–Nov. 6, 2016; OnOct. 31, 2016, the Patriots unexpectedlytraded top linebacker Jamie Collins to theCleveland Browns [20].
(b) Week of Nov. 7–13, 2016; On Nov.7, 2016, Donald Trump gave a speechclaiming Patriots coach Bill Belichick andquarterback Tom Brady supported him forpresident [9].
(c) Week of March 20-26, 2017; On March20, 2016, Tom Brady’s stolen Super Bowljersey was reported to have been found [1].
Figure 5: Word clouds corresponding to detected events on /r/Patriots. Word sizes are proportional to frequency [26].
4.2.2. Dark Net Markets. In July 2017, AlphaBay,an online dark net marketplace, was taken down by lawenforcement authorities. With over 240,000 members,AlphaBay was the largest market of its kind andreportedly hosted $600,000 to $800,000 in transactionsdaily. Its takedown, along with the coordinated seizureof Hansa (the second largest dark net market) two weekslater, scattered vendors and buyers to other dark netmarkets and caused significant disruption to dark netmarket communities both on and off Reddit.
Prior to its shutdown, AlphaBay users andadministrators used the subreddit /r/AlphaBayMarketto post reviews and updates on site operations. After
its final outage on July 4, 2017, Reddit users speculatedthat the outage was an exit scam7 and exchangedconspiracy theories on /r/AlphaBayMarket and on/r/DarkNetMarkets. Federal authorities did notconfirm their involvement in the takedown until July20, and new users flooded to /r/AlphaBayMarket inthe intervening weeks as vendors and buyers tried todetermine status of the network.
Vendors and buyers also sought an alternative darknet market on which to continue their transactions. Newlistings tripled on Hansa and DreamMarket, another
7In 2015, administrators of Evolution, another major dark netmarket, shut down the site and took millions of dollars of users’ moneyin what was known as an exit scam.
Page 2270
0.02
0.04
0.06
0.08
0.10
0.12
0.14C
osin
e D
ista
nce
2016-042016-06
2016-082016-10
2016-122017-02
2017-042017-06
2017-080
200
400
600
800
1000
Num
ber o
f Pos
ts
0.002
0.004
0.006
0.008
0.010AlphaBayMarketDarkNetMarkets
AlphaBayMarketDarkNetMarkets
*o *o *o *oDisclosure of major vulnerabilities on AlphaBay and Hansa markets
(Jan. 23, 2017)
AlphaBay final outage (July 4, 2017)
Hansa shut down & Law Enforcement involvement confirmed (July 20, 2017)
AlphaBay final outage (July 4, 2017)
Hansa shut down & Law Enforcement involvement confirmed (July 20, 2017)
Abandoned subreddit exhibits new baseline o* o*
Figure 6: (top) Week-to-week change for two largest Dark Web market subreddits (more positive indicates greaterchange); (bottom) Subreddit post volume is plotted for comparison.
(a) Week of June 26–July 2, 2017; Typicaldiscussion of marketplace activity, withlarge emphasis on AlphaBay
(b) Week of July 3–July 9, 2017; Redditorsdiscuss the outage of AlphaBay (July4, 2017) and debate alternative markets,Hansa and Dream Market.
(c) Week of July 17–July 23, 2017;Hansa marketplace is shut down (July 20,2017), and discussion of AlphaBay hassignificantly decreased.
Figure 7: Word Clouds for subreddit /r/DarkNetMarkets before, during, and after the AlphaBay and Hansa dark netmarkets were shut down in July 2017.
dark net market competitor, when AlphaBay went down[38]. Once it became clear that Hansa was also shutdown, DreamMarket became the largest dark net market,although others, such as Trade Route, Tochka, and WallStreet Market, saw similar boosts in traffic [19].
Figure 6 (top) shows anomalous change in/r/DarkNetMarkets content in the weeks of July 3(when AlphaBay had its final outage) and July 17 (whenthe federal takedown of AlphaBay and Hansa wasformally announced). Word clouds give a sense of howthe content of posts on /r/DarkNetMarkets shiftedfrom from AlphaBay to Hansa to DreamMarket as newsof the takedown unfolds (see Figure 7). This anomalouscontent change detected in /r/AlphaBayMarket may
be attributed to a significant reduction in posts afterAugust 2017 (/r/AlphaBayMarket had an average of92 posts per week from Jan.-June 2017 and an averageof 35 posts per week from Aug.-Sep. 2017).
Figure 6 (bottom) shows a large spike in postactivity on /r/AlphaBayMarket on the week ofJuly 3, reflecting the large increase in posts on/r/AlphaBayMarket in the confusion following thesite’s outage (total posts on /r/AlphaBayMarketjumped from less than 80 posts on the week of June26 to over 950 on the week of July 3). A similar spikeon /r/DarkNetMarkets for the week of July 17 likelyreflects the increase in posts on /r/DarkNetMarketsregarding the announced government takedown of both
Page 2271
AlphaBay and Hansa; we speculate that posts movedfrom /r/AlphaBayMarket to /r/DarkNetMarkets as itbecame clear that AlphaBay was completely defunct(total posts on /r/DarkNetMarkets jumped from 570posts on the week of July 10 to over 940 on the weekof July 17, whereas total posts for the same weeks on/r/AlphaBayMarket decreased from ∼720 to ∼380).
The results presented here represent two specificcommunity groups, however these analyses demonstratethe ability to quantitatively characterize subredditcommunities and automatically detect disturbances in awide range of communities in an unsupervised manner.Moreover, these disturbances appear to correspondto events that are unexpected and noteworthy to themembers of the community itself, rather than thosehighlighted by traditional external news sources.
5. Future Work
The methods we have presented show promisein detecting changes to both common and nichecommunities of interest, a challenge relevant to manyacademic and commercial fields. It is expected thatmost of the techniques applied here to Reddit canbe generalized to other data sources such as Twitter,although additional work would be required to definecommunity membership and analyze shorter and lessstructured messages. Verification and analysis of thedifferences between Reddit and Twitter (and othersources) are subjects for future research.
The analysis and explanation of the causes ofdetected changes in social media is also an area ofinterest, and the overall efficacy of these approacheslikely depends on the ability to correlate andcontextualize anomaly detections with auxiliary sourcesof information. Future work includes automaticallygenerating an interpretation of detected anomalies,detecting events by correlating weak changes acrossmultiple communities, and studying the long term inter-and intra-community effects caused by detected events.A more rigorous utility study must also be performedcorrelating detected anomalies with real-world events inorder to quantify the robustness and reliability of theevent detection across a range of communities.
References
[1] David Agren. “Tom Brady’s stolen Super Bowl jerseyfound in Mexico”. In: The Guardian (Mar. 2017).
[2] Alexa. https://www.alexa.com/siteinfo/reddit.com.[3] Eiji Aramaki, Sachiko Maskawa, and Mizuki Morita.
“Twitter catches the flu: detecting influenza epidemicsusing Twitter”. In: Conference on Empirical Methodsin Natural Language Processing. 2011, pp. 1568–1576.
[4] Ahmer Arif et al. “How Information Snowballs:Exploring the Role of Exposure in OnlineRumor Propagation”. In: ACM Conference onComputer-Supported Cooperative Work & SocialComputing. 2016, pp. 466–477.
[5] Yoshua Bengio et al. “A neural probabilistic languagemodel”. In: Journal of Machine Learning Research 3(Feb. 2003), pp. 1137–1155.
[6] James Benhardus and Jugal Kalita. “Streaming trenddetection in Twitter”. In: Intl. Journal of Web BasedCommunities 9.1 (2013), pp. 122–139.
[7] Deepayan Chakrabarti and Kunal Punera. “EventSummarization Using Tweets”. In: ICWSM 11 (2011),pp. 66–73.
[8] Andrew Crooks et al. “#Earthquake: Twitter as adistributed sensor system”. In: Transactions in GIS 17.1(2013), pp. 124–147.
[9] Jeremy Diamond. “Trump says Patriots’ Tom Brady,Bill Belichick supporting him”. In: CNN (Nov. 2016).
[10] Jeremy Ferrero et al. “Using Word Embeddingfor Cross-Language Plagiarism Detection”. In:15th Conference of the European Chapter of theAssociation for Computational Linguistics. Vol. 2.2017, pp. 415–421.
[11] Gregory Finley, Stephanie Farmer, and SergueiPakhomov. “What Analogies Reveal about WordVectors and their Compositionality”. In: 6th JointConference on Lexical and Computational Semantics.2017, pp. 1–11.
[12] Santo Fortunato and Darko Hric. “Communitydetection in networks: A user guide”. In: PhysicsReports 659 (2016), pp. 1–44.
[13] Reilly Grant et al. “Discovery of Informal Topicsfrom Post Traumatic Stress Disorder Forums”. In: Intl.Conference on Data Mining Workshops. IEEE. 2017,pp. 452–461.
[14] William L Hamilton et al. “Inducing Domain-SpecificSentiment Lexicons from Unlabeled Corpora”. In: 2016Conference on Empirical Methods in Natural LanguageProcessing. 2016, pp. 595–605.
[15] Jack Hessel, Chenhao Tan, and Lillian Lee. “Science,AskScience, and BadScience: On the Coexistenceof Highly Related Communities.” In: ICWSM. 2016,pp. 171–180.
[16] Jack Hessel et al. “What do Vegans do in their SpareTime? Latent Interest Detection in Multi-CommunityNetworks”. In: arXiv:1511.03371 (2015).
[17] Edilson Anselmo Corrza Jsnior, Vanessa QueirozMarinho, and Leandro Borges dos Santos. “NILC-USPat SemEval-2017 Task 4: A Multi-view Ensemble forTwitter Sentiment Analysis”. In: Proc. of the 11th Intl.Workshop on Semantic Evaluation. 2017, pp. 611–615.
[18] Ravneet Kaur and Sarbjeet Singh. “A survey of datamining and social network analysis based anomalydetection techniques”. In: Egyptian Informatics Journal17.2 (2016), pp. 199–216.
[19] Leo Kelion. “Dark web markets boom after AlphaBayand Hansa busts”. In: BBC (Aug. 2017).
[20] Emmett Knowlton. “The Patriots shocked the NFLworld by trading one of their best defenders to the worstteam in the NFL”. In: Business Insider (Oct. 2016).
[21] Tal Linzen. “Issues in evaluating semantic spaces usingword analogies”. In: (2016), pp. 13–18.
[22] Steffen Lyngbaek. “SPORK: A SummarizationPipeline for Online Repositories of Knowledge”.MA thesis. CA Polytechnic State University, 2013.
Page 2272
[23] Carlos Martin, David Corney, and Ayse Goker. “Miningnewsworthy topics from social media”. In: Advances inSocial Media Analysis. Springer, 2015, pp. 21–43.
[24] Sara Melvin et al. “Event Detection and SummarizationUsing Phrase Network”. In: Joint European Conferenceon Machine Learning and Knowledge Discovery inDatabases. Springer. 2017, pp. 89–101.
[25] Tomas Mikolov et al. “Efficient estimation of wordrepresentations in vector space”. In: arXiv:1301.3781(2013).
[26] Andreas Mueller et al. word cloud: 1.2.1.https://doi.org/10.5281/zenodo.49907. Apr. 2016.
[27] El Moatez Billah Nagoudi, Jeremy Ferrero, andDidier Schwab. “LIM-LIG at SemEval-2017 Task1: Enhancing the Semantic Similarity for ArabicSentences with Vectors Weighting”. In: Intl. Workshopon Semantic Evaluations. 2017, pp. 125–129.
[28] Jeffrey Nichols, Jalal Mahmud, and Clemens Drews.“Summarizing sporting events using Twitter”. In: ACMIntl. Conference on Intelligent User Interfaces. 2012,pp. 189–198.
[29] Sen Pei et al. “Searching for Superspreaders ofinformation in real-world social media”. In: ScientificReports 4 (2014), p. 5547.
[30] Jeffrey Pennington, Richard Socher, andChristopher Manning. “GloVe: Global vectors forword representation”. In: Conference on EmpiricalMethods in Natural Language Processing. 2014,pp. 1532–1543.
[31] Swit Phuvipadawat and Tsuyoshi Murata. “Breakingnews detection and tracking in Twitter”. In: WebIntelligence and Intelligent Agent Technology. Vol. 3.2010, pp. 120–123.
[32] Jinshan Qi et al. “Discrete time information diffusion inonline social networks: micro and macro perspectives”.In: Scientific Reports 8.1 (2018), p. 11872.
[33] Takeshi Sakaki, Makoto Okazaki, and Yutaka Matsuo.“Earthquake shakes Twitter users: real-time eventdetection by social sensors”. In: 19th Intl. Conferenceon World Wide Web. ACM. 2010, pp. 851–860.
[34] Alessio Signorini, Alberto Maria Segre, and Philip MPolgreen. “The use of Twitter to track levels of diseaseactivity and public concern in the US during theinfluenza A H1N1 pandemic”. In: PloS one 6.5 (2011).
[35] The New Dictionary of Cultural Literacy, Third Edition.Houghton Mifflin Harcourt Publishing Company, 2005.
[36] Soroush Vosoughi and Deb Roy. “A Semi-AutomaticMethod for Efficient Detection of Stories on SocialMedia.” In: ICWSM. 2016, pp. 707–710.
[37] Tim Weninger, Xihao Avi Zhu, and Jiawei Han. “Anexploration of discussion threads in social news sites:A case study of the Reddit community”. In: Advancesin Social Networks Analysis and Mining. IEEE. 2013,pp. 579–583.
[38] Tom Winter. “AlphaBay, Hansa Shut, but Drug DealersFlock to Dark Web DreamMarket”. In: NBC (July2017).
[39] William H Woodall et al. “An overview and perspectiveon social network monitoring”. In: IISE Transactions49.3 (2017), pp. 354–365.
[40] Rose Yu et al. “A survey on social media anomalydetection”. In: ACM SIGKDD Explorations Newsletter18.1 (2016), pp. 1–14.
[41] Jiang Zhao, Man Lan, and Jun Feng Tian. “ECNU:using traditional similarity measurements and wordembedding for semantic textual similarity estimation”.In: Intl. Workshop on Semantic Evaluation. 2015,pp. 117–122.
[42] Siqi Zhao et al. “Sportsense: Real-time detection ofNFL game events from Twitter”. In: arXiv:1205.3212(2012).
Page 2273