social networks and surveillance: evaluating suspicion by association ryan p. layfield dr. bhavani...
TRANSCRIPT
Social Networks and Surveillance: Evaluating Suspicion by Association
Ryan P. LayfieldDr. Bhavani Thuraisingham
Dr. Latifur KhanDr. Murat Kantarcioglu
The University of Texas at Dallas
{layfield, bxt043000, lkhan, muratk}@utdallas.edu
Overview
Introduction►Our Goal►System Design►Social Networks►Threat Detection►Correlation Analysis
The Experiment►Setup►Current Results►Issues►Future Work
Introduction
Automated message surveillance is essential to communication monitoring►Widespread use of electronic
communication
►Exponential data growth
►Impossible to sift through all ‘by hand’
Going beyond basic surveillance►Identifying groups rather than individuals
►Monitoring conversations rather than messages
Our Goal
Design new techniques and apply existing algorithms to…►Create a machine-understandable model
of existing social networks
►Identify abnormal conversations and behavior
►Monitor a given communications system in real-time
►Continuously learn and adapt to a dynamic environment
System Design
Three major components:►Social Network Modeler
►Initial Activity Detector
►Correlated Activity Investigator
Social Networks
Individuals engaged in suspicious or undesirable behavior rarely act alone
We can infer than those associated with a person positively identified as suspicious have a high probability of being either:►Accomplices (participants in suspicious
activity)►Witnesses (observers of suspicious activity)
Making these assumptions, we create a context of association between users of a communication network
Social Networks
Within our model:► Every node is a unique user► Every message creates or strengthens a link between
nodesOver time, the network changes
► Frequent communication leads to stronger links► Intermittent messaging implies weakening social ties
The strength of the link implies how strong an association between individuals is
From this data, we can theoretically identify► Hubs► Groups► Liaisons
Social Networks
Threat Detection
Every message sent is scrutinized in the interest of identifying suspicious communication►Keywords analysis►Prior context (i.e. previous message content)
When a detection algorithm yields a strong result, a token is created►The token is created at the origin and passed to the
recipient(s)►Existing tokens, if any, are cloned instead
The result is a web that potentially reflects the dissemination of suspicious information activity
Correlation Analysis
Future messages with similar suspicious topics are not always identifiable with the same ‘initial’ techniques►Quick replies ►Pronoun use►Assumption that recipient is aware of topic
If a token is present at the sender when a message is sent:►Message token is associated with and new
message are analyzed►If analysis yields a strong match, the token
is further cloned and passed to recipient
The Experiment
A rare set of words shared between two or more messages are candidates for keyword analysis, but they are not always easily sifted from ‘noise’
Noise within text-based messages comes in a variety of forms► Misspelled words► Unusual word choice► Incompatible variations of the same language (i.e. British
vs. American English)► Unexpected language
However, we do not want to eliminate potential keywords► Document names► Terminology specific to a subject► ‘Buzz’ words
The Experiment
We proposed an experiment that attempts to eliminate false positives due to noisy data while strengthening and expanding our correlation techniques
Setup
Tools► Running word ‘rank’ database
► Implementation of word set theory infrastructure
► JAMA Matrix LibrarySingular Value Decomposition
Our Approach► Apply SVD noise filtering based on 100 messages
► Analyze word frequency correlation between current message and prior suspicious messages
► Generate a score based on the results
Setup
Construct a matrix based on the last 100 messages
Ww
MMMW
mwcountc
i
t
jiji
...
),(
21
wor
ds
messages
More common
Less common
Setup
Decompose and rebuild
U VTA
Eliminate ‘weak’ singular values
SetupPulled from messages j and k
)(
),(),()(
i
kijii wrank
mwcountmwcountwscore
‘Raw’ total score for word wi
Pulled from ‘running’ word database
kji WWw
iwscore )(Counts only intersection of words Predefined fixed
threshold
Current Results
Method is not currently accurateLarge fluctuations
►Correlation easily swayed by plethora of common words
►Uncommon words not given enough weight
Current Results
Accuracy of Results over 900 Messages
3%12%
59%
26%
True Positives
False Positives
True Negatives
False Negatives
1000 messages evaluated, first 100 used to seed word ranks.
Issues
Word frequencies fluctuate wildly during beginning of experiment (0.0 – 10.0+)
Extreme cost for current construction methods and computation
Filtering context limited to recent global history
Affected by large bodies of text
Future Work
Tap potential of existing matrix for further analysis
Adaptive filtering feedback algorithmsSpeed improvements to accommodate
real-time streamsFlexible communication platform
monitoringAddition of pipe architecture for
modular threat detection and correlation