streaming models and algorithms for communication and information networks
DESCRIPTION
Streaming Models and Algorithms for Communication and Information Networks. Brian Thompson (joint work with James Abello ). Outline. Introduction and Motivation. A Streaming Model. Our Approach. Algorithms. Experimental Results. Conclusions and Future Work. - PowerPoint PPT PresentationTRANSCRIPT
Streaming Models and Algorithms for
Communication and Information NetworksBrian Thompson (joint work with James
Abello)
Outline
Introduction and MotivationA Streaming Model
Algorithms
Experimental Results
Conclusions and Future Work
Streaming Models and Algorithms for Communication and Information Networks
Our Approach
Outline
Introduction and MotivationA Streaming Model
Algorithms
Experimental Results
Conclusions and Future Work
Streaming Models and Algorithms for Communication and Information Networks
Our Approach
Streaming Models and Algorithms for Communication and Information Networks
Data: A network (G;T)G = (V,E) is a graphT is a set of time-stamped events corresponding to
nodes or edges in G
Goals: Identify recent correlated activityMeasure influence between entities
Challenges:Scalability – networks may be very large, limited
spaceEfficiency – high data rate, time-sensitive
informationVariability – entities have different temporal
dynamics
Problem Description
Streaming Models and Algorithms for Communication and Information Networks
Time-evolving graph model - sequence of “snapshots”
Time series analysis
t = 1 t = 2 t = 3 t = 4
12:0
0 AM
1:00
AM
2:00
AM
3:00
AM
4:00
AM
5:00
AM
6:00
AM
7:00
AM
8:00
AM
9:00
AM
10:0
0 AM
11:0
0 AM
12:0
0 PM
1:00
PM
2:00
PM
3:00
PM
4:00
PM
5:00
PM
6:00
PM
7:00
PM
8:00
PM
9:00
PM
10:0
0 PM
11:0
0 PM
IP Traffic (MB Per Hour)
Related Work
Streaming Models and Algorithms for Communication and Information Networks
Cascade model – set of seed nodes, information (product, news, virus) propagates through network
Related Work
Outline
Introduction and MotivationA Streaming Model
Algorithms
Experimental Results
Conclusions and Future Work
Streaming Models and Algorithms for Communication and Information Networks
Our Approach
G is a graph
T is a set of time-stamped events corresponding to nodes or edges in G
Source
Recipient
Content Timestamp
Alice (public) “Fire at 2nd & Main!”
Tuesday, 9:25am
Bob Cheng (private message) Tuesday, 9:27am
Cheng (public) “RT @Alice Fire ...” Tuesday, 9:28am
Alice
BobChen
g
Devika
Elina
Streaming Models and Algorithms for Communication and Information Networks
Data Model
(Node-centric)
Alice
Bob
Cheng
Devika
Elina
Streaming Models and Algorithms for Communication and Information Networks
Data Model
(Edge-centric)
Streaming Models and Algorithms for Communication and Information Networks
Data Model
Bob
Cheng
Alice Devika
Elina
Streaming Models and Algorithms for Communication and Information Networks
A renewal process is a continuous-time Markov process where state transitions occur with holding times sampled independently from a positive distribution .
Let be samples from , and consider a sequence of events corresponding to those holding times.
We call inter-arrival times, and refer to the sequence as the discrete-event sequence for .
t1 t2 t3 t4 t50
:
S3
Renewal Theory
Streaming Models and Algorithms for Communication and Information Networks
The age of a renewal process at time is the amount of time elapsed since the last event:
𝐴𝑔𝑒Φ (𝑡 )={𝑡−max {𝑡𝑖 :𝑡𝑖<𝑡 } i f 𝑡≥ 𝑡1∞otherwise
t1 t2 t3 t4 t50 t
:
𝐴𝑔𝑒Φ (𝑡 )
Renewal Theory
We model a stream of communication data from a node or across an edge as a renewal process
Streaming Models and Algorithms for Communication and Information Networks
xmin xmax
Inter-Arrival Time Distribution
Discrete-event sequence:
t1 t2 t3 t4 t5
REneWal theory Approach for Real-time Data StreamsThe REWARDS Model
Given a stream of time-stamped events, we estimate the parameters of the renewal process for each nodeor edge based on the inter-arrival times
Streaming Models and Algorithms for Communication and Information Networks
xmin xmax
Inter-Arrival Time Distribution
REneWal theory Approach for Real-time Data Streams
Discrete-event sequence:
t1 t2 t3 t4 t5
The REWARDS Model
Outline
Introduction and MotivationA Streaming Model
Algorithms
Experimental Results
Conclusions and Future Work
Streaming Models and Algorithms for Communication and Information Networks
Our Approach
Streaming Models and Algorithms for Communication and Information Networks
Goal: highlight recent activityKey idea: more recent = more relevant
Challenge: The most frequent communicators will always seem “recent”, overshadowing others’ behavior.
We call this time-scale bias.
8:00 am 10:00 am 12:00 pm NOW!
alice1337
bob_iz_kewl
User:
User:
Recency
Streaming Models and Algorithms for Communication and Information Networks
We can overcome time-scale bias by using the REWARDS Model
We first derive the limit distribution of the function:
We define the recency of at time to be:
𝑅𝑒𝑐Φ (𝑡 )=1−𝐹Φ𝐴𝑔𝑒∗ ( 𝐴𝑔𝑒Φ (𝑡 ) )
𝐹Φ𝐴𝑔𝑒∗ (𝜏 )=lim
𝑡→∞Pr (𝐴𝑔𝑒Φ (𝑡 )≤𝜏 )
Recency
Streaming Models and Algorithms for Communication and Information Networks
is a decreasing function on every interval . It also satisfies the uniformity property: for any renewal process , the limit distribution of is Uniform(0,1).
Recency effectively normalizes the age of a process relative to its own temporal dynamics, making our approach robust to differences in time scale between networks or between entities within the same network.
Recency of Edge <3,22> in Bluetooth Dataset
Recency
Streaming Models and Algorithms for Communication and Information Networks
Goal: measure influence of entity A on entity BKey idea: study pairwise (A,B)-gaps
Challenge: More frequent communicators will tend to always have shorter “gaps”.
8:00 am 10:00 am 12:00 pm NOW!
alice1337
bob_iz_kewl
User:
User:
Another example of time-scale bias.
Delay
Streaming Models and Algorithms for Communication and Information Networks
Given renewal processes and , we say the ordered pair of events are adjacent if and . We refer to the elapsed time as the pairwise gap. We denote by the most recent such gap at time .
If and are independent processes, then we can derive the limit distribution of pairwise gaps between consecutive event pairs.
We define the -delay at time to be:
𝐷𝑒𝑙Φ ,Ψ (𝑡 )=1−𝐹Φ ,Ψ𝐺𝑎𝑝∗ (𝐺𝑎𝑝Φ ,Ψ (𝑡 ) )
Delay
Streaming Models and Algorithms for Communication and Information Networks
is a constant function on every interval , and also satisfies the uniformity property: for any pair of independent renewal process and , the limit distribution of is Uniform(0,1).
By comparing an observed gap to the theoretical joint distribution of inter-arrival times for and , delay effectively normalizes the gap relative to the temporal dynamics of and individually.
Similarly to the recency function, this makes our approach robust to differences in time scale between networks or between entities within the same network.
Delay
Outline
Introduction and MotivationA Streaming Model
Algorithms
Experimental Results
Conclusions and Future Work
Streaming Models and Algorithms for Communication and Information Networks
Our Approach
Divergence
Based on the Kolmogorov-Smirnov statistic:
Recency divergence compares recency values for a set of nodes or edges to the CDF for Uniform(0,1)
Delay divergence compares delay values for a set of edges, or for all (A,B)-gaps, to the CDF for Uniform(0,1)Streaming Models and Algorithms for Communication and Information Networks
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.2
0.4
0.6
0.8
1
Fn(x) F(x)
Compares empirical EDF Fn(x)to hypothetical CDF F(x)
𝑲𝑺 (𝑭𝒏∨¿𝑭 )=𝐬𝐮𝐩 (𝑭𝒏(𝒙 )−𝑭 (𝒙 ))KS = 0.32
Streaming Node-Centric Algorithm
• Goal: Flag times at which a node exhibits anomalous activity (indicated by an unusually high concentration of recent outgoing communication)
• Approach: Since the recency function is decreasing between consecutive communication, measure the recency divergence at a node only at times at which new activity occurs
Streaming Models and Algorithms for Communication and Information Networks
The MCD Algorithm
• Goal: Identify subgraphs with correlated behavior
• Recency divergence to find recent anomalous activity
• Delay divergence to identify spheres of influence
Streaming Models and Algorithms for Communication and Information Networks
Challenge: How do we overcome the combinatorial explosion?
Maximal Component Divergence Algorithm
2.9
2.7
The MCD Algorithm
V2
V3
V1
V5
V4
0.9
0.750.7
0.1
0.5
0.3
2.4
V1 V2
V3
V4 V5
θ Component Div(C)
0.9 {V1,V2} 2.908
0.75 {V1,V2,V3} 2.723
0.7 {V1,V2,V3} 6.132
0.5 {V4,V5} 1.143
0.3 {V1,V2,V3,V4,V5} 2.380
0.1 {V1,V2,V3,V4,V5} 1.882
1. Calculate edge weights using recency or delay function
2. Gradually decrease the threshold, updating components and divergence values as necessary
3. Output: Disjoint components with max divergence
6.1
2.9 1.1
Streaming Models and Algorithms for Communication and Information Networks
Maximal Component Divergence Algorithm
Sample OutputMCD θ #V(C) E-frac %E(C) %E(G)
14.57 0.07 54 53/212 0.25 0.08
12.84 0.08 32 31/88 0.35 0.08
3.70 0.10 6 5/7 0.71 0.10
2.97 0.18 5 4/4 1.00 0.14
1.91 0.05 7 6/41 0.15 0.04
Streaming Models and Algorithms for Communication and Information Networks
Outline
Introduction and MotivationA Streaming Model
Algorithms
Experimental Results
Conclusions and Future Work
Streaming Models and Algorithms for Communication and Information Networks
Our Approach
Robustness to Time Scale
Streaming Models and Algorithms for Communication and Information Networks
• Simulation: R-MAT model, 128 vertices, avg. degree 16
• IATs for edge activity sampled from Bounded Pareto distributions, rate parameter btwn 10 mins. and 1 week
• Every 5 days, a randomly selected node has anomalous activity at 10x its normal rate
Robustness to Time Scale
Streaming Models and Algorithms for Communication and Information Networks
Robustness to Time Scale
Streaming Models and Algorithms for Communication and Information Networks
• Conclusion: While it takes longer for anomalous activity to be recognized at nodes with lower rates, the magnitude of the peak seems to be independent of activity rate but highly correlated with degree
Accuracy and Precision
Streaming Models and Algorithms for Communication and Information Networks
• Simulation: star network, 100 trials w/ only normal activity and 100 trials including a period of anomalous activity
• ROC curves show accuracy and precision for several methods for distinguishing between the two scenarios
• Conclusion: Especially when variability is introduced, our approach out-performs the WtdDeg and Z-Score metrics
Detection Latency
Streaming Models and Algorithms for Communication and Information Networks
• Data: Enron corpus, 1k nodes, 2k edges, 4k timestamps
• Compare our approach with GraphScope Algorithm
• Conclusion: The two algorithms seem to identify similar times of anomalous activity, but our approach based on the REWARDS model has shorter response time
Anomaly Detection in IP Traffic
Streaming Models and Algorithms for Communication and Information Networks
• Data: LBNL network trace, > 9 million timestamps during one hour on December 15, 2004
• Compare our approach with total network volume and with “scanning activity” labeled by LBNL analysts
Anomaly Detection in IP Traffic
Streaming Models and Algorithms for Communication and Information Networks
• Three of the four times of highest correspond to labeled scanning activity
• The peak in scanning activity at 12:07pm is primarily due to an increase in DNS and NBNS lookups
• The peak at 12:26pm was not flagged by the analysts since the sequence of IP addresses was not monotonic
Complexity Analysis
Dataset: Twitter messages, Nov. 2008 – Oct. 2009 (263k nodes, 308k edges, 1.1 million timestamps)
Updates O(1) per communication
MCD Algorithm O(m log m), where m = # of edges; can be approximated in effectively O(m) time
0 15,000 30,000 45,000 60,0000
500
1000
1500
2000
Runtime for MCD Algorithm
number of live edges
runti
me (
milliseco
nds)
Streaming Models and Algorithms for Communication and Information Networks
Outline
Introduction and MotivationA Streaming Model
Algorithms
Experimental Results
Conclusions and Future Work
Streaming Models and Algorithms for Communication and Information Networks
Our Approach
Future Work
Incorporate duration of communication and other node or edge attributes into our model
Make use of geographical and textual content
Use gap divergence to infer links, compare to approach of Gomez-Rodriguez et. al.
Develop streaming algorithm to identify emerging trends
Streaming Models and Algorithms for Communication and Information Networks
Acknowledgements
Part of this work was conducted at Lawrence Livermore National Laboratory, under the guidance of Tina Eliassi-Rad.
This project is partially supported by a DHS Career Development Grant, under the auspices of CCICADA, a DHS Center of Excellence.
Streaming Models and Algorithms for Communication and Information Networks
Questions?
Streaming Models and Algorithms for Communication and Information Networks