streaming models and algorithms for communication and information networks

40
Streaming Models and Algorithms for Communication and Information Networks Brian Thompson (joint work with James Abello)

Upload: halee-rocha

Post on 31-Dec-2015

24 views

Category:

Documents


0 download

DESCRIPTION

Streaming Models and Algorithms for Communication and Information Networks. Brian Thompson (joint work with James Abello ). Outline. Introduction and Motivation. A Streaming Model. Our Approach. Algorithms. Experimental Results. Conclusions and Future Work. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Streaming Models and Algorithms for Communication and Information Networks

Streaming Models and Algorithms for

Communication and Information NetworksBrian Thompson (joint work with James

Abello)

Page 2: Streaming Models and Algorithms for Communication and Information Networks

Outline

Introduction and MotivationA Streaming Model

Algorithms

Experimental Results

Conclusions and Future Work

Streaming Models and Algorithms for Communication and Information Networks

Our Approach

Page 3: Streaming Models and Algorithms for Communication and Information Networks

Outline

Introduction and MotivationA Streaming Model

Algorithms

Experimental Results

Conclusions and Future Work

Streaming Models and Algorithms for Communication and Information Networks

Our Approach

Page 4: Streaming Models and Algorithms for Communication and Information Networks

Streaming Models and Algorithms for Communication and Information Networks

Data: A network (G;T)G = (V,E) is a graphT is a set of time-stamped events corresponding to

nodes or edges in G

Goals: Identify recent correlated activityMeasure influence between entities

Challenges:Scalability – networks may be very large, limited

spaceEfficiency – high data rate, time-sensitive

informationVariability – entities have different temporal

dynamics

Problem Description

Page 5: Streaming Models and Algorithms for Communication and Information Networks

Streaming Models and Algorithms for Communication and Information Networks

Time-evolving graph model - sequence of “snapshots”

Time series analysis

t = 1 t = 2 t = 3 t = 4

12:0

0 AM

1:00

AM

2:00

AM

3:00

AM

4:00

AM

5:00

AM

6:00

AM

7:00

AM

8:00

AM

9:00

AM

10:0

0 AM

11:0

0 AM

12:0

0 PM

1:00

PM

2:00

PM

3:00

PM

4:00

PM

5:00

PM

6:00

PM

7:00

PM

8:00

PM

9:00

PM

10:0

0 PM

11:0

0 PM

IP Traffic (MB Per Hour)

Related Work

Page 6: Streaming Models and Algorithms for Communication and Information Networks

Streaming Models and Algorithms for Communication and Information Networks

Cascade model – set of seed nodes, information (product, news, virus) propagates through network

Related Work

Page 7: Streaming Models and Algorithms for Communication and Information Networks

Outline

Introduction and MotivationA Streaming Model

Algorithms

Experimental Results

Conclusions and Future Work

Streaming Models and Algorithms for Communication and Information Networks

Our Approach

Page 8: Streaming Models and Algorithms for Communication and Information Networks

G is a graph

T is a set of time-stamped events corresponding to nodes or edges in G

Source

Recipient

Content Timestamp

Alice (public) “Fire at 2nd & Main!”

Tuesday, 9:25am

Bob Cheng (private message) Tuesday, 9:27am

Cheng (public) “RT @Alice Fire ...” Tuesday, 9:28am

Alice

BobChen

g

Devika

Elina

Streaming Models and Algorithms for Communication and Information Networks

Data Model

Page 9: Streaming Models and Algorithms for Communication and Information Networks

(Node-centric)

Alice

Bob

Cheng

Devika

Elina

Streaming Models and Algorithms for Communication and Information Networks

Data Model

Page 10: Streaming Models and Algorithms for Communication and Information Networks

(Edge-centric)

Streaming Models and Algorithms for Communication and Information Networks

Data Model

Bob

Cheng

Alice Devika

Elina

Page 11: Streaming Models and Algorithms for Communication and Information Networks

Streaming Models and Algorithms for Communication and Information Networks

A renewal process is a continuous-time Markov process where state transitions occur with holding times sampled independently from a positive distribution .

Let be samples from , and consider a sequence of events corresponding to those holding times.

We call inter-arrival times, and refer to the sequence as the discrete-event sequence for .

t1 t2 t3 t4 t50

:

S3

Renewal Theory

Page 12: Streaming Models and Algorithms for Communication and Information Networks

Streaming Models and Algorithms for Communication and Information Networks

The age of a renewal process at time is the amount of time elapsed since the last event:

𝐴𝑔𝑒Φ (𝑡 )={𝑡−max {𝑡𝑖 :𝑡𝑖<𝑡 } i f 𝑡≥ 𝑡1∞otherwise

t1 t2 t3 t4 t50 t

:

𝐴𝑔𝑒Φ (𝑡 )

Renewal Theory

Page 13: Streaming Models and Algorithms for Communication and Information Networks

We model a stream of communication data from a node or across an edge as a renewal process

Streaming Models and Algorithms for Communication and Information Networks

xmin xmax

Inter-Arrival Time Distribution

Discrete-event sequence:

t1 t2 t3 t4 t5

REneWal theory Approach for Real-time Data StreamsThe REWARDS Model

Page 14: Streaming Models and Algorithms for Communication and Information Networks

Given a stream of time-stamped events, we estimate the parameters of the renewal process for each nodeor edge based on the inter-arrival times

Streaming Models and Algorithms for Communication and Information Networks

xmin xmax

Inter-Arrival Time Distribution

REneWal theory Approach for Real-time Data Streams

Discrete-event sequence:

t1 t2 t3 t4 t5

The REWARDS Model

Page 15: Streaming Models and Algorithms for Communication and Information Networks

Outline

Introduction and MotivationA Streaming Model

Algorithms

Experimental Results

Conclusions and Future Work

Streaming Models and Algorithms for Communication and Information Networks

Our Approach

Page 16: Streaming Models and Algorithms for Communication and Information Networks

Streaming Models and Algorithms for Communication and Information Networks

Goal: highlight recent activityKey idea: more recent = more relevant

Challenge: The most frequent communicators will always seem “recent”, overshadowing others’ behavior.

We call this time-scale bias.

8:00 am 10:00 am 12:00 pm NOW!

alice1337

bob_iz_kewl

User:

User:

Recency

Page 17: Streaming Models and Algorithms for Communication and Information Networks

Streaming Models and Algorithms for Communication and Information Networks

We can overcome time-scale bias by using the REWARDS Model

We first derive the limit distribution of the function:

We define the recency of at time to be:

𝑅𝑒𝑐Φ (𝑡 )=1−𝐹Φ𝐴𝑔𝑒∗ ( 𝐴𝑔𝑒Φ (𝑡 ) )

𝐹Φ𝐴𝑔𝑒∗ (𝜏 )=lim

𝑡→∞Pr (𝐴𝑔𝑒Φ (𝑡 )≤𝜏 )

Recency

Page 18: Streaming Models and Algorithms for Communication and Information Networks

Streaming Models and Algorithms for Communication and Information Networks

is a decreasing function on every interval . It also satisfies the uniformity property: for any renewal process , the limit distribution of is Uniform(0,1).

Recency effectively normalizes the age of a process relative to its own temporal dynamics, making our approach robust to differences in time scale between networks or between entities within the same network.

Recency of Edge <3,22> in Bluetooth Dataset

Recency

Page 19: Streaming Models and Algorithms for Communication and Information Networks

Streaming Models and Algorithms for Communication and Information Networks

Goal: measure influence of entity A on entity BKey idea: study pairwise (A,B)-gaps

Challenge: More frequent communicators will tend to always have shorter “gaps”.

8:00 am 10:00 am 12:00 pm NOW!

alice1337

bob_iz_kewl

User:

User:

Another example of time-scale bias.

Delay

Page 20: Streaming Models and Algorithms for Communication and Information Networks

Streaming Models and Algorithms for Communication and Information Networks

Given renewal processes and , we say the ordered pair of events are adjacent if and . We refer to the elapsed time as the pairwise gap. We denote by the most recent such gap at time .

If and are independent processes, then we can derive the limit distribution of pairwise gaps between consecutive event pairs.

We define the -delay at time to be:

𝐷𝑒𝑙Φ ,Ψ (𝑡 )=1−𝐹Φ ,Ψ𝐺𝑎𝑝∗ (𝐺𝑎𝑝Φ ,Ψ (𝑡 ) )

Delay

Page 21: Streaming Models and Algorithms for Communication and Information Networks

Streaming Models and Algorithms for Communication and Information Networks

is a constant function on every interval , and also satisfies the uniformity property: for any pair of independent renewal process and , the limit distribution of is Uniform(0,1).

By comparing an observed gap to the theoretical joint distribution of inter-arrival times for and , delay effectively normalizes the gap relative to the temporal dynamics of and individually.

Similarly to the recency function, this makes our approach robust to differences in time scale between networks or between entities within the same network.

Delay

Page 22: Streaming Models and Algorithms for Communication and Information Networks

Outline

Introduction and MotivationA Streaming Model

Algorithms

Experimental Results

Conclusions and Future Work

Streaming Models and Algorithms for Communication and Information Networks

Our Approach

Page 23: Streaming Models and Algorithms for Communication and Information Networks

Divergence

Based on the Kolmogorov-Smirnov statistic:

Recency divergence compares recency values for a set of nodes or edges to the CDF for Uniform(0,1)

Delay divergence compares delay values for a set of edges, or for all (A,B)-gaps, to the CDF for Uniform(0,1)Streaming Models and Algorithms for Communication and Information Networks

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.2

0.4

0.6

0.8

1

Fn(x) F(x)

Compares empirical EDF Fn(x)to hypothetical CDF F(x)

𝑲𝑺 (𝑭𝒏∨¿𝑭 )=𝐬𝐮𝐩 (𝑭𝒏(𝒙 )−𝑭 (𝒙 ))KS = 0.32

Page 24: Streaming Models and Algorithms for Communication and Information Networks

Streaming Node-Centric Algorithm

• Goal: Flag times at which a node exhibits anomalous activity (indicated by an unusually high concentration of recent outgoing communication)

• Approach: Since the recency function is decreasing between consecutive communication, measure the recency divergence at a node only at times at which new activity occurs

Streaming Models and Algorithms for Communication and Information Networks

Page 25: Streaming Models and Algorithms for Communication and Information Networks

The MCD Algorithm

• Goal: Identify subgraphs with correlated behavior

• Recency divergence to find recent anomalous activity

• Delay divergence to identify spheres of influence

Streaming Models and Algorithms for Communication and Information Networks

Challenge: How do we overcome the combinatorial explosion?

Maximal Component Divergence Algorithm

Page 26: Streaming Models and Algorithms for Communication and Information Networks

2.9

2.7

The MCD Algorithm

V2

V3

V1

V5

V4

0.9

0.750.7

0.1

0.5

0.3

2.4

V1 V2

V3

V4 V5

θ Component Div(C)

0.9 {V1,V2} 2.908

0.75 {V1,V2,V3} 2.723

0.7 {V1,V2,V3} 6.132

0.5 {V4,V5} 1.143

0.3 {V1,V2,V3,V4,V5} 2.380

0.1 {V1,V2,V3,V4,V5} 1.882

1. Calculate edge weights using recency or delay function

2. Gradually decrease the threshold, updating components and divergence values as necessary

3. Output: Disjoint components with max divergence

6.1

2.9 1.1

Streaming Models and Algorithms for Communication and Information Networks

Maximal Component Divergence Algorithm

Page 27: Streaming Models and Algorithms for Communication and Information Networks

Sample OutputMCD θ #V(C) E-frac %E(C) %E(G)

14.57 0.07 54 53/212 0.25 0.08

12.84 0.08 32 31/88 0.35 0.08

3.70 0.10 6 5/7 0.71 0.10

2.97 0.18 5 4/4 1.00 0.14

1.91 0.05 7 6/41 0.15 0.04

Streaming Models and Algorithms for Communication and Information Networks

Page 28: Streaming Models and Algorithms for Communication and Information Networks

Outline

Introduction and MotivationA Streaming Model

Algorithms

Experimental Results

Conclusions and Future Work

Streaming Models and Algorithms for Communication and Information Networks

Our Approach

Page 29: Streaming Models and Algorithms for Communication and Information Networks

Robustness to Time Scale

Streaming Models and Algorithms for Communication and Information Networks

• Simulation: R-MAT model, 128 vertices, avg. degree 16

• IATs for edge activity sampled from Bounded Pareto distributions, rate parameter btwn 10 mins. and 1 week

• Every 5 days, a randomly selected node has anomalous activity at 10x its normal rate

Page 30: Streaming Models and Algorithms for Communication and Information Networks

Robustness to Time Scale

Streaming Models and Algorithms for Communication and Information Networks

Page 31: Streaming Models and Algorithms for Communication and Information Networks

Robustness to Time Scale

Streaming Models and Algorithms for Communication and Information Networks

• Conclusion: While it takes longer for anomalous activity to be recognized at nodes with lower rates, the magnitude of the peak seems to be independent of activity rate but highly correlated with degree

Page 32: Streaming Models and Algorithms for Communication and Information Networks

Accuracy and Precision

Streaming Models and Algorithms for Communication and Information Networks

• Simulation: star network, 100 trials w/ only normal activity and 100 trials including a period of anomalous activity

• ROC curves show accuracy and precision for several methods for distinguishing between the two scenarios

• Conclusion: Especially when variability is introduced, our approach out-performs the WtdDeg and Z-Score metrics

Page 33: Streaming Models and Algorithms for Communication and Information Networks

Detection Latency

Streaming Models and Algorithms for Communication and Information Networks

• Data: Enron corpus, 1k nodes, 2k edges, 4k timestamps

• Compare our approach with GraphScope Algorithm

• Conclusion: The two algorithms seem to identify similar times of anomalous activity, but our approach based on the REWARDS model has shorter response time

Page 34: Streaming Models and Algorithms for Communication and Information Networks

Anomaly Detection in IP Traffic

Streaming Models and Algorithms for Communication and Information Networks

• Data: LBNL network trace, > 9 million timestamps during one hour on December 15, 2004

• Compare our approach with total network volume and with “scanning activity” labeled by LBNL analysts

Page 35: Streaming Models and Algorithms for Communication and Information Networks

Anomaly Detection in IP Traffic

Streaming Models and Algorithms for Communication and Information Networks

• Three of the four times of highest correspond to labeled scanning activity

• The peak in scanning activity at 12:07pm is primarily due to an increase in DNS and NBNS lookups

• The peak at 12:26pm was not flagged by the analysts since the sequence of IP addresses was not monotonic

Page 36: Streaming Models and Algorithms for Communication and Information Networks

Complexity Analysis

Dataset: Twitter messages, Nov. 2008 – Oct. 2009 (263k nodes, 308k edges, 1.1 million timestamps)

Updates O(1) per communication

MCD Algorithm O(m log m), where m = # of edges; can be approximated in effectively O(m) time

0 15,000 30,000 45,000 60,0000

500

1000

1500

2000

Runtime for MCD Algorithm

number of live edges

runti

me (

milliseco

nds)

Streaming Models and Algorithms for Communication and Information Networks

Page 37: Streaming Models and Algorithms for Communication and Information Networks

Outline

Introduction and MotivationA Streaming Model

Algorithms

Experimental Results

Conclusions and Future Work

Streaming Models and Algorithms for Communication and Information Networks

Our Approach

Page 38: Streaming Models and Algorithms for Communication and Information Networks

Future Work

Incorporate duration of communication and other node or edge attributes into our model

Make use of geographical and textual content

Use gap divergence to infer links, compare to approach of Gomez-Rodriguez et. al.

Develop streaming algorithm to identify emerging trends

Streaming Models and Algorithms for Communication and Information Networks

Page 39: Streaming Models and Algorithms for Communication and Information Networks

Acknowledgements

Part of this work was conducted at Lawrence Livermore National Laboratory, under the guidance of Tina Eliassi-Rad.

This project is partially supported by a DHS Career Development Grant, under the auspices of CCICADA, a DHS Center of Excellence.

Streaming Models and Algorithms for Communication and Information Networks

Page 40: Streaming Models and Algorithms for Communication and Information Networks

Questions?

Streaming Models and Algorithms for Communication and Information Networks