massive streaming data analytics: a case study with clustering coefficients david ediger karl jiang...

9
Massive Streaming Data Analytics: A Case Study with Clustering Coefficients David Ediger Karl Jiang Jason Riedy David A. Bader Georgia Institute of Technology Atlanta, GA USA 1

Post on 22-Dec-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Massive Streaming Data Analytics: A Case Study with Clustering Coefficients David Ediger Karl Jiang Jason Riedy David A. Bader Georgia Institute of Technology

Massive Streaming Data Analytics:A Case Study with Clustering Coefficients

David EdigerKarl Jiang

Jason RiedyDavid A. Bader

Georgia Institute of TechnologyAtlanta, GA USA

1

Page 2: Massive Streaming Data Analytics: A Case Study with Clustering Coefficients David Ediger Karl Jiang Jason Riedy David A. Bader Georgia Institute of Technology

STINGER Data Structure

• Spatio-temporal Interaction Networks and Graphs (STING) Extensible Representation

• General-purpose data structure for dynamic graphs

• Efficient edge insertion/deletion (updates) with concurrent readers (analysis)

2

Page 3: Massive Streaming Data Analytics: A Case Study with Clustering Coefficients David Ediger Karl Jiang Jason Riedy David A. Bader Georgia Institute of Technology

STINGER Data Structure

• Array of linked lists, which may have empty slots (from deleting edges)

• Additional storedinfo not in paper

• Efficient updates• Concurrent reads

(no locking)

3

Page 4: Massive Streaming Data Analytics: A Case Study with Clustering Coefficients David Ediger Karl Jiang Jason Riedy David A. Bader Georgia Institute of Technology

Assumptions for parallelism

• Single streaming source for inserts/deletes• Changes are scattered widely– Batches are sufficiently independent

• Analysis kernels have small range– Graph change only requires access to local

portions and affects small portion of output

4

Page 5: Massive Streaming Data Analytics: A Case Study with Clustering Coefficients David Ediger Karl Jiang Jason Riedy David A. Bader Georgia Institute of Technology

Assumptions (continued)

5

Page 6: Massive Streaming Data Analytics: A Case Study with Clustering Coefficients David Ediger Karl Jiang Jason Riedy David A. Bader Georgia Institute of Technology

Case Study:Updating Clustering Coefficients

• Clustering coefficients measure density of closed triangles:

• One way of determining if a graph is a small-world graph

6

Page 7: Massive Streaming Data Analytics: A Case Study with Clustering Coefficients David Ediger Karl Jiang Jason Riedy David A. Bader Georgia Institute of Technology

Bloom filter

• Consider an edge list represented as a bit array (1 bit per edge) => O(n) storage space

• Bloom filter is a bit array with an arbitrary, smaller number of bits

• A hash function maps a vertex to a specific bit• Small number of bits == high collision rate• To reduce false-positives, use k independent

hash functions to set multiple bits

7

Page 8: Massive Streaming Data Analytics: A Case Study with Clustering Coefficients David Ediger Karl Jiang Jason Riedy David A. Bader Georgia Institute of Technology

Bloom filter

8

Page 9: Massive Streaming Data Analytics: A Case Study with Clustering Coefficients David Ediger Karl Jiang Jason Riedy David A. Bader Georgia Institute of Technology

Testbed

• Massively multi-threaded Cray XMT– 64 Threadstorm processors• Each running at 500MHz• Each has 128 hardware streams maintaining a thread

context• Context switches occur every cycle• 512 GiB globally addressable shared memory

– (holds 2 billion vertices and 17 billion edges)

• Synthetic data– 16 million vertices, ~500 million edges

9