massive streaming data analytics: a case study with clustering coefficients david ediger karl jiang...
Post on 22-Dec-2015
212 views
TRANSCRIPT
Massive Streaming Data Analytics:A Case Study with Clustering Coefficients
David EdigerKarl Jiang
Jason RiedyDavid A. Bader
Georgia Institute of TechnologyAtlanta, GA USA
1
STINGER Data Structure
• Spatio-temporal Interaction Networks and Graphs (STING) Extensible Representation
• General-purpose data structure for dynamic graphs
• Efficient edge insertion/deletion (updates) with concurrent readers (analysis)
2
STINGER Data Structure
• Array of linked lists, which may have empty slots (from deleting edges)
• Additional storedinfo not in paper
• Efficient updates• Concurrent reads
(no locking)
3
Assumptions for parallelism
• Single streaming source for inserts/deletes• Changes are scattered widely– Batches are sufficiently independent
• Analysis kernels have small range– Graph change only requires access to local
portions and affects small portion of output
4
Assumptions (continued)
5
Case Study:Updating Clustering Coefficients
• Clustering coefficients measure density of closed triangles:
• One way of determining if a graph is a small-world graph
6
Bloom filter
• Consider an edge list represented as a bit array (1 bit per edge) => O(n) storage space
• Bloom filter is a bit array with an arbitrary, smaller number of bits
• A hash function maps a vertex to a specific bit• Small number of bits == high collision rate• To reduce false-positives, use k independent
hash functions to set multiple bits
7
Bloom filter
8
Testbed
• Massively multi-threaded Cray XMT– 64 Threadstorm processors• Each running at 500MHz• Each has 128 hardware streams maintaining a thread
context• Context switches occur every cycle• 512 GiB globally addressable shared memory
– (holds 2 billion vertices and 17 billion edges)
• Synthetic data– 16 million vertices, ~500 million edges
9