opportunities and challenges in massive data-intensive computing · memory cache-based...

49
Opportunities and Challenges in Massive Data-Intensive Computing David A. Bader

Upload: others

Post on 03-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Opportunities and Challenges in Massive Data-Intensive Computing · memory cache-based supercomputers. • Case study: Massively threaded architectures are shown to run several orders

Opportunities and Challenges inMassive Data-Intensive Computing

David A. Bader

Page 2: Opportunities and Challenges in Massive Data-Intensive Computing · memory cache-based supercomputers. • Case study: Massively threaded architectures are shown to run several orders

OPPORTUNITIES

David A. Bader 2

Page 3: Opportunities and Challenges in Massive Data-Intensive Computing · memory cache-based supercomputers. • Case study: Massively threaded architectures are shown to run several orders

Opportunities

• Application-oriented Opportunities:– High performance computing for massive graphs– Streaming analytics– Informational Visualization techniques for

massive graphs– Heterogeneous systems: Methodologies for

combining the use of the Cloud and Manycore for high-performance computing

– Energy-efficient high-performance computing

David A. Bader 3

Page 4: Opportunities and Challenges in Massive Data-Intensive Computing · memory cache-based supercomputers. • Case study: Massively threaded architectures are shown to run several orders

Opportunity 1: High performance computing for massive graphs

• Traditional HPC has focused primarily on solving large problems from chemistry, physics, and mechanics, using dense linear algebra.

• HPC faces new challenges to deal with:– time-varying interactions among entities, and – massive-scale graph abstractions where the vertices represent the nouns or

entities and the edges represent their observed interactions.• Few parallel computers run well on these problems because

– they often lack locality required to get high performance from distributed-memory cache-based supercomputers.

• Case study: Massively threaded architectures are shown to run several orders of magnitude faster than the fastest supercomputers on these types of problems!

A focused research agenda is needed to design algorithms that scale on these new platforms.

David A. Bader 4

Page 5: Opportunities and Challenges in Massive Data-Intensive Computing · memory cache-based supercomputers. • Case study: Massively threaded architectures are shown to run several orders

• While our high performance computers have delivered a sustained petaflop, they have done so using the same antiquated batch processing style where a program and a static data set are scheduled to compute in the next available slot.

• Today, data is overwhelming in volume and rate, and we struggle to keep up with these streams.

Fundamental computer science research is needed in: the design of streaming architectures, and data structures and algorithms that can compute important analytics while sitting in the

middle of these torrential flows.

Opportunity 2: Streaming analytics

David A. Bader 5

Page 6: Opportunities and Challenges in Massive Data-Intensive Computing · memory cache-based supercomputers. • Case study: Massively threaded architectures are shown to run several orders

Opportunity 3: Information Visualizationtechniques for massive graphs• Information Visualization today

– addresses traditional scientific computing (fluid flow, molecular dynamics), or – when handling discrete data, scale to only hundreds of vertices at best.

However, there is a strong need for visualization in the data sciences so that analytics can gain understanding from data sets with from millions to billions of interacting non-planar discrete entities.

– Applications include: data mining, intelligence, situational awareness

David A. Bader 6

NNDB Mapper of George Washington

Twitter social network using Large Graph Layout

Source: Akshay Java, from ebiquity group

Page 7: Opportunities and Challenges in Massive Data-Intensive Computing · memory cache-based supercomputers. • Case study: Massively threaded architectures are shown to run several orders

Opportunity 4: Heterogeneous Systems: Methodologies for combining the use of the Cloud and Manycore for high-performance computing.• Today, there is a dichotomy

between using clouds (e.g. Hadoop, map-reduce) for massive data storage, filtering, summarization, and massively parallel/multithreaded systems for data-intensive computation.

We must develop methodologies for employing these complementary systems for solving grand challenges in data analysis.

David A. Bader 7

Steve Mills, SVP of IBM Software (left), and Dr. John Kelly, SVP of IBM Research, view Stream Computing technology

Page 8: Opportunities and Challenges in Massive Data-Intensive Computing · memory cache-based supercomputers. • Case study: Massively threaded architectures are shown to run several orders

Opportunity 5: Energy-efficient high-performance computing

• The main constraint for our ability to compute has changed– from availability of compute resources– to the ability to power and cool our systems within budget.

Holistic research is needed that can permeate from the architecture and systems up to the applications AND DATA CENTERS, whereby energy use is a first-class object that can be optimized at all levels.

David A. Bader 8

Microsoft’s Chicago Million Server DataCenter

Page 9: Opportunities and Challenges in Massive Data-Intensive Computing · memory cache-based supercomputers. • Case study: Massively threaded architectures are shown to run several orders

MOTIVATION

David A. Bader 9

Page 10: Opportunities and Challenges in Massive Data-Intensive Computing · memory cache-based supercomputers. • Case study: Massively threaded architectures are shown to run several orders

Exascale Streaming Data Analytics:Real-world challenges

All involve analyzing massive streaming complex networks:• Health care disease spread, detection

and prevention of epidemics/pandemics (e.g. SARS, Avian flu, H1N1 “swine” flu)

• Massive social networks understanding communities, intentions, population dynamics, pandemic spread, transportation and evacuation

• Intelligence business analytics, anomaly detection, security, knowledge discovery from massive data sets

• Systems Biology understanding complex life systems, drug design, microbial research, unravel the mysteries of the HIV virus; understand life, disease,

• Electric Power Grid communication, transportation, energy, water, food supply

• Modeling and Simulation Perform full-scale economic-social-political simulations

David A. Bader 10

0

50

100

150

200

250

300

350

400

450

Dec-

04

Mar

-05

Jun-

05

Sep-

05

Dec-

05

Mar

-06

Jun-

06

Sep-

06

Dec-

06

Mar

-07

Jun-

07

Sep-

07

Dec-

07

Mar

-08

Jun-

08

Sep-

08

Dec-

08

Mar

-09

Jun-

09

Sep-

09

Dec-

09

Million Users

Exponential growth:More than 900 million active users

Sample queries: Allegiance switching: identify entities that switch communities.Community structure:identify the genesis and dissipation of communitiesPhase change: identify significant change in the network structure

REQUIRES PREDICTING / INFLUENCE CHANGE IN REAL-TIME AT SCALE

Ex: discovered minimal changes in O(billions)-size complex network that could hide or reveal top influencers in the community

Page 11: Opportunities and Challenges in Massive Data-Intensive Computing · memory cache-based supercomputers. • Case study: Massively threaded architectures are shown to run several orders

Current Example Data Rates

• Financial:– NYSE processes 1.5TB daily, maintains 8PB

• Social: – Facebook adds >100k users, 55M “status” updates, 80M

photos daily; more than 750M active users with an average of 130 “friend” connections each.

– Foursquare, a new service, reports 1.2M location check-ins per week

• Scientific:– MEDLINE adds from 1 to 140 publications a day

Shared features: All data is rich, irregularly connected to other data. All is a mix of “good” and “bad” data... And much real data may

be missing or inconsistent.

11David A. Bader

Page 12: Opportunities and Challenges in Massive Data-Intensive Computing · memory cache-based supercomputers. • Case study: Massively threaded architectures are shown to run several orders

Ubiquitous High Performance Computing (UHPC)

Goal: develop highly parallel, security enabled, power efficient processing systems, supporting ease of programming, with resilient execution through all failure modes and intrusion attacks

Program Objectives:One PFLOPS, single cabinet including self-contained cooling

50 GFLOPS/W (equivalent to 20 pJ/FLOP)

Total cabinet power budget 57KW, includes processing resources, storage and cooling

Security embedded at all system levels

Parallel, efficient execution models

Highly programmable parallel systems

Scalable systems – from terascale to petascale

Architectural Drivers:Energy EfficientSecurity and DependabilityProgrammability

Echelon: Extreme-scale Compute Hierarchies with Efficient Locality-Optimized Nodes

“NVIDIA-Led Team Receives $25 Million Contract From DARPA to Develop High-Performance GPU Computing Systems” -MarketWatch

David A. Bader (CSE)Echelon Leadership Team

Page 13: Opportunities and Challenges in Massive Data-Intensive Computing · memory cache-based supercomputers. • Case study: Massively threaded architectures are shown to run several orders

Center for Adaptive Supercomputing Softwarefor MultiThreaded Architectures (CASS-MT)• Launched July 2008

• Pacific-Northwest Lab– Georgia Tech, Sandia, WA State, Delaware

• The newest breed of supercomputers have hardware set up not just for speed, but also to better tackle large networks of seemingly random data. And now, a multi-institutional group of researchers has been awarded over $14 million to develop software for these supercomputers. Applications include anywhere complex webs of information can be found: from internet security and power grid stability to complex biological networks.

David A. Bader 13

Page 14: Opportunities and Challenges in Massive Data-Intensive Computing · memory cache-based supercomputers. • Case study: Massively threaded architectures are shown to run several orders

Example: Mining Twitter for Social Good

David A. Bader 14

ICPP 2010

Image credit: bioethicsinstitute.org

Page 15: Opportunities and Challenges in Massive Data-Intensive Computing · memory cache-based supercomputers. • Case study: Massively threaded architectures are shown to run several orders

• CDC / Nation-scale surveillance of public health

• Cancer genomics and drug design– computed Betweenness Centrality

of Human Proteome

Human Genome core protein interactionsDegree vs. Betweenness Centrality

Degree

1 10 100

Bet

wee

nnes

s C

entra

lity

1e-7

1e-6

1e-5

1e-4

1e-3

1e-2

1e-1

1e+0

Massive Data Analytics: Protecting our Nation

US High Voltage Transmission Grid (>150,000 miles of line)

Public Health

David A. Bader 15

ENSG00000145332.2Kelch-

like protein implicat

ed in breast cancer

Page 16: Opportunities and Challenges in Massive Data-Intensive Computing · memory cache-based supercomputers. • Case study: Massively threaded architectures are shown to run several orders

Network Analysis for Intelligence and Survelliance

• [Krebs ’04] Post 9/11 Terrorist Network Analysis from public domain information

• Plot masterminds correctly identified from interaction patterns: centrality

• A global view of entities is often more insightful

• Detect anomalous activities by exact/approximate graph matching

Image Source: http://www.orgnet.com/hijackers.html

Image Source: T. Coffman, S. Greenblatt, S. Marcus, Graph-based technologies for intelligence analysis, CACM, 47 (3, March 2004): pp 45-47

16David A. Bader

Page 17: Opportunities and Challenges in Massive Data-Intensive Computing · memory cache-based supercomputers. • Case study: Massively threaded architectures are shown to run several orders

Graphs are pervasive in large-scale data analysis

• Sources of massive data: petascale simulations, experimental devices, the Internet, scientific applications.

• New challenges for analysis: data sizes, heterogeneity, uncertainty, data quality.

Astrophysics Problem: Outlier detection. Challenges: massive datasets, temporal variations.Graph problems: clustering, matching.

BioinformaticsProblem: Identifying drug target proteins.Challenges: Data heterogeneity, quality.Graph problems: centrality, clustering.

Social InformaticsProblem: Discover emergent communities, model spread of information.Challenges: new analytics routines, uncertainty in data.Graph problems: clustering, shortest paths, flows.

Image sources: (1) http://physics.nmt.edu/images/astro/hst_starfield.jpg(2,3) www.visualComplexity.com

17David A. Bader

Page 18: Opportunities and Challenges in Massive Data-Intensive Computing · memory cache-based supercomputers. • Case study: Massively threaded architectures are shown to run several orders

Flywheel has driven HPC into a corner

David A. Bader 18

For decades, HPC has been on a vicious cycle of enabling applications that run well on HPC systems.

Real-World Applications(with natural parallelism)

Data-intensive computing has natural concurrency; yet HPC architectures are designed to achieve high floating-point rates by exploiting spatial and temporal localities.

• For the first time in decades, manycore and multithreaded can let us rethink architecture.

Data-intensive computing has natural concurrency; yet HPC architectures are designed to achieve high floating-point rates by exploiting spatial and temporal localities.

• For the first time in decades, manycore and multithreaded can let us rethink architecture.

These are today’s MPI applications!

Page 19: Opportunities and Challenges in Massive Data-Intensive Computing · memory cache-based supercomputers. • Case study: Massively threaded architectures are shown to run several orders

A Word of Caution• Don’t let programming paradigms limit

your imagination of what’s possible!• In the HPC community,

Scientific Computing ≡ Message-Passing Programming (MPI)

• Some grand challanege applications do very well, led to numerous scientific insights

• Other applications are impossible to solve efficiently

• For the Big Data community,– Hadoop/Map-Reduce

• Great for some statistical analyses• Problematic for deep insights

David A. Bader 19

Page 20: Opportunities and Challenges in Massive Data-Intensive Computing · memory cache-based supercomputers. • Case study: Massively threaded architectures are shown to run several orders

Big Data platforms

David A. Bader 20

DATA PLACEMENTExternal Disk

Internal Memory

Hadoop / Map-Reduce Cloud or

ClusterLarge shared memory

TraditionalDatabase

Performance

Energy

Ease of Programming

Page 21: Opportunities and Challenges in Massive Data-Intensive Computing · memory cache-based supercomputers. • Case study: Massively threaded architectures are shown to run several orders

David A. Bader 21

Architectural Requirements forthe Efficient Graph Analysis (Challenges)

• Runtime is dominated by latency– Random accesses to global address space– Perhaps many at once

• Essentially no computation to hide memory costs

• Access pattern is data dependent– Prefetching unlikely to help– Usually only want small part of cache line

• Potentially abysmal locality at all levels of memory hierarchy

21

Page 22: Opportunities and Challenges in Massive Data-Intensive Computing · memory cache-based supercomputers. • Case study: Massively threaded architectures are shown to run several orders

David A. Bader 22

Architectural Requirements forthe Efficient Graph Analysis (Desired Features)

• A large memory capacity- Low latency / high bandwidth

– For small messages!• Latency tolerant• Light-weight synchronization mechanisms• Global address space

– No graph partitioning required– Avoid memory-consuming profusion of ghost-nodes– No local/global numbering conversions

22

Page 23: Opportunities and Challenges in Massive Data-Intensive Computing · memory cache-based supercomputers. • Case study: Massively threaded architectures are shown to run several orders

Graph500 Benchmark, www.graph500.org

• Cybersecurity– 15 Billion Log Entires/Day (for large

enterprises)– Full Data Scan with End-to-End Join

Required

• Medical Informatics– 50M patient records, 20-200

records/patient, billions of individuals– Entity Resolution Important

• Social Networks– Example, Facebook, Twitter– Nearly Unbounded Dataset Size

• Data Enrichment– Easily PB of data– Example: Maritime Domain

Awareness• Hundreds of Millions of Transponders• Tens of Thousands of Cargo Ships• Tens of Millions of Pieces of Bulk

Cargo• May involve additional data (images,

etc.)

• Symbolic Networks– Example, the Human Brain– 25B Neurons– 7,000+ Connections/Neuron

David A. Bader 23

Defining a new set of benchmarks to guide the design of hardware architectures and software systems intended to support such applications and to help procurements. Graph algorithms are a core part of many analytics workloads. Executive Committee: D.A. Bader, R. Murphy, M. Snir, A. Lumsdaine

• Five Business Area Data Sets:

Page 24: Opportunities and Challenges in Massive Data-Intensive Computing · memory cache-based supercomputers. • Case study: Massively threaded architectures are shown to run several orders

Breadth-first search comparison: Graph500• Graph500 benchmark: BFS on a synthetic graph of a social network• 2Size vertices and 16 * 2Size edges• Processing Rate: Traversed edges per second (TEPS)

David A. Bader 24

Rank Machine Owner Size TEPS

1 NNSA/SC Blue Gene/Q Prototype II (4096 nodes / 65,536 cores)

NNSA and IBM Research,T.J. Watson

32 254.3 B

2 Hopper XE6(1800 nodes / 43,200 cores)

LBNL 37 113.3 B

3 Lomonosov(4096 nodes / 32,768 cores)

Moscow State University 37 103.2 B

17 mirasol (Quad-Socket Intel E7-8870 - 2.4 GHz - 10 cores / 20 threads per socket)

Intel/Georgia Tech(UCB implementation)

28 5.1 B

20 Dingus (Convey HC1ex, 4FPGA) SNL 28 1.7 B

35 Matterhorn (64-proc XMT2) CSCS 31 884.6 M

Page 25: Opportunities and Challenges in Massive Data-Intensive Computing · memory cache-based supercomputers. • Case study: Massively threaded architectures are shown to run several orders

The Cray XMT

• Tolerates latency by massive multithreading– Hardware support for 128 threads on each processor– Globally hashed address space– No data cache – Single cycle context switch– Multiple outstanding memory requests

• Support for fine-grained, word-level synchronization

– Full/empty bit associated with every memory word

• Flexibly supports dynamic load balancing

• GraphCT currently tested on a 512 processor XMT: 64K threads– 4 TB of globally shared memory

– XMT2 supports 512 processors and 64TB of globally shared memory

David A. Bader 25

Image Source: cray.com

Page 26: Opportunities and Challenges in Massive Data-Intensive Computing · memory cache-based supercomputers. • Case study: Massively threaded architectures are shown to run several orders

Intel Xeon E7-8870-based platforms

• Architecture supports up to 2 TB, commercially available.• Tolerates some latency by hyperthreading.

– Hardware: 2 threads / core, 10 cores / socket, four sockets.– Fast cores (2.4 GHz), fast memory (1 066 MHz).– Not so many outstanding memory requests (60/socket), but large

caches (30 MB L3 per socket).• Good system support

– Transparent hugepages reduce TLB costs.– Fast, user-level locking.– Widely available tuned primitives

• Also high-productivity graph analysis!

David A. Bader 26

Image: Intel(tm) press kit

Page 27: Opportunities and Challenges in Massive Data-Intensive Computing · memory cache-based supercomputers. • Case study: Massively threaded architectures are shown to run several orders

Graph Analysis Performance:Multithreaded (Cray XMT) vs. Cache-based multicore

• SSCA#2 network, SCALE 24 (16.77 million vertices and 134.21 million edges.)

David A. Bader 27

Number of processors/cores

1 2 4 8 12 16

Betw

eenn

ess

TEPS

rate

(Mill

ions

of e

dges

per

sec

ond)

0

20

40

60

80

100

120

140

160

180 Cray XMT Sun UltraSparcT2

2.0 GHz quad-core Xeon

Page 28: Opportunities and Challenges in Massive Data-Intensive Computing · memory cache-based supercomputers. • Case study: Massively threaded architectures are shown to run several orders

Massive Streaming Graph Analytics

David A. Bader 28

(A, B, t1, poke)(A, C, t2, msg)

(A, D, t3, view wall)(A, D, t4, post)

(B, A, t2, poke)(B, A, t3, view wall)

(B, A, t4, msg)

Analysts

Page 29: Opportunities and Challenges in Massive Data-Intensive Computing · memory cache-based supercomputers. • Case study: Massively threaded architectures are shown to run several orders

Streaming Graphs

STINGER: A Data Structure for Graphs with Streaming Updates

Light-weight data structure that supports efficient iteration and efficient updates.

Experiments with Streaming Updates to Clustering Coefficients

Working with bulk updates, can handle almost 200k

Tracking Connected ComponentsTracking Betweenness Centrality

29David A. Bader

Page 30: Opportunities and Challenges in Massive Data-Intensive Computing · memory cache-based supercomputers. • Case study: Massively threaded architectures are shown to run several orders

STING Extensible Representation (STINGER)Enhanced representation developed for dynamic graphs developed in consultation with David A. Bader, Johnathan Berry, Adam Amos-Binks, Daniel Chavarría-Miranda, Charles Hastings, Kamesh Madduri, and Steven C. Poulos.Design goals:

Be useful for the entire “large graph” communityPortable semantics and high-level optimizations across multiple platforms & frameworks (XMT C, MTGL, etc.)Permit good performance: No single structure is optimal for all.Assume globally addressable memory accessSupport multiple, parallel readers and a single writer

Operations:Insert/update & delete both vertices & edgesAging-off: Remove old edges (by timestamp)Serialization to support checkpointing, etc.

30David A. Bader

Page 31: Opportunities and Challenges in Massive Data-Intensive Computing · memory cache-based supercomputers. • Case study: Massively threaded architectures are shown to run several orders

STING Extensible RepresentationSemi-dense edge list blocks with free spaceCompactly stores timestamps, types, weightsMaps from application IDs to storage IDsDeletion by negating IDs, separate compaction

31David A. Bader

Page 32: Opportunities and Challenges in Massive Data-Intensive Computing · memory cache-based supercomputers. • Case study: Massively threaded architectures are shown to run several orders

Goal: Streaming Graph Analytics

• O(Billion) vertices, O(Trillion) edges, 1M updates / sec

• Challenge: Maintain analytics, update quickly– Connected components– Agglomerative clustering

• Problem:– Irregular execution– Bandwidth and latency bound– Irregular memory access– Low-to-no reuse

32David A. Bader

Page 33: Opportunities and Challenges in Massive Data-Intensive Computing · memory cache-based supercomputers. • Case study: Massively threaded architectures are shown to run several orders

Case Study 1:Clustering Coefficients in Streaming Graph• Roughly, the ratio of actual triangles to possible triangles

around a vertex.

• Defined in terms of triplets.• i-j-v is a closed triplet (triangle).• m-v-n is an open triplet.• Clustering coefficient

# closed triplets / # all triplets• Locally, count those around v.• Globally, count across entire graph.

Multiple counting cancels (3/3=1)

33David A. Bader

Page 34: Opportunities and Challenges in Massive Data-Intensive Computing · memory cache-based supercomputers. • Case study: Massively threaded architectures are shown to run several orders

Updating clustering coefficients in streaming graph

• Monitoring clustering coefficients could identify anomalies, find forming communities, etc.

• Computations stay local. A change to edge <u, v> affects only vertices u, v, and their neighbors.

• Need a fast method for updating the triangle counts, degrees when an edge is inserted or deleted.

– Dynamic data structure for edges & degrees: STINGER– Rapid triangle count update algorithms: exact and approximate

“Massive Streaming Data Analytics: A Case Study with Clustering Coefficients.” Ediger, David, Karl Jiang, E. Jason Riedy, and David A. Bader. MTAAP 2010, Atlanta, GA, April 2010.

u v+2 +2

+1+1

34David A. Bader

Page 35: Opportunities and Challenges in Massive Data-Intensive Computing · memory cache-based supercomputers. • Case study: Massively threaded architectures are shown to run several orders

Updating clustering coefficients in streaming graph

• Using RMAT as a graph and edge stream generator.– Mix of insertions and deletions

• Result summary for single actions– Exact: from 8 to 618 actions/second– Approx: from 11 to 640 actions/second

• Alternative: Batch changes– Lose some temporal resolution within the batch– Median rates for batches of size B:

• STINGER overhead is minimal; most time in spent metric.

Algorithm B = 1 B = 1000 B = 4000

Exact 90 25 100 50 100

Approx. 60 83 700 193 300

35David A. Bader

Page 36: Opportunities and Challenges in Massive Data-Intensive Computing · memory cache-based supercomputers. • Case study: Massively threaded architectures are shown to run several orders

Case study 2:Tracking connected components in streaming graph

• Why? Interested in:– small components over long periods of time– communities that split off from the main group

• Global property– Track <vertex, component> mapping– Undirected, unweighted graphs

• Scale-free network– most changes are within one large component

• Inserting a new edge:– Endpoints are differently labeled– Re-label smaller component

• Deleting an edge:– Must prove that deletion does not cleave a component– Otherwise recompute!

36David A. Bader

Page 37: Opportunities and Challenges in Massive Data-Intensive Computing · memory cache-based supercomputers. • Case study: Massively threaded architectures are shown to run several orders

Tracking Connected Components:Performance on 32 of 128p Cray XMT

David A. Bader 37

|V| = 1M, |E| = 8M |V| = 16M, |E| = 135M

• Insertions-only improvement 10x-20x over recomputation• Greater parallelism within a batch yields higher performance• Trade-off between time resolution and update speed

“Tracking Structure of Streaming Social Networks, David Ediger, Jason Riedy, David A. Bader, and Henning Meyerhenke. MTAAP 2011, Anchorage, AK, May 2011.

Page 38: Opportunities and Challenges in Massive Data-Intensive Computing · memory cache-based supercomputers. • Case study: Massively threaded architectures are shown to run several orders

Tracking Connected Components:Faster deletions• Finding triangles rules out 90% of deletions as “safe”• Introduce a spanning tree to easily track deletions

• A deletion is “safe” if:– the edge is not in the spanning tree– the edge endpoints have a common neighbor– a neighbor can still reach the root of the tree

X

Spanning Tree Edge

Non-Tree Edge

Root vertex in black

38David A. Bader

• This method rules out 99.7% of deletions in a RMAT graph with 2M vertices and 32M edges– Often faster than triangle finding

Page 39: Opportunities and Challenges in Massive Data-Intensive Computing · memory cache-based supercomputers. • Case study: Massively threaded architectures are shown to run several orders

Case Study 3:Tracking Betweenness Centrality (BC)in a streaming graph• Key metric in social network analysis

[Freeman ’77, Goh ’02, Newman ’03, Brandes ’03]

: Number of shortest paths between vertices s and t: Number of shortest paths between vertices s and t passing

through v

• Streaming Updates– Reuse previously computed traversals

• Reduce computational requirements

– Avoid redundant computations

( ) ( )st

s v t V st

vBC v

σσ≠ ≠ ∈

=

)(vstσstσ

39David A. Bader

Page 40: Opportunities and Challenges in Massive Data-Intensive Computing · memory cache-based supercomputers. • Case study: Massively threaded architectures are shown to run several orders

Updating BC for Edge Insertions in streaming graph

• Currently, memory requirements limit problem size

• Speedup measured: static

insertion

TSpeedupT

=

Collaboration network Vertices Edges Speedup

Astrophysics 18772 198080 148XCondensed matter 23133 93468 91X

General relativity 5242 14490 40XHigh energy physics 120008 118505 108X

High energy physics theory

9877 51970 36X

David A. Bader 40

Page 41: Opportunities and Challenges in Massive Data-Intensive Computing · memory cache-based supercomputers. • Case study: Massively threaded architectures are shown to run several orders

Updating BC in Streaming Graph:Experimental Results for Insertions-only• 3.4 GHz Intel Core i7-2600K (16 GB RAM)

020406080

100120140160

8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

Spee

dup

Edge factor(average edge degree)

R-MAT graph speedup

Scale 14 Scale 13 Scale 12 Scale 11 Scale 10

David A. Bader 41

Page 42: Opportunities and Challenges in Massive Data-Intensive Computing · memory cache-based supercomputers. • Case study: Massively threaded architectures are shown to run several orders

CHALLENGES

David A. Bader 42

Page 43: Opportunities and Challenges in Massive Data-Intensive Computing · memory cache-based supercomputers. • Case study: Massively threaded architectures are shown to run several orders

Hierarchy of Interesting Analytics

Extend single-shot graph queries to include time.Are there s-t paths between time T1 and T2?What are the important vertices at time T?

Use persistent queries to monitor properties.Does the path between s and t shorten drastically?Is some vertex suddenly very central?

Extend persistent queries to fully dynamic properties.Does a small community stay independent rather than merge with larger groups?When does a vertex jump between communities?

New types of queries, new challenges...

43David A. Bader

Page 44: Opportunities and Challenges in Massive Data-Intensive Computing · memory cache-based supercomputers. • Case study: Massively threaded architectures are shown to run several orders

Graph Analytics for Social Networks• Are there new graph techniques? Do they parallelize?

Can the computational systems (algorithms, machines) handle massive networks with millions to billions of individuals? Can the techniques tolerate noisy data, massive data, streaming data, etc. …

• Communities may overlap, exhibit different properties and sizes, and be driven by different models– Detect communities (static or emerging)– Identify important individuals– Detect anomalous behavior– Given a community, find a representative

member of the community– Given a set of individuals, find the best

community that includes them

David A. Bader 44

Page 45: Opportunities and Challenges in Massive Data-Intensive Computing · memory cache-based supercomputers. • Case study: Massively threaded architectures are shown to run several orders

Open Questions for Massive Data Analytic Apps

• How do we diagnose the health of streaming systems?• Are there new analytics for massive spatio-temporal interaction

networks and graphs (STING)?• Do current methods scale up from thousands to millions and

billions?• How do I model massive, streaming data streams?• Are algorithms resilient to noisy data? • How do I visualize a STING with O(1M) entities? O(1B)? O(100B)?

with scale-free power law distribution of vertex degrees and diameter =6 …

• Can accelerators aid in processing streaming graph data?• How do we leverage the benefits of multiple architectures (e.g.

map-reduce clouds, and massively multithreaded architectures) in a single platform?

David A. Bader 45

Page 46: Opportunities and Challenges in Massive Data-Intensive Computing · memory cache-based supercomputers. • Case study: Massively threaded architectures are shown to run several orders

Collaborators and Acknowledgments• Jason Riedy, Research Scientist, (Georgia Tech)• Graduate Students (Georgia Tech):

– David Ediger– Oded Green– Rob McColl– Pushkar Pande– Anita Zakrzewska– Karl Jiang

• Bader PhD Graduates:– Seunghwa Kang (Pacific Northwest National Lab)– Kamesh Madduri (Penn State)– Guojing Cong (IBM TJ Watson Research Center)

• John Feo and Daniel Chavarría-Miranda (Pacific Northwest Nat’l Lab)

David A. Bader 46

Page 47: Opportunities and Challenges in Massive Data-Intensive Computing · memory cache-based supercomputers. • Case study: Massively threaded architectures are shown to run several orders

Bader, Related Recent Publications (2005-2008)

• D.A. Bader, G. Cong, and J. Feo, “On the Architectural Requirements for Efficient Execution of Graph Algorithms,” The 34th International Conference on Parallel Processing (ICPP 2005), pp. 547-556, Georg Sverdrups House, University of Oslo, Norway, June 14-17, 2005.

• D.A. Bader and K. Madduri, “Design and Implementation of the HPCS Graph Analysis Benchmark on Symmetric Multiprocessors,” The 12th International Conference on High Performance Computing (HiPC 2005), D.A. Bader et al., (eds.), Springer-Verlag LNCS 3769, 465-476, Goa, India, December 2005.

• D.A. Bader and K. Madduri, “Designing Multithreaded Algorithms for Breadth-First Search and st-connectivity on the Cray MTA-2,” The 35th International Conference on Parallel Processing (ICPP 2006), Columbus, OH, August 14-18, 2006.

• D.A. Bader and K. Madduri, “Parallel Algorithms for Evaluating Centrality Indices in Real-world Networks,” The 35th International Conference on Parallel Processing (ICPP 2006), Columbus, OH, August 14-18, 2006.

• K. Madduri, D.A. Bader, J.W. Berry, and J.R. Crobak, “Parallel Shortest Path Algorithms for Solving Large-Scale Instances,” 9th DIMACS Implementation Challenge -- The Shortest Path Problem, DIMACS Center, Rutgers University, Piscataway, NJ, November 13-14, 2006.

• K. Madduri, D.A. Bader, J.W. Berry, and J.R. Crobak, “An Experimental Study of A Parallel Shortest Path Algorithm for Solving Large-Scale Graph Instances,” Workshop on Algorithm Engineering and Experiments (ALENEX), New Orleans, LA, January 6, 2007.

• J.R. Crobak, J.W. Berry, K. Madduri, and D.A. Bader, “Advanced Shortest Path Algorithms on a Massively-Multithreaded Architecture,” First Workshop on Multithreaded Architectures and Applications (MTAAP), Long Beach, CA, March 30, 2007.

• D.A. Bader and K. Madduri, “High-Performance Combinatorial Techniques for Analyzing Massive Dynamic Interaction Networks,” DIMACS Workshop on Computational Methods for Dynamic Interaction Networks, DIMACS Center, Rutgers University, Piscataway, NJ, September 24-25, 2007.

• D.A. Bader, S. Kintali, K. Madduri, and M. Mihail, “Approximating Betewenness Centrality,” The 5th Workshop on Algorithms and Models for the Web-Graph (WAW2007), San Diego, CA, December 11-12, 2007.

• David A. Bader, Kamesh Madduri, Guojing Cong, and John Feo, “Design of Multithreaded Algorithms for Combinatorial Problems,” in S. Rajasekaran and J. Reif, editors, Handbook of Parallel Computing: Models, Algorithms, and Applications, CRC Press, Chapter 31, 2007.

• Kamesh Madduri, David A. Bader, Jonathan W. Berry, Joseph R. Crobak, and Bruce A. Hendrickson, “Multithreaded Algorithms for Processing Massive Graphs,” in D.A. Bader, editor, Petascale Computing: Algorithms and Applications, Chapman & Hall / CRC Press, Chapter 12, 2007.

• D.A. Bader and K. Madduri, “SNAP, Small-world Network Analysis and Partitioning: an open-source parallel graph framework for the exploration of large-scale networks,” 22nd IEEE International Parallel and Distributed Processing Symposium (IPDPS), Miami, FL, April 14-18, 2008.

47David A. Bader

Page 48: Opportunities and Challenges in Massive Data-Intensive Computing · memory cache-based supercomputers. • Case study: Massively threaded architectures are shown to run several orders

Bader, Related Recent Publications (2009-2010)• S. Kang, D.A. Bader, “An Efficient Transactional Memory Algorithm for Computing Minimum Spanning Forest

of Sparse Graphs,” 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), Raleigh, NC, February 2009.

• Karl Jiang, David Ediger, and David A. Bader. “Generalizing k-Betweenness Centrality Using Short Paths and a Parallel Multithreaded Implementation.” The 38th International Conference on Parallel Processing (ICPP), Vienna, Austria, September 2009.

• Kamesh Madduri, David Ediger, Karl Jiang, David A. Bader, Daniel Chavarría-Miranda. “A Faster Parallel Algorithm and Efficient Multithreaded Implementations for Evaluating Betweenness Centrality on Massive Datasets.” 3rd Workshop on Multithreaded Architectures and Applications (MTAAP), Rome, Italy, May 2009.

• David A. Bader, et al. “STINGER: Spatio-Temporal Interaction Networks and Graphs (STING) Extensible Representation.” 2009.

• David Ediger, Karl Jiang, E. Jason Riedy, and David A. Bader. “Massive Streaming Data Analytics: A Case Study with Clustering Coefficients,” Fourth Workshop in Multithreaded Architectures and Applications (MTAAP), Atlanta, GA, April 2010.

• Seunghwa Kang, David A. Bader. “Large Scale Complex Network Analysis using the Hybrid Combination of a MapReduce cluster and a Highly Multithreaded System:,” Fourth Workshop in Multithreaded Architectures and Applications (MTAAP), Atlanta, GA, April 2010.

• David Ediger, Karl Jiang, Jason Riedy, David A. Bader, Courtney Corley, Rob Farber and William N. Reynolds. “Massive Social Network Analysis: Mining Twitter for Social Good,” The 39th International Conference on Parallel Processing (ICPP 2010), San Diego, CA, September 2010.

• Virat Agarwal, Fabrizio Petrini, Davide Pasetto and David A. Bader. “Scalable Graph Exploration on Multicore Processors,” The 22nd IEEE and ACM Supercomputing Conference (SC10), New Orleans, LA, November 2010.

48David A. Bader

Page 49: Opportunities and Challenges in Massive Data-Intensive Computing · memory cache-based supercomputers. • Case study: Massively threaded architectures are shown to run several orders

Acknowledgment of Support

49David A. Bader