david ediger, jason riedy, rob mccoll, david a. baderbader/papers/ppopp12/ppopp-2012-part1.pdf ·...
TRANSCRIPT
Parallel Programming for Graph AnalysisDavid Ediger, Jason Riedy, Rob McColl, David A. Bader
Tutorial Schedule
• 08:30am – 08:45am: Welcome & Introduction• 08:45am – 10:00am: Par t 1:Network Analysis• 10:00am – 10:30am: Coffee Break• 10:30am – 12:00pm: Par t 2: Static Parallel Algorithms• 12:00pm – 01:30pm: Lunch• 01:30pm – 03:00pm: Par t 3: Dynamic Paral lel
Algorithms• 03:00pm – 03:30pm: Coffee Break• 03:30pm – 05:00pm: Par t 4: STINGER
Parallel Programming for Graph Analysis 2
Tutorial Outline
• Introduction– Speakers
• 1: Network Analysis– Introduction to Graph Theory– Opportunities & Challenges– Parallel, Multicore & Multithreaded
Architectures– Q&A
• 2: Static Parallel Algorithms– Graph Traversal / Graph500– Social Networking Algorithms– Graph Software Libraries– Q&A
• 3: Dynamic Parallel Algorithms– Data Structures for Streaming Data– Clustering Coefficients– Connected Components– Anomaly Detection– Q&A
• 4: GraphCT & STINGER
– Using GraphCT, app & lib.
– Using STINGER• Creation, Insertion, & Deletion
• Graph Traversal
• Connected Components
• Included Utilities
– Q&A
Parallel Programming for Graph Analysis 3
E. Jason Riedy• Research Scientist II, Georgia Institute of Technology
– School of Computation Science & Engineering
• Ph.D. 2010, University of California, Berkeley– Still likes Fortran...
• Contact Info:– College of Computing– Georgia Institute of Technology– Atlanta, GA 30332– [email protected]– http://www.cc.gatech.edu/~jriedy
Parallel Programming for Graph Analysis 4
David Ediger
• Ph.D. Student, Georgia Tech– advisor: Prof. David A. Bader
• B.S. 2008, The George Washington University• M.S. 2010, Georgia Tech
Parallel Programming for Graph Analysis 5
Currently working on development efforts of parallel graph algorithms and software: GraphCT and STINGER.
Outline
• Introduction• Network Analysis
– Introduction to Graph Applications– Opportunities & Challenges– Parallel, Multicore & Multithreaded Architectures
• Static Parallel Algorithms• Dynamic Parallel Algorithms• GraphCT & STINGER
Parallel Programming for Graph Analysis 6
MOTIVATION
Parallel Programming for Graph Analysis 7
Exascale Analytics: Real-world challenges
• Health care disease spread, detection and prevention of epidemics/pandemics (e.g. SARS, Avian flu, H1N1 “swine” flu, …)
• Massive social networks energy conservation requires social change, modeling pandemic spread, transportation and evacuation
• Intell igence business analytics, anomaly detection, security, knowledge discovery from massive data sets
• Systems Biology understanding complex life systems, drug design, microbial research, unravel the mysteries of the HIV virus; understand life, disease, and evolution
• Electric Power Grid communication, transportation, energy, water, food supply
• Modeling and Simulation Perform full-scale economic-social-political simulations
Requires dynamic Spatio-Temporal Interaction Networks and Graphs (STING)
Parallel Programming for Graph Analysis 8
Driving Forces in Social Network Analysis• Facebook has more than 850 million active users
• Note the graph is changing as well as growing.• What are this graph's properties? How do they change?• Traditional graph partitioning often fails:
– Topology: Interaction graph is low-diameter, and has no good separators– Irregularity : Communities are not uniform in size– Overlap: individuals are members of one or more communities
• Sample queries: – Allegiance switching : identify entities that switch communities.– Community structure: identify the genesis and dissipation of communities– Phase change: identify significant change in the network structure
Parallel Programming for Graph Analysis 9
Friendship Networks
• This image is a visualization of LiveJournal's friendship network. The network consists of 4.8 million people (vertices) connected by 69 million relationships (edges). Credit: Yifan Hu (AT&T)
Parallel Programming for Graph Analysis 10
Center for AdaptiveSupercomputing Software (CASS-MT)• CASS-MT, launched July 2008• Pacific-Northwest Lab
– Georgia Tech, Sandia, WA State, Delaware• The newest breed of supercomputers have hardware set up not just for
speed, but also to better tackle large networks of seemingly random data. And now, a multi-institutional group of researchers has been awarded over $14 million to develop software for these supercomputers. Applications include anywhere complex webs of information can be found: from internet security and power grid stability to complex biological networks.
David A. Bader 11
CASS-MT TASK 7: Analysis of Massive Social Networks
David A. Bader 12
ObjectiveTo design software for the analysis of massive-scale spatio-temporal interaction networks using multithreaded architectures such as the Cray XMT. The Center launched in July 2008 and is led by Pacific-Northwest National Laboratory.
DescriptionWe are designing and implementing advanced, scalable algorithms for static and dynamic graph analysis, including generalized k-betweenness centrality and dynamic clustering coefficients.
HighlightsOn a 64-processor Cray XMT, k-betweenness centrality scales nearly linearly (58.4x) on a graph with 16M vertices and 134M edges. Initial streaming clustering coefficients handle around 200k updates/sec on a similarly sized graph.
Our research is focusing on temporal analysis, answering questions about changes in global properties (e.g. diameter) as well as local structures (communities, paths).
Image Courtesy of Cray, Inc.
David A. Bader (CASS-MT Task 7 LEAD)David Ediger, Jason Riedy, Henning Meyerhenke
Ubiquitous High Performance Computing (UHPC)
Goal: develop highly parallel, security enabled, power efficient processing systems, supporting ease of programming, with resilient execution through all failure modes and intrusion attacks
Program Objectives:One PFLOPS, single cabinet including self-contained cooling
50 GFLOPS/W (equivalent to 20 pJ/FLOP)
Total cabinet power budget 57KW, includes processing resources, storage and cooling
Security embedded at all system levels
Parallel, efficient execution models
Highly programmable parallel systems
Scalable systems – from terascale to petascale
Architectural Drivers:Energy EfficientSecurity and DependabilityProgrammability
Echelon: Extreme-scale Compute Hierarchies with Efficient Locality-Optimized Nodes
“NVIDIA-Led Team Receives $25 Million Contract From DARPA to Develop High-Performance GPU Computing Systems” -MarketWatch
David A. Bader (CSE)Echelon Leadership Team
13
Parallel Programming for Graph Analysis
And more...
•Spatio-Temporal Interaction Networks and Graphs Software for Intel platforms–Intel Project on Parallel Algorithms in Non-
Numeric Computing, multi-year project–Collaborators: Mattson (Intel), Gilbert (UCSD),
Blelloch (CMU), Lumsdaine (Indiana)
•Graph500: http://graph500.org– Shooting for reproducible benchmarks for
massive graph-structured data
14
Parallel Programming for Graph Analysis
Example: Mining Twitter for Social Good
15
ICPP 2010
Image credit: bioethicsinstitute.org
• CDC / Nation-scale surveillance of public health
• Cancer genomics and drug design– computed Betweenness Centrality
of Human Proteome
Massive Data Analytics: Protecting our NationUS High Voltage Transmission Grid (>150,000 miles of l ine)
Public Health
Parallel Programming for Graph Analysis 16
ENSG00000145332.Kelch-
like proteinimplicat
ed in breast cancer
Routing in transportation networks
Road networks, Point-to-point shor test paths: 15 seconds (naïve) 10 microseconds
H. Bast et al., “Fast Routing in Road Networks with Transit Nodes”, Science 27, 2007.
17Parallel Programming for Graph Analysis
Internet and the WWW
• The world-wide web can be represented as a directed graph– Web search and crawl: traversal– Link analysis, ranking: Page rank and HITS– Document classification and clustering
• Internet topologies (router networks) are naturally modeled as graphs
18Parallel Programming for Graph Analysis
Scientific Computing• Reorderings for sparse solvers
– Fill reducing orderings• partitioning, traversals, eigenvectors
– Heavy diagonal to reduce pivoting (matching)
• Data structures for efficient exploitation
of sparsity• Derivative computations for optimization
– Matroids, graph colorings, spanning trees
• Preconditioning– Incomplete Factorizations– Partitioning for domain decomposition– Graph techniques in algebraic multigrid
• Independent sets, matchings, etc.– Support Theory
• Spanning trees & graph embedding techniquesB. Hendrickson, “Graphs and HPC: Lessons for Future Architectures”, http://www.er.doe.gov/ascr/ascac/Meetings/Oct08/Hendrickson%20ASCAC.pdf
19Parallel Programming for Graph Analysis
Graphs are pervasive in large-scale data analysis
• Sources of massive data: petascale simulations, experimental devices, the Internet, scientific applications.
• New challenges for analysis: data sizes, heterogeneity, uncertainty, data quality.
Astrophysics Problem: Outlier detection. Challenges: massive datasets, temporal variations.Graph problems: clustering, matching.
BioinformaticsProblem: Identifying drug target proteins.Challenges: Data heterogeneity, quality.Graph problems: centrality, clustering.
Social InformaticsProblem: Discover emergent communities, model spread of information.Challenges: new analytics routines, uncertainty in data.Graph problems: clustering, shortest paths, flows.
Image sources: (1) http://physics.nmt.edu/images/astro/hst_starfield.jpg (2,3) www.visualComplexity.com
20Parallel Programming for Graph Analysis
Data Analysis and Graph Algorithms in Systems Biology
• Study of the interactions between various components in a biological system
• Graph-theoretic formulations are pervasive:– Predicting new interactions:
modeling– Functional annotation of novel
proteins: matching, clustering– Identifying metabolic pathways:
paths, clustering– Identifying new protein
complexes: clustering, centralityImage Source: Giot et al., “A Protein Interaction Map of Drosophila melanogaster”, Science 302, 1722-1736, 2003.
21Parallel Programming for Graph Analysis
Image Source: Nexus (Facebook application)
Graph –theoretic problems in social networks
– Community identification: clustering
– Targeted advertising: centrality– Information spreading: modeling
22Parallel Programming for Graph Analysis
Network Analysis for Intelligence and Survelliance
• [Krebs ’04] Post 9/11 Terrorist Network Analysis from public domain information
• Plot masterminds correctly identified from interaction patterns: centrality
• A global view of entities is often more insightful
• Detect anomalous activities by exact/approximate graph matching / motifs
Image Source: http://www.orgnet.com/hijackers.html
Image Source: T. Coffman, S. Greenblatt, S. Marcus, Graph-based technologies for intelligence analysis, CACM, 47 (3, March 2004): pp 45-47
23Parallel Programming for Graph Analysis
Characterizing Graph-theoretic computations
• graph sparsity (m/n ratio)• static/dynamic nature• weighted/unweighted, weight distribution• vertex degree distribution• directed/undirected• simple/multi/hyper graph• problem size• granularity of computation at nodes/edges• domain-specific characteristics
• graph sparsity (m/n ratio)• static/dynamic nature• weighted/unweighted, weight distribution• vertex degree distribution• directed/undirected• simple/multi/hyper graph• problem size• granularity of computation at nodes/edges• domain-specific characteristics
• paths• clusters• partitions• matchings• patterns• orderings
• paths• clusters• partitions• matchings• patterns• orderings
Input: Graph abstraction
Problem: Find ***
Factors that influence choice of algorithmGraph
algorithms
• traversal• shortest path algorithms• flow algorithms• spanning tree algorithms• topological sort …..
• traversal• shortest path algorithms• flow algorithms• spanning tree algorithms• topological sort …..
Graph problems are often recast as sparse linear algebra (e.g., partitioning) or linear programming (e.g., matching) computations
Graph problems are often recast as sparse linear algebra (e.g., partitioning) or linear programming (e.g., matching) computations
24Parallel Programming for Graph Analysis
Massive data analytics in informatics networks
• Graphs arising in Informatics are very different from topologies in scientific computing.
• We need new data representations and parallel algorithms that exploit topological characteristics of informatics networks.
Emerging applications: dynamic, high-dimensional data
Static networks, Euclidean topologies
25Parallel Programming for Graph Analysis
What we’d like to infer from information networks
• What are the degree distributions, clustering coefficients, diameters, etc.?
– Heavy-tailed, small-world, expander, geometry+rewiring, local-global decompositions, ...
• How do networks grow, evolve, respond to perturbations, etc.?– Preferential attachment, copying, HOT, shrinking diameters, ..
• Are there natural clusters, communities, partitions, etc.?– Concept-based clusters, link-based clusters, density-based clusters, ...
• How do dynamic processes – search, diffusion, etc. – behave on networks?
– Decentralized search, undirected diffusion, cascading epidemics, ...
• How best to do learning, e.g., classification, regression, ranking, etc.?– Information retrieval, machine learning, ...
Slide credit: Michael Mahoney, Stanford
26Parallel Programming for Graph Analysis
CHALLENGES
Parallel Programming for Graph Analysis 27
Hierarchy of Interesting Graph Analytics
Extend single-shot graph queries to include time.Are there s-t paths between time T1 and T
2?
What are the important vertices at time T?
Use persistent queries to monitor proper ties.Does the path between s and t shorten drastically?
Is some vertex suddenly very central?
Extend persistent queries to fully dynamic proper ties.
Does a small community stay independent rather than merge with larger groups?
When does a vertex jump between communities?
New types of queries, new challenges...
28Parallel Programming for Graph Analysis
Graph Analytics for Social Networks
• Are there new graph techniques? Do they parallelize? Can the computational systems (algorithms, machines) handle massive networks with millions to billions of individuals? Can the techniques tolerate noisy data, massive data, streaming data, etc. …
• Communities may overlap, exhibit dif ferent proper ties and sizes, and be driven by dif ferent models
– Detect communities (static or emerging)– Identify impor tant individuals– Detect anomalous behavior– Given a community, f ind a representative
member of the community– Given a set of individuals, f ind the best
community that includes them
Parallel Programming for Graph Analysis 29
Open Questions for Massive Analytic Applications
• How do we diagnose the health of streaming systems?• Are there new analytics for massive spatio-temporal interaction
networks and graphs (STING)?• Do current methods scale up from thousands to millions and
billions?• How do I model massive, streaming data streams?• Are algorithms resil ient to noisy data? • How do I visualize a STING with O(1M) entities? O(1B)?
O(100B)? with scale-free power law distribution of vertex degrees and diameter =6 …
• Can accelerators aid in processing streaming graph data?• How do we leverage the benefits of multiple architectures (e.g.
map-reduce clouds, and massively multithreaded architectures) in a single platform?
Parallel Programming for Graph Analysis 30
Computational research challenges
• How do we solve massive graph problems (statistical analysis, dynamics, community detection, learning) with current algorithms/ systems/languages/programming models/libraries?
• How do we get analytics routines to scale for massive datasets?– We need new algorithms that exploit graph topology & modern computer
architectures
• What are the right abstractions for designing portable, high-performance graph algorithms?
– PRAM-like, BSP-like, MapReduce, matrices/tensors?
• Application Architecture mapping for current systems?– Multicore servers, commodity clusters, supercomputers, clouds
• What can we do with accelerators?– GPUs, Cell processor, Netezza and Azul data warehousing appliances
• What are the languages & programming models we should be using for high productivity/performance?
– C/C++ & threading, PGAS languages, MPI, Pig/Hive/Sawzall, …
31Parallel Programming for Graph Analysis
ARCHITECTURE
Parallel Programming for Graph Analysis 32
Flywheel has driven HPC into a corner
Parallel Programming for Graph Analysis 33
For decades, HPC has been on a vicious cycle of enabling applications that run well on HPC systems.
Real-World Applications(with natural parallelism)
Graphs have natural concurrency; yet HPC architectures are designed to achieve high floating-point rates by exploiting spatial and temporal localities.
• For the first time in decades, manycore and multithreaded can let us rethink architecture.
These are today’s MPI applications!
34
Looking Ahead: Grand Challenge in Graph Analytics
• Driving real-world applications are not just traditional HPC:– health care, transportation, energy, proteomics, security, data
sciences, informatics, …
• FLOPS are “free”, data is the challenge!• Fundamental abstractions are more irregular and complex
than sparse/dense matrices– Very sparse, very high-dimensional data– For example, there are few (if any) ef f icient distributed-
memory parallel implementations of even the simplest algorithm for sparse, arbitrary graphs!
• We must integrate across algorithms, programming models, and architectures, to address the growing requirements of grand challenge applications
Parallel Programming for
Graph Analysis
• Ideally, we want to do 1 (or more) memory op/clock cycle/processor.
• However, we have to deal with– Caches
• Reduce latency by storing data close to the processor
– Vectors• Amortize latency by fetching N words at a time
– Parallelism• Hide latency by switching tasks• Little’s law: concurrency = bandwidth * latency
Hiding memory latency
35Parallel Programming for Graph Analysis
• Tolerates latency by extreme multithreading– Each processor supports 128 hardware threads– Context switch in a single clock tick– “No” cache or local memory– Context switch on memory request– Multiple outstanding loads (configurable, often 180)
• Remote memory requests do not stall processors– Other streams work while the request
gets fulfilled
• Light-weight, word-level synchronization– Minimizes access conflicts
• Hashed global shared memory– 64-byte granularity, minimizes hotspots
• High-productivity graph analysis!
Cray XMT Operation
36Parallel Programming for Graph Analysis
Image: Swiss NationalSupercomputer Centre
XMT ThreadStorm Processor (logical view)
i = n
i = 3
i = 2
i = 1
. . .
1 2 3 4
Sub- problem
A
i = n
i = 1
i = 0
. . .
Sub- problem
BSubproblem A
Serial Code
Unused streams
. . . .
Programs running in parallel
Concurrent threads of computation
Hardware streams (128)
Instruction Ready Pool;
Pipeline of executing instructions
Slide Credit: Cray, Inc.
37Parallel Programming for Graph Analysis
XMT ThreadStorm System (logical view)
i = n
i = 3
i = 2
i = 1
. . .
1 2 3 4
Sub- problem
A
i = n
i = 1
i = 0
. . .
Sub- problem
BSubproblem A
Serial Code
Programs running in parallel
Concurrent threads of computation
Multithreaded across multiple processors
. . . . . . . . . . . .
Slide Credit: Cray, Inc.
38Parallel Programming for Graph Analysis
Scale to larger problems by adding more nodes.Thousands possible... 0.5 TiB per 16 processors (XMT2)
What is less important on XMT• Placing data near computation• Modifying shared data• Accessing data in order• Using indirection or linked data-structures• Partitioning program into independent,
balanced computations• Using adaptive or dynamic computations• Minimizing synchronization operations
Many costs amortized with memory access!39Parallel Programming for Graph Analysis
• Architecture supports up to 2 TiB, commercially available.
• Tolerates some latency by hyperthreading.
– Hardware: 2 threads / core, 10 cores / socket, four sockets.
– Fast cores (2.4 GHz), fast memory (1 066 MHz).
– Not so many outstanding memory requests (60/socket), but large caches (30 MiB L3 per socket).
• Good system support
– Transparent hugepages reduces TLB costs.
– Fast, user-level locking. (HLE may be better...)
– Widely available tuned primitives• Also high-productivity graph analysis!
Intel E7-8870-based platforms
40Parallel Programming for Graph Analysis
Image: Intel(tm) press kit
Problem Size (Log2 # of vertices)
10 12 14 16 18 20 22 24 26
Pe
rfor
ma
nce
rate
(Mill
ion
Tra
vers
ed E
dge
s/s)
0
10
20
30
40
50
60
Serial Performance of “approximate betweenness centrality” on a 2.67 GHz Intel Xeon 5560 (12 GB RAM, 8MB L3 cache)
Input: Synthetic R-MAT graphs (# of edges m = 8n)
The problems: #1. The locality challenge“Large memory footprint, low spatial and temporal locality impede performance”
~ 5X drop in performance
No LLC misses
O(m) LLC misses
41Parallel Programming for Graph Analysis
LLC: Last-Level Cache
• Graph topology assumptions in classical algorithms do not match real-world datasets
• Parallelization strategies at loggerheads with techniques for enhancing memory locality
• Classical “work-efficient” graph algorithms may not fully exploit new architectural features– Increasing complexity of memory hierarchy (x86), DMA support (Cell),
wide SIMD, floating point-centric cores (GPUs).
• Tuning implementation to minimize parallel overhead is non-trivial– Shared memory: minimizing overhead of locks, barriers.– Distributed memory: bounding message buffer sizes, bundling
messages, overlapping communication w/ computation.
The problems: #2. The parallel scaling challenge “Classical parallel graph algorithms perform poorly on current parallel systems”
42Parallel Programming for Graph Analysis
Parallel Programming for Graph Analysis 43
The Big Picture
Analyst makesqueries.
MassiveDatabases
FastGraphQuery
Extract “Window”Extract “Window”
High Latency Query
Graph resides in memory of supercomputer.
43
Graph Extraction
• Filtering massive (petabytes to exabytes) raw data- HTML documents a link graph- E-mail archives a collaboration network- Research papers a citation network- A large network a subnetwork
<li><a href="http://www.ieee.org/fellow/">IEEE Fellow</a>
<li> <a href="http://www.nsf.gov/">National Science Foundation</a> CAREER Award. <li>Georgia Tech College of Computing 2007 Dean's Award.<li>2008 IBM Shared University Research Award <li>2006 IBM Faculty Fellowship Award<li>2008 NVIDIA Professor Partnership Award <li>2006-7 Microsoft Research award <li>UNM Regents' Lecturer. <li><a href="http://www.computer.org/">IEEE Computer Society</a>
<a href="http://www.computer.org/chapter/DVP/">Distinguished Speaker</a>
…
44Parallel Programming for Graph Analysis
Graph Analysis
• Irregularly traversing large (multi-billion vertices with up to a trillion edges) graph data.
45Parallel Programming for Graph Analysis
Parallel Programming for Graph Analysis 46
Informatics Graphs are Even Tougher
• Very dif ferent from graphs in scientif ic computing!– Graphs can be enormous– Power-law distribution of the number of neighbors– Small world property – no long paths– Very limited locality, not partitionable– Highly unstructured– Edges and vertices have types
• Experience in scientific computing applications provides only limited insight.
Six degrees of Kevin BaconSource: Seokhee Hong
46
David A. Bader 47
Architectural Requirements for Efficient Graph Analysis (Challenges)
• Runtime is dominated by latency– Random accesses to global address space– Perhaps many at once
• Essentially no computation to hide memory costs
• Access pattern is data dependent– Prefetching unlikely to help– Usually only want small part of cache line
• Potentially abysmal locality at all levels of memory hierarchy
47
David A. Bader 48
Architectural Requirementsfor Efficient Graph Analysis (Desired Features)
• A large memory capacity for random access- For example, 1 trillion edges can fit into the 8 TB
(instead of 8 PB) main memory• Low latency / high bandwidth
– For small messages!• Latency tolerant• Light-weight synchronization mechanisms• Global address space
– No graph partitioning required– Avoid memory-consuming profusion of ghost-nodes– No local/global numbering conversions
48
Parallel Programming for Graph Analysis
Breadth-first search comparison: Graph500• Graph500 benchmark: BFS on an artificial graph• 2^Size vertices, 16 * 2^Size edges• TEPS: Traversed edges per second
49
Rank Machine Owner Size TEPS
1 NNSA/SC Blue Gene/Q Prototype II (4096 nodes / 65,536 cores)
NNSA and IBM Research, T.J. Watson
32 254.3 B
2 Hopper XE6 (1800 nodes / 43,200 cores)
LBNL 37 113.3 B
3 Lomonosov (4096 nodes / 32,768 cores)
Moscow State University
37 103.2 B
17 mirasol (Quad-Socket Intel E7-8870 - 2.4 GHz - 10 cores / 20 threads per socket)
Intel/Georgia Tech(UCB implementation)
28 5.1 B
20 Dingus (Convey HC1ex, 4FPGA)
SNL 28 1.7 B
35 Matterhorn (64-proc XMT2) CSCS 31 884.6 M
Graph Analysis Performance:Multithreaded (Cray XMT) vs. Cache-based multicore
• SSCA#2 network, SCALE 24 (16.77 million vertices and 134.21 million edges.)
David A. Bader 50
Q&A
Parallel Programming for Graph Analysis 51