parallel graph algorithms
DESCRIPTION
Parallel Graph Algorithms. Aydın Buluç [email protected] Lawrence Berkeley National Laboratory. CS267, Spring 2012 April 10, 2012. Some slides from: Kamesh Madduri , John Gilbert, Daniel Delling. Graph Preliminaries. n=|V| (number of vertices) m=|E| (number of edges) - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/1.jpg)
CS267, Spring 2012April 10, 2012
Parallel Graph Algorithms
Aydın Buluç[email protected]
Lawrence Berkeley National Laboratory
Some slides from: Kamesh Madduri, John Gilbert, Daniel Delling
![Page 2: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/2.jpg)
Graph Preliminaries
n=|V| (number of vertices)m=|E| (number of edges)D=diameter (max #hops between any pair of vertices)• Edges can be directed or undirected, weighted or not.• They can even have attributes (i.e. semantic graphs)• Sequences of edges <u1,u2>, <u2,u3>, … ,<un-1,un> is a
path from u1 to un. Its length is the sum of its weights.
Define: Graph G = (V,E)- a set of vertices and a set of
edges between vertices
EdgeVertex
![Page 3: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/3.jpg)
• Applications• Designing parallel graph algorithms• Case studies:
A. Graph traversals: Breadth-first searchB. Shortest Paths: Delta-stepping, PHASTC. Ranking: Betweenness centrality, PageRankD. Maximal Independent Sets: Luby’s algorithmE. Strongly Connected Components
• Wrap-up: challenges on current systems
Lecture Outline
![Page 4: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/4.jpg)
• Applications• Designing parallel graph algorithms• Case studies:
A. Graph traversals: Breadth-first searchB. Shortest Paths: Delta-stepping, PHASTC. Ranking: Betweenness centrality, PageRankD. Maximal Independent Sets: Luby’s algorithmE. Strongly Connected Components
• Wrap-up: challenges on current systems
Lecture Outline
![Page 5: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/5.jpg)
Road networks, Point-to-point shortest paths: 15 seconds (naïve) 10 microseconds
Routing in transportation networks
H. Bast et al., “Fast Routing in Road Networks with Transit Nodes”, Science 27, 2007.
![Page 6: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/6.jpg)
• The world-wide web can be represented as a directed graph– Web search and crawl: traversal– Link analysis, ranking: Page rank and HITS– Document classification and clustering
• Internet topologies (router networks) are naturally modeled as graphs
Internet and the WWW
![Page 7: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/7.jpg)
• Reorderings for sparse solvers– Fill reducing orderings
Partitioning, eigenvectors– Heavy diagonal to reduce pivoting (matching)
• Data structures for efficient exploitation of sparsity• Derivative computations for optimization
– graph colorings, spanning trees• Preconditioning
– Incomplete Factorizations– Partitioning for domain decomposition– Graph techniques in algebraic multigrid
Independent sets, matchings, etc.– Support Theory
Spanning trees & graph embedding techniques
Scientific Computing
B. Hendrickson, “Graphs and HPC: Lessons for Future Architectures”, http://www.er.doe.gov/ascr/ascac/Meetings/Oct08/Hendrickson%20ASCAC.pdf
Image source: Yifan Hu, “A gallery of large graphs”
10
1 3
2
4
5
6
7
8
9
G+(A)[chordal]
![Page 8: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/8.jpg)
• Graph abstractions are very useful to analyze complex data sets.
• Sources of data: petascale simulations, experimental devices, the Internet, sensor networks
• Challenges: data size, heterogeneity, uncertainty, data quality
Large-scale data analysis
Astrophysics: massive datasets, temporal variations
Bioinformatics: data quality, heterogeneity
Social Informatics: new analytics challenges, data uncertainty
Image sources: (1) http://physics.nmt.edu/images/astro/hst_starfield.jpg (2,3) www.visualComplexity.com
![Page 9: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/9.jpg)
Image Source: Nexus (Facebook application)
Graph –theoretic problems in social networks
– Targeted advertising: clustering and centrality
– Studying the spread of information
![Page 10: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/10.jpg)
• [Krebs ’04] Post 9/11 Terrorist Network Analysis from public domain information
• Plot masterminds correctly identified from interaction patterns: centrality
• A global view of entities is often more insightful
• Detect anomalous activities by exact/approximate subgraph isomorphism.
Image Source: http://www.orgnet.com/hijackers.html
Network Analysis for Intelligence and Survelliance
Image Source: T. Coffman, S. Greenblatt, S. Marcus, Graph-based technologies for intelligence analysis, CACM, 47 (3, March 2004): pp 45-47
![Page 11: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/11.jpg)
Research in Parallel Graph AlgorithmsApplication
AreasMethods/Problems
ArchitecturesGraph Algorithms
Traversal
Shortest Paths
Connectivity
Max Flow
…
…
…
GPUs, FPGAs
x86 multicoreservers
Massively multithreadedarchitectures
(Cray XMT, Sun Niagara)
Multicore clusters
(NERSC Hopper)
Clouds(Amazon EC2)
Social NetworkAnalysis
WWW
Computational Biology
Scientific Computing
Engineering
Find central entitiesCommunity detection
Network dynamics
Data size
ProblemComplexity
Graph partitioningMatchingColoring
Gene regulationMetabolic pathways
Genomics
MarketingSocial Search
VLSI CADRoute planning
![Page 12: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/12.jpg)
Characterizing Graph-theoretic computations
• graph sparsity (m/n ratio)• static/dynamic nature• weighted/unweighted, weight distribution• vertex degree distribution• directed/undirected• simple/multi/hyper graph• problem size• granularity of computation at nodes/edges• domain-specific characteristics
• paths• clusters• partitions• matchings• patterns• orderings
Input data
Problem: Find ***
Factors that influence choice of algorithm
Graph kernel
• traversal• shortest path algorithms• flow algorithms• spanning tree algorithms• topological sort …..
Graph problems are often recast as sparse linear algebra (e.g., partitioning) or linear programming (e.g., matching) computations
![Page 13: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/13.jpg)
• Applications• Designing parallel graph algorithms• Case studies:
A. Graph traversals: Breadth-first searchB. Shortest Paths: Delta-stepping, PHASTC. Ranking: Betweenness centrality, PageRankD. Maximal Independent Sets: Luby’s algorithmE. Strongly Connected Components
• Wrap-up: challenges on current systems
Lecture Outline
![Page 14: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/14.jpg)
• Many PRAM graph algorithms in 1980s.• Idealized parallel shared memory system model• Unbounded number of synchronous processors;
no synchronization, communication cost; no parallel overhead
• EREW (Exclusive Read Exclusive Write), CREW (Concurrent Read Exclusive Write)
• Measuring performance: space and time complexity; total number of operations (work)
The PRAM model
![Page 15: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/15.jpg)
• Pros– Simple and clean semantics.– The majority of theoretical parallel algorithms are designed
using the PRAM model.– Independent of the communication network topology.
• Cons– Not realistic, too powerful communication model.– Communication costs are ignored.– Synchronized processors.– No local memory.– Big-O notation is often misleading.
PRAM Pros and Cons
![Page 16: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/16.jpg)
• Prefix sums • Symmetry breaking• Pointer jumping• List ranking• Euler tours• Vertex collapse• Tree contraction[some covered in the “Tricks with Trees” lecture]
Building blocks of classical PRAM graph algorithms
![Page 17: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/17.jpg)
tp = execution time on p processors
Work / Span Model
![Page 18: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/18.jpg)
tp = execution time on p processorst1 = work
Work / Span Model
![Page 19: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/19.jpg)
tp = execution time on p processors
*Also called critical-path lengthor computational depth.
t1 = work t∞ = span *
Work / Span Model
![Page 20: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/20.jpg)
tp = execution time on p processorst1 = work t∞ = span *
*Also called critical-path lengthor computational depth.
WORK LAW∙tp ≥t1/pSPAN LAW∙tp ≥ t∞
Work / Span Model
![Page 21: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/21.jpg)
Static case• Dense graphs (m = Θ(n2)): adjacency matrix commonly used.• Sparse graphs: adjacency lists, compressed sparse matrices
Dynamic• representation depends on common-case query• Edge insertions or deletions? Vertex insertions or deletions?
Edge weight updates?• Graph update rate• Queries: connectivity, paths, flow, etc.
• Optimizing for locality a key design consideration.
Data structures: graph representation
![Page 22: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/22.jpg)
Compressed sparse rows (CSR) = cache-efficient adjacency lists
Graph representations
1
3
24
1 3 2 2 4Adjacencies
(row pointers in CSR)1 3 3 4 6Index intoadjacency array
12
26
7
14
19
1
2
3
4
1 12 3 26
2 19
2 14 4 7
12 26 19 14 7Weights
(column ids in CSR)(numerical values in CSR)
![Page 23: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/23.jpg)
• Each processor stores the entire graph (“full replication”)• Each processor stores n/p vertices and all adjacencies
out of these vertices (“1D partitioning”)• How to create these “p” vertex partitions?
– Graph partitioning algorithms: recursively optimize for conductance (edge cut/size of smaller partition)
– Randomly shuffling the vertex identifiers ensures that edge count/processor are roughly the same
Distributed graph representations
![Page 24: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/24.jpg)
• Consider a logical 2D processor grid (pr * pc = p) and the matrix representation of the graph
• Assign each processor a sub-matrix (i.e, the edges within the sub-matrix)
2D checkerboard distribution
0 7
5
3
8
2
4 6
1x x x
xx x
x x xx x x
x xx x x
x x xx x x x
9 vertices, 9 processors, 3x3 processor grid
Flatten Sparse matrices
Per-processor local graph representation
![Page 25: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/25.jpg)
• Implementations are typically array-based for performance (e.g. CSR representation).
• Concurrency = minimize synchronization (span)• Where is the data? Find the distribution that
minimizes inter-processor communication.• Memory access patterns
– Regularize them (spatial locality)– Minimize DRAM traffic (temporal locality)
• Work-efficiency– Is (Parallel time) * (# of processors) = (Serial work)?
High-performance graph algorithms
![Page 26: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/26.jpg)
• Applications• Designing parallel graph algorithms• Case studies:
A. Graph traversals: Breadth-first searchB. Shortest Paths: Delta-stepping, PHASTC. Ranking: Betweenness centrality, PageRankD. Maximal Independent Sets: Luby’s algorithmE. Strongly Connected Components
• Wrap-up: challenges on current systems
Lecture Outline
![Page 27: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/27.jpg)
Graph traversal: Depth-first search (DFS)
0 7
5
3
8
2
4 6
1
9sourcevertex
9
8
7
2 3 4
6
5
Parallelizing DFS is a bad idea: span(DFS) = O(n)J.H. Reif, Depth-first search is inherently sequential. Inform. Process. Lett. 20 (1985) 229-234.
1
preordervertex number
![Page 28: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/28.jpg)
Graph traversal : Breadth-first search (BFS)
0 7
5
3
8
2
4 6
1
9sourcevertex
Input:Output:1
1
1
2
2 3 3
4
4
distance from source vertex
Memory requirements (# of machine words):• Sparse graph representation: m+n• Stack of visited vertices: n• Distance array: n
![Page 29: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/29.jpg)
1. Expand current frontier (level-synchronous approach, suited for low diameter graphs)
Parallel BFS Strategies
0 7
5
3
8
2
4 6
1
9source vertex
2. Stitch multiple concurrent traversals (Ullman-Yannakakis approach, suited for high-diameter graphs)
• O(D) parallel steps• Adjacencies of all
vertices in current frontier are visited in parallel
0 7
5
3
8
2
4 6
1
9source vertex
• path-limited searches from “super vertices”
• APSP between “super vertices”
![Page 30: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/30.jpg)
Locality (where are the random accesses originating from?)
A deeper dive into the “level synchronous” strategy
0 31
53 84
74
11
93
1. Ordering of vertices in the “current frontier” array, i.e., accesses to adjacency indexing array, cumulative accesses O(n).
2. Ordering of adjacency list of each vertex, cumulative O(m).
3. Sifting through adjacencies to check whether visited or not, cumulative accesses O(m).
26
44
63
1. Access Pattern: idx array -- 53, 31, 74, 262,3. Access Pattern: d array -- 0, 84, 0, 84, 93, 44, 63, 0, 0, 11
![Page 31: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/31.jpg)
Performance Observations
Phase #1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Per
cent
age
of to
tal
0
10
20
30
40
50
60 # of vertices in frontier arrayExecution time
Youtube social network
Phase #1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Per
cent
age
of to
tal
0
10
20
30
40
50
60 # of vertices in frontier arrayExecution time
Graph expansion Edge filtering
Flickr social network
![Page 32: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/32.jpg)
• Well-studied problem, slight differences in problem formulations– Linear algebra: sparse matrix column reordering to reduce bandwidth, reveal
dense blocks.– Databases/data mining: reordering bitmap indices for better compression;
permuting vertices of WWW snapshots, online social networks for compression
• NP-hard problem, several known heuristics– We require fast, linear-work approaches – Existing ones: BFS or DFS-based, Cuthill-McKee, Reverse Cuthill-McKee, exploit
overlap in adjacency lists, dimensionality reduction
Improving locality: Vertex relabeling
x
x x x xx x x x
x x x x
x
x x xx x x x
x x x x
x x x xx x
xx
x x x xx
x xx x
xxx x
x
x
![Page 33: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/33.jpg)
• Recall: Potential O(m) non-contiguous memory references in edge traversal (to check if vertex is visited).– e.g., access order: 53, 31, 31, 26, 74, 84, 0, …
• Objective: Reduce TLB misses, private cache misses, exploit shared cache.
• Optimizations:1. Sort the adjacency lists of each vertex – helps order memory accesses, reduce TLB misses.2. Permute vertex labels – enhance spatial locality.3. Cache-blocked edge visits – exploit temporal locality.
Improving locality: Optimizations
0 31
53 84
74
11
93
26
44
63
![Page 34: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/34.jpg)
• Instead of processing adjacencies of each vertex serially, exploit sorted adjacency list structure w/ blocked accesses
• Requires multiple passes through the frontier array, tuning for optimal block size.• Note: frontier array size may be O(n)
Improving locality: Cache blocking
x x x xx x
xx x
x xx x
xxx x
x
x
front
ier
Adjacencies (d)
linear processing Cache-blocked approach
x x x xx x
xx x
x xx x
xxx x
x
x
front
ier
Adjacencies (d) Metadata denotingblocking pattern
1 2 3
Tune to L2 cache size
Process high-degree vertices separately
![Page 35: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/35.jpg)
Parallel performance (Orkut graph)
Number of threads1 2 4 8
Per
form
ance
rate
(Mill
ion
Trav
erse
d E
dges
/s)
0
200
400
600
800
1000 Cache-blocked BFSBaseline
Parallel speedup: 4.9
Speedup over baseline: 2.9
Execution time: 0.28 seconds (8 threads)
Graph: 3.07 million vertices, 220 million edgesSingle socket of Intel Xeon 5560 (Core i7)
![Page 36: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/36.jpg)
• BFS (from a single vertex) on a static, undirected R-MAT network with average vertex degree 16.
• Evaluation criteria: highest performance rate (edges traversed per second) achieved on a system.
• Reference MPI, shared memory implementations provided.
• NERSC Hopper system is ranked #2 on current list (Nov 2011).– Over 100 billion edges traversed
per second
Graph 500 “Search” Benchmark (graph500.org)
![Page 37: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/37.jpg)
12
3
4 7
6
5
AT
1
7
71from
to
Breadth-first search in the language of linear
algebra
![Page 38: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/38.jpg)
12
3
4 7
6
5
XAT
1
7
71from
to
ATX
1
1
1
1
1parents:
Replace scalar operationsMultiply -> selectAdd -> minimum
![Page 39: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/39.jpg)
12
3
4 7
6
5
X
4
2
2
AT
1
7
71from
to
ATX
2
4
4
2
24
Select vertex withminimum label as parent
1
1parents:4
2
2
![Page 40: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/40.jpg)
12
3
4 7
6
5
X
3
AT
1
7
71from
to
ATX
3
5
7
3
1
1parents:4
2
2
5
3
![Page 41: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/41.jpg)
XAT
1
7
71from
to
ATX
6
12
3
4 7
6
5
![Page 42: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/42.jpg)
ALGORITHM:1. Find owners of the current frontier’s adjacency [computation]2. Exchange adjacencies via all-to-all. [communication]3. Update distances/parents for unvisited vertices. [computation]
x
1 2
3
4 7
6
5
AT frontier
x
1D Parallel BFS algorithm
![Page 43: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/43.jpg)
x
1 2
3
4 7
6
5
AT frontier
x
ALGORITHM:1. Gather vertices in processor column [communication]2. Find owners of the current frontier’s adjacency [computation]3. Exchange adjacencies in processor row [communication]4. Update distances/parents for unvisited vertices. [computation]
2D Parallel BFS algorithm
![Page 44: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/44.jpg)
5040 10008 20000 40000
2
4
6
8
10
121D Flat MPI 2D Flat MPI1D Hybrid 2D Hybrid
Number of coresCo
mm
. tim
e (s
econ
ds)
5040 10008 20000 40000
4
8
12
16
201D Flat MPI 2D Flat MPI1D Hybrid 2D Hybrid
Number of cores
GTEP
S
• NERSC Hopper (Cray XE6, Gemini interconnect AMD Magny-Cours)• Hybrid: In-node 6-way OpenMP multithreading• Graph500 (R-MAT): 4 billion vertices and 64 billion edges.
BFS Strong Scaling
A. Buluç, K. Madduri. Parallel breadth-first search on distributed memory systems. Proc. Supercomputing, 2011.
![Page 45: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/45.jpg)
• Applications• Designing parallel graph algorithms• Case studies:
A. Graph traversals: Breadth-first searchB. Shortest Paths: Delta-stepping, PHASTC. Ranking: Betweenness centrality, PageRankD. Maximal Independent Sets: Luby’s algorithmE. Strongly Connected Components
• Wrap-up: challenges on current systems
Lecture Outline
![Page 46: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/46.jpg)
Parallel Single-source Shortest Paths (SSSP) algorithms• Famous serial algorithms:
– Bellman-Ford : label correcting - works on any graph– Dijkstra : label setting – requires nonnegative edge weights
• No known PRAM algorithm that runs in sub-linear time and O(m+n log n) work
• Ullman-Yannakakis randomized approach • Meyer and Sanders, ∆ - stepping algorithm • Distributed memory implementations based on graph
partitioning• Heuristics for load balancing and termination detection
K. Madduri, D.A. Bader, J.W. Berry, and J.R. Crobak, “An Experimental Study of A Parallel Shortest Path Algorithm for Solving Large-Scale Graph Instances,” Workshop on Algorithm Engineering and Experiments (ALENEX), New Orleans, LA, January 6, 2007.
![Page 47: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/47.jpg)
∆ - stepping algorithm
• Label-correcting algorithm: Can relax edges from unsettled vertices also
• “approximate bucket implementation of Dijkstra”• For random edge weighs [0,1], runs in where L = max distance from source to any node• Vertices are ordered using buckets of width ∆• Each bucket may be processed in parallel∆ < min w(e) : Degenerates into Dijkstra∆ > max w(e) : Degenerates into Bellman-Ford
U. Meyer and P.Sanders, ∆ - stepping: a parallelizable shortest path algorithm. Journal of Algorithms 49 (2003)
![Page 48: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/48.jpg)
0.01
∆ - stepping algorithm: illustration
1
2
3
4
5
6
0.13
0
0.18
0.15
0.05
0.07
0.23
0.56
0.02
d array0 1 2 3 4 5 6
Buckets
One parallel phasewhile (bucket is non-empty)
i) Inspect light (w < ∆) edgesii) Construct a set of “requests”
(R)iii) Clear the current bucketiv) Remember deleted vertices
(S)v) Relax request pairs in R
Relax heavy request pairs (from S)Go on to the next bucket∞ ∞ ∞ ∞ ∞ ∞ ∞
∆ = 0.1 (say)
![Page 49: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/49.jpg)
0.01
∆ - stepping algorithm: illustration
1
2
3
4
5
6
0.13
0
0.18
0.15
0.05
0.07
0.23
0.56
0.02
d array0 1 2 3 4 5 6
Buckets
One parallel phasewhile (bucket is non-empty)
i) Inspect light (w < ∆) edgesii) Construct a set of “requests”
(R)iii) Clear the current bucketiv) Remember deleted vertices
(S)v) Relax request pairs in R
Relax heavy request pairs (from S)Go on to the next bucket0 ∞ ∞ ∞ ∞ ∞ ∞
Initialization:Insert s into bucket, d(s) = 000
![Page 50: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/50.jpg)
0.01
∆ - stepping algorithm: illustration
1
2
3
4
5
6
0.13
0
0.18
0.15
0.05
0.07
0.23
0.56
0.02
d array0 1 2 3 4 5 6
Buckets
One parallel phasewhile (bucket is non-empty)
i) Inspect light (w < ∆) edgesii) Construct a set of “requests”
(R)iii) Clear the current bucketiv) Remember deleted vertices
(S)v) Relax request pairs in R
Relax heavy request pairs (from S)Go on to the next bucket
0 ∞ ∞ ∞ ∞ ∞ ∞
002R
S
.01
![Page 51: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/51.jpg)
0.01
∆ - stepping algorithm: illustration
1
2
3
4
5
6
0.13
0
0.18
0.15
0.05
0.07
0.23
0.56
0.02
d array0 1 2 3 4 5 6
Buckets
One parallel phasewhile (bucket is non-empty)
i) Inspect light (w < ∆) edgesii) Construct a set of “requests”
(R)iii) Clear the current bucketiv) Remember deleted vertices
(S)v) Relax request pairs in R
Relax heavy request pairs (from S)Go on to the next bucket0 ∞ ∞ ∞ ∞ ∞ ∞
2R
0S
.010
![Page 52: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/52.jpg)
0.01
∆ - stepping algorithm: illustration
1
2
3
4
5
6
0.13
0
0.18
0.15
0.05
0.07
0.23
0.56
0.02
d array0 1 2 3 4 5 6
Buckets
One parallel phasewhile (bucket is non-empty)
i) Inspect light (w < ∆) edgesii) Construct a set of “requests”
(R)iii) Clear the current bucketiv) Remember deleted vertices
(S)v) Relax request pairs in R
Relax heavy request pairs (from S)Go on to the next bucket0 ∞ .01 ∞ ∞ ∞ ∞
2R
0S0
![Page 53: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/53.jpg)
0.01
∆ - stepping algorithm: illustration
1
2
3
4
5
6
0.13
0
0.18
0.15
0.05
0.07
0.23
0.56
0.02
d array0 1 2 3 4 5 6
Buckets
One parallel phasewhile (bucket is non-empty)
i) Inspect light (w < ∆) edgesii) Construct a set of “requests”
(R)iii) Clear the current bucketiv) Remember deleted vertices
(S)v) Relax request pairs in R
Relax heavy request pairs (from S)Go on to the next bucket0 ∞ .01 ∞ ∞ ∞ ∞
2R
0S0
1 3.03 .06
![Page 54: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/54.jpg)
0.01
∆ - stepping algorithm: illustration
1
2
3
4
5
6
0.13
0
0.18
0.15
0.05
0.07
0.23
0.56
0.02
d array0 1 2 3 4 5 6
Buckets
One parallel phasewhile (bucket is non-empty)
i) Inspect light (w < ∆) edgesii) Construct a set of “requests”
(R)iii) Clear the current bucketiv) Remember deleted vertices
(S)v) Relax request pairs in R
Relax heavy request pairs (from S)Go on to the next bucket0 ∞ .01 ∞ ∞ ∞ ∞
R
0S0
1 3.03 .06
2
![Page 55: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/55.jpg)
0.01
∆ - stepping algorithm: illustration
1
2
3
4
5
6
0.13
0
0.18
0.15
0.05
0.07
0.23
0.56
0.02
d array0 1 2 3 4 5 6
Buckets
One parallel phasewhile (bucket is non-empty)
i) Inspect light (w < ∆) edgesii) Construct a set of “requests”
(R)iii) Clear the current bucketiv) Remember deleted vertices
(S)v) Relax request pairs in R
Relax heavy request pairs (from S)Go on to the next bucket0 .03 .01 .06 ∞ ∞ ∞
R
0S0
21 3
![Page 56: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/56.jpg)
0.01
∆ - stepping algorithm: illustration
1
2
3
4
5
6
0.13
0
0.18
0.15
0.05
0.07
0.23
0.56
0.02
d array0 1 2 3 4 5 6
Buckets
One parallel phasewhile (bucket is non-empty)
i) Inspect light (w < ∆) edgesii) Construct a set of “requests”
(R)iii) Clear the current bucketiv) Remember deleted vertices
(S)v) Relax request pairs in R
Relax heavy request pairs (from S)Go on to the next bucket0 .03 .01 .06 .16 .29 .62
R
0S
1
2 1 326
4
5
6
![Page 57: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/57.jpg)
No. of phases (machine-independent performance count)
Graph Family
Rnd-rnd Rnd-logU Scale-free LGrid-rnd LGrid-logU SqGrid USAd NE USAt NE
No.
of p
hase
s
10
100
1000
10000
100000
1000000
low diameter
high diameter
Too many phases in high diameter graphs: Level-synchronous breadth-first search has the same problem.
![Page 58: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/58.jpg)
Average shortest path weight for various graph families~ 220 vertices, 222 edges, directed graph, edge weights normalized to [0,1]
Graph Family
Rnd-rnd Rnd-logU Scale-free LGrid-rnd LGrid-logU SqGrid USAd NE USAt NE
Ave
rage
sho
rtest
pat
h w
eigh
t
0.01
0.1
1
10
100
1000
10000
100000Complexity:
L: maximum distance (shortest path weight)
![Page 59: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/59.jpg)
PHAST – hardware accelerated shortest path trees
Preprocessing: Build a contraction hierarchy• order nodes by importance (highway dimension) • process in order • add shortcuts to preserve distances • assign levels (ca. 150 in road networks) • 75% increase in number of edges (for road networks)
D. Delling, A. V. Goldberg, A. Nowatzyk, and R. F. Werneck. PHAST: Hardware- Accelerated Shortest Path Trees. In IPDPS. IEEE Computer Society, 2011
![Page 60: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/60.jpg)
PHAST – hardware accelerated shortest path trees
One-to-all search from source s: • Run forward search from s• Only follow edges to more important nodes• Set distance labels d of reached nodes
![Page 61: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/61.jpg)
PHAST – hardware accelerated shortest path trees
From top-down: • process all nodes u in reverse level order:• check incoming arcs (v,u) with lev(v) > lev(u)• Set d(u)=min{d(u),d(v)+w(v,u)}
![Page 62: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/62.jpg)
PHAST – hardware accelerated shortest path trees
From top-down: • linear sweep without priority queue• reorder nodes, arcs, distance labels by level• accesses to d array becomes contiguous (no jumps)• parallelize with SIMD (SSE instructions, and/or GPU)
![Page 63: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/63.jpg)
PHAST – performance comparison
Edge weights: estimated travel times
Edge weights: physical distances
GPU implementation
Inputs: Europe/USA Road Networks
![Page 64: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/64.jpg)
PHAST – hardware accelerated shortest path trees
• Specifically designed for road networks• Fill-in can be much higher for other graphs (Hint: think about sparse Gaussian Elimination)
• Road networks are (almost) planar.
• Planar graphs have separators.
• Good separators lead to orderings with minimal fill.
Lipton, Richard J.; Tarjan, Robert E. (1979), "A separator theorem for planar graphs", SIAM Journal on Applied Mathematics 36 (2): 177–189Alan George. “Nested Dissection of a Regular Finite Element Mesh”. SIAM Journal on Numerical Analysis, Vol. 10, No. 2 (Apr., 1973), 345-363
![Page 65: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/65.jpg)
• Applications• Designing parallel graph algorithms• Case studies:
A. Graph traversals: Breadth-first searchB. Shortest Paths: Delta-stepping, PHASTC. Ranking: Betweenness centrality, PageRankD. Maximal Independent Sets: Luby’s algorithmE. Strongly Connected Components
• Wrap-up: challenges on current systems
Lecture Outline
![Page 66: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/66.jpg)
Betweenness Centrality (BC)
• Centrality: Quantitative measure to capture the importance of a vertex/edge in a graph– degree, closeness, eigenvalue, betweenness
• Betweenness Centrality
( : No. of shortest paths between s and t)• Applied to several real-world networks
– Social interactions– WWW– Epidemiology– Systems biology
tvs st
st vvBC
)()(
st
![Page 67: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/67.jpg)
Algorithms for Computing Betweenness
• All-pairs shortest path approach: compute the length and number of shortest paths between all s-t pairs (O(n3) time), sum up the fractional dependency values (O(n2) space).
• Brandes’ algorithm (2003): Augment a single-source shortest path computation to count paths; uses the Bellman criterion; O(mn) work and O(m+n) space on unweighted graph.
Dependency of source on v. Number of shortest
paths from source to w
![Page 68: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/68.jpg)
Parallel BC algorithms for unweighted graphs
• High-level idea: Level-synchronous parallel breadth-first search augmented to compute centrality scores
• Two steps (both implemented as BFS)– traversal and path counting– dependency accumulation
Shared-memory parallel algorithms for betweenness centrality Exact algorithm: O(mn) work, O(m+n) space, O(nD+nm/p) time.
Improved with lower synchronization overhead and fewer non-contiguous memory references.
K. Madduri, D. Ediger, K. Jiang, D.A. Bader, and D. Chavarria-Miranda. A faster parallel algorithm and efficient multithreaded implementations for evaluating betweenness centrality on massive datasets. MTAAP 2009
D.A. Bader and K. Madduri. Parallel algorithms for evaluating centrality indices in real-world networks, ICPP’08
![Page 69: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/69.jpg)
Parallel BC Algorithm Illustration
0 7
5
3
8
2
4 6
1
9
![Page 70: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/70.jpg)
Parallel BC Algorithm Illustration
1. Traversal step: visit adjacent vertices, update distanceand path counts.
0 7
5
3
8
2
4 6
1
9source vertex
![Page 71: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/71.jpg)
Parallel BC Algorithm Illustration
1. Traversal step: visit adjacent vertices, update distanceand path counts.
0 7
5
3
8
2
4 6
1
9source vertex
2750
0
1
1
1
D
0
0 0
0
S P
![Page 72: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/72.jpg)
Parallel BC Algorithm Illustration
1. Traversal step: visit adjacent vertices, update distanceand path counts.
0 7
5
3
8
2
4 6
1
9source vertex
8
2750
0
12
1
12
D
0
0 0
0
S P
3
2 7
5 7
Level-synchronous approach: The adjacencies of all vertices in the current frontier can be visited in parallel
![Page 73: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/73.jpg)
Parallel BC Algorithm Illustration
1. Traversal step: at the end, we have all reachable vertices,their corresponding predecessor multisets, and D values.
0 7
5
3
8
2
4 6
1
9source vertex
21648
2750
0
12
1
12
D
0
0 0
0
S P
3
2 7
5 7
Level-synchronous approach: The adjacencies of all vertices in the current frontier can be visited in parallel
3 8
8
6
6
![Page 74: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/74.jpg)
Graph traversal step locality analysis
for all vertices u at level d in parallel dofor all adjacencies v of u in parallel do
dv = D[v];if (dv < 0) vis = fetch_and_add(&Visited[v], 1); if (vis == 0) D[v] = d+1;
pS[count++] = v; fetch_and_add(&sigma[v], sigma[u]); fetch_and_add(&Scount[u], 1); if (dv == d + 1) fetch_and_add(&sigma[v], sigma[u]); fetch_and_add(&Scount[u], 1);
All the vertices are in a contiguous block (stack)
All the adjacencies of a vertex are stored compactly (graph rep.)
Store to S[u]
Non-contiguous memory access
Non-contiguous memory access
Non-contiguous memory access
Better cache utilization likely if D[v], Visited[v], sigma[v] are stored contiguously
![Page 75: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/75.jpg)
Parallel BC Algorithm Illustration
2. Accumulation step: Pop vertices from stack, update dependence scores.
0 7
5
3
8
2
4 6
1
9source vertex
21648
2750
Delta
0
0 0
0
S P
3
2 7
5 7
3 8
8
6
6
)(1)()()(
)(
wwvv
wPv
![Page 76: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/76.jpg)
Parallel BC Algorithm Illustration
2. Accumulation step: Can also be done in a level-synchronous manner.
0 7
5
3
8
2
4 6
1
9source vertex
21648
2750
Delta
0
0 0
0
S P
3
2 7
5 7
3 8
8
6
6
![Page 77: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/77.jpg)
Distributed memory BC algorithm
6
BAT (ATX).*¬B
3
4 7 5
Work-efficient parallel breadth-first search via parallel sparse matrix-matrix multiplication over semirings
Encapsulates three level of parallelism:1. columns(B): multiple BFS searches in parallel2. columns(AT)+rows(B): parallel over frontier vertices in each BFS3. rows(AT): parallel over incident edges of each frontier vertex
A. Buluç, J.R. Gilbert. The Combinatorial BLAS: Design, implementation, and applications. The International Journal of High Performance Computing Applications, 2011.
![Page 78: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/78.jpg)
Web graph: PageRank (Google) [Brin, Page]
• Markov process: follow a random link most of the time; otherwise, go to any page at random.
• Importance = stationary distribution of Markov process.• Transition matrix is p*A + (1-p)*ones(size(A)),
scaled so each column sums to 1.• Importance of page i is the i-th entry in the principal
eigenvector of the transition matrix.• But the matrix is 1,000,000,000,000 by 1,000,000,000,000.
An important page is one that many important pages point to.
![Page 79: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/79.jpg)
A Page Rank MatrixImportance ranking of web pages
Stationary distribution of a Markov chain
Power method: sparse matrix-vector multiplication and vector arithmetic
Matlab*P page ranking demo (from SC’03) on a web crawl of mit.edu (170,000 pages)
![Page 80: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/80.jpg)
• Applications• Designing parallel graph algorithms• Case studies:
A. Graph traversals: Breadth-first searchB. Shortest Paths: Delta-stepping, PHASTC. Ranking: Betweenness centrality, PageRankD. Maximal Independent Sets: Luby’s algorithmE. Strongly Connected Components
• Wrap-up: challenges on current systems
Lecture Outline
![Page 81: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/81.jpg)
Maximal Independent Set
1
8765
43
2
• Graph with vertices V = {1,2,…,n}
• A set S of vertices is independent if no two vertices in S are neighbors.
• An independent set S is maximal if it is impossible to add another vertex and stay independent
• An independent set S is maximum if no other independent set has more vertices
• Finding a maximum independent set is intractably difficult (NP-hard)
• Finding a maximal independent set is easy, at least on one processor.
The set of red vertices S = {4, 5} is independent
and is maximalbut not maximum
![Page 82: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/82.jpg)
Sequential Maximal Independent Set Algorithm
1
8765
43
21. S = empty set;
2. for vertex v = 1 to n {
3. if (v has no neighbor in S) {
4. add v to S
5. }
6. }
S = { }
![Page 83: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/83.jpg)
1
8765
43
21. S = empty set;
2. for vertex v = 1 to n {
3. if (v has no neighbor in S) {
4. add v to S
5. }
6. }
S = { 1 }
Sequential Maximal Independent Set Algorithm
![Page 84: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/84.jpg)
1
8765
43
21. S = empty set;
2. for vertex v = 1 to n {
3. if (v has no neighbor in S) {
4. add v to S
5. }
6. }
S = { 1, 5 }
Sequential Maximal Independent Set Algorithm
![Page 85: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/85.jpg)
1
8765
43
21. S = empty set;
2. for vertex v = 1 to n {
3. if (v has no neighbor in S) {
4. add v to S
5. }
6. }
S = { 1, 5, 6 }
work ~ O(n), but span ~O(n)
Sequential Maximal Independent Set Algorithm
![Page 86: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/86.jpg)
Parallel, Randomized MIS Algorithm
1
8765
43
21. S = empty set; C = V;
2. while C is not empty {
3. label each v in C with a random r(v);
4. for all v in C in parallel {
5. if r(v) < min( r(neighbors of v) ) {
6. move v from C to S;
7. remove neighbors of v from C;
8. }
9. }
10. }
S = { }C = { 1, 2, 3, 4, 5, 6, 7, 8 }
M. Luby. "A Simple Parallel Algorithm for the Maximal Independent Set Problem". SIAM Journal on Computing 15 (4): 1036–1053, 1986
![Page 87: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/87.jpg)
1
8765
43
21. S = empty set; C = V;
2. while C is not empty {
3. label each v in C with a random r(v);
4. for all v in C in parallel {
5. if r(v) < min( r(neighbors of v) ) {
6. move v from C to S;
7. remove neighbors of v from C;
8. }
9. }
10. }
S = { }C = { 1, 2, 3, 4, 5, 6, 7, 8 }
Parallel, Randomized MIS Algorithm
![Page 88: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/88.jpg)
1
8765
43
21. S = empty set; C = V;
2. while C is not empty {
3. label each v in C with a random r(v);
4. for all v in C in parallel {
5. if r(v) < min( r(neighbors of v) ) {
6. move v from C to S;
7. remove neighbors of v from C;
8. }
9. }
10. }
S = { }C = { 1, 2, 3, 4, 5, 6, 7, 8 }
2.6 4.1
5.9 3.1
1.25.8
9.39.7
Parallel, Randomized MIS Algorithm
![Page 89: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/89.jpg)
1
8765
43
21. S = empty set; C = V;
2. while C is not empty {
3. label each v in C with a random r(v);
4. for all v in C in parallel {
5. if r(v) < min( r(neighbors of v) ) {
6. move v from C to S;
7. remove neighbors of v from C;
8. }
9. }
10. }
S = { 1, 5 }C = { 6, 8 }
2.6 4.1
5.9 3.1
1.25.8
9.39.7
Parallel, Randomized MIS Algorithm
![Page 90: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/90.jpg)
1
8765
43
21. S = empty set; C = V;
2. while C is not empty {
3. label each v in C with a random r(v);
4. for all v in C in parallel {
5. if r(v) < min( r(neighbors of v) ) {
6. move v from C to S;
7. remove neighbors of v from C;
8. }
9. }
10. }
S = { 1, 5 }C = { 6, 8 }
2.7
1.8
Parallel, Randomized MIS Algorithm
![Page 91: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/91.jpg)
1
8765
43
21. S = empty set; C = V;
2. while C is not empty {
3. label each v in C with a random r(v);
4. for all v in C in parallel {
5. if r(v) < min( r(neighbors of v) ) {
6. move v from C to S;
7. remove neighbors of v from C;
8. }
9. }
10. }
S = { 1, 5, 8 }C = { }
2.7
1.8
Parallel, Randomized MIS Algorithm
![Page 92: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/92.jpg)
1
8765
43
21. S = empty set; C = V;
2. while C is not empty {
3. label each v in C with a random r(v);
4. for all v in C in parallel {
5. if r(v) < min( r(neighbors of v) ) {
6. move v from C to S;
7. remove neighbors of v from C;
8. }
9. }
10. }
Theorem: This algorithm “very probably” finishes within O(log n) rounds.
work ~ O(n log n), but span ~O(log n)
Parallel, Randomized MIS Algorithm
![Page 93: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/93.jpg)
• Applications• Designing parallel graph algorithms• Case studies:
A. Graph traversals: Breadth-first searchB. Shortest Paths: Delta-stepping, PHASTC. Ranking: Betweenness centrality, PageRankD. Maximal Independent Sets: Luby’s algorithmE. Strongly Connected Components
• Wrap-up: challenges on current systems
Lecture Outline
![Page 94: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/94.jpg)
1 52 4 7 3 61
5
247
36
Strongly connected components
• Symmetric permutation to block triangular form
• Find P in linear time by depth-first search [Tarjan]
1 2
3
4 7
6
5
![Page 95: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/95.jpg)
Strongly connected components of directed graph
• Sequential: use depth-first search (Tarjan); work=O(m+n) for m=|E|, n=|V|.
• DFS seems to be inherently sequential.
• Parallel: divide-and-conquer and BFS (Fleischer et al.); worst-case span O(n) but good in practice on many graphs.
![Page 96: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/96.jpg)
Strongly Connected Components
![Page 97: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/97.jpg)
• Applications• Designing parallel graph algorithms• Case studies:
A. Graph traversals: Breadth-first searchB. Shortest Paths: Delta-stepping, PHASTC. Ranking: Betweenness centrality, PageRankD. Maximal Independent Sets: Luby’s algorithmE. Strongly Connected Components
• Wrap-up: challenges on current systems
Lecture Outline
![Page 98: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/98.jpg)
Problem Size (Log2 # of vertices)10 12 14 16 18 20 22 24 26
Per
form
ance
rate
(Mill
ion
Trav
erse
d E
dges
/s)
0
10
20
30
40
50
60
Serial Performance of “approximate betweenness centrality” on a 2.67 GHz Intel Xeon 5560 (12 GB RAM, 8MB L3 cache)
Input: Synthetic R-MAT graphs (# of edges m = 8n)
The locality challenge“Large memory footprint, low spatial and temporal locality impede performance”
~ 5X drop in performance
No Last-level Cache (LLC) misses
O(m) LLC misses
![Page 99: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/99.jpg)
• Graph topology assumptions in classical algorithms do not match real-world datasets
• Parallelization strategies at loggerheads with techniques for enhancing memory locality
• Classical “work-efficient” graph algorithms may not fully exploit new architectural features– Increasing complexity of memory hierarchy, processor heterogeneity,
wide SIMD.• Tuning implementation to minimize parallel overhead is non-
trivial– Shared memory: minimizing overhead of locks, barriers.– Distributed memory: bounding message buffer sizes, bundling
messages, overlapping communication w/ computation.
The parallel scaling challenge “Classical parallel graph algorithms perform poorly on current parallel systems”
![Page 100: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/100.jpg)
Designing parallel algorithms for large sparse graph analysis
Problem size (n: # of vertices/edges)
104 106 108 1012
Work performed
O(n)
O(n2)O(nlog n)
“RandomAccess”-like
“Stream”-like
Improvelocality
Data reduction/Compression
Faster methods
Peta+
System requirements: High (on-chip memory, DRAM, network, IO) bandwidth.Solution: Efficiently utilize available memory bandwidth.
Algorithmic innovation to avoid corner cases.
Prob
lem
Co
mpl
exity
Locality
![Page 101: Parallel Graph Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022052308/56816760550346895ddc366f/html5/thumbnails/101.jpg)
• Best algorithm depends on the input.• Communication costs (and hence data
distribution) is crucial for distributed memory. • Locality is crucial for in-node performance.• Graph traversal (BFS) is fundamental building
block for more complex algorithms.• Best parallel algorithm can be significantly
different than the best serial algorithm.
Conclusions