graph500 and green graph500 benchmarks on sgi uv2000 @ sgi ug sc14
TRANSCRIPT
Graph500 and Green Graph500 benchmarks on SGI UV2000
*Yuichiro Yasui & Katsuki Fujisawa Kyushu University and JST CREST
SGI User group conference Nov. 17, 2014
Outline 1. Graph processing for large-scale networks
2. Graph500 & Green Graph500 benchmarks
3. Our NUMA-optimized BFS algorithm 4. Numerical results on SGI UV 2000
Bottom%up(Top%down
CPU
RAM
NUMA%aware
Graph processing for Large scale networks • Large scale graphs in various fields
– US Road network : 58 million edges – Twitter follow-ship : 1.47 billion edges – Neuronal network : 100 trillion edges
89 billion vertices & 100 trillion edges Neuronal network @ Human Brain Project
Cyber-security Twitter
US road network 24 million vertices & 58 million edges 15 billion log entries / day
Social network
• Fast and scalable graph processing by using HPC
large
61.6 million vertices & 1.47 billion edges
• Transportation • Social network • Cyber-security • Bioinformatics
Graph analysis and important kernel BFS • The cycle of graph analysis for understanding real-networks
graph processing
Understanding
Application field
- SCALE- edgefactor
- SCALE- edgefactor- BFS Time- Traversed edges- TEPS
Input parameters ResultsGraph generation Graph construction
TEPSratio
ValidationBFS
64 Iterations
Relationships - SCALE- edgefactor
- SCALE- edgefactor- BFS Time- Traversed edges- TEPS
Input parameters ResultsGraph generation Graph construction
TEPSratio
ValidationBFS
64 Iterations
graph
- SCALE- edgefactor
- SCALE- edgefactor- BFS Time- Traversed edges- TEPS
Input parameters ResultsGraph generation Graph construction
TEPSratio
ValidationBFS
64 Iterations
results Step1
Step2
Step3
constructing
• concurrent search (breadth-first search) • optimization (single source shortest path) • edge-oriented (maximal independent set)
• Transportation • Social network • Cyber-security • Bioinformatics
Graph analysis and important kernel BFS • The cycle of graph analysis for understanding real-networks
• concurrent search (breadth-first search) • optimization (single source shortest path) • edge-oriented (maximal independent set)
graph processing
Understanding
Application field
- SCALE- edgefactor
- SCALE- edgefactor- BFS Time- Traversed edges- TEPS
Input parameters ResultsGraph generation Graph construction
TEPSratio
ValidationBFS
64 Iterations
Relationships - SCALE- edgefactor
- SCALE- edgefactor- BFS Time- Traversed edges- TEPS
Input parameters ResultsGraph generation Graph construction
TEPSratio
ValidationBFS
64 Iterations
graph
- SCALE- edgefactor
- SCALE- edgefactor- BFS Time- Traversed edges- TEPS
Input parameters ResultsGraph generation Graph construction
TEPSratio
ValidationBFS
64 Iterations
results Step1
Step2
Step3
• One of most important and fundamental processing • Many algorithms and applications based on exists (Max.-flow and centrality) • low arithmetic intensity & irregular memory accesses.
Breadth-first search (BFS)
Source
BFS Lv. 3
source Lv. 2 Lv. 1
Outputs:Distance (Lv.) and Predecessor for each vertex from source Inputs:Graph,
and source vertex
constructing
Twitter follow-ship network
• follow-ship network – #Users (#vertices) 41,652,230 – Follow-ships (#edges) 2,405,026,092
Lv. #users ratio (%) percentile (%) 0 1 0.00 0.00 1 7 0.00 0.00 2 6,188 0.01 0.01 3 510,515 1.23 1.24 4 29,526,508 70.89 72.13 5 11,314,238 27.16 99.29 6 282,456 0.68 99.97 7 11536 0.03 100.00 8 673 0.00 100.00 9 68 0.00 100.00
10 19 0.00 100.00 11 10 0.00 100.00 12 5 0.00 100.00 13 2 0.00 100.00 14 2 0.00 100.00 15 2 0.00 100.00
Total 41,652,230 100.00 -
BFS result from User 21,804,357
This network excludes unconnected users The six-degrees of
separation
Our algorithm computes a BFS in 60 ms only
Twitter2009
Highway
Bridge
Betweenness centrality (BC) • Definition
CB(v) =!
s!v!t∈V
σst(v)σst
σst : number of shortest (s, t)-paths
σst(v) : number of shortest (s, t)-paths passing through vertex v
2 / 2
CB(v) =!
s!v!t∈V
σst(v)σst
σst : number of shortest (s, t)-paths
σst(v) : number of shortest (s, t)-paths passing through vertex v
2 / 2
: # of (s,t)-shortest paths : # of (s,t)-shortest paths passing throw v
Osaka road network 13,076 vertices and 40,528 edges
High(score(vertex/edge(=(Important(place(
c.g.)(Highway,(Bridge
• BFS => one-to-all • <#vertices> times BFS => all-to-all
• BC requires the all-to-all shortest paths
• BC measures important vertices and edges without coordinates
=>(13,076(times(BFS(computations
Graph500 Benchmark • Measures a performance of irregular memory accesses • TEPS score (# of Traversed edges per second) in a BFS
SCALE & edgefactor (=16)
Median TEPS
1. Generation
- SCALE- edgefactor
- SCALE- edgefactor- BFS Time- Traversed edges- TEPS
Input parameters ResultsGraph generation Graph construction
TEPSratio
ValidationBFS
64 Iterations
- SCALE- edgefactor
- SCALE- edgefactor- BFS Time- Traversed edges- TEPS
Input parameters ResultsGraph generation Graph construction
TEPSratio
ValidationBFS
64 Iterations
- SCALE- edgefactor
- SCALE- edgefactor- BFS Time- Traversed edges- TEPS
Input parameters ResultsGraph generation Graph construction
TEPSratio
ValidationBFS
64 Iterations
3. BFS x 64 2. Construction
x 64
TEPS ratio
• Generates synthetic scale-free network with 2SCALE vertices and 2SCALE×edgefactor edges by using SCALE-times the Rursive Kronecker products
www.graph500.org
G1 G2 G3 G4
Kronecker graph
Input parameters for problem size
Green Graph500 Benchmark • Measures power-efficient using TEPS/W score • Our results on various systems such as SGI UV series and Xeon servers, Android devices
http://green.graph500.org
Median TEPS
1. Generation - SCALE- edgefactor
- SCALE- edgefactor- BFS Time- Traversed edges- TEPS
Input parameters ResultsGraph generation Graph construction
TEPSratio
ValidationBFS
64 Iterations
- SCALE- edgefactor
- SCALE- edgefactor- BFS Time- Traversed edges- TEPS
Input parameters ResultsGraph generation Graph construction
TEPSratio
ValidationBFS
64 Iterations
- SCALE- edgefactor
- SCALE- edgefactor- BFS Time- Traversed edges- TEPS
Input parameters ResultsGraph generation Graph construction
TEPSratio
ValidationBFS
64 Iterations
3. BFS phase
2. Construction x 64
TEPS ratio
Watt TEPS/W
Power measurement Green Graph500
Graph500
Target networks
20
25
30
35
40
45
15 20 25 30 35 40 45
log 2
(m)
log2(n)
USA-road-d.NY.gr
USA-road-d.LKS.gr
USA-road-d.USA.gr cit-Patents
soc-LiveJournal1
twitter-rv
Human Project
Graph500 (Toy)
Graph500 (Mini)
Graph500 (Small)
Graph500 (Medium)
Graph500 (Large)
Graph500 (Huge)
1blillion 1trillion
1blillion
1trillion
Human Brain 89 B, 100 T
Twitter2009
#(of(vertices((in(logscale)
#(of(edges((in(logscale)
US road network
Target networks on Smartphone
20
25
30
35
40
45
15 20 25 30 35 40 45
log 2
(m)
log2(n)
USA-road-d.NY.gr
USA-road-d.LKS.gr
USA-road-d.USA.gr cit-Patents
soc-LiveJournal1
twitter-rv
Human Project
Graph500 (Toy)
Graph500 (Mini)
Graph500 (Small)
Graph500 (Medium)
Graph500 (Large)
Graph500 (Huge)
1blillion 1trillion
1blillion
1trillion
Human Brain 89 B, 100 T
Twitter2009
Graph500 (SCALE20) ・Smartphone (4 cores)
Smartphone
US road network
#(of(vertices((in(logscale)
#(of(edges((in(logscale)
Target networks on Single-server
20
25
30
35
40
45
15 20 25 30 35 40 45
log 2
(m)
log2(n)
USA-road-d.NY.gr
USA-road-d.LKS.gr
USA-road-d.USA.gr cit-Patents
soc-LiveJournal1
twitter-rv
Human Project
Graph500 (Toy)
Graph500 (Mini)
Graph500 (Small)
Graph500 (Medium)
Graph500 (Large)
Graph500 (Huge)
1blillion 1trillion
1blillion
1trillion
Human Brain 89 B, 100 T
Twitter2009
Graph500 (SCALE29) ・4-way Intel Xeon (64 cores)
Graph500 (SCALE20) ・Smartphone (4 cores)
Single server
Smartphone
US road network
#(of(vertices((in(logscale)
#(of(edges((in(logscale)
Target networks on UV2000
20
25
30
35
40
45
15 20 25 30 35 40 45
log 2
(m)
log2(n)
USA-road-d.NY.gr
USA-road-d.LKS.gr
USA-road-d.USA.gr cit-Patents
soc-LiveJournal1
twitter-rv
Human Project
Graph500 (Toy)
Graph500 (Mini)
Graph500 (Small)
Graph500 (Medium)
Graph500 (Large)
Graph500 (Huge)
1blillion 1trillion
1blillion
1trillion
Human Brain 89 B, 100 T
Twitter2009
Graph500 (SCALE29) ・4-way Intel Xeon (64 cores)
Graph500 (SCALE32) ・UV2000 (1rack, 640 cores)
Graph500 (SCALE20) ・Smartphone (4 cores)
Single server
UV 2000
Smartphone
US road network
#(of(vertices((in(logscale)
#(of(edges((in(logscale)
Target networks on Supercomputer
20
25
30
35
40
45
15 20 25 30 35 40 45
log 2
(m)
log2(n)
USA-road-d.NY.gr
USA-road-d.LKS.gr
USA-road-d.USA.gr cit-Patents
soc-LiveJournal1
twitter-rv
Human Project
Graph500 (Toy)
Graph500 (Mini)
Graph500 (Small)
Graph500 (Medium)
Graph500 (Large)
Graph500 (Huge)
1blillion 1trillion
1blillion
1trillion
Human Brain 89 B, 100 T
US road network
Twitter2009
Graph500 (SCALE29) ・4-way Intel Xeon (64 cores)
Graph500 (SCALE40) ・BlueGene/Q (64K nodes) ・K computer (64K nodes)
Graph500 (SCALE32) ・UV2000 (1rack, 640 cores)
Graph500 (SCALE20) ・Smartphone (4 cores)
Single server
UV 2000
K and Sequoia
Smartphone
#(of(vertices((in(logscale)
#(of(edges((in(logscale)
Problem and Our motivation • Does UV2000 obtain a high-performance without MPI?
Thread ? MPI
# of cores 1 4 32 640 512K 1280
K computer Thread << MPI
UV2000 Single-server Thread > MPI
Problem and Our motivation • Does UV2000 obtain a high-performance without MPI?
• Exploiting Algorithm on NUMA and cc-NUMA system – Automatic processor topology detection
– Affinity configurations for running threads and allocating memory
# of cores 1 4 32 640 512K 1280
K computer Thread ≈ MPI Thread << MPI
Node = 2 sockets Cube = 8 nodes Rack = 32 nodes
CPU
RAM
CPU
RAM
× 4 =
RAM RAM
processor core & L2 cache
8-core Xeon E5 4640shared L3 cache
RAM RAM
RAM RAM
processor core & L2 cache
8-core Xeon E5 4640shared L3 cache
RAM RAM
RAM RAM
processor core & L2 cache
8-core Xeon E5 4640shared L3 cache
RAM RAM
0th 3th
1st 2nd
0 1 2 3Adjacency Matrix
Partitioning Binding
UV2000 Single-server Thread > MPI
YES
Level-synchronized parallel BFS (Top-down) • Started from source vertex and executes following two phases for each level
3) BFS iterations (timed).: This step iterates the timedBFS-phase and the untimed verify-phase 64 times. The BFS-phase executes the BFS for each source, and the verify-phaseconfirms the output of the BFS.
This benchmark is based on the TEPS ratio, which iscomputed for a given graph and the BFS output. Submissionagainst this benchmark must report five TEPS ratios: theminimum, first quartile, median, third quartile, and maxi-mum.
III. PARALLEL BFS ALGORITHM
A. Level-synchronized Parallel BFSWe assume that the input of a BFS is a graph G = (V, E)
consisting of a set of vertices V and a set of edges E.The connections of G are contained as pairs (v, w), wherev, w ∈ V . The set of edges E corresponds to a set ofadjacency lists A, where an adjacency list A(v) containsthe outgoing edges (v, w) ∈ E for each vertex v ∈ V . ABFS explores the various edges spanning all other verticesv ∈ V \{s} from the source vertex s ∈ V in a given graphG and outputs the predecessor map π, which is a map fromeach vertex v to its parent. When the predecessor map π(v)points to only one parent for each vertex v, it represents atree with the root vertex s. However, some applications, suchas the betweenness centrality [1], require all of the parentsfor each vertex, which is equivalent to the number of hopsfrom the source. Therefore, the output predecessor map isrepresented as a directed adjacency graph (DAG). In thispaper, we focus on the Graph500 benchmark, and assumethat the BFS output is a predecessor map that is representedby a tree.
Algorithm 1 is a fundamental parallel algorithm for aBFS. This requires the synchronization of each level thatis a certain number of hops away from the source. We callthis the level-synchronized parallel BFS [6]. Each traversalexplores all outgoing edges of the current frontier, which isthe set of vertices discovered at this level, and finds theirneighbors, which is the set of unvisited vertices at the nextlevel. We can describe this algorithm using a frontier queueQF and a neighbor queue QN , because unvisited verticesw are appended to the neighbor queue QN for each frontierqueue vertex v ∈ QF in parallel with the exclusive controlat each level (Algorithm 1, lines 7–12), as follows:
QN ←!w ∈ A(v) | w ̸∈ visited, v ∈ QF
". (1)
B. Hybrid BFS Algorithm of Beamer et al.The main runtime bottleneck of the level-synchronized
parallel BFS (Algorithm 1) is the exploration of all outgoingedges of the current frontier (lines 7–12). Beamer et al. [8],[9] proposed a hybrid BFS algorithm (Algorithm 2) thatreduced the number of edges explored. This algorithmcombines two different traversal kernels: top-down and
Algorithm 1: Level-synchronized Parallel BFS.Input : G = (V, A) : unweighted directed graph.
s : source vertex.Variables: QF : frontier queue.
QN : neighbor queue.visited : vertices already visited.
Output : π(v) : predecessor map of BFS tree.1 π(v)← −1, ∀v ∈ V2 π(s)← s3 visited← {s}4 QF ← {s}5 QN ← ∅6 while QF ̸= ∅ do7 for v ∈ QF in parallel do8 for w ∈ A(v) do9 if w ̸∈ visited atomic then
10 π(w)← v11 visited← visited ∪ {w}12 QN ← QN ∪ {w}
13 QF ← QN
14 QN ← ∅
bottom-up. Like the level-synchronized parallel BFS, top-down kernels traverse neighbors of the frontier. Conversely,bottom-up kernels find the frontier from vertices in candidateneighbors. In other words, a top-down method finds thechildren from the parent, whereas a bottom-up method findsthe parent from the children. For a large frontier, bottom-upapproaches reduce the number of edges explored, becausethis traversal kernel terminates once a single parent is found(Algorithm 2, lines 16–21).
Table III lists the number of edges explored at each levelusing a top-down, bottom-up, and combined hybrid (oracle)approach. For the top-down kernel, the frontier size mF atlow and high levels is much less than that at mid-levels.In the case of the bottom-up method, the frontier size mBis equal to the number of edges m = |E| in a given graphG = (V, E), and decreases as the level increases. Bottom-upkernels estimate all unvisited vertices as candidate neighborsQN , because it is difficult to determine their exact numberprior to traversal, as shown in line 15 of Algorithm 2. Thislazy estimation of candidate neighbors increases the numberof edges traversed for a small frontier. Hence, consideringthe size of the frontier and the number of neighbors, wecombine top-down and bottom-up approaches in a hybrid(oracle). This loop generally only executes once, so thealgorithm is suitable for a small-world graph that has a largefrontier, such as a Kronecker graph or an R-MAT graph.Table III shows that the total number of edges traversed bythe hybrid (oracle) is only 3% of that in the case of thetop-down kernel.
We now explain how to determine a traversal policy(Table II(a)) for the top-down and bottom-up kernels. The
395
Traversal
Swap
Frontier
Neighbor
Level k Level k+1 QF
QN
Swap exchanges the frontier QF and the neighbors QN for next level
Traversal finds neighbors QN from current frontier QF
Unvisited adjacency vertices(
Candidates of neighbors
前方探索と後方探索でのデータアクセスの観察• 前方探索でのデータの書込み
v→ w
v
w
Input : Directed graph G = (V, AF ), Queue QF
Data : Queue QN , visited, Tree π(v)
QN ← ∅for v ∈ QF in parallel do
for w ∈ AF (v) doif w ! visited atomic thenπ(w)← vvisited← visited ∪ {w}QN ← QN ∪ {w}
QF ← QN
• 後方探索でのデータの書込み
w→ v
v w
Input : Directed graph G = (V, AB), Queue QF
Data : Queue QN , visited, Tree π(v)
QN ← ∅for w ∈ V \ visited in parallel do
for v ∈ AB(w) doif v ∈ QF thenπ(w)← vvisited← visited ∪ {w}QN ← QN ∪ {w}break
QF ← QN
• どちらも wに関する変数 π(w)と visitedに書込みを行っている (vは点番号の参照)
6 / 12
Direction-optimizing BFS
Frontier
Neighbors
Level7k
Level7k+1
Frontier Level7k
Level7k+1 neighbors
Top-down algorithm • Efficient for small-frontier • Uses out-going edges
Bottom-up algorithm • Efficient for large-frontier • Uses in-coming edges
前方探索と後方探索でのデータアクセスの観察• 前方探索でのデータの書込み
v→ w
v
w
Input : Directed graph G = (V, AF ), Queue QF
Data : Queue QN , visited, Tree π(v)
QN ← ∅for v ∈ QF in parallel do
for w ∈ AF (v) doif w ! visited atomic thenπ(w)← vvisited← visited ∪ {w}QN ← QN ∪ {w}
QF ← QN
• 後方探索でのデータの書込み
w→ v
v w
Input : Directed graph G = (V, AB), Queue QF
Data : Queue QN , visited, Tree π(v)
QN ← ∅for w ∈ V \ visited in parallel do
for v ∈ AB(w) doif v ∈ QF thenπ(w)← vvisited← visited ∪ {w}QN ← QN ∪ {w}break
QF ← QN
• どちらも wに関する変数 π(w)と visitedに書込みを行っている (vは点番号の参照)
6 / 12
Current frontier
Unvisited neighbors
Current frontier
Candidates of neighbors
Skips unnecessary edge traversal
Outgoing edges Incoming
edges
Chooses one from Top-down or Bottom-up Beamer2012 @ SC2012
# of traversal edges of Kronecker graph with SCALE 26
Hybrid-BFS reduces unnecessary edge traversals
Direction-optimizing BFS
Top%down 幅優先探索に対する前方探索 (Top-down)と後方探索 (Bottom-up)
Level Top-down Bottom-up Hybrid0 2 2,103,840,895 21 66,206 1,766,587,029 66,2062 346,918,235 52,677,691 52,677,6913 1,727,195,615 12,820,854 12,820,8544 29,557,400 103,184 103,1845 82,357 21,467 21,4676 221 21,240 227
Total 2,103,820,036 3,936,072,360 65,689,631Ratio 100.00% 187.09% 3.12%
6 / 14
Bottom%up(
Top%down
Distance from source |V| = 226, |E| = 230
= |E|
Chooses one from Top-down or Bottom-up Beamer2012 @ SC2012
Small frontier large frontier
NUMA-optimized Dir. Opt. BFS
0
10
20
30
40
50
2011 SC10 SC12 BigData13ISC14
G500,ISC14
GT
EPS
Ref
eren
ce
NU
MA-a
war
e
Dir.
Opt
.
NU
MA-O
pt.
NU
MA-O
pt.
+D
eg.a
war
e
NU
MA-O
pt.
+D
eg.a
war
e+
Vtx
.Sor
t
87M 800M
5G
11G
29G
42G
⇥1 ⇥9
⇥58
⇥125
⇥334
⇥489
• CPU: Intel Xeon • #sockets: 4 • #cores: 32 or 40 • RAM: 256GB or 512GB
• Manages memory accesses on NUMA system – Each NUMA node contains CPU socket and local memory
Top%down Top%down
Bottom%up(
Top%down
CPU
RAM
NUMA%aware
System7configuration
NUMA-optimized Dir. Opt. BFS
0
10
20
30
40
50
2011 SC10 SC12 BigData13ISC14
G500,ISC14
GT
EPS
Ref
eren
ce
NU
MA-a
war
e
Dir.
Opt
.
NU
MA-O
pt.
NU
MA-O
pt.
+D
eg.a
war
e
NU
MA-O
pt.
+D
eg.a
war
e+
Vtx
.Sor
t
87M 800M
5G
11G
29G
42G
⇥1 ⇥9
⇥58
⇥125
⇥334
⇥489
• CPU: Intel Xeon • #sockets: 4 • #cores: 32 or 40 • RAM: 256GB or 512GB
• Manages memory accesses on NUMA system – Each NUMA node contains CPU socket and local memory
Top%down Top%down
Bottom%up(
Top%down
CPU
RAM
NUMA%aware Bottom%up(
Top%down
CPU
RAM
NUMA%aware
System7configuration
Our results
RAM RAM
processor core & L2 cache
8-core Xeon E5 4640shared L3 cache
RAM RAM
RAM RAM
processor core & L2 cache
8-core Xeon E5 4640shared L3 cache
RAM RAM
RAM RAM
processor core & L2 cache
8-core Xeon E5 4640shared L3 cache
RAM RAM0th 3th
1st 2nd
0 1 2 3
Adjacency Matrix
Partitioning
Binding on NUMA
NUMA architecture
RAM RAM
processor core & L2 cache
8-core Xeon E5 4640shared L3 cache
RAM RAM
CPU socket(16 logical cores) + Local RAM
Memory access for Local RAM(Fast)
Memory access for Remote RAM(Slow)
NUMA node
• Reduces and avoids memory accesses for Remote RAM
• 4-way Intel Xeon E5-4640 (Sandybridge-EP) – 4 (# of CPU sockets) – 8 (# of physical cores per socket) – 2 (# of threads per core)
4 x 8 x 2 = 64 threads
NUMA node
Max.
NUMA-aware (optimized) computation
Flow of affinities using ULIBC ULIBC : Ubiquity Library for Intelligently Binding Cores – provides some APIs to utilizing processor topology easily.
(Our(library)
1. Detects entire topology
Flow of affinities using ULIBC
Cores CPU 0 P0, P4, P8, P12 CPU 1 P1, P5, P9, P13 CPU 2 P2, P6, P10, P14 CPU 3 P3, P7, P11, P15
Use(Other(
processes
ULIBC : Ubiquity Library for Intelligently Binding Cores – provides some APIs to utilizing processor topology easily.
(Our(library)
1. Detects entire topology Cores
CPU 1 P1, P5, P9, P13 CPU 2 P2, P6, P10, P14
2. Detects online (available) topology
Flow of affinities using ULIBC
Job manager (PBS) or numactl --cpunodebind=1,2
Cores CPU 0 P0, P4, P8, P12 CPU 1 P1, P5, P9, P13 CPU 2 P2, P6, P10, P14 CPU 3 P3, P7, P11, P15
Use(Other(
processes
ULIBC : Ubiquity Library for Intelligently Binding Cores – provides some APIs to utilizing processor topology easily.
(Our(library)
1. Detects entire topology Cores
CPU 1 P1, P5, P9, P13 CPU 2 P2, P6, P10, P14
2. Detects online (available) topology
Threads NUMA 0 0(P1), 2(P5), 4(P9), 6(P13) NUMA 1 1(P2), 3(P6), 5(P10)
3. Constructs ULIBC affinity ULIBC_set_affinity_policy( 7, SCATTER_MAPPING, THREAD_TO_CORE)
Flow of affinities using ULIBC
Job manager (PBS) or numactl --cpunodebind=1,2
# of threads Scatter-type mapping
Each thread binds each logical cores
Cores CPU 0 P0, P4, P8, P12 CPU 1 P1, P5, P9, P13 CPU 2 P2, P6, P10, P14 CPU 3 P3, P7, P11, P15
Use(Other(
processes
NUMA 0
NUMA 1
core 0 core 1 core 2
core 3
RAM
RAM
Local RAM
ULIBC : Ubiquity Library for Intelligently Binding Cores – provides some APIs to utilizing processor topology easily.
(Our(library)
NUMA-optimized BFS • The 1-D column-wise partitioning for adjacency matrix
0 1 2 3Adjacency Matrix
Partitioning
0 1 2 3
RAM RAM
processor core & L2 cache
8-core Xeon E5 4640shared L3 cache
RAM RAM
RAM RAM
processor core & L2 cache
8-core Xeon E5 4640shared L3 cache
RAM RAM
RAM RAM
processor core & L2 cache
8-core Xeon E5 4640shared L3 cache
RAM RAM
RAM RAM
processor core & L2 cache
8-core Xeon E5 4640shared L3 cache
RAM RAM
RAM RAM
processor core & L2 cache
8-core Xeon E5 4640shared L3 cache
RAM RAMRAM RAM
processor core & L2 cache
8-core Xeon E5 4640shared L3 cache
RAM RAM
Local neighbors
Duplicated frontiers
All-gathering of next frontier Edge traversal on local RAM
RAM RAM
processor core & L2 cache
8-core Xeon E5 4640shared L3 cache
RAM RAM
RAM RAM
processor core & L2 cache
8-core Xeon E5 4640shared L3 cache
RAM RAM
RAM RAM
processor core & L2 cache
8-core Xeon E5 4640shared L3 cache
RAM RAM
0th 3th
1st 2nd
Binding
Inner-NUMA-node Inter-NUMA-node
Each NUMA node searches unvisited vertices from duplicated frontier
Construct duplicated frontiers from partial neighbors
In In In In Out Out Out Out
• Local traversal and all-to-all comm. for each level
Degree-aware + NUMA-opt. + Dir. Opt. BFS
0
10
20
30
40
50
2011 SC10 SC12 BigData13ISC14
G500,ISC14
GT
EPS
Ref
eren
ce
NU
MA-a
war
e
Dir.
Opt
.
NU
MA-O
pt.
NU
MA-O
pt.
+D
eg.a
war
e
NU
MA-O
pt.
+D
eg.a
war
e+
Vtx
.Sor
t
87M 800M
5G
11G
29G
42G
⇥1 ⇥9
⇥58
⇥125
⇥334
⇥489
• CPU: Intel Xeon • #sockets: 4 • #cores: 32 or 40 • RAM: 256GB or 512GB
• Manages memory accesses on NUMA system – Each NUMA node contains CPU socket and local memory
Top%down Top%down
Bottom%up(
Top%down
CPU
RAM
NUMA%aware Bottom%up(
Top%down
CPU
RAM
NUMA%aware
System7configuration
2. Sorting adjacency vertices
1. Deleting isolated vertices Our results
Isolated
A(va)
… …
A(va)
… …
Sorted by degree
TEPS and TEPS/W on single-server • Strong scaling for SCALE 27 for Graph500 for Green Graph500
1
2
4
8
16
32
64
1 2 4 8 16 32 64
1
2
4
8
16
32
64
relative
GT
EPS
relative
MT
EPS/W
Number of threads
GTEPS
MTEPS/W
29.03 GTEPS
45.43 MTEPS/W
4-way SB-EP based Xeon Relative improvements
Number of threads
x 27.9
x 12.6
SGI UV 2000 system • Shared-memory supercomputer
– handle large memory space using thread parallel. – C/C++ with OpenMP/Pthreads (w/o MPI comm.) – cc-NUMA architecture system base on Intel Xeon
• ISM has two Full-spec. UV 2000 – 4 UV 2000 racks – Up to 2,560 cores and 64 TB memory
• ISM, SGI, and us collaborate for Graph500 – achieves the fastest of single-node in current list
The Institute of Statistical Mathematics • Japan's national research institute for
statistical science.
#1 system #2 system
UV2000 rack
SGI UV 2000 configuration • UV2000 has complex hardware topologies
– Socket, Node, Cube, Inner-rack, and Inter-rack Node = 2 sockets Cube = 8 nodes Rack = 32 nodes
• We used NUMA-based flat parallelization – Each NUMA node contains a “Xeon CPU E5-2470 v2” and a “256 GB RAM”
CPU
RAM
CPU
RAM
× 4 =
CPU
RAM
Node = 2 NUMA nodes Rack = 64 NUMA nodes
× 64 = CPU
RAM
Cube = 16 NUMA nodes
× 2 CPU
RAM × 16
NUMAlink(
6.7GB/s
(20(cores,(512GB) (160(cores,(4TB) (640(cores,(16TB)
0
50
100
150
200
26
(` = 1)27
(` = 2)28
(` = 4)29
(` = 8)30
(` = 16)31
(` = 32)32
(` = 64)33
(` = 128)34
(` = 256)
GT
EPS
SCALE (` = #sockets)
Weak scaling on UV 2000
June 2014
Weak scaling on UV2000 Fastest of single node 131 GTEPS Most power-efficient commercial supercomputer
Inner%rack(comm. Inter%rack
Graph500 June list
12.481 MTEPS = 131 GTEPS / 10.53 kW
The Graph500 List in June 2014 • Measures performance using TEPS (# of Traversed edges
per second) in graph traversal such as BFS
h"p://www.graph500.org�
Fastest.of.......single5node�
Fastest.of........single5server�
Distributed Memory
Distributed Memory
Distributed Memory
Shared Memory
Shared Memory
Fastest.of..mul:5node�
The Green Graph500 List in June 2014
Small data category (< SCALE 29)
Big Data category (> SCALE 30)
SONY Xperia-Z1-SO-01F
Measures power-efficiency using TEPS/W http://green.graph500.org
George Washington University’sColonial
is ranked
No.1
in the Small Data category of the Green Graph 500Ranking of Supercomputers with
445.92 MTEPS/W on Scale 20
on the third Green Graph 500 list published at theInternational Supercomputing Conference, June 23, 2014.
Congratulations from the Green Graph 500 Chair
Kyushu’s UniversityGraphCREST-SandybridgeEP-2.4GHz
is ranked
No.1
in the Big Data category of the Green Graph 500Ranking of Supercomputers with
59.12 MTEPS/W on Scale 30
on the third Green Graph 500 list published at theInternational Supercomputing Conference, June 23, 2014.
Congratulations from the Green Graph 500 Chair
Ours
UV2000
Ours
4-way Xeon server
TSUBAME-KFC
0
50
100
150
200
26
(` = 1)27
(` = 2)28
(` = 4)29
(` = 8)30
(` = 16)31
(` = 32)32
(` = 64)33
(` = 128)34
(` = 256)
GT
EPS
SCALE (` = #sockets)
Weak scaling on UV 2000
June 2014
Nov. 2014
Weak scaling on UV2000
Fastest of single node in Graph500 June
Inter%rack(comm. Inner%rack(comm.
131 GTEPS
New result 174 GTEPS
Two racks
Conclusion • UV 2000 with NUMA-based thread-parallelization
– Scalable for irregular memory access computation
SGI UV2000 CPU : 64 CPUs per rack RAM : 16 TB per rack
• Graph500/Green Graph500 on UV 2000 – 131 GTEPS with 640 threads – The fastest of single node entries – The most power-efficient of commercial supercomputers
174 GTEPS for SCALE 33 with 1,280 threads
• ULIBC will be available at https://bitbucket.org/yuichiro_yasui/ulibc �
Graph500 Green Graph500