exploiting the potential of modern supercomputers through high level language abstractions
DESCRIPTION
Exploiting the Potential of Modern Supercomputers Through High Level Language Abstractions. Exploit Hierarchical and Irregular Parallelism in UPC. Li Chen State Key Laboratory of Computer Architecture Institute of Computing Technology, CAS. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Exploiting the Potential of Modern Supercomputers Through High Level Language Abstractions](https://reader031.vdocuments.us/reader031/viewer/2022012402/56813868550346895da0174b/html5/thumbnails/1.jpg)
Li ChenState Key Laboratory of Computer Architecture
Institute of Computing Technology, CAS
Exploiting the Potential of Modern Supercomputers Through High Level
Language AbstractionsExploit Hierarchical and
Irregular Parallelism in UPC
![Page 2: Exploiting the Potential of Modern Supercomputers Through High Level Language Abstractions](https://reader031.vdocuments.us/reader031/viewer/2022012402/56813868550346895da0174b/html5/thumbnails/2.jpg)
2
Exploit Hierarchical and Irregular Parallelism in UPC-H
Motivations Why use UPC?Exploit the tiered network of Dawning 6000
– GASNet support for HPP architecture
Exploit hierarchical data parallelism for regular applications
Shared work list for irregular applications
![Page 3: Exploiting the Potential of Modern Supercomputers Through High Level Language Abstractions](https://reader031.vdocuments.us/reader031/viewer/2022012402/56813868550346895da0174b/html5/thumbnails/3.jpg)
3
Deep Memory Hierarchies in Modern Computing Platforms
Many-core accelerators
Traditional multicore processors
Harpertown Dunnington
Intra-node parallelism should be well exploited
![Page 4: Exploiting the Potential of Modern Supercomputers Through High Level Language Abstractions](https://reader031.vdocuments.us/reader031/viewer/2022012402/56813868550346895da0174b/html5/thumbnails/4.jpg)
4
HPP Interconnect of Dawning 6000
Traditional clusterHPP Architecture of Dawning7000
Discrete CPU : App CPU and OS CPUHypernode: discrete OS, SSIDiscrete interconnection: data int, OS int, global Sync
Three-tier networkPE: Cache coherence
2 CPUsHPP: 4 nodesIB
Global address space, through HPP controller
![Page 5: Exploiting the Potential of Modern Supercomputers Through High Level Language Abstractions](https://reader031.vdocuments.us/reader031/viewer/2022012402/56813868550346895da0174b/html5/thumbnails/5.jpg)
5
Mapping Hierarchical Parallelism to Modern SupercomputersHybrid programming models, MPI+X
– MPI+OpenMP/TBB (+OpenAcc/OpenCL)– MPI+StarPU– MPI+UPC
Sequoia– Explicitly tune the data layout and data transfer (Parallel memory
hierarchy) – Recursive task tree, static mapping for tasks
HTA– Data type for hierarchical tiled array (multiple level tiling)– Parallel operators: map parallelism statically
X10– Combine HTA with Sequoia– Abstraction on memory hierarchies: hierarchical place tree (Habanero-
java)– Nested task parallelism, task mapping until launching time
![Page 6: Exploiting the Potential of Modern Supercomputers Through High Level Language Abstractions](https://reader031.vdocuments.us/reader031/viewer/2022012402/56813868550346895da0174b/html5/thumbnails/6.jpg)
6
Challenges in Efficient Parallel Graph Processing
Data-driven computations– Parallelism cannot be exploited statically– Computation partitioning is not suitable
Unstructured problems– Unstructured and highly irregular data
structure– Data partitioning is not suitable, may lead
to load balancing Poor locality
– data access patterns has less locality High data access to computation
ratio– explore the structure, not computation– dominated by the wait for memory
fetches
Express Parallelism
memory latency dominated
Communication dominated
Low level
tedious
Express Parallelism
Memory latency dominated
Low levelborrowed: Andrew Lumsdaine, Douglas Gregor
![Page 7: Exploiting the Potential of Modern Supercomputers Through High Level Language Abstractions](https://reader031.vdocuments.us/reader031/viewer/2022012402/56813868550346895da0174b/html5/thumbnails/7.jpg)
7
Why Unified Parallel C?
UPC, parallel extension to ISO C99– A dialect of PGAS languages (Partitioned Global Address Language)
Important UPC features– Global address space: thread may directly read/write remote data– Partitioned: data is designated as local or global, affinity– Two kinds of memory consistency
UPC Performance benefit over MPI– Permits data sharing, better memory utilization
Thinking future many core chips, Exascale system– Better bandwidth and latency using one-sided messages (GASNET)– No less scalable than MPI! (to 128K threads)
Why use UPC?– Grasp the non-uniform memory access feature of modern computers – Programmability very close to shared memory programming
![Page 8: Exploiting the Potential of Modern Supercomputers Through High Level Language Abstractions](https://reader031.vdocuments.us/reader031/viewer/2022012402/56813868550346895da0174b/html5/thumbnails/8.jpg)
8
The Status of the UPC Community
Portability and usability– Many different UPC implementations and tools
Berkeley UPC, Cray UPC, HP UPC, GCC-based Intrepid UPC and MTU UPC Performance tools: GASP interface and Parallel Performance Wizard (PPW) Debuggability: TotalView
– Provide Interoperability with pthreads/MPI(/OpenMP)UPC is developing in
– Hierarchical parallelism, asynchronous executionTasking mechanism
– Scalable work stealing; hierarchical tasking library– Place, async~finish; Asynchronous Remote Methods
Nested parallelism Instant team: data centric collective Irregular benchmarks: UTS, MADNESS,GAP
– InteroperabilitySupport for hybrid programming with OpenMP and other languagesMore convenient support for writing libraries
![Page 9: Exploiting the Potential of Modern Supercomputers Through High Level Language Abstractions](https://reader031.vdocuments.us/reader031/viewer/2022012402/56813868550346895da0174b/html5/thumbnails/9.jpg)
9
What is UPC-H?
Developed by the compiler group of ICT– H: heterogeneous, hierarchical
Based on Berkeley UPC compilerAdded features by ICT
– Support HW features of Dawning series computersHPP interconnectLoad/store in physical global address space
– Hierarchical data distribution and parallelismGodson-T (many-core processor)GPU clusterD6000 computer and X86 clusters
– SWL support for graph algorithms– Communication optimizations
Software cache, msg vectorization– Runtime system for heterogeneous platform
Data management
![Page 10: Exploiting the Potential of Modern Supercomputers Through High Level Language Abstractions](https://reader031.vdocuments.us/reader031/viewer/2022012402/56813868550346895da0174b/html5/thumbnails/10.jpg)
UPC-H Support for HPP Architecture
![Page 11: Exploiting the Potential of Modern Supercomputers Through High Level Language Abstractions](https://reader031.vdocuments.us/reader031/viewer/2022012402/56813868550346895da0174b/html5/thumbnails/11.jpg)
11
Lack a BCL Conduit in the UPC system
Gasnet Extended API
Gasnet Core API
lnfiniband inter-Process
SHared Memory
HPPBCL
GASNet: Networking for Global-Address Space Languages
BCL: low level communication layer of HPP
![Page 12: Exploiting the Potential of Modern Supercomputers Through High Level Language Abstractions](https://reader031.vdocuments.us/reader031/viewer/2022012402/56813868550346895da0174b/html5/thumbnails/12.jpg)
12
Implementation of the BCL-conduit
Initialization of the tiered-network – construct the topology of the tiered network– set up reliable datagram service through QP virtualization– initialize internal data structures such as send buffers
Finalization of communication Network selection in the core API of GASNet
– PSHM, HPP, IBFlow control of messages Implementation of Active Message
– Short Message : NAP– Medium Message: NAP– Long Message: RDMA +NAP– RDMA Put/Get : RMDA+NAP
Two-tiered topology three-tiered topology
![Page 13: Exploiting the Potential of Modern Supercomputers Through High Level Language Abstractions](https://reader031.vdocuments.us/reader031/viewer/2022012402/56813868550346895da0174b/html5/thumbnails/13.jpg)
13
BCL Conduit: latency of short messages
Latency of short message intra-HPP Latency of short message inter-HPP
![Page 14: Exploiting the Potential of Modern Supercomputers Through High Level Language Abstractions](https://reader031.vdocuments.us/reader031/viewer/2022012402/56813868550346895da0174b/html5/thumbnails/14.jpg)
14
BCL Conduit, Bandwidth of Med. Messages (intra HPP)
![Page 15: Exploiting the Potential of Modern Supercomputers Through High Level Language Abstractions](https://reader031.vdocuments.us/reader031/viewer/2022012402/56813868550346895da0174b/html5/thumbnails/15.jpg)
15
BCL Conduit, Bandwidth of Med. Messages (inter HPP)
![Page 16: Exploiting the Potential of Modern Supercomputers Through High Level Language Abstractions](https://reader031.vdocuments.us/reader031/viewer/2022012402/56813868550346895da0174b/html5/thumbnails/16.jpg)
16
BCL Conduit, Latency of Med. Messages (intra HPP)
![Page 17: Exploiting the Potential of Modern Supercomputers Through High Level Language Abstractions](https://reader031.vdocuments.us/reader031/viewer/2022012402/56813868550346895da0174b/html5/thumbnails/17.jpg)
17
BCL Conduit , Latency of Med. Messages (inter HPP)
![Page 18: Exploiting the Potential of Modern Supercomputers Through High Level Language Abstractions](https://reader031.vdocuments.us/reader031/viewer/2022012402/56813868550346895da0174b/html5/thumbnails/18.jpg)
18
BCL Conduit, Latency of Barriers
Net latency of barrier (inter- HPP)
Net latency of barrier (intra-HPP)
![Page 19: Exploiting the Potential of Modern Supercomputers Through High Level Language Abstractions](https://reader031.vdocuments.us/reader031/viewer/2022012402/56813868550346895da0174b/html5/thumbnails/19.jpg)
19
Summary and Ongoing work of UPCH targeting Dawning 6000
Summary– UPCH compiler can now support HPP architecture,
benefit from the 3-tier network
Ongoing work– Optimization on DMA registration strategy– Evaluate HPP-supported barrier and collective– Full-length evaluation
![Page 20: Exploiting the Potential of Modern Supercomputers Through High Level Language Abstractions](https://reader031.vdocuments.us/reader031/viewer/2022012402/56813868550346895da0174b/html5/thumbnails/20.jpg)
Hierarchical Data Parallelism, UPC-H Support for Regular
Applications
![Page 21: Exploiting the Potential of Modern Supercomputers Through High Level Language Abstractions](https://reader031.vdocuments.us/reader031/viewer/2022012402/56813868550346895da0174b/html5/thumbnails/21.jpg)
21
UPC-H(UPC-Hierarchical/Heterogeneous)Execution model
Standard UPC is SPMD style and has flat parallelism
UPC-H extension– Mix SPMD with fork-join
Implicit subgroups
Implicit threads
Implicit thread or thread subgroup
UPC threads
fork point
Join point at upc_forall
UPC thread
Implicit subgroups
Implicit threads
fork joint
Join point at upc_forall
UPC program
upc_forall
– Two approach to express hierarchical parallelism
Implicit threads (or gtasks), organized in thread groups implicitly specified by the data distribution
Explicit low-level gtask
![Page 22: Exploiting the Potential of Modern Supercomputers Through High Level Language Abstractions](https://reader031.vdocuments.us/reader031/viewer/2022012402/56813868550346895da0174b/html5/thumbnails/22.jpg)
22
Multi-level Data Distribution
Data distribution => an implicit thread tree
shared [32][32], [4][4],[1][1] float A[128][128];
UPC thread
UPC program
44
… …
16
Upc-tiles
128
… …64
32
Subgroup-tiles
Thread-tiles
… … 64
32
… …
16
Subgroup
logical implicit threads
16
64
16
1
…
![Page 23: Exploiting the Potential of Modern Supercomputers Through High Level Language Abstractions](https://reader031.vdocuments.us/reader031/viewer/2022012402/56813868550346895da0174b/html5/thumbnails/23.jpg)
23
UPC-H: Mapping Forall Loop to the Implicit Thread Tree
Leverage an existing language construct, upc_forall – Independent loop– Point-to-shared or integer type affinity expression
Loop Iterations Implicit thread tree CUDA thread tree
3-level data distribution
Machine configuration
shared [32][32], [4][4],[1][1] float A[128][128];… …upc_forall(i=0; i<128; i++; continue)upc_forall(j=6; j<129; j++; &A[i][j-1]) ... body...
=>Thread topology: <THREADS,64,16>
![Page 24: Exploiting the Potential of Modern Supercomputers Through High Level Language Abstractions](https://reader031.vdocuments.us/reader031/viewer/2022012402/56813868550346895da0174b/html5/thumbnails/24.jpg)
24
UPC-H Codes for nbody
shared [1024],[128],[1] point P[4096];Shared [1024][1024] float tempf[4096][4096]; for(int time=0; time<1000;time++) { upc_forall(int i=0;i<N;i++; &P[i])
for (int j=0;j<N;j++) { if(j!=i) { distance = (float)sqrt((P[i].x-P[j].x)*(P[i].x-P[j].x)+ (P[i].y-P[j].y)*(P[i].y-P[j].y)); if(distance!=0) { magnitude = (G*m[i]*m[j])/(distance*distance+C*C); …… tempf[i][j].x = magnitude*direction.x/distance; tempf[i][j].y = magnitude*direction.y/distance; } } upc_forall(int i=0;… …)
… …}
![Page 25: Exploiting the Potential of Modern Supercomputers Through High Level Language Abstractions](https://reader031.vdocuments.us/reader031/viewer/2022012402/56813868550346895da0174b/html5/thumbnails/25.jpg)
25
Overview of the Compiling SupportOn Berkeley UPC compiler v2.8.0Compiler analysis
– Multi-dimensional and multi-level data distribution – Affinity-aware multi-level tiling
upc tilingSubgroup tiling, thread tilingMemory tiling for scratchpad memory
– Communication optimizationMessage vectorization, loop peeling, static comm. scheduling
– Data layout optimizations for GPUShared memory optimizationFind better data layout for memory coalescing
– array transpose and structure splitting
– Code Generation: CUDA, hier parallelism
![Page 26: Exploiting the Potential of Modern Supercomputers Through High Level Language Abstractions](https://reader031.vdocuments.us/reader031/viewer/2022012402/56813868550346895da0174b/html5/thumbnails/26.jpg)
26
Affinity-aware Multi-level Loop Tiling (Eg.) shared [32][32], [4][4],[1][1] float A[128]
[128];… …upc_forall(i=6; i<128; i++; continue)upc_forall(j=0; j<128; j++; &A[i-1][ j]) ... ... F[i][j]...
Step1: iteration space transformation, to make affinity expression consistent with data space
upc_forall(i=5; i<127; i++; continue)upc_forall(j=0; j<128; j++; &A[i][j]) ... ... F[i+1][j]... //transformation
Step2: three level tiling, actually two levelfor (iu=0; iu<128; iu=iu+32)for (ju=0; ju<128; ju=ju+32) //upc thread affinity if (has_affinity(MYTHREAD, &A[iu][ju])) { // for exposed region …dsm_read… F[iu+1:min(128, iu+32)]
[ju: min(127,ju+31) ] for (ib=iu ; ib<min(128, iu+32); ib=ib+4) for (jb=ju; jb< min(128, ju+32); jb=jb+4) for (i=ib; i<min(128,ib+4); i=i+1) for (j=jb; j<min(128,jb+4); j=j+1) if(i>=5 && i<127) //sink guards here! ... F[i+1][j]... ; }//of upc thread affinity Step 3: spawn fine-grained threads… …
![Page 27: Exploiting the Potential of Modern Supercomputers Through High Level Language Abstractions](https://reader031.vdocuments.us/reader031/viewer/2022012402/56813868550346895da0174b/html5/thumbnails/27.jpg)
27
Memory Optimizations for CUDA
What data will be put into the shared memory?– 0-1 bin packing problem (over shared memory’s capacity)
The profit: reuse degree integrated with coalescing attribute – inter-thread reuse and intra-thread reuse– average reuse degree for merged region
The cost: the volume of the referenced array region prefer inter-thread reuse
– Compute the profit and cost for each reference What is the optimal data layout in GPU’s global memory?
– Coalescing attributes of array reference only consider contiguous constraints of coalescing
– Legality analysis– Cost model and amortization analysis
![Page 28: Exploiting the Potential of Modern Supercomputers Through High Level Language Abstractions](https://reader031.vdocuments.us/reader031/viewer/2022012402/56813868550346895da0174b/html5/thumbnails/28.jpg)
28
Overview of the Runtime Support
Multi-dimensional data distribution supportGtask support on multicore platforms
– Workload scheduling, synchronization, topology-aware mapping and binding
DSM system for unified memory management– GPU heap management– Memory consistency, block-based– Inter-UPC message generation and data shuffling
Data shuffling to generate data tiles with halos
Data transformations for GPUs– Dynamic data layout transformations
For global memory coalescing, demand driven
– Demand driven data transfer between cpu and GPU
![Page 29: Exploiting the Potential of Modern Supercomputers Through High Level Language Abstractions](https://reader031.vdocuments.us/reader031/viewer/2022012402/56813868550346895da0174b/html5/thumbnails/29.jpg)
29
Unified Memory Management
Demand driven data transfer– Only on local data space, no software caching on remote data
– Consistency maintenance is on the boundary of CPU code and GPU code
Demand driven data layout transformation– Redundant data transformation removal
– An extra field is recorded for the current layout of the data tile copy
![Page 30: Exploiting the Potential of Modern Supercomputers Through High Level Language Abstractions](https://reader031.vdocuments.us/reader031/viewer/2022012402/56813868550346895da0174b/html5/thumbnails/30.jpg)
30
Benchmarks for the GPU Cluster
Applications
Description Original language
Application field
Source
Nbody n-body simulation CUDA+MPI Scientific computing
CUDA campus programming contest 2009
LBM Lattice Boltzmann method in computational fluid dynamics
C Scientific computing
SPEC CPU 2006
CP Coulombic Potential CUDA Scientific computing
UIUC Parboil Benchmark
MRI-FHD Magnetic Resonance Imaging FHD
CUDA Medical image analysis
UIUC Parboil Benchmark
MRI-Q Magnetic Resonance Imaging Q
CUDA Medical image analysis
UIUC Parboil Benchmark
TPACF Two Point Angular Correlation Function
CUDA Scientific computing
UIUC Parboil Benchmark
![Page 31: Exploiting the Potential of Modern Supercomputers Through High Level Language Abstractions](https://reader031.vdocuments.us/reader031/viewer/2022012402/56813868550346895da0174b/html5/thumbnails/31.jpg)
31
UPC-H Performance on GPU Cluster
Use 4-node cuda cluster, 1000M Ethernet. Each node has– CPUs : 2 dual core AMD Opteron 880– GPU: NVIDIA GeForce 9800 GX2
Compilers: nvcc (2.2) –O3, GCC (3.4.6) –O3
one-node speedup to serial execution
05
10152025303540
nbody lbm
spee
dup
base DSM memory coalescing SM reuse manual CUDA
Four-node speedup to serial execution (log2)
0123456789
10
nbody mri -fhd mri -q tpacf cp
spee
dup
base DSM memory coalescing SM reuse Manual CUDA/MPI
Performance 72%, on average
![Page 32: Exploiting the Potential of Modern Supercomputers Through High Level Language Abstractions](https://reader031.vdocuments.us/reader031/viewer/2022012402/56813868550346895da0174b/html5/thumbnails/32.jpg)
32
UPC-H Performance on Godson-T
The average speedup of SPM opt is 2.30 , that of double-buffering is 2.55
speedup
![Page 33: Exploiting the Potential of Modern Supercomputers Through High Level Language Abstractions](https://reader031.vdocuments.us/reader031/viewer/2022012402/56813868550346895da0174b/html5/thumbnails/33.jpg)
33
UPC-H Performance on Multi-core Cluster Hardware and software
– Xeon(R) CPU X7550 *8=64 cores/node, 40Gb infiniband, ibv conduit, mvapich2-1.4
Benchmarks– NPB: CG, FT, – nbody, MM, cannon MM
Results– NPB performance : UPC-H
reach 90% of UPC+OMP– Cannon MM can leverage
optimal data sharing and communication coalescing
express complicated hierarchical data parallelism which is hard to express in UPC+OpenMP
perf ormance rat i o: UPCH/ UPC+OMP
0
0. 2
0. 4
0. 6
0. 8
1
1. 2
1 2 4 8
thread team si ze
perf
. ra
tio CG- B- 2
CG- C- 2FT- B- 4FT- C- 8nbody- 16384- 16
UPCH/ UPC perf ormance (cannon MM)
0
1
2
3
4
5
6
7
8
9
4 8 16 32 64
total threads
spee
dups 1024
204840968192
![Page 34: Exploiting the Potential of Modern Supercomputers Through High Level Language Abstractions](https://reader031.vdocuments.us/reader031/viewer/2022012402/56813868550346895da0174b/html5/thumbnails/34.jpg)
SWL, UPC-H Support for Graph Algorithms
![Page 35: Exploiting the Potential of Modern Supercomputers Through High Level Language Abstractions](https://reader031.vdocuments.us/reader031/viewer/2022012402/56813868550346895da0174b/html5/thumbnails/35.jpg)
35
Introduction
Graph– flexible abstraction for describing relationships between
discrete objects – basis of exploration based applications (genomics,
astrophysics, social network analysis, machine-learning)
Graph search algorithms – Important technique for analyzing vertices or edges in it– Breadth-first search (BFS) is widely used and is the
basis of many others (CC , SSSP , Best-first-search, A*) Kernel of Graph500 benchmarks
35
![Page 36: Exploiting the Potential of Modern Supercomputers Through High Level Language Abstractions](https://reader031.vdocuments.us/reader031/viewer/2022012402/56813868550346895da0174b/html5/thumbnails/36.jpg)
36
Challenges in Efficient Parallel Graph Processing
Data-driven computations– Parallelism cannot be exploited statically– Computation partitioning is not suitable
Unstructured problems– Unstructured and highly irregular data
structure– Data partitioning is not suitable, may lead
to load balancing Poor locality
– data access patterns has less locality High data access to computation
ratio– explore the structure, not computation– dominated by the wait for memory
fetches
Express Parallelism
memory latency dominated
Communication dominated
Low level
tedious
Express Parallelism
Memory latency dominated
Low level
User directed, Auto opt
Global
view, high
level
borrowed: Andrew Lumsdaine, Douglas Gregor
![Page 37: Exploiting the Potential of Modern Supercomputers Through High Level Language Abstractions](https://reader031.vdocuments.us/reader031/viewer/2022012402/56813868550346895da0174b/html5/thumbnails/37.jpg)
37
Tedious Optimizations of Bfs (graph algorithm)
Perf. Problem Goal Techniques
Memory accessleverage Non-blocking Cache Multithreading
Synchronization
Reduce the overhead of shared data protection
Use atomic operation not locks
The scalability problem of collective operation
Multithreading+Hierarchical collective
Communication
Avoid small messages which waste network bandwidth
Message vectorization
Hide the overhead of communication Async operation
Reduce the number of messages Multithreading
Optimize bfs on clusters:
![Page 38: Exploiting the Potential of Modern Supercomputers Through High Level Language Abstractions](https://reader031.vdocuments.us/reader031/viewer/2022012402/56813868550346895da0174b/html5/thumbnails/38.jpg)
38
Amorphous Data Parallelism (Keshav Pingali)– Active elements (activities)– Neighborhood– Ordering
Exploit such parallelism: work list – Keep track of active elements and ordering
Unordered-set Iterator, ordered-set Iterator
– Conflicts among concurrent operationssupport for speculative execution
In Galois system
Def: Given a set of active nodes and an ordering on active nodes, amorphous data-parallelism is the parallelism that arises from simultaneously processing active nodes, subject to neighborhood and ordering constraints
Data-Centric Parallelism Abstraction for Irregular Applications
![Page 39: Exploiting the Potential of Modern Supercomputers Through High Level Language Abstractions](https://reader031.vdocuments.us/reader031/viewer/2022012402/56813868550346895da0174b/html5/thumbnails/39.jpg)
39
Design Principles of SWLProgrammability
– global-view programming – High level language abstraction
Flexibility– user control on data locality (constructing/executing)– customize the construction and behavior of work
items
lightweight speculative execution– Trigger on by user hints, not purely automatic– Lightweight conflict detecting, lock is too costly
39
![Page 40: Exploiting the Potential of Modern Supercomputers Through High Level Language Abstractions](https://reader031.vdocuments.us/reader031/viewer/2022012402/56813868550346895da0174b/html5/thumbnails/40.jpg)
40
SWL Extension in UPC-H
40
1) specify a work list
2) user-defined work constructor
3) two iterators of work list blocking one non-blocking one
4) Two kinds of work item dispatcher
Hide optimization detail from users:
message coalescing, queue management,asynchronous communication,Speculative execution etc.
5) user-assisted speculation upc_spec_get() upc_spec_put()
![Page 41: Exploiting the Potential of Modern Supercomputers Through High Level Language Abstractions](https://reader031.vdocuments.us/reader031/viewer/2022012402/56813868550346895da0174b/html5/thumbnails/41.jpg)
41
Level Synchronized BFS in SWL, Code Example
Work_t usr_add(Msg_t msg){ Work_t res_work; if(!TEST_VISITED(msg.tgt)){ pred[msg.tgt] = msg.src; SET_VISITED(msg.tgt); res_work = msg.tgt;} else res_work = NULL; return res_work;}
In Galois on shared memory machines: while(1){ int any_set = 0; upc_worklist_foreach(Work_t rcv: list1) { size_t ei = g.rowstarts[VERTEX_LOCAL(rcv)]; size_t ei_end = g.rowstarts[VERTEX_LOCAL(rcv) + 1]; for( ; ei < ei_end; ++ei){ long w = g.column[ei]; if( w == rcv) continue; Msg_t msg; msg.tgt = w; msg.src = rcv; upc_worklist_add(list2, &pred[w], usr_add(msg)); any_set = 1; } //for each row } //foreach bupc_all_reduce_allI(.....); if(final_set[MYTHREAD] == 0) break; upc_worklist_exchage(list1, list2); }//while
In UPCH on clusters:
![Page 42: Exploiting the Potential of Modern Supercomputers Through High Level Language Abstractions](https://reader031.vdocuments.us/reader031/viewer/2022012402/56813868550346895da0174b/html5/thumbnails/42.jpg)
42
Asynchronous BFS in SWL, Code Example
Asynchronous implementation on SM machines(Galois)
In UPCH on clusters:
![Page 43: Exploiting the Potential of Modern Supercomputers Through High Level Language Abstractions](https://reader031.vdocuments.us/reader031/viewer/2022012402/56813868550346895da0174b/html5/thumbnails/43.jpg)
43
Execution model – SPMD– SPMD+Multithreading
Master/slave– State transition
Executing; idle; termination detection; Exit
Work dispatching– AM-based, distributed– Coalescing work items and
async transfer– mutual exclusion on SWL and
work-item buffers
Implementation of SWL
![Page 44: Exploiting the Potential of Modern Supercomputers Through High Level Language Abstractions](https://reader031.vdocuments.us/reader031/viewer/2022012402/56813868550346895da0174b/html5/thumbnails/44.jpg)
44
User-Assisted Speculative Execution
User API upc_spec_get:
– get the data Ownership;– data transfer, get the shadow copy– conflict checking and rollback – enter the critical region ;
upc_cmt_put – release the data Ownership– commit the computation
Compiler: Identify speculative hints
– Upc_spec_get/put Fine-grained atomic protection
– Full/empty bits
Runtime system: two modes: (non-)speculative rollback of data and computation
![Page 45: Exploiting the Potential of Modern Supercomputers Through High Level Language Abstractions](https://reader031.vdocuments.us/reader031/viewer/2022012402/56813868550346895da0174b/html5/thumbnails/45.jpg)
45
(Intel Xeon E5450 @ 3.00GHz * 2) *64 nodesScale=20,edgefactor=16
On the shared memory machine, UPC gets very close to OpenMP
SPMD Execution, on Shared Memory Machine and Cluster
On the cluster, UPC is better than MPI :1)Save one copying for each work2)Frequent polling raise the network
throughput
Intel Xeon X7550 @ 2.00GHz * 8Scale=20,edgefactor=16
![Page 46: Exploiting the Potential of Modern Supercomputers Through High Level Language Abstractions](https://reader031.vdocuments.us/reader031/viewer/2022012402/56813868550346895da0174b/html5/thumbnails/46.jpg)
46
SPMD+MT, on X86 Cluster
pthreads/UPC thread
SWL SYNC BFSScale=24, EdgeFactor=16
![Page 47: Exploiting the Potential of Modern Supercomputers Through High Level Language Abstractions](https://reader031.vdocuments.us/reader031/viewer/2022012402/56813868550346895da0174b/html5/thumbnails/47.jpg)
47
On D6000, Strong Scaling of SWL SYNC BFS
ICT Loongson-3A V0.5 FPU [email protected], *2
1 ) MPI Conduit, large overhead2 ) tiered network behaves better when more intra-HPP comm happens
Strong Scal i ng of SWL_SYNC_BFS
0. 00E+00
2. 00E+06
4. 00E+06
6. 00E+06
8. 00E+06
1. 00E+07
1. 20E+07
1. 40E+07
H1N4P1(4) H4N1P1(1) H4N2P1(2) H4N4P1(4) H4N4P2(4) H4N4P2(8)
TEPS
MPI _Si mpl e(MPI ) UPC_SWL( I BV) UPC_SWL( I BV+BCL) UPC_SWL(MPI )
Scale=24, EdgeFactor=16
![Page 48: Exploiting the Potential of Modern Supercomputers Through High Level Language Abstractions](https://reader031.vdocuments.us/reader031/viewer/2022012402/56813868550346895da0174b/html5/thumbnails/48.jpg)
48
Summary and Future Work on SWL
Summary– Put forward Shared Work List (SWL) to UPC to tackle
amorphous data-parallelism– Using SWL, bfs can achieve better performance and scalability
than MPI at certain scale and runtime configurations– Realize tedious optimizations with less user effort
Future work– Realize and evaluate the speculative execution support
Delaunay Triangulation Refinement – Add dynamic scheduler to the SWL iterators– Evaluate more graph algorithms
![Page 49: Exploiting the Potential of Modern Supercomputers Through High Level Language Abstractions](https://reader031.vdocuments.us/reader031/viewer/2022012402/56813868550346895da0174b/html5/thumbnails/49.jpg)
49
Acknowledgement
Shenglin Tang
Shixiong Xu
Xingjing Lu
Zheng Wu
Lei Liu
Chengpeng Li
Zheng Jing
![Page 50: Exploiting the Potential of Modern Supercomputers Through High Level Language Abstractions](https://reader031.vdocuments.us/reader031/viewer/2022012402/56813868550346895da0174b/html5/thumbnails/50.jpg)
50
THANKS
![Page 51: Exploiting the Potential of Modern Supercomputers Through High Level Language Abstractions](https://reader031.vdocuments.us/reader031/viewer/2022012402/56813868550346895da0174b/html5/thumbnails/51.jpg)
51
Workload distribution of an upc_forall shared [32][32], [4][4],[1][1] float A[128]
[128];… …upc_forall(i=0; i<128; i++; continue)upc_forall(j=5; j<128; j++; &A[i][j]) ... body...
0:63
0:3
4:15
8:15
UPC threads
UPC program
16:63
0:15
4:15
multiple edges
0:15
0:7
Subgroups
Implicit threads0
11
1
0:3
0
0:15
THREADS=16
8
64 subgroups of 0-th grid
4X4
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
Workload distribution on UPC threads
i
j
5
127
0 127
![Page 52: Exploiting the Potential of Modern Supercomputers Through High Level Language Abstractions](https://reader031.vdocuments.us/reader031/viewer/2022012402/56813868550346895da0174b/html5/thumbnails/52.jpg)
52
Leverage load/store support within HPP