Re-‐designing Communica0on and Work Distribu0on in Scien0fic Applica0ons for Extreme-‐scale Heterogeneous Systems
Project Team: Karen Tomko (PI), Ohio Supercomputer Center
Dhabeleswar K. Panda (Co-‐PI), Ohio State University Khaled Hamidouche, Ohio State University Hari Subramoni, Ohio State University
Jithin Jose, Ohio State University Raghunathan Raja Chandrasekar, Ohio State University
Rong Shi, Ohio State University Akshay Venkatesh, Ohio State University
Jie Zhang, Ohio State University
Blue Waters Symposium -‐ 2014 1
Drivers of Modern HPC System Architectures
• Mul%-‐core processors are ubiquitous
• Modern interconnects have high performance features such as RDMA and support for collec%ves
• Accelerators/Coprocessors becoming common in high-‐end systems
• Pushing the envelope for Exascale compu%ng
Accelerators / Coprocessors high compute density, high performance/waE
>1 TFlop DP on a chip
High Performance Interconnects
2
Mul%-‐core Processors
Tianhe – 2 (1) Titan (2) Stampede (6) Blue Waters Blue Waters Symposium -‐ 2014
• Complex Architecture – Within a node
• Accelerators connected via PCIe, • NUMA shared memory
– Interconnect feature and topology considera%on
• Scaling – Current algorithms developed and tested with 100s to 1000s of
processes
– few systems on which to run with 10,000s to 100,000s
Blue Waters Symposium -‐ 2014 3
Challenges for Communica0on Run0mes
4
Parallel Programming Models Overview
P1 P2 P3
Shared Memory
P1 P2 P3
Memory Memory Memory
P1 P2 P3
Memory Memory Memory
Logical shared memory
Shared Memory Model
SHMEM, OpenMP Distributed Memory Model
MPI (Message Passing Interface)
Par%%oned Global Address Space (PGAS)
Global Arrays, UPC, OpenSHMEM, …
• Programming models provide abstract machine models
• Models can be mapped on different types of systems – e.g. Distributed Shared Memory (DSM), MPI within a node, etc.
• Many Core models – OpenMP, OpenACC, CUDA
Blue Waters Symposium -‐ 2014
• How do MPI collec%ves perform at extreme scales?
• How well do the CraySHMEM and UPC PGAS collec%ve communica%ons scale?
• Can both the CPU and GPU resources be leveraged effec%vely in a hybrid node system?
Blue Waters Symposium -‐ 2014 5
Key Ques0ons
• Domain applica%ons such weather forecas%ng, earthquake simula%ons and many more have a real requirement for large throughput capability
• MPI is the most dominant programming model for distributed memory systems
• MPI jobs in order of 1K processes becoming common
• MPI jobs in order of 1M processes is the maximum
• Blue Waters is one of the first instances that can be used to test performance of MPI jobs at a really large scale
6
MPI on Blue Waters
Blue Waters Symposium -‐ 2014
• Point-‐to-‐point opera%ons and Collec%ve opera%ons determine the performance of MPI programs
• Performance of point-‐to-‐point opera%ons involve – Efficient u%liza%on of underlying interconnec%on hardware
– Design of high performance protocols
• Performance of collec%ves addi%onally involves – Design of efficient algorithms
• We evaluate performance of common collec%ves such as: – MPI_Bcast
– MPI_Reduce
– MPI_Allgather
7
Blue Waters MPI Collec0ve Performance
Blue Waters Symposium -‐ 2014
Blue Waters Symposium -‐ 2014 8
Performance of MPI_Bcast (64 – 512 Processes)
0
5
10
15
20
1 2 4 8 16 32 64 128 256 512 1K
Latency (us)
Message Size (bytes)
64 128 256 512
0
20
40
60
80
100
120
140
1K 2K 4K 8K 16K 32K
Latency (us)
Message Size (bytes)
64 128 256 512
0
500
1000
1500
2000
2500
3000
64K 128K 256K 512K 1M
Latency (us)
Message Size (bytes)
64 128 256 512
• Latency is flat in the 1 byte – 32 byte range and then starts climbing – regardless of process count
• Latency of broadcast more than doubles in the short message range going from 128 processes to 256 processes which is undesirable
Blue Waters Symposium -‐ 2014 9
Performance of MPI_Bcast (1K – 8K Processes)
0
10
20
30
40
50
1 2 4 8 16 32 64 128 256 512
Latency (us)
Message Size (bytes)
1K 2K 4K
0
20
40
60
80
100
120
140
1K 2K 4K 8K 16K
Latency (us)
Message Size (bytes)
1K 2K 4K
0 500
1000 1500 2000 2500 3000 3500 4000 4500
64K 128K 256K 512K 1M
Latency (us)
Message Size (bytes)
1K 2K
4K 8K
• For a process count over 1K, there is spike in latency at the 256 byte range where bandwidth available starts gekng stressed
Blue Waters Symposium -‐ 2014 10
Performance of MPI_Bcast (16K – 128K Processes)
0
20
40
60
80
100
120
140
1 2 4 8 16 32 64 128 256 512
Latency (us)
Message Size (bytes)
16K 32K
64K 128K
0 100 200 300 400 500 600 700 800 900 1000
1K 2K 4K 8K 16K 32K
Latency (us)
Message Size (bytes)
16K 32K
64K 128K
0
1000
2000
3000
4000
5000
6000
64K 128K 256K 512K 1M
Latency (us)
Message Size (bytes)
16K 32K
64K 128K
• Unlike the 64 – 8K process count there is variability – possible traffic effect
• The spike at 8K message range is indica%ve of algorithm selec%on problem
Blue Waters Symposium -‐ 2014 11
Performance of MPI_Reduce (64 – 512 Processes)
0
1
2
3
4
5
1 2 4 8 16 32 64 128 256 512
Latency (us)
Message Size (bytes)
64 128
256 512
0 500
1000 1500 2000 2500 3000 3500 4000 4500
1K 2K 4K 8K 16K 32K 64K 128K 256K
Latency (us)
Message Size (bytes)
64 128
256 512
• Reduce latency is hardware accelerated and regardless of process count the latency is similar
• There does seem to be a limita%on with hardware accelera%on at 128K byte range
Blue Waters Symposium -‐ 2014 12
Performance of MPI_Reduce (1K – 8K Processes)
0
1
2
3
4
5
1 2 4 8 16 32 64 128 256 512
Latency (us)
Message Size (bytes)
1K 2K
4K 8K
0 500
1000 1500 2000 2500 3000 3500 4000
1K 2K 4K 8K 16K 32K 64K 128K 256K
Latency (us)
Message Size (bytes)
1K 2K
4K 8K
• Trends similar to smaller process count
Blue Waters Symposium -‐ 2014 13
Performance of MPI_Reduce (16K – 128K Processes)
• Notable increase in latency for 128K processes in the short message range
0
1
2
3
4
5
6
7
1 2 4 8 16 32 64 128 256 512
Latency (us)
Message Size (bytes)
16K 32K
64K 128K
0
500
1000
1500
2000
2500
3000
3500
1K 2K 4K 8K 16K 32K 64K 128K 256K
Latency (us)
Message Size (bytes)
16K 32K
64K 128K
Blue Waters Symposium -‐ 2014 14
Scalability of MPI_Bcast and MPI_Reduce
• Scalability normalized to 64 process job case
• MPI_Reduce is highly scalable
• MPI_Bcast is not as scalable
0 5 10 15 20 25 30
1 2 4 8 16 32 64 128 256 512
Normalized
Laten
cy
Message Size (Bytes)
MPI_Bcast Scalability
64
4K
128K
0
0.5
1
1.5
2
2.5
1 2 4 8 16 32 64 128 256 512
Normalized
Laten
cy
Message Size (Bytes)
MPI_Reduce Scalability
64
4K
128K
Blue Waters Symposium -‐ 2014 15
Performance of MPI_Allgather (128K Processes)
0
1000
2000
3000
4000
5000
6000
7000
1 2 4 8 16 32 64 128 256 512 1K 2K 4K
Latency (us) x 1000
Message Size (Bytes)
128K-‐Process Allgather Latency
128K
• Allgather is equivalent to all processes performing broadcasts
• Bandwidth of the interconnec%on is tested
• Tradi%onally order of log (N) algorithms applicable to short message allgathers
• The above graph raises an alarm of latency growth for large scale dense collec%ves
• Performance of latency sensi%ve opera%ons such as Reduce is compe%%ve in the opera%onal range with increasing scale
• Conges%on effects, cross job traffic likely to play a role in performance of collec%ves as job sizes get larger (as seen in the 128K jobs)
• Performance of dense collec%ves like Allgather suffer from bandwidth limita%ons => – Applica%ons should perform such collec%ves in smaller
communicators or using non-‐blocking variant of the collec%ves
– BeEer algorithms need to be devised to overcome bandwidth limita%ons
16
Observa0ons on MPI Collec0ve Performance
Blue Waters Symposium -‐ 2014
• How do MPI collec%ves perform at extreme scales?
• How well do the CraySHMEM and UPC PGAS collec%ve communica%ons scale?
• Can both the CPU and GPU resources be leveraged effec%vely in a hybrid node system?
Blue Waters Symposium -‐ 2014 17
Key Ques0ons
• Par%%oned Global Address Space (PGAS) programming models gekng more trac%on – Shared memory abstrac%on over distributed nodes
– Global view of data and one-‐sided communica%on calls
– Provides improved produc%vity
– Can express irregular communica%on paEerns easily
• Unified Parallel C (UPC) – a language based PGAS model
• SHMEM – a library based model
• Blue Waters provides a good plaoorm to evaluate performance of UPC/SHMEM jobs at scale
18
PGAS (UPC/SHMEM) on Blue Waters
Blue Waters Symposium -‐ 2014
• Point-‐to-‐point opera%ons and Collec%ve opera%ons determine the performance of UPC programs
• Used Cray UPC and OSU UPC Microbenchmarks for evalua%ons
• Performance of point-‐to-‐point opera%ons involve – upc_memput
– upc_memget
• Performance of collec%ves addi%onally involves – upc_barrier – upc_broadcast – upc_reduce
19
Blue Waters UPC Performance Evalua0ons
Blue Waters Symposium -‐ 2014
Blue Waters Symposium -‐ 2014 20
UPC Put/Get Performance
0
2
4
6
8
10
12 1 2 4 8 16
32
64
128
256
512 1K
2K
4K
8K
16K
32K
Time (us)
Message Size
UPC Memput Latency
Intra-‐node
Inter-‐node
0 1 2 3 4 5 6 7 8
1 2 4 8 16
32
64
128
256
512 1K
2K
4K
8K
16K
32K
Time (us)
Message Size
UPC Memget Latency Intra-‐node
Inter-‐node
• Latency is flat in the 1 byte – 512 byte range and then starts climbing – Latency for UPC Put (intra/inter) for 4 byte message: 0.13/2.34 us
– Latency for UPC Get (intra/inter) for 4 byte message: 0.07/1.17 us
• Higher costs for Put opera%on might be because of the extra synchroniza%on opera%on (upc_fence) for ensuring comple%on
Blue Waters Symposium -‐ 2014 21
UPC Barrier Performance
• Barrier Opera%on Latency at 32,768 process – 186us
• Scalability graph shows the latency normalized to that at 1,024 processes
• Linear scalability observed for smaller system sizes
0 20 40 60 80
100 120 140 160 180 200
1024 2048 8192 32768
Time (us)
System Size (# of Processes)
Latency
0
2
4
6
8
10
12
1024 2048 8192 32768
Normalized
to Laten
cy at 1
K
System Size (# of Processes)
Scalability
Blue Waters Symposium -‐ 2014 22
UPC Broadcast Performance
• Broadcast Latency for a 4byte message at 32,768 processes – 13us
• Varia%on in latencies observed aqer 8192 processes, and the varia%on increases with scale
• Broadcast latency does not scale linearly with increase in system size
0
5000
10000
15000
20000
25000
30000
35000
40000
1 4 16 64 256 1K 4K 16K 64K 256K
Time (us)
Message Size
Latency 2K Processes
8K Processes
16 KProcesses
32K Processes
0
10
20
30
40
50
60
70
1 4 16 64 256 1K 4K 16K 64K 256K
Normalized
to Laten
cy at 2
K
Message Size
Scalability 2K Processes
8K Processes
16 KProcesses
32K Processes
Blue Waters Symposium -‐ 2014 23
UPC Reduce Performance
• Reduce Latency for 4 byte message at 32,768 processes – 5.4us
• Linear scalability observed for small message range
• Varia%on in opera%on latency observed as the system size increases
0
1000
2000
3000
4000
5000
6000
1 4 16 64 256 1K 4K 16K 64K 256K
Time (us)
Message Size
Latency 2K Processes
8K Processes
16K Processes
32K Processes
0
5
10
15
20
25
1 4 16 64 256 1K 4K 16K 64K 256K
Normalized
to Laten
cy at 2
K Message Size
Scalability 2K Processes
8K Processes
16 KProcesses
32K Processes
• Point-‐to-‐point opera%ons and Collec%ve opera%ons determine the performance of SHMEM programs
• Used CraySHMEM library and OSU OpenSHMEM Microbenchmarks for evalua%ons
• Performance of point-‐to-‐point opera%ons involve – shmem_put
– shmem_get
• Performance of collec%ves addi%onally involves – shmem_barrier
– shmem_broadcast
– shmem_reduce
– shmem_collect 24
Blue Waters CraySHMEM Performance Evalua0ons
Blue Waters Symposium -‐ 2014
Blue Waters Symposium -‐ 2014 25
CraySHMEM Put/Get Performance
• Latency is flat in the 1 byte – 512 byte range and then starts climbing aqer 1K bytes
– Latency for 4byte Put opera%on (intra/inter) – 0.12/1.04 us – Latency for 4byte Get opera%on (intra/inter) – 0.05/1.41 us
• Significantly higher latency observed for get opera%on, with increase in message size – Get Latency for 512K message – 763 us
0 1 2 3 4 5 6 7 8 9
1 2 4 8 16
32
64
128
256
512 1K
2K
4K
8K
16K
32K
Time (us)
Message Size
SHMEM Put Latency Intra-‐Node
Inter-‐Node
0
10
20
30
40
50
60
1 2 4 8 16
32
64
128
256
512 1K
2K
4K
8K
16K
32K
Time (us)
Message Size
SHMEM Get Latency Intra-‐Node
Inter-‐Node
Blue Waters Symposium -‐ 2014 26
CraySHMEM Barrier Performance
• Barrier Latency at 16,384 processes – 138.64 us
• Similar latencies as that of UPC barrier
• Shows good scalability trends with increase in system size
0
20
40
60
80
100
120
140
160
2048 4096 16384
Time (us)
System Size (# of Processes)
Latency
0 0.5 1
1.5 2
2.5 3
3.5 4
4.5
2048 4096 16384
Normalized
to Laten
cy at 2
K System Size (# of Processes)
Scalability
Blue Waters Symposium -‐ 2014 27
CraySHMEM Broadcast Performance
• Latency is flat in the 1 byte – 512 byte range and then starts climbing – regardless of process count
• Broadcast Latency for 4-‐byte message at 16,384 processes – 72.3us
• Varia%on in latencies observed with increase in system size
0
200
400
600
800
1000
1200
1400
4 16 64 256 1K 4K 16K 64K 256K
Time (us)
Message Size
Latency 2K Processes
4K Processes
8K Processes
16K Processes
0
0.5
1
1.5
2
2.5
3
3.5
4 8 16
32
64
128
256
512 1K
2K
4K
8K
16K
32K
64K
128K
256K
Normalized
to Laten
cy at 2
K Message Size
Scalability 2K Processes 4K Processes 8K Processes 16K Processes
Blue Waters Symposium -‐ 2014 28
CraySHMEM Reduce Performance
• Latency for 4-‐byte message at 16K processes – 210 us
• Scalability analysis shows good scalability trends with even higher system sizes as well
• Latencies smaller compared to UPC reduce opera%on – extra synchroniza%on opera%ons in UPC collec%ve opera%ons
0
20000
40000
60000
80000
100000
120000
140000
160000
4 16 64 256 1K 4K 16K 64K 256K
Time (us)
Message Size
Latency 2K Processes
4K Processes
8K Processes
16K Processes
0 1 2 3 4 5 6 7 8 9 10
4 8 16
32
64
128
256
512 1K
2K
4K
8K
16K
32K
64K
128K
256K
Normalized
to Laten
cy at 2
K Message Size
Scalability 2K Processes
4K Processes
8K Processes
16K Processes
Blue Waters Symposium -‐ 2014 29
CraySHMEM Collect Performance
• Latency for 4byte collect (all-‐gather) opera%on at 16K processes – 319.3 ms
• Scalability analysis shows collect opera%on scales well
0
5000
10000
15000
20000
25000
30000 4 8 16
32
64
128
256
512 1K
2K
4K
8K
16K
32K
Time (m
s)
Message Size
Latency 2K Processes 4K Processes 8K Processes 16K Processes
0
5
10
15
20
25
30
4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K
Normalized
to Laten
cy at 2
K Message Size
Scalability 2K Processes 4K Processes 8K Processes 16K Processes
• How do MPI collec%ves perform at extreme scales?
• How well do the CraySHMEM and UPC PGAS collec%ve communica%ons scale?
• Can both the CPU and GPU resources be leveraged effec%vely in a hybrid node system?
Blue Waters Symposium -‐ 2014 30
Key Ques0ons
• HPL (High Performance Linpack) Benchmark for ranking supercomputers in the top500 list
• Current HPL support for GPU Clusters – Heterogeneity inside a node CPU+GPU – Homogeneity across nodes
• Current HPL execu%on on heterogeneous GPU Clusters – Only CPU nodes (using all the CPU cores) – Only GPU nodes (using CPU+GPU on only GPU nodes) – As the ra%o CPU/GPU is higher => report the “Only CPU” runs
• Hybrid HPL support for heterogeneous systems – Heterogeneity inside a node (CPU+GPU) – Heterogeneity across nodes (nodes w/o GPUs)
31
Current Execu0on of HPL on Heterogeneous GPU Clusters
Blue Waters Symposium -‐ 2014
R. Shi, S. Potluri, K. Hamidouche, X. Lu, K. Tomko and D. K. Panda, A Scalable and Portable Approach to Accelerate Hybrid HPL on Heterogeneous CPU-‐GPU Clusters, IEEE Cluster (Cluster '13), Best Student Paper Award
Two Level Workload Par00oning: Inter-‐node
• Inter-‐node Sta0c Par00oning Original design: uniform distribu%on, boEleneck on CPU nodes
Original Par%%oning MPI Process-‐scaling based Par%%oning
G1 G2 C1 C2
2x2 2x3
New design: iden%cal block size, schedules more MPI processes on GPU nodes
MPI_GPU = ACTUAL_PEAK_GPU / ACTUAL_PEAK_CPU + β (NUM_CPU_CORES mod MPI_GPU = 0 ) Evenly split the cores
32 Blue Waters Symposium -‐ 2014
Two Level Workload Par00oning: Intra-‐node
A
B2
C1 C2
B1
• Intra-‐node Dynamic Par00oning • MPI-‐to-‐Device Mapping Original design: 1:1 New design: M: N (M > N), N= number of GPUs/Node, M= number of MPI processes
• Ini%al Split Ra%o Tuning: alpha = GPU_LEN / (GPU_LEN + CPU_LEN) Fewer CPU cores per MPI processes Overhead caused by scheduling mul%ple MPI processes on GPU nodes
GPU_LEN CPU_LEN
33 Blue Waters Symposium -‐ 2014
Blue Waters Symposium -‐ 2014 34
Performance Tuning of Single CPU Node and GPU Node Netlib-‐CPU: Standard HPL version from Netlib (UTK) Hybrid-‐CPU: Hybrid HPL version with OpenMP support NVIDIA-‐GPU: NVIDIA’s HPL version * OpenBLAS Math Library is used
0 100 200 300 400 500 600 700 800
10000 20000 30000 35000 40000 45000 50000 55000 60000
Performan
ce (G
flops)
Problem Size N
Peak Performance Scaling on Single CPU/GPU Node
Netlib-‐CPU Hybrid-‐CPU NVIDIA-‐GPU
Blue Waters Symposium -‐ 2014 35
Peak Performance Scaling of Pure CPU/GPU Nodes Measure the peak performance of either pure CPU Nodes or pure GPU
Nodes (1, 2, 4, 8, 16)
0 1000 2000 3000 4000 5000 6000 7000
1 2 4 8 16
Performan
ce (G
flops)
Number of CPU/GPU Nodes
Performance Scaling of Pure CPU/GPU Nodes
Netlib-‐CPU Hybrid-‐CPU NVIDIA-‐GPU
Blue Waters Symposium -‐ 2014 36
Strong and Weak Scalability of Hybrid CPU+GPU Nodes Using Hybrid-‐HPL to measure the scalability with 4 GPU Nodes + (4, 8, 12,
16) CPU Nodes Launch 1 MPI process / CPU node; 1, 2 or 4 MPI processes / GPU node Strong Scalability: fixed problem size N for each combina%on of CPUs+GPUs (e.g. N=100,000 for 4 GPUs + 4 CPUs) Weak Scalability: fixed memory usage (~40%) on GPU nodes for all cases
0 500
1000 1500 2000 2500 3000 3500 4000
4 8 12 16
Peak Perform
ance (G
flops)
Number of CPU Nodes
Strong Scalability
1 MPI/GPU 2 MPI/GPU 4 MPI/GPU
0
1000
2000
3000
4000
5000
4 8 12 16
Peak Perform
ance (G
flops)
Number of CPU Nodes
Weak Scalability
1 MPI/GPU 2 MPI/GPU 4 MPI/GPU
Blue Waters Symposium -‐ 2014 37
Peak Performance of Hybrid CPU Nodes + GPU Nodes Measure the peak performance of 64 CPU Nodes and 16 GPU Nodes Launch 1 MPI process / CPU node, and 4 MPI processes / GPU node
Node Configura0on Peak Performance (Gflops)
16 GPUs 6,480
64 CPUs 13,210
16 GPUs + 64 CPUs 14,520
Peak Performance Efficiency (Hybrid-‐HPL) Peak Perf. of hybrid Nodes / (Peak Perf. of CPUs + Peak Perf. of GPUs) (e.g. 14,520 / (6,480 + 13,210) = 73.7 %
• The Blue Waters system provides unique opportuni%es – Communica%ons at large scale
– Hybrid system with XE6 and XK7 nodes
• MPI collec%ves study on up to 128K processes – Latency sensi%ve collec%ves such as reduce perform well
– Bandwidth limita%ons impact dense collec%ves such as Allgather
• UPC and SHMEM communica%ons study up 32K and 16K cores respec%vely – UPC and SHMEM point-‐to-‐point performance is good
– Some collec%ves (UPC ScaEer, SHMEM Broadcast) scale well, for others (SHMEM collect) we observed high latencies
Blue Waters Symposium -‐ 2014 38
Conclusion
• Hybrid HPL – Peak single CPU node performance 202 Gflops/sec
– Peak GPU node performance 670 Gflops/sec
– Performance efficiency of hybrid HPL compared to the sum of pure CPU and GPU nodes, above 70% efficiency with 16 GPU nodes and 64 CPU nodes.
• Contact US:
Blue Waters Symposium -‐ 2014 39
Conclusion (con0nued)
Dhabaleswar K. (DK) Panda The Ohio State University
E-‐mail: [email protected]‐state.edu hEp://www.cse.ohio-‐state.edu/~panda
Karen Tomko Ohio Supercomputer Center E-‐mail: [email protected]
hEp://www.osc.edu/~ktomko