disp:*optimizations*towards*scalable* mpi*startup
TRANSCRIPT
Huansong Fu†, Swaroop Pophale*, Manjunath Gorentla Venkata*,
Weikuan Yu†
†Florida State University*Oak Ridge National Laboratory
DISP: Optimizations Towards Scalable MPI Startup
Outline
• Background and motivation– Issues with MPI startup– Cost analysis
• Design of DISP– Delayed initialization – Module sharing– Prediction-based topology setup
• Experiments• Conclusion
S-3
• The scale of High Performance Computing (HPC) systems is increasing rapidly.
Increasing Scale of HPC Systems
Rank Name Cores Rpeak(TFlop/s)
Rank in Nov 2015
1 Sunway 10,649,600 93,014.6 N/A2 Tianhe-2 3,120,000 33,862.7 1
3 Titan 560,640 17,590.0 2
4 Sequoia 1,572,864 17,173.2 3
5 Cori 622,336 14,014.7 N/A6 Oakforest 556,104 13,554.6 N/A
Table: Top500 List (Nov 2016)
S-4
MPI Startup at Scale• For the last 20 years, Message Passing Interface (MPI) is the
de facto parallel programming system on HPC systems.• However, MPI startup has serious performance issue at scale.
0
500
1000
1500
2000
2500
2 4 8 16 32 64 128 256 512 1024
Time (ms)
No. of procs
MPI_Reduce x 1000
MPI_Init
10x
24x
S-5
MPI Startup Breakdown
• The initialization of communicator object and collective module is particularly non-scalable.
Startup Phase Work Phase
Init OMPI, including global communicator and collective module
ompi_mpi_init
ompi_comm_init
mca_coll_select
……
……
Initialize subsequentcommunicators
ompi_comm_split
ompi_comm_create
ompi_comm_dup
……
Get backend framework ready
opal_init
orte_init
……
Run collectives withcommunicators
barrier
bcast
reduce
……
global comm init
coll init
sub-‐comm init
S-6
Issues with Comm & Coll Init• There can be many communicators for various uses.
– For every communicator, there is a same communicator object to be created on every participating process.
– A communicator object contains its basic info and a collective modulethat orchestrates collective communication.
• Thus, the computation time and memory consumption grows linearly with the number of processes.
Process 2
Comm AComm BComm C
Process 1
Comm AComm BComm C
Process 0
Comm AComm BComm C
Collmodule
rank sizec_id …
Comm A
9 comm objects & collective modules
S-7
Multi-Level Collective• Cheetah is a popular framework that provide a suit of fast
collectives. Its module for OpenMPI is called Multi-Level (ML).• ML has a hierarchical structure. Process from one group
communicates with other group members through higher-level root. • Every process needs to set up a global topology in order to
communicate.
S-8
Cost Analysis of MPI_Init• We study the time and memory cost of MPI_Init using
OpenMPI’s default collective module (i.e. Tuned) and ML.• Both have various performance behaviors.
– Generally, init of ML performs worse than init of Tuned. – Init of ML scales particularly bad in terms of time.
Fig. Cost of MPI_Init using ML
0
5
10
15
20
25
0
5
10
15
20
25
Memory Consumption
(MB)
Time (s)
No. of procs
TimeMemory 16 s
24 MB
S-9
Cost Breakdown of ML Init• The topology setup needs to conduct many all-to-all collective
communications across all participating processes in the corresponding communicator.
• Time to finish the inter-process communication can occupy most of the init of ML. It is the essential cause that makes the ML init even more non-scalable.
0%
20%
40%
60%
80%
100%
2 4 8 16 32 64 128 256 512 1024 2048 4096
% of Total Cost
No. of procs
44% 47%63% 64%
89% 86% 85% 86% 87% 87% 88%93%
Inter-process communication Other costs
S-10
Related Works & our Solution• Previous studies have recognized the performance issue of
communicator initialization. But most of them have not identified and addressed the issue of non-scalable initialization of the collective module, especially ML.
• We propose a hybrid solution – Delayed Initialization with Sharing and Prediction (DISP).
Inter-processcommunicationComm A Comm A
Comm B
1. Delayed Initialization
3. Prediction-based Topology Setup
2. Module Sharing
S-11
Delayed Initialization• Delay the initialization of communicator until it is actually used.
Instead of a full-fledged communicator, we create a shadow communicator that only contains its basic info.– It removes cost of unused module.– Delayed initialization also facilitates module sharing between successive
identical communicators. This helps remove initialization cost of identical modules.
Global comm initSub-comm initShallow init On-demand init
MPI startupCollectives
unused
Time
old process:
new process:module sharing
S-12
• Temporal sharing: collective module is shared between identical communicators.
• Spatial sharing: collective module is shared between MPI processes on the same node. Only the root process on that node initializes the module.
Process 0
Module Sharing
SpatialSharing
identical
Collmodule
Temporal sharing
Process 1
……
comm A
Collmodulecomm B
Collmodulecomm A
Node 1
…
…
S-13
Prediction-based Topology Setup• Based on system specifics, every process predicts the topology
without exchanging information with others.• Our prediction algorithm can computes the information:
• ① Highest and lowest hierarchy level; • ② Ranks of all participating processes;• ③ All group lists that contain the ranks of the members;• ④ Routing table of how a process can be reached by another one.
Level 0
Level 1
Level 2
4
1
2
3ranks
groups
levels
S-14
Experimental Setup• Testbed: all experiments are conducted on Titan’s Cray XK6
machines.– 16-core AMD Opteron 6200 series processor.– 32GB of DDR3 memory.– Connected through a Gemini interconnect.– 600 TB total storage.
• Software: OpemMPI 1.8.8 and Cheetah 1.0.0.• Benchmark: NAS Parallel Benchmarks v3.3 (customized) and
MVAPICH MPI benchmark suite v4.4.1.
S-15
Overall Improvement• Real improvement is the difference between DISP’s
improvement to startup and its delay to work phase.• DISP improves ML by a bigger factor than Tuned because of
ML’s longer initialization cost.
Fig. 1 Improvement vs. Delay
0100200300400500600700800
bt cg ep is ft lu mg sp
Time (ms)
Startup Improv. (Tuned) Work Delay (Tuned) Startup Improv. (ML) Work Delay (ML)
real improvement
S-16
Memory Savings• Delayed initialization saves memory of unused communicator,
and module sharing saves for reusable collective module. • Actual savings depend on the ratio between size of the
collective module and size of the communicator object.
Fig. 1 For Tuned. Fig. 2 For ML.
0
100
200
300
400
500
Memory consum
ption (MB)
No. of procs
OrigDISP
0
500
1000
1500
2000
2500
3000
Memory consum
ption (MB)
No. of procs
OrigDISP
Avg savings:Avg savings: 85.7%8.6%
S-17
Benefit of Prediction-based Setup• By speeding up the initialization of collective module, topology-
prediction significantly reduces MPI initialization calls.
Fig. 1 MPI_Init() Fig. 2 MPI_Comm_split & _create
0
5000
10000
15000
20000
25000
30000
35000
Time (ms)
No. of procs
Orig
DISP
0
100
200
300
400
500
600
700
Time (ms)
No. of Procs
Split (Orig)Split (DISP)Create (Orig)Create (DISP)70.0% impr.
63.8%
74.9%
S-18
Conclusion• Issues with communicator and collective module can significantly
diminish its scalability to thousands or more processes. We have examined such impact in terms of time and memory cost.
• By prudently delaying the initialization and sharing the reusable collective module, we can efficiently reduce the time and memory cost.
• The costly topology setup of multi-level collective module can be well mitigated by a prediction-based approach without affecting the collective module’s functionality.
S-19
Acknowledgment
S-20
Thank You and Questions?