disp:optimizationstowardsscalable mpi*startup

Huansong Fu†, Swaroop Pophale*, Manjunath Gorentla Venkata*,

Weikuan Yu†

†Florida State University*Oak Ridge National Laboratory

DISP: Optimizations Towards Scalable MPI Startup

Outline

• Background and motivation– Issues with MPI startup– Cost analysis

• Design of DISP– Delayed initialization – Module sharing– Prediction-based topology setup

• Experiments• Conclusion

S-3

• The scale of High Performance Computing (HPC) systems is increasing rapidly.

Increasing Scale of HPC Systems

Rank Name Cores Rpeak(TFlop/s)

Rank in Nov 2015

1 Sunway 10,649,600 93,014.6 N/A2 Tianhe-2 3,120,000 33,862.7 1

3 Titan 560,640 17,590.0 2

4 Sequoia 1,572,864 17,173.2 3

5 Cori 622,336 14,014.7 N/A6 Oakforest 556,104 13,554.6 N/A

Table: Top500 List (Nov 2016)

S-4

MPI Startup at Scale• For the last 20 years, Message Passing Interface (MPI) is the

de facto parallel programming system on HPC systems.• However, MPI startup has serious performance issue at scale.

0

500

1000

1500

2000

2500

2 4 8 16 32 64 128 256 512 1024

Time (ms)

No. of procs

MPI_Reduce x 1000

MPI_Init

10x

24x

S-5

MPI Startup Breakdown

• The initialization of communicator object and collective module is particularly non-scalable.

Startup Phase Work Phase

Init OMPI, including global communicator and collective module

ompi_mpi_init

ompi_comm_init

mca_coll_select

……

……

Initialize subsequentcommunicators

ompi_comm_split

ompi_comm_create

ompi_comm_dup

……

Get backend framework ready

opal_init

orte_init

……

Run collectives withcommunicators

barrier

bcast

reduce

……

global comm init

coll init

sub-‐comm init

S-6

Issues with Comm & Coll Init• There can be many communicators for various uses.

– For every communicator, there is a same communicator object to be created on every participating process.

– A communicator object contains its basic info and a collective modulethat orchestrates collective communication.

• Thus, the computation time and memory consumption grows linearly with the number of processes.

Process 2

Comm AComm BComm C

Process 1

Comm AComm BComm C

Process 0

Comm AComm BComm C

Collmodule

rank sizec_id …

Comm A

9 comm objects & collective modules

S-7

Multi-Level Collective• Cheetah is a popular framework that provide a suit of fast

collectives. Its module for OpenMPI is called Multi-Level (ML).• ML has a hierarchical structure. Process from one group

communicates with other group members through higher-level root. • Every process needs to set up a global topology in order to

communicate.

S-8

Cost Analysis of MPI_Init• We study the time and memory cost of MPI_Init using

OpenMPI’s default collective module (i.e. Tuned) and ML.• Both have various performance behaviors.

– Generally, init of ML performs worse than init of Tuned. – Init of ML scales particularly bad in terms of time.

Fig. Cost of MPI_Init using ML

0

5

10

15

20

25

0

5

10

15

20

25

Memory Consumption

(MB)

Time (s)

No. of procs

TimeMemory 16 s

24 MB

S-9

Cost Breakdown of ML Init• The topology setup needs to conduct many all-to-all collective

communications across all participating processes in the corresponding communicator.

• Time to finish the inter-process communication can occupy most of the init of ML. It is the essential cause that makes the ML init even more non-scalable.

0%

20%

40%

60%

80%

100%

2 4 8 16 32 64 128 256 512 1024 2048 4096

% of Total Cost

No. of procs

44% 47%63% 64%

89% 86% 85% 86% 87% 87% 88%93%

Inter-process communication Other costs

S-10

Related Works & our Solution• Previous studies have recognized the performance issue of

communicator initialization. But most of them have not identified and addressed the issue of non-scalable initialization of the collective module, especially ML.

• We propose a hybrid solution – Delayed Initialization with Sharing and Prediction (DISP).

Inter-processcommunicationComm A Comm A

Comm B

1. Delayed Initialization

3. Prediction-based Topology Setup

2. Module Sharing

S-11

Delayed Initialization• Delay the initialization of communicator until it is actually used.

Instead of a full-fledged communicator, we create a shadow communicator that only contains its basic info.– It removes cost of unused module.– Delayed initialization also facilitates module sharing between successive

identical communicators. This helps remove initialization cost of identical modules.

Global comm initSub-comm initShallow init On-demand init

MPI startupCollectives

unused

Time

old process:

new process:module sharing

S-12

• Temporal sharing: collective module is shared between identical communicators.

• Spatial sharing: collective module is shared between MPI processes on the same node. Only the root process on that node initializes the module.

Process 0

Module Sharing

SpatialSharing

identical

Collmodule

Temporal sharing

Process 1

……

comm A

Collmodulecomm B

Collmodulecomm A

Node 1

…

…

S-13

Prediction-based Topology Setup• Based on system specifics, every process predicts the topology

without exchanging information with others.• Our prediction algorithm can computes the information:

• ① Highest and lowest hierarchy level; • ② Ranks of all participating processes;• ③ All group lists that contain the ranks of the members;• ④ Routing table of how a process can be reached by another one.

Level 0

Level 1

Level 2

4

1

2

3ranks

groups

levels

S-14

Experimental Setup• Testbed: all experiments are conducted on Titan’s Cray XK6

machines.– 16-core AMD Opteron 6200 series processor.– 32GB of DDR3 memory.– Connected through a Gemini interconnect.– 600 TB total storage.

• Software: OpemMPI 1.8.8 and Cheetah 1.0.0.• Benchmark: NAS Parallel Benchmarks v3.3 (customized) and

MVAPICH MPI benchmark suite v4.4.1.

S-15

Overall Improvement• Real improvement is the difference between DISP’s

improvement to startup and its delay to work phase.• DISP improves ML by a bigger factor than Tuned because of

ML’s longer initialization cost.

Fig. 1 Improvement vs. Delay

0100200300400500600700800

bt cg ep is ft lu mg sp

Time (ms)

Startup Improv. (Tuned) Work Delay (Tuned) Startup Improv. (ML) Work Delay (ML)

real improvement

S-16

Memory Savings• Delayed initialization saves memory of unused communicator,

and module sharing saves for reusable collective module. • Actual savings depend on the ratio between size of the

collective module and size of the communicator object.

Fig. 1 For Tuned. Fig. 2 For ML.

0

100

200

300

400

500

Memory consum

ption (MB)

No. of procs

OrigDISP

0

500

1000

1500

2000

2500

3000

Memory consum

ption (MB)

No. of procs

OrigDISP

Avg savings:Avg savings: 85.7%8.6%

S-17

Benefit of Prediction-based Setup• By speeding up the initialization of collective module, topology-

prediction significantly reduces MPI initialization calls.

Fig. 1 MPI_Init() Fig. 2 MPI_Comm_split & _create

0

5000

10000

15000

20000

25000

30000

35000

Time (ms)

No. of procs

Orig

DISP

0

100

200

300

400

500

600

700

Time (ms)

No. of Procs

Split (Orig)Split (DISP)Create (Orig)Create (DISP)70.0% impr.

63.8%

74.9%

S-18

Conclusion• Issues with communicator and collective module can significantly

diminish its scalability to thousands or more processes. We have examined such impact in terms of time and memory cost.

• By prudently delaying the initialization and sharing the reusable collective module, we can efficiently reduce the time and memory cost.

• The costly topology setup of multi-level collective module can be well mitigated by a prediction-based approach without affecting the collective module’s functionality.

S-19

Acknowledgment

S-20

Thank You and Questions?

disp:*optimizations*towards*scalable* mpi*startup

Documents

disp:optimizationstowardsscalable mpi*startup