dealing with the scale problem

24
Presented by Dealing with the Scale Problem Innovative Computing Laboratory MPI Team

Upload: edolie

Post on 19-Mar-2016

44 views

Category:

Documents


2 download

DESCRIPTION

Dealing with the Scale Problem. Innovative Computing Laboratory MPI Team. Binomial graph. Undirected graph G :=( V , E ), | V |= n (any size) Node i ={0,1,2,…, n -1} has links to a set of nodes U U ={ i ±1, i ±2,…, i ±2 k | 2 k ≤ n } in a circular space - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Dealing with the Scale Problem

Presented by

Dealing with the Scale Problem

Innovative Computing LaboratoryMPI Team

Page 2: Dealing with the Scale Problem

2 Dongarra_KOJAK_SC072 Bosilca_OpenMPI_SC07

Page 3: Dealing with the Scale Problem

3 Dongarra_KOJAK_SC073 Bosilca_OpenMPI_SC07

Runtime scalability

10

5

2

8

9

3

4

6

7

0

1

Undirected graph G:=(V, E), |V|=n (any size)Node i={0,1,2,…,n-1} has links to a set of nodes UU={i±1, i±2,…, i±2k | 2k ≤ n} in a circular spaceU={ (i+1)mod n, (i+2)mod n,…, (i+2k)mod n | 2k ≤ n } and { (n+i-1)mod n, (n+i-2)mod n,…, (n+i-2k)mod n | 2k ≤ n }

11

5

28

10

9

3

46

7

0

1

Merging all links create binomial graph from each node of the graph

Broadcast from any node in Log2(n) steps

Binomial graph

Page 4: Dealing with the Scale Problem

4 Dongarra_KOJAK_SC074 Bosilca_OpenMPI_SC07

Degree = number of neighbors = number of connections = resource consumption

Node-Connectivity (κ)Link-Connectivity (λ)Optimally Connectedκ = λ =δmin

Diameter

⎡ ⎤ )( 2)(log2 ⎥⎥⎤⎢⎢⎡nO Average

Distance 3)(log2n≈

Binomial graph properties

Runtime scalability

Page 5: Dealing with the Scale Problem

5 Dongarra_KOJAK_SC075 Bosilca_OpenMPI_SC07

Broadcast: optimal in number of steps log2(n) (binomial tree from each node)

Allgather: Bruck algorithm log2(n) steps At step s:

Node i send data to node i-2s Node i receive data from node i+2s

Routing cost

Runtime scalability

Because of the counter-clockwise links the routing problem is NP-complete.

Good approximations exist, with a max overhead of 1 hop.

Page 6: Dealing with the Scale Problem

6 Dongarra_KOJAK_SC076 Bosilca_OpenMPI_SC07

5

2

89

34

6

7

0

1

Self-Healing BMG

# of nodes

# of added nodes

# of newconnections

Runtime scalability

Dynamic environments

Page 7: Dealing with the Scale Problem

7 Dongarra_KOJAK_SC077 Bosilca_OpenMPI_SC07

Optimization process Run-time selection process

We use performance models, graphical encoding, and statistical learning techniques to build platform-specific, efficient, and fast runtime decision functions.

Collective communications

Page 8: Dealing with the Scale Problem

8 Dongarra_KOJAK_SC078 Bosilca_OpenMPI_SC07

Collective communications

Model prediction(A) (B)

(C) (D)

Page 9: Dealing with the Scale Problem

9 Dongarra_KOJAK_SC079 Bosilca_OpenMPI_SC07

Model prediction

Collective communications

(A) (B)

(C) (D)

Page 10: Dealing with the Scale Problem

10 Dongarra_KOJAK_SC0710 Bosilca_OpenMPI_SC07

64 Opterons with1Gb/s TCP

Tuning broadcast

Collective communications

Page 11: Dealing with the Scale Problem

11 Dongarra_KOJAK_SC0711 Bosilca_OpenMPI_SC07

Tuning broadcast

Collective communications

Page 12: Dealing with the Scale Problem

12 Dongarra_KOJAK_SC0712 Bosilca_OpenMPI_SC07

Application tuning Parallel Ocean Program (POP) on a Cray XT4.

Dominated by MPI_Allreduce of 3 doubles. Default Open MPI select recursive doubling

Similar with Cray MPI (based on MPICH). Cray MPI has better latency. I.e., POP using Open MPI is 10% slower on 256 processes.

Profile the system for this specific collective and determine that “reduce + bcast” is faster. Replace the decision function. New POP performance is about 5% faster than Cray MPI.

Collective communications

Page 13: Dealing with the Scale Problem

13 Dongarra_KOJAK_SC0713 Bosilca_OpenMPI_SC07

Fault tolerance

P1P1 P2P2 P3P3 P4P4 4 available processors

P1P1 P2P2 P3P3 P4P4 Add a fifth and performa checkpoint(Allreduce)PcPc+ P4P4+ + =

P1P1 P2P2 P3P3 P4P4 Ready to continuePcPcP4P4....

P1P1 P2P2 P3P3 P4P4 FailurePcPcP4P4

P1P1 P3P3 P4P4 Ready for recoveryPcPcP4P4

P1P1 P3P3 P4P4 Recover the processor/dataPcPc P2P2- - - =

Diskless checkpointing

Page 14: Dealing with the Scale Problem

14 Dongarra_KOJAK_SC0714 Bosilca_OpenMPI_SC07

Diskless checkpointing

How to checkpoint? Either floating-point arithmetic or binary arithmetic will work. If checkpoints are performed in floating‐point arithmetic then we

can exploit the linearity of the mathematical relations on the object to maintain the checksums.

How to support multiple failures? Reed-Salomon algorithm. Support p failures require p additional processors (resources).

Fault tolerance

Page 15: Dealing with the Scale Problem

15 Dongarra_KOJAK_SC0715 Bosilca_OpenMPI_SC07

Performance of PCG with different MPI libraries

For ckpt we generate one

ckpt every 2000 iterations

Fault tolerance

PCG Fault tolerant CG 64x2 AMD 64 connected using GigE

Page 16: Dealing with the Scale Problem

16 Dongarra_KOJAK_SC0716 Bosilca_OpenMPI_SC07

Checkpoint overhead in

seconds

PCG

Fault tolerance

Page 17: Dealing with the Scale Problem

17 Dongarra_KOJAK_SC0717 Bosilca_OpenMPI_SC07

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Fault tolerance

Detailing event types to avoid intrusiveness

Page 18: Dealing with the Scale Problem

18 Dongarra_KOJAK_SC0718 Bosilca_OpenMPI_SC07

Fault tolerance

Interposition in Open MPI

We want to avoid tainting the base code with #ifdef FT_BLALA.

Vampire PML loads a new class of MCA components. Vprotocols provide the entire FT

protocol (only pessimistic for now). You can use the ability to define

subframeworks in your components!

Keep using the optimized low level and zero-copy devices (BTL) for communication.

Unchanged message scheduling logic.

Page 19: Dealing with the Scale Problem

19 Dongarra_KOJAK_SC0719 Bosilca_OpenMPI_SC07

Performance Overhead of Improved Pessimistic Message LoggingNetPipe Myrinet 2000

75,00%

80,00%

85,00%

90,00%

95,00%

100,00%

105,00%

1 4 121927355167991311952593875157711027153920513075409961478195122911638724579327714915565539983071310751966112621473932195242917864351E+062E+062E+063E+064E+066E+068E+06

Message Size (bytes)

%performance of Open MPI

Open MPI-V no Sender-Based Open MPI-V with Sender-Based

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Fault tolerance

Performance overhead

Myrinet 2G (mx 1.1.5)—Opteron 146x2—2GB RAM—Linux 2.6.18 —gcc/gfortran 4.1.2—NPB3.2— NetPIPE.

Only two application kernels show non-deterministic events (MG, LU).

Page 20: Dealing with the Scale Problem

20 Dongarra_KOJAK_SC0720 Bosilca_OpenMPI_SC07

Interactive Debugging

Testing

ReleaseDesign and code

Fault tolerance

Debugging applications

Usual scenario involves Programmer design testing

suite Testing suite shows a bug Programmer runs the

application in a debugger (such as gdb) up to understand the bug and fix it

Programmer runs the testing suite again Cyclic debugging

Page 21: Dealing with the Scale Problem

21 Dongarra_KOJAK_SC0721 Bosilca_OpenMPI_SC07

Fault tolerance

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Interposition Open MPI Events are stored (asynchronously on disk) during initial run. Keep using the optimized low level and zero-copy devices (BTL) for

communication. Unchanged message scheduling logic. We expect low impact on application behavior.

Page 22: Dealing with the Scale Problem

22 Dongarra_KOJAK_SC0722 Bosilca_OpenMPI_SC07

Fault tolerance

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Performance overhead Myrinet 2G (mx 1.0.3)—Opteron 146x2—2GB RAM—Linux

2.6.18—gcc/gfortran 4.1.2—NPB3.2—NetPIPE 2% overhead on bare latency, no overhead on bandwidth. Only two application kernels show non-deterministic events. Receiver-based has more overhead; moreover, it incurs large amount of

logged data—350 MB on simple ping-pong (this is beneficial; it is enabled on-demand).

Page 23: Dealing with the Scale Problem

23 Dongarra_KOJAK_SC0723 Bosilca_OpenMPI_SC07

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Log size per process on NAS Parallel Benchmarks (kB)

Fault tolerance

Log size on NAS parallel benchmarks

Among the 7 NAS kernels, only 2 NPB generates nondeterministic events.

Log size does not correlate with number of processes. The more scalable is the application, the more scalable is

the log mechanism. Only 287KB of log/process for LU.C.1024 (200MB memory

footprint/process).

Page 24: Dealing with the Scale Problem

24 Dongarra_KOJAK_SC0724 Bosilca_OpenMPI_SC07

Contact

Dr. George BosilcaInnovative Computing LaboratoryElectrical Engineering and Computer Science DepartmentUniversity of [email protected]