dealing with the scale problem

32
Dealing with the scale problem Innovative Computing Laboratory MPI Team

Upload: carsyn

Post on 05-Feb-2016

37 views

Category:

Documents


0 download

DESCRIPTION

Dealing with the scale problem. Innovative Computing Laboratory MPI Team. Runtime Scalability Collective Communications Fault Tolerance. Binomial Graph (BMG). Undirected graph G :=( V , E ), | V |= n (any size) Node i ={0,1,2,…, n -1} has links to a set of nodes U - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Dealing with the scale problem

Dealing with the scale problem

Innovative Computing Laboratory

MPI Team

Page 2: Dealing with the scale problem
Page 3: Dealing with the scale problem

Runtime ScalabilityCollective CommunicationsFault Tolerance

Page 4: Dealing with the scale problem

Runtime Scalability

Binomial Graph (BMG)

10

5

2

8

9

3

4

6

7

0

1

Undirected graph G:=(V, E), |V|=n (any size)Node i={0,1,2,…,n-1} has links to a set of nodes UU={i±1, i±2,…, i±2k | 2k ≤ n} in a circular spaceU={ (i+1)mod n, (i+2)mod n,…, (i+2k)mod n | 2k ≤ n } and { (n+i-1)mod n, (n+i-2)mod n,…, (n+i-2k)mod n | 2k ≤ n }

11

5

28

10

9

3

46

7

0

1

Merging all links create binomial graph from each node of the graphBroadcast from any node in Log2(n) steps

Page 5: Dealing with the scale problem

Runtime Scalability

BMG propertiesDegree = number of neighbors

= number of connections = resource consumption

Node-Connectivity (κ)Link-Connectivity (λ)Optimally Connectedκ = λ =δmin

Diameter ⎡ ⎤ )( 2)(log2 ⎥⎥⎤⎢⎢

⎡ nOAverage Distance 3

)(log2n≈

Page 6: Dealing with the scale problem

Routing Cost• Because of the counter-

clockwise links the routing problem is NP-complete

• Good approximations exists, with a max overhead of 1 hop

o Broadcast: optimal in number of steps log2(n) (binomial tree from each node)

o Allgather: Bruck algorithm log2(n) steps

o At step s:

o Node i send data to node i-2s

o Node i receive data from node i+2s

Runtime Scalability

Page 7: Dealing with the scale problem

Dynamic Environments

5

2

8

9

3

4

6

7

0

1

Self-Healing BMG

# of nodes

# of addednodes

# of newconnections

Runtime Scalability

Page 8: Dealing with the scale problem

Runtime ScalabilityCollective CommunicationsFault Tolerance

Page 9: Dealing with the scale problem

Collective Communications

Optimization process

• Run-time selection process

• We use performance models, graphical encoding, and statistical learning techniques to build platform-specific, efficient and fast runtime decision functions

Page 10: Dealing with the scale problem

Model prediction

Collective Communications

Page 11: Dealing with the scale problem

Model prediction

Collective Communications

Page 12: Dealing with the scale problem

Tuning broadcast

64 Opterons with1Gb/s TCP

Collective Communications

Page 13: Dealing with the scale problem

Collective Communications

Tuning broadcast

Page 14: Dealing with the scale problem

Application Tuning• Parallel Ocean Program (POP) on a Cray XT4

• Dominated by MPI_Allreduce of 3 doubles

• Default Open MPI select recursive doubling

• similar with Cray MPI (based on MPICH)

• Cray MPI has better latency

• i.e. POP using Open MPI is 10% slower on 256 processes

• Profile the system for this specific collective and determine that “reduce + bcast” is faster

• Replace the decision function

• New POP performance is about 5% faster than Cray MPI

Collective Communications

Page 15: Dealing with the scale problem

Runtime ScalabilityCollective CommunicationsFault Tolerance Fault Tolerant Linear Algebra Uncoordinated checkpoint Sequential debugging of parallel applications

Page 16: Dealing with the scale problem

Diskless Checkpointing

P1P1 P2P2 P3P3 P4P4 4 available processors

Fault Tolerance

Page 17: Dealing with the scale problem

Diskless Checkpointing

P1P1 P2P2 P3P3 P4P4 4 available processors

P1P1 P2P2 P3P3 P4P4Add a fifth and performa checkpoint(Allreduce)P5P5+ P4P4+ + =

Fault Tolerance

Page 18: Dealing with the scale problem

Diskless Checkpointing

P1P1 P2P2 P3P3 P4P4 4 available processors

P1P1 P2P2 P3P3 P4P4Add a fifth and performa checkpoint(Allreduce)P5P5+ P4P4+ + =

P1P1 P2P2 P3P3 P4P4 Ready to continueP5P5P4P4

Fault Tolerance

Page 19: Dealing with the scale problem

Diskless Checkpointing

P1P1 P2P2 P3P3 P4P4 4 available processors

P1P1 P2P2 P3P3 P4P4Add a fifth and performa checkpoint(Allreduce)P5P5+ P4P4+ + =

P1P1 P2P2 P3P3 P4P4 Ready to continueP5P5P4P4

....

P1P1 P2P2 P3P3 P4P4 FailureP5P5P4P4

Fault Tolerance

Page 20: Dealing with the scale problem

Fault Tolerance

Diskless Checkpointing

P1P1 P2P2 P3P3 P4P4 4 available processors

P1P1 P2P2 P3P3 P4P4Add a fifth and performa checkpoint(Allreduce)PcPc+ P4P4+ + =

P1P1 P2P2 P3P3 P4P4 Ready to continuePcPcP4P4

....

P1P1 P2P2 P3P3 P4P4 FailurePcPcP4P4

P1P1 P3P3 P4P4 Ready for recoveryPcPcP4P4

P1P1 P3P3 P4P4 Recover the processor/dataPcPc P2P2- - - =

Page 21: Dealing with the scale problem

Diskless Checkpointing• How to checkpoint ?

• either floating-point arithmetic or binary arithmetic will work

• If checkpoints are performed in floating‐point arithmetic then we can exploit the linearity of the mathematical relations on the object to maintain the checksums

• How to support multiple failures ?

• Reed-Salomon algorithm

• support p failures require p additional processors (resources)

Fault Tolerance

Page 22: Dealing with the scale problem

PCG• Fault Tolerant CG

• 64x2 AMD 64 connected using GigE

Performance of PCG with different MPI libraries

For ckpt we generate one

ckpt every 2000 iterations

Fault Tolerance

Page 23: Dealing with the scale problem

PCGCheckpoint overhead in

seconds

Fault Tolerance

Page 24: Dealing with the scale problem

Fault ToleranceUncoordinated

checkpoint

Fault Tolerance

Page 25: Dealing with the scale problem

Detailing event types to avoid intrusiveness

•promiscuous receptions (i.e. ANY_SOURCE)

•Non blocking delivery tests (i.e. WaitSome, Test, TestAny, iProbe…)

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Fault Tolerance

Page 26: Dealing with the scale problem

Interposition in Open MPI

•We want to avoid tainting the base code with #ifdef FT_BLALA

•Vampire PML loads a new class of MCA components

•Vprotocols provide the entire FT protocol (only pessimistic for now)

•You can use yourself the ability to define subframeworks in your components ! :)

•Keep using the optimized low level and zero-copy devices (BTL) for communication

•Unchanged message scheduling logic

Fault Tolerance

Page 27: Dealing with the scale problem

Performance Overhead

•Myrinet 2G (mx 1.1.5) - Opteron 146x2 - 2GB RAM - Linux 2.6.18 - gcc/gfortran 4.1.2 - NPB3.2 - NetPIPE

•Only two application kernels shows non-deterministic events (MG, LU)

Performance Overhead of Improved Pessimistic Message LoggingNetPipe Myrinet 2000

75,00%

80,00%

85,00%

90,00%

95,00%

100,00%

105,00%

1 4 1219 2735 5167 991311952593875157711027153920513075409961478195122911638724579327714915565539983071310751966112621473932195242917864351E+062E+062E+063E+064E+066E+068E+06

Message Size (bytes)

%performance of Open MPI

Open MPI-V no Sender-Based Open MPI-V with Sender-Based

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Fault Tolerance

Page 28: Dealing with the scale problem

Fault ToleranceSequential debugging

of parallel applications

Fault Tolerance

Page 29: Dealing with the scale problem

Debugging Applications• Usual scenario involves

• Programmer design testing suite

• Testing suite shows a bug

• Programmer runs the application in a debugger (such as gdb) up to understand the bug and fix it

• Programmer runs again the testing suite

• Cyclic DebuggingInteractive Debugging

Testing

Release

Design and code

Fault Tolerance

Page 30: Dealing with the scale problem

Fault Tolerance

Interposition Open MPI

•Events are stored (asynchronously on disk) during initial run

•Keep using the optimized low level and zero-copy devices (BTL) for communication

•Unchanged message scheduling logic

•We expect low impact on application behavior

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 31: Dealing with the scale problem

Fault Tolerance

Performance Overhead• Myrinet 2G (mx 1.0.3) - Opteron 146x2 - 2GB RAM - Linux 2.6.18 - gcc/gfortran 4.1.2 -

NPB3.2 - NetPIPE

• 2% overhead on bare latency, no overhead on bandwidth

• Only two application kernels shows non-deterministic events

• Receiver-based have more overhead, moreover it incurs large amount of logged data - 350MB on simple ping-pong (this is beneficial it is enabled on-demand)

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 32: Dealing with the scale problem

Log Size on NAS Parallel

Benchmarks

o Among the 7 NAS kernels, only 2 NPB generates non deterministic events

o Log size does not correlate with number of processes

o The more scalable is the application, the more scalable is the log mechanism

o Only 287KB of log/process for LU.C.1024 (200MB memory footprint/process)

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Log size per process on NAS Parallel Benchmarks (kB)

Fault Tolerance