supercomputer design through simulation

35
Supercomputer Design Through Simulation Cray User Group (CUG) Meeting Lugano, Switzerland Rolf Riesen [email protected] Sandia National Laboratories May 9, 2006

Upload: others

Post on 23-Jan-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

Supercomputer Design Through SimulationCray User Group (CUG) Meeting

Lugano, Switzerland

Rolf [email protected]

Sandia National Laboratories

May 9, 2006

TalkOverview

Goal

Design

Usage

Validation

Experiments

Related Work

Future Work

Summary

End

Talk Overview

1 Goal

2 Design

3 Usage

4 Validation

5 Experiments

6 Comparison to Other Work

7 Future Work

8 Summary

TalkOverview

Goal

Design

Usage

Validation

Experiments

Related Work

Future Work

Summary

End

Section Outline

1 Goal

2 Design

3 Usage

4 Validation

5 Experiments

6 Comparison to Other Work

7 Future Work

8 Summary

TalkOverview

Goal

Design

Usage

Validation

Experiments

Related Work

Future Work

Summary

End

Goal

Simulate a supercomputer; e.g., Red Storm, usingfederated discrete event simulatorsWith enough fidelity to make future purchase anddesign decisions concerning things like:

CPU choiceMemory size and speedNetwork interfaceTopologyApplication behaviorResearch directionsetc.

Created initial prototype with promising attributes

This talk describes simulator

Collective results on Thursday

TalkOverview

Goal

DesignNode (Application)

MPI Wrappers

Communicators

Virtual Time

Linking

Usage

Validation

Experiments

Related Work

Future Work

Summary

End

Section Outline

1 Goal

2 DesignNode (Application)MPI Wrapper LibraryMPI CommunicatorsVirtual TimeLinking

3 Usage

4 Validation

5 Experiments

6 Comparison to Other Work

7 Future Work

8 Summary

TalkOverview

Goal

DesignNode (Application)

MPI Wrappers

Communicators

Virtual Time

Linking

Usage

Validation

Experiments

Related Work

Future Work

Summary

End

Node (Application)

Hybrid simulator:

App runs regularly and uses MPI to exchange data

Each MPI send and receive generates an event to thenetwork simulator

Sim generates rcv events that are matched by clients

Algorithm determines when and how to update virtualtime on each node

Use MPI wrappers and profiling interface

Current network simulator uses simple model:

∆ = sB + L

∆ network delay B network bandwidths message size L network latency

TalkOverview

Goal

DesignNode (Application)

MPI Wrappers

Communicators

Virtual Time

Linking

Usage

Validation

Experiments

Related Work

Future Work

Summary

End

MPI Wrapper Library

TalkOverview

Goal

DesignNode (Application)

MPI Wrappers

Communicators

Virtual Time

Linking

Usage

Validation

Experiments

Related Work

Future Work

Summary

End

MPI Wrapper Library

int MPI_Send(void ∗data,int len,MPI_Datatype dt,int dest,int tag,MPI_Comm comm)

{tx= get_vtime();

// Send the MPI messagerc= PMPI_Send(data, len, dt, dest, tag, comm);

// Send event to simulatorevent_send(tx , len, dt, dest, tag);

return rc;}

TalkOverview

Goal

DesignNode (Application)

MPI Wrappers

Communicators

Virtual Time

Linking

Usage

Validation

Experiments

Related Work

Future Work

Summary

End

MPI Wrapper Library

int MPI_Recv(void ∗data, int len, MPI_Datatype dt, int src,int tag, MPI_Comm comm, MPI_Status ∗stat)

{t1= get_vtime();

// Receive the MPI messagerc= PMPI_Recv(data, len, dt, src, tag, comm, stat);

// Wait for the matching eventevent_wait(&tx , &∆, stat−>MPI_TAG, stat−>MPI_SOURCE);

if (tx + ∆ > t1)t3= tx + ∆;

elset3= t1;

set_vtime(t3); // Adjust virtual timereturn rc;

}

TalkOverview

Goal

DesignNode (Application)

MPI Wrappers

Communicators

Virtual Time

Linking

Usage

Validation

Experiments

Related Work

Future Work

Summary

End

MPI Wrapper Library

Event traffic for collectives

TalkOverview

Goal

DesignNode (Application)

MPI Wrappers

Communicators

Virtual Time

Linking

Usage

Validation

Experiments

Related Work

Future Work

Summary

End

MPI Communicators

Simulator framework sets up communicator forapplication nodes only

MPI_COMM_WORLDcovers application and simulator

Wrappers swap MPI_COMM_WORLDwith internalcommunicator when application calls MPI

Application never sees real MPI_COMM_WORLD

TalkOverview

Goal

DesignNode (Application)

MPI Wrappers

Communicators

Virtual Time

Linking

Usage

Validation

Experiments

Related Work

Future Work

Summary

End

Virtual Time

if (tx + ∆ > t1)t3= tx + ∆;

elset3= t1;

set_vtime(t3);

If message was sent earlier than we started looking for it, wehave to assume it was already here

Just “erase” the time we spent actually receiving it

If message arrived after we started waiting for it, use thevirtual send time + ∆ to set local virtual clock

TalkOverview

Goal

DesignNode (Application)

MPI Wrappers

Communicators

Virtual Time

Linking

Usage

Validation

Experiments

Related Work

Future Work

Summary

End

Virtual Time

TalkOverview

Goal

DesignNode (Application)

MPI Wrappers

Communicators

Virtual Time

Linking

Usage

Validation

Experiments

Related Work

Future Work

Summary

End

Linking

Currently need to rename main() to main_node()

Should not be necessary when we use MPI_Init()

In Fortran programs program has to be changed tosubroutine main_node

No Changes to application are necessary!

TalkOverview

Goal

Design

Usage

Validation

Experiments

Related Work

Future Work

Summary

End

Section Outline

1 Goal

2 Design

3 Usage

4 Validation

5 Experiments

6 Comparison to Other Work

7 Future Work

8 Summary

TalkOverview

Goal

Design

Usage

Validation

Experiments

Related Work

Future Work

Summary

End

Usage

Two steps:Create point-to-point modelCreate collective model

Measure two-node latency curve and write function tomodel it

Measure all-to-all performance and write model

TalkOverview

Goal

Design

Usage

Validation

Experiments

Related Work

Future Work

Summary

End

Usage

0

200

400

600

800

1000

1200

32 kB 64 kB 96 kB 128 kB 160 kB 192 kB

Ban

dwid

th M

B/s

Message Size

Bandwidth on Red Storm Apr. 14

modelRun at 13:54:33Run at 14:03:01Run at 14:16:29

TalkOverview

Goal

Design

Usage

Validation

Experiments

Related Work

Future Work

Summary

End

Usage

0 s

0.01 ms

0.02 ms

0.03 ms

0.04 ms

0.05 ms

0.06 ms

0.07 ms

0.08 ms

0.09 ms

16 k 32 k 48 k 64 k 80 k 96 k 112 k 128 k

Tim

e

Number of ints exchanged

4 nodes16 nodes64 nodes

model

TalkOverview

Goal

Design

Usage

Validation

Experiments

Related Work

Future Work

Summary

End

Section Outline

1 Goal

2 Design

3 Usage

4 Validation

5 Experiments

6 Comparison to Other Work

7 Future Work

8 Summary

TalkOverview

Goal

Design

Usage

Validation

Experiments

Related Work

Future Work

Summary

End

Validation

0

5

10

15

20

25

30

35

40

45

BT 16BT 64

CG 16

CG 64

EP 16EP 64

FT 16FT 64

IS 16

IS 64

LU 16

LU 64

MG 16

MG 64

SP 16SP 64

Run

tim

e in

sec

onds

NAS Class A Run Times

RealSimulation

TalkOverview

Goal

Design

Usage

Validation

ExperimentsPatterns

Varying

Collectives

MPI Traces

Related Work

Future Work

Summary

End

Section Outline

1 Goal

2 Design

3 Usage

4 Validation

5 ExperimentsCommunication PatternsVarying Bandwidth and LatencyZero-Cost CollectivesIntrusion-Free MPI Traces

6 Comparison to Other Work

7 Future Work

8 Summary

TalkOverview

Goal

Design

Usage

Validation

ExperimentsPatterns

Varying

Collectives

MPI Traces

Related Work

Future Work

Summary

End

Communication Patterns

MG (class B) message density distribution

0 100 200 300 400 500 600

# m

essa

ges

Destination Node

Sou

rce

Nod

e

0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 0 4 8

12 16 20 24 28 32 36 40 44 48 52 56 60

TalkOverview

Goal

Design

Usage

Validation

ExperimentsPatterns

Varying

Collectives

MPI Traces

Related Work

Future Work

Summary

End

Communication Patterns

BT (class A) data density distribution

0 1 2 3 4 5 6 7 8 9 10

meg

a by

tes

Destination Node

Sou

rce

Nod

e

0 2 4 6 8 10 12 14 0

2

4

6

8

10

12

14

TalkOverview

Goal

Design

Usage

Validation

ExperimentsPatterns

Varying

Collectives

MPI Traces

Related Work

Future Work

Summary

End

Varying Bandwidth and Latency

Simulator can change bandwidth and latencyindependently

This can be used to evaluate application performanceunder varying network characteristics

→ predict impact of new network

TalkOverview

Goal

Design

Usage

Validation

ExperimentsPatterns

Varying

Collectives

MPI Traces

Related Work

Future Work

Summary

End

Varying Bandwidth and Latency

TalkOverview

Goal

Design

Usage

Validation

ExperimentsPatterns

Varying

Collectives

MPI Traces

Related Work

Future Work

Summary

End

Zero-Cost Collectives

Putting collectives into NIC, building specialized NIC, oroptimizing them is interesting

How much application performance can be gained isnot clear

Simulator can assign ∆ = 0 to collectives and leavepoint-to-point alone

See talk on Thursday

TalkOverview

Goal

Design

Usage

Validation

ExperimentsPatterns

Varying

Collectives

MPI Traces

Related Work

Future Work

Summary

End

Intrusion-Free MPI Traces

So far gathered only limited amounts of dataSimulator can gather, and save to disk, large amount ofdata

Without changing application virtual time

TalkOverview

Goal

Design

Usage

Validation

Experiments

Related Work

Future Work

Summary

End

Section Outline

1 Goal

2 Design

3 Usage

4 Validation

5 Experiments

6 Comparison to Other Work

7 Future Work

8 Summary

TalkOverview

Goal

Design

Usage

Validation

Experiments

Related Work

Future Work

Summary

End

Comparison to Other Work

This approach seems to be new

Combines low-intrusion measurement research withdiscrete event simulation

Needs more validation, but seems to be very accurate

Opens up many different and simple ways of evaluatingapplications and research directions

TalkOverview

Goal

Design

Usage

Validation

Experiments

Related Work

Future Work

Summary

End

Comparison to Other Work

No instrumentation code inserted into appRename main() (program ) only change to app

No disturbance of (virtual) runtime of appIndependent of amount of data collected.

No extra memory needed on compute nodes to storetrace data

Language independent (Fortran, Fortran 90 with MPI-2,and C)

TalkOverview

Goal

Design

Usage

Validation

Experiments

Related Work

Future Work

Summary

End

Section Outline

1 Goal

2 Design

3 Usage

4 Validation

5 Experiments

6 Comparison to Other Work

7 Future Work

8 Summary

TalkOverview

Goal

Design

Usage

Validation

Experiments

Related Work

Future Work

Summary

End

Future Work

Continuing Work

Need to incorporate more accurate network model

This will allow simulation of congestion, and evaluationof topology choices, node allocation, etc.

Move below MPI into NIC for more fine-grainedsimulation

Incorporate non-network simulators; CPU and NIC sims

TalkOverview

Goal

Design

Usage

Validation

Experiments

Related Work

Future Work

Summary

End

Section Outline

1 Goal

2 Design

3 Usage

4 Validation

5 Experiments

6 Comparison to Other Work

7 Future Work

8 Summary

TalkOverview

Goal

Design

Usage

Validation

Experiments

Related Work

Future Work

Summary

End

Summary

Novel tool to collect MPI data

Language independent

Only linking with application needed

Virtual runtime of application is not changed

Lots of future possibilities

TalkOverview

Goal

Design

Usage

Validation

Experiments

Related Work

Future Work

Summary

End

End

Questions?