supercomputing 2005 1 cross-platform performance prediction using partial execution leo t. yang...

20
Supercomputing 2005 1 Cross-Platform Performance Prediction Using Partial Execution Leo T. Yang Xiaosong Ma* Frank Mueller Department of Computer Science Center for High Performance Simulations (CHiPS) North Carolina State University (* Joint Faculty with Oak Ridge National Laboratory)

Upload: laurence-montgomery

Post on 02-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Supercomputing 2005 1

Cross-Platform Performance Prediction Using Partial Execution

Leo T. Yang

Xiaosong Ma*

Frank Mueller

Department of Computer Science

Center for High Performance Simulations (CHiPS)

North Carolina State University

(* Joint Faculty with Oak Ridge National Laboratory)

Supercomputing 2005 2

Presentation Roadmap

Introduction Model and approach Performance results Conclusion and future work

Supercomputing 2005 3

Cross-Platform Performance Prediction

Users face wide selection of machines

Need cross-platform performance prediction to Choose platform to use /

purchase Estimate resource usage Estimate job wall time

Machines and applications both grow larger and more complex Modeling- and simulation-

based approaches harder and more expensive

Performance data not reused in performance prediction

Supercomputing 2005 4

Observation-based Performance Prediction

Observe cross-platform behavior Treating applications and

platforms as black boxes Avoiding case-by-case model

building Covering entire application

Computation Communication I/O

Convenient with third-party libraries

Performance translation

Observation: existence of “reference platform”Goal: Cross-platform Meta-predictorApproach: based on relative performance

T = 20 hrs

T = ? hrs

Supercomputing 2005 5

Presentation Roadmap

Introduction Model and approach Performance results Conclusion and future work

Supercomputing 2005 6

Main Idea: Utilizing Partial Execution

Observation: majority of scientific applications are iteration-based Highly repetitive behavior

phases -> timesteps Execute small partial executions

Low-cost “test drives” Simple APIs (indicate timesteps: k) Quit after k timesteps

Full-1

Partial-1

Partial-2

Relative performance = 0.6

Full-2

(predicted)

reference system

target system

Supercomputing 2005 7

Application Model

Execution of parallel simulations modeled as regular expression

I(C*[W])*F I: one-time initialization phase C: computation phase W: optional I/O phase F: one-time finalization phase Different phases likely have different cross-platform

relative performance Major challenges

Avoid impact of initially unstable performance Predict correct mixture of C and W phases

Supercomputing 2005 8

Partial Execution

Terminate applications prematurely API

init_timestep() Optional, useful with large setup phase

begin_timestep() end_timestep(maxsteps)

“begin” and “end” calls bracket C or CW phase Execution terminated after maxsteps timesteps

Easy-to-use interface 2-3 lines of codes inserted into source codes

Supercomputing 2005 9

Base Prediction Model

Given reference platform and target platform Perform 1 or more partial executions Compute average execution time of timestep on both platforms Compute relative performance

Compute overall execution time estimate for target platform

Prediction performance (predicted-to-actual ratio)

Supercomputing 2005 10

Refined Prediction Model

Problem 1: initial performance fluctuations Variances due to cache warm-up, etc. May span dozens of timesteps

Problem 2: periodic I/O phases I/O frequency often configurable and determined at run time

Unified solution Monitor per-timestep performance variance at runtime

Identify anomalies and repeated patterns Filter out early, unstable timestep measurements

Consider only later results once performance stabilizes Combine early timestep overheads into initialization cost

Computing sliding window averages of per-timestep overheads Use multiples of observed pattern length as window size

Supercomputing 2005 11

Presentation Roadmap

Introduction Model and approach Performance results Conclusion and future work

Supercomputing 2005 12

Proof-of-concept experiments

Questions: Is relative performance observed in a very short early period

indicative of overall relative performance? Can we reuse partial execution data in predicting execution

with different configurations?

Experiment settings Large-scale codes:

2 ASCI Purple (sphot and sPPM) fusion code (Gyro) rocket simulation (GENx)

Full runs take >5 hours 10 super computers: SDSC, NCSA, ORNL, LLNL, UIUC,

NCSU, NERSC 7 architectures (SP3, SP4, Altix, Cray X1, 3 clusters: G5, Xeon,

Itanium)

Supercomputing 2005 13

Base Model Accuracy (Sphot)

0.96

0.97

0.98

0.99

1.00

1.01

1 3 5 7 9 11 13 15 17 19 21 23 25

Timesteps

Ac

cu

rac

y

Datastar-655 Cumulative Datastar-655 Steps

Henry2 Cumulative Henry2 Steps

Ram Cumulative RAM Steps

High accuracy with very short partial execution

Supercomputing 2005 14

Refined Model (sPPM, Ram->Henry2)

0.92

0.94

0.96

0.98

1

1.02

1.04

26 31 36 41 46 51

Timesteps

Acc

ura

cyCumulative

Sliding Window

• Issues:

• Ram: init variance

• Henry2: 1 in 10 steps I/O

normalized

Smarter algorithms

•Initialization filter

•Sliding window

•handle anomaly and periodic I/O

Supercomputing 2005 15

Application with Variable Problem Size

GENx Rocket Simulation (CSAR, UIUC), TuringFrost

0.8

0.85

0.9

0.95

1

1.05

1.1

0 500 1000 1500

Timesteps

Accu

racy

0.5

1

1.5

2

2.5

3

3.5

4

Re

lati

ve

P

erf

orm

an

ce

Accuracy

Relative Performance

Limited accuracy w/ variable timesteps

Supercomputing 2005 16

Reusing Partial Execution Data

0.00

0.50

1.00

1.50

2.00

2.50

0 1 2 3 4Problem Size.

(1:B1-std, 2:B2-cy, 3:B3-gtc)

Re

lati

ve

Pe

rfo

rma

nc

e

Phoenix RamSeaborg TeraGrid

0.00

0.50

1.00

1.50

2.00

2.50

3.00

0 200 400 600

Number of Processor

Re

lati

ve

Pe

rfo

rma

nc

e

Phoenix Ram

Seaborg TeraGrid

Avg. Error: 12.1% - 25.8%

Avg. Error: 5.6% - 37.9%

Scientists often repeat runs with different configurations Number of processors Input size and data content Computation tasks

Results from Gyro fusion simulation on 5 platforms

Supercomputing 2005 17

Presentation Roadmap

Introduction Model and approach Performance results Conclusion and future work

Supercomputing 2005 18

Conclusion

Empirical performance prediction works! Real-world production codes Multiple parallel platforms Highly accurate predictions Limitations with

Variable problem sizes Input-size/processor scaling

Observation-based prediction Simple Portable Low cost (few timesteps)

T = 20 hrs

T = 2 hrs

T = 10 hrs

T = 1 hrs

Supercomputing 2005 19

Related Work

Parallel program performance prediction Application-specific analytical models Compiler/instrumentation tools Simulation-based predictions

Cross-platform performance studies Mostly examine multiple platforms individually

Grid job schedulers Do not offer cross-platform performance translation

Supercomputing 2005 20

Ongoing and Future Work

Evaluate with AMR applications Automated partial execution

Automatic computation phase identification Binary rewriting to avoid source code modification

Extend to non-dedicated systems For job schedulers