supercomputing 2005 1 cross-platform performance prediction using partial execution leo t. yang...

Supercomputing 2005 1

Cross-Platform Performance Prediction Using Partial Execution

Leo T. Yang

Xiaosong Ma*

Frank Mueller

Department of Computer Science

Center for High Performance Simulations (CHiPS)

North Carolina State University

(* Joint Faculty with Oak Ridge National Laboratory)


Presentation Roadmap

Introduction Model and approach Performance results Conclusion and future work


Cross-Platform Performance Prediction

Users face wide selection of machines

Need cross-platform performance prediction to Choose platform to use /

purchase Estimate resource usage Estimate job wall time

Machines and applications both grow larger and more complex Modeling- and simulation-

based approaches harder and more expensive

Performance data not reused in performance prediction


Observation-based Performance Prediction

Observe cross-platform behavior Treating applications and

platforms as black boxes Avoiding case-by-case model

building Covering entire application

Computation Communication I/O

Convenient with third-party libraries

Performance translation

Observation: existence of “reference platform”Goal: Cross-platform Meta-predictorApproach: based on relative performance

T = 20 hrs

T = ? hrs


Main Idea: Utilizing Partial Execution

Observation: majority of scientific applications are iteration-based Highly repetitive behavior

phases -> timesteps Execute small partial executions

Low-cost “test drives” Simple APIs (indicate timesteps: k) Quit after k timesteps

Full-1

Partial-1

Partial-2

Relative performance = 0.6

Full-2

(predicted)

reference system

target system


Application Model

Execution of parallel simulations modeled as regular expression

I(C*[W])*F I: one-time initialization phase C: computation phase W: optional I/O phase F: one-time finalization phase Different phases likely have different cross-platform

relative performance Major challenges

Avoid impact of initially unstable performance Predict correct mixture of C and W phases


Partial Execution

Terminate applications prematurely API

init_timestep() Optional, useful with large setup phase

begin_timestep() end_timestep(maxsteps)

“begin” and “end” calls bracket C or CW phase Execution terminated after maxsteps timesteps

Easy-to-use interface 2-3 lines of codes inserted into source codes


Base Prediction Model

Given reference platform and target platform Perform 1 or more partial executions Compute average execution time of timestep on both platforms Compute relative performance

Compute overall execution time estimate for target platform

Prediction performance (predicted-to-actual ratio)


Refined Prediction Model

Problem 1: initial performance fluctuations Variances due to cache warm-up, etc. May span dozens of timesteps

Problem 2: periodic I/O phases I/O frequency often configurable and determined at run time

Unified solution Monitor per-timestep performance variance at runtime

Identify anomalies and repeated patterns Filter out early, unstable timestep measurements

Consider only later results once performance stabilizes Combine early timestep overheads into initialization cost

Computing sliding window averages of per-timestep overheads Use multiples of observed pattern length as window size


Proof-of-concept experiments

Questions: Is relative performance observed in a very short early period

indicative of overall relative performance? Can we reuse partial execution data in predicting execution

with different configurations?

Experiment settings Large-scale codes:

2 ASCI Purple (sphot and sPPM) fusion code (Gyro) rocket simulation (GENx)

Full runs take >5 hours 10 super computers: SDSC, NCSA, ORNL, LLNL, UIUC,

NCSU, NERSC 7 architectures (SP3, SP4, Altix, Cray X1, 3 clusters: G5, Xeon,

Itanium)


Base Model Accuracy (Sphot)

0.96

0.97

0.98

0.99

1.00

1.01

1 3 5 7 9 11 13 15 17 19 21 23 25

Timesteps

Ac

cu

rac

y

Datastar-655 Cumulative Datastar-655 Steps

Henry2 Cumulative Henry2 Steps

Ram Cumulative RAM Steps

High accuracy with very short partial execution


Refined Model (sPPM, Ram->Henry2)

0.92

0.94

0.96

0.98

1

1.02

1.04

26 31 36 41 46 51

Timesteps

Acc

ura

cyCumulative

Sliding Window

• Issues:

• Ram: init variance

• Henry2: 1 in 10 steps I/O

normalized

Smarter algorithms

•Initialization filter

•Sliding window

•handle anomaly and periodic I/O


Application with Variable Problem Size

GENx Rocket Simulation (CSAR, UIUC), TuringFrost

0.8

0.85

0.9

0.95

1

1.05

1.1

0 500 1000 1500

Timesteps

Accu

racy

0.5

1

1.5

2

2.5

3

3.5

4

Re

lati

ve

P

erf

orm

an

ce

Accuracy

Relative Performance

Limited accuracy w/ variable timesteps


Reusing Partial Execution Data

0.00

0.50

1.00

1.50

2.00

2.50

0 1 2 3 4Problem Size.

(1:B1-std, 2:B2-cy, 3:B3-gtc)

Re

lati

ve

Pe

rfo

rma

nc

e

Phoenix RamSeaborg TeraGrid

0.00

0.50

1.00

1.50

2.00

2.50

3.00

0 200 400 600

Number of Processor

Re

lati

ve

Pe

rfo

rma

nc

e

Phoenix Ram

Seaborg TeraGrid

Avg. Error: 12.1% - 25.8%

Avg. Error: 5.6% - 37.9%

Scientists often repeat runs with different configurations Number of processors Input size and data content Computation tasks

Results from Gyro fusion simulation on 5 platforms


Conclusion

Empirical performance prediction works! Real-world production codes Multiple parallel platforms Highly accurate predictions Limitations with

Variable problem sizes Input-size/processor scaling

Observation-based prediction Simple Portable Low cost (few timesteps)

T = 20 hrs

T = 2 hrs

T = 10 hrs

T = 1 hrs


Related Work

Parallel program performance prediction Application-specific analytical models Compiler/instrumentation tools Simulation-based predictions

Cross-platform performance studies Mostly examine multiple platforms individually

Grid job schedulers Do not offer cross-platform performance translation


Ongoing and Future Work

Evaluate with AMR applications Automated partial execution

Automatic computation phase identification Binary rewriting to avoid source code modification

Extend to non-dedicated systems For job schedulers

supercomputing 2005 1 cross-platform performance prediction using partial execution leo t. yang...

Documents

timestep performance

partial execution observation

partial execution leo

future worksupercomputing

onetime initialization

w phasessupercomputing

target platformperform

initialization costcomputing