supercomputing 2005 1 cross-platform performance prediction using partial execution leo t. yang...
TRANSCRIPT
Supercomputing 2005 1
Cross-Platform Performance Prediction Using Partial Execution
Leo T. Yang
Xiaosong Ma*
Frank Mueller
Department of Computer Science
Center for High Performance Simulations (CHiPS)
North Carolina State University
(* Joint Faculty with Oak Ridge National Laboratory)
Supercomputing 2005 2
Presentation Roadmap
Introduction Model and approach Performance results Conclusion and future work
Supercomputing 2005 3
Cross-Platform Performance Prediction
Users face wide selection of machines
Need cross-platform performance prediction to Choose platform to use /
purchase Estimate resource usage Estimate job wall time
Machines and applications both grow larger and more complex Modeling- and simulation-
based approaches harder and more expensive
Performance data not reused in performance prediction
Supercomputing 2005 4
Observation-based Performance Prediction
Observe cross-platform behavior Treating applications and
platforms as black boxes Avoiding case-by-case model
building Covering entire application
Computation Communication I/O
Convenient with third-party libraries
Performance translation
Observation: existence of “reference platform”Goal: Cross-platform Meta-predictorApproach: based on relative performance
T = 20 hrs
T = ? hrs
Supercomputing 2005 5
Presentation Roadmap
Introduction Model and approach Performance results Conclusion and future work
Supercomputing 2005 6
Main Idea: Utilizing Partial Execution
Observation: majority of scientific applications are iteration-based Highly repetitive behavior
phases -> timesteps Execute small partial executions
Low-cost “test drives” Simple APIs (indicate timesteps: k) Quit after k timesteps
Full-1
Partial-1
Partial-2
Relative performance = 0.6
Full-2
(predicted)
reference system
target system
Supercomputing 2005 7
Application Model
Execution of parallel simulations modeled as regular expression
I(C*[W])*F I: one-time initialization phase C: computation phase W: optional I/O phase F: one-time finalization phase Different phases likely have different cross-platform
relative performance Major challenges
Avoid impact of initially unstable performance Predict correct mixture of C and W phases
Supercomputing 2005 8
Partial Execution
Terminate applications prematurely API
init_timestep() Optional, useful with large setup phase
begin_timestep() end_timestep(maxsteps)
“begin” and “end” calls bracket C or CW phase Execution terminated after maxsteps timesteps
Easy-to-use interface 2-3 lines of codes inserted into source codes
Supercomputing 2005 9
Base Prediction Model
Given reference platform and target platform Perform 1 or more partial executions Compute average execution time of timestep on both platforms Compute relative performance
Compute overall execution time estimate for target platform
Prediction performance (predicted-to-actual ratio)
Supercomputing 2005 10
Refined Prediction Model
Problem 1: initial performance fluctuations Variances due to cache warm-up, etc. May span dozens of timesteps
Problem 2: periodic I/O phases I/O frequency often configurable and determined at run time
Unified solution Monitor per-timestep performance variance at runtime
Identify anomalies and repeated patterns Filter out early, unstable timestep measurements
Consider only later results once performance stabilizes Combine early timestep overheads into initialization cost
Computing sliding window averages of per-timestep overheads Use multiples of observed pattern length as window size
Supercomputing 2005 11
Presentation Roadmap
Introduction Model and approach Performance results Conclusion and future work
Supercomputing 2005 12
Proof-of-concept experiments
Questions: Is relative performance observed in a very short early period
indicative of overall relative performance? Can we reuse partial execution data in predicting execution
with different configurations?
Experiment settings Large-scale codes:
2 ASCI Purple (sphot and sPPM) fusion code (Gyro) rocket simulation (GENx)
Full runs take >5 hours 10 super computers: SDSC, NCSA, ORNL, LLNL, UIUC,
NCSU, NERSC 7 architectures (SP3, SP4, Altix, Cray X1, 3 clusters: G5, Xeon,
Itanium)
Supercomputing 2005 13
Base Model Accuracy (Sphot)
0.96
0.97
0.98
0.99
1.00
1.01
1 3 5 7 9 11 13 15 17 19 21 23 25
Timesteps
Ac
cu
rac
y
Datastar-655 Cumulative Datastar-655 Steps
Henry2 Cumulative Henry2 Steps
Ram Cumulative RAM Steps
High accuracy with very short partial execution
Supercomputing 2005 14
Refined Model (sPPM, Ram->Henry2)
0.92
0.94
0.96
0.98
1
1.02
1.04
26 31 36 41 46 51
Timesteps
Acc
ura
cyCumulative
Sliding Window
• Issues:
• Ram: init variance
• Henry2: 1 in 10 steps I/O
normalized
Smarter algorithms
•Initialization filter
•Sliding window
•handle anomaly and periodic I/O
Supercomputing 2005 15
Application with Variable Problem Size
GENx Rocket Simulation (CSAR, UIUC), TuringFrost
0.8
0.85
0.9
0.95
1
1.05
1.1
0 500 1000 1500
Timesteps
Accu
racy
0.5
1
1.5
2
2.5
3
3.5
4
Re
lati
ve
P
erf
orm
an
ce
Accuracy
Relative Performance
Limited accuracy w/ variable timesteps
Supercomputing 2005 16
Reusing Partial Execution Data
0.00
0.50
1.00
1.50
2.00
2.50
0 1 2 3 4Problem Size.
(1:B1-std, 2:B2-cy, 3:B3-gtc)
Re
lati
ve
Pe
rfo
rma
nc
e
Phoenix RamSeaborg TeraGrid
0.00
0.50
1.00
1.50
2.00
2.50
3.00
0 200 400 600
Number of Processor
Re
lati
ve
Pe
rfo
rma
nc
e
Phoenix Ram
Seaborg TeraGrid
Avg. Error: 12.1% - 25.8%
Avg. Error: 5.6% - 37.9%
Scientists often repeat runs with different configurations Number of processors Input size and data content Computation tasks
Results from Gyro fusion simulation on 5 platforms
Supercomputing 2005 17
Presentation Roadmap
Introduction Model and approach Performance results Conclusion and future work
Supercomputing 2005 18
Conclusion
Empirical performance prediction works! Real-world production codes Multiple parallel platforms Highly accurate predictions Limitations with
Variable problem sizes Input-size/processor scaling
Observation-based prediction Simple Portable Low cost (few timesteps)
T = 20 hrs
T = 2 hrs
T = 10 hrs
T = 1 hrs
Supercomputing 2005 19
Related Work
Parallel program performance prediction Application-specific analytical models Compiler/instrumentation tools Simulation-based predictions
Cross-platform performance studies Mostly examine multiple platforms individually
Grid job schedulers Do not offer cross-platform performance translation