the case of the missing supercomputer...
TRANSCRIPT
![Page 1: The Case of the Missing Supercomputer Performancecseweb.ucsd.edu/groups/csag/html/teaching/cse294/... · 2005-10-03 · below model expectation FINDING #3 We were able to double SAGE’s](https://reader033.vdocuments.us/reader033/viewer/2022042023/5e7bd273826f46143a34d7f8/html5/thumbnails/1.jpg)
The Case of the Missing Supercomputer PerformanceAchieving Optimal Performance on the 8192 Processors of ASCI Q
Fabrizio Petrini, Darren Kerbyson, Scott Pakin (Los Alamos National Lab)
Presented by Jiahua He
![Page 2: The Case of the Missing Supercomputer Performancecseweb.ucsd.edu/groups/csag/html/teaching/cse294/... · 2005-10-03 · below model expectation FINDING #3 We were able to double SAGE’s](https://reader033.vdocuments.us/reader033/viewer/2022042023/5e7bd273826f46143a34d7f8/html5/thumbnails/2.jpg)
10/03/05 2
Skeleton of the Story• Machine: ASCI Q (Second of Top500)
– 2048 Alpha SMP nodes with 4 proc per node– Interconnected with Quadrics QsNet network
• Application: SAGE– compressible Eulerian hydrodynmics program – 150,000 lines of Fortran MPI code
• Beginning: a serious but previously undetected problem• Techniques:
– Measurement to determine real performance– Analytical model to predict expected performance– Microbenchmarks to identify problem source– Simulator to examine “what if”scenarios
• Result: a factor of 2 improvement in app performance
![Page 3: The Case of the Missing Supercomputer Performancecseweb.ucsd.edu/groups/csag/html/teaching/cse294/... · 2005-10-03 · below model expectation FINDING #3 We were able to double SAGE’s](https://reader033.vdocuments.us/reader033/viewer/2022042023/5e7bd273826f46143a34d7f8/html5/thumbnails/3.jpg)
10/03/05 3
Steps• Performance expectation
– Use analytical model to determine the performance that SAGE ought to see on ASCI Q
– Measure the real performance of SAGE• Problem source
– If the measured performance is less than the expected one, use custom microbenchmarks to identify the source of the discrepancy
• Problem eliminating– Use the simulator to try different measures– Eliminate the cause of the problem
• Remeasurement– Remeasure and repeat from step 2 if still not match
![Page 4: The Case of the Missing Supercomputer Performancecseweb.ucsd.edu/groups/csag/html/teaching/cse294/... · 2005-10-03 · below model expectation FINDING #3 We were able to double SAGE’s](https://reader033.vdocuments.us/reader033/viewer/2022042023/5e7bd273826f46143a34d7f8/html5/thumbnails/4.jpg)
10/03/05 4
Step 1• Performance expectation
– Use analytical model to determine the perf. that SAGE ought to see on ASCI Q
– Measure the real performance of SAGE• Problem source
– If the measured performance is less than the expected one, use custom microbenchmarks to identify the source of the discrepancy
• Problem eliminating– Use the simulator to try different measures– Eliminate the cause of the problem
• Remeasurement– Remeasure and repeat from step 2 if still not match
![Page 5: The Case of the Missing Supercomputer Performancecseweb.ucsd.edu/groups/csag/html/teaching/cse294/... · 2005-10-03 · below model expectation FINDING #3 We were able to double SAGE’s](https://reader033.vdocuments.us/reader033/viewer/2022042023/5e7bd273826f46143a34d7f8/html5/thumbnails/5.jpg)
10/03/05 5
Performance Expectation• Model (Darren Kerbyson et al. SC01)
– Validated on many large-scale systems including all ASCI systems
– Typical prediction error of less than 10%• Terms
– QA: first 4096-processor segment
– QB: second 4096-processor segment
• Weal-scaling: fix per-node problem size and scale # of proc
![Page 6: The Case of the Missing Supercomputer Performancecseweb.ucsd.edu/groups/csag/html/teaching/cse294/... · 2005-10-03 · below model expectation FINDING #3 We were able to double SAGE’s](https://reader033.vdocuments.us/reader033/viewer/2022042023/5e7bd273826f46143a34d7f8/html5/thumbnails/6.jpg)
10/03/05 6
Performance Expectation• Model (Darren Kerbyson et al. SC01)
– Validated on many large-scale systems including all ASCI systems
– Typical prediction error of less than 10%• Terms
– QA: first 4096-processor segment
– QB: second 4096-processor segment
• Weal-scaling: fix per-node problem size and scale # of proc
MYSTERY #1
SAGE performs significantly worse on ASCI Q than was predicted by our performance model.
![Page 7: The Case of the Missing Supercomputer Performancecseweb.ucsd.edu/groups/csag/html/teaching/cse294/... · 2005-10-03 · below model expectation FINDING #3 We were able to double SAGE’s](https://reader033.vdocuments.us/reader033/viewer/2022042023/5e7bd273826f46143a34d7f8/html5/thumbnails/7.jpg)
10/03/05 7
Different # of proc • Is it the model accurate?• n-proc: using n processors
per node• Only significant difference
occurs when 4-proc– Giving confidence to the model– Limit the problem in 4-proc
• 3-proc outperforms 4-proc when using more than 256 nodes
• 2-proc outperforms 4-proc when using more than 512 nodes
![Page 8: The Case of the Missing Supercomputer Performancecseweb.ucsd.edu/groups/csag/html/teaching/cse294/... · 2005-10-03 · below model expectation FINDING #3 We were able to double SAGE’s](https://reader033.vdocuments.us/reader033/viewer/2022042023/5e7bd273826f46143a34d7f8/html5/thumbnails/8.jpg)
10/03/05 8
Perf Variability• Constant amount of work
in each cycle constant amount of time
• Vary from 0.7s to 3.0s• A factor of 4 in variability
![Page 9: The Case of the Missing Supercomputer Performancecseweb.ucsd.edu/groups/csag/html/teaching/cse294/... · 2005-10-03 · below model expectation FINDING #3 We were able to double SAGE’s](https://reader033.vdocuments.us/reader033/viewer/2022042023/5e7bd273826f46143a34d7f8/html5/thumbnails/9.jpg)
10/03/05 9
Breakdown of Cycle Time• cycle = computation + local boundary exchange +
collective communication• Local boundary exchanges (get, put)
– Plateau above 500 proc– Match model prediction
• Collective communications (allreduce, reduction, broadcast)– Increase with # of proc– Constant number and payload
size in allreduce operations– Difference between allreduce
and reduction/broadcast: the difference in frequency of occurrence
![Page 10: The Case of the Missing Supercomputer Performancecseweb.ucsd.edu/groups/csag/html/teaching/cse294/... · 2005-10-03 · below model expectation FINDING #3 We were able to double SAGE’s](https://reader033.vdocuments.us/reader033/viewer/2022042023/5e7bd273826f46143a34d7f8/html5/thumbnails/10.jpg)
10/03/05 10
Summary• Observations
– Significant difference: expected performance observed performance
– Only when 4-proc– High variability– Source of performance deficit:
collective operations, especially allreduce• Deduction
– Improve the performance of allreduce, especially when using four processors per node
![Page 11: The Case of the Missing Supercomputer Performancecseweb.ucsd.edu/groups/csag/html/teaching/cse294/... · 2005-10-03 · below model expectation FINDING #3 We were able to double SAGE’s](https://reader033.vdocuments.us/reader033/viewer/2022042023/5e7bd273826f46143a34d7f8/html5/thumbnails/11.jpg)
10/03/05 11
Step 2• Performance expectation
– Use analytical model to determine the perf. that SAGE ought to see on ASCI Q
– Measure the real performance of SAGE• Problem source
– If the measured performance is less than the expected one, use custom microbenchmarks to identify the source of the discrepancy
• Problem eliminating– Use the simulator to try different measures– Eliminate the cause of the problem
• Remeasurement– Remeasure and repeat from step 2 if still not match
![Page 12: The Case of the Missing Supercomputer Performancecseweb.ucsd.edu/groups/csag/html/teaching/cse294/... · 2005-10-03 · below model expectation FINDING #3 We were able to double SAGE’s](https://reader033.vdocuments.us/reader033/viewer/2022042023/5e7bd273826f46143a34d7f8/html5/thumbnails/12.jpg)
10/03/05 12
Investigating allreduce• allreduce latency
– 4-proc: 3ms– Others: less than 0.3ms
• Synthetic parallel benchmark– Alternately computes for either
0, 1 or 5 ms then performs either an allreduce or barrier
• Ideal scalable system– Logarithmic growth with #
nodes– Insensitivity to computational
granularity• Result: not scalable
![Page 13: The Case of the Missing Supercomputer Performancecseweb.ucsd.edu/groups/csag/html/teaching/cse294/... · 2005-10-03 · below model expectation FINDING #3 We were able to double SAGE’s](https://reader033.vdocuments.us/reader033/viewer/2022042023/5e7bd273826f46143a34d7f8/html5/thumbnails/13.jpg)
10/03/05 13
Optimizing allreduce• Optimization
– Always polling Blocking after a limited time (100us, determined empirically)
– Improve latency by a factor of 7• Expectation
– At 4096 proc, SAGE spends 51% time in allreduce 78% performance gain
• Measurement result– Only a marginal improvement in application
performance
![Page 14: The Case of the Missing Supercomputer Performancecseweb.ucsd.edu/groups/csag/html/teaching/cse294/... · 2005-10-03 · below model expectation FINDING #3 We were able to double SAGE’s](https://reader033.vdocuments.us/reader033/viewer/2022042023/5e7bd273826f46143a34d7f8/html5/thumbnails/14.jpg)
10/03/05 14
Optimizing allreduce• Optimization
– Always polling Blocking after a limited time (100us, determined empirically)
– Improve latency by a factor of 7• Expectation
– At 4096 proc, SAGE spends 51% time in allreduce 78% performance gain
• Measurement result– Only a marginal improvement in application
performance
MYSTERY #2
Although SAGE spends half of its time in allreduce (at 4,096 processors), making allreduce seven times faster leads to a negligible performance improvement.
![Page 15: The Case of the Missing Supercomputer Performancecseweb.ucsd.edu/groups/csag/html/teaching/cse294/... · 2005-10-03 · below model expectation FINDING #3 We were able to double SAGE’s](https://reader033.vdocuments.us/reader033/viewer/2022042023/5e7bd273826f46143a34d7f8/html5/thumbnails/15.jpg)
10/03/05 15
Analyzing Noise• Neither MPI nor network node• Periodic system activities (noise)
– Need a spare proc (Fig. 3, 6)– Blocking in allreduce
• Benchmark– Synthetic 1000s computation per
proc without noise– Max slowdown: only 2.5%
• Refined benchmark– 1 million 1ms iterations per proc
without noise– Match LANL codes pattern– Similar result
![Page 16: The Case of the Missing Supercomputer Performancecseweb.ucsd.edu/groups/csag/html/teaching/cse294/... · 2005-10-03 · below model expectation FINDING #3 We were able to double SAGE’s](https://reader033.vdocuments.us/reader033/viewer/2022042023/5e7bd273826f46143a34d7f8/html5/thumbnails/16.jpg)
10/03/05 16
Analyzing Noise• Neither MPI nor network node• Periodic system activities (noise)
– Need a spare proc (Fig. 3, 6)– Blocking in allreduce
• Benchmark– Synthetic 1000s computation per
proc without noise– Max slowdown: only 2.5%
• Refined benchmark– 1 million 1ms iterations per proc
without noise– Match LANL codes pattern– Similar result
MYSTERY #3
Although the “noise” hypothesis could explain SAGE’s suboptimal performance, microbenchmarks of per-processor noise indicate that at most 2.5% of performance is being lost to noise.
![Page 17: The Case of the Missing Supercomputer Performancecseweb.ucsd.edu/groups/csag/html/teaching/cse294/... · 2005-10-03 · below model expectation FINDING #3 We were able to double SAGE’s](https://reader033.vdocuments.us/reader033/viewer/2022042023/5e7bd273826f46143a34d7f8/html5/thumbnails/17.jpg)
10/03/05 17
Node Aggregation• Expose structure in what
appears to be uncorrelated noise on a per-proc basis
• Important observation– Regular pattern across nodes– Each cluster (32 nodes)
contains noisier nodes• Zoom into a cluster
– Node 0: cluster manager– Node 1: quorum node– Node 31: RMS cluster monitor
![Page 18: The Case of the Missing Supercomputer Performancecseweb.ucsd.edu/groups/csag/html/teaching/cse294/... · 2005-10-03 · below model expectation FINDING #3 We were able to double SAGE’s](https://reader033.vdocuments.us/reader033/viewer/2022042023/5e7bd273826f46143a34d7f8/html5/thumbnails/18.jpg)
10/03/05 18
Node Aggregation• Expose structure in what
appears to be uncorrelated noise on a per-proc basis
• Important observation– Regular pattern across nodes– Each cluster (32 nodes)
contains noisier nodes• Zoom into a cluster
– Node 0: cluster manager– Node 1: quorum node– Node 31: RMS cluster monitor
FINDING #1
Analyzing noise on a per-node basis instead of a per-processor basis reveals a regular structure across nodes.
![Page 19: The Case of the Missing Supercomputer Performancecseweb.ucsd.edu/groups/csag/html/teaching/cse294/... · 2005-10-03 · below model expectation FINDING #3 We were able to double SAGE’s](https://reader033.vdocuments.us/reader033/viewer/2022042023/5e7bd273826f46143a34d7f8/html5/thumbnails/19.jpg)
10/03/05 19
Noise Events
![Page 20: The Case of the Missing Supercomputer Performancecseweb.ucsd.edu/groups/csag/html/teaching/cse294/... · 2005-10-03 · below model expectation FINDING #3 We were able to double SAGE’s](https://reader033.vdocuments.us/reader033/viewer/2022042023/5e7bd273826f46143a34d7f8/html5/thumbnails/20.jpg)
10/03/05 20
Source of Noises• Kernel
– Distributed heartbeat generated at kernel level– Lightweight: hundreds of microseconds (us)– High frequency: one every 125ms
• RMS daemons– Quadrics Resource Management System– One every 30s
• TruCluster daemons– HP cluster
management software– One every about 100s
![Page 21: The Case of the Missing Supercomputer Performancecseweb.ucsd.edu/groups/csag/html/teaching/cse294/... · 2005-10-03 · below model expectation FINDING #3 We were able to double SAGE’s](https://reader033.vdocuments.us/reader033/viewer/2022042023/5e7bd273826f46143a34d7f8/html5/thumbnails/21.jpg)
10/03/05 21
Step 3• Performance expectation
– Use analytical model to determine the perf. that SAGE ought to see on ASCI Q
– Measure the real performance of SAGE• Problem source
– If the measured performance is less than the expected one, use custom microbenchmarks to identify the source of the discrepancy
• Problem eliminating– Use the simulator to try different measures– Eliminate the cause of the problem
• Remeasurement– Remeasure and repeat from step 2 if still not match
![Page 22: The Case of the Missing Supercomputer Performancecseweb.ucsd.edu/groups/csag/html/teaching/cse294/... · 2005-10-03 · below model expectation FINDING #3 We were able to double SAGE’s](https://reader033.vdocuments.us/reader033/viewer/2022042023/5e7bd273826f46143a34d7f8/html5/thumbnails/22.jpg)
10/03/05 22
Coscheduling• Application: fine-grained, bulk-synchronous• A delay in a process slows down the whole app• Large # proc at least one slow process per
iteration• Coscheduling: pay the penalty only once• Developed a prototype, but no details or results
![Page 23: The Case of the Missing Supercomputer Performancecseweb.ucsd.edu/groups/csag/html/teaching/cse294/... · 2005-10-03 · below model expectation FINDING #3 We were able to double SAGE’s](https://reader033.vdocuments.us/reader033/viewer/2022042023/5e7bd273826f46143a34d7f8/html5/thumbnails/23.jpg)
10/03/05 23
Discrete-event Simulator• Why simulator?
– Time on ASCI Q is scarce– Configuration changes are
not always practical• Event = <F, L, E, P>
– F: frequency of the event– L: average duration– E: distribution; P: placement
• Barriers + 1ms computations• Validated for measured events (top two curves)• Predict performance gain of removing noises
– Node 0, 1 or 31: marginal improvement (15%)– Kernel noise on all nodes: dramatically improved
![Page 24: The Case of the Missing Supercomputer Performancecseweb.ucsd.edu/groups/csag/html/teaching/cse294/... · 2005-10-03 · below model expectation FINDING #3 We were able to double SAGE’s](https://reader033.vdocuments.us/reader033/viewer/2022042023/5e7bd273826f46143a34d7f8/html5/thumbnails/24.jpg)
10/03/05 24
Discrete-event Simulator• Why simulator?
– Time on ASCI Q is scarce– Configuration changes are
not always practical• Event = <F, L, E, P>
– F: frequency of the event– L: average duration– E: distribution; P: placement
• Barriers + 1ms computations• Validated for measured events (top two curves)• Predict performance gain of removing noises
– Node 0, 1 or 31: marginal improvement (15%)– Kernel noise on all nodes: dramatically improved
FINDING #2
On fine-grained applications, more performance is lost to short but frequent noise on all nodes than to long but less frequent noise on just a few nodes.
![Page 25: The Case of the Missing Supercomputer Performancecseweb.ucsd.edu/groups/csag/html/teaching/cse294/... · 2005-10-03 · below model expectation FINDING #3 We were able to double SAGE’s](https://reader033.vdocuments.us/reader033/viewer/2022042023/5e7bd273826f46143a34d7f8/html5/thumbnails/25.jpg)
10/03/05 25
Eliminating Noise• Infeasible to remove all the noise
– Two TruCluster heartbeats at kernel level– Require substantial kernel modifications
• Optimizations– Removed ten daemons from all nodes– Increased RMS interval from 30s to 60s– Moved several TruCluster daemons from node 1 and
2 to node 0• Microbenchmarks
– Barriers + Computations (0, 1 or 5ms)
• Improvements– 2.2 to 13 times faster
![Page 26: The Case of the Missing Supercomputer Performancecseweb.ucsd.edu/groups/csag/html/teaching/cse294/... · 2005-10-03 · below model expectation FINDING #3 We were able to double SAGE’s](https://reader033.vdocuments.us/reader033/viewer/2022042023/5e7bd273826f46143a34d7f8/html5/thumbnails/26.jpg)
10/03/05 26
Step 4• Performance expectation
– Use analytical model to determine the perf. that SAGE ought to see on ASCI Q
– Measure the real performance of SAGE• Problem source
– If the measured performance is less than the expected one, use custom microbenchmarks to identify the source of the discrepancy
• Problem eliminating– Use the simulator to try different measures– Eliminate the cause of the problem
• Remeasurement– Remeasure and repeat from step 2 if still not match
![Page 27: The Case of the Missing Supercomputer Performancecseweb.ucsd.edu/groups/csag/html/teaching/cse294/... · 2005-10-03 · below model expectation FINDING #3 We were able to double SAGE’s](https://reader033.vdocuments.us/reader033/viewer/2022042023/5e7bd273826f46143a34d7f8/html5/thumbnails/27.jpg)
10/03/05 27
Optimized SAGE Performance• Old curves (top two curves)• New curves
– 4-proc, but w/o nodes 0 & 31– Jan-27-03: 1024-node segment
(only up to 3716 proc)– May-01-03: full sized ASCI Q
(up to 7680 proc)– May-01-03(min): minimum time over 50 cycles
• Results– Jan-27-03 and May-01-03: much improved– May-01-03(min): closely match expected performance further optimizations
![Page 28: The Case of the Missing Supercomputer Performancecseweb.ucsd.edu/groups/csag/html/teaching/cse294/... · 2005-10-03 · below model expectation FINDING #3 We were able to double SAGE’s](https://reader033.vdocuments.us/reader033/viewer/2022042023/5e7bd273826f46143a34d7f8/html5/thumbnails/28.jpg)
10/03/05 28
Summary• Different configurations tested prior to and after
noise removal• Total processing rate
– (# usable proc) * (cells per proc) / (cycle time)– Fixed 13,500 cells per proc – Varied # of usable proc
• Best observed (???) processing rate is only 15% below model expectation
![Page 29: The Case of the Missing Supercomputer Performancecseweb.ucsd.edu/groups/csag/html/teaching/cse294/... · 2005-10-03 · below model expectation FINDING #3 We were able to double SAGE’s](https://reader033.vdocuments.us/reader033/viewer/2022042023/5e7bd273826f46143a34d7f8/html5/thumbnails/29.jpg)
10/03/05 29
Summary• Different configurations tested prior to and after
noise removal• Total processing rate
– (# usable proc) * (cells per proc) / (cycle time)– Fixed 13,500 cells per proc – Varied # of usable proc
• Best observed (???) processing rate is only 15% below model expectation
FINDING #3
We were able to double SAGE’s performance by removing noise caused by several types of dæmons, confining dæmons to the cluster manager, and removing the cluster manager and the RMS cluster monitor from each cluster’s compute pool.
![Page 30: The Case of the Missing Supercomputer Performancecseweb.ucsd.edu/groups/csag/html/teaching/cse294/... · 2005-10-03 · below model expectation FINDING #3 We were able to double SAGE’s](https://reader033.vdocuments.us/reader033/viewer/2022042023/5e7bd273826f46143a34d7f8/html5/thumbnails/30.jpg)
10/03/05 30
Discussion• Computational granularity of app type of noise
• Load balanced, coarse-grained app (e.g. LINPACH): – Long noise dominate– Short noise becomes coscheduled
• Medium-grained app (e.g. SAGE):– Medium noise dominate
• Fine-grained app (e.g. deterministic Sn-transport):– Short noise dominate– The freq of long noise is low
![Page 31: The Case of the Missing Supercomputer Performancecseweb.ucsd.edu/groups/csag/html/teaching/cse294/... · 2005-10-03 · below model expectation FINDING #3 We were able to double SAGE’s](https://reader033.vdocuments.us/reader033/viewer/2022042023/5e7bd273826f46143a34d7f8/html5/thumbnails/31.jpg)
10/03/05 31
Discussion• Computational granularity of app type of noise
• Load balanced, coarse-grained app (e.g. LINPACH): – Long noise dominate– Short noise becomes coscheduled
• Medium-grained app (e.g. SAGE):– Medium noise dominate
• Fine-grained app (e.g. deterministic Sn-transport):– Short noise dominate– The freq of long noise is low
FINDING #4
Substantial performance loss occurs when an application resonates with system noise: high-frequency, fine-grained noise affects only fine-grained applications; low-frequency, coarse-grained noise affects only coarse-grained applications.
![Page 32: The Case of the Missing Supercomputer Performancecseweb.ucsd.edu/groups/csag/html/teaching/cse294/... · 2005-10-03 · below model expectation FINDING #3 We were able to double SAGE’s](https://reader033.vdocuments.us/reader033/viewer/2022042023/5e7bd273826f46143a34d7f8/html5/thumbnails/32.jpg)
10/03/05 32
Conclusion• Described a figurative journey to improve the
performance of a sizable hydrodynamics app, SAGE, on the world`s second-fastest supercomputer, ASCI Q
• Methodologies– The first to determine how fast an app could
potentially run– Developed a methodology to analyze artifacts that
degrade app performance yet are not part of the app– Doubled the performance of SAGE w/o modifying a
single line of code• Notions
– Noise and resonance– Applicable to other system and other app
![Page 33: The Case of the Missing Supercomputer Performancecseweb.ucsd.edu/groups/csag/html/teaching/cse294/... · 2005-10-03 · below model expectation FINDING #3 We were able to double SAGE’s](https://reader033.vdocuments.us/reader033/viewer/2022042023/5e7bd273826f46143a34d7f8/html5/thumbnails/33.jpg)
10/03/05 33
More discussions• What do they mean by“best observed” in
Table 3? The processing rate of regular 4-proc using 7680 proc (120.6) is still lower than 3-proc with only 6144 proc.
• The analytical model is constructed manually (Darren Kerbyson et al. SC01). It is enormously labor intensive.
![Page 34: The Case of the Missing Supercomputer Performancecseweb.ucsd.edu/groups/csag/html/teaching/cse294/... · 2005-10-03 · below model expectation FINDING #3 We were able to double SAGE’s](https://reader033.vdocuments.us/reader033/viewer/2022042023/5e7bd273826f46143a34d7f8/html5/thumbnails/34.jpg)
Thanks! Any questions?
The Case of the Missing Supercomputer Performance
(SC 2003)