the case of the missing supercomputer...

The Case of the Missing Supercomputer PerformanceAchieving Optimal Performance on the 8192 Processors of ASCI Q

Fabrizio Petrini, Darren Kerbyson, Scott Pakin (Los Alamos National Lab)

Presented by Jiahua He

10/03/05 2

Skeleton of the Story• Machine: ASCI Q (Second of Top500)

– 2048 Alpha SMP nodes with 4 proc per node– Interconnected with Quadrics QsNet network

• Application: SAGE– compressible Eulerian hydrodynmics program – 150,000 lines of Fortran MPI code

• Beginning: a serious but previously undetected problem• Techniques:

– Measurement to determine real performance– Analytical model to predict expected performance– Microbenchmarks to identify problem source– Simulator to examine “what if”scenarios

• Result: a factor of 2 improvement in app performance

10/03/05 3

Steps• Performance expectation

– Use analytical model to determine the performance that SAGE ought to see on ASCI Q

– Measure the real performance of SAGE• Problem source

– If the measured performance is less than the expected one, use custom microbenchmarks to identify the source of the discrepancy

• Problem eliminating– Use the simulator to try different measures– Eliminate the cause of the problem

• Remeasurement– Remeasure and repeat from step 2 if still not match

10/03/05 4

Step 1• Performance expectation

– Use analytical model to determine the perf. that SAGE ought to see on ASCI Q





10/03/05 5

Performance Expectation• Model (Darren Kerbyson et al. SC01)

– Validated on many large-scale systems including all ASCI systems

– Typical prediction error of less than 10%• Terms

– QA: first 4096-processor segment

– QB: second 4096-processor segment

• Weal-scaling: fix per-node problem size and scale # of proc

10/03/05 6

Performance Expectation• Model (Darren Kerbyson et al. SC01)

– Validated on many large-scale systems including all ASCI systems

– Typical prediction error of less than 10%• Terms

– QA: first 4096-processor segment

– QB: second 4096-processor segment

• Weal-scaling: fix per-node problem size and scale # of proc

MYSTERY #1

SAGE performs significantly worse on ASCI Q than was predicted by our performance model.

10/03/05 7

Different # of proc • Is it the model accurate?• n-proc: using n processors

per node• Only significant difference

occurs when 4-proc– Giving confidence to the model– Limit the problem in 4-proc

• 3-proc outperforms 4-proc when using more than 256 nodes

• 2-proc outperforms 4-proc when using more than 512 nodes

10/03/05 8

Perf Variability• Constant amount of work

in each cycle constant amount of time

• Vary from 0.7s to 3.0s• A factor of 4 in variability

10/03/05 9

Breakdown of Cycle Time• cycle = computation + local boundary exchange +

collective communication• Local boundary exchanges (get, put)

– Plateau above 500 proc– Match model prediction

• Collective communications (allreduce, reduction, broadcast)– Increase with # of proc– Constant number and payload

size in allreduce operations– Difference between allreduce

and reduction/broadcast: the difference in frequency of occurrence

10/03/05 10

Summary• Observations

– Significant difference: expected performance observed performance

– Only when 4-proc– High variability– Source of performance deficit:

collective operations, especially allreduce• Deduction

– Improve the performance of allreduce, especially when using four processors per node

10/03/05 11







10/03/05 12

Investigating allreduce• allreduce latency

– 4-proc: 3ms– Others: less than 0.3ms

• Synthetic parallel benchmark– Alternately computes for either

0, 1 or 5 ms then performs either an allreduce or barrier

• Ideal scalable system– Logarithmic growth with #

nodes– Insensitivity to computational

granularity• Result: not scalable

10/03/05 13

Optimizing allreduce• Optimization

– Always polling Blocking after a limited time (100us, determined empirically)

– Improve latency by a factor of 7• Expectation

– At 4096 proc, SAGE spends 51% time in allreduce 78% performance gain

• Measurement result– Only a marginal improvement in application

performance

10/03/05 14

Optimizing allreduce• Optimization

– Always polling Blocking after a limited time (100us, determined empirically)

– Improve latency by a factor of 7• Expectation

– At 4096 proc, SAGE spends 51% time in allreduce 78% performance gain

• Measurement result– Only a marginal improvement in application

performance

MYSTERY #2

Although SAGE spends half of its time in allreduce (at 4,096 processors), making allreduce seven times faster leads to a negligible performance improvement.

10/03/05 15

Analyzing Noise• Neither MPI nor network node• Periodic system activities (noise)

– Need a spare proc (Fig. 3, 6)– Blocking in allreduce

• Benchmark– Synthetic 1000s computation per

proc without noise– Max slowdown: only 2.5%

• Refined benchmark– 1 million 1ms iterations per proc

without noise– Match LANL codes pattern– Similar result

10/03/05 16

Analyzing Noise• Neither MPI nor network node• Periodic system activities (noise)

– Need a spare proc (Fig. 3, 6)– Blocking in allreduce

• Benchmark– Synthetic 1000s computation per

proc without noise– Max slowdown: only 2.5%

• Refined benchmark– 1 million 1ms iterations per proc

without noise– Match LANL codes pattern– Similar result

MYSTERY #3

Although the “noise” hypothesis could explain SAGE’s suboptimal performance, microbenchmarks of per-processor noise indicate that at most 2.5% of performance is being lost to noise.

10/03/05 17

Node Aggregation• Expose structure in what

appears to be uncorrelated noise on a per-proc basis

• Important observation– Regular pattern across nodes– Each cluster (32 nodes)

contains noisier nodes• Zoom into a cluster

– Node 0: cluster manager– Node 1: quorum node– Node 31: RMS cluster monitor

10/03/05 18

Node Aggregation• Expose structure in what

appears to be uncorrelated noise on a per-proc basis

• Important observation– Regular pattern across nodes– Each cluster (32 nodes)

contains noisier nodes• Zoom into a cluster

– Node 0: cluster manager– Node 1: quorum node– Node 31: RMS cluster monitor

FINDING #1

Analyzing noise on a per-node basis instead of a per-processor basis reveals a regular structure across nodes.

10/03/05 19

Noise Events

10/03/05 20

Source of Noises• Kernel

– Distributed heartbeat generated at kernel level– Lightweight: hundreds of microseconds (us)– High frequency: one every 125ms

• RMS daemons– Quadrics Resource Management System– One every 30s

• TruCluster daemons– HP cluster

management software– One every about 100s

10/03/05 21







10/03/05 22

Coscheduling• Application: fine-grained, bulk-synchronous• A delay in a process slows down the whole app• Large # proc at least one slow process per

iteration• Coscheduling: pay the penalty only once• Developed a prototype, but no details or results

10/03/05 23

Discrete-event Simulator• Why simulator?

– Time on ASCI Q is scarce– Configuration changes are

not always practical• Event = <F, L, E, P>

– F: frequency of the event– L: average duration– E: distribution; P: placement

• Barriers + 1ms computations• Validated for measured events (top two curves)• Predict performance gain of removing noises

– Node 0, 1 or 31: marginal improvement (15%)– Kernel noise on all nodes: dramatically improved

10/03/05 24

Discrete-event Simulator• Why simulator?

– Time on ASCI Q is scarce– Configuration changes are

not always practical• Event = <F, L, E, P>

– F: frequency of the event– L: average duration– E: distribution; P: placement

• Barriers + 1ms computations• Validated for measured events (top two curves)• Predict performance gain of removing noises

– Node 0, 1 or 31: marginal improvement (15%)– Kernel noise on all nodes: dramatically improved

FINDING #2

On fine-grained applications, more performance is lost to short but frequent noise on all nodes than to long but less frequent noise on just a few nodes.

10/03/05 25

Eliminating Noise• Infeasible to remove all the noise

– Two TruCluster heartbeats at kernel level– Require substantial kernel modifications

• Optimizations– Removed ten daemons from all nodes– Increased RMS interval from 30s to 60s– Moved several TruCluster daemons from node 1 and

2 to node 0• Microbenchmarks

– Barriers + Computations (0, 1 or 5ms)

• Improvements– 2.2 to 13 times faster

10/03/05 26







10/03/05 27

Optimized SAGE Performance• Old curves (top two curves)• New curves

– 4-proc, but w/o nodes 0 & 31– Jan-27-03: 1024-node segment

(only up to 3716 proc)– May-01-03: full sized ASCI Q

(up to 7680 proc)– May-01-03(min): minimum time over 50 cycles

• Results– Jan-27-03 and May-01-03: much improved– May-01-03(min): closely match expected performance further optimizations

10/03/05 28

Summary• Different configurations tested prior to and after

noise removal• Total processing rate

– (# usable proc) * (cells per proc) / (cycle time)– Fixed 13,500 cells per proc – Varied # of usable proc

• Best observed (???) processing rate is only 15% below model expectation

10/03/05 29

Summary• Different configurations tested prior to and after

noise removal• Total processing rate

– (# usable proc) * (cells per proc) / (cycle time)– Fixed 13,500 cells per proc – Varied # of usable proc

• Best observed (???) processing rate is only 15% below model expectation

FINDING #3

We were able to double SAGE’s performance by removing noise caused by several types of dæmons, confining dæmons to the cluster manager, and removing the cluster manager and the RMS cluster monitor from each cluster’s compute pool.

10/03/05 30

Discussion• Computational granularity of app type of noise

• Load balanced, coarse-grained app (e.g. LINPACH): – Long noise dominate– Short noise becomes coscheduled

• Medium-grained app (e.g. SAGE):– Medium noise dominate

• Fine-grained app (e.g. deterministic Sn-transport):– Short noise dominate– The freq of long noise is low

10/03/05 31

Discussion• Computational granularity of app type of noise

• Load balanced, coarse-grained app (e.g. LINPACH): – Long noise dominate– Short noise becomes coscheduled

• Medium-grained app (e.g. SAGE):– Medium noise dominate

• Fine-grained app (e.g. deterministic Sn-transport):– Short noise dominate– The freq of long noise is low

FINDING #4

Substantial performance loss occurs when an application resonates with system noise: high-frequency, fine-grained noise affects only fine-grained applications; low-frequency, coarse-grained noise affects only coarse-grained applications.

10/03/05 32

Conclusion• Described a figurative journey to improve the

performance of a sizable hydrodynamics app, SAGE, on the world`s second-fastest supercomputer, ASCI Q

• Methodologies– The first to determine how fast an app could

potentially run– Developed a methodology to analyze artifacts that

degrade app performance yet are not part of the app– Doubled the performance of SAGE w/o modifying a

single line of code• Notions

– Noise and resonance– Applicable to other system and other app

10/03/05 33

More discussions• What do they mean by“best observed” in

Table 3? The processing rate of regular 4-proc using 7680 proc (120.6) is still lower than 3-proc with only 6144 proc.

• The analytical model is constructed manually (Darren Kerbyson et al. SC01). It is enormously labor intensive.

Thanks! Any questions?

The Case of the Missing Supercomputer Performance

(SC 2003)

the case of the missing supercomputer...

Documents