the case of the missing supercomputer...

34
The Case of the Missing Supercomputer Performance Achieving Optimal Performance on the 8192 Processors of ASCI Q Fabrizio Petrini, Darren Kerbyson, Scott Pakin (Los Alamos National Lab) Presented by Jiahua He

Upload: others

Post on 19-Mar-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Case of the Missing Supercomputer Performancecseweb.ucsd.edu/groups/csag/html/teaching/cse294/... · 2005-10-03 · below model expectation FINDING #3 We were able to double SAGE’s

The Case of the Missing Supercomputer PerformanceAchieving Optimal Performance on the 8192 Processors of ASCI Q

Fabrizio Petrini, Darren Kerbyson, Scott Pakin (Los Alamos National Lab)

Presented by Jiahua He

Page 2: The Case of the Missing Supercomputer Performancecseweb.ucsd.edu/groups/csag/html/teaching/cse294/... · 2005-10-03 · below model expectation FINDING #3 We were able to double SAGE’s

10/03/05 2

Skeleton of the Story• Machine: ASCI Q (Second of Top500)

– 2048 Alpha SMP nodes with 4 proc per node– Interconnected with Quadrics QsNet network

• Application: SAGE– compressible Eulerian hydrodynmics program – 150,000 lines of Fortran MPI code

• Beginning: a serious but previously undetected problem• Techniques:

– Measurement to determine real performance– Analytical model to predict expected performance– Microbenchmarks to identify problem source– Simulator to examine “what if”scenarios

• Result: a factor of 2 improvement in app performance

Page 3: The Case of the Missing Supercomputer Performancecseweb.ucsd.edu/groups/csag/html/teaching/cse294/... · 2005-10-03 · below model expectation FINDING #3 We were able to double SAGE’s

10/03/05 3

Steps• Performance expectation

– Use analytical model to determine the performance that SAGE ought to see on ASCI Q

– Measure the real performance of SAGE• Problem source

– If the measured performance is less than the expected one, use custom microbenchmarks to identify the source of the discrepancy

• Problem eliminating– Use the simulator to try different measures– Eliminate the cause of the problem

• Remeasurement– Remeasure and repeat from step 2 if still not match

Page 4: The Case of the Missing Supercomputer Performancecseweb.ucsd.edu/groups/csag/html/teaching/cse294/... · 2005-10-03 · below model expectation FINDING #3 We were able to double SAGE’s

10/03/05 4

Step 1• Performance expectation

– Use analytical model to determine the perf. that SAGE ought to see on ASCI Q

– Measure the real performance of SAGE• Problem source

– If the measured performance is less than the expected one, use custom microbenchmarks to identify the source of the discrepancy

• Problem eliminating– Use the simulator to try different measures– Eliminate the cause of the problem

• Remeasurement– Remeasure and repeat from step 2 if still not match

Page 5: The Case of the Missing Supercomputer Performancecseweb.ucsd.edu/groups/csag/html/teaching/cse294/... · 2005-10-03 · below model expectation FINDING #3 We were able to double SAGE’s

10/03/05 5

Performance Expectation• Model (Darren Kerbyson et al. SC01)

– Validated on many large-scale systems including all ASCI systems

– Typical prediction error of less than 10%• Terms

– QA: first 4096-processor segment

– QB: second 4096-processor segment

• Weal-scaling: fix per-node problem size and scale # of proc

Page 6: The Case of the Missing Supercomputer Performancecseweb.ucsd.edu/groups/csag/html/teaching/cse294/... · 2005-10-03 · below model expectation FINDING #3 We were able to double SAGE’s

10/03/05 6

Performance Expectation• Model (Darren Kerbyson et al. SC01)

– Validated on many large-scale systems including all ASCI systems

– Typical prediction error of less than 10%• Terms

– QA: first 4096-processor segment

– QB: second 4096-processor segment

• Weal-scaling: fix per-node problem size and scale # of proc

MYSTERY #1

SAGE performs significantly worse on ASCI Q than was predicted by our performance model.

Page 7: The Case of the Missing Supercomputer Performancecseweb.ucsd.edu/groups/csag/html/teaching/cse294/... · 2005-10-03 · below model expectation FINDING #3 We were able to double SAGE’s

10/03/05 7

Different # of proc • Is it the model accurate?• n-proc: using n processors

per node• Only significant difference

occurs when 4-proc– Giving confidence to the model– Limit the problem in 4-proc

• 3-proc outperforms 4-proc when using more than 256 nodes

• 2-proc outperforms 4-proc when using more than 512 nodes

Page 8: The Case of the Missing Supercomputer Performancecseweb.ucsd.edu/groups/csag/html/teaching/cse294/... · 2005-10-03 · below model expectation FINDING #3 We were able to double SAGE’s

10/03/05 8

Perf Variability• Constant amount of work

in each cycle constant amount of time

• Vary from 0.7s to 3.0s• A factor of 4 in variability

Page 9: The Case of the Missing Supercomputer Performancecseweb.ucsd.edu/groups/csag/html/teaching/cse294/... · 2005-10-03 · below model expectation FINDING #3 We were able to double SAGE’s

10/03/05 9

Breakdown of Cycle Time• cycle = computation + local boundary exchange +

collective communication• Local boundary exchanges (get, put)

– Plateau above 500 proc– Match model prediction

• Collective communications (allreduce, reduction, broadcast)– Increase with # of proc– Constant number and payload

size in allreduce operations– Difference between allreduce

and reduction/broadcast: the difference in frequency of occurrence

Page 10: The Case of the Missing Supercomputer Performancecseweb.ucsd.edu/groups/csag/html/teaching/cse294/... · 2005-10-03 · below model expectation FINDING #3 We were able to double SAGE’s

10/03/05 10

Summary• Observations

– Significant difference: expected performance observed performance

– Only when 4-proc– High variability– Source of performance deficit:

collective operations, especially allreduce• Deduction

– Improve the performance of allreduce, especially when using four processors per node

Page 11: The Case of the Missing Supercomputer Performancecseweb.ucsd.edu/groups/csag/html/teaching/cse294/... · 2005-10-03 · below model expectation FINDING #3 We were able to double SAGE’s

10/03/05 11

Step 2• Performance expectation

– Use analytical model to determine the perf. that SAGE ought to see on ASCI Q

– Measure the real performance of SAGE• Problem source

– If the measured performance is less than the expected one, use custom microbenchmarks to identify the source of the discrepancy

• Problem eliminating– Use the simulator to try different measures– Eliminate the cause of the problem

• Remeasurement– Remeasure and repeat from step 2 if still not match

Page 12: The Case of the Missing Supercomputer Performancecseweb.ucsd.edu/groups/csag/html/teaching/cse294/... · 2005-10-03 · below model expectation FINDING #3 We were able to double SAGE’s

10/03/05 12

Investigating allreduce• allreduce latency

– 4-proc: 3ms– Others: less than 0.3ms

• Synthetic parallel benchmark– Alternately computes for either

0, 1 or 5 ms then performs either an allreduce or barrier

• Ideal scalable system– Logarithmic growth with #

nodes– Insensitivity to computational

granularity• Result: not scalable

Page 13: The Case of the Missing Supercomputer Performancecseweb.ucsd.edu/groups/csag/html/teaching/cse294/... · 2005-10-03 · below model expectation FINDING #3 We were able to double SAGE’s

10/03/05 13

Optimizing allreduce• Optimization

– Always polling Blocking after a limited time (100us, determined empirically)

– Improve latency by a factor of 7• Expectation

– At 4096 proc, SAGE spends 51% time in allreduce 78% performance gain

• Measurement result– Only a marginal improvement in application

performance

Page 14: The Case of the Missing Supercomputer Performancecseweb.ucsd.edu/groups/csag/html/teaching/cse294/... · 2005-10-03 · below model expectation FINDING #3 We were able to double SAGE’s

10/03/05 14

Optimizing allreduce• Optimization

– Always polling Blocking after a limited time (100us, determined empirically)

– Improve latency by a factor of 7• Expectation

– At 4096 proc, SAGE spends 51% time in allreduce 78% performance gain

• Measurement result– Only a marginal improvement in application

performance

MYSTERY #2

Although SAGE spends half of its time in allreduce (at 4,096 processors), making allreduce seven times faster leads to a negligible performance improvement.

Page 15: The Case of the Missing Supercomputer Performancecseweb.ucsd.edu/groups/csag/html/teaching/cse294/... · 2005-10-03 · below model expectation FINDING #3 We were able to double SAGE’s

10/03/05 15

Analyzing Noise• Neither MPI nor network node• Periodic system activities (noise)

– Need a spare proc (Fig. 3, 6)– Blocking in allreduce

• Benchmark– Synthetic 1000s computation per

proc without noise– Max slowdown: only 2.5%

• Refined benchmark– 1 million 1ms iterations per proc

without noise– Match LANL codes pattern– Similar result

Page 16: The Case of the Missing Supercomputer Performancecseweb.ucsd.edu/groups/csag/html/teaching/cse294/... · 2005-10-03 · below model expectation FINDING #3 We were able to double SAGE’s

10/03/05 16

Analyzing Noise• Neither MPI nor network node• Periodic system activities (noise)

– Need a spare proc (Fig. 3, 6)– Blocking in allreduce

• Benchmark– Synthetic 1000s computation per

proc without noise– Max slowdown: only 2.5%

• Refined benchmark– 1 million 1ms iterations per proc

without noise– Match LANL codes pattern– Similar result

MYSTERY #3

Although the “noise” hypothesis could explain SAGE’s suboptimal performance, microbenchmarks of per-processor noise indicate that at most 2.5% of performance is being lost to noise.

Page 17: The Case of the Missing Supercomputer Performancecseweb.ucsd.edu/groups/csag/html/teaching/cse294/... · 2005-10-03 · below model expectation FINDING #3 We were able to double SAGE’s

10/03/05 17

Node Aggregation• Expose structure in what

appears to be uncorrelated noise on a per-proc basis

• Important observation– Regular pattern across nodes– Each cluster (32 nodes)

contains noisier nodes• Zoom into a cluster

– Node 0: cluster manager– Node 1: quorum node– Node 31: RMS cluster monitor

Page 18: The Case of the Missing Supercomputer Performancecseweb.ucsd.edu/groups/csag/html/teaching/cse294/... · 2005-10-03 · below model expectation FINDING #3 We were able to double SAGE’s

10/03/05 18

Node Aggregation• Expose structure in what

appears to be uncorrelated noise on a per-proc basis

• Important observation– Regular pattern across nodes– Each cluster (32 nodes)

contains noisier nodes• Zoom into a cluster

– Node 0: cluster manager– Node 1: quorum node– Node 31: RMS cluster monitor

FINDING #1

Analyzing noise on a per-node basis instead of a per-processor basis reveals a regular structure across nodes.

Page 19: The Case of the Missing Supercomputer Performancecseweb.ucsd.edu/groups/csag/html/teaching/cse294/... · 2005-10-03 · below model expectation FINDING #3 We were able to double SAGE’s

10/03/05 19

Noise Events

Page 20: The Case of the Missing Supercomputer Performancecseweb.ucsd.edu/groups/csag/html/teaching/cse294/... · 2005-10-03 · below model expectation FINDING #3 We were able to double SAGE’s

10/03/05 20

Source of Noises• Kernel

– Distributed heartbeat generated at kernel level– Lightweight: hundreds of microseconds (us)– High frequency: one every 125ms

• RMS daemons– Quadrics Resource Management System– One every 30s

• TruCluster daemons– HP cluster

management software– One every about 100s

Page 21: The Case of the Missing Supercomputer Performancecseweb.ucsd.edu/groups/csag/html/teaching/cse294/... · 2005-10-03 · below model expectation FINDING #3 We were able to double SAGE’s

10/03/05 21

Step 3• Performance expectation

– Use analytical model to determine the perf. that SAGE ought to see on ASCI Q

– Measure the real performance of SAGE• Problem source

– If the measured performance is less than the expected one, use custom microbenchmarks to identify the source of the discrepancy

• Problem eliminating– Use the simulator to try different measures– Eliminate the cause of the problem

• Remeasurement– Remeasure and repeat from step 2 if still not match

Page 22: The Case of the Missing Supercomputer Performancecseweb.ucsd.edu/groups/csag/html/teaching/cse294/... · 2005-10-03 · below model expectation FINDING #3 We were able to double SAGE’s

10/03/05 22

Coscheduling• Application: fine-grained, bulk-synchronous• A delay in a process slows down the whole app• Large # proc at least one slow process per

iteration• Coscheduling: pay the penalty only once• Developed a prototype, but no details or results

Page 23: The Case of the Missing Supercomputer Performancecseweb.ucsd.edu/groups/csag/html/teaching/cse294/... · 2005-10-03 · below model expectation FINDING #3 We were able to double SAGE’s

10/03/05 23

Discrete-event Simulator• Why simulator?

– Time on ASCI Q is scarce– Configuration changes are

not always practical• Event = <F, L, E, P>

– F: frequency of the event– L: average duration– E: distribution; P: placement

• Barriers + 1ms computations• Validated for measured events (top two curves)• Predict performance gain of removing noises

– Node 0, 1 or 31: marginal improvement (15%)– Kernel noise on all nodes: dramatically improved

Page 24: The Case of the Missing Supercomputer Performancecseweb.ucsd.edu/groups/csag/html/teaching/cse294/... · 2005-10-03 · below model expectation FINDING #3 We were able to double SAGE’s

10/03/05 24

Discrete-event Simulator• Why simulator?

– Time on ASCI Q is scarce– Configuration changes are

not always practical• Event = <F, L, E, P>

– F: frequency of the event– L: average duration– E: distribution; P: placement

• Barriers + 1ms computations• Validated for measured events (top two curves)• Predict performance gain of removing noises

– Node 0, 1 or 31: marginal improvement (15%)– Kernel noise on all nodes: dramatically improved

FINDING #2

On fine-grained applications, more performance is lost to short but frequent noise on all nodes than to long but less frequent noise on just a few nodes.

Page 25: The Case of the Missing Supercomputer Performancecseweb.ucsd.edu/groups/csag/html/teaching/cse294/... · 2005-10-03 · below model expectation FINDING #3 We were able to double SAGE’s

10/03/05 25

Eliminating Noise• Infeasible to remove all the noise

– Two TruCluster heartbeats at kernel level– Require substantial kernel modifications

• Optimizations– Removed ten daemons from all nodes– Increased RMS interval from 30s to 60s– Moved several TruCluster daemons from node 1 and

2 to node 0• Microbenchmarks

– Barriers + Computations (0, 1 or 5ms)

• Improvements– 2.2 to 13 times faster

Page 26: The Case of the Missing Supercomputer Performancecseweb.ucsd.edu/groups/csag/html/teaching/cse294/... · 2005-10-03 · below model expectation FINDING #3 We were able to double SAGE’s

10/03/05 26

Step 4• Performance expectation

– Use analytical model to determine the perf. that SAGE ought to see on ASCI Q

– Measure the real performance of SAGE• Problem source

– If the measured performance is less than the expected one, use custom microbenchmarks to identify the source of the discrepancy

• Problem eliminating– Use the simulator to try different measures– Eliminate the cause of the problem

• Remeasurement– Remeasure and repeat from step 2 if still not match

Page 27: The Case of the Missing Supercomputer Performancecseweb.ucsd.edu/groups/csag/html/teaching/cse294/... · 2005-10-03 · below model expectation FINDING #3 We were able to double SAGE’s

10/03/05 27

Optimized SAGE Performance• Old curves (top two curves)• New curves

– 4-proc, but w/o nodes 0 & 31– Jan-27-03: 1024-node segment

(only up to 3716 proc)– May-01-03: full sized ASCI Q

(up to 7680 proc)– May-01-03(min): minimum time over 50 cycles

• Results– Jan-27-03 and May-01-03: much improved– May-01-03(min): closely match expected performance further optimizations

Page 28: The Case of the Missing Supercomputer Performancecseweb.ucsd.edu/groups/csag/html/teaching/cse294/... · 2005-10-03 · below model expectation FINDING #3 We were able to double SAGE’s

10/03/05 28

Summary• Different configurations tested prior to and after

noise removal• Total processing rate

– (# usable proc) * (cells per proc) / (cycle time)– Fixed 13,500 cells per proc – Varied # of usable proc

• Best observed (???) processing rate is only 15% below model expectation

Page 29: The Case of the Missing Supercomputer Performancecseweb.ucsd.edu/groups/csag/html/teaching/cse294/... · 2005-10-03 · below model expectation FINDING #3 We were able to double SAGE’s

10/03/05 29

Summary• Different configurations tested prior to and after

noise removal• Total processing rate

– (# usable proc) * (cells per proc) / (cycle time)– Fixed 13,500 cells per proc – Varied # of usable proc

• Best observed (???) processing rate is only 15% below model expectation

FINDING #3

We were able to double SAGE’s performance by removing noise caused by several types of dæmons, confining dæmons to the cluster manager, and removing the cluster manager and the RMS cluster monitor from each cluster’s compute pool.

Page 30: The Case of the Missing Supercomputer Performancecseweb.ucsd.edu/groups/csag/html/teaching/cse294/... · 2005-10-03 · below model expectation FINDING #3 We were able to double SAGE’s

10/03/05 30

Discussion• Computational granularity of app type of noise

• Load balanced, coarse-grained app (e.g. LINPACH): – Long noise dominate– Short noise becomes coscheduled

• Medium-grained app (e.g. SAGE):– Medium noise dominate

• Fine-grained app (e.g. deterministic Sn-transport):– Short noise dominate– The freq of long noise is low

Page 31: The Case of the Missing Supercomputer Performancecseweb.ucsd.edu/groups/csag/html/teaching/cse294/... · 2005-10-03 · below model expectation FINDING #3 We were able to double SAGE’s

10/03/05 31

Discussion• Computational granularity of app type of noise

• Load balanced, coarse-grained app (e.g. LINPACH): – Long noise dominate– Short noise becomes coscheduled

• Medium-grained app (e.g. SAGE):– Medium noise dominate

• Fine-grained app (e.g. deterministic Sn-transport):– Short noise dominate– The freq of long noise is low

FINDING #4

Substantial performance loss occurs when an application resonates with system noise: high-frequency, fine-grained noise affects only fine-grained applications; low-frequency, coarse-grained noise affects only coarse-grained applications.

Page 32: The Case of the Missing Supercomputer Performancecseweb.ucsd.edu/groups/csag/html/teaching/cse294/... · 2005-10-03 · below model expectation FINDING #3 We were able to double SAGE’s

10/03/05 32

Conclusion• Described a figurative journey to improve the

performance of a sizable hydrodynamics app, SAGE, on the world`s second-fastest supercomputer, ASCI Q

• Methodologies– The first to determine how fast an app could

potentially run– Developed a methodology to analyze artifacts that

degrade app performance yet are not part of the app– Doubled the performance of SAGE w/o modifying a

single line of code• Notions

– Noise and resonance– Applicable to other system and other app

Page 33: The Case of the Missing Supercomputer Performancecseweb.ucsd.edu/groups/csag/html/teaching/cse294/... · 2005-10-03 · below model expectation FINDING #3 We were able to double SAGE’s

10/03/05 33

More discussions• What do they mean by“best observed” in

Table 3? The processing rate of regular 4-proc using 7680 proc (120.6) is still lower than 3-proc with only 6144 proc.

• The analytical model is constructed manually (Darren Kerbyson et al. SC01). It is enormously labor intensive.

Page 34: The Case of the Missing Supercomputer Performancecseweb.ucsd.edu/groups/csag/html/teaching/cse294/... · 2005-10-03 · below model expectation FINDING #3 We were able to double SAGE’s

Thanks! Any questions?

The Case of the Missing Supercomputer Performance

(SC 2003)