mobs-5 :: june 21, 2009 fiesta: a sample-balanced multi-program workload methodology andrew hilton,...

MoBS-5 :: June 21, 2009 FIESTA: A Sample-Balanced Multi-Program Workload Methodology Andrew Hilton, Neeraj Eswaran, Amir Roth University of Pennsylvania {adhilton, neeraj, [ 2 ] Overview Multi-program workloads Samples from independent programs Executed concurrently to evaluate SMT, CMP, scheduling, etc. How to choose samples? Fixed-workload: choose samples first Load imbalance problem Variable-workload: multi-program execution defines samples Other (more serious) problems Our work Distinguish sample imbalance (bad) from schedule imbalance (ok) Propose FIESTA: sample-balanced fixed-workload methodology [ 3 ] Traditional Fixed-Workload Single-program workload x N X insns (i.e. 5M/sample) from each program [ElMoursi03, Eyerman07] Workload composition is fixed across experiments + Direct comparisons between experiments Load imbalance: time spent executing only slowest programs A: 5M B: 5M time A: 5M B: 5M time Experiment 1 Experiment 2 [ 4 ] Load Imbalance If significant Not representative of real, continuous multi-program execution Deflates (multi-program / single-program) speedup T 1 = 1, T 2 = 2, T 1+2 = 2 SMT-speedup = 50% Not really used (anymore) because of this A: 5M B: 5M time SMT-speedup: (T 1 +T 2 ) / T 1+2 [ 5 ] 2-way SMT Fixed 250M insns from each program 13% SMT speedup 51% load imbalance [ 6 ] Variable-Workload Multi-program execution defines workload Execute all programs until some condition (i.e. total insns = 10M) Normalize to single-program region defined by this execution SMT-speedup metric used for this normalization Eliminates load imbalance (by construction) A: 3M B: 7M time [ 7 ] Variable-Workload variations Many variations of execute all programs until X total instructions committed [Kumar03, Luo01, Tune04] X instructions committed by one program [Cazorla04] X instructions committed by every program [Raasch03, Yeh05] X execution cycles have elapsed [Snavely00] All programs fairly represented [Vera07, Ramirez07] All basically the same Have same fundamental problems Total of X instructions used in this talk/paper [ 8 ] 2-way SMT Fixed 250M insns from each program 13% SMT speedup 51% load imbalance Variable 500M insns total 0% imbalance (by construction) 35% SMT speedup What is the real speedup? 13%? 35%? Something else? [ 9 ] Variable-Workload: Danger! Results from different experiments not directly comparable Different workload in each Skews workload to over-estimate throughput Over-samples fast programs Skews workload to over-estimate speedup Over-samples programs that slow down less due to contention Fairness attempts to account for this [Gabr06] How to synthesize SMT-speedup and fairness into real speedup? A: 3M B: 7M Experiment 1 A: 4M B: 6M Experiment 2 [ 10 ] Fixing Fixed Workload? Many problems with variable workload methodologies Incomparable experiments Over-estimations of throughputs and speedups Tells you what you want to hear Can we revive fixed-workload? Load imbalance only significant problem Very difficult to eliminate completely But complete balance may not even be what we want [ 11 ] Deconstructing Load Imbalance Fixed-workload runs experience two forms of imbalance Sample imbalance: different standalone runtimes Artifact of finite experiments Should be eliminated Easy: choose samples with same standalone runtimes Schedule imbalance: asymmetric (unfair) contention Characteristic of concurrent execution Should be preserved, measured [ 12 ] FIESTA FIESTA: Fixed-Instruction with Equal STAndalone runtimes Run single-programs for C cycles, record insn count Build fixed workloads from time-balanced samples + Eliminates sample imbalance + Remaining imbalance is schedule imbalance Programs represented according to standalone performance Corresponds to fair continuous multi-programming A: 5M B: 7M time A: 5M B: 7M time schedule imbalance [ 13 ] 2-way SMT Reprise Fixed 250M insns from each program 13% speedup, 51% imbalance Variable 500M insns total 35% speedup, 0% imbalance FIESTA 250M cycles from each program 28% speedup, 21% imbalance Fixed has 30% sample imbalance [ 14 ] The Rest of Our Methodology Processor configurations 4-way superscalar, dynamically scheduled, 17-stage pipeline 64KByte, 4-way I/D$, 2MByte, 8-way L2, 8 8-entry stream buffers 400 cycle main memory, 16 outstanding misses Up to 4 threads, ICOUNT, issue queue & stream buffers capped Eight SPEC2K benchmarks ILP (mesa, vortex), branch (gcc, perl) Memory latency (equake, mcf), memory bandwidth (art, swim) Workloads 50 samples per benchmark, periodic starting points for samples 28 2-thread workloads, 70 4-thread workloads [ 15 ] Two Multi-Program Studies Same-architecture study: ICOUNT vs. Round-Robin FIESTA is perfect for this! All experiments share single-program baseline FIESTA workload is sample-balanced (by construction) in all runs Cross-architecture study: SMT vs. RaT Different experiments have different single-program baselines No single FIESTA workload is sample-balanced in all runs FIESTA not perfect but much better than anything else [ 16 ] ICOUNT vs. Round-Robin SMT-speedup Variable uniformly higher ICOUNT advantage Variable, FIESTA agree ICOUNT by 7% Workload composition Danger of Variable Workloads differ by 10% [ 17 ] Cross-Architecture Studies Example: SMT vs. RaT (Runahead Threads) [Ramirez08] SMT baseline is ROB, RaT baseline is Runahead (RA) [Mutlu03] ROB workload sample-unbalanced on RaT, vice versa Well Cross-architecture sample imbalance not as bad as you might think FIESTA can be used to provide tight bound in these cases [ 18 ] Cross-Architecture Sample Imbalance Sample imbalance Fixed: 30% FIESTA: 0% FIESTA-RA: 2% FIESTA-2K-D$: 1% FIESTA-2wide: 9% 30% lower IPC Surprisingly small Single change typically affects all programs in same direction Both programs accelerate by 2X? imbalance still 0% Architecture changes typically smaller in magnitude (1.13X) than a priori program performance differences (215X) [ 19 ] SMT vs. RaT First RA only 5% faster than ROB No RA/SMT synergy, some overlap Variable: RaT by 11% (unlikely) Over-samples RA-happy programs Fixed: RaT by 6% (maybe?) RA fixes sample imbalance Exposes existing MT speedups FIESTA: 14% (confirms intuition) Upshot: any FIESTA workload better than Fixed or Variable Known direction of error from architectural change Use both FIESTA workloads: tight range of results [ 20 ] Other Issues (Future Work) Representativeness of individual programs Being time-based, FIESTA will over-sample fast regions Potential solution: time-based SimPoint [Perelman03] Find representative sample that runs for C cycles Multi-threaded applications Should work FIESTA will ignore inter-thread imbalance, consider entire program [ 21 ] Conclusions Prevailing multi-program studies use variable workloads Introduced to avoid load imbalance problems of fixed workloads Have their own more subtle (and sinister) problems Direct comparisons impossible (but made repeatedly anyway) Tells you what you want to hear Fairness cant account for this FIESTA: sample-balanced fixed multi-program workloads Eliminates sample-imbalance artifacts (different standalone runtimes) Preserves schedule-imbalance characteristics (unfair contention) + Direct comparisons using any metric, unskewed results + Time-based but works for cross-architecture studies Spread the word! [ 22 ] [ 23 ] Contention: The Key Measure? Multi-program speedups, proxy for contention No contention? 100% speedup (2 programs) Fixed Sample imbalance reduces speedups without contention Variable Allows asymmetric contention to disappear, without affecting speedup FIESTA Same architecture? Speedups correspond exactly to contention Different architecture? Very small sample imbalance Speedup/contention relation closer than anything else

mobs-5 :: june 21, 2009 fiesta: a sample-balanced multi-program workload methodology andrew hilton,...

Documents