folklore conﬁrmed: compiling for speed = compiling for...

Folklore Confirmed:�� Compiling for Speed =�� Compiling for Energy

Tomofumi Yuki INRIA, Rennes Sanjay Rajopadhye Colorado State University

1

Exa-Scale Computing

n Reach 1018 FLOP/s by year 2020 n  Energy is the key challenge

n  Roadrunner (1PFLOP/s): 2MW n  K (10PFLOP/s): 12MW n  Exa-Scale (1000PFLOP/s): 100s of MW?

n Need 10-100x energy efficiency improvements

n What can we do as compiler designers?

2

Energy = Power × Time

n Most compilers cannot touch power n  Go as fast as possible is energy optimal

n Also called “race-to-sleep” strategy

n Dynamic Voltage and Frequency Scaling n  One knob available to compilers n  Control voltage/frequency at run-time n  Higher voltage, higher frequency n  Higher voltage, higher power consumption

3

Can you slow down for better energy efficiency? n Yes—in Theory

n  Voltage scaling: n  Linear decrease in speed (frequency) n  Quadratic decrease in power consumption n  Hence, going slower is better for energy

n No—in Practice n  System power dominates n  Savings in CPU cancelled by other components

n  CPU dynamic power is around 30%

4

Our Paper

n Analysis based on high-level energy model n  Emphasis on power breakdown n  Find when “race-to-sleep” is the best n  Survey power breakdown of recent machines

n Goal: confirm that sophisticated use of DVFS by compilers is not likely to help much n  e.g., analysis/transformation to find/expose

“sweet-spot” for trading speed with energy

5

Outline

n  Proposed Model (No Equations!)

n  Power Breakdown n  Ratio of Powers n  When “race-to-speed” works

n  Survey of Machines n DVFS for Memory n Conclusion

6

Power Breakdown

n Dynamic (Pd)—consumed when bits flips n  Quadratic savings as voltage scales

n  Static (Ps)—leaked while current is flowing n  Linear savings as voltage scales

n Constant (Pc)—everything else n  e.g., memory, motherboard, disk, network card,

power supply, cooling, … n  Little or no effect from voltage scaling

7

Influence on Execution Time

n Voltage and Frequency are linearly related n  Slope is less than 1 n  i.e., scale voltage by half, frequency drop is less

than half

n  Simplifying Assumption n  Frequency change directly influence exec. time n  Scale frequency by x, time becomes 1/x n  Fully flexible (continuous) scaling

n  Small set of discrete states in practice

8

n Case1: Dynamic Dominates n  Power îîî n  Time ì

n Case2: Static Dominates n  Power î n  Time ì

n Case3: Constant Dominates n  Power è n  Time ì

Ratio is the Key

9

Pd : Ps : Pc

Pd : Ps : Pc Pd : Ps : Pc

Pd : Ps : Pc

Energy îî Slower the Better

Energy è No harm, but No gain

Energy ì Faster the Better

When do we have Case 3?

n  Static power is now more than dynamic power n  Power gating doesn’t help when computing

n Assume Pd = Ps n  50% of CPU power is due to leakage

n  Roughly matches 45nm technology n  Further shrink = even more leakage

n  The borderline is when Pd = Ps = Pc n  We have case 3 when Pc is larger than Pd=Ps

10

Extensions to The Model

n  Impact on Execution Time n  May not be directly proportional to frequency n  Shifts the borderline in favor of DVFS

n  Larger Ps and/or Pc required for Case 3

n  Parallelism n  No influence on result n  CPU power is even less significant than 1-core

n  Power budget for a chip is shared (multi-core) n  Network cost is added (distributed)

11

Outline

n  Survey of Machines n  Pc in Current Machines n  Desktop and Servers n  Cray Supercomputers

n DVFS for Memory n Conclusion

12

Do we have Case 3?

n  Survey of machines and significance of Pc n Based on:

n  Published power budget (TDP) n  Published power measures n  Not on detailed/individual measurements

n Conservative Assumptions n  Use upper bound for CPU n  Use lower bound for constant powers n  Assume high PSU efficiency

13

Pc in Current Machines

n  Sources of Constant Power n  Stand-By Memory (1W/1GB)

n  Memory cannot go idle while CPU is working

n  Power Supply Unit (10-20% loss) n  Transforming AC to DC

n  Motherboard (6W) n  Cooling Fan (10-15W)

n  Fully active when CPU is working

n Desktop Processor TDP ranges from 40-90W n  Up to 130W for large core count (8 or 16)

14

Sever and Desktop Machines

n Methodology n  Compute a lower bound of Pc n  Does it exceed 33% of total system power? n  Then Case 3 holds even if the rest was all

consumed by the processor

n  System load n  Desktop: compute-intensive benchmarks n  Sever: Server workloads��

(not as compute-intensive)

15

Desktop and Server Machines

2007 2008 2009 2010 2011 2012

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Constant Power Trends in Recent Processors

year

cons

tant

pow

er /

tota

l pow

er u

nder

load

2007 2008 2009 2010 2011 2012

0.0

0.1

0.2

0.3

0.4

0.5

0.6

2007 2008 2009 2010 2011 2012

0.0

0.1

0.2

0.3

0.4

0.5

0.6

2007 2008 2009 2010 2011 2012

0.0

0.1

0.2

0.3

0.4

0.5

0.6

desktop (individual data points) server (individual data points)

desktop (mean) server (mean)

16

Cray Supercomputers

n Methodology n  Let Pd+Ps be sum of processors TDPs n  Let Pc be the sum of

n  PSU loss (5%) n  Cooling (10%) n  Memory (1W/1GB)

n  Check if Pc exceeds Pd = Ps n  Two cases for memory configuration (min/max)

17

Cray Supercomputers

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

XT5 (min)

XT5 (max)

XT6 (min)

XT6 (max)

XE6 (min)

XE6 (max)

Other

PSU+Cooling

Memory

CPU-static

CPU-dynamic

18

Cray Supercomputers

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

XT5 (min)

XT5 (max)

XT6 (min)

XT6 (max)

XE6 (min)

XE6 (max)

Other

PSU+Cooling

Memory

CPU-static

CPU-dynamic

19

Cray Supercomputers

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

XT5 (min)

XT5 (max)

XT6 (min)

XT6 (max)

XE6 (min)

XE6 (max)

Other

PSU+Cooling

Memory

CPU-static

CPU-dynamic

20

Outline

n DVFS for Memory n  Changes to the model n  Influence on “race-to-sleep”

n Conclusion

21

DVFS for Memory (from TR version)

n  Still in research stage (since 2010~) n  Same principle applied to memory

n  Quadratic component in power w.r.t. voltage n  25% quadratic, 75% linear

n  The model can be adopted: n  Pd becomes Pq dynamic to quadratic n  Ps becomes Pl static to linear

n  The same story but with Pq : Pl : Pc

22

Influence on “race-to-sleep”

n Methodology n  Move memory power from Pc to Pq and Pl

n  25% to Pq and 75% to Pl

n  Pc becomes 15% of total power for Server/Cray n  “race-to-sleep” may not be the best anymore n  remains to be around 30% for desktop

n  Vary Pq:Pl ratio to find when “race-to-sleep” is the winner again

n  leakage is expected to keep increasing

23

When “Race to Sleep” is optimal

n When derivative of energy w.r.t. scaling is >0

24

When should we "Race to Sleep"

Fraction of Linearly Scaling Component Pl / (Pq+Pl)

dE/d

F w

ith h

ighe

st fr

eque

ncy

Server/CrayDesktop

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

−4−3

−2−1

01

2

borderline dE/dF > 0dE/dF

Linearly Scaling Fraction: Pl / (Pq + Pl)

Outline

n Conclusion

25

Summary and Conclusion

n Diminishing returns of DVFS n  Main reason is leakage power n  Confirmation by a high-level energy model n  “race-to-speed” seems to be the way to go n  Memory DVFS won’t change the big picture

n Compilers can continue to focus on speed n  No significant gain in energy efficiency by

sacrificing speed

26

Balancing Computation and I/O

n DVFS can improve energy efficiency n  when speed is not sacrificed

n Bring program to compute-I/O balanced state n  If it’s memory-bound, slow down CPU n  If it’s compute-bound, slow down memory

n  Still maximizing hardware utilization n  but by lowering the hardware capability

n Current hardware (e.g., Intel Turbo-boost) ��and/or OS do this for processor

27

Thank you!

28

folklore conﬁrmed: compiling for speed = compiling for...

Documents