power consumption in silicon chips - irisa · –latch-up immunity •ultra-thin body (tsi~1/3lg)...
TRANSCRIPT
1
OlivierSentieys
IRISA/INRIA– CairnteamUniversityofRennes1 [email protected]
2
PowerConsumptioninSiliconChips
• Chips,logicgatesandtransistors
Intel’sXeonChip
A B
B
A
SCi
Pi = ↵i.fi.Ci.Vdd2 + Ileaki .Vdd
P =X
i
⇥↵i.fi.Ci.Vdd2 + Ileaki .Vdd
⇤
2
3
Outline
• Powerandenergyconsumption basics• Powerconsumption inprocessors• Multicore:powerandutilizationwalls• Energyadvantagesofhardwareaccelerators
• Playingwithaccuracyforreducingenergy• Towardsheterogeneousmanycores
– 3Dstacking– Opticalinterconnect
4
TheBasicElement:Transistor
• Transistorasaswitch
• Vg>Vt:NMOSon– ResistanceRDS
• Vg<Vt:NMOSoff– LeakageIoff
G
S
D
Id
VgVt:threshold
Ioff
• Gate:capacitanceCG
CG
• Switch:resistanceRDS
RDS / 1Vdd�Vt
3
5
Transistors
• BulkCMOS
• FD-SOICMOS– STMicroelectronics– Vt tuning
L
W
tox
OxydeN/PDiffusion
channel
Source GateDrain
FDSOI/UTBB Structure
gate
Thin Silicon film
UTBB-SOI : Ultra Thin Body (FD) and BOX SOI
February 2012 Technology R&D
9
Planar FDSOI Transistor Advantages
February 2012 Technology R&D
• Total dielectric isolation – Lower S/D capacitances – Lower S/D leakages – Latch-up immunity
• Ultra-thin Body (TSi~1/3LG) – Excellent short-channel immunity
• low SCE, DIBL • No channel doping, no pocket implant
– Improved VT variation
• Ultra-thin BOX option – Back-bias control
• Ground-plane implantation – VT adjustment
Thin Silicon Channel
10
6
Transistors
• IntelFinFET:transistorsgo3D
4
7
Gates
• Delayofagate
• Dynamicpower
• Leakagepower
A B
B
AS
A B S0 0 10 1 11 0 11 1 0
BA S
Ci
Pstati = N.I
o↵
.Vdd
Delay / RDS.Ci
/ Relative FanOut
Vdd � Vt
Pdyni = ↵i.fclk.Ci.Vdd2
8
Activity
• Activityαi istheprobability tohavea0→1transitionsattheoutput ofagate
• Example:ANDgate– PS =P(S=1)=PAPB– αi =PS(1- PS)
• Activitypropagation
A B S0 0 00 1 01 0 01 1 1
BA S1/2
1/2 1/4α=3/16
BA X
CS1/41/21/8
5
9
Propagating Activity isnotSoSimple
• Conditionalprobabilities
• Glitches:gatedelay– Significantinarithmetic
BA X
CS1/41/21/8
BA X
S1/41/21/4
110
101
1
0
0
0
Glitches
10
DynamicPowervs.Performance
• DecreasingVdd reducespowerbutincreasesdelay
0
1
2
3
4
5
6
0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4Supply voltage (VDD)
Rel
ativ
e D
elay
t d
0
2
4
6
8
10
Rel
ativ
e P d
yn
Pdyni = ↵i.fclk.Ci.Vdd2
Delay / 1
Vdd �Vt
6
11
PowerDissipationandCircuitDelay
12
34
-0. 400.40.8
0
0.2
0.4
0.6
0.8
1x 10
-4
Pow
er (W
)
A
B
12
34
-0.400.40.8
0
1
2
3
4
5x 10
-10
Del
ay (s
)
AB
12
Leakagevs.performance
• Highperformance • Lowleakage
Id
VgVt (high)
Ioff
Id
VgVt (low)
Ioff
Delay / 1
Vdd �Vt
Pstati = N.I
o↵
.Vdd Ioff:• Exponential ininverseofVt• Exponential intemperature• Linearindevicecount
7
13
MinimumEnergyperOperation
• Puttingalltogether
14
[Source: Intel]
On-ChipInterconnect?
• Gatedelaydecreasesbut…wiredelayincreases• Crossingchipin5-10clockcycles• Alsoaffectedbynoise…
• Metallayerstoreducewiredelay
• Repeaters
• Towardsnetwork-on-chip
8
15
Conclusion: PowerinCMOS
• Dynamicpower– 40-70%today– Decreasingrelatively– DVFSbecomesmoreandmoredifficult
• Leakagepower– 20-50%today– Increasingrapidly
• number oftransistors• Vdd/Vt scaling
– Criticalformemory
powerstaticrateoperationenergy
P +×=
P =X
i
⇥↵i.fi.Ci.Vdd2 + Ileaki .Vdd
⇤
16
Outline
• Powerandenergyconsumption basics• Powerconsumption inprocessors• Multicore:powerandutilizationwalls• Energyadvantagesofhardwareaccelerators
• Playingwithaccuracyforreducingenergy• Towardsheterogeneousmanycores
– 3Dstacking– Opticalinterconnect
9
17
PowerConsumptioninProcessors
• Atypical(yetsimple)processorpipeline
134
Chapter 3 Pipelining
when we consider the effect of branches, which changes the PC also, but not untilthe MEM stage. This is not a problem in our multicycle, unpipelined datapath,since the PC is written once in the MEM stage. For now, we will organize ourpipelined datapath to write the PC in IF and write either the incremented PC orthe value of the branch target of an earlier branch. This introduces a problem inhow branches are handled that we will explain in the next section and explore indetail in section 3.5.
Because every pipe stage is active on every clock cycle, all operations in apipe stage must complete in one clock cycle and any combination of operationsmust be able to occur at once. Furthermore, pipelining the datapath requires thatvalues passed from one pipe stage to the next must be placed in registers.Figure 3.4 shows the DLX pipeline with the appropriate registers, called
pipelineregisters
or pipeline latches, between each pipeline stage. The registers arelabeled with the names of the stages they connect. Figure 3.4 is drawn so thatconnections through the pipeline registers from one stage to another are clear.
FIGURE 3.4 The datapath is pipelined by adding a set of registers, one between each pair of pipe stages. The reg-isters serve to convey values and control information from one stage to the next. We can also think of the PC as a pipelineregister, which sits before the IF stage of the pipeline, leading to one pipeline register for each pipe stage. Recall that thePC is an edge-triggered register written at the end of the clock cycle; hence there is no race condition in writing the PC. Theselection multiplexer for the PC has been moved so that the PC is written in exactly one stage (IF). If we didn’t move it, therewould be a conflict when a branch occurred, since two instructions would try to write different values into the PC. Most ofthe datapaths flow from left to right, which is from earlier in time to later. The paths flowing from right to left (which carry theregister write-back information and PC information on a branch) introduce complications into our pipeline, which we willspend much of this chapter overcoming.
Datamemory
ALU
Signextend
PC
Instructionmemory
ADD
IF/ID
4
ID/EX EX/MEM MEM/WB
IR6..10
MEM/WB.IR
Mux
Mux
Mux
IR11..15Registers
Branchtaken
IR
16 32
Mux
Zero?
18
D-cache6%
Datapath38%
Reg. File14%
Fetch/Decode
19%
I-cache23%
EnergyCostinaProcessor
MIPS processor91 pJ/instr.
10
19
EnergyCostinaProcessor
• Fetchingoperandscostsmorethancomputing
28nmCMOS
500pJ Efficientoff-chiplink
16nJDRAMRd/Wr
64-bitDP20pJ 26pJ 256pJ
1nJ
256-bitbuses
50pJ256-bitaccess
8kB SRAM
[Dally,IPDPS’11]
20
EnergyCostinaMemory
• L2Cachecontains4MillionsSRAMcells• Raw/columnof2000cells
Bit line2 L-K
Word line
A K
A K+1
A L-1
A 0
M. 2K
A K-1
Sense amplifiers-Drivers
Column decoder
Input-Output(M bits)
Rowdecoder
StorageCell
11
21
TheEnergyCostofDataMovement
• Futureprocessorupto3Tera-op/sec• Atminimumrequires64bx9Tera-operandstobemovedeachsecond
• Ifonaverage1mm(10%ofdiesize)then– 0.1pJ/bit x576Tbits/s– consumes 58Watts!
74 COMMUNICATIONS OF THE ACM | MAY 2011 | VOL. 54 | NO. 5
contributed articles
tion has been slowed by continued improvement in microprocessor sin-gle-thread performance. Developers of software applications had little incen-tive to customize for accelerators that might be available on only a fraction of the machines in the field and for which
the performance advantage might soon be overtaken by advances in the traditional microprocessor. With slow-ing improvement in single-thread per-formance, this landscape has changed significantly, and for many applica-tions, accelerators may be the only
path toward increased performance or energy efficiency (see Table 4). But such software customization is diffi-cult, especially for large programs (see the sidebar “Decline of 90/10 Optimi-zation, Rise of 10x10 Optimization”).
Orchestrating data movement: Memory hierarchies and intercon-nects. In future microprocessors, the energy expended for data movement will have a critical effect on achiev-able performance. Every nano-joule of energy used to move data up and down the memory hierarchy, as well as to synchronize across and data be-tween processors, takes away from the limited budget, reducing the energy available for the actual computation. In this context, efficient memory hi-erarchies are critical, as the energy to retrieve data from a local register or cache is far less than the energy to go to DRAM or to secondary storage. In addition, data must be moved between processing units efficiently, and task placement and scheduling must be optimized against an interconnection network with high locality. Here, we examine energy and power associated with data movement on the processor die.
Today’s processor performance is on the order of 100Giga-op/sec, and a 30x increase over the next 10 years would increase this performance to 3Tera-op/sec. At minimum, this boost requires 9Tera-operands or 64b x 9Tera-operands (or 576Tera-bits) to be moved each second from registers or memory to arithmetic logic, consum-ing energy.
Figure 11(a) outlines typical wire delay and energy consumed in moving one bit of data on the die. If the oper-ands move on average 1mm (10% of die size), then at the rate of 0.1pJ/bit, the 576Tera-bits/sec of movement con-sumes almost 58 watts with hardly any energy budget left for computation. If most operands are kept local to the ex-ecution units (such as in register files) and the data movement is far less than 1mm, on, say, the order of only 0.1mm, then the power consumption is only around 6 watts, allowing ample energy budget for the computation.
Cores in a many-core system are typically connected through a net-work-on-a-chip to move data around the cores.40 Here, we examine the ef-
Figure 12. Hybrid switching for network-on-a-chip.
Bus to connect a cluster
Second-level bus to connect clusters (hierarchy of busses)
Second-level router-based network (hierarchy of networks)
C
C
Bus
C
C
C
C
Bus
C
C
C
C
Bus
C
C
C
C
Bus
C
C
C
C
Bus
C
C
C
C
Bus
C
C
C
C
Bus
C
C
C
C
Bus
C
C
C
C
Bus
C
C
R R
R R
Table 5. Data movement challenges, trends, directions.
Challenge Near-Term Long-Term
Parallelism Increased parallelism Heterogeneous parallelism and customization, hardware/runtime placement, migration, adaptation for locality and load balance
Data Movement/Locality
More complex, more exposed hierarchies; new abstractions for control over movement and “snooping”
New memory abstractions and mechanisms for efficient vertical data locality management with low programming effort and energy
Resilience More aggressive energy reduction; compensated by recovery for resilience
Radical new memory technologies (new physics) and resilience techniques
Energy Proportional Communication
Fine-grain power management in packet fabrics
Exploitation of wide data, slow clock, and circuit-based techniques
Reduced Energy Low-energy address translation Efficient multi-level naming and memory-hierarchy management
Figure 11. On-die interconnect delay and energy (45nm).
10,000
1,000
100
10
1
1,000
100
10
1
0.1
0.01
2
1.5
1
0.5
00 0.5u
Wire Delay
On-die network energy per bit
Wire Energy Measured
Extrapolated
5 0.18u10 65nm15 22nm20 8nm
Del
ay (p
s)
(pJ
)
pJ/B
it
On-die interconnect length (mm)
(a) (b)
22
ReducingPower
• Powergating,multi-Vt• Clockgating• Vdd scaling
– Parallel,pipeline• Activityreduction
– Pre-computation, correlation,encoding• GlitchPowerReduction
Virtual Vdd
sleep SwitchCell
Vdd
LogicCell
12
23
DynamicPowerManagement
• DynamicVoltageandFrequencyScaling(DVFS)• Reducespeed(clockfreq.)andVdd dependingonprocessoractivity
After
Before
TimeProc
esso
r Sp
eed
IDLE
E=CVH2+Eidle
E=CVL2
24
Outline
• Powerandenergyconsumption basics• Powerconsumption inprocessors• Multicore:powerandutilizationwalls• Energyadvantagesofhardwareaccelerators
• Playingwithaccuracyforreducingenergy• Towardsheterogeneousmanycores
– Can3Dstackinghelp?– Opticalinterconnect
13
25
The“PowerWall”
Source: C. Batten, Cornell Source: C. Batten, Cornell
~25 W/cm2
ahardlimit
26
andthe“Multicore Era”
• Increasingperformancebyincreasing#ofcores
Course Motivation Interconnection Network Basics Course Logistics
Examples of Multicore and Manycore Processors
ECE 5970 L01: Course Overview 9 / 29
Source: C. Batten, Cornell
14
27
Shared-MemoryMultiprocessor
• Processorscommunicatewithsharedaddressspacebymemoryread/write
• Hardware-managed,implicitly-addressed,coherentcaches
• Bandwidthdependson– Cachesize,associativity– Replacementpolicy,coherenceprotocol
– Applicationrequirements
P
L1
P
L1
P
L1
P
L1
L2 L2 L2 L2
Interconnect
SharedL3
OffchipDRAMbanks
28
IBMPower8
• 12cores(SMT8)• 2013(2015)• 22nm, 6.5cm2
• Caches– 512KBSRAML2/core
– 96MBeDRAMsharedL3
– Upto128MBeDRAM L4(off-chip)
15
29
IBMPower8
• Across12corechip– 4TB/secL2BW– 3TB/secL3BW
• 230GB/ssustainedexternalmemorybandwidth
GB/secshownassuming4GHz
© 2013 International Business Machines Corporation 8
GB/sec shown assuming 4 GHz • Product frequency will vary based on model type
Across 12 core chip • 4 TB/sec L2 BW • 3 TB/sec L3 BW
Core
L2
L3
128 128
256
64 128
64
30
DistributedMemory
P
M
RP
M
RP
M
RP
M
RP
M
R
P
M
RP
M
RP
M
RP
M
RP
M
R
P
M
RP
M
RP
M
RP
M
RP
M
R
• Separateaddressspaceforeachprocessor
• Processorscommunicate viamessage passing
• Software-managed,explicitly-addressed,localmemories
• Sometimes alsodistributedsharedmemory
16
31
Intel’s80CoreTerascale Processor
• 80cores(2FMACs)• [email protected]• 320GB/sbisectionrouterbandwidth
1111
Intel’s 80 core terascale processor Die Photo and Chip Details
21.72mm
12.64mmI/O Area
I/O AreaPLL
single tile1.5mm
2.0mm
TAP
21.72mm
12.64mmI/O Area
I/O AreaPLL
single tile1.5mm
2.0mm
TAP
• Basic statistics:– 65 nm CMOS process– 100 Million transistors in 275 mm2
– 8x10 tiles, 3mm2/tile– Mesosynchronous clock– 1.6 SP TFLOP @ 5 Ghz and 1.2 V– 320 GB/s bisection bandwidth– Variable voltage and multiple sleep
states for explicit power managementRouters writedirectlyintomemory:anycorecouldwriteintothememoryofanyothercorewithlowlatency(2cycles)
32
Movingtomulticore
• 1core@[email protected]@1W
• 1core@[email protected]@0.25W
• 2cores@[email protected]@0.5W• But…twicearea(andnotsosimple)
• Advancedtechnologynodes?
2GHz 1W1.2V
1GHz 0.22W0.8V
1GHz
1GHz
17
33
Technology Scaling
28 nm 20 nm 14 nm
Planar FDSOI Transistor Advantages
February 2012 Technology R&D
• Total dielectric isolation – Lower S/D capacitances – Lower S/D leakages – Latch-up immunity
• Ultra-thin Body (TSi~1/3LG) – Excellent short-channel immunity
• low SCE, DIBL • No channel doping, no pocket implant
– Improved VT variation
• Ultra-thin BOX option – Back-bias control
• Ground-plane implantation – VT adjustment
Thin Silicon Channel
10
Classical(Dennard’s)scalingDevicecount S2Devicefrequency SCapacitance, Vdd 1/SDevicepower 1/S2Utilization 1
Corei
100W@f
Corei
S
34
EndofDennard’s Scaling
• Energyefficiencyisnotscalingalongwith integrationcapacity
LeakagelimitedscalingDevicecount S2Devicefrequency SDevicepower (cap) 1/SDevicepower(Vdd) ~1Utilization 1/S2
Corei
100W@f
Corei
[email protected](w/o)leakage