power consumption in silicon chips - irisa · –latch-up immunity •ultra-thin body (tsi~1/3lg)...

1

OlivierSentieys

IRISA/INRIA– CairnteamUniversityofRennes1 [email protected]

2

PowerConsumptioninSiliconChips

• Chips,logicgatesandtransistors

Intel’sXeonChip

A B

B

A

SCi

Pi = ↵i.fi.Ci.Vdd2 + Ileaki .Vdd

P =X

i

⇥↵i.fi.Ci.Vdd2 + Ileaki .Vdd

⇤

2

3

Outline

• Powerandenergyconsumption basics• Powerconsumption inprocessors• Multicore:powerandutilizationwalls• Energyadvantagesofhardwareaccelerators

• Playingwithaccuracyforreducingenergy• Towardsheterogeneousmanycores

– 3Dstacking– Opticalinterconnect

4

TheBasicElement:Transistor

• Transistorasaswitch

• Vg>Vt:NMOSon– ResistanceRDS

• Vg<Vt:NMOSoff– LeakageIoff

G

S

D

Id

VgVt:threshold

Ioff

• Gate:capacitanceCG

CG

• Switch:resistanceRDS

RDS / 1Vdd�Vt

3

5

Transistors

• BulkCMOS

• FD-SOICMOS– STMicroelectronics– Vt tuning

L

W

tox

OxydeN/PDiffusion

channel

Source GateDrain

FDSOI/UTBB Structure

gate

Thin Silicon film

UTBB-SOI : Ultra Thin Body (FD) and BOX SOI

February 2012 Technology R&D

9

Planar FDSOI Transistor Advantages


• Total dielectric isolation – Lower S/D capacitances – Lower S/D leakages – Latch-up immunity

• Ultra-thin Body (TSi~1/3LG) – Excellent short-channel immunity

• low SCE, DIBL • No channel doping, no pocket implant

– Improved VT variation

• Ultra-thin BOX option – Back-bias control

• Ground-plane implantation – VT adjustment

Thin Silicon Channel

10

6

Transistors

• IntelFinFET:transistorsgo3D

4

7

Gates

• Delayofagate

• Dynamicpower

• Leakagepower

A B

B

AS

A B S0 0 10 1 11 0 11 1 0

BA S

Ci

Pstati = N.I

o↵

.Vdd

Delay / RDS.Ci

/ Relative FanOut

Vdd � Vt

Pdyni = ↵i.fclk.Ci.Vdd2

8

Activity

• Activityαi istheprobability tohavea0→1transitionsattheoutput ofagate

• Example:ANDgate– PS =P(S=1)=PAPB– αi =PS(1- PS)

• Activitypropagation

A B S0 0 00 1 01 0 01 1 1

BA S1/2

1/2 1/4α=3/16

BA X

CS1/41/21/8

5

9

Propagating Activity isnotSoSimple

• Conditionalprobabilities

• Glitches:gatedelay– Significantinarithmetic

BA X

CS1/41/21/8

BA X

S1/41/21/4

110

101

1

0

0

0

Glitches

10

DynamicPowervs.Performance

• DecreasingVdd reducespowerbutincreasesdelay

0

1

2

3

4

5

6

0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4Supply voltage (VDD)

Rel

ativ

e D

elay

t d

0

2

4

6

8

10

Rel

ativ

e P d

yn

Pdyni = ↵i.fclk.Ci.Vdd2

Delay / 1

Vdd �Vt

6

11

PowerDissipationandCircuitDelay

12

34

-0. 400.40.8

0

0.2

0.4

0.6

0.8

1x 10

-4

Pow

er (W

)

A

B

12

34

-0.400.40.8

0

1

2

3

4

5x 10

-10

Del

ay (s

)

AB

12

Leakagevs.performance

• Highperformance • Lowleakage

Id

VgVt (high)

Ioff

Id

VgVt (low)

Ioff

Delay / 1

Vdd �Vt

Pstati = N.I

o↵

.Vdd Ioff:• Exponential ininverseofVt• Exponential intemperature• Linearindevicecount

7

13

MinimumEnergyperOperation

• Puttingalltogether

14

[Source: Intel]

On-ChipInterconnect?

• Gatedelaydecreasesbut…wiredelayincreases• Crossingchipin5-10clockcycles• Alsoaffectedbynoise…

• Metallayerstoreducewiredelay

• Repeaters

• Towardsnetwork-on-chip

8

15

Conclusion: PowerinCMOS

• Dynamicpower– 40-70%today– Decreasingrelatively– DVFSbecomesmoreandmoredifficult

• Leakagepower– 20-50%today– Increasingrapidly

• number oftransistors• Vdd/Vt scaling

– Criticalformemory

powerstaticrateoperationenergy

P +×=

P =X

i

⇥↵i.fi.Ci.Vdd2 + Ileaki .Vdd

⇤

16

Outline



– 3Dstacking– Opticalinterconnect

9

17

PowerConsumptioninProcessors

• Atypical(yetsimple)processorpipeline

134

Chapter 3 Pipelining

when we consider the effect of branches, which changes the PC also, but not untilthe MEM stage. This is not a problem in our multicycle, unpipelined datapath,since the PC is written once in the MEM stage. For now, we will organize ourpipelined datapath to write the PC in IF and write either the incremented PC orthe value of the branch target of an earlier branch. This introduces a problem inhow branches are handled that we will explain in the next section and explore indetail in section 3.5.

Because every pipe stage is active on every clock cycle, all operations in apipe stage must complete in one clock cycle and any combination of operationsmust be able to occur at once. Furthermore, pipelining the datapath requires thatvalues passed from one pipe stage to the next must be placed in registers.Figure 3.4 shows the DLX pipeline with the appropriate registers, called

pipelineregisters

or pipeline latches, between each pipeline stage. The registers arelabeled with the names of the stages they connect. Figure 3.4 is drawn so thatconnections through the pipeline registers from one stage to another are clear.

FIGURE 3.4 The datapath is pipelined by adding a set of registers, one between each pair of pipe stages. The reg-isters serve to convey values and control information from one stage to the next. We can also think of the PC as a pipelineregister, which sits before the IF stage of the pipeline, leading to one pipeline register for each pipe stage. Recall that thePC is an edge-triggered register written at the end of the clock cycle; hence there is no race condition in writing the PC. Theselection multiplexer for the PC has been moved so that the PC is written in exactly one stage (IF). If we didn’t move it, therewould be a conflict when a branch occurred, since two instructions would try to write different values into the PC. Most ofthe datapaths flow from left to right, which is from earlier in time to later. The paths flowing from right to left (which carry theregister write-back information and PC information on a branch) introduce complications into our pipeline, which we willspend much of this chapter overcoming.

Datamemory

ALU

Signextend

PC

Instructionmemory

ADD

IF/ID

4

ID/EX EX/MEM MEM/WB

IR6..10

MEM/WB.IR

Mux

Mux

Mux

IR11..15Registers

Branchtaken

IR

16 32

Mux

Zero?

18

D-cache6%

Datapath38%

Reg. File14%

Fetch/Decode

19%

I-cache23%

EnergyCostinaProcessor

MIPS processor91 pJ/instr.

10

19

EnergyCostinaProcessor

• Fetchingoperandscostsmorethancomputing

28nmCMOS

500pJ Efficientoff-chiplink

16nJDRAMRd/Wr

64-bitDP20pJ 26pJ 256pJ

1nJ

256-bitbuses

50pJ256-bitaccess

8kB SRAM

[Dally,IPDPS’11]

20

EnergyCostinaMemory

• L2Cachecontains4MillionsSRAMcells• Raw/columnof2000cells

Bit line2 L-K

Word line

A K

A K+1

A L-1

A 0

M. 2K

A K-1

Sense amplifiers-Drivers

Column decoder

Input-Output(M bits)

Rowdecoder

StorageCell

11

21

TheEnergyCostofDataMovement

• Futureprocessorupto3Tera-op/sec• Atminimumrequires64bx9Tera-operandstobemovedeachsecond

• Ifonaverage1mm(10%ofdiesize)then– 0.1pJ/bit x576Tbits/s– consumes 58Watts!

74 COMMUNICATIONS OF THE ACM | MAY 2011 | VOL. 54 | NO. 5

contributed articles

tion has been slowed by continued improvement in microprocessor sin-gle-thread performance. Developers of software applications had little incen-tive to customize for accelerators that might be available on only a fraction of the machines in the field and for which

the performance advantage might soon be overtaken by advances in the traditional microprocessor. With slow-ing improvement in single-thread per-formance, this landscape has changed significantly, and for many applica-tions, accelerators may be the only

path toward increased performance or energy efficiency (see Table 4). But such software customization is diffi-cult, especially for large programs (see the sidebar “Decline of 90/10 Optimi-zation, Rise of 10x10 Optimization”).

Orchestrating data movement: Memory hierarchies and intercon-nects. In future microprocessors, the energy expended for data movement will have a critical effect on achiev-able performance. Every nano-joule of energy used to move data up and down the memory hierarchy, as well as to synchronize across and data be-tween processors, takes away from the limited budget, reducing the energy available for the actual computation. In this context, efficient memory hi-erarchies are critical, as the energy to retrieve data from a local register or cache is far less than the energy to go to DRAM or to secondary storage. In addition, data must be moved between processing units efficiently, and task placement and scheduling must be optimized against an interconnection network with high locality. Here, we examine energy and power associated with data movement on the processor die.

Today’s processor performance is on the order of 100Giga-op/sec, and a 30x increase over the next 10 years would increase this performance to 3Tera-op/sec. At minimum, this boost requires 9Tera-operands or 64b x 9Tera-operands (or 576Tera-bits) to be moved each second from registers or memory to arithmetic logic, consum-ing energy.

Figure 11(a) outlines typical wire delay and energy consumed in moving one bit of data on the die. If the oper-ands move on average 1mm (10% of die size), then at the rate of 0.1pJ/bit, the 576Tera-bits/sec of movement con-sumes almost 58 watts with hardly any energy budget left for computation. If most operands are kept local to the ex-ecution units (such as in register files) and the data movement is far less than 1mm, on, say, the order of only 0.1mm, then the power consumption is only around 6 watts, allowing ample energy budget for the computation.

Cores in a many-core system are typically connected through a net-work-on-a-chip to move data around the cores.40 Here, we examine the ef-

Figure 12. Hybrid switching for network-on-a-chip.

Bus to connect a cluster

Second-level bus to connect clusters (hierarchy of busses)

Second-level router-based network (hierarchy of networks)

C

C

Bus

C

C

C

C

Bus

C

C

C

C

Bus

C

C

C

C

Bus

C

C

C

C

Bus

C

C

C

C

Bus

C

C

C

C

Bus

C

C

C

C

Bus

C

C

C

C

Bus

C

C

R R

R R

Table 5. Data movement challenges, trends, directions.

Challenge Near-Term Long-Term

Parallelism Increased parallelism Heterogeneous parallelism and customization, hardware/runtime placement, migration, adaptation for locality and load balance

Data Movement/Locality

More complex, more exposed hierarchies; new abstractions for control over movement and “snooping”

New memory abstractions and mechanisms for efficient vertical data locality management with low programming effort and energy

Resilience More aggressive energy reduction; compensated by recovery for resilience

Radical new memory technologies (new physics) and resilience techniques

Energy Proportional Communication

Fine-grain power management in packet fabrics

Exploitation of wide data, slow clock, and circuit-based techniques

Reduced Energy Low-energy address translation Efficient multi-level naming and memory-hierarchy management

Figure 11. On-die interconnect delay and energy (45nm).

10,000

1,000

100

10

1

1,000

100

10

1

0.1

0.01

2

1.5

1

0.5

00 0.5u

Wire Delay

On-die network energy per bit

Wire Energy Measured

Extrapolated

5 0.18u10 65nm15 22nm20 8nm

Del

ay (p

s)

(pJ

)

pJ/B

it

On-die interconnect length (mm)

(a) (b)

22

ReducingPower

• Powergating,multi-Vt• Clockgating• Vdd scaling

– Parallel,pipeline• Activityreduction

– Pre-computation, correlation,encoding• GlitchPowerReduction

Virtual Vdd

sleep SwitchCell

Vdd

LogicCell

12

23

DynamicPowerManagement

• DynamicVoltageandFrequencyScaling(DVFS)• Reducespeed(clockfreq.)andVdd dependingonprocessoractivity

After

Before

TimeProc

esso

r Sp

eed

IDLE

E=CVH2+Eidle

E=CVL2

24

Outline



– Can3Dstackinghelp?– Opticalinterconnect

13

25

The“PowerWall”

Source: C. Batten, Cornell Source: C. Batten, Cornell

~25 W/cm2

ahardlimit

26

andthe“Multicore Era”

• Increasingperformancebyincreasing#ofcores

Course Motivation Interconnection Network Basics Course Logistics

Examples of Multicore and Manycore Processors

ECE 5970 L01: Course Overview 9 / 29

Source: C. Batten, Cornell

14

27

Shared-MemoryMultiprocessor

• Processorscommunicatewithsharedaddressspacebymemoryread/write

• Hardware-managed,implicitly-addressed,coherentcaches

• Bandwidthdependson– Cachesize,associativity– Replacementpolicy,coherenceprotocol

– Applicationrequirements

P

L1

P

L1

P

L1

P

L1

L2 L2 L2 L2

Interconnect

SharedL3

OffchipDRAMbanks

28

IBMPower8

• 12cores(SMT8)• 2013(2015)• 22nm, 6.5cm2

• Caches– 512KBSRAML2/core

– 96MBeDRAMsharedL3

– Upto128MBeDRAM L4(off-chip)

15

29

IBMPower8

• Across12corechip– 4TB/secL2BW– 3TB/secL3BW

• 230GB/ssustainedexternalmemorybandwidth

GB/secshownassuming4GHz

© 2013 International Business Machines Corporation 8

GB/sec shown assuming 4 GHz • Product frequency will vary based on model type

Across 12 core chip • 4 TB/sec L2 BW • 3 TB/sec L3 BW

Core

L2

L3

128 128

256

64 128

64

30

DistributedMemory

P

M

RP

M

RP

M

RP

M

RP

M

R

P

M

RP

M

RP

M

RP

M

RP

M

R

P

M

RP

M

RP

M

RP

M

RP

M

R

• Separateaddressspaceforeachprocessor

• Processorscommunicate viamessage passing

• Software-managed,explicitly-addressed,localmemories

• Sometimes alsodistributedsharedmemory

16

31

Intel’s80CoreTerascale Processor

• 80cores(2FMACs)• [email protected]• 320GB/sbisectionrouterbandwidth

1111

Intel’s 80 core terascale processor Die Photo and Chip Details

21.72mm

12.64mmI/O Area

I/O AreaPLL

single tile1.5mm

2.0mm

TAP

21.72mm

12.64mmI/O Area

I/O AreaPLL

single tile1.5mm

2.0mm

TAP

• Basic statistics:– 65 nm CMOS process– 100 Million transistors in 275 mm2

– 8x10 tiles, 3mm2/tile– Mesosynchronous clock– 1.6 SP TFLOP @ 5 Ghz and 1.2 V– 320 GB/s bisection bandwidth– Variable voltage and multiple sleep

states for explicit power managementRouters writedirectlyintomemory:anycorecouldwriteintothememoryofanyothercorewithlowlatency(2cycles)

32

Movingtomulticore

• 1core@[email protected]@1W

• 1core@[email protected]@0.25W

• 2cores@[email protected]@0.5W• But…twicearea(andnotsosimple)

• Advancedtechnologynodes?

2GHz 1W1.2V

1GHz 0.22W0.8V

1GHz

1GHz

17

33

Technology Scaling

28 nm 20 nm 14 nm

Planar FDSOI Transistor Advantages


• Total dielectric isolation – Lower S/D capacitances – Lower S/D leakages – Latch-up immunity

• Ultra-thin Body (TSi~1/3LG) – Excellent short-channel immunity

• low SCE, DIBL • No channel doping, no pocket implant

– Improved VT variation

• Ultra-thin BOX option – Back-bias control

• Ground-plane implantation – VT adjustment

Thin Silicon Channel

10

Classical(Dennard’s)scalingDevicecount S2Devicefrequency SCapacitance, Vdd 1/SDevicepower 1/S2Utilization 1

Corei

100W@f

Corei

[email protected]

S

34

EndofDennard’s Scaling

• Energyefficiencyisnotscalingalongwith integrationcapacity

LeakagelimitedscalingDevicecount S2Devicefrequency SDevicepower (cap) 1/SDevicepower(Vdd) ~1Utilization 1/S2

Corei

100W@f

Corei

[email protected](w/o)leakage

power consumption in silicon chips - irisa · –latch-up immunity •ultra-thin body (tsi~1/3lg)...

Documents