lecture 6 clocked elements - stanford university · 2007-04-23 · m horowitz ee 371 lecture 6 26...
TRANSCRIPT
EE 371 Lecture 6M Horowitz 1
Lecture 6
Clocked Elements
Computer Systems LaboratoryStanford University
Copyright © 2006 Mark Horowitz, Ron HoSome material taken from lecture notes by Vladimir Stojanovic and Ken Mai
EE 371 Lecture 6M Horowitz 2
Overview
• Readings (For next lecture on clocking)– Gronowski Alpha clocking paper– Restle Clock grid paper– Harris Variations paper
• Today’s topics– Latches and flops overview– Power and timing metrics– High-performance design and low-energy design– Examples
EE 371 Lecture 6M Horowitz 3
Why Are Clocked Elements Important?
• A graph from the first lecture, showing cycle time in FO4
• A clock cycle contains less and less time each generation
Cycle in FO4
10
100
85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02 03 04 05
intel 386
intel 486
intel pentium
intel pentium 2
intel pentium 3
intel pentium 4
intel i tanium
Alpha 21064
Alpha 21164
Alpha 21264
Spar c
Super Spar c
Spar c64
Mips
HP PA
Power PC
AMD K6
AMD K7
AMD x86-64
EE 371 Lecture 6M Horowitz 4
We Need Faster Clocked Elements
• Note that the previous graph invites us to extrapolate blindly– “In just a few years, clock frequencies will go to infinity!!”– Reality says clock cycle times will level out
• There is some disagreement on the actual limits– “6-8 FO4 per cycle is optimum” (Hrishikesh, ISCA ’02)– “Yes, but including overhead, more like 18” (Srinivasan, ISM ’02)
• Today: 16 FO4/cycle in Pentium4; 11 FO4/cycle in Sony Cell– If a flop has an overhead of 3FO4, this is 15%-30% overhead– We can fill less of the cycle time with “work”
• Making a faster flop helps performance significantly
EE 371 Lecture 6M Horowitz 5
We Need Lower Power Clocked Elements
• Recall that chip power density is climbing up– “Rocket nozzle!” “Surface of the sun!” “Oh, my!”– Because power is C*V2f and V is not scaling anymore (V != 10Lgate)
• Part of this power is for clock distribution– Clocking requires 70% of core power in Power4 (2001)– Clocking requires 25% of core power in dual-core Itanium (2005)
• Almost all of this power is in driving the latches/flops at the ends– Only around 20-25% of clock power is in clock transmission– Saving power at the ends is better than saving power in the wires
• Making a more efficient flop helps power significantly
EE 371 Lecture 6M Horowitz 6
Timing Overhead, Illustrated
• A standard circuit view of a flop system
– Flop timing overhead is the “data-to-output(Q)” delay• Tdata-to-Q = Tsetup + Tclk-to-Q = Tsu + Tcq
• Next look at some basic latch and flop designs and metrics– But first, an analogy…
LogicD Q
Clk
D Q
Clk
Tcycle
Tcq TsuTlogic
EE 371 Lecture 6M Horowitz 7
An Analogy Of Timing
• Clocked datapaths are like streets with traffic lights– Cars moving down the street are data
• Some cars speed, some cars drive normally, some crawl• Principal rule: cannot have two cars collide into each other
– As a car (data) comes to a light (flop or latch)• It passes through if the light is green and waits if it’s red
• A clocked system has a master controller for the lights– All the lights turn red/green at the same time (as in Manhattan)
• In some systems, the lights are green very briefly (flops)• In other systems, the lights stay green half the time (latches)
– Satisfy principal rule: ensure all cars only pass one block per light
• The point of traffic lights is to slow down fast cars
EE 371 Lecture 6M Horowitz 8
The Analogy’s Timing Failures
• A traffic light (flop or latch) can fail the principal rule in two ways– A car can be too slow to make it to the next block: Max path
• The car hadn’t reached the intersection when the light turned red– A car can race through more than one block: Min path
• The intersection didn’t stay clear just when the light went red
• Failures often arise because traffic light timing is not perfect– Differences between traffic lights (skew), or local variations (jitter)
• Fixing these failures– Max paths are fixable: Slow down the master controller – Min paths are not: Rebuild the street (or the light control wires)!
• Cost of 3 months fab time and $1M in mask costs
EE 371 Lecture 6M Horowitz 9
By The Way
• Do you really need traffic lights?– No, if you can guarantee that all cars travel at the same rate– Wave-pipelining ensures all datapaths have the same delay
• Do all the lights need to switch at the same time?– No, and long blocks might benefit from different light timing– Intentional clock skew on chips helps for timing problems
• Do you really need a master controller for all the lights?– No, if you have cars (drivers) negotiate between themselves– Asynchronous circuits handshake between data items
EE 371 Lecture 6M Horowitz 10
A Note On Terminology
• Avoid saying a latch or flop is “open” or “closed”– There is ambiguity in these terms– Water flows if you “open a valve”– Current stops if you “open a switch”
• The common wording today is a little awkward, but clear– A timing element transfers data when it is transparent– It does not transfer data when it is opaque
• Other choices include “blocking” and “non-blocking”– The oddest I have ever seen was “Permissive” and “Prohibitive”– But don’t use that…
EE 371 Lecture 6M Horowitz 11
Latches and Flops
• Latch has soft timing: transparent when clock=1
• Flop has hard timing: “transparent” when clock 0 1
D Q
Clk
transparent opaqueClk
D
Q
D Q
sample edge
Clk
D
Q
EE 371 Lecture 6M Horowitz 12
Latch Timing
• Setup and hold times are defined relative to the clock fall– Setup time: how long before the clock fall must the data arrive– Hold time: how long after the clock fall must the data not change
• Delay depends on arrival time of data relative to clock rise– On early data arrival, delay = Tcq
– On late data arrival, delay = Tdq
D Q
Clk
transparent opaqueClk
D
Q
D
Q
Early
Late
su hold
EE 371 Lecture 6M Horowitz 13
Flop Timing
• Setup and hold times are defined relative to the clock rise– Setup time: how long before the clock rise must the data arrive– Hold time: how long after the clock rise must the data not change
• Delay is always Tcq, as long as data hits the setup constraint
Clk
D
Q
su hold
D Q
EE 371 Lecture 6M Horowitz 14
Latch and Flop Timing
• Softness of latch timing edges allows time borrowing– Nominally a latch expects its data when the latch goes transparent– But the latch will accommodate late data
• Until the data runs into the falling edge of the clock (going opaque)– Time-borrowing works backwards (“slack forwarding”) and forwards
• Flops generally do not allow time borrowing
• For some latches and flops, setup time is negative– The data can change just after the latch goes opaque– The latch can still see that data change
EE 371 Lecture 6M Horowitz 15
Flop Delay
• What’s the delay through a flop?– Data must get to the flop by a setup time Tsu
– Data must traverse the flop and take up Tcq
• So the time available to do logic is what’s left– Tlogic = Tcycle – Tsu – Tcq – Tskew
LogicD Q D Q
Tcycle
Tcq TsuTlogic
EE 371 Lecture 6M Horowitz 16
0
50
100
150
200
250
300
350
-200 -150 -100 -50 0 50 100 150 200Data-Clk [ps]
Clk
-Out
put [
ps]
Setup Hold
Flop Clk-to-Q Delay Depends On Data-to-Clk
Sampling Window
Source: Stojanovic
EE 371 Lecture 6M Horowitz 17
Examine The Setup Half Of That Graph
Data-to-clock delay
Dat
a-to
-out
put d
elay
Variable cq regionConstant cq region Failure region
Tcq
Tdq
optimum
45o
EE 371 Lecture 6M Horowitz 18
What About Power?
• To fairly compare power of various flops or latches, include– External power to drive the clock input– External power to drive the data input– Internal power required to switch interior nodes– Internal power required to drive output load
Q
CLK
D
Qb
VDDVDD
VDD
PD
PCLK PINT
PLOAD
D
CLK
EE 371 Lecture 6M Horowitz 19
Simplest CMOS Latch
• Basic transparent high latch (Figure 11.2) is simply a passgate
– Very simple and compact– Stores data dynamically – subject to leakage and noise problems
clk
clk_b
dataq_b
EE 371 Lecture 6M Horowitz 20
Bad Things Can Happen To This Latch
• Various modes of failure, including
Source: Chandrakasan Ch. 11
EE 371 Lecture 6M Horowitz 21
Buffered Transparent High Latch
• Avoid input noise with a local input inverter
– Still have problems with the storage node
clk
clk_b
dataq
EE 371 Lecture 6M Horowitz 22
Jam Latch
• Make the storage node static
– Feedback inverter is very weak and loses in a fight– Burns power during the fight until the latch flips– Beware mixed process corners (SF, FS) for overpowering latch!
clk
clk_b
dataq
weak
EE 371 Lecture 6M Horowitz 23
Tristate Latch
• Prevent the fight by shutting off the feedback device
– Count on constant input drive during latch transparency– Note that input inverter+passgate can be a tristate as well
clk
clk_b
data
clk_b
clk
q
EE 371 Lecture 6M Horowitz 24
Robust CMOS Latch
• Don’t take the output from the storage node directly
– This is a good, safe latch design– Not the fastest in the world (can trade-off speed for DANGER)– Setup and hold times are close to 0
clk
clk_b
data
clk_b
clk
q
EE 371 Lecture 6M Horowitz 25
Flip-Flops Come In Several Flavors
• You can make a flip-flop out of two back-to-back latches
• You can make it out of a edge-triggered element, plus SR latch
D QClk
D QClk
clk_b
M-SFlop
EdgeFlop
R Q
clk
SPulseGen
EE 371 Lecture 6M Horowitz 26
Flip-Flops Come In Several Flavors, con’t
• You can also make a flip-flop out of a pulsed (glitch) latch
• All three flavors are (almost) functionally indistinguishable – Although they may react differently to clock skew– Will look at this in a later lecture
D QClk
clk
GlitchLatch
pulsegen
EE 371 Lecture 6M Horowitz 27
Flavor 1: Simplest Flop Is Master-Slave
• World’s first LSI calculator chip (Tokyo Shibaura Electric)– Real “old-school” but it’s a master-slave– All tristates instead of passgates
Source: Suzuki, JSSC 1973
D Q
Clk1
Clk
Clk
Clk1
Clk
Clk1
Clk
QMClk
Clk1
Clk1
Clk
EE 371 Lecture 6M Horowitz 28
Flavor 1: Another Master-Slave
• PowerPC 603 flop – Note this is a negative edge-triggered flop (clks are reversed)– Faster than C2MOS, but at worse input noise (no tristate)– Master clk usually generated from slave (why not vice versa?)
Vdd Vdd
Clk
QClk Clkb
Clkb
D
Source: Gerosa, JSSC 1994
EE 371 Lecture 6M Horowitz 29
Flavor 1: Yet Another Master-Slave (With Scan)
• IBM Power4 latch with scan-ability (very common today)
Source: Warnock, IBM JRD, 2002
EE 371 Lecture 6M Horowitz 30
Flavor 2: True Edge-Triggered Flops
• Dec 21264 Alpha flop (Madden & Bowhill, ’90, Matsui ’94)– A sense-amp (pulse generator) followed by a capturing RS latch– Negative setup time available
Clk
D D
RS
QQD=1
D=0
pulse
Pulsegenerator
CapturingLatch
Why?
delays not equal Source: Madden & Bowhill, JSSC, 1990
EE 371 Lecture 6M Horowitz 31
(Flavor 2) Improved “Strong-Arm Flop”
• Faster S-R stage – Sized for improved switching – Bigger drivers and smaller keepers
• Symmetric Q and Q_b delays
• Reduce the hysterisis of the latch– Slam the node high/low strongly– Then in precharge, hold it weakly– Makes it faster, but at what cost?
• Coupling immunity• Yet another application of NFL
– “No free lunch”
Source: Nikolic & Stojanovic, ISSCC ’99
EE 371 Lecture 6M Horowitz 32
Flavor 3: Glitch Latch or Pulse Latch
• Just like a standard latch, only with a pulsed (glitched) clock– Looks and smells like a flip-flop– Pulse gives you negative setup time and a soft edge (latch-like)
• Generating the pulsed clock can be tricky over process corners– A “clock chopper” or “single-shot”– Do this locally at the pulse latch for each latch or small group
• Distributing this pulse is dangerous: pulses disappear– So often combine this functionality into the latch itself
clk pulse
EE 371 Lecture 6M Horowitz 33
(Flavor 3) Glitch Latch
• AMD K6 latch, called a “Hybrid Latch Flip-Flop” (i.e., glitch latch)
Vdd
D
Clk
Q
Q
D=1
D=0
signal atnode X Second
Stage LatchPulse
Generator
D=1
D=0
Source: Partovi ISSCC ’96
EE 371 Lecture 6M Horowitz 34
(Flavor 3) Abstraction of the HLFF
D=1
D=0
signal atnode X
SecondStage Latch
PulseGenerator
D=1
Clk
D
D=0
Enable
Q
EE 371 Lecture 6M Horowitz 35
(Flavor 3) A “Semi-Dynamic” Flip-Flop
• Sun UltrasparcIII pulsed latch
– Dynamic pulse generator, unlike the NAND3 in the HLFF– Beware the 1-1 glitch
Q
Clk
Clk1
S
I
D
Clk
Clk Source: Klass VLSI ’98
EE 371 Lecture 6M Horowitz 36
Delay Comparisons
• Driving a load of 14 inverters, in a 0.18μm technology
Min D-Q Delay Comparison
0.00.51.01.52.02.53.03.54.04.55.0
MSL C2MOS HLFF SDFF SAFF M-SAFF
Del
ay [F
O4]
Pulse latches are faster
Source: Stojanovic
EE 371 Lecture 6M Horowitz 37
Energy Comparisons
• Driving a load of 14 inverters, in a 0.18μm technology
Energy breakdown (50% activity)
0
20
40
60
80
100
120
MSL C2MOS HLFF SDFF SAFF M-SAFF
Ener
gy [f
J]
Ext. clock Ext. data Int. clockInternal non-clk
Latches are lower energy
Source: Stojanovic
EE 371 Lecture 6M Horowitz 38
What About Imperfect Clocks?
• Clocks do not always arrive “on time”– Clocks to different clocked elements arrive at different times: SKEW– One clock will arrive at different times from cycle to cycle: JITTER
• Clock performance is a function of the distribution grid (later)
clk1
clk2
tskewt+jit
t-jit
EE 371 Lecture 6M Horowitz 39
Do Clocked Elements Have Transparency?
200
220
240
260
280
300
-30 -20 -10 0 10 20 30 40 50 60
Clk arrival time [ps]
D-Q
del
ay [p
s]
tCU
DDQm
DDQM
NominalClk
Clk
• Change in D-Q delay < clock uncertainty– The clocked element absorbs some of the clock uncertainty
• There is a range of clk arrivals for which D-Q is constant– For that range the clocked element looks like a combinational gate
EE 371 Lecture 6M Horowitz 40
“Clock Uncertainty Absorption”
• Helps to mitigate the effects of clock skew (HLFF shown below)
• More skew-tolerant circuits will be discussed in a later lecture