clocked storage elements - stanford university · 2004-04-20 · note: power is always consumed in...
TRANSCRIPT
1
EE371 Lecture 7
Clocked storage elementsVladimir Stojanovic
Material in this presentation is adapted from “Digital System Clocking: High-Performance and Low Power Aspects”, V.G.Oklobdzija, V.M.Stojanovic, D.M.Markovic, N.M.Nedovic, © 2003 J.Wiley & IEEE
EE371 Lecture 7 2
Latch and Flip-FlopTiming and Power MetricsHigh-Performance IssuesLow-Energy IssuesState-of-the-art circuitsMicroprocessor Examples
Outline
2
EE371 Lecture 7 3
Recent Interest in Flip-FlopsTrends in high-performance systems» Higher clock frequency» More transistors on chip
Consequences» Increased flip-flop overhead relative to cycle time
– Cycle time 10 - 20 FO4 delays, flop overhead 2 - 4 FO4 » Difficult to control both edges of the clock» Higher impact of clock skew» Higher crosstalk and substrate coupling» Higher power consumption
– expensive packages and cooling systems– limit in performance
» Clock burns up to 40%, flops up to 20% of total power
EE371 Lecture 7 4
Why are clocked storage elements important?
Cycle time ~ 12-20 FO4 delaysFlip-flop overhead 2-3FO4 (20% of cycle time!!!)
D Q
Clk
D Q
Clk
LogicN
TLogicTClk-Q TSetup
T
TD-Q=TClk-Q + TSetup
3
EE371 Lecture 7 5
Latch and Flip-Flop
Data Q
Clock
Q
Clock
Data
F-F
Data Q
Clock
Q
Clock
Data
Latch
Latch is “transparent”(clock-level sensitive)
After the transition of the clock, data change does not affect the output (clock-edge sensitive)
Flip-Flop – “hard” edge clocking
Latch – “soft” edge clocking
EE371 Lecture 7 6
How can one recognize the difference without knowing what is inside the “black-box” ?
Flip-Flop and Master-Slave Latch
Operational behavior appears the same…
DataQ
Clock
Q
Clock
Data
DataQ
Clock
Clock
Data
Latch
D
Q
Q
Master - L1D
Q
Q
Clk2
Slave - L2
Clk1
Q
F-F
M-S Latch
4
EE371 Lecture 7 7
Flip-Flop and M-S Latch: Structural Difference
PulseGenerator
Clock
Q Q
Input Input
SlaveLatchNo Clock
Flip-Flop
Master(L1)
Latch
Clock: Φ1
Q Q
Input Input
Slave(L2)
Latch
Clock: Φ2
Q1
Q2
Q1
Q2
M-S Latch
Pulse Capturing
Latch
S RS
EE371 Lecture 7 8
T-G Master-Slave LatchPowerPC 603 (Gerosa, JSSC 12/94)
Vdd Vdd
Clk
QClk Clkb
Clkb
D
5
EE371 Lecture 7 9
Flip-Flop Example 1: SAFF
SAFF DEC Alpha 21264 (Madden & Bowhill, 1990, Matsui 1994)
D=1
D=0pulse RS
Pulsegenerator
CapturingLatch
Clk
D D
RS
(Sense-Amplifier-Based Flip-Flop)
EE371 Lecture 7 10
Flip-Flop Example 2: HLFF
Vdd
D
Clk
Q
Q
D=1
D=0
signal atnode X Second
Stage LatchPulse
Generator
D=1
D=0
(Hybrid Latch Flip-Flop)
6
EE371 Lecture 7 11
Logic Diagram of HLFF
D=1
D=0
signal atnode X
SecondStage Latch
PulseGenerator
D=1
Clk
D
D=0
Enable
Q
EE371 Lecture 7 12
Performance Metrics:Timing and Power
7
EE371 Lecture 7 13
Timing Parameters in LatchesD Q
ClkU H
Q1
D1
D2
Q2
(a)
(b)
tCW
tCQ
tDQ
L
Clk
Setup and Hold Times are defined relative to closing the clock signal
Timing Parameters in Latches
(a) Early data arrival
(b) Late data arrival
Setup and Hold Times are defined relative to closing the clock signal
EE371 Lecture 7 14
Timing Parameters in Flip-Flops
Setup and Hold Times are defined relative to rising edgeof the clock
D Q
ClkU H
Q
D
tCW
tCQ
FF
Clk
SamplingWindow
8
EE371 Lecture 7 15
Data-to-Output Delay
Sum of setup time and Clk-Q delay is the only true measure of performance w.r.t. system speedT = TClk-Q + TLogic + Tsetup+ Tskew
D Q
Clk
D Q
Clk
LogicN
TLogicTClk-Q TSetup
T
TD-Q=TClk-Q + TSetup
EE371 Lecture 7 16
0
50
100
150
200
250
300
350
-200 -150 -100 -50 0 50 100 150 200Data-Clk [ps]
Clk
-Out
put [
ps]
Setup Hold
Clk-Q Delay is a function of D-Clk
Sampling Window
9
EE371 Lecture 7 17
Data to Clock Delay
Dat
a to
Out
put D
elay
DDQm
Constant Clk-Q Region
Failure Region
Variable Clk-Q Region
Data arrives early Data arrives late
D-Q
Clk-Q
Uopt
45o
Setup Time vs. Data-to-Output (D-Q) Delay
opt Setup
EE371 Lecture 7 18
Power Consumption
Power related to a CSE can be divided into: » Input power
– Data power (PD)– Clock power (PCLK)
» Internal power (PINT)– Depends on data activity
and glitching activity» Load power (PLOAD)
– Can be merged into PINT
Q
CLK
D
Qb
VDDVDD
VDD
PD
PCLK PINT
PLOAD
D
CLK
tot internal&load driverinputs(D,CLK)
P P P= + ∑
10
EE371 Lecture 7 19
High-Performance Issues
EE371 Lecture 7 20
Clock Skew and Jitter
Ref_Clock
DRVCLKt
Received Clock
skewt
RCVCLKtT
jitt−
jitt+
skewt
11
EE371 Lecture 7 21
The Idea of Clock Uncertainty Absorption
Change in D-Q delay is much smaller than the clock uncertainty (CSE absorbs a part of the uncertainty)
200
220
240
260
280
300
-30 -20 -10 0 10 20 30 40 50 60
Clk arrival time [ps]
D-Q
del
ay [p
s]tCU
DDQm
DDQM
NominalClk
Clk
EE371 Lecture 7 22
Clock Uncertainty Absorption
Clock uncertainty tCU
D
Q
Clk
Worst-case DDQ
Nominal DD-Clk
DDQm
DDQM
Early DD-Clk
Late DD-ClkTNominal=0
12
EE371 Lecture 7 23
Example: HLFF
[Partovi et al, ISSCC’96]
EE371 Lecture 7 24
State-of-the-Art CSEs in CMOS Technology
13
EE371 Lecture 7 25
Requirements in the Flip-Flop Design
Small Clk-Output delay, Narrow sampling windowLow powerSmall clock loadHigh driving capability (increased levels of parallelism)» Typical flip-flop load in a 0.18µm CMOS ranges from 50fF to
over 200fF, with typical values of 100-150fF in critical pathsIntegration of logic into the flopMultiplexed or clock scanCrosstalk insensitivity - dynamic/high impedance nodes are affected
EE371 Lecture 7 26
Low-Energy Designs: Master-Slave Latch Examples
D
Q
Clk Clk1
Clk
QMSM SS
Clk1
Clk1
Clk
C2MOS Latch Master-Slave Latch (MSL)
[G.Gerosa et al, JSSC 1994]
Feedback for pseudo-static operationState node SS protection in PPC (decoupled Q)
D Q
Clk1
Clk
Clk
Clk1
Clk
Clk1
Clk
QMClk
Clk1
Clk1
Clk
[Suzuki et al, JSSC 1973]
14
EE371 Lecture 7 27
Master-Slave Latches
Positive setup timesTwo clock phases:» distributed globally» generated locally
Small penalty in delay for incorporating MUXSome circuit tricks needed to reduce the overall delay
EE371 Lecture 7 28
High-Performance Designs: Flip-Flop Examples 1/2
D
QClk
S
Clk1
Q
Clk
Clk1
S
I
D
Clk
Clk
HLFF (Hybrid Latch-Flip-Flop) SDFF (Semi-Dynamic Flip-Flop)
[F.Klass et al, JSSC 1998]
Pulse-generating first stage (precharge-evaluate)Keepers for pseudo-static operation
Output load decoupled from internal nodes
[Partovi et al, JSSC 1996]
15
EE371 Lecture 7 29
High-Performance Designs: Flip-Flop Examples 2/2
SAFF (Sense-Amplifier-Based Flip-Flop)
Fully-differential circuit
First stage sense-amp can take reduced-swing inputs
2nd stage is capturing latch
delay to Q and !Q not equal[Matsui et al, 1994]
Clk
D D
RS
EE371 Lecture 7 30
The first stage is unchanged sense amplifierSecond stage is sized to provide maximum switching speedDriver transistors are largeKeeper transistors are small and disengaged during transitions
[Nikolic & Stojanovic ISSCC ’99]
SAFF with Improved S-R Latch
16
EE371 Lecture 7 31
Flip-Flops
First stage is a pulse generator» generates a pulse (glitch) on a rising edge of the clock
Second stage is a latch» captures the pulse generated in the first stage
Pulse generation potentially results in a negative setup time and soft-edge propertyMust check for hold time violations
Note: power is always consumed in the clocked pulse generator
EE371 Lecture 7 32
Delay Comparison: M-S Latches and Flip-Flops
0.18um, high load (14 min inverters)
Min D-Q Delay Comparison
0.00.51.01.52.02.53.03.54.04.55.0
MSL C2MOS HLFF SDFF SAFF M-SAFF
Del
ay [F
O4] Flip-Flops are faster
17
EE371 Lecture 7 33
Energy Comparison: M-S Latches and Flip-Flops
0.18um, high load (14 min inverters)
Energy breakdown (50% activity)
0
20
40
60
80
100
120
MSL C2MOS HLFF SDFF SAFF M-SAFF
Ener
gy [f
J]Ext. clock Ext. data Int. clockInternal non-clk
Latches are lower energy
EE371 Lecture 7 34
CSE topology depends on target application» Master-Slave Latches for low-energy» Flip-Flops & Pulsed latches for high-performance
Delay is critical in high-speed systems, although minimizing Clk energy is of increasing importanceMethods for reducing Clk energy» Clock gating (more effective in high-performance than in
low-energy designs)» Reduced-swing clocking» Dual-edge clocking
Summary
18
EE371 Lecture 7 35
Microprocessor Examples
EE371 Lecture 7 36
Sun UltraSPARC-III
Vdd
Vdd
Clk
D
MN3
MN1MN4
MP2
MP1
Inv1
Inv2 Inv3
NAND
MN2
MN5Q
Inv4
Inv6
Inv5Q
S
Clk1
[Klass, 1998]
Basic Flip-Flop (SDFF)
19
EE371 Lecture 7 37
Sun UltraSPARC-III Dynamic Flip-FlopsVdd
Clk
MN3
MN1
MP1
Inv1
Inv5
Inv4
Inv2
Q
Inv3D
NAND
MN2
S
Vdd
Clk
MN3
MN1
MP1
Inv1-2
Inv5
Inv3-4
D
Vdd
QInv6
MN2 MN4
MN5
MP2 MP4 MP3
MN7MN6
D
QS R
Single-Ended
Differential
EE371 Lecture 7 38
Sun UltraSPARC-III
2-input XOR
Flip-Flops with Logic Embedding
20
EE371 Lecture 7 39
Latches used in 20164 Alpha[Gronowski et al, 1998]
D
Clk
X Q
1N 2N
3N4N
2P1P
5PD
ClkX Q
1N 2N
3P
4P
2P1P
5N
21064 modified TSPC Latches
EE371 Lecture 7 40
Logic Embedding in 21064/21164 Alpha
1 level of logic(21064 Alpha)
Q
X2
Clk
Q
X1
Clk
Clk
X
1D
2D
1D
2D
3D
4D
2 levels of logic(21164 Alpha)
21
EE371 Lecture 7 41
Flip-Flop used in 21264 Alpha
Differential
Clk
D
EE371 Lecture 7 42
IBM Processors: Level-Sensitive Scan Design (LSSD)
[Eichelberger 1983]
Out
+Clock
-Clock
Data
Hazard-free level-sensitive polarity-hold latch
22
EE371 Lecture 7 43
The PowerPC 603 Master-Slave Latch
Din
C1
C1
VDD
ACLK
ACLK
ACLK
SCANin
C2
C2
C2
Dout
[Gerosa et al, 1994]
EE371 Lecture 7 44
IBM Power4TM processor
Data_outData_in
c1 c2
c1c2
scan_clk
Scan_in
1c 2c
2c
L1 L2
AttachedLogicGate
scan_clk
scan_clk
scan_clk
1c
[Warnock et al, 2002]Scannable Split Latch with LSSD Capability
23
EE371 Lecture 7 45
Low-Energy Issues:Clock & Clocked elements burn more
than 60% of the processor power
EE371 Lecture 7 46
Low-Swing Clocking: Clock Driver Re-design
Cn1
Cp1
Cn2
Cp2 CA
CB
CPT
CNT
CPB
CNBH-VDD
VDD
GNDClk
VDDVthp
VthnGND
CPBCPT
CNBCNT
50% power reduction with half-swing clock
(minus some penalty in clock drivers)
[H. Kojima, JSSC, April 1995]
24
EE371 Lecture 7 47
Low-Swing Clocking: CSE Re-design
D
Clk
Vwell > VDD VDD
VDD
Clk(VDD-nVth)
n
Clock drivers
VDD-Low
Clk(VDD-Low)
Clk(VDD-Vth)
E(a),(b)~VDD(VDD-Vth)
(a)
(b)
(c)
E(c)~(VDD-Vth)2
PMOS does not fully turn off[H. Kawaguchi and T. Sakurai, JSSC, May 1998]
EE371 Lecture 7 48
Low-Swing Clocking: N-only CSEs
D
QClk Clk
QMSM
Clk
SS
ClkN1 N2 N3 N4
Clk
N-only clocked transistors, M-S Latch Example(N1 and N2 improve pull-up on SM)
[D.Markovic, J.Tschanz, V.De, 2001, patent pending]
25
EE371 Lecture 7 49
Clock Gating: Global Clock Gating
Used to save clocking energy when data activity is low
D Q0
1 S
ClkLoad
In
REG
D Q
REGClkEN
In(a) (b)
Time-mux(no gating!) Global Clk Gating
EE371 Lecture 7 50
Clock Gating: Local Clock Gating
Used to save clocking energy when data activity is low
D Q
CP
CPCPQM
Clk
CP
P1
CPI
CP
Data-TransitionLook-Ahead
Clock ControlPulse Generator
CP
CP
CP
[M. Nogawa and Y. Ohtomo, JSSC, May 1998]
26
EE371 Lecture 7 51
Dual-Edge Triggering: Latch-Mux
D Q
QC0
1 S
D Q
QC
Q
Clk
D
Used to save clocking energy regardless of data activity!
Clk Clk
Clk
QClkD
Clk
Clk
Clk Clk
Clk
Clk
Concept Circuit Example
[R.P. Llopis and M. Sachdev, ISPLED Aug. 1996]
EE371 Lecture 7 52
Dual-Edge Triggering: Pulsed-Latch
C
C
D Q
QC
Clk
D Q
QPulseGen
PulseGen
Concept
Circuit Example
Clk
Clk
Clk Clk1
Clk
D
D
Clk
(a) (b)
Q
Clk Clk
Q
Clk Clk1 Clk1
Clk2
ClkClk2
Clk
Clk1
Clk2
Clk2
Clk2
Clk2
Clk1 Clk1
Clk1
Clk1
Clk
Clk
Clk
Single-Edge Dual-Edge
27
EE371 Lecture 7 53
Dual-Edge Triggered Flip-Flop
Q
QCL
Clk
D
Q
Q
R
S
C
D
R
S
C
D
Concept Circuit Example
Clk
D
Clk
D
SX
Clk
PG Latch X CL
Q
Clk Clk1 Clk1 Clk2
Clk1
Clk
Clk
Clk2
SY
1st stage:PG Latch Y
1st stage:
[N.Nedovic, V.G.Oklobdzija, ESSCIRC 2002]
Pulse-generating latchestrigger capturing latch
EE371 Lecture 7 54
Design goalsApply» Small clock load» Short direct path» Reduced node swing» Low-power feedback» Pulsed design» Optimization of both
Master and Slave latch
Avoid» Positive setup time» Sensitivity to clock slope
and skew» Dynamic (floating) nodes» Dynamic Master latch
Conduct Energy - Delay optimizations Take into account all sources of power dissipationALWAYS use Clk-Q + setup time for max delay
For more details on storage elements check prof. Oklobdzija’s ISSCC’02 workshop:http://www.ece.ucdavis.edu/acsel under Presentations
28
EE371 Lecture 7 55
Incorporating logic into the CSEAbsorbing clock skewPipeline boundaries will start to blur – pulsed latchesLatch-less domino style clocking, signals used to clockSynchronous design only in a limited domainAsynchronous communication between synchronous domains
What to Expect in the Future?