a 240ps 64b carry-lookahead adder in 90nm cmos faezeh montazeri [email protected] advanced...
TRANSCRIPT
A 240ps 64b Carry-Lookahead A 240ps 64b Carry-Lookahead Adder in 90nm CMOSAdder in 90nm CMOS
Faezeh MontazeriFaezeh [email protected]@ece.ut.ac.ir
Advanced VLSI Course PresentationAdvanced VLSI Course PresentationUniversity of TehranUniversity of Tehran
December 2006December 2006
Based on :Based on :A 240ps 64b Carry-Lookahead Adder in 90nm CMOSA 240ps 64b Carry-Lookahead Adder in 90nm CMOS
Sean Kao, Radu Zlatanovici, Borivoje NikolićSean Kao, Radu Zlatanovici, Borivoje NikolićUniversity of California, BerkeleyUniversity of California, Berkeley
2
0
10
20
30
0 10 20 30 40 50 60
Normalized Delay [90nm 1V FO4]
No
rma
lize
d E
ne
rgy
[r.
u.]
500 nm
350 nm
250 nm
180 nm
130 nm
90 nm
What Is an Optimal Adder?What Is an Optimal Adder?
Optimal adder:• Minimum delay for given energy• Minimum energy for given delay
64-bit Adders on IEEE Xplore 1995-2005
[1]
3
This WorkThis Work
Multi-issue 64-bit microprocessor environment:
• Optimize a set of representative 64-bit adders in
the energy – delay space
• Analyze the design tradeoffs
• Implement the optimal adder in
1.0V 90nm GP CMOS
4
OutlineOutline
• Energy – delay optimization
• Design tradeoffs for 64-bit adders
• Test chip implementation
• Measured results
• Summary
5
Energy – Delay OptimizationEnergy – Delay Optimization
Delay
Ene
rgy Domino CLA Adder
• Goal: obtain the energy – delay optimal adder • CAD tool: optimize custom digital circuits in the
energy – delay space [3]
Static CLA Adder
[1]
6
Circuit Optimization FrameworkCircuit Optimization Framework
Optimizer
(Matlab)
Delay, EnergyStatic timer
(C++)
Models Netlist Optimization Goal
Optimal Design
Variables
Design Variables
Static timer
(C++)
Optimization Core
[1]
7
Adder Optimization SetupAdder Optimization Setup
MinimizeDELAYsubject toMaximumENERGY
Generatesubtree
Propagatesubtree
G64
MUX
Carry
S0
S1
SUM
Sum precompute
A,B,Cin
Critical path
Non-critical path
CL = 27 fF
CIN ≤ 27fF
tSLOPE ≤ 100 ps [1]
8
0
10
20
30
40
50
6 8 10 12 14
Delay [FO4]
En
erg
y [p
J]
R2 CLA
R4 CLA
CLA: Full Tree ComparisonCLA: Full Tree Comparison
• 6 stages• Moderate
branching
• 3 stages• Larger
branching
Radix- 4 closer to optimum number of stages
Radix-2 Radix-4
[1]
9
CLA vs. LingCLA vs. Ling
0
10
20
30
40
50
6 8 10 12 14
Delay [FO4]
En
erg
y [p
J]
R2 Ling
R2 CLA
R4 Ling
R4 CLA
1i1i1iiiiiii
0121223
HbabaHbaS
gttgtgg0]:H[3
1iiii
0123123233
GbaS
gpppgppgpg0]:G[3
Conventional CLA• Higher stack in first stage• Simple sum precompute
Ling CLA• Lower stack in first stage • Complex sum precompute• Higher speed
[1]
[2]
10
Full vs. Sparse ComparisonFull vs. Sparse Comparison
0
10
20
30
40
50
6 8 10 12 14
Delay [FO4]
En
erg
y [p
J]
R2 FULL
R4 FULL
FULL SP2Ling CLA
[1]
11
Full vs. Sparse ComparisonFull vs. Sparse Comparison
0
10
20
30
40
50
6 8 10 12 14
Delay [FO4]
En
erg
y [p
J]
R2 FULL
R2 SP2
R4 FULL
R4 SP2
FULL SP2Ling CLA
SP2
R2 +
R4 +[1]
12
0
10
20
30
40
50
6 8 10 12 14
Delay [FO4]
En
erg
y [p
J]
R2 FULL
R2 SP2
R2 SP4
R4 FULL
R4 SP2
R4 SP4
Full vs. Sparse ComparisonFull vs. Sparse Comparison
Sparseness benefits adders with large carry trees
FULL SP4Ling CLA
SP2 SP4
R2 + +
R4 + –[1]
13
0
10
20
30
40
50
6 8 10 12 14
Delay [FO4]
En
erg
y [p
J]
R2 FULL
R2 SP2
R2 SP4
R4 FULL
R4 SP2
R4 SP4
Optimal AdderOptimal Adder
• Ling’s equations
• Radix-4 sparse-2
• Domino carry tree
• Static sum-precompute
• Delay of fastest adder:
7.3 FO4
[1]
14
Radix-4 Sparse-2 Carry TreeRadix-4 Sparse-2 Carry Tree
• Computes every other Ling pseudo-carry: H0, H2, H4 …• Each output selects two sums
SUMSEL
(A0, B0)
H4/I4
H16/I16
H64
Cin (A63, B63)G/T
s63Couts0
G/T gates
H gate
H/I gates
SUMSEL MUX
LEGEND
[1]
15
Adder Core Block DiagramAdder Core Block Diagram
• Critical paths implemented in clock-delayed domino • Non-critical paths implemented in static • At-speed BIST
TG
H4I4
H16I16
H64
Sum precompute
Sum selectMUX
pc1 pc2 pc3 pc4 psel
sum
Clock Generator
MUX Out FF
pc1
Scan chain
Scan chain
S0
S1 Buffer
Com
parator
Out
scan_in
footed domino
footless domino
static CMOShard edge
H64
H64'
Precomputed sums
inputs
[1]
16
Timing DiagramTiming Diagram
• 20 ps margin on all edges; Adjustable hard edges• Delay spread places precharge in critical path
pc1
pc2
pc3
pc4
psel
H64
H64'
Hard edge
TCYCLE DUTY CYCLE
24%
43%
53%
53%
45%
[1]
17
Layout FloorplanLayout Floorplan
• Bitslice height: 24 metal tracks• Aligned clock lines• Sum precompute occupies space freed by sparse carry tree
TG H4
I16I4
H16
H64
J1
TG SUM SELECT
SUM SELECT
TG H4
I16I4
H16
H64
J1
TG SUM SELECT
SUM SELECT
XO
R2
XO
R2
XO
R2
XO
R2
XO
R2
XO
R2
XO
R2
XO
R2
XO
R2
XO
R2
K1
J1
J0J0
EVERY BITSLICE
SPARSE-2 CARRY TREE
SPARSE-2 SUM
PRECOMP
24 TRACKS
LEGEND
pc1 pc2 pc3 pc4 psel
[1]
18
90 nm Test Chip90 nm Test Chip
CO
RE
2
CO
RE
3
CO
RE
4
CO
RE
6
CO
RE
7
CO
RE
8
CO
RE
5A
DD
ER
CO
RE
1
TE
ST
IN
TE
ST
OU
T
CK GEN
1.7 mm
1.6
mm
• 90 nm GP 7M 1P • SVT transistors• VDD = 1V• 8 adder cores + test
circuitry • Core 1: this work• Cores 2-8:
Supply noise measurements and supply grid experiments [4].
• Adder core size: 417 x 75m2
[1]
19
[1]
20
Chip PackagingChip Packaging
Chip-on-board:• Bond wires 60% shorter• Cleaner supply 10 ps shorter delays
Advance ProgramDigest
[1]
21
Measured Results: DelayMeasured Results: Delay
CHIP-ON-BOARD:
• VDD = 1 V
– Average: 240 ps
– Fastest: 226 ps
• VDD = 1.3 V
– Average: 180 ps
Davg = 7.5 FO4
[1]
22
Measured Results: PowerMeasured Results: Power
VDD = 1V: Pmax = 260 mW
VDD = 1.3V: Pmax = 606 mW
Adder core
Clk gen
BIST
Leakage
[1]
23
ConclusionConclusion
• 90 nm GP 7M 1P
• SVT transistors
• VDD = 1V
• 8 adder cores + test circuitry
• Adder core size: 417 x 75m2
24
0
10
20
30
0 10 20 30 40 50 60
Normalized Delay [90nm 1V FO4]
No
rma
lize
d E
ne
rgy
[r.
u.]
500 nm350 nm250 nm180 nm130 nm90 nmThis work
64-bit Adders on IEEE Xplore 1995-2005
SummarySummary
• Ling radix-4 sparse-2 domino carry tree
• 90nm GP CMOS: 240ps, 260mW @1V
[1]
25
ReferencesReferences
• [1]. S. Kao, R. Zlatanovici, B. Nikolic, “A 240ps 64-bit Carry-Lookahead Adder in 90nm CMOS,” ISSCC2006, Feb.2006.
• [2]. H. Ling, “High Speed Binary Adder,” IBM J. R&D, vol. 25, no. 3, pp.156-166, May, 1981.
• [3]. R. Zlatanovici, B. Nikolic, “Power – Performance Optimization for Custom Digital Circuits,” Proc. PATMOS, pp. 404-414, Sept., 2005.
• [4] V. Abramzon, E. Alon, M. Horowitz Stanford University