vlsi arithmetic adders & multipliers prof. vojin g. oklobdzija university of california

VLSI ArithmeticAdders & Multipliers

Prof. Vojin G. Oklobdzija

University of California

http://www.ece.ucdavis.edu/acsel

Oklobdzija 2004 Computer Arithmetic 2

Introduction

• Digital Computer Arithmetic belongs to Computer Architecture, however, it is also an aspect of logic design.

• The objective of Computer Arithmetic is to develop appropriate algorithms that are utilizing available hardware in the most efficient way.

• Ultimately, speed, power and chip area are the most often used measures, making a strong link between the algorithms and technology of implementation.


Basic Operations

• Addition

• Multiplication

• Multiply-Add

• Division

• Evaluation of Functions

• Multi-Media

Addition of Binary Numbers


Addition of Binary NumbersFull Adder. The full adder is the fundamental building block of most arithmetic circuits:

The sum and carry outputs are described as:

iiiiiiiiiiiiiiiiiii cbcabacbacbacbacbac 1

iiiiiiiiiiiii cbacbacbacbas

FullAdder

CinCout

si

ai bi


Addition of Binary Numbers

Propagate

Propagate

Generate

Generate

Inputs Outputs

ci ai bi si ci+1

0 0 0 0 0

0 0 1 1 0

0 1 0 1 0

0 1 1 0 1

1 0 0 1 0

1 0 1 0 1

1 1 0 0 1

1 1 1 1 1


Full-Adder Implementation

Full Adder operations is defined by equations:

iiiiiiiiiiiiiiiiii cpcbacbacbacbacbas

iiiiiiiiiiii cpgbacbacbac 1

One-bit adder could be implemented as shown

Carry-Propagate:and Carry-Generate gi

iii bap

iii bag cout c in

s i

a i b i


High-Speed Addition

iii cps

iiii cpgc 1

One-bit adder could be implemented more efficiently

because MUX is faster

iii bap iii bag

0

1s

b ia i

cout

s i

c in


The Ripple-Carry Adder


The Ripple-Carry Adder

A0 B0

S0

Co,0Ci,0

A1 B1

S1

Co,1

A2 B2

S2

Co,2

A3 B3

S3

Co,3

(= Ci,1)FA FA FA FA

Worst case delay linear with the number of bits

tadder N 1– tcarry tsum+

td = O(N)

Goal: Make the fastest possible carry path circuit

From Rabaey


Inversion Property

A B

S

CoCi FA

A B

S

CoCi FA

S A B Ci S A B Ci

=

Co A B Ci Co A B Ci

=

From Rabaey


Minimize Critical Path by Reducing Inverting Stages

A0 B0

S0

Co,0Ci,0

A1 B1

S1

Co,1

A2 B2

S2

Co,2 Co,3FA’ FA’ FA’ FA’

A3 B3

S3

Odd CellEven Cell

Exploit Inversion Property

Note: need 2 different types of cellsFrom Rabaey


Ripple Carry Adder

Carry-Chain of an RCA implemented using multiplexer from the standard cell library: a i+1 b i+1 a i b i

a i+2 b i+2

cout

c i+1 c i

s is i+1s i+2

c in

Critical Path

Oklobdzija, ISCAS’88


Manchester Carry-Chain Realization of the Carry Path

• Simple and very popular scheme for implementation of carry signal path

V dd

Carry out Carry in

Propagatedevice

Predischarge& kill device

Generatedevice

++++++++

V ddV ddV ddV ddV ddV ddV dd


Original DesignT. Kilburn, D. B. G. Edwards, D. Aspinall, "Parallel Addition in Digital Computers:

A New Fast "Carry" Circuit", Proceedings of IEE, Vol. 106, pt. B, p. 464, September 1959.


Manchester Carry Chain (CMOS)

P0

Ci,0

P1

G0

P2

G1

P3

G2

P4

G3 G4

VDD

Kilburn, et al, IEE Proc, 1959.

•Implement P with pass-transistors•Implement G with pull-up, kill (delete) with pull-down•Use dynamic logic to reduce the complexity and speed up


Pass-Transistor Realization in DPL A

A

B

B

C C

V C CS

S

XO R /XN O R M U LT IPLEX ER B U FFER

C C

M U LT IPLEX ER

V C CC

O

CO

B U FFER

V C C

V C C

O R /N O R

A N D /N A N D

A

A

B

B

A

A

B

B


Carry-Skip Adder

MacSorley, Proc IRE 1/61Lehman, Burla, IRE Trans on Comp, 12/61


Carry-Skip Adder

FA FA FA FA

P0 G1 P0 G1 P2 G2 P3 G3

Co,3Co,2Co,1Co,0Ci ,0

FA FA FA FA

P0 G1 P0 G1 P2 G2 P3 G3

Co,2Co,1Co,0Ci,0

Co,3

Mul

tipl

exer

BP=PoP1P2P3

Idea: If (P0 and P1 and P2 and P3 = 1)then Co3 = C0, else “kill” or “generate”.

Bypass

From Rabaey


Carry-Skip Adder: N-bits, k-bits/group, r=N/k groups

G r G r-1

...

SN-k-1S N-1

a N -1bN -1 b N -k-1a N -k-1

S(r-1)k-1 S (r-2)k

G 1G o

...

Sk

S2k-1

a 2k-1b 2k-1 b kak

Sk-1

S0

...

...a (r-1)k b(r-1)k a (r-1)kb (r-1)k

...a k-1 b k-1 a0 b 0

...

C in

... ... ... ... ... ... ... ...

P r-1P r-2 P 1 P 0

C out + + + +

A N D

O RO RO R O R

A N DA N DA N D

critica l pa th , de lay =2(k-1)+(N /2-2)


Carry-Skip Adder

SKIPRCAd tN

tkt

2

212

N

tp

ripple adder

bypass adder

4..8

k


Variable Block Adder(Oklobdzija, Barnes: IBM 1985)


Carry-chain of a 32-bit Variable Block Adder(Oklobdzija, Barnes: IBM 1985)

G 0

... ...

a0 b

0

...

...

ai

bi

aN-1

bN-1

S j

P m -2

C inC out

C ou

t

G 2G m -2G m -1G m

G 0G 1G 2G m -2G m -1G m

S N-1S i

S 0

P 2P 0P m -1P m

.....

G 1

P 1

C in

.....

aj b

j

Carry signal path

skip ing

ripp ling



1 13 34 4

5 56

=9

Any-point-to-any-point delay = 9 as compared to 12 for CSKA


Carry-chain block size determination for a 32-bit Variable Block Adder

(Oklobdzija, Barnes: IBM 1985)


Delay Calculation for Variable Block Adder(Oklobdzija, Barnes: IBM 1985)

P0

Ci,0

P1

G0

P2

G1

P3

G2

BP

G3

BP

Co,3

Delay model:



Variable Group Length

Oklobdzija, Barnes, Arith’85

321 cNcctd



Variable Block Lengths

• No closed form solution for delay• It is a dynamic programming problem


Delay Comparison: Variable Block Adder(Oklobdzija, Barnes: IBM 1985)


Delay Comparison: Variable Block Adder

0

2

4

6

8

10

12

14

16

4 11 18 25 32 39 46 53 60

Size N

Del

ay

VBA- Multi-Level

CLA

VBA

VLSI ArithmeticLecture 4




Review

Lecture 3



G0

... ...

a0 b

0

...

...

ai

bi

aN-1

bN-1

Sj

Pm-2

CinCout

Cout

G2Gm-2Gm-1Gm

G0G1G2Gm-2Gm-1Gm

SN-1Si

S0

P2P0Pm-1Pm

.....

G1

P1

Cin

.....

aj b

j

Carry signal path

skiping

rippling



1 13 34 4

5 56

=9

Any-point-to-any-point delay = 9 as compared to 12 for CSKA


Carry-chain block size determination for a 32-bit Variable Block Adder

(Oklobdzija, Barnes: IBM 1985)


Delay Calculation for Variable Block Adder(Oklobdzija, Barnes: IBM 1985)

P0

Ci,0

P1

G0

P2

G1

P3

G2

BP

G3

BP

Co,3

Delay model:



Variable Group Length

Oklobdzija, Barnes, Arith’85

321 cNcctd



Variable Block Lengths

• No closed form solution for delay• It is a dynamic programming problem


Delay Comparison: Variable Block Adder

0

2

4

6

8

10

12

14

16

4 11 18 25 32 39 46 53 60

Size N

Del

ay

VBA- Multi-Level

CLA

VBASquare Root Dependency

Log Dependency


Circuit Issues

• Adder speed can not be estimated based on:– logic gates in the critical path– number of transistors in the path– logic levels in the path

• Estimating Adders speed is much more complex and many of the “fast” schemes may be misleading you.


Fan-Out Dependency


Fan-In Dependency

This looks like “Logical Effort”

(1985)


Carry-Lookahead Adder(Weinberger and Smith, 1958)

Ref: A. Weinberger and J. L. Smith, “A Logic for High-Speed Addition”, National Bureau of Standards, Circ. 591, p.3-12, 1958.

ARITH-13: Presenting Achievement Award to Arnold Weinberger of IBM (who invented CLA adder in 1958)


CLA Definitions: One-bit adder

iii cps

iiii cpgc 1

iii bap iii bag

0

1s

b ia i

cout

s i

c in


CLA Definitions: 4-bit Adderai bi

Ci

gi pi

ai+1 bi+1

Ci+1

gi+1 pi+1

ai+2 bi+2

Ci+2

gi+2 pi+2

ai+3 bi+3

Ci+3

gi+3 pi+3

Ci+4

1111

1111112 )(

cppgpg

cpgpgcpgc

iiiii

iiiiiiii

iiiiiiiiiiii cpgbacbacbac 1


Carry-Lookahead Adder: 4-bitsai bi

Ci

gi pi

ai+1 bi+1

Ci+1

gi+1 pi+1

ai+2 bi+2

Ci+2

gi+2 pi+2

ai+3 bi+3

Ci+3

gi+3 pi+3

Ci+4

iiiiiiiiii

iiiiiiiiiiii

cpppgppgpg

cppgpgpgcpgc

1212122

111222223

)(

iiiiiiiiiiiiiii

iiiiiiiiiiii

cppppgpppgppgpg

gppgpgpgcpgc

123123123233

12122333334

)(

Gj Pj


Carry-Lookahead Adderiiiiiiiiiij gpppgppgpgG 123123233

iiiij ppppP 123

jjjj cPGc )1(4

One gate delay to calculate p, g

One to calculateP and two for G

Three gate delaysTo calculate C4(j+1)

Compare that to 8 in RCA !

a i b i

Cin Cj

G jP j

a i+1 b i+1

g i+1p i+1 g i p i

a i+2 b i+2a i+3 b i+3

g i+1p i+1g i+1p i+1

C4(j+1)

C4j+1C4j+2C4j+3

P , G G roup


Carry-Lookahead Adder(Weinberger and Smith)

iiiiiiiiiij GPPPGPPGPG 123123233*G

iiiij PPPPP 123*

jkkj cPGc 4)1(4 **

P j

G* P*

C 4j+1

G jP j+1G j+1P j+3G j+3P j+2G j+2

C4jC4(j+1)

C 4j+2C 4j+3

Additional two gate delays

C16 will take a total of 5 vs. 32 for RCA !


32-bit Carry Lookahead Adder

C in

C out C in

C 4C 8C 12

C out

C 20C 24C 28

C in

C 16

a ib i

ind ividua l addersgenera ting: g i, p i,

and sum S i

C arry-lookahead b locks o f4-b its generating:

G i, P i, and C in fo r theadders

C arry-lookahead super- b locks o f4-b its b locks genera ting:

G * i, P * i, and C in fo r the 4-b itb locks

G roup producing fina lcarry C out and C 16

C ritica l pa th de lay = (fo r g i,p i)+2x2 (fo r G ,P )+3x2 (fo r C in)+1XO R - (fo r S um ) = appx. 12of de lay


Carry-Lookahead Adder(Weinberger and Smith: original derivation, 1958 )


Carry-Lookahead Adder(Weinberger and Smith: original derivation )


Carry-Lookahead Adder (Weinberger and Smith)please notice the similarity with Parallel-Prefix Adders !

Motorola: CLA Implementation Example

A. Naini, D. Bearden and W. Anderson, “A 4.5nS 96b CMOS Adder Design”,

Proceedings of the IEEE Custom Integrated Circuits Conference, May 3-6, 1992.


Critical path in Motorola's 64-bit CLA

C ritica l pa th : A , B - G 0 - G 3:0 - G 15:0 - G 47:0 - C 48 - C 60 - C 63 - S 63

G4

P7

G0

P0

G1

P1

G2

P2

G3

P3

...

CARRYBLOCK

G8

P1

1

... G1

2

P1

5

... G1

6

P3

1

... G3

2

P4

7

... G4

8

P5

1

G6

0

P6

0

G6

1

P6

1

G6

2

P6

2

G6

3

P6

3

... G5

2

P5

5

... G5

6

P5

9

...

PG BLOCK

PG BLOCK

PG BLOCK

PG BLOCK

P,G

0

P,G

1:0

P,G

2:0

G3

:0

P3

:0

G7

:4

P7

:4

G1

1:8

P1

1:8

G1

5:1

2

P1

5:1

2

G3

:0

P3

:0

G7

:0

P7

:0

G1

1:0

P1

1:0

G1

5:0

P1

5:0

G1

5:0

P1

5:0

G3

1:1

6

P3

1:1

6

G3

1:0

P3

1:0

G4

7:3

2

P4

7:3

2

G4

7:0

P4

7:0

G5

1:4

8

P5

1:4

8

G5

5:5

2

P5

5:5

2

G5

9:5

6

P5

9:5

6

C6

4

G5

1:4

8

P5

1:4

8

G5

5:4

8

P5

5:4

8

G5

9:4

8

P5

9:4

8

P,G

60

P,G

61

:60

P,G

62

:60

G6

3:6

0

P6

3:6

0

G6

3:4

8

P6

3:4

8

G6

3:0

P6

3:0

C0

C4

C8

C1

2

C1

6

C3

2

C4

8

C1

6

C3

2

C4

8

C5

2

C5

6

C6

0

C6

3

PG BLOCK

C6

2

C6

1

1.05nS

1.7nS

2.0nS 2.35nS

2.7nS

3.75nS

4.8nS


Motorola's 64-bit CLA

conventional PG Block

carry ripples locally5-transistors in the path

no better situation here !

Basically, this is MCC performance with Carry-Skip.One should not expect any better results than VBA.



Modified PG Block

Intermediate propagate signals Pi:0 are generated to speed-up C3

still critical path resembles MCC



1.8nS

2.2nS

2.9nS 3.2nS

3.55nS

3.9nS


C ritica l pa th : A , B - G 0 - G 3:0 - G 15:0 - G 47:0 - C 48 - C 60 - C 63 - S 63

G4

P7

G0

P0

G1

P1

G2

P2

G3

P3

...

CARRYBLOCK

G8

P1

1

... G1

2

P1

5

... G1

6

P3

1

... G3

2

P4

7

... G4

8

P5

1

G6

0

P6

0

G6

1

P6

1

G6

2

P6

2

G6

3

P6

3... G

52

P5

5

... G5

6

P5

9

...

PG BLOCK

PG BLOCK

PG BLOCK

PG BLOCK

P,G0

P,G1

:0

P,G2

:0

G3

:0

P3

:0

G7

:4

P7

:4

G1

1:8

P1

1:8

G1

5:1

2

P1

5:1

2

G3

:0

P3

:0

G7

:0

P7

:0

G1

1:0

P1

1:0

G1

5:0

P1

5:0

G1

5:0

P1

5:0

G3

1:1

6

P3

1:1

6

G3

1:0

P3

1:0

G4

7:3

2

P4

7:3

2

G4

7:0

P4

7:0

G5

1:4

8

P5

1:4

8

G5

5:5

2

P5

5:5

2

G5

9:5

6

P5

9:5

6

C6

4

G5

1:4

8

P5

1:4

8

G5

5:4

8

P5

5:4

8

G5

9:4

8

P5

9:4

8

P,G6

0

P,G6

1:6

0

P,G6

2:6

0

G6

3:6

0

P6

3:6

0

G6

3:4

8

P6

3:4

8

G6

3:0

P6

3:0

C0

C4

C8

C1

2

C1

6

C3

2

C4

8

C1

6

C3

2

C4

8

C5

2

C5

6

C6

0

C6

3

PG BLOCK

C6

2

C6

1

1.05nS

1.7nS

2.0nS 2.35nS

2.7nS3.75nS

4.8nS

1.8nS

2.2nS

2.9nS 3.2nS

3.55nS

3.9nS

Delay Optimized CLA

B. Lee, V. G. OklobdzijaJournal of VLSI Signal Processing, Vol.3, No.4, October 1991


Delay Optimized CLA: Lee-

Oklobdzija ‘91(a.) Fixed groups and levels

(b.) variable-sized groups, fixed levels

(c.) variable-sized groups and fixed levels

(d.) variable-sized groups and levels


Two-Levels of Logic Implementation of the Carry Block


Two-Levels of Logic Implementation of the Carry-Lookahead Block


Three-Levels of Logic Implementation of the Carry Block (restricted fan-in)


Three-Levels of Logic Implementation of the Carry Lookahead (restricted fan-in)


Delay Optimized CLA: Lee-Oklobdzija ‘91

Delay: Two-level BCLA Delay: Three-level BCLA


Delay Optimized CLA: Lee-Oklobdzija ‘91

(a.) 2-level BCLA =8.5nS (b.) 3-level BCLA =8.9nS

Ling’s Adder

Huey Ling, “High-Speed Binary Adder”

IBM Journal of Research and Development, Vol.5, No.3, 1981.

Used in: IBM 3033, IBM 168, Amdahl V6, HP etc.


Ling’s Derivations

ai bi pi gi ti

0 0 0 0 0

0 1 1 0 1

1 0 1 0 1

1 1 0 1 1

iii CCH 11

iii bag

ai bi

ci

si

ci+1

gi implies Ci+1 which implies Hi+1 , thus: gi= gi Hi+1

iiii CpgC 1

define:

111

11

iiiiii

iiiiiiiii

HpCpCp

CppgpCpCp

1 iiii HpCp

111

11

iiiiii

iiiiiiii

HtHpHg

CpHgCpgC

11 iii HtC



iii CCH 11 iiii CpgC 1

From: and

iiiiiiiii CgCCpgCCH 11

iiii HtgH 11 11 iii HtCbecause:

fundamental expansion

Now we need to derive Sum equation


Ling Adder

Variation of CLA:

Ling, IBM J. Res. Dev, 5/81

iiii CpgC 1

iii CpS

iii bap

iii bag

iiii HtgH 11

iiiiii HtgHtS 11

iii bat

iii bag

Ling’s equations:


Ling Adder

iiii

iiiiii

Cpgg

CpCggC

1

iiii CtgC 1 11 iiii HtgH

Ling’s equation:

see: Doran, IEEE Trans on Comp. Vol 37, No.9 Sept. 1988.

Ling uses different transfer function.Four of those functions have desiredproperties (Ling’s is one of them)

Variation of CLA:


Ling Adder

inCttttgtttgttgtgC 012301231232334

in

in

CtttgttgtggH

CttttgtttgttgtgH

01201212234

101200121122234

Conventional:

Ling:

Fan-in of 5

Fan-in of 4


Advantages of Ling’s Adder

• Uniform loading in fan-in and fan-out

• H16 contains 8 terms as compared to G16 that contains 15.

• H16 can be implemented with one level of logic (in ECL), while G16 can not.

(Ling’s adder takes full advantage of wired-OR, of special importance when ECL technology is used)

Review

Lecture 4

Ling’s Adder

Huey Ling, “High-Speed Binary Adder”

IBM Journal of Research and Development, Vol.5, No.3, 1981.

Used in: IBM 3033, IBM S370/168, Amdahl V6, HP etc.



ai bi pi gi ti

0 0 0 0 0

0 1 1 0 1

1 0 1 0 1

1 1 0 1 1

iii CCH 11

iii bag

ai bi

ci

si

ci+1

gi implies Ci+1 which implies Hi+1 , thus: gi= gi Hi+1

iiii CpgC 1

define:

11

iiiiii

iiiiiiiii

HpCpCp

CppgpCpCp

1 iiii HpCp

111

11

iiiiii

iiiiiiii

HtHpHg

CpHgCpgC

11 iii HtC



iii CCH 11 iiii CpgC 1

From: and

iiiiiiiii CgCCpgCCH 11

iiii HtgH 11 11 iii HtCbecause:

fundamental expansion

Now we need to derive Sum equation


Ling Adder

Variation of CLA:

Ling, IBM J. Res. Dev, 5/81

iiii CpgC 1

iii CpS

iii bap

iii bag

iiii HtgH 11

iiiiii HtgHtS 11

iii bat

iii bag

Ling’s equations:


Ling Adder

iiii

iiiiii

Cpgg

CpCggC

1

iiii CtgC 1 iiii HtgH 11

Ling’s equation:

see: Doran, IEEE Trans on Comp. Vol 37, No.9 Sept. 1988.

Ling uses different transfer function.Four of those functions have desiredproperties (Ling’s is one of them)

Variation of CLA:

ai bi

ci

si

ci+1

ai-1 bi-1

ci-1

si-1

gi, ti gi-1, ti-1

Hi+1 Hi


Ling Adder

inCttttgtttgttgtgC 012301231232334

in

in

CtttgttgtggH

CttttgtttgttgtgH

01201212234

101200121122234

Conventional:

Ling:

Fan-in of 5

Fan-in of 4


Advantages of Ling’s Adder• Uniform loading in fan-in and fan-out

• H16 contains 8 terms as compared to G16 that contains 15.

• H16 can be implemented with one level of logic (in ECL), while G16 can not (with 8-way wire-OR).

(Ling’s adder takes full advantage of wired-OR, of special importance when ECL technology is used - his IBM limitation was fan-in of 4 and wire-OR of 8)


Ling: Weinberger Notes


Advantage of Ling’s Adder

• 32-bit adder used in: IBM 3033, IBM S370/ Model168, Amdahl V6.

• Implements 32-bit addition in 3 levels of logic

• Implements 32-bit AGEN: B+Index+Disp in 4 levels of logic (rather than 6)

• 5 levels of logic for 64-bit adder used in HP processor


Implementation of Ling’s Adder in CMOS

(S. Naffziger, “A Subnanosecond 64-b Adder”, ISSCC ‘ 96)


S. Naffziger, ISSCC’96

01212234 gttgtggH

11 iii HtC



01212234 gttgtggH



)( 0711711111515161516 gttgtggpHpC


Ling Adder Critical Path


Ling Adder: Circuits

A0

B0

A1 B1A1

B1

A2

B2

A2 B2

CKG3

G4

CK

A3

B3P4

A2 B2

B3A3B1

A0 B0

A1

CK

CK

P

LCH LCL

C1H C0LC1L C0H

SumH

CK

K

G

SumL LCH LCL

C1H C0LC1L C0H

CK

P2

P1

G0

CKLC

G2G1


LCS4 – Critical G Path

4b

in1

G3

12b

P4(k,p) or (g,p) G4

C15

32b

C47 C15C31

S63 S48S62

16b


LCS4 – Logical Effort Delay

Prefix-4 Ling/Conditional-Sum (Dynamic - Long Carry Path)

Stages Branch LE ParasiticTotal

Branch Total LEPath Effort fo, opt

Effort Delay

(ps)

Parasitic Delay

(ps)

Total Delay

(ps)

Total Delay (FO4)

dg3# (dg3) 4.0 0.98 2.97g4 (NAND2) 2.0 1.11 1.84C15# (GG4) 1.0 1.01 1.80C15 (INV) 1.0 1.00 1.00C47# (LC) 3.0 1.03 3.32C47 (INV) 1.0 1.00 1.00C47#b (INV) 1.0 1.00 1.00C47b (INV) 1.0 1.00 1.00S63# (SUM) 16.0 0.86 1.36S63 (INV) 1.0 1.00 1.00

3.74E+023.84E+02 9.73E-01 7.2701.81 13666


Results:

• 0.5u Technology

• Speed: 0.930 nS

• Nominal process, 80C, V=3.3V

See: S. Naffziger, “A Subnanosecond 64-b Adder”, ISSCC ‘ 96

Prefix Addersand

Parallel Prefix Adders


from: Ercegovac-Lang


Prefix Adders

(g0, p0)

Following recurrence operation is defined:

(g, p)o(g’,p’)=(g+pg’, pp’)

such that:

Gi, Pi =

(gi, pi)o(Gi-1, Pi-1 )

i=0

1 ≤ i ≤ n

ci+1 = Gifor i=0, 1, ….. n

c1 = g0+ p0 cin (g-1, p-1)=(cin,cin)

This operation is associative, but not commutativeIt can also span a range of bits (overlapping and adjacent)


Parallel Prefix Adders: variety of possibilitiesfrom: Ercegovac-Lang


Pyramid Adder:M. Lehman, “A Comparative Study of Propagation Speed-up Circuits in Binary Arithmetic

Units”, IFIP Congress, Munich, Germany, 1962.


Hybrid BK-KS Adder


Parallel Prefix Adders: S. Knowles 1999

operation is associative: h>i≥j≥k

operation is idempotent: h>i≥j≥k

produces carry: cin=0


Parallel Prefix Adders: Ladner-Fisher

Exploits associativity, but not idempotency. Produces minimal logical depth


Two wires at each level. Uniform, fan-in of two.Large fan-out (of 16; n/2); Large capacitive loading combined with the long wires (in the last stages)

Parallel Prefix Adders: Ladner-Fisher(16,8,4,2,1)


Parallel Prefix Adders: Kogge-StoneExploits idempotency to limit the fan-out to 1. Dramatic increase in wires. The wire span remains the same as in Ladner-Fisher.

Buffers needed in both cases: K-S, L-F


Kogge-Stone Adder


Parallel Prefix Adders: Brent-Kung

• Set the fan-out to one

• Avoids explosion of wires (as in K-S)

• Makes no sense in CMOS:– fan-out = 1 limit is arbitrary and extreme– much of the capacitive load is due to wire

(anyway)

• It is more efficient to insert buffers in L-F than to use B-K scheme


Brent-Kung Adder


Parallel Prefix Adders: Han-Carlson

• Is a hybrid synthesis of L-F and K-S

• Trades increase in logic depth for a reduction in fan-out:– effectively a higher-radix variant of K-S.– others do it similarly by serializing the prefix

computation at the higher fan-out nodes.

• Others, similarly trade the logical depth for reduction of fan-out and wire.


Parallel Prefix Adders: variety of possibilitiesfrom: Knowles

bounded by L-F and K-S at ends


Parallel Prefix Adders: variety of possibilitiesKnowles 1999

Following rules are used:

• Lateral wires at the jth level span 2j bits

• Lateral fan-out at jth level is power of 2 up to 2j

• Lateral fan-out at the jth level cannot exceed that a the (j+1)th level.



• The number of minimal depth graphs of this type is given in:

• at 4-bits there is only K-S and L-F, afterwards there are several new possibilities.


Parallel Prefix Adders: variety of possibilities

example of a new 32-bit adder [4,4,2,2,1]

Knowles 1999



Example of a new 32-bit adder [4,4,2,2,1]

Knowles 1999



• Delay is given in terms of FO4 inverter delay: w.c.(nominal case is 40-50% faster)

• K-S is the fastest• K-S adders are wire limited (requiring 80% more area)• The difference is less than 15% between examined schemes



Conclusion

• Irregular, hybrid schmes are possible

• The speed-up of 15% is achieved at the cost of large wiring, hence area and power

• Circuits close in speed to K-S are available at significantly lower wiring cost

Review

Lecture 5

Prefix Addersand

Parallel Prefix Adders


Prefix Adders

(g0, p0)

Following recurrence operation is defined:

(g, p)o(g’,p’)=(g+pg’, pp’)

such that:

Gi, Pi =

(gi, pi)o(Gi-1, Pi-1 )

i=0

1 ≤ i ≤ n

ci+1 = Gifor i=0, 1, ….. n

c1 = g0+ p0 cin (g-1, p-1)=(cin,cin)

This operation is associative, but not commutativeIt can also span a range of bits (overlapping and adjacent)


Parallel Prefix Adders: S. Knowles 1999

operation is associative: h>i≥j≥k

operation is idempotent: h>i≥j≥k

produces carry: cin=0


Kogge-Stone Adder


Brent-Kung Adder


Hybrid BK-KS Adder


Pyramid Adder:M. Lehman, “A Comparative Study of Propagation Speed-up Circuits in Binary Arithmetic

Units”, IFIP Congress, Munich, Germany, 1962.


Parallel Prefix Adders: Ladner-Fisher

Exploits associativity, but not idempotency. Produces minimal logical depth


Two wires at each level. Uniform, fan-in of two.Large fan-out (of 16; n/2); Large capacitive loading combined with the long wires (in the last stages)

Parallel Prefix Adders: Ladner-Fisher(16,8,4,2,1)


Parallel Prefix Adders: Kogge-StoneExploits idempotency to limit the fan-out to 1. Dramatic increase in wires. The wire span remains the same as in Ladner-Fisher.

Buffers needed in both cases: K-S, L-F


Parallel Prefix Adders: Brent-Kung

• Set the fan-out to one

• Avoids explosion of wires (as in K-S)

• Makes no sense in CMOS:– fan-out = 1 limit is arbitrary and extreme– much of the capacitive load is due to wire

(anyway)

• It is more efficient to insert buffers in L-F than to use B-K scheme


G2,P2

G3,P3

G4,P4

G1,P1

C1C2C3C4C5C6C7C8C9C10C11C12C13C14C15Cout

Two Parallel Prefix Adder Structures

G2,P2

G3,P3

G4,P4

G1,P1

C1C2C3C4C5C6C7C8C9C10C11C12C13C14C15Cout

Kogge-Stone Han-Carlson

• log(bits) carry stages• Extra Wiring

• log(bits) + 1 carry stages• Reduced Wiring and Gates


Parallel Prefix Adders: Han-Carlson

• Is a hybrid synthesis of L-F and K-S

• Trades increase in logic depth for a reduction in fan-out:– effectively a higher-radix variant of K-S.– others do it similarly by serializing the prefix

computation at the higher fan-out nodes.

• Others, similarly trade the logical depth for reduction of fan-out and wire.


Parallel Prefix Adders: variety of possibilitiesfrom: Knowles

bounded by L-F and K-S at ends



Following rules are used:

• Lateral wires at the jth level span 2j bits

• Lateral fan-out at jth level is power of 2 up to 2j

• Lateral fan-out at the jth level cannot exceed that a the (j+1)th level.



• The number of minimal depth graphs of this type is given in:

• at 4-bits there is only K-S and L-F, afterwards there are several new possibilities.



example of a new 32-bit adder [4,4,2,2,1]

Knowles 1999



Example of a new 32-bit adder [4,4,2,2,1]

Knowles 1999



• Delay is given in terms of FO4 inverter delay: w.c.(nominal case is 40-50% faster)

• K-S is the fastest• K-S adders are wire limited (requiring 80% more area)• The difference is less than 15% between examined schemes



Conclusion

• Irregular, hybrid schmes are possible

• The speed-up of 15% is achieved at the cost of large wiring, hence area and power

• Circuits close in speed to K-S are available at significantly lower wiring cost


Possibilities for Further Research

• The logical depth is important (Knowles was right)• The fan-out is less important than fan-in (Knowles

was wrong):– It is possible to examine a variety of topologies with

restricted and varied fan-in.• Driving strength and Logical Effort rules were

overlooked and at least neglected:– It is possible to create number of topologies taking LE

rules into account.– It is further possible to combine the rules with

compound domino implementation taking advantage of two different rules governing “dynamic” and “static”.

• It is still possible to produce a better adder !


Other Types of Adders

Conditional Sum Adder

J. Sklansky, “Conditional-Sum Addition Logic”, IRE Transactions on Electronic

Computers, EC-9, p.226-231, 1960.


ConditionalSum Adder

Carry-Select Adder

O. J. Bedrij, “Carry-Select Adder”, IRE Transactions on Electronic Computers, June

1962, p.340-34


Carry-Select Sum Adder



Carry-Select Adder

Addition under assumption of Cin=0 and Cin =1.


Carry Select Adder:combining two 32-b VBAs in select mode

Delay =VBA32+ MUX


Carry-Select Adder

O.J. Bedrij, IBM Poughkeepsie, 1962

vlsi arithmetic adders & multipliers prof. vojin g. oklobdzija university of california

Documents

adder slide

rabaey slide

computer arithmetic12

computer arithmetic9

computer arithmetic7

computer arithmetic10

computer architecture

dpl slide