fast mltiplication with out carry propagate
TRANSCRIPT
-
8/18/2019 Fast Mltiplication With Out Carry Propagate
1/6
IEEE TRANSACTIONS ON COMPUTERS, VOL.
39,
NO.
1 1 ,
NOVEMBER
1990
1385
TABLE IV
TESTS
FOR F AND FA,
TABLE V
TESTS
OR
R N ,
The test sets grow linearly with the input size of the counter for the
following reason. The number of full adders NFAn a generalized
counter is related to the number of counter inputs CI as N F ~
CI [log, N’], where CI
=
C;Igk-’ij and N = x;z;-‘i,2J. In
the above, the second term is smaller than the first. Since at most 32
tests are needed to test each full adder, the test sizeT
5
32CI and the
hypothesis follows. It can be shown that for a C(C1: log, (CI + 1))
b al an ce d [ 4] ~ o u n t e r , ~here CI is of the form 2” 1 , and
n
2 2 ,
T
=
33C I/2 5 log, (CI
+
1)
+
5
for CI
>
3.
Example
2:
We now present a test set for the reduction tree R N ,
These are referred to as Construction 1 Counters in [7].
under the assumption that R N ~as already been tested and deter-
mined to be fault-free under the multiple fault assumption. In order
to test the full adders of R N , , we apply the tests corresponding
to E o( e ) and
E l ( e )
which are the tests 1-4 and 5-8, res pectively
in Table
IV.
Next, the tests corresponding to
Eo c)
and E l ( c ) are
applied. These are the tests 9-12 and 13-16 in Table IV. If these
tests “pass,” the tests in Table V are applied. Tests 17-20 corre-
spond to the experiment EXHT o e , c) ests 2 1-24 to EXHT ( e , C )
tests 25-28 to EXH T2(e, c ) , tests 29-32 to EXHT3(e, c), tests
33-36 to EXHT4(e,c), tests 37-40 to EXHTS(e, c), tests 41-44
to EXHT6(e, c), and tests 45-48 to EXH T7(e ,
c).
Some of the tests in Tables IV and V are applied more than
once and can be eliminated. Tests 17-20 of EXHTo(e,c) are also
applied by
E o ( a )u E , ( a ) E o ( b )U E l ( b ) ,
s are tests 25, 27 of
E XHT 2( e , c ) and tests 33, 34
of
E XHT 4( e , c). These
tests
need not
be reapplied. In order to test
FAd
we need only apply all possible
input combinations to its inputs. These tests are contained in those
listed in Table V. The C 5,5: 4) counter is testable in
61
tests.
C.
onclusions
In this paper, we have discussed the problem of multiple faulty
cells in generalized counters for
two
different cases, one in which
the class of circuits is general but some restrictions are placed on
the faults allowed and another in which the class of circuits is re-
stricted but the fault model is more general.
A
theory for the detection
of multiple faulty cells in generalized counters has been developed.
Based on this theory, two schemes for generating test sets that de-
tect multiple faults in generalized counters have been presented. It
is hoped that the techniques presented in this paper will open new
avenues to similar test generation problems for circuits with complex
interconnection structures.
REFERENCES
[ l ]
J. A . Abraham and D. D . Gajski, “Designof testable structures defined
by simple loops,”
IEEE Trans. Cornput.,
vol. C-30, no. 11, pp.
A. Chatterjee and J . A. Abraham, “On the C-testability of generalized
counters,” IEEE
Trans.
Cornput.-Aided Design, vol. CAD-6, no. 5 ,
pp. 713-726, Sept. 1987.
A. Vergis, “L inear-testable counters for multiple faults,” inProc. Int.
Conf. Cornput. Aided Design,
Nov. 198 7, pp. 156-159.
S .
C. Seth and K. L. Kodandapani, “Diagnosis of faults in linear tree
networks,”
IEEE Trans. Cornput.,
vol. C-26, pp. 29-33, Jan.
1977.
W .
T. Cheng, “Testing and error detection in iterative logic arrays,”
Ph.D. dissertation, 1985.
K.
Hwang, Computer Arithmetic, Principles, Architecture and De-
sign.
New
York:
Wiley, 1979.
A. Chatterjee and J. A. Abraham, “C-Testability of generalized tree
structures with applications to Wallace trees and
other
circuits,” in
Proc. Int . Conf. Cornput.-Aided Design, Nov. 1986, pp. 288-291.
875-883, NOV.1981.
[21
[3]
[4]
[5]
[6]
[7]
Fast M ultiplication W ithout Carry-Propagate
Addition
Milo: D. Ercegovac and Tomas Lang
Abstract-Conventional schemes for fast multiplication accumulate
the partial products in redundant form (carry-save or signed-digit) and
convert the result to conventional representation in the last step. This
Manuscript received October 20, 1987; revised March 1, 1989.
The authors are with the Department of Computer Science, the University
IEEE Log Number 9035139.
of California, Los Angeles, CA 90024.
0018-9340/90/1100-1385 01.00
990 IEEE
-
8/18/2019 Fast Mltiplication With Out Carry Propagate
2/6
1386
IEEE TRANSACTIONS ON COMPUTERS, VOL.
39,
N O .
1 1 ,
NOVEMBER
1990
step requires a ca rry-propagate adder which is comparatively slow and
occupies a significant area of the chip in a VLSI im plementation . In th is
paper, we report a multiplication scheme (LRC F-left-to-right, carry-
free) that does not req uire this carry-propagate step. The LR CF scheme
performs the multiplication most-significant bit first and produces a
conventional sign-and-m agnitude product (most significant n bits)
by
means of an on-the-fly conversion. The resulting implementation s fast
and regular and is very well suited for VLSI. The LRCF scheme for
general radix
r
and a radix-4 signed-digit impleme ntation are presented.
Index Terms-
Carry-save multiplier, dig ital arithmetic, left-to-right
multiplier, mu ltiplic ation , signed-digit multiplier, signed-digit represen-
tation, VLSI implementation.
I . I NT RODUCT I ON
All comm on schemes for multiplication, sequential as well as com-
binational, requ ire a carry-propagate addition in the final step (stage)
to obtain the 2n-bit product [11. In sequential right-to-left n-bit mul-
tipliers, the least sign ificant half of the product is obtained w ithout
the use of a C PA. Th e most significant half, however, requires an
n-bit CPA to complete the operation. Similarly in multipliers using
linear array s of carry-save adde rs, the least significant half is gen-
erated during the reduction process. To obtain the most significant
half, an n-bit CPA is used. In the case of tree-type multipliers, a CPA
of more than n bits is required. For example, a Dadda or a Wallace
multiplier with
s
reduction stages requires an adder of 2n bits.
Such a carry-propagate adder is comparatively slow and occupies a
significant area of the chip in a VLS I implem entation [2].
In this paper, we report a novel multiplication scheme that pro-
duces the most-significant half of the product without the carry-
propagate step. The basic characteristics of the proposed LR CF (left-
to-right, carry-free) scheme a re:
1) The recurrence uses the digits of the multiplier from most to
least significant (left-to-right multiplication)
[3].
The multiplier can
be recoded into a suitable radix-r representation to reduce the number
of steps [4].
2) The accumulated partial products are decomposed into two
parts: the m ost significant part and the lea st significant part.
3) To produce the product, the most significant portion of the
accumulated partial products is converted to conventional form using
a variation of the on-the-fly algorithm presented in [5], without the
need of carry-propagate addition.
The LRCF scheme can be used both for sequential and combi-
national implementations. We concentrate here on the com binational
case, since it provides the most in speed advantages. The resulting
implementation is fast and regular and is very well suited for VLSI
implementation.
11. THEL R C F M U L T ~ P L ~ C A T I O NLGORITHM
Consider multiplication of normalized fractions in the sign-and-
magnitud e representation. Let e the radix-2 representation of the
normalized fractional magnitude x , such that
n
x
=
Cx;2-
Xi E (0, I }
i= l
and let Y be the recoded radix-r representation of the normalized
fractional magnitude y , such that
14
y = Y i F i
Y ;
E { - r / 2 , .
. .
r/2} (minimally redundant)
i =O
where, for simplicity,
r
=
24.
The L RCF multiplication algorithm is a recurrence that produces
a sequence of two accumulated partial products
w
nd p) as follows:
w u ]
= r ( f r a c t i o n ( w l j - l ] + x Y , ) )
j = o , . . . , n / q ( 2 . 2 )
(2.3)
, = nteger(wv 11 +XI ,)
and
p u ]
=
p v 1
+Zjr-j .
(2.4)
The initial values are w[-1] = p[ - 1 ] = 0. The algorithm uses
the digits of the multiplier from m ost significant to least significan t
[ 3 ] ,
unlike conventional multiplication schemes which use the digits
from least significan t to most sign ificant.
To show that the LR CF algorithm performs multiplication, observe
that the sum of partial products after k steps satisfies
k
p [ k ]
+ w[k] x rPk-' = X x Y ; x r-'. (2.5)
i
=O
Consequently, after nlq steps we obtain
n I
p [ n / q ]
+
w [ n / q ] x r- lq-I
=
X x Y ; x r-' = x y . (2 .6 )
That is, p[ n/ q] is the most significant part of the product while
w[n /q] is the least significant part.
A block diagram of one step of the recurrence is shown in Fig. 1.
A fast implementation requires the following.
1) Use of a redundant adder (either carry-save or signed-digit
[7]) to produce w u ]. This results in a carry-free addition.
2) Addi tion by concatenat ion to produce p u ] . That i s ,
i=O
p u ]
=
concat
p u
] , P j ) . (2 .7)
Since the maximum value of
2,
n (2.3) is, in general, larger than
r , to perform this concatenation it is necessary to recode
Z,
and
Z,,, into P, in the range [-(r ) , ( r ) ] . That i s ,
P j = F ( Z j , Z j + ] )
P,
E [-(r ), ( r )].
(2.8)
The details of the recoding depend on the range of Z j which is
dependent on the type of redundant adder used to produce w v ] .
3 The on-the-fly conversion of the resulting signed-digit rep-
resentation of pu] into a conventional representation M u ] . A n al-
gorithm to perform this conversion is given in [5]. In contrast with
the traditional approach that performs the conversion by subtracting
the negative part from the positive part (and therefore uses a carry-
propagate adder), this on-the-fly scheme forms two numbers
a u ]
nd
b u ] uch that a b ] is the converted number up to digi t
j
a nd b u ] i s
au ]
r-j. When a new digit Pi is produced, if it is positive or zero
it is concatenated to
A u
11 (digit-vector representing
a u
] ) ,
while if it is negative A V ] is obtained by concatenating r
Pj
to
B u ] . T h a t i s ,
i f P j
i f P , < 0
concat A u 13, P,)
con ca t ( Bv 11, ( r P j I) )
A V ]
=
c o n c a t ( A U 11, (P, 1))
conc at(BV 11, ( ( r ~ 1) PjI ) )
i f P ,
> O
i f P , 5 0
(2.9)
v ] =
with the initial condition
A[-1] =B[-1] = 0.
(2.10)
After the last step, the m ost significant half
of
the product in
conventional representation is
M[n
/ q ] =
A
[n q ]
The step of Fig. 1 can be used to implement either a sequen-
tial multiplier or a combinational one. The basic scheme for the
combinational case is illustrated in Fig. 2 together with the corre-
sponding conventional right-to-left approach. The fundamental dif-
ference is shown in the timing diagram of Fig. 2(c): the two phases
of multiplication- generation of partial products and formation of
the final result in a conventional representation- are performed con-
-
8/18/2019 Fast Mltiplication With Out Carry Propagate
3/6
1387
EEE TRANSACTIONS
ON
COMPUTERS,
VOL. 9,
NO
1
NOVEMBER
1990
Y
X
k lEDUNDANT (Carry-Save or Signed-Digit)
i n t eger
l
Product Digits (signed)
-
MSDfirst
WI
Fig.
1.
One step
of
the LRCF multiplication scheme.
currently in the LR CF scheme and one after the other in conventional
schemes.
Since the result of the multiplication corresponds to the most-
significant half of the produ ct, an erro r is mad e with respect to the
correct product. Several rounding schemes can be used to bound this
error
[8].
In the LR CF sch eme, since the least-significant half of the
product is left in redundant form, the error is somewhat larger than
in the conventional right-to-left scheme. The actual range of the error
depends on the type of redundant adder used, and several solutions
are discussed in
[6].
111. RADIX-4 IMPLEMENTATION
WITH
SIGNED-DIGITDDITION
The presented combinational implementation of the LRCF scheme
for a radix4 multiplication unit uses a linear array of signed-digit
adders
[7]
for the computation of
w.
n implementation using carry-
save adders is discussed in [6], [151. For radix4 the recurrences are
w [ ]
=
4(fraction w [ 11
+
x q
= 0,.
.
n 2 ,
4 - 1 1 = o , Y € { - 2 , - 1 , 0 , 1 , 2 } ( 3 . 1 )
Z,
=
nteger
w
1
+
XY)
(3 .2)
and
PVI = PV ] Z j 4 - j = concat@u
3 ,
P,) ,
pi-11 = o
(3.3)
where P , is obtained from Z j and Z,+, by signed-digit addition
[7]
as described next.
To determine the range of
Z j
observe that
Z, = integer w u 1 +XY )
= nteger wV 13) + nteger (XU,)
+
o
(3.4)
where t o
E { - 1 , 0, l }
is a transfer digit resulting from the signed-
digit addition of the fractions of w u
1
and xYj. Moreover, the
most significant digit of the fraction of the signed-digit sum
(3 .5)
V
1
XYj-,
is in the range
{ 3 ,
. . .,3}. Therefore,
in te ge r (wV
11) =
nteger(4 x f r a c t ion(wV
1
+ x Y j - , ) )
(3 .6)
on- -Fly
Conversion
MsHalf
of the
product
in
non-redundant form)
IsHalf of the Product
in
redundant form
- not
used
(a)
Redundmt Adder
Carry-Ropagate
MSHalfoftheProduct LSHalfoftheRoduct
in non-redundant form)
in
non-redundant form)
b)
LRCF Scheme:
Partial Product
Recurrence
n-the-ny j b
Conversion
:
i t
:
FinalProduct
: Most
Significant
Half)
Conventional Scheme :
Partial Product
Recurrence
onversion
t
CPA)
Final
Product
(Most Significant
Hal f
(c)
Fig. 2 .
(a) LRCF com binational multiplier. (b) Conventional combinational
multiplier. (c) Timing comparison between LRCF and conventional multi-
plication schemes.
WO 21
x. x x ... x
XYj.1
zj.1
3 3 3 ... 3
wti 11
3 . 3
3 3 ... 0
X Y j 1 . 3
3 3 ... 3
z j
. 3 3 3
... 2
x. x x ... x
(similarly for negative values)
Fig. 3.
Range
of Z
digits.
is in the range { 3 , . . 3 } . Since 1x1< 1 is a fraction in conven-
tional form and the recoded multiplier digits Yj’s are in the range
{ - -2 , .
. .
2 } ,
integer(xYj) is in the range { 1 O 1) . Consequently,
the range of
Z j
is
[-5, 51
as illustrated in Fig.
3 .
The implementation consists of th ree parts as follows:
1)
A linear array of signed-digit adders to compute the w’s. This
array includes the recoder of Y and the selection of the multiples of
-
8/18/2019 Fast Mltiplication With Out Carry Propagate
4/6
1388 IEEE TRANSACTIONS ON COMPUTERS, VOL. 39, NO. 1 1 , NOVEMBER 1990
zero(P.
sign(Pj
Recode, Select,
Signed-Digit Add
T T T T
J-l
MJ MJ+l
MJ+2
Final Product (Most
Significant
Halt)
0
(b) Implementation of result generation-
a
segment.
Fig.
4.
(a) Linear array
of
signed-digit adders for computing
W s and
Z’s.
2) A
linear array of modules TS to compute the partial products
pljl’s [Fig. 4(b)] from left to right. In the partial product computa-
tion (3.3), a direct addition of
Z
would require carry propagation.
To perform these additions without carry propagation,
Z ,
(with range
Then P , = t , +s, is the jt h product d igit. Because of the ranges
(3.8),
P ,
E {- 3 , .
. .
3}. Since
[-5, 51) is recoded into t,-l and
s,
such that p U ] =
C P * 4 - ‘
P , E {-3, .
. ,3}
(3.9)
I =o
z, = 4 t , - , +s, (3.7)
s,I
2. (3 .8)
the partial products
p l j l ’ s
are computed by concatenation
p l j ]
= concat ( p l j 1, P , )
(3.10)
with
t j - ,
1
and
A possible recoding of Zj is
-1 1 -2 -1
thus avoiding carry propagation.
3) The signed-digit representation of the product p[n/2] is con-
verted to conventional representation M [ n 2] using a combinational
variation of the on-the-fly algorithm [ 5 ] , [IO], as shown in Fig. 4(b).
0 0
0
1 1 1 Instead of using the two conditional forms A and B , described in
Section 11, we keep
only A
together with control signals
Dkl j ]
as-
0
1 -1 0 1
sociated with each A k (digit of A ) , o determine whether the final
0 1 2 3 4 5
-
8/18/2019 Fast Mltiplication With Out Carry Propagate
5/6
IEEE TRANSACTIONS ON COMPUTERS, VOL. 39, NO . 1 1 , NOVEMBER 1990
i
1
1389
Dltil P I DZUl P 2
A I A2
2
U 2
0
2 u
U
3 u
U
4 n n
5 n n
6 n n
I n n
8 n n
M I
M2
2 0
TABLE I
CONVERSIONONTROLIGNALS
Decision at level j about
value of product digit M k[
/21
undecided
decided: no change
decided: decrement
Mk[n/2]= A k 1 mod4
MkIn/21
= A k)
M,[n/21
= A k )
TABLE II
EXAMPLEF ON-THE-FLYONVERSION
D l p 3
A 3
0
U 0
n
n
n
n
n
M3
0
digit M i is
Ai
or A i
)
mo d4 . The meaning of the control signals
is given in Table I.
A high-level description of the conversion process is
Initially,
DjU] = U . (3.12)
F o r k
< j ,
U
i f D k b ]
=
u a n d P ,
=
0
(3.13)
Consequently,
D k +
I] depends on Dk
U]
and on the signals
1
i fP ,
= O
0 otherwise
1 i f P ,
< O
{
0 otherwise.
(3.14)
i
e r o ( P j ) =
s i g n ( P j ) =
The on-the-fly conversion of
P
= 2003 00B in to M
=
20023313
Consequently, the conversion part is composed of
1) the modules
A
that generate Aj act irding to (3.11), and
2) the modules
D
that update D k
U]
according to (3.13)
3) the modules DEC that decrement Ak (modulo 4) if
D k [ n /2 ]=
d.
Bit-Level Implementation
and
Comparison to Conventional
Schemes:
The modules and their connections are indicated in Fig.
is
illustrated in Table 11.
zer o(P j) and sign(P,) signals
D S l i i
p6
0
U
d
d
M 6
3
4(a) and (b). Their bit-level implementation is discussed in 161. The
LRCF and a conventional scheme are similar regarding the follow-
ing parts: the binary -to-radix 4 multiplier recoder, the selection of
multiples of the multiplicand i
x
1
x
0 x
),
and the array
of redundant adders for the accumulation of partial products. These
adders are composed of signed-digit adder modules [12]-[I41 which,
according to [12 ], are similar in area and delay to conven tional full
adders. The principal difference is in producing the final product in
conventional representation from a sum of partial products in redun-
dant form: the LRCF scheme uses an on-the-fly converter while in a
conventional scheme a CPA adder is required.
Since a product
of
n bits is to be computed, those digits of the
array that do not influence the result can be eliminated. For a radix-4
recoding of th e multiplier, the first half of the array is of full precision
and from then on, the num ber of radi x4 adders decreases by one per
level. A similar reduction in size of the adder array can be achieved
in a conventional right-to-left multiplier, as done for example in the
Cray X-MP processor [9], in which case the error in multiplication
is similar to that produced by the LRCF scheme.
Delay of the Scheme and Comparison to Conventional
Schemes: The delay of the g eneration of the product is composed
of the following:
1) Recoding and forming the multiples of the multiplicand.
2) Delay to obtain the last partial product w[n/2] in the signed-
digit adder array. This corresponds to ( n / 2 ) 1 signed-digit adders.
3) Delay to produce the last zero(Pj ) and sign (P,) signals.
4) Delay to determine the value of the last D's.
5
Delay of digit decrementing.
In comparison, in a conventional right-to-left multiplier the de-
lay corresponds to l ) and 2) above plus a carry-propagate addition.
Consequently, the scheme presented here is faster by the difference
in delay between the carry-propagate adder and the sum of the delays
3) to 5) above. Since the CPA delay is at best O(log, n ) logic levels
and steps 3)-5) can be implem ented in a couple of levels, this differ-
ence is significant, especially for larger operand precision. To reduce
the total delay, the last step of the LRCF scheme can be optimized
for speed.
To illustrate the implementation we show in Fig. 5 an example of
an 8
x 8
bit multiplication. The additions are performed in ra dix 4
signed-digit.
-
8/18/2019 Fast Mltiplication With Out Carry Propagate
6/6
1390
..
XY = 0 . 5 i i I
0.
i 0 i
S
1t.-1w[41=
IEEE TRANSACTIONS ON COMPUTERS, VOL. 39, NO. 1 1 , NOVEMBER 1990
0
0 0
1
1.331
d .dnu
x
=0 1101011~= .311 y =0.110101112=0.311341 712
[ O ] =
9
Z ,+ 4 - ~ w [ l l=
w [ l l =
0.
3.
1 2
j i i l
i o l l
5 . 0 1 5
I I
xy t= 0
I I I I
I Z3+4-L[31= 5. 0
2 1 0
15
check: 0.311wO.3113 = O . g l O O.ooOOi0~ .
Fig. 5.
Example of LRCF multiplication.
IV.
S U M M A R Y
The reported multiplication scheme (LRCF) eliminates the need
for a carry-propagate adder. The scheme performs the multiplica-
tion most-significant digit first and produces a conventional sign-
and-magnitude product (most significant half) by means of an on-
the-fly conversion, performed concurrently with the generation of
accumulated (redundant) partial products. T he scheme is presented
for general radix r and a radix-4 signed-digit implementation is de-
scribed. We estimate that, for a m ultiplier of 64 its, the scheme we
described produces a reduction of about ten gate levels with respect
to a conventional scheme using a carry-look-ahead adder. The speed
can be improved by increasing the radix. In [6], we present a radix-
16 implementation in which odd and even partial products
[l
11 are
computed concurrently.
REFERENCES
K Hwang, Computer Arithmetic.
M. Uya, K . Kaneko, and J. Yasui, “A CMOS floating-point multi-
plier,” IEEE
J.
Solid-State Circuits,vol. SC-19, no.
5,
pp. 697-701,
Oct.
1984.
A. Avizienis, “On a flexible implementation of digital computer arith-
metic,” in
Information Processing
1962, C. M. Popplewell, Ed.
New York: North Holland, 1963, pp. 664-670.
A. D. Booth, “A signed binary multiplication technique,” Quart. J .
Mech. Appl. Math. , vol. 4, part 2, pp. 236-240, 1951.
M. D. Ercegovac and T. Lang, “On-the-fly conversion of redundant
into conventional representa tions,”
IEEE Trans. Comput.,
vol. C-36,
no. 7, pp. 895-897, July 1987.
“Fast multiplication without carry-propagate addition,” UCLA
Comput. Sci. Dep. Rep., 1986.
A.
Avizienis, “Signed-digit number representation for fast parallel
arithmetic,”
IEEE Trans. Electron. Cornput.,
vol. EC-10, pp.
389-400, Sept. 1961.
J.
T.
Coonen, “An implementation guide to a proposed standard for
floating-point arithmetic,” IEEE Comput. Mag., pp. 68-79, Jan.
1980.
Annon, “Cray X-MP Computer Systems,” Four-Processor Main-
frame Reference Manual,
HR-0097, Cray Research, Inc., 1985.
M. D. Ercegovac and T. Lang, “Alternative on-the-fly conversion
of redundant into conventional representations,” UCLA Comput. Sci.
Dep. Rep. CS D-860027, Nov. 19 86.
J. Iwamura et al., “A 16-bit CMOS/SOS multiplier-ac cumu lator,” in
Proc. ICCC82,
1982, pp. 151-154.
S. Kuninobu et al., “Des ign of high-speed MOS multiplier and divider
using redundant binary represe ntation,” in
Proc. 8th. Symp. Comput.
Arithmet., 1987, pp. 80-86.
Y.Harata
et al.,
“High-speed multiplier using a redundant binary
adder tree,” in Proc. 1984 IEEE Int. Conf. Comput. Design, 1984,
J. E. R obertson , “A s ystema tic approach to the design of structures
for arithmetic,” in
Proc. 5th Symp.
Comput.
Arithmet.,
1981.
M. D. Ercegovac and T. Lang, ”Radix4 multiplication without carry-
propagate addition,” in
Proc. IEEE Int. Conf. Comput. Design:
VLSI Comput. Processors, Oct 5-8, 1987, pp. 654-658.
New York: Wiley, 1978.
pp. 165-170.
Fast, Deterministic Routing, on Hypercubes,
Using
Small B uffers
Bradley C. Kuszmaul
Abstract- We propose a deterministic routing scheme for a comm uni-
cations network based on the k-dimensional hypercube. We present two
formulations of the scheme. The first formulation delivers messages in
O kz)
bit times using
O k)
bits of buffer space at each node in the hy-
percube. The second formulation assumes that there are several batches
of messages to be delivered, and makes certain assumptions about the
cost of sending messages along the various dimensions of the cube.
In
this case, the latency for delivery time is still O k2) bit times,
hut
the
throughput is increased to one set of messages every O k) bit times. For
the first formulation, we restrict ourselves to routings which are subsets
of permutations (i.e., every node sends at most
one
message and re-
ceives at most one mes sage). The second formulation indicates a way to
perform routings which are subsets of H -permutations (i.e., every node
sends at most H messages and receives at most H messages).
Index Terns-Buffers, complexity theory, deterministic routing,
hy-
percubes, interconnection networks, parallel processing, routing.
I. INTRODUCTION
Several routing schemes based on the hypercube have been pro-
posed [7],
[ 5 ] ,
[15], [17], [12]. We discuss hypercubes with kdim en-
sions and 2k
=
N vertices (w hich we call nodes). A nondeterministic
O k2)
it time algorithm with
O kz )
its of storage at each node is
described in [17]. In this paper, we describe a deterministic
O k2)
bit time algorithm with O k) its of storage at each node. W e go on
to
describe an alternative deterministic algorithm, based on a slightly
modified network, with
O k2)
it time latency for messages travel-
ing through the network, O k) hroughput (i .e. , one message every
O k)
bit times), and O k2) its of storage at each node.
When describing hypercube networks we define a
node
to be a
vertex on the hypercube. W hen describing multiprocessor computer
systems, we define a
processor
to be the hardware w hich sends and
receives messages. In some computer systems (e.g. , the connection
machine [9]), the processors are associated with the nodes of the
hypercube routing network.
In general, we assume that messages are at O k) bits long (be-
cause, for example, it should be possible to transmit a node a ddress
in a message). This gives a lower bound for routing of
o k )
bit
times.
Manuscript received October 6, 1987; revised January 28, 1988.
The author is with the Massachusetts nstitute of Technology, Cambridge,
IEEE Log Number 9035138.
MA 02139.
0018-9340/90/1100-1390 01.00 990 IEEE