fast mltiplication with out carry propagate

8/18/2019 Fast Mltiplication With Out Carry Propagate

1/6

IEEE TRANSACTIONS ON COMPUTERS, VOL.

39,

NO.

1 1 ,

NOVEMBER

1990

1385

TABLE IV

TESTS

FOR F AND FA,

TABLE V

TESTS

OR

R N ,

The test sets grow linearly with the input size of the counter for the

following reason. The number of full adders NFAn a generalized

counter is related to the number of counter inputs CI as N F ~

CI [log, N’], where CI

=

C;Igk-’ij and N = x;z;-‘i,2J. In

the above, the second term is smaller than the first. Since at most 32

tests are needed to test each full adder, the test sizeT

5

32CI and the

hypothesis follows. It can be shown that for a C(C1: log, (CI + 1))

b al an ce d [ 4] ~ o u n t e r , ~here CI is of the form 2” 1 , and

n

2 2 ,

T

=

33C I/2 5 log, (CI

+

1)

+

5

for CI

>

3.

Example

2:

We now present a test set for the reduction tree R N ,

These are referred to as Construction 1 Counters in [7].

under the assumption that R N ~as already been tested and deter-

mined to be fault-free under the multiple fault assumption. In order

to test the full adders of R N , , we apply the tests corresponding

to E o( e ) and

E l ( e )

which are the tests 1-4 and 5-8, res pectively

in Table

IV.

Next, the tests corresponding to

Eo c)

and E l ( c ) are

applied. These are the tests 9-12 and 13-16 in Table IV. If these

tests “pass,” the tests in Table V are applied. Tests 17-20 corre-

spond to the experiment EXHT o e , c) ests 2 1-24 to EXHT ( e , C )

tests 25-28 to EXH T2(e, c ) , tests 29-32 to EXHT3(e, c), tests

33-36 to EXHT4(e,c), tests 37-40 to EXHTS(e, c), tests 41-44

to EXHT6(e, c), and tests 45-48 to EXH T7(e ,

c).

Some of the tests in Tables IV and V are applied more than

once and can be eliminated. Tests 17-20 of EXHTo(e,c) are also

applied by

E o ( a )u E , ( a ) E o ( b )U E l ( b ) ,

s are tests 25, 27 of

E XHT 2( e , c ) and tests 33, 34

of

E XHT 4( e , c). These

tests

need not

be reapplied. In order to test

FAd

we need only apply all possible

input combinations to its inputs. These tests are contained in those

listed in Table V. The C 5,5: 4) counter is testable in

61

tests.

C.

onclusions

In this paper, we have discussed the problem of multiple faulty

cells in generalized counters for

two

different cases, one in which

the class of circuits is general but some restrictions are placed on

the faults allowed and another in which the class of circuits is re-

stricted but the fault model is more general.

A

theory for the detection

of multiple faulty cells in generalized counters has been developed.

Based on this theory, two schemes for generating test sets that de-

tect multiple faults in generalized counters have been presented. It

is hoped that the techniques presented in this paper will open new

avenues to similar test generation problems for circuits with complex

interconnection structures.

REFERENCES

[ l ]

J. A . Abraham and D. D . Gajski, “Designof testable structures defined

by simple loops,”

IEEE Trans. Cornput.,

vol. C-30, no. 11, pp.

A. Chatterjee and J . A. Abraham, “On the C-testability of generalized

counters,” IEEE

Trans.

Cornput.-Aided Design, vol. CAD-6, no. 5 ,

pp. 713-726, Sept. 1987.

A. Vergis, “L inear-testable counters for multiple faults,” inProc. Int.

Conf. Cornput. Aided Design,

Nov. 198 7, pp. 156-159.

S .

C. Seth and K. L. Kodandapani, “Diagnosis of faults in linear tree

networks,”

IEEE Trans. Cornput.,

vol. C-26, pp. 29-33, Jan.

1977.

W .

T. Cheng, “Testing and error detection in iterative logic arrays,”

Ph.D. dissertation, 1985.

K.

Hwang, Computer Arithmetic, Principles, Architecture and De-

sign.

New

York:

Wiley, 1979.

A. Chatterjee and J. A. Abraham, “C-Testability of generalized tree

structures with applications to Wallace trees and

other

circuits,” in

Proc. Int . Conf. Cornput.-Aided Design, Nov. 1986, pp. 288-291.

875-883, NOV.1981.

[21

[3]

[4]

[5]

[6]

[7]

Fast M ultiplication W ithout Carry-Propagate

Addition

Milo: D. Ercegovac and Tomas Lang

Abstract-Conventional schemes for fast multiplication accumulate

the partial products in redundant form (carry-save or signed-digit) and

convert the result to conventional representation in the last step. This

Manuscript received October 20, 1987; revised March 1, 1989.

The authors are with the Department of Computer Science, the University

IEEE Log Number 9035139.

of California, Los Angeles, CA 90024.

0018-9340/90/1100-1385 01.00

990 IEEE


2/6

1386

IEEE TRANSACTIONS ON COMPUTERS, VOL.

39,

N O .

1 1 ,

NOVEMBER

1990

step requires a ca rry-propagate adder which is comparatively slow and

occupies a significant area of the chip in a VLSI im plementation . In th is

paper, we report a multiplication scheme (LRC F-left-to-right, carry-

free) that does not req uire this carry-propagate step. The LR CF scheme

performs the multiplication most-significant bit first and produces a

conventional sign-and-m agnitude product (most significant n bits)

by

means of an on-the-fly conversion. The resulting implementation s fast

and regular and is very well suited for VLSI. The LRCF scheme for

general radix

r

and a radix-4 signed-digit impleme ntation are presented.

Index Terms-

Carry-save multiplier, dig ital arithmetic, left-to-right

multiplier, mu ltiplic ation , signed-digit multiplier, signed-digit represen-

tation, VLSI implementation.

I . I NT RODUCT I ON

All comm on schemes for multiplication, sequential as well as com-

binational, requ ire a carry-propagate addition in the final step (stage)

to obtain the 2n-bit product [11. In sequential right-to-left n-bit mul-

tipliers, the least sign ificant half of the product is obtained w ithout

the use of a C PA. Th e most significant half, however, requires an

n-bit CPA to complete the operation. Similarly in multipliers using

linear array s of carry-save adde rs, the least significant half is gen-

erated during the reduction process. To obtain the most significant

half, an n-bit CPA is used. In the case of tree-type multipliers, a CPA

of more than n bits is required. For example, a Dadda or a Wallace

multiplier with

s

reduction stages requires an adder of 2n bits.

Such a carry-propagate adder is comparatively slow and occupies a

significant area of the chip in a VLS I implem entation [2].

In this paper, we report a novel multiplication scheme that pro-

duces the most-significant half of the product without the carry-

propagate step. The basic characteristics of the proposed LR CF (left-

to-right, carry-free) scheme a re:

1) The recurrence uses the digits of the multiplier from most to

least significant (left-to-right multiplication)

[3].

The multiplier can

be recoded into a suitable radix-r representation to reduce the number

of steps [4].

2) The accumulated partial products are decomposed into two

parts: the m ost significant part and the lea st significant part.

3) To produce the product, the most significant portion of the

accumulated partial products is converted to conventional form using

a variation of the on-the-fly algorithm presented in [5], without the

need of carry-propagate addition.

The LRCF scheme can be used both for sequential and combi-

national implementations. We concentrate here on the com binational

case, since it provides the most in speed advantages. The resulting

implementation is fast and regular and is very well suited for VLSI

implementation.

11. THEL R C F M U L T ~ P L ~ C A T I O NLGORITHM

Consider multiplication of normalized fractions in the sign-and-

magnitud e representation. Let e the radix-2 representation of the

normalized fractional magnitude x , such that

n

x

=

Cx;2-

Xi E (0, I }

i= l

and let Y be the recoded radix-r representation of the normalized

fractional magnitude y , such that

14

y = Y i F i

Y ;

E { - r / 2 , .

. .

r/2} (minimally redundant)

i =O

where, for simplicity,

r

=

24.

The L RCF multiplication algorithm is a recurrence that produces

a sequence of two accumulated partial products

w

nd p) as follows:

w u ]

= r ( f r a c t i o n ( w l j - l ] + x Y , ) )

j = o , . . . , n / q ( 2 . 2 )

(2.3)

, = nteger(wv 11 +XI ,)

and

p u ]

=

p v 1

+Zjr-j .

(2.4)

The initial values are w[-1] = p[ - 1 ] = 0. The algorithm uses

the digits of the multiplier from m ost significant to least significan t

[ 3 ] ,

unlike conventional multiplication schemes which use the digits

from least significan t to most sign ificant.

To show that the LR CF algorithm performs multiplication, observe

that the sum of partial products after k steps satisfies

k

p [ k ]

+ w[k] x rPk-' = X x Y ; x r-'. (2.5)

i

=O

Consequently, after nlq steps we obtain

n I

p [ n / q ]

+

w [ n / q ] x r- lq-I

=

X x Y ; x r-' = x y . (2 .6 )

That is, p[ n/ q] is the most significant part of the product while

w[n /q] is the least significant part.

A block diagram of one step of the recurrence is shown in Fig. 1.

A fast implementation requires the following.

1) Use of a redundant adder (either carry-save or signed-digit

[7]) to produce w u ]. This results in a carry-free addition.

2) Addi tion by concatenat ion to produce p u ] . That i s ,

i=O

p u ]

=

concat

p u

] , P j ) . (2 .7)

Since the maximum value of

2,

n (2.3) is, in general, larger than

r , to perform this concatenation it is necessary to recode

Z,

and

Z,,, into P, in the range [-(r ) , ( r ) ] . That i s ,

P j = F ( Z j , Z j + ] )

P,

E [-(r ), ( r )].

(2.8)

The details of the recoding depend on the range of Z j which is

dependent on the type of redundant adder used to produce w v ] .

3 The on-the-fly conversion of the resulting signed-digit rep-

resentation of pu] into a conventional representation M u ] . A n al-

gorithm to perform this conversion is given in [5]. In contrast with

the traditional approach that performs the conversion by subtracting

the negative part from the positive part (and therefore uses a carry-

propagate adder), this on-the-fly scheme forms two numbers

a u ]

nd

b u ] uch that a b ] is the converted number up to digi t

j

a nd b u ] i s

au ]

r-j. When a new digit Pi is produced, if it is positive or zero

it is concatenated to

A u

11 (digit-vector representing

a u

] ) ,

while if it is negative A V ] is obtained by concatenating r

Pj

to

B u ] . T h a t i s ,

i f P j

i f P , < 0

concat A u 13, P,)

con ca t ( Bv 11, ( r P j I) )

A V ]

=

c o n c a t ( A U 11, (P, 1))

conc at(BV 11, ( ( r ~ 1) PjI ) )

i f P ,

> O

i f P , 5 0

(2.9)

v ] =

with the initial condition

A[-1] =B[-1] = 0.

(2.10)

After the last step, the m ost significant half

of

the product in

conventional representation is

M[n

/ q ] =

A

[n q ]

The step of Fig. 1 can be used to implement either a sequen-

tial multiplier or a combinational one. The basic scheme for the

combinational case is illustrated in Fig. 2 together with the corre-

sponding conventional right-to-left approach. The fundamental dif-

ference is shown in the timing diagram of Fig. 2(c): the two phases

of multiplication- generation of partial products and formation of

the final result in a conventional representation- are performed con-


3/6

1387

EEE TRANSACTIONS

ON

COMPUTERS,

VOL. 9,

NO

1

NOVEMBER

1990

Y

X

k lEDUNDANT (Carry-Save or Signed-Digit)

i n t eger

l

Product Digits (signed)

-

MSDfirst

WI

Fig.

1.

One step

of

the LRCF multiplication scheme.

currently in the LR CF scheme and one after the other in conventional

schemes.

Since the result of the multiplication corresponds to the most-

significant half of the produ ct, an erro r is mad e with respect to the

correct product. Several rounding schemes can be used to bound this

error

[8].

In the LR CF sch eme, since the least-significant half of the

product is left in redundant form, the error is somewhat larger than

in the conventional right-to-left scheme. The actual range of the error

depends on the type of redundant adder used, and several solutions

are discussed in

[6].

111. RADIX-4 IMPLEMENTATION

WITH

SIGNED-DIGITDDITION

The presented combinational implementation of the LRCF scheme

for a radix4 multiplication unit uses a linear array of signed-digit

adders

[7]

for the computation of

w.

n implementation using carry-

save adders is discussed in [6], [151. For radix4 the recurrences are

w [ ]

=

4(fraction w [ 11

+

x q

= 0,.

.

n 2 ,

4 - 1 1 = o , Y € { - 2 , - 1 , 0 , 1 , 2 } ( 3 . 1 )

Z,

=

nteger

w

1

+

XY)

(3 .2)

and

PVI = PV ] Z j 4 - j = concat@u

3 ,

P,) ,

pi-11 = o

(3.3)

where P , is obtained from Z j and Z,+, by signed-digit addition

[7]

as described next.

To determine the range of

Z j

observe that

Z, = integer w u 1 +XY )

= nteger wV 13) + nteger (XU,)

+

o

(3.4)

where t o

E { - 1 , 0, l }

is a transfer digit resulting from the signed-

digit addition of the fractions of w u

1

and xYj. Moreover, the

most significant digit of the fraction of the signed-digit sum

(3 .5)

V

1

XYj-,

is in the range

{ 3 ,

. . .,3}. Therefore,

in te ge r (wV

11) =

nteger(4 x f r a c t ion(wV

1

+ x Y j - , ) )

(3 .6)

on- -Fly

Conversion

MsHalf

of the

product

in

non-redundant form)

IsHalf of the Product

in

redundant form

- not

used

(a)

Redundmt Adder

Carry-Ropagate

MSHalfoftheProduct LSHalfoftheRoduct

in non-redundant form)

in

non-redundant form)

b)

LRCF Scheme:

Partial Product

Recurrence

n-the-ny j b

Conversion

:

i t

:

FinalProduct

: Most

Significant

Half)

Conventional Scheme :

Partial Product

Recurrence

onversion

t

CPA)

Final

Product

(Most Significant

Hal f

(c)

Fig. 2 .

(a) LRCF com binational multiplier. (b) Conventional combinational

multiplier. (c) Timing comparison between LRCF and conventional multi-

plication schemes.

WO 21

x. x x ... x

XYj.1

zj.1

3 3 3 ... 3

wti 11

3 . 3

3 3 ... 0

X Y j 1 . 3

3 3 ... 3

z j

. 3 3 3

... 2

x. x x ... x

(similarly for negative values)

Fig. 3.

Range

of Z

digits.

is in the range { 3 , . . 3 } . Since 1x1< 1 is a fraction in conven-

tional form and the recoded multiplier digits Yj’s are in the range

{ - -2 , .

. .

2 } ,

integer(xYj) is in the range { 1 O 1) . Consequently,

the range of

Z j

is

[-5, 51

as illustrated in Fig.

3 .

The implementation consists of th ree parts as follows:

1)

A linear array of signed-digit adders to compute the w’s. This

array includes the recoder of Y and the selection of the multiples of


4/6

1388 IEEE TRANSACTIONS ON COMPUTERS, VOL. 39, NO. 1 1 , NOVEMBER 1990

zero(P.

sign(Pj

Recode, Select,

Signed-Digit Add

T T T T

J-l

MJ MJ+l

MJ+2

Final Product (Most

Significant

Halt)

0

(b) Implementation of result generation-

a

segment.

Fig.

4.

(a) Linear array

of

signed-digit adders for computing

W s and

Z’s.

2) A

linear array of modules TS to compute the partial products

pljl’s [Fig. 4(b)] from left to right. In the partial product computa-

tion (3.3), a direct addition of

Z

would require carry propagation.

To perform these additions without carry propagation,

Z ,

(with range

Then P , = t , +s, is the jt h product d igit. Because of the ranges

(3.8),

P ,

E {- 3 , .

. .

3}. Since

[-5, 51) is recoded into t,-l and

s,

such that p U ] =

C P * 4 - ‘

P , E {-3, .

. ,3}

(3.9)

I =o

z, = 4 t , - , +s, (3.7)

s,I

2. (3 .8)

the partial products

p l j l ’ s

are computed by concatenation

p l j ]

= concat ( p l j 1, P , )

(3.10)

with

t j - ,

1

and

A possible recoding of Zj is

-1 1 -2 -1

thus avoiding carry propagation.

3) The signed-digit representation of the product p[n/2] is con-

verted to conventional representation M [ n 2] using a combinational

variation of the on-the-fly algorithm [ 5 ] , [IO], as shown in Fig. 4(b).

0 0

0

1 1 1 Instead of using the two conditional forms A and B , described in

Section 11, we keep

only A

together with control signals

Dkl j ]

as-

0

1 -1 0 1

sociated with each A k (digit of A ) , o determine whether the final

0 1 2 3 4 5


5/6

IEEE TRANSACTIONS ON COMPUTERS, VOL. 39, NO . 1 1 , NOVEMBER 1990

i

1

1389

Dltil P I DZUl P 2

A I A2

2

U 2

0

2 u

U

3 u

U

4 n n

5 n n

6 n n

I n n

8 n n

M I

M2

2 0

TABLE I

CONVERSIONONTROLIGNALS

Decision at level j about

value of product digit M k[

/21

undecided

decided: no change

decided: decrement

Mk[n/2]= A k 1 mod4

MkIn/21

= A k)

M,[n/21

= A k )

TABLE II

EXAMPLEF ON-THE-FLYONVERSION

D l p 3

A 3

0

U 0

n

n

n

n

n

M3

0

digit M i is

Ai

or A i

)

mo d4 . The meaning of the control signals

is given in Table I.

A high-level description of the conversion process is

Initially,

DjU] = U . (3.12)

F o r k

< j ,

U

i f D k b ]

=

u a n d P ,

=

0

(3.13)

Consequently,

D k +

I] depends on Dk

U]

and on the signals

1

i fP ,

= O

0 otherwise

1 i f P ,

< O

{

0 otherwise.

(3.14)

i

e r o ( P j ) =

s i g n ( P j ) =

The on-the-fly conversion of

P

= 2003 00B in to M

=

20023313

Consequently, the conversion part is composed of

1) the modules

A

that generate Aj act irding to (3.11), and

2) the modules

D

that update D k

U]

according to (3.13)

3) the modules DEC that decrement Ak (modulo 4) if

D k [ n /2 ]=

d.

Bit-Level Implementation

and

Comparison to Conventional

Schemes:

The modules and their connections are indicated in Fig.

is

illustrated in Table 11.

zer o(P j) and sign(P,) signals

D S l i i

p6

0

U

d

d

M 6

3

4(a) and (b). Their bit-level implementation is discussed in 161. The

LRCF and a conventional scheme are similar regarding the follow-

ing parts: the binary -to-radix 4 multiplier recoder, the selection of

multiples of the multiplicand i

x

1

x

0 x

),

and the array

of redundant adders for the accumulation of partial products. These

adders are composed of signed-digit adder modules [12]-[I41 which,

according to [12 ], are similar in area and delay to conven tional full

adders. The principal difference is in producing the final product in

conventional representation from a sum of partial products in redun-

dant form: the LRCF scheme uses an on-the-fly converter while in a

conventional scheme a CPA adder is required.

Since a product

of

n bits is to be computed, those digits of the

array that do not influence the result can be eliminated. For a radix-4

recoding of th e multiplier, the first half of the array is of full precision

and from then on, the num ber of radi x4 adders decreases by one per

level. A similar reduction in size of the adder array can be achieved

in a conventional right-to-left multiplier, as done for example in the

Cray X-MP processor [9], in which case the error in multiplication

is similar to that produced by the LRCF scheme.

Delay of the Scheme and Comparison to Conventional

Schemes: The delay of the g eneration of the product is composed

of the following:

1) Recoding and forming the multiples of the multiplicand.

2) Delay to obtain the last partial product w[n/2] in the signed-

digit adder array. This corresponds to ( n / 2 ) 1 signed-digit adders.

3) Delay to produce the last zero(Pj ) and sign (P,) signals.

4) Delay to determine the value of the last D's.

5

Delay of digit decrementing.

In comparison, in a conventional right-to-left multiplier the de-

lay corresponds to l ) and 2) above plus a carry-propagate addition.

Consequently, the scheme presented here is faster by the difference

in delay between the carry-propagate adder and the sum of the delays

3) to 5) above. Since the CPA delay is at best O(log, n ) logic levels

and steps 3)-5) can be implem ented in a couple of levels, this differ-

ence is significant, especially for larger operand precision. To reduce

the total delay, the last step of the LRCF scheme can be optimized

for speed.

To illustrate the implementation we show in Fig. 5 an example of

an 8

x 8

bit multiplication. The additions are performed in ra dix 4

signed-digit.


6/6

1390

..

XY = 0 . 5 i i I

0.

i 0 i

S

1t.-1w[41=

IEEE TRANSACTIONS ON COMPUTERS, VOL. 39, NO. 1 1 , NOVEMBER 1990

0

0 0

1

1.331

d .dnu

x

=0 1101011~= .311 y =0.110101112=0.311341 712

[ O ] =

9

Z ,+ 4 - ~ w [ l l=

w [ l l =

0.

3.

1 2

j i i l

i o l l

5 . 0 1 5

I I

xy t= 0

I I I I

I Z3+4-L[31= 5. 0

2 1 0

15

check: 0.311wO.3113 = O . g l O O.ooOOi0~ .

Fig. 5.

Example of LRCF multiplication.

IV.

S U M M A R Y

The reported multiplication scheme (LRCF) eliminates the need

for a carry-propagate adder. The scheme performs the multiplica-

tion most-significant digit first and produces a conventional sign-

and-magnitude product (most significant half) by means of an on-

the-fly conversion, performed concurrently with the generation of

accumulated (redundant) partial products. T he scheme is presented

for general radix r and a radix-4 signed-digit implementation is de-

scribed. We estimate that, for a m ultiplier of 64 its, the scheme we

described produces a reduction of about ten gate levels with respect

to a conventional scheme using a carry-look-ahead adder. The speed

can be improved by increasing the radix. In [6], we present a radix-

16 implementation in which odd and even partial products

[l

11 are

computed concurrently.

REFERENCES

K Hwang, Computer Arithmetic.

M. Uya, K . Kaneko, and J. Yasui, “A CMOS floating-point multi-

plier,” IEEE

J.

Solid-State Circuits,vol. SC-19, no.

5,

pp. 697-701,

Oct.

1984.

A. Avizienis, “On a flexible implementation of digital computer arith-

metic,” in

Information Processing

1962, C. M. Popplewell, Ed.

New York: North Holland, 1963, pp. 664-670.

A. D. Booth, “A signed binary multiplication technique,” Quart. J .

Mech. Appl. Math. , vol. 4, part 2, pp. 236-240, 1951.

M. D. Ercegovac and T. Lang, “On-the-fly conversion of redundant

into conventional representa tions,”

IEEE Trans. Comput.,

vol. C-36,

no. 7, pp. 895-897, July 1987.

“Fast multiplication without carry-propagate addition,” UCLA

Comput. Sci. Dep. Rep., 1986.

A.

Avizienis, “Signed-digit number representation for fast parallel

arithmetic,”

IEEE Trans. Electron. Cornput.,

vol. EC-10, pp.

389-400, Sept. 1961.

J.

T.

Coonen, “An implementation guide to a proposed standard for

floating-point arithmetic,” IEEE Comput. Mag., pp. 68-79, Jan.

1980.

Annon, “Cray X-MP Computer Systems,” Four-Processor Main-

frame Reference Manual,

HR-0097, Cray Research, Inc., 1985.

M. D. Ercegovac and T. Lang, “Alternative on-the-fly conversion

of redundant into conventional representations,” UCLA Comput. Sci.

Dep. Rep. CS D-860027, Nov. 19 86.

J. Iwamura et al., “A 16-bit CMOS/SOS multiplier-ac cumu lator,” in

Proc. ICCC82,

1982, pp. 151-154.

S. Kuninobu et al., “Des ign of high-speed MOS multiplier and divider

using redundant binary represe ntation,” in

Proc. 8th. Symp. Comput.

Arithmet., 1987, pp. 80-86.

Y.Harata

et al.,

“High-speed multiplier using a redundant binary

adder tree,” in Proc. 1984 IEEE Int. Conf. Comput. Design, 1984,

J. E. R obertson , “A s ystema tic approach to the design of structures

for arithmetic,” in

Proc. 5th Symp.

Comput.

Arithmet.,

1981.

M. D. Ercegovac and T. Lang, ”Radix4 multiplication without carry-

propagate addition,” in

Proc. IEEE Int. Conf. Comput. Design:

VLSI Comput. Processors, Oct 5-8, 1987, pp. 654-658.

New York: Wiley, 1978.

pp. 165-170.

Fast, Deterministic Routing, on Hypercubes,

Using

Small B uffers

Bradley C. Kuszmaul

Abstract- We propose a deterministic routing scheme for a comm uni-

cations network based on the k-dimensional hypercube. We present two

formulations of the scheme. The first formulation delivers messages in

O kz)

bit times using

O k)

bits of buffer space at each node in the hy-

percube. The second formulation assumes that there are several batches

of messages to be delivered, and makes certain assumptions about the

cost of sending messages along the various dimensions of the cube.

In

this case, the latency for delivery time is still O k2) bit times,

hut

the

throughput is increased to one set of messages every O k) bit times. For

the first formulation, we restrict ourselves to routings which are subsets

of permutations (i.e., every node sends at most

one

message and re-

ceives at most one mes sage). The second formulation indicates a way to

perform routings which are subsets of H -permutations (i.e., every node

sends at most H messages and receives at most H messages).

Index Terns-Buffers, complexity theory, deterministic routing,

hy-

percubes, interconnection networks, parallel processing, routing.

I. INTRODUCTION

Several routing schemes based on the hypercube have been pro-

posed [7],

[ 5 ] ,

[15], [17], [12]. We discuss hypercubes with kdim en-

sions and 2k

=

N vertices (w hich we call nodes). A nondeterministic

O k2)

it time algorithm with

O kz )

its of storage at each node is

described in [17]. In this paper, we describe a deterministic

O k2)

bit time algorithm with O k) its of storage at each node. W e go on

to

describe an alternative deterministic algorithm, based on a slightly

modified network, with

O k2)

it time latency for messages travel-

ing through the network, O k) hroughput (i .e. , one message every

O k)

bit times), and O k2) its of storage at each node.

When describing hypercube networks we define a

node

to be a

vertex on the hypercube. W hen describing multiprocessor computer

systems, we define a

processor

to be the hardware w hich sends and

receives messages. In some computer systems (e.g. , the connection

machine [9]), the processors are associated with the nodes of the

hypercube routing network.

In general, we assume that messages are at O k) bits long (be-

cause, for example, it should be possible to transmit a node a ddress

in a message). This gives a lower bound for routing of

o k )

bit

times.

Manuscript received October 6, 1987; revised January 28, 1988.

The author is with the Massachusetts nstitute of Technology, Cambridge,

IEEE Log Number 9035138.

MA 02139.

0018-9340/90/1100-1390 01.00 990 IEEE

fast mltiplication with out carry propagate

Documents