fast mltiplication with out carry propagate

Upload: hari

Post on 07-Jul-2018

222 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/18/2019 Fast Mltiplication With Out Carry Propagate

    1/6

    IEEE TRANSACTIONS ON COMPUTERS, VOL.

    39,

    NO.

    1 1 ,

    NOVEMBER

    1990

    1385

    TABLE IV

    TESTS

    FOR F AND FA,

    TABLE V

    TESTS

    OR

    R N ,

    The test sets grow linearly with the input size of the counter for the

    following reason. The number of full adders NFAn a generalized

    counter is related to the number of counter inputs CI as N F ~

    CI [log, N’], where CI

    =

    C;Igk-’ij and N = x;z;-‘i,2J. In

    the above, the second term is smaller than the first. Since at most 32

    tests are needed to test each full adder, the test sizeT

    5

    32CI and the

    hypothesis follows. It can be shown that for a C(C1: log, (CI + 1))

    b al an ce d [ 4] ~ o u n t e r , ~here CI is of the form 2” 1 , and

    n

    2 2 ,

    T

    =

    33C I/2 5 log, (CI

    +

    1)

    +

    5

    for CI

    >

    3.

    Example

    2:

    We now present a test set for the reduction tree R N ,

    These are referred to as Construction 1 Counters in [7].

    under the assumption that R N ~as already been tested and deter-

    mined to be fault-free under the multiple fault assumption. In order

    to test the full adders of R N , , we apply the tests corresponding

    to E o( e ) and

    E l ( e )

    which are the tests 1-4 and 5-8, res pectively

    in Table

    IV.

    Next, the tests corresponding to

    Eo c)

    and E l ( c ) are

    applied. These are the tests 9-12 and 13-16 in Table IV. If these

    tests “pass,” the tests in Table V are applied. Tests 17-20 corre-

    spond to the experiment EXHT o e , c) ests 2 1-24 to EXHT ( e , C )

    tests 25-28 to EXH T2(e, c ) , tests 29-32 to EXHT3(e, c), tests

    33-36 to EXHT4(e,c), tests 37-40 to EXHTS(e, c), tests 41-44

    to EXHT6(e, c), and tests 45-48 to EXH T7(e ,

    c).

    Some of the tests in Tables IV and V are applied more than

    once and can be eliminated. Tests 17-20 of EXHTo(e,c) are also

    applied by

    E o ( a )u E , ( a ) E o ( b )U E l ( b ) ,

    s are tests 25, 27 of

    E XHT 2( e , c ) and tests 33, 34

    of

    E XHT 4( e , c). These

    tests

    need not

    be reapplied. In order to test

    FAd

    we need only apply all possible

    input combinations to its inputs. These tests are contained in those

    listed in Table V. The C 5,5: 4) counter is testable in

    61

    tests.

    C.

    onclusions

    In this paper, we have discussed the problem of multiple faulty

    cells in generalized counters for

    two

    different cases, one in which

    the class of circuits is general but some restrictions are placed on

    the faults allowed and another in which the class of circuits is re-

    stricted but the fault model is more general.

    A

    theory for the detection

    of multiple faulty cells in generalized counters has been developed.

    Based on this theory, two schemes for generating test sets that de-

    tect multiple faults in generalized counters have been presented. It

    is hoped that the techniques presented in this paper will open new

    avenues to similar test generation problems for circuits with complex

    interconnection structures.

    REFERENCES

    [ l ]

    J. A . Abraham and D. D . Gajski, “Designof testable structures defined

    by simple loops,”

    IEEE Trans. Cornput.,

    vol. C-30, no. 11, pp.

    A. Chatterjee and J . A. Abraham, “On the C-testability of generalized

    counters,” IEEE

    Trans.

    Cornput.-Aided Design, vol. CAD-6, no. 5 ,

    pp. 713-726, Sept. 1987.

    A. Vergis, “L inear-testable counters for multiple faults,” inProc. Int.

    Conf. Cornput. Aided Design,

    Nov. 198 7, pp. 156-159.

    S .

    C. Seth and K. L. Kodandapani, “Diagnosis of faults in linear tree

    networks,”

    IEEE Trans. Cornput.,

    vol. C-26, pp. 29-33, Jan.

    1977.

    W .

    T. Cheng, “Testing and error detection in iterative logic arrays,”

    Ph.D. dissertation, 1985.

    K.

    Hwang, Computer Arithmetic, Principles, Architecture and De-

    sign.

    New

    York:

    Wiley, 1979.

    A. Chatterjee and J. A. Abraham, “C-Testability of generalized tree

    structures with applications to Wallace trees and

    other

    circuits,” in

    Proc. Int . Conf. Cornput.-Aided Design, Nov. 1986, pp. 288-291.

    875-883, NOV.1981.

    [21

    [3]

    [4]

    [5]

    [6]

    [7]

    Fast M ultiplication W ithout Carry-Propagate

    Addition

    Milo: D. Ercegovac and Tomas Lang

    Abstract-Conventional schemes for fast multiplication accumulate

    the partial products in redundant form (carry-save or signed-digit) and

    convert the result to conventional representation in the last step. This

    Manuscript received October 20, 1987; revised March 1, 1989.

    The authors are with the Department of Computer Science, the University

    IEEE Log Number 9035139.

    of California, Los Angeles, CA 90024.

    0018-9340/90/1100-1385 01.00

    990 IEEE

  • 8/18/2019 Fast Mltiplication With Out Carry Propagate

    2/6

    1386

    IEEE TRANSACTIONS ON COMPUTERS, VOL.

    39,

    N O .

    1 1 ,

    NOVEMBER

    1990

    step requires a ca rry-propagate adder which is comparatively slow and

    occupies a significant area of the chip in a VLSI im plementation . In th is

    paper, we report a multiplication scheme (LRC F-left-to-right, carry-

    free) that does not req uire this carry-propagate step. The LR CF scheme

    performs the multiplication most-significant bit first and produces a

    conventional sign-and-m agnitude product (most significant n bits)

    by

    means of an on-the-fly conversion. The resulting implementation s fast

    and regular and is very well suited for VLSI. The LRCF scheme for

    general radix

    r

    and a radix-4 signed-digit impleme ntation are presented.

    Index Terms-

    Carry-save multiplier, dig ital arithmetic, left-to-right

    multiplier, mu ltiplic ation , signed-digit multiplier, signed-digit represen-

    tation, VLSI implementation.

    I . I NT RODUCT I ON

    All comm on schemes for multiplication, sequential as well as com-

    binational, requ ire a carry-propagate addition in the final step (stage)

    to obtain the 2n-bit product [11. In sequential right-to-left n-bit mul-

    tipliers, the least sign ificant half of the product is obtained w ithout

    the use of a C PA. Th e most significant half, however, requires an

    n-bit CPA to complete the operation. Similarly in multipliers using

    linear array s of carry-save adde rs, the least significant half is gen-

    erated during the reduction process. To obtain the most significant

    half, an n-bit CPA is used. In the case of tree-type multipliers, a CPA

    of more than n bits is required. For example, a Dadda or a Wallace

    multiplier with

    s

    reduction stages requires an adder of 2n bits.

    Such a carry-propagate adder is comparatively slow and occupies a

    significant area of the chip in a VLS I implem entation [2].

    In this paper, we report a novel multiplication scheme that pro-

    duces the most-significant half of the product without the carry-

    propagate step. The basic characteristics of the proposed LR CF (left-

    to-right, carry-free) scheme a re:

    1) The recurrence uses the digits of the multiplier from most to

    least significant (left-to-right multiplication)

    [3].

    The multiplier can

    be recoded into a suitable radix-r representation to reduce the number

    of steps [4].

    2) The accumulated partial products are decomposed into two

    parts: the m ost significant part and the lea st significant part.

    3) To produce the product, the most significant portion of the

    accumulated partial products is converted to conventional form using

    a variation of the on-the-fly algorithm presented in [5], without the

    need of carry-propagate addition.

    The LRCF scheme can be used both for sequential and combi-

    national implementations. We concentrate here on the com binational

    case, since it provides the most in speed advantages. The resulting

    implementation is fast and regular and is very well suited for VLSI

    implementation.

    11. THEL R C F M U L T ~ P L ~ C A T I O NLGORITHM

    Consider multiplication of normalized fractions in the sign-and-

    magnitud e representation. Let e the radix-2 representation of the

    normalized fractional magnitude x , such that

    n

    x

    =

    Cx;2-

    Xi E (0, I }

    i= l

    and let Y be the recoded radix-r representation of the normalized

    fractional magnitude y , such that

    14

    y = Y i F i

    Y ;

    E { - r / 2 , .

    . .

    r/2} (minimally redundant)

    i =O

    where, for simplicity,

    r

    =

    24.

    The L RCF multiplication algorithm is a recurrence that produces

    a sequence of two accumulated partial products

    w

    nd p) as follows:

    w u ]

    = r ( f r a c t i o n ( w l j - l ] + x Y , ) )

    j = o , . . . , n / q ( 2 . 2 )

    (2.3)

    , = nteger(wv 11 +XI ,)

    and

    p u ]

    =

    p v 1

    +Zjr-j .

    (2.4)

    The initial values are w[-1] = p[ - 1 ] = 0. The algorithm uses

    the digits of the multiplier from m ost significant to least significan t

    [ 3 ] ,

    unlike conventional multiplication schemes which use the digits

    from least significan t to most sign ificant.

    To show that the LR CF algorithm performs multiplication, observe

    that the sum of partial products after k steps satisfies

    k

    p [ k ]

    + w[k] x rPk-' = X x Y ; x r-'. (2.5)

    i

    =O

    Consequently, after nlq steps we obtain

    n I

    p [ n / q ]

    +

    w [ n / q ] x r- lq-I

    =

    X x Y ; x r-' = x y . (2 .6 )

    That is, p[ n/ q] is the most significant part of the product while

    w[n /q] is the least significant part.

    A block diagram of one step of the recurrence is shown in Fig. 1.

    A fast implementation requires the following.

    1) Use of a redundant adder (either carry-save or signed-digit

    [7]) to produce w u ]. This results in a carry-free addition.

    2) Addi tion by concatenat ion to produce p u ] . That i s ,

    i=O

    p u ]

    =

    concat

    p u

    ] , P j ) . (2 .7)

    Since the maximum value of

    2,

    n (2.3) is, in general, larger than

    r , to perform this concatenation it is necessary to recode

    Z,

    and

    Z,,, into P, in the range [-(r ) , ( r ) ] . That i s ,

    P j = F ( Z j , Z j + ] )

    P,

    E [-(r ), ( r )].

    (2.8)

    The details of the recoding depend on the range of Z j which is

    dependent on the type of redundant adder used to produce w v ] .

    3 The on-the-fly conversion of the resulting signed-digit rep-

    resentation of pu] into a conventional representation M u ] . A n al-

    gorithm to perform this conversion is given in [5]. In contrast with

    the traditional approach that performs the conversion by subtracting

    the negative part from the positive part (and therefore uses a carry-

    propagate adder), this on-the-fly scheme forms two numbers

    a u ]

    nd

    b u ] uch that a b ] is the converted number up to digi t

    j

    a nd b u ] i s

    au ]

    r-j. When a new digit Pi is produced, if it is positive or zero

    it is concatenated to

    A u

    11 (digit-vector representing

    a u

    ] ) ,

    while if it is negative A V ] is obtained by concatenating r

    Pj

    to

    B u ] . T h a t i s ,

    i f P j

    i f P , < 0

    concat A u 13, P,)

    con ca t ( Bv 11, ( r P j I) )

    A V ]

    =

    c o n c a t ( A U 11, (P, 1))

    conc at(BV 11, ( ( r ~ 1) PjI ) )

    i f P ,

    > O

    i f P , 5 0

    (2.9)

    v ] =

    with the initial condition

    A[-1] =B[-1] = 0.

    (2.10)

    After the last step, the m ost significant half

    of

    the product in

    conventional representation is

    M[n

    / q ] =

    A

    [n q ]

    The step of Fig. 1 can be used to implement either a sequen-

    tial multiplier or a combinational one. The basic scheme for the

    combinational case is illustrated in Fig. 2 together with the corre-

    sponding conventional right-to-left approach. The fundamental dif-

    ference is shown in the timing diagram of Fig. 2(c): the two phases

    of multiplication- generation of partial products and formation of

    the final result in a conventional representation- are performed con-

  • 8/18/2019 Fast Mltiplication With Out Carry Propagate

    3/6

    1387 

    EEE TRANSACTIONS

    ON

    COMPUTERS,

    VOL. 9,

    NO

    1

    NOVEMBER

    1990

    Y

    X

    k lEDUNDANT (Carry-Save or Signed-Digit)

    i n t eger

    l

    Product Digits (signed)

    -

    MSDfirst

    WI

    Fig.

    1.

    One step

    of

    the LRCF multiplication scheme.

    currently in the LR CF scheme and one after the other in conventional

    schemes.

    Since the result of the multiplication corresponds to the most-

    significant half of the produ ct, an erro r is mad e with respect to the

    correct product. Several rounding schemes can be used to bound this

    error

    [8].

    In the LR CF sch eme, since the least-significant half of the

    product is left in redundant form, the error is somewhat larger than

    in the conventional right-to-left scheme. The actual range of the error

    depends on the type of redundant adder used, and several solutions

    are discussed in

    [6].

    111. RADIX-4 IMPLEMENTATION

    WITH

    SIGNED-DIGITDDITION

    The presented combinational implementation of the LRCF scheme

    for a radix4 multiplication unit uses a linear array of signed-digit

    adders

    [7]

    for the computation of

    w.

    n implementation using carry-

    save adders is discussed in [6], [151. For radix4 the recurrences are

    w [ ]

    =

    4(fraction w [ 11

    +

    x q

    = 0,.

    .

    n 2 ,

    4 - 1 1 = o , Y € { - 2 , - 1 , 0 , 1 , 2 } ( 3 . 1 )

    Z,

    =

    nteger

    w

    1

    +

    XY)

    (3 .2)

    and

    PVI = PV ] Z j 4 - j = concat@u

    3 ,

    P,) ,

    pi-11 = o

    (3.3)

    where P , is obtained from Z j and Z,+, by signed-digit addition

    [7]

    as described next.

    To determine the range of

    Z j

    observe that

    Z, = integer w u 1 +XY )

    = nteger wV 13) + nteger (XU,)

    +

    o

    (3.4)

    where t o

    E { - 1 , 0, l }

    is a transfer digit resulting from the signed-

    digit addition of the fractions of w u

    1

    and xYj. Moreover, the

    most significant digit of the fraction of the signed-digit sum

    (3 .5)

    V

    1

    XYj-,

    is in the range

    { 3 ,

    . . .,3}. Therefore,

    in te ge r (wV

    11) =

    nteger(4 x f r a c t ion(wV

    1

    + x Y j - , ) )

    (3 .6)

    on- -Fly

    Conversion

    MsHalf

    of the

    product

    in

    non-redundant form)

    IsHalf of the Product

    in

    redundant form

    - not

    used

    (a)

    Redundmt Adder

    Carry-Ropagate

    MSHalfoftheProduct LSHalfoftheRoduct

    in non-redundant form)

    in

    non-redundant form)

    b)

    LRCF Scheme:

    Partial Product

    Recurrence

    n-the-ny j b

    Conversion

    :

    i t

    :

    FinalProduct

    : Most

    Significant

    Half)

    Conventional Scheme :

    Partial Product

    Recurrence

    onversion

    t

    CPA)

    Final

    Product

    (Most Significant

    Hal f

    (c)

    Fig. 2 .

    (a) LRCF com binational multiplier. (b) Conventional combinational

    multiplier. (c) Timing comparison between LRCF and conventional multi-

    plication schemes.

    WO 21

    x. x x ... x

    XYj.1

    zj.1

    3 3 3 ... 3

    wti 11

    3 . 3

    3 3 ... 0

    X Y j 1 . 3

    3 3 ... 3

    z j

    . 3 3 3

    ... 2

    x. x x ... x

    (similarly for negative values)

    Fig. 3.

    Range

    of Z

    digits.

    is in the range { 3 , . . 3 } . Since 1x1< 1 is a fraction in conven-

    tional form and the recoded multiplier digits Yj’s are in the range

    { - -2 , .

    . .

    2 } ,

    integer(xYj) is in the range { 1 O 1) . Consequently,

    the range of

    Z j

    is

    [-5, 51

    as illustrated in Fig.

    3 .

    The implementation consists of th ree parts as follows:

    1)

    A linear array of signed-digit adders to compute the w’s. This

    array includes the recoder of Y and the selection of the multiples of

  • 8/18/2019 Fast Mltiplication With Out Carry Propagate

    4/6

    1388 IEEE TRANSACTIONS ON COMPUTERS, VOL. 39, NO. 1 1 , NOVEMBER 1990

    zero(P.

    sign(Pj

    Recode, Select,

    Signed-Digit Add

    T T T T

    J-l

    MJ MJ+l

    MJ+2

    Final Product (Most

    Significant

    Halt)

    0

    (b) Implementation of result generation-

    a

    segment.

    Fig.

    4.

    (a) Linear array

    of

    signed-digit adders for computing

    W s and

    Z’s.

    2) A

    linear array of modules TS to compute the partial products

    pljl’s [Fig. 4(b)] from left to right. In the partial product computa-

    tion (3.3), a direct addition of

    Z

    would require carry propagation.

    To perform these additions without carry propagation,

    Z ,

    (with range

    Then P , = t , +s, is the jt h product d igit. Because of the ranges

    (3.8),

    P ,

    E {- 3 , .

    . .

    3}. Since

    [-5, 51) is recoded into t,-l and

    s,

    such that p U ] =

    C P * 4 - ‘

    P , E {-3, .

    . ,3}

    (3.9)

    I =o

    z, = 4 t , - , +s, (3.7)

    s,I

    2. (3 .8)

    the partial products

    p l j l ’ s

    are computed by concatenation

    p l j ]

    = concat ( p l j 1, P , )

    (3.10)

    with

    t j - ,

    1

    and

    A possible recoding of Zj is

    -1 1 -2 -1

    thus avoiding carry propagation.

    3) The signed-digit representation of the product p[n/2] is con-

    verted to conventional representation M [ n 2] using a combinational

    variation of the on-the-fly algorithm [ 5 ] , [IO], as shown in Fig. 4(b).

    0 0

    0

    1 1 1 Instead of using the two conditional forms A and B , described in

    Section 11, we keep

    only A

    together with control signals

    Dkl j ]

    as-

    0

    1 -1 0 1

    sociated with each A k (digit of A ) , o determine whether the final

    0 1 2 3 4 5

  • 8/18/2019 Fast Mltiplication With Out Carry Propagate

    5/6

    IEEE TRANSACTIONS ON COMPUTERS, VOL. 39, NO . 1 1 , NOVEMBER 1990

    i

    1

    1389

    Dltil P I DZUl P 2

    A I A2

    2

    U 2

    0

    2 u

    U

    3 u

    U

    4 n n

    5 n n

    6 n n

    I n n

    8 n n

    M I

    M2

    2 0

    TABLE I

    CONVERSIONONTROLIGNALS

    Decision at level j about

    value of product digit M k[

    /21

    undecided

    decided: no change

    decided: decrement

    Mk[n/2]= A k 1 mod4

    MkIn/21

    = A k)

    M,[n/21

    = A k )

    TABLE II

    EXAMPLEF ON-THE-FLYONVERSION

    D l p 3

    A 3

    0

    U 0

    n

    n

    n

    n

    n

    M3

    0

    digit M i is

    Ai

    or A i

    )

    mo d4 . The meaning of the control signals

    is given in Table I.

    A high-level description of the conversion process is

    Initially,

    DjU] = U . (3.12)

    F o r k

    < j ,

    U

    i f D k b ]

    =

    u a n d P ,

    =

    0

    (3.13)

    Consequently,

    D k +

    I] depends on Dk

    U]

    and on the signals

    1

    i fP ,

    = O

    0 otherwise

    1 i f P ,

    < O

    {

    0 otherwise.

    (3.14)

    i

    e r o ( P j ) =

    s i g n ( P j ) =

    The on-the-fly conversion of

    P

    = 2003 00B in to M

    =

    20023313

    Consequently, the conversion part is composed of

    1) the modules

    A

    that generate Aj act irding to (3.11), and

    2) the modules

    D

    that update D k

    U]

    according to (3.13)

    3) the modules DEC that decrement Ak (modulo 4) if

    D k [ n /2 ]=

    d.

    Bit-Level Implementation

    and

    Comparison to Conventional

    Schemes:

    The modules and their connections are indicated in Fig.

    is

    illustrated in Table 11.

    zer o(P j) and sign(P,) signals

    D S l i i

    p6

    0

    U

    d

    d

    M 6

    3

    4(a) and (b). Their bit-level implementation is discussed in 161. The

    LRCF and a conventional scheme are similar regarding the follow-

    ing parts: the binary -to-radix 4 multiplier recoder, the selection of

    multiples of the multiplicand i

    x

    1

    x

    0 x

    ),

    and the array

    of redundant adders for the accumulation of partial products. These

    adders are composed of signed-digit adder modules [12]-[I41 which,

    according to [12 ], are similar in area and delay to conven tional full

    adders. The principal difference is in producing the final product in

    conventional representation from a sum of partial products in redun-

    dant form: the LRCF scheme uses an on-the-fly converter while in a

    conventional scheme a CPA adder is required.

    Since a product

    of

    n bits is to be computed, those digits of the

    array that do not influence the result can be eliminated. For a radix-4

    recoding of th e multiplier, the first half of the array is of full precision

    and from then on, the num ber of radi x4 adders decreases by one per

    level. A similar reduction in size of the adder array can be achieved

    in a conventional right-to-left multiplier, as done for example in the

    Cray X-MP processor [9], in which case the error in multiplication

    is similar to that produced by the LRCF scheme.

    Delay of the Scheme and Comparison to Conventional

    Schemes: The delay of the g eneration of the product is composed

    of the following:

    1) Recoding and forming the multiples of the multiplicand.

    2) Delay to obtain the last partial product w[n/2] in the signed-

    digit adder array. This corresponds to ( n / 2 ) 1 signed-digit adders.

    3) Delay to produce the last zero(Pj ) and sign (P,) signals.

    4) Delay to determine the value of the last D's.

    5

    Delay of digit decrementing.

    In comparison, in a conventional right-to-left multiplier the de-

    lay corresponds to l ) and 2) above plus a carry-propagate addition.

    Consequently, the scheme presented here is faster by the difference

    in delay between the carry-propagate adder and the sum of the delays

    3) to 5) above. Since the CPA delay is at best O(log, n ) logic levels

    and steps 3)-5) can be implem ented in a couple of levels, this differ-

    ence is significant, especially for larger operand precision. To reduce

    the total delay, the last step of the LRCF scheme can be optimized

    for speed.

    To illustrate the implementation we show in Fig. 5 an example of

    an 8

    x 8

    bit multiplication. The additions are performed in ra dix 4

    signed-digit.

  • 8/18/2019 Fast Mltiplication With Out Carry Propagate

    6/6

    1390

    ..

    XY = 0 . 5 i i I

    0.

    i 0 i

    S

    1t.-1w[41=

    IEEE TRANSACTIONS ON COMPUTERS, VOL. 39, NO. 1 1 , NOVEMBER 1990

    0

    0 0

    1

    1.331

    d .dnu

    x

    =0 1101011~= .311 y =0.110101112=0.311341 712

    [ O ] =

    9

    Z ,+ 4 - ~ w [ l l=

    w [ l l =

    0.

    3.

    1 2

    j i i l

    i o l l

    5 . 0 1 5

    I I

    xy t= 0

    I I I I

    I Z3+4-L[31= 5. 0

    2 1 0

    15

    check: 0.311wO.3113 = O . g l O O.ooOOi0~ .

    Fig. 5.

    Example of LRCF multiplication.

    IV.

    S U M M A R Y

    The reported multiplication scheme (LRCF) eliminates the need

    for a carry-propagate adder. The scheme performs the multiplica-

    tion most-significant digit first and produces a conventional sign-

    and-magnitude product (most significant half) by means of an on-

    the-fly conversion, performed concurrently with the generation of

    accumulated (redundant) partial products. T he scheme is presented

    for general radix r and a radix-4 signed-digit implementation is de-

    scribed. We estimate that, for a m ultiplier of 64 its, the scheme we

    described produces a reduction of about ten gate levels with respect

    to a conventional scheme using a carry-look-ahead adder. The speed

    can be improved by increasing the radix. In [6], we present a radix-

    16 implementation in which odd and even partial products

    [l

    11 are

    computed concurrently.

    REFERENCES

    K Hwang, Computer Arithmetic.

    M. Uya, K . Kaneko, and J. Yasui, “A CMOS floating-point multi-

    plier,” IEEE

    J.

    Solid-State Circuits,vol. SC-19, no.

    5,

    pp. 697-701,

    Oct.

    1984.

    A. Avizienis, “On a flexible implementation of digital computer arith-

    metic,” in

    Information Processing

    1962, C. M. Popplewell, Ed.

    New York: North Holland, 1963, pp. 664-670.

    A. D. Booth, “A signed binary multiplication technique,” Quart. J .

    Mech. Appl. Math. , vol. 4, part 2, pp. 236-240, 1951.

    M. D. Ercegovac and T. Lang, “On-the-fly conversion of redundant

    into conventional representa tions,”

    IEEE Trans. Comput.,

    vol. C-36,

    no. 7, pp. 895-897, July 1987.

    “Fast multiplication without carry-propagate addition,” UCLA

    Comput. Sci. Dep. Rep., 1986.

    A.

    Avizienis, “Signed-digit number representation for fast parallel

    arithmetic,”

    IEEE Trans. Electron. Cornput.,

    vol. EC-10, pp.

    389-400, Sept. 1961.

    J.

    T.

    Coonen, “An implementation guide to a proposed standard for

    floating-point arithmetic,” IEEE Comput. Mag., pp. 68-79, Jan.

    1980.

    Annon, “Cray X-MP Computer Systems,” Four-Processor Main-

    frame Reference Manual,

    HR-0097, Cray Research, Inc., 1985.

    M. D. Ercegovac and T. Lang, “Alternative on-the-fly conversion

    of redundant into conventional representations,” UCLA Comput. Sci.

    Dep. Rep. CS D-860027, Nov. 19 86.

    J. Iwamura et al., “A 16-bit CMOS/SOS multiplier-ac cumu lator,” in

    Proc. ICCC82,

    1982, pp. 151-154.

    S. Kuninobu et al., “Des ign of high-speed MOS multiplier and divider

    using redundant binary represe ntation,” in

    Proc. 8th. Symp. Comput.

    Arithmet., 1987, pp. 80-86.

    Y.Harata

    et al.,

    “High-speed multiplier using a redundant binary

    adder tree,” in Proc. 1984 IEEE Int. Conf. Comput. Design, 1984,

    J. E. R obertson , “A s ystema tic approach to the design of structures

    for arithmetic,” in

    Proc. 5th Symp.

    Comput.

    Arithmet.,

    1981.

    M. D. Ercegovac and T. Lang, ”Radix4 multiplication without carry-

    propagate addition,” in

    Proc. IEEE Int. Conf. Comput. Design:

    VLSI Comput. Processors, Oct 5-8, 1987, pp. 654-658.

    New York: Wiley, 1978.

    pp. 165-170.

    Fast, Deterministic Routing, on Hypercubes,

    Using

    Small B uffers

    Bradley C. Kuszmaul

    Abstract- We propose a deterministic routing scheme for a comm uni-

    cations network based on the k-dimensional hypercube. We present two

    formulations of the scheme. The first formulation delivers messages in

    O kz)

    bit times using

    O k)

    bits of buffer space at each node in the hy-

    percube. The second formulation assumes that there are several batches

    of messages to be delivered, and makes certain assumptions about the

    cost of sending messages along the various dimensions of the cube.

    In

    this case, the latency for delivery time is still O k2) bit times,

    hut

    the

    throughput is increased to one set of messages every O k) bit times. For

    the first formulation, we restrict ourselves to routings which are subsets

    of permutations (i.e., every node sends at most

    one

    message and re-

    ceives at most one mes sage). The second formulation indicates a way to

    perform routings which are subsets of H -permutations (i.e., every node

    sends at most H messages and receives at most H messages).

    Index Terns-Buffers, complexity theory, deterministic routing,

    hy-

    percubes, interconnection networks, parallel processing, routing.

    I. INTRODUCTION

    Several routing schemes based on the hypercube have been pro-

    posed [7],

    [ 5 ] ,

    [15], [17], [12]. We discuss hypercubes with kdim en-

    sions and 2k

    =

    N vertices (w hich we call nodes). A nondeterministic

    O k2)

    it time algorithm with

    O kz )

    its of storage at each node is

    described in [17]. In this paper, we describe a deterministic

    O k2)

    bit time algorithm with O k) its of storage at each node. W e go on

    to

    describe an alternative deterministic algorithm, based on a slightly

    modified network, with

    O k2)

    it time latency for messages travel-

    ing through the network, O k) hroughput (i .e. , one message every

    O k)

    bit times), and O k2) its of storage at each node.

    When describing hypercube networks we define a

    node

    to be a

    vertex on the hypercube. W hen describing multiprocessor computer

    systems, we define a

    processor

    to be the hardware w hich sends and

    receives messages. In some computer systems (e.g. , the connection

    machine [9]), the processors are associated with the nodes of the

    hypercube routing network.

    In general, we assume that messages are at O k) bits long (be-

    cause, for example, it should be possible to transmit a node a ddress

    in a message). This gives a lower bound for routing of

    o k )

    bit

    times.

    Manuscript received October 6, 1987; revised January 28, 1988.

    The author is with the Massachusetts nstitute of Technology, Cambridge,

    IEEE Log Number 9035138.

    MA 02139.

    0018-9340/90/1100-1390 01.00 990 IEEE