performance and implementatio of 4x4n switching …hj/conferences/38.pdf · 2014. 2. 4. · robert...

PERFORMANCE AND IMPLEMENTATION OF 4x4 SWITCHING NODESIN AN INTERCONNECTION NETWORK FOR PASM

Robert J. McMillen, George B. Adams III, and Howard Jay SiegelSchool of Electrical Engineering, Purdue University

West Lafayette, IN 47907

AbstractDesign issues for the multistage Generalized

Cube network are discussed in this paper. Ananalysis of the merits of 2-input/2-output inter-change boxes versus 4-input/4-output crossbars forinterconnection network implementation is made.The cost and performance of each network for thetwo switching node alternatives are examined.Discussion of the sui tabi l i ty of each approach forVLSI implementation is included. I t is shown thatin a packet switching environment, 4x4 crossbarsoutperform, and are less expensive to implementthan the four interchange boxes they replace.

I. INTRODUCTIONThe choice of interconnection network is a cen-

t ra l issue in the design of large-scale,multimicroprocessor-based distributed and parallelsystems. The Bal l ist ic Missile Defense (BMD)Agency is designing a test bed for evaluating suchsystems as they may apply to BMD tasks [83. PASMis a multimicroprocessor system being designed atPurdue University for a variety of image process-ing and pattern recognition problems [163. Inboth cases a highly f lexible network is needed forcommunication among processors and memories.

The Generalized Cube network has a cube-typetopology and is constructed from 2-input/2-outputcrossbars or interchange boxes [17]. A more gen-eral form of interchange box is an a-input/a-output (a x a) switching node. A relative of theGeneralized Cube network can be constructed froma × a switching nodes using cube-type connectionsbetween stages. Many papers in the l i teraturediscuss using larger than 2x2 interchange boxesfor implementing multistage cube-type networks- [2,7, 10, 11, 12, 183. In the following, design op-tions for 4x4 switching nodes are considered. Theperformances of two designs are evaluated andtheir implementation in discrete logic (e .g . , TTL)and VLSI is considered. I t w i l l be shown that a4x4 crossbar performs better and costs less thanfour 2x2 crossbars in a packet switching environ-ment.

The logical structure of the Generalized Cubenetwork is defined in Section I I to provide a

This work was supported by the Bal l ist ic MissileDefense Agency under grant number DASG60-80-C-0022and the Air Force Office of Scientific Research,Air Force Systems Command, USAF, underAFOSR-78-3581. The United States Government isauthorized to reproduce and distribute reprintsfor Government purposes non-withstanding any copy-right notation here on. The views, opinions,and/or findings contained in this report are thoseof the author(s) and should not be construed as ano f f i c ia l Department of the Army position, policy,or decision, unless so designated by other o f f i -cial documentation.

229

Figure 1(a): Generalized Cube topology for N=8.(b): Four states of an interchange box.

framework for discussing modifications. In Sec-t ion I I I , the performance of two network implemen-tations are compared. Implementation considera-tions are presented in Section IV. For furtherdetails of a l l this material see [14].

I I . DEFINITIONSA partitionable SIMD/MIMD system is a parallel

processing system which can be structured as oneor more independent SIMD and/or MIMD machines [4]of varying sizes. PASM is a partitionableSIMD/MIMD system for image processing and patternrecognition [16]. The BMD testbed should have thef l e x i b i l i t y to perform as a partitionableSIMD/MIMD machine. The cube network describedhere can function ef f ic ient ly in such an environ-ment.

The Generalized Cube network (Fig. 1) is a mul-tistage cube-type network topology which was in -troduced in [173. I t has been shown that this to-pology is equivalent to that used by the omega[ 7 ] , indirect binary n-cube [113, STARAN [1 ] , andSW-banyan (F=S=2) [63 networks [17, 203. An Ninput/output Generalized Cube topology hasjn = log_N stages, where each stage consists of a

set of N lines connected to N/2 interchange boxes.Each interchange box is a 2-input/2-output device.The labels of the input/output (I/O) lines enter-ing the upper and lower inputs of an interchangebox are used as the labels for the upper and loweroutputs, respectively. The labels are the in -tegers from 0 to N-1. Each interchange box can beset to one of four states as shown in Fig. 1. Theconnections in this network are based on the cubeinterconnection functions [133. Stage i of thegeneralized cube topology pairs I/O lines thatdi f fer only in the i-th bi t position.

The name cube network w i l l be used to refer tothe network consisting of the Generalized Cube to-pology and four-state interchange boxes. Each in-terchange box w i l l be controlled independentlythrough the use of routing tags [7 , 153.

0190-3918/81/0000/0229$00.75 © 1981 IEEE

It is assumed that processors and memories arepaired to form processing elements (PE/j;). Thenetwork is configured such that PE i is connectedto input i and output i, CKi<N. The packetswitching mode, in which packets move from stageto stage in the network as paths between stagesbecome available, is assumed. They do not requirethat their entire path be established prior toentering the network. A packet consists of arouting tag and a number of data items. Packetswitching in multistage networks has been dis-cussed in [ 3 , 1 9 ] .

The primary goal here is to investigate thecost-effectiveness of constructing multistage cubenetworks from 4x4 crossbars versus 2x2 crossbars(interchange boxes). Since a single 2x2 inter-change box is not functionally comparable to a 4x4crossbar (i.e., it can only handle two items at atime instead of four), the 4x4 crossbar is com-pared with a 4x4 composition of four 2x2 inter-change boxes. This configuration is called acomposite node and is shown in Fig. 2. A networkconstructed from properly connected (to be speci-fied later) composite nodes is identical to a cubenetwork constructed from 2x2 interchange boxes.The external connections of the crossbar (Fig. 3)are identical to those of the composite node, soit can be directly substituted for a 4x4 compositenode.

Many options for the implementation of 2x2 in-terchange boxes were discussed in [9]. To avoidrepetition, one of the configurations discussed inthat paper will be assumed here. It is assumedthat packet switching is implemented and that anentire packet is transferred between adjacentstages during one network clock cycle. Further-more, the size of each input queue in a switchingnode is assumed to be an integral multiple of thepacket size. The packet size is thus not res-tricted to be any particular number of words.

III. PERFORMANCE ANALYSISThe 4x4 crossbar node and composite node will

be compared in their performance at both a localand global level. On a local level blockingwithin a node is examined. On the global level,the permuting ability of two networks constructedfrom the respective 4x4 switching nodes is com-pared.

Consider the local level. Let level 1 of acomposite node be the two interchange boxes con-nected to the inputs of the node and level 2 bethe two interchange boxes connected to the out-puts. The composite node can perform 16 permuta-tion connections (each box either straight or ex-change) and the crossbar node can perform all 4!possible permutation connections.

For those permutations where there is no con-flict in either node, the messages traverse thecomposite node in twice the time required by thosein the crossbar node due to the two levels of in-terchange boxes. When conflicts occur in thecrossbar node, the delay due to waiting diminishesthe speedup achieved.

Consider situations where there are conflictsin a switch node. For this analysis it is assumedthat the destination of any message is a uniformlydistributed random variable. Also, it is assumedthat each message has only one destination (i.e.,no broadcasting). Both the composite node and the4x4 crossbar node have four inputs and four out-

puts so there are 4 =256 distinct patterns inwhich messages may need to be routed through theboxes. Since the destinations are assumed to berandom and uniformly distributed, the distinctdata patterns of routing are all equally likely.Assuming four simultaneous inputs is somewhat of aworst case, since in MIMD mode this would be con-

Figure 3: A 4x4 crossbar node.

230

Figure 2: A 4x4 composite node constructed fromfour 2x2 interchange boxes.

231

sidered heavy loading and in SIMD mode destina-tions are not random but structured and chosen toavoid conf l ic ts . The node is assumed i n i t i a l l yempty.

Consider the 4x4 crossbar node. Let r be themaximum number of messages desiring any given out-put of the 4x4 crossbar node. The to ta l time re-quired for a l l four messages to pass through thenode is r. PCr=D = 24/256, P(r=2) = 180/256,P(r=3) = 48/256, and P(r=4) = 4/256. The expectedtime to pass a l l four messages through thecrossbar node is given by:4

i P(r=i) = 2.125 network clock cycles.i=1That i s , given that four messages arrive at anempty crossbar node simultaneously, on the averagei t w i l l take 2.125 network clock cycles for thenode to empty.

Now consider the composite node. The followingnotation w i l l be used in the ensuing equations,where i=1 or 2:

P(iU) = P(no conf l ic t level i , upper box) = 1/2;P(iL) = P(no conf l ic t level i , lower box) = 1/2;P(iX) = 1/2, where X = U or L; andP(i) = P(no conf l ic t in level i) = 1/4.

Now consider the probabi l i t ies of di f ferentamounts of t ime, t , to pass four input messagesthrough the composite node. The minimum time pos-sible is 2 network clock cycles because there aretwo levels.P(t=2) = P(1U) P(1L) P(2U) P(2L) = 1/16.

For a total time of 3 network clock cyclesthere are 5 cases to consider. First assume noconf l ic ts occur in level 1 .P(t=3, case 1) = P(1) d-P(2)) = 3/16.

Next, assume exactly one level 1 interchangebox has a conf l i c t . P(t=3, case 2) =C(1-P(1U)) P(1L)+P(1U) (1-P(1L))3 P(2X) = 1/4.

For case 3, there is one conf l ic t at each lev-e l , but the maximum delay is 3 cycles.P(t=3, case 3) = C(1-P(1U)) P(1L)+P(1U) (1-P(1D):

• (1-P(2X)) (1/2) P(2X) = 1/16.The first factor is the probability that exactlyone box at level 1 has a conflict. The next fac-tor is the probability that the first message fromthe level 1 box which had a conflict, call thismessage M, also has a conflict at level 2. The(1/2) is the probability that M will be chosen topass through the Level 2 box first. The last fac-tor is the probability that the two delayed mes-sages do not conflict.

Case 4 assumes that there is a conflict in bothlevel 1 boxes and that both level 2 boxes receivemessages (this happens half the time there are twoconflicts in level 1 ) .P(t=3, case 4) = (1/2) d-P(1U)) d-P(1D) = 1/8.

Finally, assume conflict in both level 1 boxesbut only one level 2 box receives messages andthere is no conflict for either pair that passesthrough: P(t=3), case 5) =(1/2) (1-P(1U)) (1-P(1L)) P(2X) P(2X) = 1/32.The probability that all messages pass through thecomposite node in 3 network clock cycles is

message enters a non-empty queue in level 2 and(2) the delayed message enters an empty queue butconflicts with the other remaining message:P(t=4, case 1) = C(1-P(1U)> P(1L)+P(1U> C1-P(1L))3C(1/2) (1-P(2X))+(1/2) (1-P(2X)) (1-P(2X)):=3/16.Now assume conflict in both level 1 boxes and

that only one level 2 box receives messages (thishappens half the time there are two conflicts inlevel 1 ) . Given this occurs, there are three ways(cases 2, 3, and 4) a time of 4 occurs. In case2, the first two messages reaching the box in lev-el 2 conflict, but there are no subsequent con-flicts:P(t=4, case 2) = (1/2) <1-P(1U)) <1-P(1L>)(1-P(2X)) P(2X) = 1/32.

In case 3, the first pair of messages do notconflict but the second pair do:P(t=4, case 3) = (1/2) •(1-P(1U)> (1-P(1L)>P(2X) (1-P(2X)) = 1/32.

In case 4, the first and second pair of mes-sages conflict. When the second pair conflicts,one queue will contain two messages. For a timeof 4 the queue with two items must be selected toresolve the second conflict and a third conflictmust not occur.P(t=4,case 4) = (1/2) (1-P(1U)) (1-P(1D)d-P(2X)) (1-P(2X)) (1/2) P(2X) = 1/128.

The probability of a time of 4 is:P(t=4) = 3/16 + 1/32 + 1/32 + 1/128 = 33/128.

The time of 5 happens when either of the twoconditions of case 4 for a time of 4 are not met.P(t=5) = (1/2) (1-P(1U)) (1-P(1L)) (1-P(2X>)C(1/2)(1-P(2X))+(1/2)(1-P(2X))(1-P(2X)): = 3/128.

The expected time for all four messages to passthrough the composite node is:

This time is 53% longer than the 2.125 networkclock cycles expected with the crossbar node.

Consider the global level. To construct a net-work from m/2 stages of N/4 4x4 switching nodes,assume all connection lines in the network are la-beled in base 4 and that the stages are numbered(m/2)-1, •••,1,0 (from input to output). At stagei, the four input lines to a node are those thatdiffer only in the i-th position of their base 4representation. The line with a 0 in the i-th po-sition connects to the top input, 2 to the nextinput, 1 to the next input, and 3 to the bottominput. The output lines of the 4x4 switchingnodes have the same labels as the input lines, butin increasing order, i.e., the top output line la-bel has a 0 in the i-th position, next 1, next 2,and the bottom 3. When composite nodes are used,making connections in the above manner creates acube network. When crossbars nodes are used, anetwork is created whose capabilities are a super-

A composite node network consists of Nra/2 i n -Nm/?

terchange boxes, allowing 2 permutations. As-suming m is even, a 4x4 crossbar node network con-

Nm/8sists of Nm/8 nodes, permitting (4!) permuta-tions. If m is odd and one stage is constructedby 4x4 crossbar nodes limited to act as a 2x2

For a time of 4, there are four cases to con-sider. The first case is where there is one con-flict at each level. There are two ways to obtaina time of 4 from this situation: (1) the delayed

IV. IMPLEMENTATIONTo control the network, the destination tags

defined in [7] are used. Let the destination ad-, dress D be represented in binary as d ^ •••d .d .

A switching node in stage i examines bits d-,.+,

and dj.j. For the composite node, the first level

interchange boxes examine only bit d~. .. and thesecond level interchange boxes examine only bitd_.. If the bit examined is 0, the upper output

link of the interchange box is selected and if thebit is 1, the lower link is selected. For thecrossbar node, both bits are examined simultane-ously. Together they are considered a single basefour digit which corresponds to one of the outputslabeled 0 through 3.

To add a broadcast capability, an m-bit broad-cast mask is appended [15]. Let the mask B berepresented in binary as b . •••b. b . A switching

node in stage i now examines b_.+1, b_ , d- +« anddp - For the composite node, first leveL inter-change boxes examine bits with index 2i+1 andsecond level boxes examine bits with index 2i. Ifthe broadcast mask bit is 0, the destination tagbit is interpreted as before. If the mask bit is1, the destination bit is ignored and both outputlinks of the interchange box are selected. Forthe crossbar node the four bits are all examinedsimultaneously. They are interpreted so as to es-

• tablish the same connections as those that wouldbe obtained in the composite node. Five kinds ofbroadcasts are defined for either type of 4x4switching node.

Hardware Without Broadcast CapabilityFor simplicity, designs for the composite node

and the crossbar node initially will be developedassuming no broadcast capability. Then, thoseportions of the designs affected by inclusion of abroadcast capability will be modified and com-pared.

In the following analysis, hardware complexityis measured in terms of logic gate count and chipcount. The gate counts are used as a first ap-proximation to compare VLSI implementations.Designs using this technology must also considerwiring complexity [5]. The chip counts are usedto compare discrete logic (e.g. TTL) implementa-tions, assuming standard gate-per-chip packaging.

Examining Figs. 2 and 3, the first differencenoted is that the crossbar node requires half asmany queues as the composite node. Depending onthe actual queue size, a considerable savings inlogic may be realized in the implementation of thecrossbar node. To compare multiplexer require-ments, typical implementations of 2-to-1 and4-to-1 multiplexers were examined [14]. Eight2-to-1 multiplexers require 20% more gates (re-gardless of path width) than four 4-to-1 multi-plexers. The chip counts are equal. Since thenumber of external connections for data and con-trol lines is the same for both designs, anybuffering/signal conditioning logic will be com-parable. In a VLSI design, this implies identicalpin counts.

Thus far the crossbar node appears to be thebetter choice. It is however, decidedly more com-

plicated to arbitrate the requests of four packetssimultaneously (as opposed to two) while assuringeach packet equal access to each output link onthe average. To determine whether one 4x4 controlunit is actually more complex than four 2x2 con-trol units, the functional components of the con-trol units are considered.

The control unit of a 2x2 interchange box con-tains two sets of queue control logic, input re-quest arbitration (IRA) logic, output request ar-bitration (ORA) logic, and timing. The controlunit for a 4x4 crossbar node contains four sets ofqueue control logic. The remaining components arethe functional equivalents of those for the 2x2interchange box. The most obvious differencebetween the two designs is that four 2x2 controlunits contain twice as many sets of queue controllogic as one 4x4 control unit.

One set of queue control logic contains two re-gisters which store pointers, one to the front andone to the back of its associated queue. If thequeue is Q words long, log-,Q bits are required for

each register.The IRA logic is quite simple. If a request is

made for the i-th input, (i=0,1 for the 2x2;i=0,1,2,3 for the 4x4), it will be granted if thei-th queue is not full. Once again, four 2x2 con-trol units require twice as much IRA logic as one4x4 control unit.

The timing logic is identical in both cases.Three clock phases are generated. Arequest/grant/transfer protocol is implemented(see [9]).

None of the logic discussed thus far is affect-ed by the inclusion of a broadcast capability.Thus, its analysis is equally applicable to thenext subsection, which includes broadcast capabil-ities.

The most important and by far the most complexcomponent of the control unit is the ORA logic.It is responsible for examining the routing tagbits and generating signals to set the multi-plexers and make requests. It must also examinethe grant signals and generate control signals forthe "increment front pointer" input of each set ofqueue control logic. The complexity of this logicarises from arbitrating conflicting requests foraccess to the output ports.

To compare the ORA logic, equations are derivedfor all its output signals as a function of thetag bits and grant signals [14]. The total (NAND)gate count for 4 sets 2x2 of control unit logic is104 gates. This corresponds to 24 chips. Thecontrol unit for the 4x4 crossbar node requires124 gates. There is a 19% increase in the numberof gates required by the crossbar node. In adiscrete logic design, the chip count is 32. Thisis a 33% increase over the 24 chips required inthe composite node.

The excess in ORA logic can be compensated for,since a 4x4 crossbar node requires half the queuecontrol and IRA logic of a 4x4 composite node.From the equations derived, 20 extra gates oreight extra chips are required for the 4x4crossbar ORA logic. Assuming one of the eightsets of queue control and IRA logic in a compositenode will require more than 5 gates or 2 chips,the 4x4 crossbar node is actually less expensiveto build. Despite the higher wiring complexity of

232

the 4x4 crossbar node, the total design effort iscomparable to that required by the 4x4 compositenode.

Hardware With Broadcast CapabilityAdding a broadcast capability requires the ORA

logic to examine the broadcast mask bits in addi-tion to the routing tag bits. The revised equa-tions for the 2x2 control unit require 33 gates,which multiplied by 4 is 156. This is equivalentto 48 chips. A broadcasting capability costs 52gates or 24 chips beyond the requirements for a4x4 composite node without it. More details canbe found in [14].

The circuitry needed to add the same broadcastcapability to 4x4 crossbar nodes as was added tothe composite nodes requires 233 gates, a 49% in-crease over the 156 required for the compositenode. The chip count is 74, a 54% increase over48. In this case it is likely that one of theeight sets of queue control and IRA logic will re-quire more than 20 gates or 7 chips. If not, thesavings in queue gates will compensate for thedifference. Again the crossbar node is less ex-pensive than a composite node where both have thesame broadcast capability.

V. CONCLUSIONSAt a local level, the crossbar node is always

faster at passing four messages that arrive simul-taneously than the composite node. If the connec-tion requests do not conflict in the composite,the crossbar is twice as fast. When the connec-tion requests of the messages form a permutationwhich the composite node cannot pass without con-flict, it takes 3 times longer for all messages toexit the composite node. Assuming each messagechooses each output with equal probability, on theaverage it takes approximately 53% more time forall messages to pass through the composite nodethan through the crossbar node.

The ORA logic is the only logic requiring morehardware in a crossbar node than in a compositenode. Otherwise, a crossbar node requires half asmuch queue control and IRA logic, and half as manyqueues. The multiplexer logic is less than orcomparable to that needed by the composite node.The net result is that when packet switching isimplemented, the 4x4 crossbar node requires lesshardware and significantly out-performs a compo-site node.

If circuit switching is implemented, no queuesor their associated control logic are required.In this case, the crossbar node does contain morehardware. However, it offers a significant im-provement in connectivity/permuting ability. Ifthe switching nodes are implemented as VLSI chips,since both nodes require the same number of pins,the gate/pin ratio is improved with a crossbar im-plementation. Only in the case where circuitswitching is implemented in discrete logic isfurther consideration required. Without a broad-cast capability (which is less important in a cir-cuit switching environment), there is only a smalldifference in the chip count.

In summary, the implementation of cube-typenetworks using 2x2 and 4x4 crossbars were com-pared. It was shown that for packet switching the4x4 crossbar is a more cost-effective approach.

233

REFERENCES1 K. Batcher, "The f l i p network in STARAN," 1976

I n t . Conf. P a r a l l e l Processing, pp. 6 5 - 7 1 ,Aug. 1976

2 L. Ciminiera, A. Serra, "Modular interconnec-t ion networks with asynchronous control , " 14thHawaii I n t . Conf. System Sciences, pp.210-218, Jan. 1981.

3 D. Dias, J. Jump, "Packet communication inmultistage shuffle-exchange networks," 1980I n t . Conf. Parallel Processing, pp. 327-328,Aug. 1980.

4 M. Flynn, "Very high-speed computing systems,"Proc. IEEE, Vol. 54, pp. 1901-1909, Dec. 1966.

5 M. Franklin, "VLSI performance comparison ofbanyan and crossbar communications networks,"Workshop on Interconnection Networks forParallel and Distributed Processing, pp.20-18. Apr. 1980.

6 G. Goke, G. J. Lipovski, "Banyan networks forpart i t ioning multiprocessor systems," 1stSymp. Comp. Arch., pp. 21-28, Dec. 1973.

7 D. Lawrie, "Access and alignment of data in anarray processor," IEEE Trans. Comp., Vol.C-24, pp. 1145-1155, Dec. 1975.

8 W. McDonald, J . Will iams, "The advanced dataprocessing test bed," Compsac, pp. 346-351,Mar. 1978.

9 R. J. McMillen, H. J. Siegel, "The hybrid cubenetwork," Distributed Data Acquisi t ion,Computing and Control Symp., pp. 11-22, Dec,1980.

10 J. Patel, "Processor-memory interconnectionsfor multiprocessors," 6th Symp. Comp. Arch.,pp. 168-177, Apr. 1979.

11 M. Pease, "The indirect binary n-cube mi-croprocessor array," IEEE Trans. Comp., Vol.C-26, pp. 458-473, May 1977.

12 U. Premkumar, et a l . , "Design and implementa-t ion of the banyan interconnection network inTRAC," NCC, pp. 643-653, June 1980.

13 H. J. Siegel, "A model of SIMD machines and acomparison of various interconnection net-works," IEEE Trans. Comp., Vol. C-28, pp.907-917, Dec. 1979

14 H. J. Siegel, et a l . , Paral lel/Distr ibutedMultimicroprocessor Systems for Ba l l i s t i cMissile Defense, Purdue, EE School, TR-EE81-12, June 1981.

15 H. J. Siegel, R. J. McMillen, "The cube net-work as a distr ibuted processing test bedswitch," 2nd I n t . Conf. Distributed ComputingSystems, pp. 377-387, Apr. 1981.

16 H. J. Siegel, et a l . , "PASM: A part i t ionableSIMD/MIMD system for image processing and pat-tern recognition," IEEE Trans. Comp., to ap-pear.

17 H. J. Siegel, S. D. Smith, "Study of mult is-tage SIMD interconnection networks," 5th Symp.Comp. Arch., pp. 223-229, Apr. 1978.

18 S. D. Smith, "LSI design considerations formultistage interconnection networks for paral-le l processing systems," 14th Hawaii I n t .Conf. System Sciences, pp. 219-227, Jan. 1981.

19 A. Tr ipa th i , G. J . Lipovski, "Packet switchingbanyan networks," 6th Symp. Comp. Arch., pp.160-167, Apr. 1979.

20 C. Wu, T. Feng, "On a class of multistage i n -terconnection networks," IEEE Trans. Comp.,Vol. C-29, pp. 694-702, Aug. 1980

performance and implementatio of 4x4n switching …hj/conferences/38.pdf · 2014. 2. 4. · robert...

Documents