[ieee 2007 ieee international symposium on signal processing and information technology - giza,...

6
2007 IEEE International Symposium on Signal Processing and Information Technology Design of a 2D Mesh-Torus Router for Network on Chip Yahia SALAH Mohamed ATRI Rached TOURKI Laboratory of Electronics and Microelectronics Faculty of Sciences Monastir Monastir 5000, Tunisia yahia Ir [email protected] Rached.tourki(fsm.ru.tn Abstract - New systems on chip (SoC) design allow one to build buffering strategy, switching mode, contention resolution, network heterogeneous systems with several functional units, distributed flow control, and traffic processing, to satisfy the new paradigm of memories, and interconnections on the same chip. In order to SoCs. achieve more reuse, flexibility, and performance, bus based A. Network topology interconnections are no more sufficient and Network on Chip concepts are emerged. The authors of [3] and [4] proposed mesh-based interconnect This paper presents the design of a scalable packet based router architectures. These architectures consist of an m* n mesh allowing data transfer and managing dynamically several of switches interconnecting computational resources (IPs) placed communications in parallel. The designed router, described in along with the switches. This topology is predominant in the VHDL on RTL level, was simulated in the case of topologies 2D- literature because of its easy implementation, simplicity of the XY mesh and 2D-torus (2x2), (3x3) and then (4x4). The used design routing strategy and network scalability. The short distance methodology is based on VHDL as a description language, between switches eliminate a certain number of electric simulation and synthesis tools. phenomena in comparison with the buses. On-chip networks will Keywords: Network on chip, 2D-mesh, 2D-torus, router likely use networks with lower dimensionality to keep wire switch), VHDL, simulation, synthesis, placement ors routad lengths short [1]. In [5], Dally et al. proposed the use of a torus (i)H ,m t,arouting. interconnect architecture, to reduce the network diameter. A I. INTRODUCTION variation of the torus architecture, which eliminates the use of long wraparound wires, called a folded torus, is described in [6]. NoC is the communication media in a system-on-chip (SoC) A problem of torus and mesh topologies is the associated network environment. Traditionally on-chip communication has been latency. To overcome this problem, one proposes other NoCs conducted via dedicated point-to-point links or a shared media alternatives. The fat tree topology has the following advantages: like a bus. Busses are not scalable with respect to speed and its diameter (maximum number of links between two subscribed) power; also they quickly become the bottleneck of a system. As remains reasonable (2*log4n, where n is the number of layer of described by [1], SoC design is based on intellectual property (IP) network), the topology is scalable and uses a small number of cores reuse. These cores can be analog/digital cores, memory routers for a given number of subscribers. It has a natural blocks, DSP cores, and also new technologies. Cores do not make hierarchical structure which can be useful in the embedded up SoCs alone; they must include an interconnection architecture systems. In [1], the authors described an interconnect architecture and interfaces to peripheral devices [2]. The interconnection based on a butterfly fat tree for a networked SoC, as well as the architecture includes physical interfaces and communication design of required switches and addressing mechanisms. mechanisms, which allow the communication between SoC However, the scalability of current proposed solutions, including components. honeycomb structure [1], and butterfly fat tree structure, is either This paper presents the study and design of a NoC topology with insufficient or unexplored. different topologies: 2D-mesh and 2D-torus. The basic element of B. Buffering strategy this topology is a switch making possible the communication between the IPs cores. The buffering strategy determines the location of buffers inside the router. We distinguish output and input queuing. Output II. STATE OF THE ART IN NOC queuing has the best performance among the buffering strategies. However, the interconnection will make the router wire In any system on chip, the communication management becomes dominated and expensive already for small values of number of the bottleneck limiting the total performances. As a new paradigm inptsI inp queuin e queues ar at thei of the ou 'g npDuts. In inpDut cqueuing the cqueues are at the inpDut of the router. of SoC, with a communication centric design style, Network-on- A scheduler determines at which times which queues are Chip (NoC) [2] was proposed to meet the distinctive challenges of connected to which output ports, such that no contention occurs providing functionally correct, reliable operation of interacting [10] SoC components. A few on-chip micro network proposals for SoC Another significant parameter is the dimension of the queue integationcan e foud inthe lteratre. evera NoC which influences directly packet latency and the switch area. It iS interconnects have been proposed recently [1-9]. The current very important to minimize the amount of buffering space. The studies are interested mainly in the problem of topologies, buffer size and location are the main aspects discussed. In [7] the 978-1 -4244-1 835-0/07/$25.00 ©C2007 IEEE 626

Upload: rached

Post on 20-Dec-2016

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: [IEEE 2007 IEEE International Symposium on Signal Processing and Information Technology - Giza, Egypt (2007.12.15-2007.12.18)] 2007 IEEE International Symposium on Signal Processing

2007 IEEE International Symposiumon Signal Processing and Information Technology

Design of a 2D Mesh-Torus Router for Network on Chip

Yahia SALAH Mohamed ATRI Rached TOURKI

Laboratory of Electronics and MicroelectronicsFaculty of Sciences Monastir

Monastir 5000, Tunisia

yahia Ir [email protected] Rached.tourki(fsm.ru.tn

Abstract - New systems on chip (SoC) design allow one to build buffering strategy, switching mode, contention resolution, networkheterogeneous systems with several functional units, distributed flow control, and traffic processing, to satisfy the new paradigm ofmemories, and interconnections on the same chip. In order to SoCs.achieve more reuse, flexibility, and performance, bus based A. Network topologyinterconnections are no more sufficient and Network on Chipconcepts are emerged. The authors of [3] and [4] proposed mesh-based interconnectThis paper presents the design of a scalable packet based router architectures. These architectures consist of an m* n meshallowing data transfer and managing dynamically several of switches interconnecting computational resources (IPs) placedcommunications in parallel. The designed router, described in along with the switches. This topology is predominant in theVHDL on RTL level, was simulated in the case of topologies 2D- literature because of its easy implementation, simplicity ofthe XYmesh and 2D-torus (2x2), (3x3) and then (4x4). The used design routing strategy and network scalability. The short distancemethodology is based on VHDL as a description language, between switches eliminate a certain number of electricsimulation and synthesis tools. phenomena in comparison with the buses. On-chip networks will

Keywords: Network on chip, 2D-mesh, 2D-torus, router likely use networks with lower dimensionality to keep wireswitch), VHDL, simulation, synthesis, placement

orsroutad lengths short [1]. In [5], Dally et al. proposed the use of a torus(i)H ,mt,arouting. interconnect architecture, to reduce the network diameter. A

I. INTRODUCTION variation of the torus architecture, which eliminates the use oflong wraparound wires, called a folded torus, is described in [6].

NoC is the communication media in a system-on-chip (SoC) A problem of torus and mesh topologies is the associated networkenvironment. Traditionally on-chip communication has been latency. To overcome this problem, one proposes other NoCsconducted via dedicated point-to-point links or a shared media alternatives. The fat tree topology has the following advantages:like a bus. Busses are not scalable with respect to speed and its diameter (maximum number of links between two subscribed)power; also they quickly become the bottleneck of a system. As remains reasonable (2*log4n, where n is the number of layer ofdescribed by [1], SoC design is based on intellectual property (IP) network), the topology is scalable and uses a small number ofcores reuse. These cores can be analog/digital cores, memory routers for a given number of subscribers. It has a naturalblocks, DSP cores, and also new technologies. Cores do not make hierarchical structure which can be useful in the embeddedup SoCs alone; they must include an interconnection architecture systems. In [1], the authors described an interconnect architectureand interfaces to peripheral devices [2]. The interconnection based on a butterfly fat tree for a networked SoC, as well as thearchitecture includes physical interfaces and communication design of required switches and addressing mechanisms.mechanisms, which allow the communication between SoC However, the scalability of current proposed solutions, includingcomponents. honeycomb structure [1], and butterfly fat tree structure, is eitherThis paper presents the study and design of a NoC topology with insufficient or unexplored.different topologies: 2D-mesh and 2D-torus. The basic element of B. Buffering strategythis topology is a switch making possible the communicationbetween the IPs cores. The buffering strategy determines the location of buffers inside

the router. We distinguish output and input queuing. OutputII. STATE OF THE ART IN NOC queuing has the best performance among the buffering strategies.

However, the interconnection will make the router wireIn any system on chip, the communication management becomes dominated and expensive already for small values of number ofthe bottleneck limiting the total performances. As a new paradigm inptsI inp queuin e queues ar at thei of the ou'g npDuts. In inpDut cqueuing the cqueues are at the inpDut of the router.of SoC, with a communication centric design style, Network-on- A scheduler determines at which times which queues areChip (NoC) [2] was proposed to meet the distinctive challenges of connected to which output ports, such that no contention occursproviding functionally correct, reliable operation of interacting [10]SoC components. Afew on-chip micro network proposals for SoC Another significant parameter is the dimension of the queue

integationcane foud inthe lteratre. evera NoC which influences directly packet latency and the switch area. It iSinterconnects have been proposed recently [1-9]. The current very important to minimize the amount of buffering space. Thestudies are interested mainly in the problem of topologies, buffer size and location are the main aspects discussed. In [7] the

978-1 -4244-1 835-0/07/$25.00 ©C2007 IEEE 626

Page 2: [IEEE 2007 IEEE International Symposium on Signal Processing and Information Technology - Giza, Egypt (2007.12.15-2007.12.18)] 2007 IEEE International Symposium on Signal Processing

authors present an algorithm which sizes the (input) buffers in a by limiting the rate at which packets travel through the networkmesh-type NoC, on basis of the traffic characteristics of a given [11].application. For three audio/video benchmarks it was shown how To avoid alignment problems, the block size (B words) is asuch intelligent buffer allocation resulted in about 85% savings in multiple of the flit size (F words, B = tF) with t being constant.buffering resources, in a comparison with uniform buffer sizes, We prefer a small t to decrease the store-and-forward delay andwithout any reduction in performance. reduce the buffer size for guaranteed traffic, and a small F for

C.Switching mode, Contention resolution and Network fine-grained switching and better statistical multiplexing. TheC. control router architecture contains a data path and a control path. Theflow control data path maximises throughput for high link utilisation, and the

We identify three important issues in the design of the router control path maximises the rate of scheduling and switching. Theynetwork architecture. These are: the switching mode, contention can be designed and optimised independently [10].resolution, and networkflow control. E. Quality ofservice (QoS)The switching mode of a network specifies how data and controlare related. We distinguish circuit switching and packet switching. Quality of service [7] [9] offered by the network has latelyIn circuit switching data and control are separated. The control is received a lot of attention and can be classified in two mainprovided to the network to set up a connection. This results in a modes: best effort (no QoS) and guaranteed traffic (QoS) that cancircuit over which all subsequent data of the connection is be implemented either statically, using virtual circuit fixed attransported. In packet switching data is divided into packets and design time, dynamically, using handshake protocols at runtime.every packet is composed of a control part, the header, and a datapart, the payload. Network routers inspect, and possibly modify, Other NoC parameters are discussed such as packet length,the headers of incoming packets to switch the packet to the network interface VCI, and power consumption [8].appropriate output port.When a router attempts to send multiple data items over the same III. THE DESIGN APPROACHlink at the same time contention is said to occur. As only one data The used design methodology (Figure 1) is based on VHDL as aitem can be sent over a link at any point in time, a selection . .

thecotndn dat mus be.made; thi prcs i l description language at RTL level. This description is simulatedamong * - 'cae until obtaining the expected behaviour. Simulation tool uses, incontention resolution [10]. addition to VHDL description, a test file containing the systemIn circuit switching, contention resolution takes place at set up at input stimuli in order to visualize the output behaviour. Thisthe granularity of connections, so that data sent over different description with a HDL (Hardware Description Language) isconnections do not conflict. Thus, there is no contention during synthesized using existent synthesis tool allowing passage to thedata transport, and time-related guarantees can be given. .. ...In packet switching contention resolution takes place at the nx abstrationl unti reA.c i therdintegr tion is. . . ~~~~~~~~~orimplementation in a FPGA. Simulation of the description isgranularity of individual packets. Because packet arrival cannot

be preicted ontenion ca not b avoied. ~±is ld required before synthesis at every abstraction level.tresolved Automatic synthesis (Synopsys Design Compiler) has beendynamically by scheduling in which data items are sent in turn. applied to produce a standard cell netlist. The layout was carriedNetwork flow control, also called routing mode, addresses the out in a standard cell fashion with a semi-automatic toollimited amount of buffering in routers and data acceptance environment based on the Cadence tools. A Clock tree synthesisbetween routers. Many recent multi-computer networks use cut- approach was followed in order to minimize the clock skew. Thethrough or wormhole routing, which is a technique to reduce clock buffer tree was generated, equalizing perfectly all differentmessage latency by pipelining transmission over the channels clock branches and sub-branches. To meet all the timingalong a message's route [10]. constraints, in-place optimization was used using SDF back-In wormhole routing, each packet is further split into a set of units annotation.named flits (flow control unit), which is the smallest unit overwhich is performed the flow control. The flit size varies between IV. THE ROUTER BLOCK DESIGN UNITSthe packet size and channel width according to the topology,architecture and protocol of NoCs. Each flit also consists of A. Network structure modelcontrol information and data. Wormhole routing reduces the A Network-on-Chip is composed by a set of routers and point-to-store-and-forward latencies of packet transportation. Only theheader flit needs to be routed. If it goes through a switch point links interconnecting routers in a structured way. Eachsuccessfully, other flits just follow it without any more routing. If router has a set of ports that are used to connect routers with itsthe header flit is blocked, the other flits stop "in place" neighbour's routers and with the processing cores of the systemWormhole routing requires the least buffering (buffer flits instead (i.e. processors, DSPs, memories, and others).of packets) and also allows low-latency communication. The ports used to connect the processing cores are named local

However, it is more sensitive to deadlock and generally results in ports or terminals. A pair of parallel opposite chanmels composeslower link utilization than virtual cut-through routing. each interconnecting link.

D. Arbitration andflit sizeThe router's internal policies for queuing, arbitration, and flow ption

complexity by coordinating resource sharing amongst competing Siuain9;ytei oueTo S

invokes an arbitration policy, such as round-robin or a priority g zEaitobased scheme, to select the winner. The arbitration policy may 7enchdifferentiate between application traffic classes to assign priority - ____________________________to more urgent packets. Closely tied to both queuing andarbitration is flow control, which affects latency and throughput Fig. 1. Design Flow

627

Page 3: [IEEE 2007 IEEE International Symposium on Signal Processing and Information Technology - Giza, Egypt (2007.12.15-2007.12.18)] 2007 IEEE International Symposium on Signal Processing

A NoC (Figure 2) can be described by its topology and by thestrategies used for routing, flow control, switching, arbitration andbuffering. The network topology is the arrangement of nodes andchannels into a graph. Routing determines how a message choosesa path in this graph, while flow control deals with the allocationof channels and buffers to a message as it traverses this path..Switching is the mechanism that removes data from an input Wvechannel and places it on an output channel, while arbitration is -responsible to schedule the use of channels and buffers by themessages. Finally, buffering defines the approach used to storemessages while the router arbitration circuits cannot schedule Sout Local

them. I

These routers are addressable for communication amongprocessors. The network uses a deterministic routing algorithm in L3the form of a lookup table inside each router for routing to the Cmneighbouring node. Although flow control is not supported, a Fig. 2. Flow Chartdeterministic routing approach significantly reduces hardwarecomplexity and overhead. Reconfiguration of the network When a flit is forwarded from an input to an output port, onetopology and placement of processing units only requires a buffer becomes available and a buffer-credit is sent back to themodification to the routing table. We can reconfigure internal previous router on separate out-of-band wires.buffer size of each router, and in this way, trade area for speed. It Each output port of a router is connected to an input port of a nextalso allows for more efficient use of routing resources, which is router via a communication link. The output port maintains thean important design factor in embedded system design. number of available flit slots per each service level in the bufferRouting is performed over fixed shortest paths, employing a of the next input port. These numbers are stored in the Next Buffersymmetric X-Y discipline whereby each packet is routed first in State table. The number is decremented upon transmitting a flitan "X" direction and then along the perpendicular dimension and incremented upon receiving a buffer-credit from the nextor vice versa. This scheme leads to a simple, cost-effective router. When a space is available, the output port schedulesrouter implementation. Network traffic is thus distributed non- transmission of flits that are buffered at the input ports anduniformly over the mesh links, but each link's bandwidth is waiting for transmission through that output port, as detailedadjusted to its expected load, achieving an approximately equal below.level of link utilization across the chip. We now turn to the mechanics of flit transfer inside the router.Links are uni-directional and each link is composed by 32-bit Flits are buffered at the input ports, awaiting transmission by thewide channels (one for each direction). In each channel 32 bits are output ports. Flit routing is resolved upon arrival of the first flit ofreserved for data. Each router has five ports named L (Local), N a packet and the output port number is stored in Current(North), S (South), E (East) and W (West). Each port has two Routing Table for the pending flit per each input port and perchannels, one for packets incoming into the router and another for each service level. Each output port schedules transmission ofpackets outgoing from the router. the flits according to the availability of buffers in the next router,Internal channels are 16-bit wide. The flow control bits and the the priority (namely service level) of the pending flits, and thedata words are stored in the queues of routers. Other signals are round-robin ordering of flits within the same service level.used to control the internal flow inside the routers. A Router ArchitectureIn a wormhole switched network, each packet is split in a set ofunits named flits (flow control unit). A flit is the smallest unit The main objective of an on-chip switch is to provide correctover which is performed the flow control. It can be as long as a packet transfer between different IPs cores. The NoC proposed,packet or as short as the channel width. supposes that each switch has a bidirectional ports related to theaccount. Buffering is implemented by means of input queues. neighbours switches and to the IP core.NoC characteristics areEach input channel in the router has a 5-flit FIFO buffer to store chosen in order to facilitate implementation. Recent packet-packets while they cannot be forwarded to an output channel. switching trends show that wormhole switching is the solution ofRouters connect to up to five links, designed for planar choice for NoCs.interconnect to four torus/mesh neighbours and to one chipmodule. Each link comprises an input port and an output port. Therouter forwards packets from input ports to output ports. \Data is received in flits. Every arriving flit is first stored in aninput buffer. On the first flit of a packet, the router determines to i 2which output port that packet is destined. The router thenschedules the transmission for each flit on the appropriate output Iftihd.,

port. O,tp,t Ilp't fl - -,,

In the modeled NoC, a flit is equal to the internal channel width P,l-t1-(16-bit), because the flow control signalsqafe not taken into Smallbuffers are allocated to each service level, capable of storing only Input1 tnlth=ssUupta few flits. The routing algorithm iS invoked when the first flit of _ _ IptOpua packet iS received. The algorithm uses a simple routing function. lll Ikllegs cnnexinFor instance, relative routing iS employed for XY routing. IgRouting information per each service level per each input port islll Istored in the Current Routing Table, until the tail flit of the packet _~p~_ _p~3_~pt4 ~~is received, processed and delivered.

Fig. 4. The structure of the proposed encoder

628

Page 4: [IEEE 2007 IEEE International Symposium on Signal Processing and Information Technology - Giza, Egypt (2007.12.15-2007.12.18)] 2007 IEEE International Symposium on Signal Processing

This scheme divides packets into fixed-length flow control units control of errors, source and destination address, i.e., the (XL YL)(flits), with P/O buffers storing only a few flits. Thus, unlike most and (XT YT) offsets in a 2D Torus/Mesh.other schemes, this design style minimizes the buffer space in the * The routing decision unitswitches, and the switches used in a wormhole technique can be The routing decision unit associated to each input queue selectssmall and compact. The data packet size used is 32 words, with a the output port taking into account the information in the packetbyte in order to achieve routing. We have started with a 2D-mesh header, the available local output ports and the status of thetransformed to a 2D-torus topology. The used routing algorithm neighbour's buses (VC). This status is known through externalcompares the actual switch address (xLyL) to the target switch signals which are processed by the arbiter. If the selected outputaddress (xTyT) of the packet, stored in the header flit. port and buses are available, the routing decision unit updates theThe exchange of data among the processors and storage cores is header flits so that the offset is decremented, and sent first. Thisbecoming an increasingly difficult task because of growing decrement operation is carried out simultaneously for x and ysystem size and non scalable global wire delay. To cope with offsets, and in parallel with the unit arbitration. However, onlythese issues, we divide the end-to-end communication medium one of the modified offsets will be forwarded as the first flitinto multiple pipelined stages. In NoC architectures (Figure 4), according to the selected output.the interswitch wire segments, along with the switch blocks, The routing decision needs five cycles as they examine bothconstitute a highly pipelined communication medium header flits.characterized by link pipelining, deeply pipelined switches, andlatency-insensitive component design. Switch design also depends V. EXPERIM4ENTSon the routing scheme adopted. The two broad categories ofrouting are deterministic and adaptive. Deterministic routingalgorithms always provide the same path between a given source The developed switch can establish only one connection at theand destination pair. Adaptive routing algorithms use information same time. However, a single switch can simultaneously handleabout routing traffic or to avoid the congested part of the network. up to five connections as illustrated by the simulation resultsIn a deterministic routing scheme (used in this work), switches presented in figure 5.can be fast and compact. For example, three packets coming respectively from port 0 (Est),The four internal buses (VC) can avoid deadlocks in a two port 1 (West) and port 4 (Local) can be forwarded to the samedimensional Torus/Mesh network output port (example port 3 (South)). The switch canIn an output port, the buses selector decides which input bus will simultaneously be requested to establish one of three connectionssupply data to the header decoder by application of a priority as shown in figure 6.algorithm. It consists of a priority multiplexer unit, four VCs and i ta register of 16 bits as shown in Figure 4. rxS - f

* Flow control LiS- ftFlow control deals with the allocation of channels and buffers to a Lpacket as it travels along a path through the network. A resource rE2 - ILIcollision occurs when a packet cannot be processed. Whether the txZlpacket is dropped, blocked, buffered, or rerouted through another 111-channel depends on the flow control policy. A good flow control rx*u Kpolicy should avoid channel congestion while reducing the tx4 7S__ L7J Li7LJ7 l <network latency. d9taiR 01 0800I'I 062E 1 8822 C2

The allocation of channels and their associated buffers to packets ,atain 1-=can be viewed from two perspectives. The routing algorithm ..dedetermines which output channel is selected for a packet arriving - fon a given input channel. Therefore, routing can be referred to as t1.........-C xthe output selection policy. Since an outgoing channel can be ( ljj_requested by packets arriving on many different input channels. %atain4 65An input selection policy is needed to determine which packet dmay use the output channel. Possible input selection policies Fig.5. Establishment of five simultaneousl active connectionsinclude round, fixed channel priority, and first come first served.The input selection policy affects the fairness of routingalgorithms._______________

* Resource arbitrationIAn arbiter of a particular unit router (hE, hW, hN, hS, hL)receives requests from the header decoders of the other four unit | |routers and assigns an available output ports to one of therequesting header decoders using a centric (or round robin)arbitration mechanism.

* Phvsical datalinks

have bee usd xtra, ar 32 bit wid for dta that isueo

system by a pair of parallel opposite channels. Inside the switch,buses width is fixed to 16 bits. 14 1Packets are 32 words long (i.e., 64 bytes), including the flitsheader which represents the number of data words in the packet, Fig. 6: Simulation of a packet transmission from switch 00 to

switch 32 in 2D Torus/Mesh topology

629

Page 5: [IEEE 2007 IEEE International Symposium on Signal Processing and Information Technology - Giza, Egypt (2007.12.15-2007.12.18)] 2007 IEEE International Symposium on Signal Processing

Arbitration is used to grant access to an output port when one or - -D = 2more input ports simultaneously require a connection. A static 434- D = 3arbitration scheme is used. The algorithm of arbitration is 4 _v-D = 4executed by using a fixed priority (by set of priorities). 428 -

The priority of a port is a function of the requests of the other 424-ports having a granted request of the routing. The request 422-/processing requires five clock cycles. 418 -

The packet transmission in a 4x4 of this NoC was validated by a 416 -

functional simulation. Figure 6 illustrates the transmission of this 41241 0packet from switch 00 to switch 32. In fact, the input and output 408-

interface behaviours of switches 30 and 31 are shown in the 406 |simulation results. 4022

400

B. FPGA implementation 398A switch infrastructure prototype, implemented in an FPGA 0 5 10 15 20 25occupies 87% of the device area and can be clocked at 127 MHz. Number of cores (N)To investigate the feasibility of the concept presented here, wehave implemented a switch on an FPGA Virtex lIV500fg456. 411 -D =2Table 1 presents the results and the percentages of used resources. 410 D 3

409-Table 1. Results and % of the switch synthesis on FPGA 408Resources Used Accessible % ofuse 407 -

lOs 182 264 68%, 406-FGs 5382 6144 87% ; 405

404-CLB Slices 2691 3072 87%0 403Dffs 3064 6936 44%5 402

401 -

-400-C. ASIC implementation 399

398-We have designed the entire switch using the Synopsys V- 397 __._2004.06-SPi synthesis tool. Then, we mapped our design into 0 5 10 15 20 250.35 ptm (4 metal layers) technology from AMS. Although the Number of cores (N)obtained results for area and time are estimated by Synopsys in apessimistic way, they provide us with values very close to the Fig. 7: Latency for m=1 and m=4, load=N.(N-1)physical domain.Table 2. Characteristics of the router modules - Number of DFFs

Module Area (mm2) Critical path Power(ns) (mW)

Buffer (6 flits) 0.0187 - -

Control logic 0.0641 0.5 -

Global port East 0.3648 0.3518 -

Global port West 0.3648 0.3518 -

Global port North 0.3650 0.3519 -

Global port South 0.3648 0.3518 -

Global port Local 0.3568 0.351 -

TOTAL 1.8803 0.63 486.8319

Table 2 shows both delay and area for each one of the switchblocks. As the design is pipelined, the block in the critical path 0 2 4 6 8 10 12 14 16 18with highest delay determines the maximum clock frequency of Number of cores (N)the device. In this particular case it is the control logic unit Fig. 8: Variation ofNoC surface accordin to a number of(routing and arbitration units), due to its arbitration complexity, switches of topologywhich imposes a cycle time of 0.5 ns. For the global port block In a mesh network, for example, the number of ports by switch(reception and emission modules) we cover a cycle time of 0.35 depends on its position in the network as explained previously.ns, taking into account the impact that queue depth has on its The designed router, described in VHDL on RTL level, wasmanagement time. simulated in the case of topologies 2D-mesh and 2D-torus (2x2),D. Latency andperformance (3x3) and then (4x4).

The basic latency of a packet-switched message is:We fixed a number of 32 flits in a packet crossing the network ofthe source to the target. The flit processing time is two clock latency = RER+ P*4cycle. The operating frequency is 127MHz allowing a theoretical i=0peak performance of: n is the number of switchs during the trajectory of communication

127 MHz /2 * 5 ports * 32 bits= 10 Gbits/s (the source and target included), Ri is the time required by theIt is noticed that latency varies neither according to the size ofthe routing algorithm for each switch (at least 6 cycle's clock), P isbuffers nor according to the size of the packets, but only the size ofpacket. Pis multiplied by 2because each word requiresaccording to the number of intermediate switches. 2 cycle's clock so that it is sent of one port to the other in theThe topology of the network depends on the manner of same suitch. This number is multiplied by 4 because each wordconnecting the switches between them with bi-directional ports. requires 4 cycle's clock so that it is sent of a switch to the other.

630

Page 6: [IEEE 2007 IEEE International Symposium on Signal Processing and Information Technology - Giza, Egypt (2007.12.15-2007.12.18)] 2007 IEEE International Symposium on Signal Processing

Following figure 7 shows the effect of the number of cores IPs e p-i

(N) in the latency of NoC-based systems, without taking intoaccount the impact of die size and number of cores in the clockcycle. Based in dimension of NoC and their topologies, the totallatency to delivery N (N-l) messages in both networks. Eachmessage has P=1+2m 32-bit words on the NoC, due to the packetheader and data. Each graph includes several curves for differentcommunication locality patterns. As smaller the parameter D is,the greater is the locality. For instance, by making D equals 2, weconsider that all the messages are exchanged among coresseparated by two routers and a single inter-router link.Figure 7 shows that for short messages and at the maximum a: Placed b: Routedworkload, the NoC is effective when the locality is great (Dsmall) or when the system grows (N increases). Although smaller Fig. 10: The physical representation of SoC 3x3workload and if the message size is increased, the parallelism and Each router occupies less than 0.9 0 of the system device areathe pipeline of the NoC produce a performance enhancement. and has a latency of 6 cycle's clock corresponding to a frequencyThe time spent to deliver packets grows linearly with the number of 127 MHz. Because the maximum number of routers to traverseof intermediate switches (Figure 8). This happens because each is 5 two components running on the device with a frequency ofswitch spends some clock cycles to execute the arbitration and 100MHz can send and receive the packets without delay in theirswitching algorithms.switchingalgorithms. ~~~~~~~execution, ifthe network iS free.For buffers of a small number of flits, the delivery time does notvary. If the buffer is too small, the switch cannot receive newflits VI. CONCLUSIONSuntil the destination port is chosen. Therefore, the minimumbuffer size must be equal to the number of write operations that In this paper, we are interested by torus and mesh topologies,can be executed during the arbitration and switching algorithms which seem to dominate the literature of the networks on a chip.execution. We studied several key points such as memorization, routingIn the case of a NoC (4x4), these algorithms use 6 clock cycles algorithm and arbitration. A VHDL based methodology is used toand each write operation takes 2 clock cycles. The minimum size implement a switch for data transfer from an IP source to an IPof the buffer is then of 6 flits. destination.

E. Placement and route REFERENCES

Cadence SoC Encounter integrates the solution of [1] Kumar S., et al. "A Network on Chip Architecture and DesignSynthesis/Placement and Route (SP&R) with new functions and Methodology". In: IEEE Computer Society Annual Symposium onthe technology of Silicon Perspective Corporation (SPC). It VLSI. (ISVLSI'02), Apr. 2002, pp. 105-112.provides a hierarchical solution RTL-GDSII complete. We chose [2] L. Benini and G.De Micheli, "Networks on chips: a new SoCSoC Encounter as a tool of placement, routing and installation paradigm", IEEE Computer, pp. 70-78, Jan. 2002.hierarchical us to decrease the time of design of our SoC [3] S. Murali and G. De Micheli, "Bandwidth-Constrainedapplication. Mapping of Cores onto NoC Architectures", Design, Automation

and Test in Europe Conference and Exhibition, Vol. 2, pp. 896-90 1,The complete communication infrastructure iS made in the 2004.topology 2D-Torus/Mesh (3x3) by inter-connecting 9 routers with [4] D. Ching, P. Schaumont and I. Verbauwhede, "Integrated Modeling9 modules (different MACs). These modules are responsible of and Generation of a Reconfigurable Network-on-Chip," 18thdata computing. We present here the integration of 9 MACs in the International Parallel and Distributed Processing Symposium, pp.context ofNoC. This combination creates an application of a SoC 139-145, 2004.type (Figure 10). The placement and routing of the system [5] A. Hemani, A. Jantsch, S. Kumar, A. Postula, J. Oberg, M. Millberg,(resulting circuit) is presented respectively by Figure 10-a and b. and D. Lindqvist, "Network on chip: an architecture for billionDesign of the 2D-Torus/Mesh (3x3) topology occupies 11 . of transistor era". In Proc. IEEE NorChip Conf, 2000, pp. 3531-3538.

[6] G.De Micheli, "Reliable communication in systems on chip," inthe device area. If large components are placed on the device, Proc. DAC, 2004, pp. 77.they will cover a large set of routers, thus reducing the total area [7] Jingcao Hu, R. Marculescu: "Application-specific buffer spaceused by the routers. The router has also been implemented as allocation for networks-on-chip router design". Proceedings of thedescribed previously. 2004 IEEE/ACM International conference on Computer-aided

design, pp.354-361, November 07-11, 2004.[8] J. Henkel, W. Wolf, and S. Chakradhar: "On-chip networks: a

1 1 1 0 20p scalable, communication-centric embedded system designparadigm", in Proc. 17th Int'l Conf VLSI Design, 2004, pp. 845-851.

[9] D. Bertozzi, et al.: "NoC Synthesis Flow for Customized DomainJ | Specific Multiprocessor Systems-on-Chip", IEEE Transactions on

Parallel and Distributed Systems, VOL. 16, NO. 2, February 2005,

1|;|11|;|1 1|;l ~~~~[10] E. Rijpkema, K. Goossens, A. Radulescu, J. Dielissen, J. vanI I ~~~~~~~~~~~~~~~~Meerbergen,P. Wielage and E. aterlander: "Trade-offs in the design

| q= | ~~~~~~~~~~~~~ofa router with both guaranteed and best-effort services for~~~~~i ~~~~~~~~networks on chip". IEE Proc.-Comput. Digit. Tech., Vol. 150, No. 5,

W ~~~~~~~~~~~~~~294, September 2003, pp. 294-302.

Flexible and Extensible Simulator for Evaluating MulticomputerFig. 9- Representation of the SoC 3x3 application in the case of a Networks". IEEE Transactions on Parallel and Distributed Systems,

2D-Torus/Mesh 3x3 network topology Vol. 8, No. 1, January 1997, pp. 25-40.

631