[ieee 2013 ieee postgraduate research in microelectronics and electronics asia (primeasia) -...

A Provably Good Performance Centric NoCTopology

Tuhin Subhra Das, Prasun Ghosal

Department of Information Technology

Bengal Engineering and Science University, Shibpur

Howrah 711103, WB, India

[email protected], [email protected]

Abstract—As chip density increases rapidly with every processgeneration, the use of Network-on-Chip (NoC) has become theprevalent architecture for SoC, MPSoC, and, large scale CMP(Chip Multi Processor) based designs. Diverse NoC solutionshave been proposed by the researchers to meet the enhancedon-chip communication requirements. Here, underlying networkinterconnection architecture (topology), router design and routingpolicy play an important role for overall system performanceimprovement. In this work, a 2D Hybrid Mesh based Startopology has been proposed with an objective of providing lowlatency, low channel contention and higher throughput basedsystem. The observe experimental results show a maximumlatency benefit of 62% and increase of 48% in throughput forthis proposed topology compared to simple 2D mesh in cost ofadditional area overhead.

Keywords—NoC topology; routing; throughput; latency; loadbalancing; performance

I. INTRODUCTION

As the silicon process technology is evolving with an ac-celerated pace, design reuse and design automation technologyare now looking as the major technical barriers for futureprogress, and this productivity gap between silicon process anddesign technology is increasing rapidly with the time. System-on-a-chip [1] design with its advanced IC process technologyhas reduced this design productivity gap to some extent.However, the incremental changes to current methodologiesfor IC design cannot be sufficient for enabling the full potentialof system on chip (SoC) integration. Here, NoC [2] has comeup with an alternative solution by providing a platform baseddesigning (PBD) methodology. This PBD methodology, whichactually builds on intellectual property (IP) blocks facilitieshierarchical design methodology starting at the system level.It also provides a clear separation between architecture designphase and the function design phase. It allows not only reuseof components but reuse of the system also.

Today NoC with 100 cores [3] already exists and a proto-type for 1000 or more core has also been proposed recently[4] by the researchers. Here, key challenge is to providea massively parallel distributed communication environment.The underlying network topology play an important role toimprove system performance as it defines the communicationinfrastructure between any two router on chip. Where, keyissues are modelling, design technology, routing, flow controland deadlock prevention . A properly balanced network offers

978-1-4799-2751-7/13/$31.00 2013 IEEE

large or moderate bisection-width [5], whereas some otherimportant metric viz. network diameter [5] [6] as well as nodedegree [5] are expected to be smaller. A Hybrid locally meshglobally star (HLMGS) 2D NoC topology has been proposedin this paper with an objective of providing a balanced networkwith a low network latency [6] and higher throughput[6]benefit. Channel contention [7] problem also has been reducedby providing more alternative path.

The overall organization of the paper is as follows. SectionII describes the background and related works in this areato provide the motivation of the present problem. In sectionIII details of proposed Hybrid architecture is given here. Insection IV the details of proposed routing algorithm has beendescribed. Deadlock freeness of the proposed algorithm ispresented in section V. Experimental results are reported insection VI and section VII concludes the paper.

II. BACKGROUND AND MOTIVATION

In NoC, topology provides basic interconnection architec-ture among the routers. This topologies are broadly categorizedinto two sections viz. regular and irregular. A regular topologyfollows a symmetric pattern throughout its whole structure.Whereas, an irregular topology is derived by mixing differentstructure of topology in hybrid or hierarchical or asymmetricfashion. A regular grid-based 2D mesh is very popular NoCtopology because of having architectural simplicity and sup-porting high level of parallelism. But it has very long networkdiameter [5] that leads to higher network latency [5]. Whilea hierarchical star [8] topology offers very short diameter butdoes not suit well for designing parallel architecture because ofhaving smaller bisection width [5]. In some recently proposedtopology, the researchers have followed some hybridization orhierarchy based technique to design more enhanced topology.For example a star-type 2D mesh [9] is proposed by combiningsimple 2D-mesh and hierarchical star [8] topology. Again, aL2STAR [10] and multi-level mesh [11] follows some levelwise hierarchical architecture. Objective is to design a lowlatency based parallel architecture. But most of them sufferfrom either higher node degree or from channel contentionproblem with the increasing network size. For example, amulti-level mesh offer low latency but node degree increasesalmost linearly with the increasing network size. Topologieslike star-type 2D mesh[9], L2STAR [10], and SD2D [12]provides a limit on maximum node degree but they onlysupport a second level hierarchical routing policy. So makinga scalable, low latency and high throughput [6] based reliable

2013 IEEE Asia Pacific Conference on Postgraduate Research in Microelectronics and Electronics (PrimeAsia)

170

Fig. 1: Proposed LMGS topology when N is 4

architecture is a challenge to the designer. In this paper,we have tried to overcome all these limitation through ourproposed work.

III. PROPOSED HYBRID LMGS TOPOLOGY

This paper demonstrate a hybrid locally mesh globally star(HLMGS) NoC interconnection architecture (see Fig.1). Pro-posed topology facilitates both long distance traffic and shortdistance traffic by using two different types of connections atdifferent level. Usually mesh facilitates the short distance localtraffic, whereas star used for long distance traffic. In mesh, aconnection is established between two same levels of routers,whereas in hierarchical star connection a communication be-tween two different levels of routers is established. Thoughpresence of extra links and routers require some additional areabut at the same time they offer more evenly distributed trafficthroughout the network. And thus helps in load balancingof the network and offers better performance with the issueslike congestion control or channel contention [7] problem byproviding more alternative path.

Some important parameters of an M ×M sized proposedarchitecture are as follows, where, M = 2m for m =2, 3, 4, . . . , n.[where m,n are positive integer number]Bisection width = M + 4Maximum node degree of non-leaf router = 7Maximum node degree of leaf router = 9, when N = 4Maximum node degree of leaf router = 6, when N = 1Maximum number of IP cores connected to a network = M ×

M × N , where, N represents the numbers of IP(IntellectualProperty) core connected to each leaf level router.

IV. PROPOSED ROUTING ALGORITHM

In proposed routing algorithm two important variablesXdiff and Ydiff has been used to measure only the difference

(a) Leaf Router (b) Non leaf Router

Fig. 2: Router Architecture when N is 1

between source and destination nodes (positive integer). Apacket may follow global routing in one of following twosituations.(i)If difference between current and destination node exceedssome predefined threshold value (which is 4 here) then packetfollow global routing.(ii)Again, when current node position matches with the targetdestination node but target destination node is not equal to theoriginal destination node then also follow global routing.In global routing when a packet switches from one specificlevel to another level then the target destination point alsoshifted to one level upwards or downwards, though actualdestination node remain same. At the beginning of algorithmcurrent node will be initialized to source node, whereas targetdestination node will be initialized with original destinationnode. This target destination node will shifted to new positionat every stage of global routing.Each router node position isrepresented by the notation (L,X, Y ), where L represents thelevel of router, and X , Y the row and column number of thatparticular node. The current node position is represented bythe notation (Lc, Xcurr, Ycurr), the original destination nodeposition by (Ld, Xdest, Ydest) and the target destination nodeposition by (L′

d, X′

dest, Y′

dest). Two other variables Xdiff andYdiff have been used to measure only the difference betweensource and destination nodes (always positive integers).

Different variable abbreviations used in the pseudo codesare given in table I.

A. Proposed Routing

The pseudo code of the proposed routing scheme is asfollows.

B. Global Star Routing

The general pseudo code of the global star routing schemeis as follows:


171

Xdest X co-ordinate of actual destination node.

Ydest Y co-ordinate of actual destination node.

X′

dest X co-ordinate of target destination node,

initialized to value of Xdest .

Y ′

dest Y co-ordinate of target destination node,

initialized to value of Ydest.

Xcurr X co-ordinate of current node.

Ycurr Y co-ordinate of current node.

Lc Level of current node, initialized to 0.

Ld Level of destination node, initialized to 0.

L′

d Level of target destination node, initialized to 0.

Xdiff difference between Xdest and Xcurr .

Ydiff difference between Ydest and Ycurr .

TABLE I: Table of notations

if (current node == actual destination node) thenpacket has reached the destination

end

else

if (current node != target destination node) then

if (Xdiff or Ydiff or (Xdiff + Ydiff ) > 4) thenThen follow global star routing

end

elseFollow local XY routing

end

end

elseFollow global Star routing

end

end

Algorithm 1: Proposed routing

During global routing as the target destination node isshifted one level up, the following section will be executed:Lc = Lc + 1;

L′

d = L′

d + 1;

Xcurr = Xcurr/2;

Ycurr = Ycurr/2;

X′

dest = X′

dest/2;

Y ′

dest = Y ′

dest/2;

When shifted one level downward, the following sectionis executed:L′

d = L′

d−1;

If L′

d != 0 then X′

dest = Xdest/(Lc − 1)2;

Else X′

dest = Xdest;

If L′

d !=0 then Y ′

dest = Ydest/(Lc−1)2;

Else Y ′

dest = Ydest;

Lc = L′

d;

Xcurr = X′

dest;

Ycurr = Y ′

dest;

if (current node != target destination node) thenShift target destination node and next current node one level up; Routepacket towards the new current node;

end

elseShift target destination and next current node one level down; Routepacket to new target destination node;

end

Algorithm 2: Global routing

Fig. 3: An abstract view of 4× 4 hybrid LMGS topology

V. DEADLOCK FREENESS

A deadlock may appear in NOC routing if proposed routinggenerates a circular waiting path as depicted in[13][10],wheretwo different packet or flit will wait for each other in acyclic way for undefined time. So to avoid this kind ofdeadlock situation two different policies has been adopted byresearchers. One is to restrict the packet movement to avoidthe generation of circular waiting path. Another way is to splitthe physical channel to into several number of logical virtualchannel [14][15]. We have chosen the first one as addingvirtual channel is not free of cost and need some additionalspace also as described in details in [16]. Simple mesh andstar routing is always deadlock free, and the proposed locallymesh based globally star routing will not generate any suchcircular path also. Because in a specific level a packet followseither local mesh routing or global Star routing. So proposedrouting is quite adept to avoid any such kind of unavoidablesituation.

VI. EXPERIMENTAL RESULTS AND DISCUSSION

For experimental result and evaluation of proposed work,we selected Ns-2 [17] as suggested in [6][18][19]. It is anobject-oriented, discrete-event driven network simulator imple-mented in C++ and Otcl and suit well for simulating NoC athigher abstraction level. This simulator provides a convenientuser interface Network Animator (NAM), which help us tovisualize the network operation in real time by tracking thedata flow (see Fig.4).Important performance centric parameterviz. latency, throughput, packet drop rate etc. can be calculatedeasily from this output trace file. This simulator also facilitiesus to observe system performance under different networkload. A list of important parameter that has been used forour experimentation are listed in tableII.

For experiments, we create four different type of 4 × 4size topology through tcl script. Where routers and PE cores(i.e. resources) are represented by square and circular noderespectively and they are connected by duplex link accordingto the proposed and other compared topologies (see Fig.4).Each router at level-0 is connected to its neighbour router bya maximum channel bandwidth of 1mb. Router at higher level


172

(i.e. other than level-0 or non-leaf router) connected to antherhigher level router by a maximum channel bandwidth of 2Mb. Thus for a 4× 4 size proposed topology (HLMGS) totaleight 2mb channel will be required. Four channels will beused to connect four level-1 routers in mesh and another fourwill be used to connect these four level-1 routers to the level-2 router in star orientation (See Fig.3). Each leaf router (i.e.router at level-0) connected to a single resource (i.e IP core) by1Mb channel bandwidth. UDP is selected for communicationprotocol as it provides non-guaranteed datagram delivery. Eachsource node is attached to an UDP agent and each sink isattached to a null agent. Each source (i.e. UDP agent) isattached to an exponential traffic generator. Traffic on and offperiod are set to 2 millisecond and 0.1 millisecond respectively.Each node uses a DropTail queue, whose maximum size isset to 8 as suggested in [19]. Link delay for short (i.e. forlevel-0 connection) and long channel (i.e. other than level-0connection) is set to 0.1 and 0.15 milliseconds respectively.A communication scenario has been defined by selectingsome traffic source-sink pairs randomly and simulation runfor 15 seconds. Some perl script has been used to retrieverequire information from the trace file, which are used toanalyse network performance. Important performance centricparameter like network latency, throughput, packets loss ratehas been observed for different topologies with a varyingnetwork load.

Fig. 4: Snapshot of NS-2 Network Animator for 4×4 proposedtopology

A latency benefit of 49% compared to l2star [10] and sd2d[12] and 62% compared to simple mesh has been observed forthis proposed hybrid topology (see Fig.5), simply by doublingthe channel bandwidth of only 16% of total link of the wholenetwork. This simulated results may differ with the analyticalresult, because packet delay influenced heavily by the channelcontention problem rather link delay. Link delay is observableonly at the ideal situation i.e. when no packet gets loss at

TABLE II: NS-2 Simulation Parameter Details

Maximum Channel Bandwidth 1Mb-2Mb

Link delay 0.1-0.15 ms

Topology LMGS, L2STAR, SD2D, Mesh

Buffer size 8 (Unit of packets)

Queue Type Drop Tail

Packet size 8 Bytes

Packet injection rate 500K-1000k

Simulation duration 15 sec

Connection type UDP

Traffic Type Exponential

Traffic Burst-time (On period) 2ms

Traffic Ideal-time(Off period) 0.1ms

Fig. 5: Packet latency as a function of increasing network loadfor 4× 4 sized topology

Fig. 6: Maximum throughput as a function of increasingnetwork load for 4× 4 sized topology


173

Fig. 7: Packet loss rate as a function of increasing networkload for 4× 4 sized topology

transmission time. Packet delay for proposed topology (i.e.HLMGS topology) reaches to thresholds value comparativelyat higher load compared to others. An increase of 25% and27% in maximum throughput compared to sd2d and l2starand increase of 48% compared to simple 2d mesh has beenobserved for this proposed topology (see Fig.6). Packet lossrate is also is negligible or very low for this proposed topology(as shown in Fig.7). Low packet drop rate signifies a lowchannel contention problem also. This signifies a properlydistributed traffic throughout the network. A comparison onrequired area has been calculated followed by a methodproposed by S Suboh et. al in [20]. Where average requiredarea (Av) has been calculated as follows

Av = Ns(Rs + asdgSfBs) +NcAc + alNlLl (1)

Number of switches (Ns) is considered 16 for mesh and21 for HLMGS, whereas average node degree for mesh andproposed HLMGS are consider as 4 and 5.4 respectively. Linklength for leaf level connection has been kept 1, while linklength for non-leaf connection is kept 2 units. A 24% to12% area overhead has been observed by taking other variablevalues as proposed in [20] and varying number of connected IPcore (N) to each router from 1 to 4 for this proposed topologyover 2dmesh. So require area overhead decreases with theincreasing IP core number.

VII. CONCLUSION

Though experimental result is quite convincing but lotmore things are still left to experiment in future. For exampleresult is observed under static routing and assumption ofa single resource connected to each leaf router. So how itwill perform under dynamic routing and what will be therequisite channel bandwidth with the increasing number ofresource connected to each leaf router. Those things need tobe addressed in future work. A result under different kindof traffic pattern is also a subject to observe. However, theobserved experimental result shows the superiority of proposedtopology over other compared topologies with respect to someperformance centric measurement parameter like throughput,latency, load balancing and packet loss rate in cost of additionalarea overhead.

REFERENCES

[1] J. Nurmi, “Network-on-Chip: A New Paradigm for System-on-ChipDesign,” in International Symposium on System-on-Chip, pp. 2–6, 2005.

[2] V. Rantala, T. Lehtonen, and J. Plosila, “Network on Chip RoutingAlgorithms,” tech. rep., TUCS Technical Reports 779, Turku Centrefor Computer Science, 2006.

[3] “Tilera Announces the world’s first 100-core processor.” Online, Octo-ber 2009. Available: http://goo.gl/K9c85.

[4] U. of Glasgow, “Scientists Squeeze More Than 1,000 cores on toComputer Chip.” Online. Available: http://goo.gl/KdBbW.

[5] S. Kundu, R. P. Dasari, S. Chattopadhyay, and K. Manna, “Mesh-of-Tree Based Scalable Network-on-Chip Architecture,” in IEEE Region

10 Colloquium and the Third ICIIS, 2008.

[6] J. Chen, P. Gillard, and C. Li, “Performance evaluation of threeNetwork-on-Chip (NoC) architectures (Invited),” in 1st IEEE Interna-

tional Conference on Communications in China (ICCC), pp. 91–96,2012.

[7] C. J. Glass and L. M. Ni, “The Turn Model for Adaptive Routing,”Journal of the Association for Computing Machinery, vol. 41, pp. 874–902, September 1994.

[8] Z. Song, G. Ma, and D. Song, “Hierarchical Star: An Optimal NoCTopology for High-Performance SoC Design,” in International Multi-symposiums on Computer and Computational Sciences (IMSCCS ’08),pp. 158–163, 2008.

[9] K.-J. Chen, C.-H. Peng, and F. Lai, “Star-type architecture with lowtransmission latency for a 2D mesh NOC,” in IEEE Asia Pacific

Conference on Circuits and Systems (APCCAS), pp. 919–922, 2010.

[10] P. Ghosal and T. S. Das, “L2STAR: A Star Type level-2 2D Mesharchitecture for NoC,” in Asia Pacific Conference on Postgraduate

Research in Microelectronics and Electronics (PrimeAsia), pp. 155–159, 2012.

[11] M. Saneei, A. Afzali-Kusha, and Z. Navabi, “Low-Latency Multi-LevelMesh Topology for NoCs,” in The 18th International Confernece onMicroelectronics (ICM), pp. 36–39, 2006.

[12] P. Ghosal and T. S. Das, “Network-on-chip routing using StructuralDiametrical 2D mesh architecture,” in Third International Conferenceon Emerging Applications of Information Technology (EAIT), pp. 471–474, 2012.

[13] Y. Fukushima, M. Fukushi, I. E. Yairi, and T. Hattori, “A Hardware-Oriented Fault-Tolerant Routing Algorithm for Irregular 2D-MeshNetwork-on-Chip without Virtual Channels,” in IEEE 25th International

Symposium on Defect and Fault Tolerance in VLSI Systems (DFT),pp. 52–59, 2010.

[14] Y. M. Boura and C. R. Das, “Fault-tolerant routing in mesh networks,”in International Conference on Parallel Processing, pp. I.106–I.109,1995.

[15] D. H. Linder and J. C. Harden, “An adaptive and fault-tolerantwormhole routing strategies for k-ary n-cubes,” IEEE Transactions onComputer, vol. 40, pp. 2–12, 1991.

[16] A. A. Chien, “A cost and speed model for k-ary n-cube wormholerouters,” in Hot Interconnects 93, 1993.

[17] “The Network Simulator-NS-2.” Online. Available:http://www.isi.edu/nsnam/ns/.

[18] M. Ali, M. Welzl, A. Adnan, and F. Nadeem, “Using the NS-2 NetworkSimulator for Evaluating Network on Chips (NoC),” in International

Conference on Emerging Technologies (ICET ’06), pp. 506–512, 2006.

[19] Y.-R. Sun, S. Kumar, and A. Jain, “Simulation and Evaluation for a net-work on chip architecture using NS-2,” in 20th NORCHIP conference,2002.

[20] S. Suboh, M. Bakhouya, J. Gaber, and T. El-Ghazawi, “Analytical mod-eling and evaluation of network-on-chip architectures,” in InternationalConference on High Performance Computing and Simulation (HPCS),pp. 615–622, 2010.


174

[ieee 2013 ieee postgraduate research in microelectronics and electronics asia (primeasia) -...

Documents