memory access aware mapping for networks-on-chip · algorithm based on multi-objective genetic...

10
Memory Access Aware Mapping for Networks-on-Chip Xi Jin * , Nan Guan *† , Qingxu Deng * , Wang Yi *† * Institute of Computer Software Northeastern University, China Department of Information Technology Uppsala University, Sweden Abstract Networks-on-Chip (NoC) has been introduced to of- fer high on-chip communication bandwidth for large- scale multi-core systems. However, the communication bandwidth between NoC chips and off-chip memories is relatively low, which seriously limits the overall system performance. So optimizing the off-chip memory commu- nication efficiency is a crucial issue in the NoC system design flow. In this paper, we present a memory ac- cess aware mapping algorithm for NoC, which explores SDRAM access parallelization in order to offer higher off-chip memory communication efficiency, and eventually achieve higher overall system performance. To the best of our knowledge, this is the first work to consider off-chip memory communication efficiency in application mapping on NoC. Experimental results showed that, comparing with classical NoC mapping algorithms, our algorithm can significantly improve the memory utilization and overall system throughput (on average 60% improvement). I. Introduction More and more cores will be integrated on a sin- gle die to offer high-computing-capacity and low-power- consumption processors. Conventional point-to-point and bus-based communication mechanism cannot afford the rapidly increasing on-chip communication. Networks-on- chip (NoC), which connects cores together by networks and offers packet switched communication among cores, provides very high on-chip communication bandwidth [5], [2]. It has been widely accepted that the NoC paradigm will be the default design choice for future large-scale multi- core processors. In contrast with the high on-chip communication band- width, the bandwidth between NoC chips and off-chip memory systems is still relatively low, which is usually the bottleneck of the overall system performance. SDRAM is commonly used for the off-chip memory because of its large storage density and high access speed. SDRAM has a “3D” structure: a memory chip consists of several banks; each bank is a grid fabricated by columns and rows. The accesses to different banks in SDRAM can be served simultaneously, thus exploring the access parallelization is a key to improve the communication efficiency to SDRAM off-chip memory. Previous work [13] has used specific routers to better organize the SDRAM accesses and im- prove the memory communication efficiency. However, this approach cannot be generally applied to common NoC systems as it requires extra hardware support. Even if the NoC hardware is customized to include these hardware support, the system design flexibility is still to a certain extent limited. For example, the routers in [13] need to be modified if the designer changes the SDRAM by a new one with a different number of banks. Instead of the hardware-based approach, we are interested in software- based approaches which can be applied to common NoC architectures and provide larger design flexibility. Due to the network-like architecture feature of NoC, a critical step in the system design on NoC is application mapping, i.e., decide how to allocate each component of the whole application to NoC nodes (a processing element in the SoC is called node). The mapping problem for NoC has been intensively studied, to optimize the on- chip communication efficiency. However, to the best of our knowledge, no previous work on NoC application mapping algorithms has considered the issue of off-chip memory communication efficiency. In this paper, we present a mapping algorithm which optimizes for not only the on-chip communication, but also the off-chip communication efficiency. Our algorithm does not require any special hardware support; it only

Upload: others

Post on 20-Oct-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

  • Memory Access Aware Mapping for Networks-on-Chip

    Xi Jin∗, Nan Guan∗†, Qingxu Deng∗, Wang Yi∗†

    ∗ Institute of Computer SoftwareNortheastern University, China

    †Department of Information TechnologyUppsala University, Sweden

    Abstract

    Networks-on-Chip (NoC) has been introduced to of-fer high on-chip communication bandwidth for large-scale multi-core systems. However, the communicationbandwidth between NoC chips and off-chip memories isrelatively low, which seriously limits the overall systemperformance. So optimizing the off-chip memory commu-nication efficiency is a crucial issue in the NoC systemdesign flow. In this paper, we present a memory ac-cess aware mapping algorithm for NoC, which exploresSDRAM access parallelization in order to offer higheroff-chip memory communication efficiency, and eventuallyachieve higher overall system performance. To the best ofour knowledge, this is the first work to consider off-chipmemory communication efficiency in application mappingon NoC. Experimental results showed that, comparingwith classical NoC mapping algorithms, our algorithm cansignificantly improve the memory utilization and overallsystem throughput (on average 60% improvement).

    I. Introduction

    More and more cores will be integrated on a sin-gle die to offer high-computing-capacity and low-power-consumption processors. Conventional point-to-point andbus-based communication mechanism cannot afford therapidly increasing on-chip communication. Networks-on-chip (NoC), which connects cores together by networksand offers packet switched communication among cores,provides very high on-chip communication bandwidth [5],[2]. It has been widely accepted that the NoC paradigm willbe the default design choice for future large-scale multi-core processors.

    In contrast with the high on-chip communication band-

    width, the bandwidth between NoC chips and off-chipmemory systems is still relatively low, which is usuallythe bottleneck of the overall system performance. SDRAMis commonly used for the off-chip memory because ofits large storage density and high access speed. SDRAMhas a “3D” structure: a memory chip consists of severalbanks; each bank is a grid fabricated by columns and rows.The accesses to different banks in SDRAM can be servedsimultaneously, thus exploring the access parallelization isa key to improve the communication efficiency to SDRAMoff-chip memory. Previous work [13] has used specificrouters to better organize the SDRAM accesses and im-prove the memory communication efficiency. However,this approach cannot be generally applied to common NoCsystems as it requires extra hardware support. Even if theNoC hardware is customized to include these hardwaresupport, the system design flexibility is still to a certainextent limited. For example, the routers in [13] need tobe modified if the designer changes the SDRAM by anew one with a different number of banks. Instead of thehardware-based approach, we are interested in software-based approaches which can be applied to common NoCarchitectures and provide larger design flexibility.

    Due to the network-like architecture feature of NoC, acritical step in the system design on NoC is applicationmapping, i.e., decide how to allocate each component ofthe whole application to NoC nodes (a processing elementin the SoC is called node). The mapping problem forNoC has been intensively studied, to optimize the on-chip communication efficiency. However, to the best of ourknowledge, no previous work on NoC application mappingalgorithms has considered the issue of off-chip memorycommunication efficiency.

    In this paper, we present a mapping algorithm whichoptimizes for not only the on-chip communication, butalso the off-chip communication efficiency. Our algorithmdoes not require any special hardware support; it only

  • assumes round-robin routers with source-based routingfunctionality, which are very commonly used on NoC[7] [3] [24]. The main idea of our algorithm is to or-ganize the memory accesses in the way that the round-robin router will alternately send the accesses to differentbanks, by which the accesses can be served in parallel asmuch as possible. Experiments show that our algorithmcan significantly improve the memory access efficiency,and eventually improve the overall system performancecomparing with classical NoC mapping algorithms [10][19]. Note that, our algorithm focuses on optimizing forthe access parallelization to SDRAM, and is orthogonal tothe accesses locality nature of an application, so it can beused with both the open-page and close-page [12] modesof SDRAM.

    The rest of this paper is organized as follows. Section IIintroduces the related work. Section III reviews SDRAMarchitecture. Section IV describes the problem model.Section V introduces our proposed algorithm. SectionVI presents the experimental results. The conclusion andfuture work are presented in Section VII.

    II. Related Works

    There has been a number of work on increasing memorycommunication efficiency by providing sophisticated mem-ory controllers. In [20], Mutlu et al. design a parallelism-aware batch scheduler (PAR-BS) that maximizes the bankparallelism. In [17], Macian et al. present a managerarchitecture that provides secure access to shared resources(memory, communication channels, CPUs) and QoS guar-antees. In [15], Lee et al. present a multilayer, quality-aware memory controller for multimedia platform SoCs.The goal of this memory controller is to provide not onlythe high DRAM utilization but also the QoS guarantees. In[9], an SDRAM controller IP is presented, which supportsaccess preemption and reordering to optimize bandwidthand average latency. Different from the above previousworks, our approach is software-based and does not requirespecial memory controller supports.

    Application mapping, as one of the central problemsin the NoC system design flow, has been intensivelystudied by the research community. Here we provide areview on previous works. In [10], [11], Hu et al. proposea branch and bound algorithm to map a given set ofIP cores onto a regular NoC architecture such that thetotal communication energy is minimized under specifiedperformance constraints. Murali et al. in [19] present afast mapping algorithm that maps the cores onto a meshNoC architecture under bandwidth constraints, minimizingthe average communication delay. A two step geneticalgorithm for mapping is presented in [16]. In first step,the proposed algorithm maps the vertices of the task graph

    to available cores so that the overall execution time ofthe task graph is minimized. In second step, the IP coresare mapped onto a fixed NoC chip. In [1], a heuristicsalgorithm based on multi-objective genetic algorithm isproposed to optimize performance and energy. In [25],Zhang et al. propose a mapping and routing technique withoptimizing the energy consumption and worst link-load. Anovel algorithm work not only with homogeneous coreson regular mesh architecture but also with heterogeneouscores on irregular mesh or custom architecture is proposedin [14]. In [18], several different applications or use-casesare considered during the NoC design process, and amethod is presented to efficiently map the applications onto the NoC architecture, satisfying the design constraintsof each individual use-case. In [8], Hansson et al. proposea unified algorithm, called Unified MApping, Routingand Slot allocation (UMARS), that couples mapping, pathselection and time-slot allcation, using a single consistentobjective. In [22], Shen et al. propose a fast and efficientbinomial mapping and optimization algorithm (BMAP)that provides more economic network component mapping.And the experimental result shows communication costand average hop count is reduced. In [4], Chou et al.analyze the impact of network contention and propose acontention-aware application mapping problem which aimsat minimizing the network contention and reducing packetlatency. These previous works did not consider off-chipmemory communication efficiency.

    III. SDRAM Background

    SDRAM has a “3D” architecture as shown in Figure 1.An SDRAM chips includes several independent banks suchthat memory accesses to different banks can be servicedin parallel. Each bank contains a two dimensions structureof rows and columns. When a row is accessed, the entirerow is transmitted into the row buffer of this bank. Thena column access is performed in the row buffer. Aftercompleting the column access, the row buffer is writtenback to the bank memory by bank precharge for nextrow activation. So a SDRAM access typically consistsof three commands: row activation, column access andbank precharge, and the bank precharge and row activationusually take more time than the column access.

    Clearly the access order has a great effect on thecommunication efficiency in SDRAM. Researchers haveproposed to use memory access scheduling to optimizememory performance [21]. The memory access schedulingcan make effects in two aspects:

    1) Gather the accesses to the same row and serve themconsecutively, to reduce the row activation overhead.

    2) Group write and read accesses to reduce the over-head of switching transmission direction on the

  • Figure 1. SDRAM architecture

    Table I. Access parameters of the exampleaccess bank row column

    A0 0 1 0

    A1 0 0 1

    A2 1 0 0

    A3 1 1 0

    bidirectional data pins.3) Let the accesses to different banks be served simul-

    taneously, to explore the inter-bank parallelization.The first aspect heavily depends on the locality nature

    of the memory access: it can be very helpful for accesseswith good locality, but is much less effective for accesseswith bad locality. For example, the access locality in multi-core systems is usually weaker than single-core systems,since different cores are accessing different memory rowssimultaneously.

    The second aspect also depends on the characterizationof accesses. The data pins are bidirectional and require sev-eral cycles to switch from read (write) to write (read). Socontinuous read and write accesses are good for reducingthe switching overhead.

    The third aspect is orthogonal to the first and secondaspect. The example in Table I and Figure 2 illustratesits effect. Table I identify the bank, row and columnnumber of the four accesses respectively. For simplicity,we assume bank precharge, row activation and columnaccess all require 2 cycles. In Figure 2, the “B”, “R”, “C”,denotes bank precharge, row activation and column accessrespectively. For example, “B0” denotes the precharge forbank 0, and “R1” denotes the activation for row 1. InFigure 2 (a), the four memory accesses are performedin order. When A0 is performed, A1 can only wait untilA0 is completed since they are to the same bank butdifferent rows. We call this situation bank conflict. A2 canbe pipelined with A1 since they are to different banks, i.e.,

    Figure 2. The sequence of four memory ac-cesses without (a) and with (b) access opti-mization.

    they use different row buffers. We call this situation bankinterleaving. In Figure 2 (b), we changes the access orderto: A0, A2, A1 and A3. When A0 and A1 are performed,A2 and A3 are also performed independently. From thefigure we see that it takes 15 cycles to finish all the memoryaccesses in the original order, but only takes 11 cycles inthe optimized order.

    In this paper, we focus on the exploring the thirdaspect, i.e., exploring the parallelism of the accesses todifferent SDRAM banks. For applications with differentlocality features, we may choose the open-page or close-page SDRAM mode, to maximize the efficiency in thefirst aspect. In other words, the approach in this work canbe used with both the open-page and close-page SDRAMmodes.

    IV. Problem Model

    We consider an application consists of several tasks.Each task will exclusively executes on a unique node, i.e.,we aim at one-to-one mapping between tasks to nodes.

    We assume there are two types of communication inthe system:• On-chip direct communication between two tasks.• Communication between a task and a memory block

    on the off-chip SDRAM.A memory block is a part of memory area. It is less orequal to the size of the whole memory in the system. Thecommunication workload of the application is character-ized by G = 〈T,M,E,w〉:

  • Figure 3. An example of the application com-munication workload characterization

    • T = {t0, t1, . . . } is the set of tasks.• M = {m0,m1, . . . } is the set of memory blocks.• D : T × T describes the direct communication. Each

    element di,j in D represents the on-chip communica-tion between task ti and tj .

    • E : T × M describes the communication to theoff-chip memory. Each element ei,j in E representsthe communication between task ti and memoryblock mj . Note that a task can communicate withseveral memory blocks, and a memory blocks canalso communicate with several tasks (to model theshared memory among several tasks), however, thecommunication between two memory blocks is notallowed.

    • w : D ∪ E → N is the communication workloadfunction, which maps each element in D or E to anatural number representing the workload (in MB/s)of the task-task or task-block communication.

    The communication workload characterization G can beviewed as a undirected graph with two types of vertexesand two types of edges. For example, Figure 3 has threetasks and four memory blocks. The on-chip direct commu-nication workload between t0 and t1 is 4. t0 and t1 share amemory block m1, and each of them has a communicationworkload to m1 of 50. t0 also has a private memory blockm0, and the communication workload between t0 and m0is 80.

    We assume a 2D-mesh NoC, which is characterized byA = 〈N,nsnk〉:• N is the node matrix, where each nx,y ∈ N is the

    node in the xth row and yth column in the 2D-meshstructure.

    • nsnk ∈ N is a special node where the off-chipmemory is connected. Note that we assume there isone sink node in the system for simplicity, however,our approach can be easily extended to the case withmultiple sink nodes.

    We use L = N×N to denote the set of links. The links areimplied by the 2D-mesh structure: each pair of nodes thatare at the same row (column) and with an adjacent column(row) number, are connected by a link. For example, the

    Figure 4. A 3× 3 NoC

    link li = 〈n0,1, n0,2〉 represents the link between node n0,1and n0,2. For each link lx we know its bandwidth BW(lx).

    We define all the links connected to the sink nodensnk to be tunnels, the set of which is denoted byα = {α0, α1, · · · }, and define all the nodes (except thesink node) connected by a tunnel as a repeater, the setof which is denoted by γ = {γ0, γ1, · · · }. For example,Figure 4 shows a 3 × 3 NoC, in which node n1,1 isthe sink, and the four links connected to it are tunnels.Correspondingly, n0,1, n1,0, n1,2, n2,1 are repeaters. Thearchitecture of tunnel and repeater is the same as that oflink and node. They are only different in how they areconsidered by our algorithm.

    We use P = (L \ α)∗ to denote the set of paths.Each path pq on the NoC is an ordered sequence of links{l1, l2, · · · } such that• Each pair of conjunctive links lx and lx+1 in the path

    share a node.• Any node can not appear more than once in a path,

    which means any path does not contain a circle.

    Note that a path does not include tunnels, the reason ofwhich will be explained in Section V-B.

    V. Memory Access Aware Mapping

    A. Overall Strategy

    The application mapping problem on NoC is in generalintractable (NP-hard) [10]. Existing mapping algorithm-s rely on various heuristics to non-exhaustively searchover the design space, to obtain suboptimal solutions. Astraightforward way of taking the off-chip communicationinto the mapping problem is to use the existing algorithmsand add new criteria describing the off-chip communica-tion into their optimization subject and constraints. How-ever, this approach does not work well. On one hand, theexisting algorithms are optimized for particular subjects,which may not be suitable when the communication tothe off-chip memory is taken into account. On the other

  • (a) via different tunnels (b) via the same tunnel

    Figure 5. Sending the accesses to the samebank via different tunnels or via the sametunnel

    hand, the new problem considering the memory commu-nication has a much larger design space than previous;without a good exploration guidance, the heuristics canonly cover a very small part of the whole design spacewithin reasonable search time. In the following, we willintroduce the overall strategy of our proposed Memory-Aware Mapping algorithm (MA-MAP for short), whichallows us to significantly narrow down the effective designspace.

    We first consider an example: an SDRAM memory with4 banks is accessed via a router connected with 4 tunnels,and a memory access sequence {000111222333} (eachnumber denotes the destination bank of an access).

    The router connecting the SDRAM picks accesses fromdifferent links in round-robin. If the system is designed inthe way that accesses to a bank are all sent via differenttunnels as shown in Figure 5-(a), then the resulted accessorder would be:

    (start) 0 0 0 1 1 1 2 2 2 3 3 3 (end).

    In this case, the SDRAM has to sequentially serve the threeaccesses to bank 0, then sequentially serve the accesses tobank 1 and so on. The memory access parallelism of thisdesign is very low.

    If we can design the system in the way that the accessesto one bank are all sent via the same tunnel (i.e., via thesame repeater), as shown in Figure 5-(b), then the accesseswill end up with the following order:

    (start) 0 1 2 3 0 1 2 3 0 1 2 3 (end)

    Since the accesses to different banks can be served inparallel, this is clearly the most efficient way to send theseaccesses to the memory.

    The main strategy of our approach is to enforce thatall the accesses to the same bank to be sent via thesame tunnel, in order to maximize the bank-interleavingparallelism. In other words, each memory bank is boundto a particular tunnel. Note that we allow that more thanone memory banks to be bound to the same tunnel in casethat the number of memory banks is more than the number

    of tunnels. We further assume that all the memory banksare symmetric, and the bank-to-tunnel binding is fixed, i.e.,as long as we know the destination memory bank of anaccess, it is clear which tunnel it will go through to reachthe sink.

    In the following we will introduce how our mappingalgorithm MA-MAP works in detail. In Section V-B wefirst present the formalization of the mapping problem (onthe premise that all the accesses to the same bank to besent via the same tunnel), then in Section V-C we introduceour proposed algorithm MA-MAP to solve the formalizedoptimization problem.

    B. Optimization Subject and Constraints

    The major task of our algorithm is to find the solutionsfor the following three mapping functions.• ϕ : M → γ, mapping each memory block mi ∈M to a repeater γh ∈ γ. Since each memory bankcorresponds to one tunnel, and thereby correspondsto one repeater, so it equals to mapping each memoryblock mi to a memory bank.

    • π : T → N , mapping each task tj ∈ T to a NoCnode nj ∈ N .

    • ρ : N ×N → P , routing the communication betweentwo NoC nodes.

    The optimization subject is to minimize the total costof both the on-chip direct communication and the commu-nication to the off-chip memory:

    Minimize : Cd + Cm (1)

    Cd, the total cost due to on-chip direct communication,is defined as:

    Cd =∑∀di,j∈D

    w(di,j)× |ρ(π(ti), π(tj))| (2)

    in which π(ti) and π(tj) represent the node where τiand τj is mapped to respectively. ρ(π(ti), π(tj)) is thepath of the communication between these two nodes, and|ρ(π(ti), π(tj))| is the number of links along this path.Cm, the total cost due to the communication to the off-

    chip memory, defined as follows:

    Cm =∑∀ei,j∈E

    w(ei,j)× (|ρ(π(ti), ϕ(mj))|+ 1) (3)

    in which π(ti) is the node where τi is mapped to, ϕ(mj)the repeater where the memory block mj is bound to, andρ(π(ti), ϕ(mj)) the path of the communication betweenthe task node and the repeater. |ρ(π(ti), ϕ(mj))| is thenumber of links along this path. We add 1 to the path lengthsince there is one more hop between the repeater and thesink node for the the communication to the memory.

  • The mapping should respect the following constraints:

    • Task Mapping Constraint: We regulate that eachnode can only host at most one task

    ∀ti 6= tj ∈ T : π(ti) 6= π(tj) (4)

    • Link Load Constraint: The total workload on eachlink (except tunnels) can not exceed the link band-width.

    ∀lx ∈ L : BW(lx) ≥WLd(lx) + WLm(lx) (5)

    BW(lx) is the bandwidth of link lx (known parame-ter). WLd(lx) and WLm(lx) is the total workload onlink lx due to the direct communication and memorycommunication respectively, which are defined as

    WLd(lx) =∑∀di,j∈D

    w(di,j)× f(lx, ρ(π(ti), π(tj)))

    WLm(lx) =∑∀ei,j∈E

    w(ei,j)× f(lx, ρ(π(ti), ϕ(mj)))

    where f(lx, pq) examines whether link lx is in thepath pq:

    f(lx, pq) =

    {1 lx ∈ pq0 lx 6∈ pq

    • Tunnel Load Constraint: The total workload on eachtunnel can not exceed the bandwidth of the tunnel. Asdefined in Section IV, a path does not include tunnels.In other words, for each memory communication ei,j ,the path ρ(π(ti), ϕ(mj)) only includes the part fromthe node hosting ti to the repeater bound to mj , butnot include the last hop from the repeater to the sinknode. So the Link Load Constraint has no effect tothe workload on the tunnels. Therefore the followingconstraint is added to limit the total workload ontunnels:

    ∀αy ∈ α : BW(αy) ≥WLm(αy) (6)

    Similarly, BW(αy) is the bandwidth of tunnel αy(known parameter), and WLm(αy) is the total work-load on the tunnel αy , which is defined as

    WLm(αy) =∑∀ei,j∈E

    w(ei,j)× g(αy, ϕ(mj))

    where g(αy, γz) examines whether tunnel αy is con-nected to repeater γz:

    g(αy, γz) =

    {1 αy connects to γz0 αy does not connects to γz

    C. Mapping Algorithm

    Now we present our algorithm MA-MAP to solve theoptimization problem introduced in Section V-B. Recallthat our target is to find solutions for the three mappingfunctions ϕ, π, ρ. The overall structure of our algorithmis shown in Algorithm 1. The algorithm will first solvethe block-to-repeater mapping (function ϕ), then it usesthe genetic algorithm to explore the solution space of thetask-to-node mapping (function π). The routing problem ρwill be solved within the genetic algorithm framework, tochoose the routing for each candidate of the tasks-to-nodesmapping.

    1: Solving ϕ: binding each block to a repeater

    // Genetic Algorithm, to explore the solutionspace of π (mapping tasks to nodes)

    2: Randomly choose the initial population3: for a certain amount of iterations do4: Fixing the chromosomes violating

    the Task Mapping Constraint (4).5: Routing for each chromosome

    (solving ρ for each chromosome)6: Calculating the fitness function Cd + Cm

    for each chromosome7: Selection8: Crossover and Mutation9: end forAlgorithm 1: The pseudo-code of MA-MAP.

    1. ϕ: Binding Blocks to RepeatersRecall that the main idea of our approach is to send

    the accesses to the same memory bank via the repeater,by which we make the round-robin router to automaticallysend the accesses to different banks in turn, in order toexplore the access parallelization. However, this is notenough to gain a good access parallelism: Consider thesituation that the workload sent via one repeater is veryhigh, and the workloads sent by other repeaters are allvery low, then at most time the router will still onlysend accesses from the same repeater, which leads to apoor parallelism. So one can see that, another importantcondition we need is that the workload sent via differentrepeaters should be as balanced as possible.

    In order to have a balanced workload division on differ-ent repeaters, we will first partition the whole applicationinto |γ| partitions (|γ| is the number of repeaters) suchthat the memory communication workloads included bydifferent partitions are as equal as possible. At the sametime, the Tunnel Link Constraint (5) should be respected,such that the total workload of each partition does notexceed the tunnel bandwidth. Such a dividing problem is

  • actually the bin-packing problem, which is NP-hard in thestrong sense. Plenty of heuristics for bin-packing problemhave been proposed to solve the bin-packing problem.In our work, we choose the well-known decreasing-sizeworst-fit algorithm to solve the workload dividing problem:

    • For each block mj , we calculate the total workloadof the communication to this block:

    δ(mj) =∑∀ei,j

    w(ei,j)

    • Sort all the blocks in the decreasing order of δ(mj),and assign them to partitions in order.

    • At each step the partitions with the least assignedδ(mj) is chosen to include the current selected block.

    Figure 3 shows an example where an application in-cluding 8 tasks and 16 memory blocks is divided into 4partitions

    The next step is to bind each partition to a repeater.In general, the sink node may not be in the “central” onthe NoC, therefore the capabilities for different repeatersto connect to other nodes are not symmetric. For example,on a 3 × 4 NoC, if the sink node is n1,1, then the nodesn0,1, n1,0, n1,2 and n2,1 are repeaters. One can see thatn1,2 is in a better position than others since it has a betterchance to connect to other nodes with a shorter distance.Therefore we should consider to bind the partition relatedto more tasks to n1,2.

    We define the following concepts to formally specifyhow our algorithm works based on the above observation.We first define the number of related tasks of a partitionto be the number of tasks that communicate with at leastone of the memory blocks in this partition. For example,with partitioning in Figure 6, the number of related taskof Partition 3 is 5. We define the total reachable distanceof a repeater to be the sum of the distances between thisrepeater to all other nodes (excluding the sink node) inthe system. Note that a smaller average reachable distancevalue indicates a better location. For example, on a 3× 4NoC, the total reachable distance of the repeaters n1,0 is1 + 1 + 2 + 2 + 2 + 3 + 3 + 3 + 4 + 4 = 23.

    Our algorithm will bind a partition with larger num-ber of related tasks to a repeater with smaller averagereachable distance. For example, Figure 7 shows howthe resulted partitions in Figure 6 are bound to the fourrepeaters on a 3× 4 NoC.

    2. ρ: RoutingAs introduced in the beginning of Section V-C, the

    routing is embedded into the genetic algorithm framework.In the following, we will introduce how to solve the ρfunction given a particular tasks-to-nodes mapping (i.e.,how to route the paths for each chromosome in geneticalgorithm).

    Figure 6. Dividing an application into fourpartitions

    Figure 7. The mapping between the partitionsand tunnels

    In our problem, since the last hop of any memorycommunication is via a tunnel, so the routing actuallycontains two steps: The first step is to route all the hopsexcept the last hop of each communication flow. Thesecond step is adding the last hop of each communicationflow that passes through the tunnel. A naive approach is tosolve the first step by standard 2D-mesh routing algorithmsand then add the last hops on top. However, this approachdoes not work, since we have to reserve the tunnels for allthe last hops. If the routing algorithms in the first step canarbitrarily use the tunnels, it will destroy our plan of usingthe sink node to automatically send accesses to differentbanks in turn. Even worst, this would cause deadlock evenif we use deadlock-free routing algorithms, since the lasthop is out of the routing algorithm’s control.

    To solve this problem, we will exclude all the tunnelsfrom the first-step routing (this answers why in SectionIV the pathes are defined in the way that tunnels are notincluded). The result of excluding the tunnels in the routingis that the NoC topology is not strictly 2D-mesh any longer

  • in the routing point of view.We use the algorithm in [6] as our routing algorithm

    and enforce the Link Load Constraint (5) introducedin Section V-B. This algorithm is a modification of thenegative-first routing algorithm to solve the routing on 2D-mesh containing faulty nodes. To use this algorithm inour problem, we simply let the sink node to be a faultynode, and all the tunnels are automatically excluded. Thewhole path from the source node to the sink node will becontained in the packet header since we assume routerswith source-based routing functionality (see Section I).

    3. π: Mapping Tasks to NoC Nodes Our algorithmuses the classical genetic algorithm to explore the solutionspace of π. Each chromosome represents a design planof task mapping to NoC nodes, which respects the threeconstraints introduced in Section V-B. The length of achromosome is the number of tasks, and the value of theith gene in a chromosome represents the node to host taskti.

    At the beginning, we randomly choose a certain amount(decided by the system designer) of chromosomes as theinitial population. At each iteration, we first fix the thechromosome violating the Task Mapping Constraint (4),i.e., the ones in which multiple tasks are mapped to thesame node: We scan the genes of a chromosome, if we seea NoC node has already appeared in this chromosome, wewill replace it by an arbitrary node that has not appearedin this chromosome. Then we invoke the routing algorithmintroduced above for each chromosome, and calculate thefitness function for each. Next we select the same amountof chromosomes as in the initial population, with thesmallest total cost Cd+Cm. Then we conduct the crossover(two-point crossover [23] is used) and mutation operation,and enter the next iteration. This procedure terminates aftera certain amount of steps which is pre-defined by thesystem designer. At the end, the best individual in the lastgeneration of population is our final result.

    VI. Experimental results

    We compare MA-MAP with two classical algorithmsthe partial branch-and-bound algorithm (PBB) [10] andNMAP [19]. PBB can efficiently explore the search treewhich represents the whole searching space and map theIPs to the tiles so that the total communication energyconsumption is minimized under the bandwidth constraint.NMAP is a fast heuristic mapping algorithm that optimizesbandwidth and packet latency. Neither of these two classi-cal algorithms consider memory communication efficiency.Our experiments will show how much benefit one can getby taking the memory communication into the mapping.

    We use two metrics in the comparison: memory utiliza-tion and system throughput. Memory utilization is defined

    (a) Memory Utilization

    (b) System Throughput

    Figure 8. The memory utilization and systemthroughput comparison of PBB, NMAP andMA-MAP.

    as the number of accesses served by the memory pertime unit. System throughput is defined as the times theapplication can complete per time unit. The experimentsis conducted with our NoC system simulator which alsosupports high-level simulations of SDRAM. We configurethe hardware platform to be 2D-mesh NoC processors withsize from 4 × 4 to 10 × 10, with both the open-page andclose-page SDRAM modes. The memory controller we useis classical architecture, which is described in page 498 of[12]. We assume there are 4 banks in our system.

    Figure 8 shows the experiment results under the op-timization buffer size. Figure 8-(a) shows the memoryutilization comparison with the close-page SDRAM mode.The figure shows the normalized memory utilizations ofPBB, NMAP and MA-MAP, with different NoC sizes.We can see that with our algorithm MA-MAP the memoryutilization improvement over PBB and NMAP is signifi-cant (on average 60%). We have also conducted the sameexperiments with the open-page SDRAM mode, and theget similar improvement (figures omitted here). Figure8-(b) shows how the memory utilization improvementreflects to the overall system performance (the figure

  • Figure 9. The communication cost compari-son of PBB, NMAP and MA-MAP

    Figure 10. The memory utilization compari-son with different buffer sizes.

    shows the result under the close-page SDRAM mode, andthe result with the open-page SDRAM mode is similar).From the above results we can see that, considering thememory access optimization in the mapping algorithm cansignificantly improve the memory utilization and overallsystem throughput (on average 60% improvement in ourexperiments).

    Figure 9 shows the comparison of the communicationcost, which is defined by Cd + Cm (see Section V).The purpose of this experiment is to show the effect ofconsidering the memory access efficiency to the mappingoptimality in the pure topology point of view. The exper-iment shows that our algorithm MA-MAP only leads toa slightly larger communication cost (The average loss ofcommunication cost is 5.3% and 9.7% compared to PBBand NMAP). This is the price we paid for considering thememory access efficiency in the mapping. However, sinceour algorithm significantly improve the memory utilization,the overall system performance is much higher with MA-MAP.

    Figure 10 shows the memory utilization comparisonof the three algorithms with different sizes of bank re-

    quest buffer (also called bank queue) with the close-pageSDRAM mode (the result with the open-page SDRAMmode is similar). Each value is the average of the memoryutilization under 4×4 to 10×10 NoC. From Figure 10 wecan see that our algorithm is more efficient with memorycontroller with smaller buffers, which may be preferred insystems with limited cost and power budgets.

    VII. Conclusion and future work

    While NoC processors provide very high on-chip com-munication bandwidth, the bandwidth between the NoCchip and the off-chip memory is still relatively low. Sooptimizing the off-chip memory communication efficiencyis very important to the overall NoC system performance.In this paper, we propose a software-based approach tosolve this problem, by integrating the memory efficiencyissue into the NoC mapping. We introduced a memory ac-cess aware mapping algorithm MA-MAP, which exploresSDRAM access parallelization in order to achieve higheroff-chip memory communication efficiency, and eventuallyoffer higher overall system performance. Experimentalresults showed that, comparing with classical NoC map-ping algorithms, our algorithm can significantly improvethe memory utilization and overall system throughput (onaverage 60% improvement) which only introduces smallcommunication cost overhead. As the future work, wewill evaluate the performance of our proposed approachby realistic applications or benchmarks. In this paper, wefocus on 2D mesh NoC. We would like to study memoryaccess aware mapping algorithm under different topologiesin the future work.

    Acknowledgment

    This work is partially supported by the NSF of Chinaunder Grant No. 60973017 and the Fundamental Re-search Funds for the Central Universities under GrantN100604011 and N100204001.

    References

    [1] G. Ascia, V. Catania, and M. Palesi. Multi-objectivemapping for mesh-based noc architectures. In InternationalConference on Hardware/Software Codesign and SystemSynthesis, pages 182–187, 2004.

    [2] L. Benini and G. D. Micheli. Networks on chips: a new socparadigm. IEEE Computer, 35(1):70–78, January 2002.

    [3] D. Bertozzi and L. Benini. Xpipes: A network-on-chiparchitecture for gigascale systems-on-chip. IEEE circuitsand systems magazine, 4(2):18–31, 2004.

  • [4] C. Chou and R. Marculescu. Contention-aware applicationmapping for network-on-chip communication architectures.In Computer Design, 2008. ICCD 2008. IEEE InternationalConference on, pages 164–169. IEEE, 2008.

    [5] W. J. Dally and B. Towles. Route packets, not wires: on-chip interconnection networks. In Proceedings of DesignAutomation Conference, pages 648–649, June 2002.

    [6] C. J. Glass and L. M. Ni. Fault-tolerant wormhole routingin meshes without virtual channels. IEEE transactions onparallel and distributed systems, 7(6):620–636, 1996.

    [7] K. Goossens, J. Dielissen, and A. Radulescu. Æthereal net-work on chip: concepts, architectures, and implementations.IEEE Design & Test of Computers, 22(5):414–421, 2005.

    [8] K. Goossens, A. Radulescu, and A. Hansson. A unifiedapproach to constrained mapping and routing on network-on-chip architectures. pages 75–80, 2005.

    [9] S. Heithecker and R. Ernst. Traffic shaping for an fpgabased sdram controller with complex qos requirements.In Proceedings of the 42nd annual Design AutomationConference, pages 575–578. ACM, 2005.

    [10] J. Hu and R. Marculescu. Energy-aware mapping for tile-based noc architectures under performance constraints. InProceedings of Asia and South Pacific Design AutomationConference, pages 233–239, 2003.

    [11] J. Hu and R. Marculescu. Exploiting the routing flexibil-ity for energy/performance aware mapping of regular nocarchitectures. In Design, Automation and Test in EuropeConference and Exhibition, pages 688–693, 2003.

    [12] B. Jacob, S. Ng, and D. Wang. Memory systems: cache,DRAM, disk. Morgan Kaufmann Pub, 2007.

    [13] W. Jang and D. Z. Pan. An sdram-aware router fornetworks-on-chip. IEEE Transactions on Computer AidedDesign of Integrated Circuits and Systems, 29(10):1572–1585, 2009.

    [14] W. Jang and D. Z. Pan. A3map: Architecture-aware analyticmapping for networks-on-chip. In Proceedings of the 15thAsia and South Pacific Design Automation Conference,pages 523–528, 2010.

    [15] K. Lee, T. Lin, and C. Jen. An efficient quality-awarememory controller for multimedia platform soc. Circuitsand Systems for Video Technology, IEEE Transactions on,15(5):620–633, 2005.

    [16] T. Lei and S. Kumar. A two-step genetic algorithm formapping task graphs to a network on chip architecture. InProceedings of Euromicro Symposium on Digital SystemDesign, pages 180–187, 2003.

    [17] C. Macian, S. Dharmapurikar, and J. Lockwood. Beyondperformance: secure and fair memory management formultiple systems on a chip. In Field-Programmable Tech-nology (FPT), 2003. Proceedings. 2003 IEEE InternationalConference on, pages 348–351. IEEE, 2003.

    [18] S. Murali, M. Coenen, A. Radulescu, and K. Goossen-s. Mapping and configuration methods for multi-use-casenetworks on chips. In Proceedings of the 11th Asia andSouth Pacific Design Automation Conference, pages 146–151, 2006.

    [19] S. Murali and G. D. Micheli. Bandwidth-constrained map-ping of cores onto noc architectures. In Design, Automationand Test in Europe Conference and Exhibition, volume 2,pages 896–901, 2004.

    [20] O. Mutlu and T. Moscibroda. Parallelism-aware batchscheduling: Enabling high-performance and fair sharedmemory controllers. Micro, IEEE, 29(1):22–32, 2009.

    [21] S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, and J. D.Owens. Memory access scheduling. In Proceedings of the27th International Symposium on Computer Architecture,pages 128–138, 2000.

    [22] W. Shen, C. Chao, Y. Lien, and A. Wu. A new bi-nomial mapping and optimization algorithm for reduced-complexity mesh-based on-chip network. 2007.

    [23] M. Srinivas and L. M. Patnaik. Genetic algorithms: Asurvey. Computer, 27(6):17–26, 1994.

    [24] C. A. Zeferino, M. E. Kreutz, and A. A. Susin. Rasoc: Arouter soft-core for networks-on-chip. 3:189–203, February2004.

    [25] W. Zhou, Y. Zhang, and Z. Mao. Link-load balance awaremapping and routing for noc. WSEAS Transactions onCircuits and Systems, 6(11):583–591, 2007.