design and implementation of axi-based network-on-chip …274190/fulltext01.pdf · 2009. 10....

TRITA-ICT-EX-2009:157

Master Thesis in Electronic System Design

Design and Implementation of

AXI-based Network-on-Chip Systems

for Flow Regulation

Jiayi Zhang

September 2009

Supervisor: Dr. Zhonghai Lu

Examiner: Dr. Zhonghai Lu and Prof. Axel Jantsch

2

Abstract

In Network-on-Chip (NoC), controlling Quality-of-Service is crucial in order to

build predictable systems. In this project, we design and implement an

AXI-based system on the Nostrum NoC, which features a 2D mesh topology

and deflective routing. The main components we add to the Nostrum NoC are

master and slave interfaces. The master interface conducts packetization,

queuing and multiplexing. The slave interface performs de-packetization,

queuing, de-multiplexing and, in particular, reordering of transfers. We also

build master and slave modules to serve as traffic generators and sinks. One

particular feature of the master module is that it can regulate traffic burstiness

while generating traffic. All these models are implemented in VHDL at the RTL.

The interface protocol of masters and slaves is the AXI from ARM.

With the above components, we designed experiments to show the effect of

traffic regulation. Our results show that higher burstiness traffic results in larger

transfer delay and bigger backlog. We can conclude that the transfer delay and

backlog can be controlled to some degree in a best-effort network via

regulating the traffic burstiness.

3

Acknowledgements

I would like to thank my examiner Professor Axel Jantsch and Doctor Zhonghai

Lu for giving me the opportunity to work on my master thesis in their research

group. I want to thank Dr. Zhonghai Lu for his forgiveness and continuous

support on me. I also want to thank Dr. Zhonghai for his patient guide, from

which I have learnt the fundamental knowledge of how to research. I would

also like to thank my parents in China. Without them I wouldn‟t be able neither

to attend the master program in KTH nor to finish the master thesis.

4

Abbreviations

NoC Network on Chip

AXI Advanced eXtensible Interface

SOC System On Chip

IC Integrated Circuit

QoS Quality of Service

API Application Programming Interface

OEM Original Equipment Manufacture

IDE Integrated Development Environment

RTOS Real-Time Operating System

ARM Advanced RISC Machines

RISC Reduced Instruction Set Computer

MI Master Interface

SI Slave Interface

5

TABLE OF CONTENTS TABLE OF CONTENTS ......................................................................................... 1

1 Introduction ....................................................................................................... 7

1.1 Background............................................................................................ 7

1.2 Project Overview ................................................................................... 8

1.3 Thesis Structure .................................................................................... 8

2 NoC System Description ................................................................................. 9

2.1 System Overview .................................................................................. 9

2.2 AXI Master and Slave ........................................................................... 9

2.3 Master Interface (MI) .......................................................................... 11

2.3.1 Write Channel Interface .............................................................. 12

2.3.2 Read Channel Interface ............................................................... 14

2.3.3 Master side Mux ......................................................................... 15

2.3.4 Master to NOC FIFO .................................................................... 15

2.3.5 NOC to Master FIFO .................................................................... 16

2.4 Slave Interface (SI) ............................................................................. 17

2.4.1 Slave side write interface............................................................. 18

2.4.2 Slave side read channel interface ................................................ 19

2.5 Flow Regulation................................................................................... 21

2.6 System Integration ................................................................................. 21

3 Experiments.................................................................................................... 23

3.1 Experimental Setup............................................................................. 23

3.1.1 Purpose ....................................................................................... 23

3.1.2 Infrastructure of the Experiment ................................................. 23

3.1.3 Evaluation Statistics ..................................................................... 26

3.2 4 Masters 2 Slaves (4M2S) ................................................................ 26

3.2.1 Experiment results of ρ = 0.2 ....................................................... 27

3.2.2 Experiment results of ρ = 0.5 ....................................................... 30

3.2.3 Comparison between ρ= 0.2 and ρ=0.5 ....................................... 33

3.3 8 Masters 4 Slaves (8M4S) ................................................................ 37

3.3.1 Experiment results of ρ= 0.2 ........................................................ 37

3.3.2 Experiment result of ρ=0.5 .......................................................... 40

3.3.3 Comparison between ρ=0.2 and ρ=0.5 ........................................ 44

3.4 Comparison between 4M2S and 8M4S ............................................ 48

3.4.1 Comparison of ρ=0.2 ................................................................... 48

3.4.2 Comparison of ρ=0.5 ................................................................... 52

3 Conclusion ...................................................................................................... 57

4.1 Summary .............................................................................................. 57

4.2 Future Work ......................................................................................... 57

REFERENCES ..................................................................................................... 58

7

1 Introduction

1.1 Background

Network-on-Chip (NoC) has been proposed to address the scalability

challenge of buses in clocking frequency, bandwidth and power consumption.

The research starts around year 2000 and so far it has been a hot research

area.

The NoC group at KTH has proposed the Nostrum NoC. The Nostrum NoC is

a packet-switched dropless network, which features a 3D mesh topology and

deflection routing. As an example, a 2 x 2 network is shown in Figure 1. The

routers do not have buffer queues. When contention for shared links occurs,

the router deflects the packets which lose the link arbitration to unflavored

links.

Figure 1. An example of Nostrum NoC (Source: The Nostrum homepage)

Quality-of-Service (QoS) has been a main concern for on-chip network

research because routing packets may bring non-deterministic behavior, thus

uncertainties in delay and jitter. For real time applications which require

guarantees under even worst-case conditions, this is not acceptable. At KTH,

the concept of flow regulation has been proposed to control QoS by traffic

shaping. This concept is based on the network calculus theory. It has been

demonstrated in [3] that the flow regulation can be applied to reduce delay and

backlog for a NoC with guaranteed services.

8

1.2 Project Overview

This project is set up to investigate the effect of flow regulation on Nostrum

NoCs. To realize this, we will build a NoC-based system, which consists of

masters, slaves, master interfaces (MIs) and slave interfaces (SIs) and the

Nostrum NoC. The MIs and NIs connect the masters and slaves to the

Nostrum NoC, respectively. The Nostrum router is already designed in VHDL

model at RTL. The Masters and slaves are used mainly for testing purpose.

We have focused on the three main tasks:

1. Design and implement the MI and SI

2. Design and implement masters and slaves

3. Experiment on the effect of flow regulation

Since a typical master/slave IP has a standard interface, such as AXI or OCP,

we have chosen to use the AXI in our projects. This means that the MI and SI

are specific to the AXI protocol.

1.3 Thesis Structure

Chapter 1 Introduction: it gives a description of the project including shortly its background, objectives and design tasks.

Chapter 2 NoC System: We describe the NoC system which has been constructed in the project. This system consists of masters, slaves, and master and slave interfaces connected to the Nostrum network.

Chapter 3 Experiments: Experiments for investigating the effect of flow

regulation on the Nostrum network are described and results are analyzed.

Chapter 4 Conclusion: This chapter summarizes the project work, also

pointing out a few future directions for extending the work.

9

2 NoC System Description

This chapter first describes the whole NoC system, and then details the

hardware modules.

2.1 System Overview

Figure 2 depicts the NoC system.

Figure 2. Nostrum overview [1]

The Nostrum network has less buffers but it does not guarantee in-order

delivery. This means that a sequence of packets is sent in the order P1, P2 …

PN, but it may be received in the order P2, P1, P3, P8 … PN. However, a

slave module, typically, a memory controller plus a memory does not aware of

the packet sequence. This implies that the packets sent to the slave module

must be re-ordered into the right sequence, like P1, P2 … PN before being

transmitted to the slave.

2.2 AXI Master and Slave

AXI is the next generation high performance interconnect interface from ARM.

It features five wide parallel channels, three write channels, AW , W, B and two

read channels AR and R, in which AW is for Write Address, W for Write data,

and B for acknowledgement; AR for Read Address and R and Read data.

10

The protocol works in a request-response fashion. A write starts with AW, then

W from an AXI master, and finishes with a response B from an AXI slave. A

read starts with an AW request from an AXI master and finishes with a read

data R from an AXI slave.

Figure 3 shows how a read transaction is performed. The master initiates the

read transaction by sending the address and control information through read

address channel. When slave receives the information from read address

channel, it will process the request and reply the corresponding read data via

read data channel. The last read data will assert the last signal which indicates

the end of the read transaction.

Figure 3. Channel architecture of reads

Figure 4 shows how a write transaction is performed. The master initiates the

write transaction by sending the address and transaction through write

address channel. Then the master will send the data through write data

channel and by asserting the last signal to indicate the end of data transfer.

When the slave receives the information, it will process the request and wait

for the data. After the slave receives all the data, it will send a write response

through write response channel. The master will process the response from

the slave and finalize the write transaction.

11

Figure 4 Channel architecture of write

2.3 Master Interface (MI)

The MI performs packetization, multiplexing and queuing. The master network

interface is constituted by master side write channel interface, master side

read channel interface, master side mux, master to NOC FIFO and NOC to

master FIFO. Figure 5 shows the structure of the master interface.

FIFO

R

O

U

T

E

R

M

U

X

FIFO

WRITE

CHANNEL

INTERFACE

M

A

S

T

E

R

READ

CHANNEL

INTERFACE

WRITE ADDRESS

WRITE DATA

WRITE RESPONSE

READ ADDRESS

READ DATA

Figure 5. Structure of Master Interface

12

2.3.1 Write Channel Interface

WRITE ADDRESS INF

WRITE DATA INF

WRITE RESPONSE INF

PACKET

UNIT

PACKET

UNIT

DE-

PACKET

UNIT

WRITE

TRANSACTION

TABLE

WA REQ

WD REQ

FIFO IN

WA SIGS

WD SIGS

WR SIGS

Figure 6. Structure of Master Write Channel Interface

The write channel interface module interacts with the AXI master‟s write

channel, and packetization the write requests into proper network packets.

The structure of master write channel interface is shown as Figure 6. Figure 8

shows the program flow of the write channel interface. Both of the write

address request and the write data request are initiated by valid signals. When

the AXI master asserts the corresponding valid signals, the write channel

interface will answer with the ready signal. For write address request signal, it

first creates an entry in write transaction table to record the characters of the

request. The structure of the write transaction table is depicted in Figure 7.

VALID ID LEN SIZE BURST CTR

Figure 7. Structure of Write transaction table

The valid field in the table represents if this entry is occupied or not. The id

field records the transaction id. The LEN field records the length of the

transaction. The size field records the size of the transaction. The burst field

records the burst type of the transaction. The CTR field is the key field we

need to maintain in-order transfer in NOC. It records the number of current

transfer. When we pack the transfer into network packet, we will put this field

in it. Then when the slave side receives this packet, it will put the packet into a

location in the reordering buffer corresponding to its CTR number.

When the master asserts the write data request signal, we shall search the

13

write transaction table to find the transaction record with the same ID field.

Then it will pack the write data request into network packet in addition with the

CTR filed to indicate the order of the transfer in the transaction. After packing

the write request into a packet, the interface will assert request signal to the

mux. The mux will solve the contention and send the packets to FIFO.

When the write channel interface receives the write response packet from the

NOC to master FIFO, it will restore the request from the packet. Since the

write response means the acknowledgement from the slave, we remove the

entry in the write transaction table with the same ID to finish the transaction.

Figure 8. Pseudo code for the Master Write Channel Interface

if awvalid = „1‟ then

put write address request to table;

packet write address request;

assert request to mux;

wait for ack from mux;

end if;

if wvalid = „1‟ then

search the table;

packet write data request;



end if;

if fifo_in_valid and fifo_in_ready then

de-packet write response;

remove corresponding entry;

assert bvalid to master;

wait for bready from master;

end if;

14

2.3.2 Read Channel Interface

READ ADDRESS INF

READ DATA INF

PACKET

UNIT

DE-

PACKET

UNIT

READ

TRANSACTION

REORDER

BUFFER

RA SIGS

RD SIGS

WA REQ

FIFO IN

Figure 9. Structure of Master Read Channel Interface

Figure 10. Pseudo code for the Master Read Interface

Figure 9 and Figure 10 show the structure and the flow of the read channel

interface. When it receives the ARVALID signal, it will create an entry for this

transaction and allocate space in the reorder buffer. Then the read address

request arrives at the slave and the slave answers the request with

corresponding data. The read data transfer travel through the network and

reaches the master side. When receiving the packet from the NOC to master

FIFO, the interface first unpacks the data, and then it searches the read

if arvalid = „1‟ then

put read address request to table;

packet read address request;



end if;

if fifo_in_valid then

de-packet read data;

search rid in the table;

put data in reorder buffer;

end if;

if rready = „1‟ then

search the reorder buffer for valid output;

if rlast = „1‟ then

clear the entry;

end if;

end if;

15

transaction table to put the data into the right place in the reorder buffer. The

next action for the read interface is to find a valid output data to feed the

master. A valid data means a data that is in the order of its transaction.

Because the transfers could reach the master in random order, sometimes the

interface has to wait for the first transfer in the transaction to come although it

has all the other transfers in the reorder buffer.

2.3.3 Master side Mux

The master mux is to solve the contention that the write address request, write

data request and read address request could happen at the same time. Since

we have only one output port to the network, we have to make the

simultaneous requests injected to the network one by one. The master side

mux serves the request in fixed priority. With the consideration that address

requests should arrive to the slave side first, we make the address requests

have higher priority than the data requests.

Figure 11. Pseudo code for the Master side Mux

2.3.4 Master to NOC FIFO

We implement the master to NOC FIFO in a circular buffer way. We think this

could reduce the toggling rate of the transistor and thus it consumes less

energy than a shift register. The size of the FIFO could be determined when

we set up the platform. Because the FIFO only deal with 3 outgoing packets

which are write address request, write data request and read address request.

Packets with other types will be discarded directly. Also note the AXI

transaction is finalized by receiving the response from slave side. So if we

have configured the master component as 2 write and 2 read outstanding

transactions. Then the before the slave responses the master, it will only

generate 4 transaction at most. In this case, if the FIFO accepts all the packets

if fifo not full then

if wa request then

out to fifo = wa packet;

else if ra request then

out to fifo = ra packet;

else if wd request then

out to fifo = wd packet;

end if;

end if;

16

and cannot send them to the network, then the maximum items in the FIFO

are the total transfers of 4 transactions without response. For instance if the

maximum burstiness is 16, then the FIFO has to buffer 4 address request for

both read and write, and 16 multiply 2 write data transfers, which are 36

packets. This is the maximum occupation of the FIFO and we can guarantee

that it will not exceed the number.

Figure 12. Pseudo code for the Master to NoC FIFO

2.3.5 NOC to Master FIFO

The NOC to master FIFO only deals with 2 kinds of packet, which are read

data packet and write response packet. It first read out one data from the FIFO

buffer, if its type is write response, the FIFO then dispatches it to the write

channel interface. If its type is read data, then the FIFO dispatches it to the

read channel interface. There could be some background traffic packets in the

network, if the FIFO receives this kind of packets, it will discard them directly.

if mux valid then

if not full then

fifo[write_pointer] = input;

item_cout ++;

if write_pointer = depth then

write_pointer = 0;

else

write_pointer ++;

end if;

end if;

end if;

if network ready then

if not empty

output = fifo[read_pointer];

item_count --;

if read_pointer = depth then

read_pointer = 0;

else

read_pointer ++;

end if;

end if;

end if;

17

In the case the FIFO size could be decided by the configuration of the system.

If the master can generate 2 write and 2 read transaction, then the maximum

occupation of the FIFO is 2 write responses and 16 multiply 2 read data

transfers, which is 34 in all.

Figure 13. Pseudo code for NoC to Master FIFO

2.4 Slave Interface (SI)

The SI performs de-packetization, re-ordering, de-multiplexing and queuing.

The slave interface is similar with the master interface. The slave side mux,

the slave to NOC FIFO and the NOC to slave FIFO work in the way like the

master side ones. However, the slave side write and read interface are

if resource in then

if not full then

fifo[write_pointer] = input;

item_cout ++;

if write_pointer = depth then

write_pointer = 0;

else

write_pointer ++;

end if;

end if;

end if;

if not empty

temp = fifo[read_pointer];

item_count --;

if read_pointer = depth then

read_pointer = 0;

else

read_pointer ++;

end if;

if temp.type = read_data then

request to read channel interface;

else if temp.type = write_response then

request to write channel interface;

else

discard the packet;

end if;

end if;

18

different. Figure 14 shows the structure of the slave interface.

FIFO

R

O

U

T

E

R

M

U

X

FIFO

WRITE

CHANNEL

INTERFACE

S

L

A

V

E

READ

CHANNEL

INTERFACE

WRITE ADDRESS

WRITE DATA

WRITE RESPONSE

READ ADDRESS

READ DATA

Figure 14. Structure of Slave Interface

2.4.1 Slave side write interface

WRITE ADDRESS INF

WRITE DATA INF

WRITE RESPONSE INF

DE-

PACKET

UNIT

DE-

PACKET

UNIT

PACKET

UNIT

WRITE

TRANSACTION

REORDER

BUFFER

WA SIGS

WD SIGS

FIFO IN

FIFO IN

WR SIGSWR REQ

Figure 15. Structure of Slave side Write Channel Interface

The slave side write interface behaves like master side read interface. It

unpacks the data packet from FIFO and then manages the write transaction

reorder buffer. Figure 15 and Figure 16 show the structure and flow of the

slave side write channel interface.

19

Figure 16. Pseudo code for Slave side Write Channel Interface

2.4.2 Slave side read channel interface

READ ADDRESS INF

READ DATA INF

DE-

PACKET

UNIT

PACKET

UNIT

READ

TRANSACTION

TABLE

FIFO IN

RD REQ

RA SIGS

RD SIGS

Figure 17. Structure of Slave side Read Channel Interface

if fifo_wa_in then

if new transaction then

create new entry in the table;

else

fill the header;

end if;

assert awvalid;

wait for awready;

end if;

if fifo_wd_in then



else

fill the reorder buffer;

end if;

end if;

if wready then

search the table for valid output;

assert wvalid;

end if;

20

Figure 18. Pseudo code for Slave side Read Channel Interface

The main problem for SIs is to maintain the in-order transmission to the slave.

The reordering mechanism is table-based. The reordering is done by filling a

table. The principle is shown in Figure 19 and Figure 20. The reordering buffer

has 2 parts, the header table and data array. When a write transaction arrives,

the slave will fill the header table to record the transaction. The MST_POS

field records the master position in the NoC, so the slave could send back the

data according this information. The data_index field shows the connection

between the header and the data array. The data_valid indicates if the

address request is sent to the slave or not, because the AXI requires the

address should be sent to the slave ahead of the data transfers.

VALID ID MST_POS DATA_VALID DATA_INDEX NXT

Figure 19. Structure of the Reorder Header Table

VALID WRITE DATA

Figure 20. Structure of the Reorder Data Table

if fifo_ra_in then



else

fill the header;

end if;

assert arvalid;

wait for arready;

end if;

if rvalid then

search the table

pack counter to the request;

request to mux;

wait for ack;

end if;

21

2.5 Flow Regulation

Flows from a master to a slave are regulated according to the concept of

regulation spectrum, which gives the upper and lower limits of regulation. We

use σ, ρ regulation factors to define the characteristics of each flow. We give

an example to show the two limits. Figure 21 show a flow without regulation. If

the flow is not regulated, it could generate any length of burst. Here we can

see that the flow generates 8 consecutive transfers and then waits for another

32 cycles to transfer. In a network, this could increase the network traffic

suddenly and may induce high rate of congestion.

Figure 21. Flow without Regulation [4]

Figure 22 depicts the flow with regulation. After the regulation the 8 transfers

are evenly distributed in the t axis. In our design, the flow is regulated by MI.

The MI controls the AXI ready signals to proceed or stall the master‟s request

if it has valid tokens or not.

Figure 22. Flow with Regulation [4]

2.6 System Integration

Figure 23 demonstrates the basic configuration of the experiment platform.

The master component is connected to the network router through the master

interface while the slave is connected by the slave interface. The master and

slave interface function as both the RNI and NI.

For synthesis purpose, the master components use some non-synthesizable

features of VHDL, such as file operation and real data type. So the master

cannot be synthesized. But the master and slave interface are designed by the

purpose of synthesizing them to real hardware. And the NoC infrastructure is

fully synthesizable with very good result. With proper IP cores serving as

master and slave components we are able to implement the whole system on

22

a FPGA board or even on chip.

FIFO

R

O

U

T

E

R

M

U

X

FIFO

WRITE

CHANNEL

INTERFACE

S

L

A

V

E

READ

CHANNEL

INTERFACE

WRITE ADDRESS

WRITE DATA

WRITE RESPONSE

READ ADDRESS

READ DATA

FIFO

R

O

U

T

E

R

M

U

X

FIFO

WRITE

CHANNEL

INTERFACE

M

A

S

T

E

R

READ

CHANNEL

INTERFACE

WRITE ADDRESS

WRITE DATA

WRITE

RESPONSE

READ ADDRESS

READ DATA

Figure 23. Demonstration of 2x2 Mesh Network with 1M1S

23

3 Experiments

This chapter reports experiments and results.

3.1 Experiment Setup

3.1.1 Purpose

The experimental purpose is to investigate the effect of flow regulation on the

Nostrum NoC. In order to investigate closely on the network behavior, we use

simple experiments.

The two groups of experiments are designed as follows:

1. 4 Master and 2 Slaves

2. 8 Masters and 4 Slaves

For each group of experiments, we inject write transactions at different rates

and burstiness. We investigate delay of transfers and backlog.

3.1.2 Infrastructure of the Experiment

All of the experiments are based on a 4x4 mesh network with deflection

routing enabled. With the scale of 4x4, we can explore the influence of

deflection routing and keep the simulation within reasonable time. Figure 24

shows the topology of the network, according to Erland [5], the network traffic

gives the inner part of the mesh network most influence. So we are going to

distribute the master components to the central part of network. We index the

nodes in the network by its row and column number. The row and column

number start from 1. The upper left node is indexed as (1,1) and the lower

right node is (4,4).

24

R R R R

R R R R

R R R R

R R R R

Figure 24. Basic Structure of 4x4 Mesh Network

The first group of experiments is based on the 4x4 mesh network with 4

masters with 2 slaves. Figure 25 depicts the distribution of 4 masters and 2

slaves in a 4x4 mesh network. To make the effect of network traffic more

significant, we make the traffic flows of the 4 masters pass through the

bisection of the mesh network. For instance, the master at (2,2) will access the

slave at (4,4). The colored arrow lines demonstrate the possible traffic flows of

the masters. However, since it is a deflection routing network, the network

packets could travel through the network following all the possible routes. Note

that only 6 out of 16 nodes in the network generate traffic. It is a light traffic

network. Most packets in the network will travel to their destination following

the shortest path.

25

R R R R

R R R R

R R R R

R R R R

SLV

SLV

MST MST

MST MST

Figure 25. Flow Demonstration of 4x4 Mesh Network with 4M2S

The second group of experiments is based on the 4x4 mesh network with 8

masters and 4 slaves. The distribution is depicted in Figure 26. We dispatch

the master components along row 2 and row 3. As we can see from Figure 26,

all the masters‟ traffics are through the bisection of the mesh network. In this

distribution, there are 12 out of 16 nodes generate traffic, which gives a higher

pressure to the network. We expect to see effect introduced by the deflection

routing.

26

R R R R

R R R R

R R R R

R R R R

SLV

SLV

MST MST

MST MST

SLV

SLV

MST MST

MST MST

Figure 26. Flow Demonstration with 8M4S

3.1.3 Evaluation Statistics

We performed 2 groups of regulation parameters on the experiment platform.

The first group is 5 different burstiness with ρ=0.2, which is 1/5, 2/10, 4/20,

8/40 and 16/80. The second group is 5 different burstiness with ρ=0.5, which

is 1/2, 2/4, 4/8, 8/16 and 16/32. The burstiness of each group is increased by

order, which is expected to give different pressure to the network and the FIFO

in the slave side.

During the experiment, we shall measure the following data, the max items in

the slave side FIFO, the cycles each transfer takes, the cycles each transfer

takes during in the FIFO, the hops each transfer takes to travel in the network

and the cycles each transaction takes.

3.2 4 Masters 2 Slaves (4M2S)

27

3.2.1 Experiment results of ρ = 0.2

This group of experiment is performed with different combination of m and n

with the result of ρ is 0.2.

Figure 27. Histogram of FIFO occupation in 4M2S ρ=0.2

Figure 28. Maximum FIFO occupation in 4M2S ρ=0.2

Figure 27 shows the histogram of FIFO occupation in ρ=0.2 with different

burstiness and Figure 28 shows the maximum FIFO occupation in ρ=0.2 with

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20

1

2x 10

5 = 1/5

0 0.5 1 1.5 2 2.5 30

2

4x 10

4 = 2/10

0 1 2 3 4 5 60

5000

10000

15000 = 4/20

0 2 4 6 8 10 12 140

5000

10000

15000 = 8/40

0 2 4 6 8 10 12 14 160

5000

10000 = 16/80

1/5 2/10 4/20 8/40 16/800

2

4

6

8

10

12

14

16

184M2S =0.2 Max FIFO Itmes

28

different burstiness. From the above 2 figures we know that higher burstiness

leads to higher FIFO occupation due to the contention in slave side.

Figure 29. Mean transfer cycles in 4M2S ρ=0.2

Figure 30. Maximum transfer cycles in 4M2S ρ=0.2

Figure 30 depicts the maximum cycles each transfer takes to go through the

network, while Figure 29 depicts the mean cycles each transfer takes to go

through the network. Each bar in these pictures is formed by 3 parts, the total

interface delay, the FIFO delay and the network delay. It is easy to understand

that higher burstiness would induce more congestion in the slave side FIFO as

well as the network packet congestion. We can see that by increasing the

burstiness, all of the three parts will increase.

1/5 2/10 4/20 8/40 16/800

5

10

15

20

25

30

35

40

454M2S =Mean Transfer Cycles

Interface delay

FIFO delay

Hopcount

1/5 2/10 4/20 8/40 16/800

10

20

30

40

50

60

70

80

90

1004M2S =0.2 Max Transfer Cycles

Interface delay

FIFO delay

Hopcount

29

Figure 31. Mean transaction cycles in 4M2S ρ=0.2

Figure 32. Maximum transaction cycles in 4M2S ρ=0.2

Figure 29 and Figure 30 show the maximum and mean cycles every AXI

transaction takes to finish. Since higher burstiness will shorten the waiting

period between 2 consecutive transfers, and the AXI transaction is in a

pipelined style. The higher burstiness, the less cycles it takes to finish one

transaction.

1/5 2/10 4/20 8/40 16/800

20

40

60

80

100

1204M2S =0.2 Mean Transaction Cycles

1/5 2/10 4/20 8/40 16/800

20

40

60

80

100

1204M2S =0.2 Max Transaction Cycles

30

3.2.2 Experiment results of ρ = 0.5

This group of experiments is performed with different combination of m and

n which the result of ρ is 0.5.



Figure 33 shows the histogram of slave side FIFO occupation and Figure

34 shows the slave side maximum FIFO occupation. Because the slave

can only process one transfer in every 2 cycles, ρ=0.5 is the limitation of

the slave processing capability. In this case, we can see that different

0 2 4 6 8 10 12 14 160

5000

10000

15000 = 1/2

0 2 4 6 8 10 12 14 160

5000

10000

15000 = 2/4

0 2 4 6 8 10 12 14 160

1

2x 10

4 = 4/8

0 2 4 6 8 10 12 14 160

5000

10000 = 8/16

0 2 4 6 8 10 12 14 160

5000

10000 = 16/32

1/2 2/4 4/8 8/16 16/320

2

4

6

8

10

12

14

16

184M2S =0.5 Max FIFO itmes

31

levels of burstiness induce almost the same FIFO occupation. Note that

the AXI transaction is a kind of transactions with response to finalize each

transaction. So a master with fixed outstanding capability will not generate

infinite outstanding transactions. The master must wait for the response

from the slave and then generates a new transaction after it reaches its

maximum outstanding transactions. In this case, the maximum FIFO

occupation will not exceed a certain number which is related to the

burstiness and the max outstanding transaction capability.



Figure 35 and Figure 36 depict the mean and maximum cycles it takes to

1/2 2/4 4/8 8/16 16/320

5

10

15

20

25

30

35

40

454M2S =0.5 Mean Transfer Cycles

Interface delay

FIFO delay

Hopcount

1/2 2/4 4/8 8/16 16/320

10

20

30

40

50

60

70

80

90


Interface delay

FIFO delay

Hopcount

32

finish one transfer. We can see that higher burstiness will bring higher

maximum transfer cycles, longer FIFO occupation and larger hop count.




finish one AXI transaction with ρ=0.5. Since the generation rate of the

transfers reaches the limitation of the slave processing capability, even

lower burstiness will not induce too long extra waiting period.

1 2 3 4 50

20

40

60

80

100

1204M2S Mean Transaction Cycles

4/8 8/16 16/321/2 2/4

1 2 3 4 50

20

40

60

80

100

1204M2S Max Transaction Cycles

1/2 2/4 4/8 8/16 16/32

33

3.2.3 Comparison between ρ= 0.2 and ρ=0.5

Figure 39. Comparison of mean delay in 4M2S

Figure 40. Comparison of maximum delay in 4M2S

Figure 39 and Figure 40 show the mean and maximum transfer delay of

both ρ=0.2 and ρ=0.5. ρ=0.2 gives better result in both mean and

maximum cases, however when it reaches higher burstiness, ρ=0.2 almost

equals to ρ=0.5.

1/5 1/2 2/10 2/4 4/20 4/8 8/40 8/16 16/80 16/320

5

10

15

20

25

30

35

40

454M2S Mean delay

Interface delay

FIFO delay

Hopcount

1/5 1/2 2/10 2/4 4/20 4/8 8/40 8/16 16/80 16/320

10

20

30

40

50

60

70

80

90

1004M2S Max delay =0.2 vs =0.5

Interface delay

FIFO delay

Hopcount

34

Figure 41. Comparison of mean transfer delay in 4M2S

Figure 42. Comparison of maximum transfer delay in 4M2S

Figure 41 and Figure 42 depict the individual comparison of transfer delay.

We can see that ρ=0.5 leads to much higher transfer delay when

burstiness is low.

1 2 3 4 50

5

10

15

20

25

30

35

40

454M2S Mean Transfer delay

=0.2

=0.5

1 2 3 4 50

10

20

30

40

50

60

70

80

90

1004M2S Max Transfer delay

=0.2

=0.5

35

Figure 43. Comparison of mean FIFO delay in 4M2S

Figure 44. Comparison of maximum FIFO delay in 4M2S

Figure 43 and Figure 44 depict the individual comparison of FIFO delay.

We can see that ρ=0.5 induce much higher FIFO delay when burstiness is

low. In some case it can lead up to 40% larger compared with ρ=0.2.

1 2 3 4 50

5

10

15

20

254M2S Mean FIFO delay

=0.2

=0.5

1 2 3 4 50

5

10

15

20

25

30

35

40

45

504M2S Max FIFO delay

=0.2

=0.5

36

Figure 45. Comparison of mean hop count in 4M2S

Figure 46. Comparison of maximum hop count in 4M2S

Figure 45 and Figure 46 show the individual comparison of network delay.

The network delay is influenced by the level of network traffic. Higher

burstiness will induce more congestion in the network. However, in the

highest burstiness case, since the waiting period is larger than the cycles it

takes the packet to travel through the network, both of them gives the

same result.

1 2 3 4 50

1

2

3

4

5

64M2S Mean Hopcount

=0.2

=0.5

1 2 3 4 50

5

10

15

20

254M2S Max Hopcount

=0.2

=0.5

37

3.3 8 Masters 4 Slaves (8M4S)

The 8M4S platform is to investigate the behavior of regulation in high network

traffic. Each master generated the 16-transfer AXI transaction with different

regulation parameters.

3.3.1 Experiment results of ρ= 0.2

This group of experiments is performed with different combination of m

and n with the result of ρ is 0.2.


0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20

1

2

3x 10

4 = 1/5

0 0.5 1 1.5 2 2.5 30

1

2x 10

4 = 2/10

0 1 2 3 4 5 60

5000

10000

15000 = 4/20

0 2 4 6 8 10 12 140

5000

10000

15000 = 8/40

0 2 4 6 8 10 12 14 160

5000

10000 = 16/80

38


Figure 47 shows the histogram of FIFO occupation and Figure 48 depicts the

maximum FIFO occupation. Higher burstiness induces more congestion in the

slave side, which will lead to higher FIFO occupation.


1/5 2/10 4/20 8/40 16/800

2

4

6

8

10

12

14

16

188M4S =0.2 Max FIFO Itmes

1/5 2/10 4/20 8/40 16/800

10

20

30

40

50

60

70

80

90


Interface delay

FIFO delay

Hopcount

39


Figure 49 depicts the maximum cycles each transfer spends to go through the

network, while Figure 50 depicts the mean cycles each transfer spends to go

through the network. Each bar in these pictures is formed by 3 parts, the

interface delay, the FIFO delay and the network delay. It is easy to understand

that higher burstiness would induce more congestion in the slave side FIFO as

well as the network packet congestion. We can see that by increasing the

burstiness, all of the three parts will increase.

1/5 2/10 4/20 8/40 16/800

5

10

15

20

25

30

35

40


Interface delay

FIFO delay

Hopcount

1 2 3 4 50

20

40

60

80

100


40




finish one AXI transaction. Due to the introduction of waiting period, higher

burstiness needs shorter time to finish one transaction.

3.3.2 Experiment result of ρ=0.5

This group of experiment is performed with different combination of m

and n with the result of ρ is 0.5.

1 2 3 4 50

20

40

60

80

100


41



Figure 53 is the histogram of FIFO occupation and Figure 54 shows the

maximum FIFO occupation. We notice that because ρ=0.5 is a high

regulation rate, the combination of m=2 and n=1 reached more times

maximum FIFO occupation while others only reached very few times.

0 2 4 6 8 10 12 14 160

5000

10000

15000rho = 1/2

0 2 4 6 8 10 12 14 160

1

2x 10

4 rho = 2/4

0 2 4 6 8 10 12 14 160

5000

10000

15000rho = 4/8

0 2 4 6 8 10 12 14 160

5000

10000

15000rho = 8/16

0 2 4 6 8 10 12 14 160

5000

10000rho = 16/32

1/2 2/4 4/8 8/16 16/320

2

4

6

8

10

12

14

16

188M4S =0.5 Max FIFO itmes

42



Figure 55 and Figure 56 depict the mean and maximum transfer cycles.

Like ρ=0.2, because of the congestion it introduced, higher burstiness

leads to more interface delay, higher FIFO delay and larger hop count.

1/2 2/4 4/8 8/16 16/320

10

20

30

40

50

60

70

80

90


Interface delay

FIFO delay

Hopcount

1/2 2/4 4/8 8/16 16/320

5

10

15

20

25

30

35

40


Interface delay

FIFO delay

Hopcount

43



Figure 57 and Figure 58 show the mean and maximum cycles it takes

to finish one transaction. We notice that the combination m=2 and n=1

is different from others. It always leads to maximum transaction cycles.

This is because it keeps the slave side under the maximum pressure.

1/2 2/4 4/8 8/16 16/320

20

40

60

80

100

1208M4S =0.5 Mean Transaction Cycles

1/2 2/4 4/8 8/16 16/320

20

40

60

80

100

1208M4S =0.5 Max Transaction Cycles

44

3.3.3 Comparison between ρ=0.2 and ρ=0.5

Figure 59. Comparison of mean delay in 8M4S

Figure 60. Comparison of Maximum delay in 8M4S

Figure 59 and Figure 60 depict the mean and maximum transfer delay. We

can see that ρ=0.2 gives better result, however when burstiness is high, ρ=0.2

and ρ=0.5 give the almost the same result.

1/5 1/2 2/10 2/4 4/20 4/8 8/40 8/16 16/80 16/320

5

10

15

20

25

30

35

40

458M4S Mean delay

Interface delay

FIFO delay

Hopcount

1/5 1/2 2/10 2/4 4/20 4/8 8/40 8/16 16/80 16/320

10

20

30

40

50

60

70

80

90

1008M4S Max Transfer Cycles

Interface delay

FIFO delay

Hopcount

45

Figure 61. Comparison of Mean transaction delay in 8M4S

Figure 62. Comparison of Maximum transaction delay in 8M4S

Figure 61 and Figure 62 show the individual comparison of the mean and

maximum transfer delay. We notice that the combination of m=2 and n=1 gives

very high mean transfer delay, it is even higher than the highest burstiness

case. We consider this combination of regulation rate as almost no effect.

1 2 3 4 50

5

10

15

20

25

30

35

40

458M4S Mean Transfer delay

=0.2

=0.5

1 2 3 4 50

10

20

30

40

50

60

70

80

90

1008M4S Max Transfer delay

=0.2

=0.5

46

Figure 63. Comparison of mean FIFO delay in 8M4S

Figure 64. Comparison of maximum FIFO delay in 8M4S

Figure 63 and Figure 64 show the individual comparison of mean and

maximum FIFO delay. From them we can see that ρ=0.5 induce higher

congestion and makes each transfer stay in the FIFO for longer time.

1 2 3 4 50

5

10

15

20

258M4S Mean FIFO delay

=0.2

=0.5

1 2 3 4 50

5

10

15

20

25

30

35

40

45

508M4S Max FIFO delay

=0.2

=0.5

47

Figure 65. Comparison of mean hop count in 8M4S

Figure 66. Comparison of maximum hop count in 8M4S

Figure 65 and Figure 66 show the individual comparison of network delay.

The network delay is influenced by the level of network traffic. Higher

burstiness will induce more congestion in the network. However, in the

highest burstiness case, since the waiting period is larger than the cycles it

takes the packet to travel through the network, both of them gives the

same result.

1 2 3 4 50

1

2

3

4

5

6

78M4S Mean Hopcount

=0.2

=0.5

1 2 3 4 50

2

4

6

8

10

12

14

16

188M4S Max Hopcount

=0.2

=0.5

48

3.4 Comparison between 4M2S and 8M4S

Here we compare between the 2 kinds of network distribution. Since the

pressure of the network traffic is different in 4M2S and 8M4S, we will

analyze the effect introduced by the network traffic.

3.4.1 Comparison of ρ=0.2

Figure 67. Comparison of mean transfer delay between 4M2S and 8M4S

ρ=0.2

1/5 2/10 4/20 8/40 16/80 1/5 2/10 4/20 8/40 16/800

5

10

15

20

25

30

35

40

454M2S vs 8M4S =0.2

Interface delay

FIFO delay

Hopcount

49

Figure 68. Comparison of maximum transfer delay between 4M2S and 8M4S

ρ=0.2

Figure 67 and Figure 68 show the mean and maximum overall

perspective of transfer delay. We can see that we did benefit from lower

network traffic. With higher network traffic, it increases the time the

packets need to travel through the network. We shall analyze the result

of individual comparison to see which part introduce the difference.

Figure 69. Comparison of mean transfer cycles between 4M2S and 8M4S

ρ=0.2

1/5 1/5 2/10 2/10 4/20 4/20 8/40 8/40 16/80 16/800

10

20

30

40

50

60

70

80

90

1004M2S vs 8M4S =0.2

Interface delay

FIFO delay

Hopcount

1/5 2/10 4/20 8/40 16/800

5

10

15

20

25

30

35

40

45=0.2 Mean Transfer delay

4M2S

8M4S

50

Figure 70. Comparison of maximum transfer cycles between 4M2S and 8M4s

ρ=0.2

Figure 69 and Figure 70 show the individual comparison of mean and

maximum transfer delay, where we can see that higher network traffic

will affect the transfer delay.

Figure 71. Comparison of mean FIFO delay between 4M2S and 8M4S ρ=0.2

1/5 2/10 4/20 8/40 16/800

10

20

30

40

50

60

70

80

90

100=0.2 Max Transfer delay

4M2S

8M4S

1/5 2/10 4/20 8/40 16/800

5

10

15

20

25=0.2 Mean FIFO delay

4M2S

8M4S

51

Figure 72. Comparison of Maximum FIFO delay between 4M2S and 8M4S

ρ=0.2

Figure 71 and Figure 72 depict the comparison of mean and maximum

FIFO delay. The 8M4S has higher FIFO delay because the transfers in

the FIFO have to wait for others to arrive. Thus it is also influenced by

the network traffic.

Figure 73. Comparison of mean hop count between 4M2S and 8M4S ρ=0.2

1/5 2/10 4/20 8/40 16/800

5

10

15

20

25

30

35

40

45

50=0.2 Max FIFO delay

4M2S

8M4S

1/5 2/10 4/20 8/40 16/800

1

2

3

4

5

6

7=0.2 Mean Hopcount

4M2S

8M4S

52

Figure 74. Comparison of maximum hop count between 4M2S and 8M4S

ρ=0.2

Figure 73 and Figure 74 show the comparison of mean and maximum

network delay. This comparison should reflect the influence of the

network directly. From the mean network delay diagram we can see

that 8M4S takes 1 or 2 cycles more to travel through the network.

3.4.2 Comparison of ρ=0.5

Figure 75. Comparison of mean transfer delay between 4M2S and 8M4S

ρ=0.5

1/5 2/10 4/20 8/40 16/800

5

10

15

20

25=0.2 Max Hopcount

4M2S

8M4S

1/2 1/2 2/4 2/4 4/8 4/8 8/16 8/16 16/32 16/320

5

10

15

20

25

30

35

40

454M2S vs 8M4S =0.5

Interface delay

FIFO delay

Hopcount

53

Figure 76. Comparison of maximum transfer delay between 4M2S and 8M4S

ρ=0.5

Figure 75 and Figure 76 depict the comparison of mean and maximum

transfer delay.

Figure 77. Comparison of mean transfer cycles between 4M2S and 8M4S

ρ=0.5

1/2 1/2 2/4 2/4 4/8 4/8 8/16 8/16 16/32 16/320

10

20

30

40

50

60

70

80

90

1004M2S vs 8M4S =0.5

Interface delay

FIFO delay

Hopcount

1/2 2/4 4/8 8/16 16/320

5

10

15

20

25

30

35

40

45=0.5 Mean Transfer delay

4M2S

8M4S

54

Figure 78. Comparison of maximum transfer cycles between 4M2S and 8M4S

ρ=0.5

Figure 77 and Figure 78 show the comparison of mean and maximum cycles it

takes for each transfer.

Figure 79. Comparison of mean FIFO delay between 4M2S and 8M4S ρ=0.5

1/2 2/4 4/8 8/16 16/320

10

20

30

40

50

60

70

80

90

100=0.5 Max Transfer delay

4M2S

8M4S

1/2 2/4 4/8 8/16 16/320

5

10

15

20

25=0.5 Mean FIFO delay

4M2S

8M4S

55

Figure 80. Comparison of maximum FIFO delay between 4M2S and 8M4S

ρ=0.5

Figure 79 and Figure 80 show the comparison of mean and maximum FIFO

delay. We can see that there is not much difference between the 2 sorts of

network distributions.

Figure 81. Comparison of mean hop count between 4M2S and 8M4S ρ=0.5

1/2 2/4 4/8 8/16 16/320

5

10

15

20

25

30

35

40

45

50=0.5 Max FIFO delay

4M2S

8M4S

1/2 2/4 4/8 8/16 16/320

1

2

3

4

5

6

7=0.5 Mean Hopcount

4M2S

8M4S

56

Figure 82. Comparison of maximum hop count between 4M2S and 8M4S

ρ=0.5

Figure 81 and Figure 82 show the comparison of mean and maximum

network delay. This comparison should reflect the influence of the

network directly. From the mean network delay diagram we can see

that 8M4S takes 1 or 2 cycles more to travel through the network.

1/2 2/4 4/8 8/16 16/320

5

10

15

20

25=0.5 Max Hopcount

4M2S

8M4S

57

4 Conclusion

4.1 Summary

This report describes the construction of a Nostrum network based system.

We reuse the existing Nostrum router, and build interfaces to wrap the

Nostrum network. The two interfaces which have been designed and

implemented are master interface and slave interface. In particular, the two

interfaces realize an industrial interconnect protocol, the AXI protocol.

After constructing the platform, we perform experiments of flow regulation on

the platform. With simple but illustrative experiments, we can look into the

effect of flow regulation on reducing delay and backlog under various traffic

scenarios.

4.2 Future Work

As the first step, we have evaluated flow regulation using synthetic traffic

flows. In the future, we shall use traffic streams from real applications. This

requires integrating real IP modules (masters and slaves) into the NoC

system.

As can be observed from the experimental results, the regulation has clear

impact on the system performance. In general, increasing regulation strength

results in less transfer delays. However, we also observe that there are

exceptions in some cases. The reason for this complicated phenomenon is

partially due to delfection routing, which is adaptive and non-deterministic, but

an in-depth investigation is necessary. We are also aware of that, for NoC

systems, regulation is better globally orchestrated since regulaltion on

invidiual streams results in interferences and their impact needs to be

investigaed more from a global perspective.

58

REFERENCES

[1] Nostrum Network on Chip, Nostrum website http://www.ict.kth.se/nostrum/

[2] AXI specification. ARM. www.arm.com

[3] J.-Y. L. Boudec and P. Thiran, “Network Calculus: A Theory of

Deterministic Queuing Systems for the Internet”. Number 2050 in LNCS,

2004.

[4] Zhonghai Lu, Mikael Millberg, Axel Jantsch, Alistair Bruce, Pieter van der

Wolf and Tomas Henriksson. "Flow Regulation for On-Chip Communication".

Proceedings of the 2009 Design, Automation and Test in Europe Conference

(DATE'09), Nice, France, April 2009.

[5] Erland Nilsson. “Design and Implementation of a hot potato switch in a

Network on Chip”. Master Thesis, IMIT, KTH, June 2002.

design and implementation of axi-based network-on-chip …274190/fulltext01.pdf · 2009. 10....

Documents