a comprehensive and accurate latency model for network-on ......mappi ng rout i ng al l ocat i on t...
TRANSCRIPT
Zhiliang Qian1, Da-Cheng Juan2, Paul Bogdan3, Chi-Ying Tsui1,
Diana Marculescu2 and Radu Marculescu2
1The Hong Kong University of Science and Technology, Hong Kong 2Carnegie Mellon University, Pittsburgh, U.S.A
3University of Southern California, Los Angeles, U.S.A
A Comprehensive and Accurate Latency Model for
Network-on-Chip Performance Analysis
IEEE/ACM ASP-DAC’14, 22 Jan., 2014, Singapore
Outline
Introduction
NoC Modeling for Performance Analysis
NoC end-to-end delay calculation
Link dependency analysis
GE-type traffic modeling
Wormhole router based NoC latency model
Experimental results
Simulation setup
Evaluation under synthetic traffic patterns
Evaluation under realistic benchmarks
Conclusion
2
Network-on-Chips (NoCs)
With technology scaling down, more and more components
can be integrated on a single chip.
3
Cell
Cell
Cell
Cell
Cell
Cell
FB FB FB
FB FB
IP
IP
IP
IP
IP
IP
NoC
1.0um 0.25um <0.05um
Point-to-Point
interconnectShared Bus
Communication
Network
Total wire
length<100cm <100 meters >1Km
An efficient way to manage the communication of on-chip
resources plays the key role in future system design.
NoC design space exploration
A large design space needs to be explored for an optimal design
Task mapping, allocation, buffer sizing, routing algorithm etc.
Accurate and fast performance evaluation is required during the exploration
-> analytical performance evaluation model
4
NoC Platform
(topology, router
design)
Application
(traffic pattern,
injection rate)
Task
scheduling/
mapping
Router model
Performance
evaluation
Design space
exploration
Inner loop
Simulation/
Prototyping
Outer loopDetailed performance
evaluation
Core
mapping
Routing
allocation
Traffic analysis
vld RLD
Inverse
Scan
AC/DC
iQuant
Stripe
memory
idct
Up
samp
AR
M
Vop
paddingVOP
70
36
2
362
362
27
49
313
94
500313
16
30
0
353
357
16
Performance
feedback
Introduction- queuing-theory-based
analytical model
Queuing-theory-based delay estimation
Customer (packet) arrival process
System (server) service process
Number of servers
Service discipline (FCFS, Round-robin etc.)
System time and waiting time
5
Queue Server
A
A
Waiting time
A
Service time
System time
Bernoulli injection process
Pro
ba
bili
ty d
en
sity
Time 1/T
f(𝑥𝑡)=𝜆𝑒−𝜆𝑥𝑡 (λ =
1
𝑇)
Queuing-theory-based NoC latency model
Previous arts and motivation of this work
6
NoC
latency
model
Previous NoC analytical models This work
[VLSI 2007] [TCAD’12,
ICCAD’09]
[TVLSI’13] [NoCs’11]
Traffic model for the application
Queue M/M/1 M/G/1/K G/G/1 M/M/m/K G/G/1/K
Arrival Poisson Poisson General Poisson General
Service Markov General General Markov General
NoC architecture modeled
Buffer Small 𝐾 packets 𝐵 flits Small 𝐵 flits
PB ratio1 𝑚 (≫ 1) < 1 arbitrary 𝑚 (≫ 1) arbitrary
Arbitration Round robin Round robin Fixed priority Round robin Round robin
1 PB ratio is defined as the ratio of average packet size (𝑚 flits) to the buffer depth (𝐵 flits)
Outline
Introduction
NoC Modeling for Performance Analysis
NoC end-to-end delay calculation
Link dependency analysis
GE-type traffic modeling
Wormhole router based NoC latency model
Experimental results
Simulation setup
Evaluation under synthetic traffic patterns
Evaluation under realistic benchmarks
Conclusion
7
Input to the NoC latency model
The application has been scheduled and mapped onto the NoC.
A deterministic routing algorithm is used to avoid deadlock.
8
T1
T3
T4 T5 T6
T7
T8
T9
T10T11
T2
Task scheduling
and allocation
T11
T2
T3
T4 T5 T6
T7
T8
T9
T10T1
Mapping and
Floorplan
Routing
algorithm
design
T1
T3
Path 1
Path 2
Path 3
Input to the
latency
model
Tile A
Tile B
Tile A
Tile B
𝑃𝑓: Set of links in the
routing path
(𝜆𝑓 , 𝐶𝑓): GE-type traffic
model
Performance metrics: average latency 𝐿 = ( 𝜆𝑓𝑓∈𝐹 × 𝐿𝑓)/ 𝜆𝑓𝑓∈𝐹
NoC end-to-end delay calculation
The end-to-end flow latency 𝑳𝒔,𝒅 of a specific flow 𝒇𝒔,𝒅
consists of three parts: 𝐿𝑠,𝑑 = 𝑣𝑠 + 𝜂𝑠,𝑑 + ℎ𝑠,𝑑
The queuing time at the source 𝑣𝑠
The packet transfer time in the path 𝜂𝑠,𝑑 = 𝑚 + 1 + 𝜂𝑙𝑓𝑖𝑑𝑓𝑖=1
The path acquisition time ℎ𝑠,𝑑 = ℎ𝑙𝑓𝑖𝑑𝑓𝑖=1
9
0 1 2
3 4 5
6 7 8
8
4
1
7
0
f0,8
v0
Local(a) (b)
5
l1
f
l2
f
l3
f
l4
f
l5
f (c)
R1 l2
hl2
ηl2
R2
current flow contention flow
ql2
R1 l2
hl2
ηl2
R2
ql2
(d)
Outline
Introduction
NoC Modeling for Performance Analysis
NoC end-to-end delay calculation
Link dependency analysis
GE-type traffic modeling
Wormhole router based NoC latency model
Experimental results
Simulation setup
Evaluation under synthetic traffic patterns
Evaluation under realistic benchmarks
Conclusion
10
Link dependency graph
Building the channel (link) dependency graph
11
P1P0 P2
P3 P4 P5
P7P6 P8
L0_1 L1_2
L1_0 L2_1
L0_3 L3_0 L1_4 L4_1 L2_5 L5_2
L3_6 L6_3
L3_4
L4_3
L4_5
L5_4
L6_7
L7_6
L7_8
L8_7
C7 C3
C2
C4
C6
L1_4
L4_5
L5_2
L2_1
L1_2L2_5
L5_4
L4_1
L4_7
L7_6
L7_4
L3_6
L6_7L4_3L3_4
L6_3
Core Communication graph NOC mesh Channel Dependency Graph (CDG)
of the application communication
L4_7 L7_4 L5_8 L8_5
Routing path Traffic model
Edge in the
dependency graph
Link dependency analysis
12
Topological sort algorithm is applied on the obtained CDG to
find out the proper order to analyze the queuing delays. A sample channel dependency graph
Three
parameters:
𝜂, ℎ, 𝑠
Source
queuing 𝑣
Outline
Introduction
NoC Modeling for Performance Analysis
NoC end-to-end delay calculation
Link dependency analysis
GE-type traffic modeling
Wormhole router based NoC latency model
Experimental results
Simulation setup
Evaluation under synthetic traffic patterns
Evaluation under realistic benchmarks
Conclusion
13
Modeling the bursty traffic input
Generalized exponential (GE) distribution
14
1-τ
Mτ
Packet generating point
Exponential branch
Direct branch 10 20 30 40 50 60 70 80 90 1000
0.01
0.02
0.03
0.04
0.05
0.06
0.07
Time horizontal index
Pa
ck
et
inje
cti
on
ra
te (
pa
ck
ets
/cy
cle
)
Poisson injection (CV=1)
GE-type injection (CV=3)
GE-type injection (CV=4)
GE-type traffic modeling
The GE-type cumulative distribution function (cdf) of inter-
arrival time is: 𝐹 𝑡 = 𝑃 𝑋 ≤ 𝑡 = 1 − 𝜏𝑒−𝜏𝜆𝑡, 𝑡 ≥ 0
Where the parameter 𝜏 =2
1+𝐶2 and 𝐶2 is the square coefficient of variation
In this work, we use the GE distribution to model the traffic input of
each flow, which is characterized by two parameters:
𝜆 : the average packet arrival rate (packets/cycle)
𝐶: the coefficient of variation of this traffic flow, i.e., 𝐶 =𝜎
𝜆, where 𝜎 is the
standard derivation of the packet inter-arrival times.
Accordingly, the GE/G/1/K queuing model is used to analyze
the channel waiting time by considering the traffic burstness.
15
Outline
Introduction
NoC Modeling for Performance Analysis
NoC end-to-end delay calculation
Link dependency analysis
GE-type traffic modeling
Wormhole router based NoC latency model
Experimental results
Simulation setup
Evaluation under synthetic traffic patterns
Evaluation under realistic benchmarks
Conclusion
16
Flit transfer time 𝜼 calculation
The flit transfer time 𝜼 of link 𝒍𝒂𝒃 is defined as the time taken
for the header flit after being granted link access to reach the
buffer front in link 𝒍𝒂𝒃
17
S S S S
current link li
CrossbarA Packet
(m flits)
ηli
contention flow
Mean flit rate arriving at the buffer is: 𝜆𝑙𝑎𝑏 = 𝑚 × 𝜆𝑝𝑎𝑐𝑘𝑒𝑡𝑙𝑎𝑏 = 𝑚 × 𝜆𝑓𝑓∈𝐹𝑙𝑎𝑏
Mean time to serve a flit in this queuing system is the weighted average of the service
time from all flows passing through 𝑙𝑎𝑏: 𝑠𝑓𝑙𝑖𝑡𝑙𝑎𝑏 =
𝜆𝑓×
ℎ𝑙𝑖+1𝑓
𝑚+1∀𝑓∈𝐹𝑙𝑎𝑏
𝜆𝑓∀𝑓∈𝐹𝑙𝑎𝑏
Any flow
𝑓 passing
the link
𝑙𝑎𝑏
Mean
service
time for
flow 𝑓
Illustration of path acquisition time
Service time in wormhole NoC to obtain 𝒉:
18
H
S
W
Finte size buffer
Q
S
Link contention
M/G/1/K+v queue
Waiting time W includes: 1) the link contention time H and 2) the time for
the header flit to reach the buffer head Q
Service time S is bounded by the time where the header reaches the node that
the accumulated buffer spaces between can hold the whole worm packet.
S S S S
current link li
CrossbarA Packet
(m flits)
Point A
Point B
ηliηli+1+hli+1 ηli+2+hli+2 zli+3hli+3
Point C
contention flow
Point D
Path acquisition time 𝒉 calculation
Number of effective subsequent links of link 𝒍 with respect to the path 𝒑:
𝛬𝑝 𝑙 =
𝑚
𝐵𝑟 𝑝, 𝑙
𝑖𝑓 𝑟 𝑝, 𝑙 >𝑚
𝐵𝑒𝑙𝑠𝑒𝑤𝑖𝑠𝑒
where 𝑟(𝑝, 𝑙) is the function returns the number of remaining hops from link 𝑙 towards the
destination of path 𝑝.
The service time of link 𝒍 with respect to the path 𝒑 is [1]:
𝑠𝑙𝑖𝑓 =
𝑚 𝑚 + 𝑥𝑙𝑖𝑓+ 2𝑥𝑙𝑖
𝑓𝑚 /(𝑚 + 2𝑥𝑙𝑖
𝑓) 𝑖𝑓 𝑥𝑙𝑖
𝑓< 𝑚
𝑚 𝑚+ 𝑥𝑙𝑖𝑓+ 2 𝑥𝑙𝑖
𝑓2/(𝑚 + 2𝑥𝑙𝑖
𝑓) 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
The channel service time of link 𝒍 :
𝑠 𝑙𝑎𝑏 = (𝜆𝑓 × 𝑠𝑙𝑖𝑓)/∀𝑓∈𝐹𝑙𝑎𝑏
𝜆𝑓∀𝑓∈𝐹𝑙𝑎𝑏
19
𝐶𝑠𝑙𝑎𝑏2 =
𝑠𝑙𝑎𝑏2
𝑠 𝑙𝑎𝑏2 − 1 = (
𝜆𝑓 × 𝑠𝑙𝑖𝑓2
∀𝑓∈𝐹𝑙𝑎𝑏
𝜆𝑓∀𝑓∈𝐹𝑙𝑎𝑏
)/ 𝑠 𝑙𝑎𝑏2− 1
[1] P.-C. Hu, L. Kleinrock, An Analytical model for wormhole routing with finite size input Buffers,
15th International Telegraphic Congress, 1998
GE/G/1/K queue based 𝒉 calculation
Diffusion approximation for the steady state distribution probability 𝑷𝒏 of the
M/G/1/K queue with arrival rate 𝝀 and service rate 𝝁 :
𝑃𝑛 =
𝑐 × 𝑝′𝑛 (0 ≤ 𝑛 ≤ 𝐾)
1 −1−𝑐 1−
𝜆
𝜇
𝜆
𝜇
(𝑛 = 𝐾 + 1)
Where the normalization constant 𝑐 = (1 −𝜆
𝜇(1 − 𝑃𝑛))
−1𝐾𝑗=0 and 𝑝′𝑛 is the steady
state probability of M/G/1/∞ queue [2]
Applying Little’s formula to obtain the waiting time : ℎ𝑙′ = ( 𝑖 × 𝑃𝑖)/
𝐾+1𝑖=1 𝜆
Taking the arrival traffic burstiness in GE/G/1/K model by refining the results of
M/G/1/K queue:
20
ℎ𝑙𝑎𝑏 =(𝐶𝑠𝑙𝑎𝑏
2 +𝐶𝑎𝑙𝑎𝑏2 )
(1+𝐶𝑠𝑙𝑎𝑏2 )
ℎ′𝑙𝑎𝑏
[2] M.C. Lai, et.al. An accurate and efficient performance analysis approach based on queuing
model for Network on Chip. In Proceedings of ICCAD,2009
Source queuing time 𝒗𝒔
The source queue is modeled as a GE/G/1/∞ system:
𝑣𝑠 = 𝑠 𝑙𝑠2
1 +
𝐶𝑎2+𝜆𝑎 ×
𝑠 𝑙𝑠 −𝑚2
𝑠 𝑙𝑠1 − 𝜆𝑎 × 𝑠 𝑙𝑠
− 𝑠 𝑙𝑠
where the arrival process is characterized by (λa, Ca2) in the GE type traffic
model and the service time at source is represented as 𝑠 𝑙𝑠 .
21
Packet
source
Input timing
Infinite
source buffer
Output timing
Network
on chips
Traffic terminal (source and sink)
Traffic terminal (source and sink)
Open loop source queuing
measurement setup
Application traffic generation trace
Proposed NoC latency analysis flow
22
1: foreach 𝒍𝒂𝒃 ∈ 𝑮
2: (𝝀𝒍𝒂𝒃 , 𝑪𝒂𝒍𝒂𝒃𝟐 )= traffic_model (𝑭𝒍𝒂𝒃)
3: 𝜼𝒍𝒂𝒃= calculate_transfer_time(𝝀𝒍𝒂𝒃,𝒔𝒇𝒍𝒊𝒕𝒍𝒂𝒃 ,𝒎, 𝑩)
4: foreach 𝒇 ∈ 𝑭𝒍𝒂𝒃 and 𝒍𝒊𝒇= 𝒍𝒂𝒃
5: 𝒔𝒍𝒊𝒇
= calculate_link_service_time ( )
6: end
7: ( 𝒔 𝒍𝒂𝒃 , 𝑪𝒔𝒍𝒂𝒃𝟐 ) = service_time ( )
8: if 𝒂 ≠ 𝒃 // the links between the routers
9: 𝒉𝒍𝒂𝒃 = GE_G_1_K_queue (𝝀𝒍𝒂𝒃 , 𝑪𝒂𝒍𝒂𝒃𝟐 , 𝒔 𝒍𝒂𝒃 , 𝑪𝒍𝒂𝒃
𝟐 , 𝒌)
10: else // the link is the source link
11: 𝒗𝒂 = GE_G_1 _queue (𝝀𝒍𝒂𝒃 , 𝑪𝒂𝒍𝒂𝒃𝟐 , 𝒔 𝒍𝒂𝒃 , 𝑪𝒔𝒍𝒂𝒃
𝟐 )
12: endif
13:endfor
14: foreach 𝒇 ∈ 𝑭
15: 𝑳𝒔,𝒅=calculate_flow_latency( )
16: end
Link dependency
analysis to obtain the
link order 𝑮
For each link 𝒍𝒂𝒃 in 𝑮:
Calculate the flit transfer
time 𝜂
Calculate the link service
time 𝑠
Compute the path
acquisition time ℎ
Calculate the source
queuing time 𝑣
Form the latency for
each flow in application
Outline
Introduction
NoC Modeling for Performance Analysis
NoC end-to-end delay calculation
Link dependency analysis
GE-type traffic modeling
Wormhole router based NoC latency model
Experimental results
Simulation setup
Evaluation under synthetic traffic patterns
Evaluation under realistic benchmarks
Conclusion
23
Simulation setup
The proposed analytical latency model is implemented in
MATLAB and its accuracy is compared with Booksim simulator.
Each router takes two cycles to route a flit and the link traversal
stage takes an additional one cycle.
Different buffer depth (𝑩 flits) and packet length (𝒎 flits) combinations are evaluated.
Both synthetic and real applications are adopted:
Random and shuffle traffic on 8 × 8 and 12 × 12 meshes
MMS (Multimedia system)
DVOPD (Video object plane decoder)
MPEG4 (MPEG decoder)
SPECweb99 applications
24
Evaluation under random traffic patterns
The proposed latency model works for a variety of buffer depth
and packet size combinations.
For random traffic, about 5.2%-9.9% errors are introduced in
predicting the network saturation point. 25
0 0.01 0.02 0.03 0.04 0.050
100
200
300
400
500
600
700
800
Packet injection rate (packets/cycle)
La
ten
cy
(c
yc
les
)
Simulation (B=3,m=14)
Proposed model (B=3,m=14)
Simulation (B=4,m=9)
Proposed model (B=4,m=9)
Simulation(B=9,m=4)
Proposed model (B=9,m=4)
0 0.01 0.02 0.03 0.04 0.05 0.06 0.070
100
200
300
400
500
600
700
800
Packet injection rate (packets/cycle)
La
ten
cy
(c
yc
les
)
Simulation (B=3,m=14)
Proposed model (B=3,m=14)
Simulation (B=4,m=9)
Proposed model (B=4,m=9)
Simulation(B=9,m=4)
Proposed model (B=9,m=4)
8 × 8 𝑚𝑒𝑠ℎ, 𝑅𝑎𝑛𝑑𝑜𝑚 12 × 12 𝑚𝑒𝑠ℎ, 𝑅𝑎𝑛𝑑𝑜𝑚
Evaluation under shuffle traffic patterns
For the traffic patterns such as shuffle, a little larger error
(10.8%-13%) is introduced due to the uneven traffic arrival
rates across the channels.
Overall, the analytical model achieves 70X speedup over the
simulations for both traffic patterns. 26
0 0.002 0.004 0.006 0.008 0.01 0.0120
100
200
300
400
500
600
700
800
Packet injection rate (packets/cycle)L
ate
nc
y (
cy
cle
s)
Simulation (B=3,m=14)
Proposed model (B=3,m=14)
Simulation (B=4,m=9)
Proposed model (B=4,m=9)
Simulation(B=9,m=4)
Proposed model (B=9,m=4)
0 0.01 0.02 0.03 0.04 0.050
100
200
300
400
500
600
700
800
Packet injection rate (packets/cycle)
La
ten
cy
(c
yc
les
)
Simulation (B=3,m=14)
Proposed model (B=3,m=14)
Simulation (B=4,m=9)
Proposed model (B=4,m=9)
Simulation(B=9,m=4)
Proposed model (B=9,m=4)
8 × 8 𝑚𝑒𝑠ℎ, 𝑠ℎ𝑢𝑓𝑓𝑙𝑒 12 × 12 𝑚𝑒𝑠ℎ, 𝑠ℎ𝑢𝑓𝑓𝑙𝑒
Evaluation under burst and real traffic
Comparison of Poisson and GE-type traffic injection:
Evaluation under real application traces:
27
10 20 30 40 50 60 70 80 90 1000
0.01
0.02
0.03
0.04
0.05
0.06
0.07
Time horizontal index
Pa
ck
et
inje
cti
on
ra
te (
pa
ck
ets
/cy
cle
)
Poisson injection (CV=1)
GE-type injection (CV=3)
GE-type injection (CV=4)
0.01 0.015 0.02 0.025 0.03 0.035 0.040
200
400
600
800
1000
Packet injection rate (packets/cycle)
Late
ncy (
cycle
s)
Simulation (Poisson injection, CV=1)
Simulation (GE-type injection, CV=3)
Simulation (GE-type injection, CV=4)
Proposed model (Poisson injection, CV=1)
Proposed model (GE-type injection, CV=3)
Proposed model (GE-type injection, CV=4)
0 5 10 15 20 25 300
10
20
30
40
50
60
Flow index in DVOPD application
En
d-t
o-e
nd
dela
y (
cycle
s)
Simulation
Proposed model
DVOPD VOPD MMS MPEG4 Apache Ocean Oracle DVB2 Sparse0
10
20
30
40
50
60
70
80
En
d-t
o-e
nd
dela
y (
cycle
s)
Simulation
Proposed model
Outline
Introduction
NoC Modeling for Performance Analysis
NoC end-to-end delay calculation
Link dependency analysis
GE-type traffic modeling
Wormhole router based NoC latency model
Experimental results
Simulation setup
Evaluation under synthetic traffic patterns
Evaluation under realistic benchmarks
Conclusion
28
Conclusion
In this work, we propose a new NoC latency model which
generalizes the previous work by modeling:
The arrival traffic burstiness
The general service time distribution
The finite buffer depth and arbitrary packet length combinations
A link dependency analysis technique is proposed to
determine the order of applying queuing analysis
The accuracy of the model is demonstrated using both the
synthetic traffic and real applications.
A 70X speedup over simulation is achieved with less than 13%
error in the proposed analytical model, which benefit the NoC
synthesis process.
29
Thank you!!
Q&A
30