cost effective centralized adpative routing for networks on chip
DESCRIPTION
RanManevich, TechnionTRANSCRIPT
![Page 1: Cost Effective centralized adpative routing for networks on chip](https://reader035.vdocuments.us/reader035/viewer/2022070302/548c3054b47959ae538b47b9/html5/thumbnails/1.jpg)
May 2, 2011 1
A Cost Effective Centralized Adaptive Routing for Networks
on ChipRan Manevich*, Israel Cidon*, Avinoam Kolodny*,
Isask’har (Zigi) Walter* and Shmuel Wimer#
*Technion – Israel Institute of Technology
M odule
M odule M odule
M odule M odule
M odule M odule
M odule
M odule
M odule
M odule
M oduleGroup
ResearchQNoC
#Bar-Ilan University
![Page 2: Cost Effective centralized adpative routing for networks on chip](https://reader035.vdocuments.us/reader035/viewer/2022070302/548c3054b47959ae538b47b9/html5/thumbnails/2.jpg)
May 2, 2011 2
Networks-on-Chip (NoCs)
![Page 3: Cost Effective centralized adpative routing for networks on chip](https://reader035.vdocuments.us/reader035/viewer/2022070302/548c3054b47959ae538b47b9/html5/thumbnails/3.jpg)
May 2, 2011 3
Global traffic information is essential to make the right decision!
![Page 4: Cost Effective centralized adpative routing for networks on chip](https://reader035.vdocuments.us/reader035/viewer/2022070302/548c3054b47959ae538b47b9/html5/thumbnails/4.jpg)
May 2, 2011 4
Adaptive Routing in NoCs – Local vs. Global Information
2D Mesh NoCLow Congestion
Medium Congestion
High Congestion
A Packet routed from upper left to bottom right corner utilizing local congestion information.
The same packet routed using global information.
I CAN MAKE IT!!!Source
Destination
![Page 5: Cost Effective centralized adpative routing for networks on chip](https://reader035.vdocuments.us/reader035/viewer/2022070302/548c3054b47959ae538b47b9/html5/thumbnails/5.jpg)
May 2, 2011 5
Route Selection - ATDOR ATDOR - Adaptive Toggle Dimension Ordered Routing
Keep it simple! Centralized selection:
Routing tables in sources. One bit per destination.
The option with less congested bottleneck link is preferred.
XY or YX
![Page 6: Cost Effective centralized adpative routing for networks on chip](https://reader035.vdocuments.us/reader035/viewer/2022070302/548c3054b47959ae538b47b9/html5/thumbnails/6.jpg)
May 2, 2011 6
ATDOR Illustration 1 Five identical flows, 100
MB/s each.
Links modeled as M/M/1 queues. Delay of a single link:
LINKTraffic
DCapacity Traffic
Links capacity is 210 MB/s.
Initial routing - XY
![Page 7: Cost Effective centralized adpative routing for networks on chip](https://reader035.vdocuments.us/reader035/viewer/2022070302/548c3054b47959ae538b47b9/html5/thumbnails/7.jpg)
May 2, 2011 7
Centralized Routing – How?• Option 1 – Continuous calculation of optimal routing
for the active sessions:
Achievable load balancing
Speed and computation complexity
System complexity
![Page 8: Cost Effective centralized adpative routing for networks on chip](https://reader035.vdocuments.us/reader035/viewer/2022070302/548c3054b47959ae538b47b9/html5/thumbnails/8.jpg)
May 2, 2011 8
Centralized Routing – How?• Option 2 – Iterative serial selection based on traffic
load measurements between XY and YX for all source-destination pairs:
Achievable load balancing
Speed and computation complexity
System complexity
![Page 9: Cost Effective centralized adpative routing for networks on chip](https://reader035.vdocuments.us/reader035/viewer/2022070302/548c3054b47959ae538b47b9/html5/thumbnails/9.jpg)
May 2, 2011 9
ATDOR illustration 1
Average Delay
∞
Re-Routed Flow
Step #
1->15 1
Re-Routed Flow
Step #
2->8 2
Average Delay
37 ns
Re-Routed Flow
Step #
2->15 3
Average Delay
22 ns
![Page 10: Cost Effective centralized adpative routing for networks on chip](https://reader035.vdocuments.us/reader035/viewer/2022070302/548c3054b47959ae538b47b9/html5/thumbnails/10.jpg)
May 2, 2011 10
What did we just see? For each flow we:
1. Calculated the better route.
2. Updated routing table of the source.
3. Waited for the update to take effect and measured global traffic load.
Steps 2 and 3 are unified for all destinations of a single source:
Achievable load balancing
Speed and computation complexity
Scalability
Performing steps 1-3 for each flow is slow and not scalable.
![Page 11: Cost Effective centralized adpative routing for networks on chip](https://reader035.vdocuments.us/reader035/viewer/2022070302/548c3054b47959ae538b47b9/html5/thumbnails/11.jpg)
May 2, 2011 11
Back illustration 1
Average Delay
∞
Re-Routed Flow
Step #
1->15 1
Average Delay
22 ns
Re-Routed Flow
Step #
2->82
2->15
Re-Routed Flow
Step #
4->15 3
Average Delay
22 ns
Re-Routed Flow
Step #
1->15 4
Average Delay
22 ns
Re-Routed Flow
Step #
2->85
2->15
Average Delay
∞
![Page 12: Cost Effective centralized adpative routing for networks on chip](https://reader035.vdocuments.us/reader035/viewer/2022070302/548c3054b47959ae538b47b9/html5/thumbnails/12.jpg)
May 2, 2011 12
Problem #1 Changing routing may enhance
congestion and cause fluctuations.
Solution: Change routing only if the alternative is better by the margin α, 0< α <1:
YX XY
YX XY
XY YX
XY YX
if (Current Route = XY)
YX if MAX[Load ] a MAX[Load ]NextRoute =
XY if MAX[Load ] > a MAX[Load ]
elseif (Current Route = YX)
XY if MAX[Load ] a MAX[Load ]NextRoute =
YX if MAX[Load ] > a MAX[Load ]
![Page 13: Cost Effective centralized adpative routing for networks on chip](https://reader035.vdocuments.us/reader035/viewer/2022070302/548c3054b47959ae538b47b9/html5/thumbnails/13.jpg)
May 2, 2011 13
ATDOR illustration 2
Average Delay
∞
Re-Routed Flow
Step #
1->14
11->15
1->16
Average Delay
∞
Re-Routed Flow
Step #
1->14
21->15
1->16
Re-Routed Flow
Step #
1->14
31->15
1->16
![Page 14: Cost Effective centralized adpative routing for networks on chip](https://reader035.vdocuments.us/reader035/viewer/2022070302/548c3054b47959ae538b47b9/html5/thumbnails/14.jpg)
May 2, 2011 14
Problem #2 Coupling among flows sharing the same
source.
Solution: Re-Routing counters CI,J count routing changes of flows from source I to destination J (FI,J). When CI,J reaches a limit LI,J, routing of FI,J is locked. A Possible definition of Limits LI,J :
, ( ) mod 3I JL I J
![Page 15: Cost Effective centralized adpative routing for networks on chip](https://reader035.vdocuments.us/reader035/viewer/2022070302/548c3054b47959ae538b47b9/html5/thumbnails/15.jpg)
May 2, 2011 15
Back to illustration 2R. Changes
LeftFlows
2 1->16
1 1->15
0 1->14
Average Delay
∞
R. Changes Left
Flows
1 1->16
0 1->15
0 1->14
Average Delay
73 ns
R. Changes Left
Flows
0 1->16
0 1->15
0 2->14
Average Delay
22 ns
, ( ) mod 3I JL I J
![Page 16: Cost Effective centralized adpative routing for networks on chip](https://reader035.vdocuments.us/reader035/viewer/2022070302/548c3054b47959ae538b47b9/html5/thumbnails/16.jpg)
May 2, 2011 16
Bring it all togetherR. Changes
LeftFlows
1 1>-15
1 2>-8
2 2>-15
1 4>-15
Average Delay
∞
R. Changes Left
Flows
0 1>-15
1 2>-8
2 2>-15
1 4>-15
R. Changes Left
Flows
0 1>-15
0 2>-8
1 2>-15
1 4>-15
Average Delay
22 ns
R. Changes Left
Flows
0 1>-15
0 2>-8
1 2>-15
0 4>-15
Average Delay
22 nsAverage Delay
14 ns
R. Changes Left
Flows
0 1>-15
0 2>-8
0 2>-15
0 4>-15
, ( ) mod 3I JL I J
![Page 17: Cost Effective centralized adpative routing for networks on chip](https://reader035.vdocuments.us/reader035/viewer/2022070302/548c3054b47959ae538b47b9/html5/thumbnails/17.jpg)
May 2, 2011 17
Centralized Adaptive Routing for NoCs - Architecture
Traffic load measurements aggregation into Traffic Load Maps.
Routing control.
Local traffic load measurements inside the routers.
![Page 18: Cost Effective centralized adpative routing for networks on chip](https://reader035.vdocuments.us/reader035/viewer/2022070302/548c3054b47959ae538b47b9/html5/thumbnails/18.jpg)
May 2, 2011 18
Load Measurements Aggregation An illustration of
aggregation of load values in a 4X4 2D mesh.
A congestion value is written to each traffic load map every clock cycle.
![Page 19: Cost Effective centralized adpative routing for networks on chip](https://reader035.vdocuments.us/reader035/viewer/2022070302/548c3054b47959ae538b47b9/html5/thumbnails/19.jpg)
May 2, 2011 19
ATDOR – Route Selection Circuit
• Combinatorial pipelined implementation.
Result every ATDOR clock cycle.
Maximally loaded links of the two alternatives are compared. Next route:
YX XY
YX XY
XY YX
XY YX
if(Current Route = XY)
YX if MAX[Load ] a MAX[Load ]NextRoute =
XY if MAX[Load ] > a MAX[Load ]
elseif(Current Route = YX)
XY if MAX[Load ] a MAX[Load ]NextRoute =
YX if MAX[Load ] > a MAX[Load ]
0 < a <1
![Page 20: Cost Effective centralized adpative routing for networks on chip](https://reader035.vdocuments.us/reader035/viewer/2022070302/548c3054b47959ae538b47b9/html5/thumbnails/20.jpg)
May 2, 2011 20
Hardware Requirements The whole mechanism
was implemented on xc5vlx50t VIRTEX 5 FPGA.
Estimated area for 45nm technology node.
Per-Router hardware overheads in % for a NoC with typical size (50 KGates) virtual channel routers.
![Page 21: Cost Effective centralized adpative routing for networks on chip](https://reader035.vdocuments.us/reader035/viewer/2022070302/548c3054b47959ae538b47b9/html5/thumbnails/21.jpg)
May 2, 2011 21
Average Packet Delay – Uniform Traffic
• Average delay vs. average load in links normalized to links capacity. 8X8 2D Mesh. Uniform traffic pattern.
![Page 22: Cost Effective centralized adpative routing for networks on chip](https://reader035.vdocuments.us/reader035/viewer/2022070302/548c3054b47959ae538b47b9/html5/thumbnails/22.jpg)
May 2, 2011 22
Average Packet Delay – Transpose Traffic
• Average delay vs. average load in links normalized to links capacity. 8X8 2D Mesh. Transpose traffic pattern.
![Page 23: Cost Effective centralized adpative routing for networks on chip](https://reader035.vdocuments.us/reader035/viewer/2022070302/548c3054b47959ae538b47b9/html5/thumbnails/23.jpg)
May 2, 2011 23
Average Packet Delay – Hotspot Traffic
• Average delay vs. average load in links normalized to links capacity. 8X8 2D Mesh. 4 Hotspots traffic pattern.
![Page 24: Cost Effective centralized adpative routing for networks on chip](https://reader035.vdocuments.us/reader035/viewer/2022070302/548c3054b47959ae538b47b9/html5/thumbnails/24.jpg)
May 2, 2011 24
Control Iteration Duration• Number of re-routed flows vs. time. • 8X8 2D Mesh, ATDOR clock of 100 MHz.
α = 15/16 α = 3/4
![Page 25: Cost Effective centralized adpative routing for networks on chip](https://reader035.vdocuments.us/reader035/viewer/2022070302/548c3054b47959ae538b47b9/html5/thumbnails/25.jpg)
May 2, 2011 25
CMP DNUCA - Architecture• 8X8 CMP DNUCA (Dynamic Non Uniform Cache Array)
with 8 CPUs and 56 cache banks:
![Page 26: Cost Effective centralized adpative routing for networks on chip](https://reader035.vdocuments.us/reader035/viewer/2022070302/548c3054b47959ae538b47b9/html5/thumbnails/26.jpg)
May 2, 2011 26
CMP DNUCA – Saturation Throughput
• Saturation throughput - Splash 2 and Parsec benchmarks on 8X8 CMP DNUCA with 8 CPUs and 56 cache banks:
![Page 27: Cost Effective centralized adpative routing for networks on chip](https://reader035.vdocuments.us/reader035/viewer/2022070302/548c3054b47959ae538b47b9/html5/thumbnails/27.jpg)
May 2, 2011 27
Conclusions• Centralized adaptive routing is feasible for NoCs.
ATDOR: Centralized selection between XY and YX for each source-destination pair.
Hardware overhead: <4% of an 8X8 typical NoC.
Average saturation throughput improvement:Vs. RCA Vs. O1TURN
12.1% 19.3% Synthetic Patterns
12.8% 22.8% Spash 2 and Parsec Benchmarks