hybrid network on chip (hnoc): local buses with a global mesh
TRANSCRIPT
June 13, 2010 Slide: 1
University of New MexicoDepartment of Electrical and Computer Engineering
* Department of Computer ScienceAlbuquerque, NM 87131
Hybrid Network on Chip (HNoC):Local Buses with a Global Mesh Architecture
Payman Zarkesh-Ha, George B. P. Bezerra*, Stephanie Forrest*, and Melanie Moses*
June 13, 2010 Slide: 2
1. Overview of Hybrid NoC Architectures
2. Proposed HNoC: Local Buses with Global Mesh
3. HNoC Throughput Assessment using Rent’s Rule
4. HNoC Energy Assessment using Rent’s Rule
5. Simulation Results
6. Discussions
Outline
June 13, 2010 Slide: 3
Some Network on a Chip Configurations
Mesh Concentrated Mesh Concentrated Torus
June 13, 2010 Slide: 4
Existing Hybrid NoC Configuration Chip 1
V. Rantala et. al. [DAC ‘08]
Hybrid Mesh-Ring Topology
Mesh is used for local network
Ring is used for global network
Improves throughput and reduces delay
June 13, 2010 Slide: 5
A. Shacham et. al. [NOCS ‘07]
Hybrid Photonic NoC
Electronic network carries While the electronic network carries small-size control (and data) packets
The photonic network transfers large-size data messages
Can support energy-efficient high-bandwidth data transfers in 3D chips
Existing Hybrid NoC Configuration Chip 2
June 13, 2010 Slide: 6
Y. Wang et. al. [ICCAD ‘07]
Hybrid Wireless-Wired NoC Topology
RF interconnect acts as an information ”highway” enabling fast data transport across longer distances on the chip
Can potentially reduce overall communication power and latency
Existing Hybrid NoC Configuration Chip 3
June 13, 2010 Slide: 7
L. Carloni et. al. [NOCS ‘09]
Hybrid 3D NoC-Bus Topology
NoC on each stack
Dedicated bus between stacks
Provides performance and area benefits
Good for two planes, but not efficient for multiple stacks
Existing Hybrid NoC Configuration Chip 4
June 13, 2010 Slide: 8
Proposed Hybrid Network on a Chip (HNoC)
Mesh NOCMesh NOC
Conventional NOC Hybrid NOC (HNOC)
Mesh NOCLocal Bus
June 13, 2010 Slide: 9
Proposed Hybrid Network on a Chip (HNoC)
Standard NoC topology (e.g., mesh) is used for packet-based global interconnections and local buses for nearest-neighbor communications
HNoC uses local buses to transmit data directly to the nearest neighbors in a parallel fashion, which eliminates the need for serializer, router, and deserializer
Since the local bus interconnects are short, they inherently exhibit lower loss and therefore can provide higher bandwidth and consume less power.
June 13, 2010 Slide: 10
How to use Rent’s Rule to Evaluate HNoC?
pkNT =k and p are empirical constants
T = # of IO’s
A System ofN gates
B = CommunicationBandwidth
A Block ofN nodes
pbNB =b and p are empirical constants
June 13, 2010 Slide: 11
Probability of Communication
Probability of communication represents how probable is a block to communicate with another block within a certain distance
Similar to wire length distribution model, communication probability distribution (CPD) can be derived using Rent’s Rule
June 13, 2010 Slide: 12
Communication Probability Distribution (CPD)
( )[ ] ( )[ ] ( )[ ] ( )[ ]{ }pppp ddddddddd
dfdCPD 111111)()( ++−++−−−+Γ
=
( )
( ) ( )
−≤≤−+−−+−
<≤−+−=
22,1432112
32
3
1,163
23)(
2223
223
NdNNNNdNdd
NdNdNdd
df
CPD is the communication probability of distance d in an N×N multiprocessor system, p is the Rent’s exponent, and Γ is the normalization coefficient.
June 13, 2010 Slide: 13
Example: Probability of Communication
7.8E-01
1.4E-01
5.0E-022.0E-02 7.1E-03 2.1E-03 5.5E-04 9.5E-05
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5 6 7 8
Array size: 5x5 Rent exponent: p=0.6
Distance (unit processor pitch)
Com
mun
icat
ion
Prop
erty
Dis
trib
utio
n (C
PD)
June 13, 2010 Slide: 14
Observation from CPD
7.8E-01
1.4E-01
5.0E-022.0E-02 7.1E-03 2.1E-03 5.5E-04 9.5E-05
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5 6 7 8
Distance (unit processor)
Prob
abili
ty o
f Com
mun
icat
ion
• In an standard NOC architecture of 5x5 processor array, about 78% of communications are with the nearest neighbors!
• If the nearest neighbor communication can be lifted from the network traffic, it can tremendously improve the NOC bandwidth utilization and its throughput
June 13, 2010 Slide: 15
Observation from CPD
• The proposed HNoC uses a “local bus” for the nearest neighbor communications and the global NOC for the longer communication only
• Therefore, the system will consist of two hybrid networks; 1) local bus (like in systolic arrays), and 2) global mesh NOC
• This is similar to the concept of “local wires” and “global wires” in standard VLSI designs
• Assuming that the throughput is still limited by the global NOC, the hybrid NOC can potentially improve throughput by a factor of 4.6X in a system of 5x5 processors
June 13, 2010 Slide: 16
CPD for Various Rent’s Exponents
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1 2 3 4 5 6 7 8
p=0.1 p=0.5 p=0.9
Array size: 5x5 Rent exponent: p=0.1 to 0.9
Distance (unit processor pitch)
Com
mun
icat
ion
Prop
erty
Dis
trib
utio
n (C
PD)
June 13, 2010 Slide: 17
Maximum Throughput Improvement in HNoC
7.2 X
5.4 X4.6 X 4.2 X 3.9 X 3.7 X 3.5 X 3.4 X
18.5 X
0 X
2 X
4 X
6 X
8 X
10 X
12 X
14 X
16 X
18 X
20 X
2x2 3x3 4x4 5x5 6x6 7x7 8x8 9x9 10x10
Processor Array Size
Thro
uput
Impr
ovem
ent
Assumed that p = 0.6
( )111
CPDNOCHNOC
Throughput
Throughput
−≈
June 13, 2010 Slide: 18
3.8 X
2.8 X2.4 X 2.2 X 2.0 X 1.9 X 1.8 X 1.8 X
9.7 X
0 X
1 X
2 X
3 X
4 X
5 X
6 X
7 X
8 X
9 X
10 X
2x2 3x3 4x4 5x5 6x6 7x7 8x8 9x9 10x10
Processor Array Size
Ene
rgy
Red
uctio
nMaximum Energy Improvement in HNoC
Assumed that p = 0.6
( )
( )∑
∑−
=
−
=
⋅
⋅≈ 22
2
22
1N
d
N
d
Energy
Energy
dCPDd
dCPDd
HNOCNOC
June 13, 2010 Slide: 19
Throughput Simulation Results using Orion
System Parameters ValuesNumber of Cores 64Die Size 1cm x 1cmTechnology Node 45 nmClock Frequency 1 GHzFlit Size 64 bitsPacket Size 5 flitsRent's Exponent, p 0.60
0
5
10
15
20
25
30
35
0 0.2 0.4 0.6 0.8
Injection Rate [Packet/Cycle]
Thro
ughp
ut [P
acke
t/Cyc
le]
2.6X
Mesh
HNOC
June 13, 2010 Slide: 20
Energy Simulation Results using Orion
System Parameters ValuesNumber of Cores 64Die Size 1cm x 1cmTechnology Node 45 nmClock Frequency 1 GHzFlit Size 64 bitsPacket Size 5 flitsRent's Exponent, p 0.60
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0 0.2 0.4 0.6 0.8
Injection Rate [Packet/Cycle]
Ener
gy p
er P
acke
t [uJ
/Pak
et]
÷1.8
Mesh
HNOC
June 13, 2010 Slide: 21
Discussions
Similar to EDA tools, compilers must take locality into account to achieve traffic localization with optimized program mapping and task assignment
The proposed HNoC architecture can significantly improve the energy usage and performance of the system by directing the local communications through the low-latency, high-bandwidth, and low-power local buses and leaving the global communications to the standard NoC topology
In practice, however, achieving this locality may be challenging. The compiler needs to be able to map the program such that the neighboring threads are mapped onto neighboring network cores. Moreover, runtime re-mapping and significant data movement maybe required, when the local buses in HNoC may not be able to provide significant support
However, on average HNoC will indirectly support long-distance communication by removing the local communication traffic from the mesh NoC, leaving the mesh NoC fully dedicated to long-distance traffic