throughput-effective on-chip networks for manycore accelerators
DESCRIPTION
Throughput-Effective On-Chip Networks for Manycore Accelerators. Ali Bakhoda , John Kim ¹ and Tor M. Aamodt ¹ KAIST, Korea . Manycore Accelerators and NoC. Manycore accelerators P revalent example: high-end GPUs 10s of thousands of threads running at the same time - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Throughput-Effective On-Chip Networks for Manycore Accelerators](https://reader036.vdocuments.us/reader036/viewer/2022062315/568162be550346895dd349c2/html5/thumbnails/1.jpg)
Throughput-Effective On-Chip Networks for Manycore Accelerators
Ali Bakhoda, John Kim¹ and Tor M. Aamodt¹KAIST, Korea
![Page 2: Throughput-Effective On-Chip Networks for Manycore Accelerators](https://reader036.vdocuments.us/reader036/viewer/2022062315/568162be550346895dd349c2/html5/thumbnails/2.jpg)
2
Manycore Accelerators and NoC
Manycore accelerators Prevalent example: high-end GPUs 10s of thousands of threads running at the same time Bulk Synchronous Parallel programming style 3 / 5 top supercomputers
Based on the Nov. 2010 Top500 list
Primary goal: Higher application level throughput
NoC in accelerators Needs a different perspective from CPUs Not very well studied in this context
![Page 3: Throughput-Effective On-Chip Networks for Manycore Accelerators](https://reader036.vdocuments.us/reader036/viewer/2022062315/568162be550346895dd349c2/html5/thumbnails/3.jpg)
3
The Need for Throughput-Effective NoCs
190 210 230 250 270 290 3100.0012
0.0014
0.0016
0.0018
0.0020
Average Throughput [IPC]
(Chi
p Ar
ea)-1
[1/
mm
2]
Ideal NoC
LESS AREA
HIGHER THROUGHPUT
0.35 IPC/mm 2
0.40 IPC/mm 2
0.45 IPC/mm 2
0.50 IPC/mm 2
0.55 IPC/mm 2
0.30 IPC/mm 2
Throughput-Effective design: Improves application level performance per unit chip area
![Page 4: Throughput-Effective On-Chip Networks for Manycore Accelerators](https://reader036.vdocuments.us/reader036/viewer/2022062315/568162be550346895dd349c2/html5/thumbnails/4.jpg)
4
Contributions
Study impact of NoC on application level performance
Traditional improvements (router latency reduction): minimal impact on application level performance
Increasing channel width: High performance gain + high area cost Consider application level throughput per unit area of NoC
Throughput correlated with injection rate of few nodes Many-to-few-to-many traffic pattern
Propose Throughput-Effective NoC design Checkerboard network Multi-port router structure
![Page 5: Throughput-Effective On-Chip Networks for Manycore Accelerators](https://reader036.vdocuments.us/reader036/viewer/2022062315/568162be550346895dd349c2/html5/thumbnails/5.jpg)
5
Outline
Introduction Baseline architecture NoC properties in accelerators Throughput-Effective NoC design Experimental results Conclusion
![Page 6: Throughput-Effective On-Chip Networks for Manycore Accelerators](https://reader036.vdocuments.us/reader036/viewer/2022062315/568162be550346895dd349c2/html5/thumbnails/6.jpg)
6
Accelerator Overview
Compute
Network-On-Chip
MC+L2
GDDR
MC+L2
GDDR
MC+L2
GDDR
Compute Compute Compute ComputeCompute
Network-On-Chip
MC+L2
GDDR
MC+L2
GDDR
MC+L2
GDDR
Compute Compute Compute Compute
DispatchQueue
MemMiss
WaitingQueue
![Page 7: Throughput-Effective On-Chip Networks for Manycore Accelerators](https://reader036.vdocuments.us/reader036/viewer/2022062315/568162be550346895dd349c2/html5/thumbnails/7.jpg)
7
Baseline Network Mesh with MCs at periphery of the chip
Similar to Tilera’s TILE64 or Intel’s 80-core Teraflops chip Simple and Scalable
Dimension Order Routing Virtual Channel Flow Control 4-cycle routers
Compute
Network-On-Chip
MC+L2
GDDR
MC+L2
GDDR
MC+L2
GDDR
Compute Compute Compute Compute
![Page 8: Throughput-Effective On-Chip Networks for Manycore Accelerators](https://reader036.vdocuments.us/reader036/viewer/2022062315/568162be550346895dd349c2/html5/thumbnails/8.jpg)
8
0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.60.50
0.75
1.00
Application Level ThroughputApplication Level Throughput/Area
Bandwidth Limit of Ideal Interconnect[fraction of off-chip DRAM bandwidth]
Finding a Balanced Design
Bisection bandwidth of baseline mesh
![Page 9: Throughput-Effective On-Chip Networks for Manycore Accelerators](https://reader036.vdocuments.us/reader036/viewer/2022062315/568162be550346895dd349c2/html5/thumbnails/9.jpg)
9
Gap between Balanced Mesh and Ideal NoC
190 210 230 250 270 290 3100.0012
0.0014
0.0016
0.0018
0.0020
Average Throughput [IPC]
(Chi
p Ar
ea)-1
[1/
mm
2]
Ideal NoCLESS AREA
HIGHER THROUGHPUT
0.35 IPC/mm 2
0.40 IPC/mm 2
0.45 IPC/mm 2
0.50 IPC/mm 2
0.55 IPC/mm 2
0.30 IPC/mm 2
Balanced Mesh
![Page 10: Throughput-Effective On-Chip Networks for Manycore Accelerators](https://reader036.vdocuments.us/reader036/viewer/2022062315/568162be550346895dd349c2/html5/thumbnails/10.jpg)
10
Outline
Introduction Baseline architecture NoC properties in accelerators Throughput-Effective NoC design Experimental results Conclusion
![Page 11: Throughput-Effective On-Chip Networks for Manycore Accelerators](https://reader036.vdocuments.us/reader036/viewer/2022062315/568162be550346895dd349c2/html5/thumbnails/11.jpg)
11
NoC properties in ManyCore Accelerators
Router latency has minimal impact on application level throughput
Aggressive 1-cycle routers instead of 4-cycle router Only 2.3% application level speedup
Channel Bandwidth is very important 27% speedup by doubling BW But quadratic area increase 1-Cy-
cle Router
s
2x BW0%
20%
HM Speedup
![Page 12: Throughput-Effective On-Chip Networks for Manycore Accelerators](https://reader036.vdocuments.us/reader036/viewer/2022062315/568162be550346895dd349c2/html5/thumbnails/12.jpg)
12
2x Channel Bandwidth
190 210 230 250 270 290 3100.0012
0.0014
0.0016
0.0018
0.0020
Average Throughput [IPC]
(Chi
p Ar
ea)-1
[1/
mm
2]
Ideal NoCLESS AREA
HIGHER THROUGHPUT
2x BW
0.35 IPC/mm 2
0.40 IPC/mm 2
0.45 IPC/mm 2
0.50 IPC/mm 2
0.55 IPC/mm 2
0.30 IPC/mm 2
Balanced Mesh
![Page 13: Throughput-Effective On-Chip Networks for Manycore Accelerators](https://reader036.vdocuments.us/reader036/viewer/2022062315/568162be550346895dd349c2/html5/thumbnails/13.jpg)
13
Many-to-Few-to-Many Traffic Pattern
C0
requ
est n
etw
ork
C1
Cn
C0
C1
Cnre
ply
netw
ork
MC0
MC1
MCm
C2
MC Injectionbandwidth
C2
![Page 14: Throughput-Effective On-Chip Networks for Manycore Accelerators](https://reader036.vdocuments.us/reader036/viewer/2022062315/568162be550346895dd349c2/html5/thumbnails/14.jpg)
14
Outline
Introduction Baseline architecture NoC properties in accelerators Throughput-Effective NoC design Experimental results Conclusion
![Page 15: Throughput-Effective On-Chip Networks for Manycore Accelerators](https://reader036.vdocuments.us/reader036/viewer/2022062315/568162be550346895dd349c2/html5/thumbnails/15.jpg)
15
Throughput-Effective Network design
Throughput-Effective
Reduce Area
Checkerboard Routing
Channel Slicing
Increase Performance
Checkerboard Placement
Multi-Port routers at
MCs
![Page 16: Throughput-Effective On-Chip Networks for Manycore Accelerators](https://reader036.vdocuments.us/reader036/viewer/2022062315/568162be550346895dd349c2/html5/thumbnails/16.jpg)
16
Checkerboard Routing: Half-Routers
Half-Router Connectivity
Half-Routers No turns allowed at half-routers Limited connectivity Saves ~50% of router crossbar area
Full-Routers: Normal routers w/ complete connectivity
Use Half-Routers every other node
Ejection
Injection
North
South
EastWest
Half Router
Full Router
![Page 17: Throughput-Effective On-Chip Networks for Manycore Accelerators](https://reader036.vdocuments.us/reader036/viewer/2022062315/568162be550346895dd349c2/html5/thumbnails/17.jpg)
17
Solution: Routing Restriction (1)
• Routing from a full-router to a half-router that is:– An odd number of columns
away– Not in the same row
• Solution: Use YX routing instead of XY routing in this case
Half Router
Full Router
![Page 18: Throughput-Effective On-Chip Networks for Manycore Accelerators](https://reader036.vdocuments.us/reader036/viewer/2022062315/568162be550346895dd349c2/html5/thumbnails/18.jpg)
18
Solution: Routing Restriction (2)
Routing from a half-router to a half-router that is: An even number of columns
away Not in the same row
Solution: needs two turns(1) To intermediate full-router using YX(2) To the destination using XY
Requires an extra VC to avoid deadlock
Half Router
Full Router
![Page 19: Throughput-Effective On-Chip Networks for Manycore Accelerators](https://reader036.vdocuments.us/reader036/viewer/2022062315/568162be550346895dd349c2/html5/thumbnails/19.jpg)
19
Routing Restriction (3) Full-routers that are odd number of columns
away We avoid this case by using a different MC
placement (next 2 slides)
Half Router
Full Router
![Page 20: Throughput-Effective On-Chip Networks for Manycore Accelerators](https://reader036.vdocuments.us/reader036/viewer/2022062315/568162be550346895dd349c2/html5/thumbnails/20.jpg)
20
Throughput-Effective Network design
Throughput-Effective
Reduce Area
Checkerboard Routing
Channel Slicing
Increase Performance
Checkerboard Placement
Multi-Port routers at
MCs
![Page 21: Throughput-Effective On-Chip Networks for Manycore Accelerators](https://reader036.vdocuments.us/reader036/viewer/2022062315/568162be550346895dd349c2/html5/thumbnails/21.jpg)
21
Placement of MCs
Exploit Many-to-Few Place the MCs at Half-Router nodes
Half-Routers can communicate will all nodes with no penalty Common case for BSP: compute cores communicate with MCs
not each other
[CMP-MSI’08] “Extending the Scalability of Single Chip Stream Processors with On-chip Caches”, Bakhoda et al. [ISCA’09] “Achieving Predictable Performance Through Better Memory Controller Placement in Many-Core CMPs" Abts et al.
Half Router
Compute Core Router
Memory Controller Router
![Page 22: Throughput-Effective On-Chip Networks for Manycore Accelerators](https://reader036.vdocuments.us/reader036/viewer/2022062315/568162be550346895dd349c2/html5/thumbnails/22.jpg)
22
Throughput-Effective Network design
Throughput-Effective
Reduce Area
Checkerboard Routing
Channel Slicing
Increase Performance
Checkerboard Placement
Multi-Port routers at
MCs
![Page 23: Throughput-Effective On-Chip Networks for Manycore Accelerators](https://reader036.vdocuments.us/reader036/viewer/2022062315/568162be550346895dd349c2/html5/thumbnails/23.jpg)
23
Multi-port routers at MCs
• Reduce the bottleneck at the few nodes• Increase terminal BW of the few nodes
– Increase the injection ports of MC routers– Minimal area overhead (~1% in total NoC area)– Speedups of up to 25%
![Page 24: Throughput-Effective On-Chip Networks for Manycore Accelerators](https://reader036.vdocuments.us/reader036/viewer/2022062315/568162be550346895dd349c2/html5/thumbnails/24.jpg)
24
Throughput-Effective Network design
Throughput-Effective
Reduce Area
Checkerboard Routing
Channel Slicing
Increase Performance
Checkerboard Placement
Multi-Port routers at
MCs
![Page 25: Throughput-Effective On-Chip Networks for Manycore Accelerators](https://reader036.vdocuments.us/reader036/viewer/2022062315/568162be550346895dd349c2/html5/thumbnails/25.jpg)
25
Outline
Introduction Baseline architecture NoC properties in accelerators Throughput-Effective NoC design Experimental results Conclusion
![Page 26: Throughput-Effective On-Chip Networks for Manycore Accelerators](https://reader036.vdocuments.us/reader036/viewer/2022062315/568162be550346895dd349c2/html5/thumbnails/26.jpg)
26
Methodology
Compute simulation: GPGPU-Sim (2.2.1b) NoC simulation: Booksim-2
Integrated into GPGPU-Sim as network simulator
Area estimations: Orion 2.0 Benchmarks: 24 CUDA applications including
the Rodinia benchmarks
![Page 27: Throughput-Effective On-Chip Networks for Manycore Accelerators](https://reader036.vdocuments.us/reader036/viewer/2022062315/568162be550346895dd349c2/html5/thumbnails/27.jpg)
27
Results Combination of
Checkerboard routing and placement Channel Slicing Multi-port routers at MCs
Overall HM speedup 17% across 24 benchmarks over balanced baseline
Total NoC area reduction of 43%
AES BIN HSP NE NDL
HW LE HIS LU SLA BP CON
NNC
BLK
MM LPS RAY
DG SS TRA
SR WP MUM
LIB FWT
SCP STC KM CFD
BFS RD HM-20%
0%20%40%60%80%
Spee
dup
Low SpeedupLow Traffic
Low SpeedupHigh Traffic
High SpeedupHigh Traffic
![Page 28: Throughput-Effective On-Chip Networks for Manycore Accelerators](https://reader036.vdocuments.us/reader036/viewer/2022062315/568162be550346895dd349c2/html5/thumbnails/28.jpg)
28
Throughput-Effective NoC
190 210 230 250 270 290 3100.0012
0.0014
0.0016
0.0018
0.0020
Average Throughput [IPC]
(Chi
p Ar
ea)-1
[1/
mm
2]
Thr. Eff.
Ideal NoCLESS AREA
HIGHER THROUGHPUT
2x BW
0.35 IPC/mm 2
0.40 IPC/mm 2
0.45 IPC/mm 2
0.50 IPC/mm 2
0.55 IPC/mm 2
0.30 IPC/mm 2
Balanced Mesh
![Page 29: Throughput-Effective On-Chip Networks for Manycore Accelerators](https://reader036.vdocuments.us/reader036/viewer/2022062315/568162be550346895dd349c2/html5/thumbnails/29.jpg)
29
Summary
Throughput-Effective design: Consider system level performance impact + area impact of NoC
Observations NoC BW is more important than latency in accelerators Many-to-Few-to-Many traffic pattern
Throughput-Effective NoC for accelerators Checkerboard Multi-port MC routers Channel-slicing
![Page 30: Throughput-Effective On-Chip Networks for Manycore Accelerators](https://reader036.vdocuments.us/reader036/viewer/2022062315/568162be550346895dd349c2/html5/thumbnails/30.jpg)
Thank you
![Page 31: Throughput-Effective On-Chip Networks for Manycore Accelerators](https://reader036.vdocuments.us/reader036/viewer/2022062315/568162be550346895dd349c2/html5/thumbnails/31.jpg)
31
Backups…
![Page 32: Throughput-Effective On-Chip Networks for Manycore Accelerators](https://reader036.vdocuments.us/reader036/viewer/2022062315/568162be550346895dd349c2/html5/thumbnails/32.jpg)
32
Channel Slicing – Double networks
Divide the single network into two physical networks Each new network: half the bisection BW of the original network Overall bisection BW: constant
Saves area Quadratic dependency of crossbar area on channel BW
Increases serialization latency But compute accelerators are not sensitive to latency
![Page 33: Throughput-Effective On-Chip Networks for Manycore Accelerators](https://reader036.vdocuments.us/reader036/viewer/2022062315/568162be550346895dd349c2/html5/thumbnails/33.jpg)
33
Results
Memory Controller placement HM of speedup 13% over balanced baseline design
Compute Core Router
Memory Controller Router
-20%0%
20%40%60%80%
AES BIN HSP NE NDL HW LE HIS LU SLA BP CONNNCBLK MM LPS RAY DG SS TRA SR WP MUM LIB FWT SCP STC KM CFD BFS RD HM
Spee
dup
![Page 34: Throughput-Effective On-Chip Networks for Manycore Accelerators](https://reader036.vdocuments.us/reader036/viewer/2022062315/568162be550346895dd349c2/html5/thumbnails/34.jpg)
34
Results• Checkerboard routing
– Less than 1% performance loss compared to DOR with same resources
– Reduces total router area by 14.2%
Half Router
Compute Core Router
Memory Controller Router
70%80%90%
100%110%120%
AES BIN HSP NE NDL HW LE HIS LU SLA BP CONNNCBLK MM LPS RAY DG SS TRA SR WP MUM LIB FWT SCP STC KM CFD BFS RD HM
Rel
ativ
e Pe
rfor
man
ce
![Page 35: Throughput-Effective On-Chip Networks for Manycore Accelerators](https://reader036.vdocuments.us/reader036/viewer/2022062315/568162be550346895dd349c2/html5/thumbnails/35.jpg)
35
Results Channel slicing
Average change in performance < 1% NoC area reduction of 37%
Half Router
Compute Core Router
Memory Controller Router
-7%0%7%
14%
AES BIN HSP NE NDL HW LE HIS LU SLA BP CONNNCBLK MM LPS RAY DG SS TRA SR WP MUM LIB FWT SCP STC KM CFD BFS RD HM
Spee
dup
![Page 36: Throughput-Effective On-Chip Networks for Manycore Accelerators](https://reader036.vdocuments.us/reader036/viewer/2022062315/568162be550346895dd349c2/html5/thumbnails/36.jpg)
36
Top 5 systems
TOP 5 Systems - 11/2010 1 Tianhe-1A - NUDT TH MPP, X5670 2.93Ghz 6C, Nvidia GPU
, FT-1000 8C 2 Jaguar - Cray XT5-HE Opteron 6-core 2.6 GHz 3 Nebulae - Dawning TC3600 Blade, Intel X5650, Nvidia Tesla
C2050 GPU 4TSUBAME 2.0 - HP ProLiant SL390s G7 Xeon 6C X5670,
Nvidia GPU, Linux/Windows 5 Hopper - Cray XE6 12-core 2.1 GHz
![Page 37: Throughput-Effective On-Chip Networks for Manycore Accelerators](https://reader036.vdocuments.us/reader036/viewer/2022062315/568162be550346895dd349c2/html5/thumbnails/37.jpg)
37
Alternative MC placement example
![Page 38: Throughput-Effective On-Chip Networks for Manycore Accelerators](https://reader036.vdocuments.us/reader036/viewer/2022062315/568162be550346895dd349c2/html5/thumbnails/38.jpg)
38
Many-to-Few-to-Many Traffic Pattern
C0
requ
est n
etw
ork
C1
Core outputbandwidth
Cn
C0
C1
Cnre
ply
netw
ork
MC0
MC1
MCm
C2
MC inputbandwidth
MC outputbandwidth
Core inputbandwidth
C2