interconnection network for tightly coupled accelerators
TRANSCRIPT
Center for Computational Sciences, Univ. of Tsukuba
Interconnection Network for Tightly Coupled Accelerators
Architecture
Toshihiro Hanawa, Yuetsu Kodama, Taisuke Boku, Mitsuhisa Sato
Center for Computational Sciences University of Tsukuba, Japan
Aug 22, 2013 Hot Interconnects 21 1
Center for Computational Sciences, Univ. of Tsukuba
What is “Tightly Coupled Accelerators (TCA)” ? Concept: n Direct connection between accelerators (GPUs) over the
nodes n Eliminate extra memory copies to the host n Reduce latency, improve strong scaling with small data size for scientific applications
n Using PCIe as a communication link between accelerators over the nodes n PCIe just performs packet transfer and direct device P2P communication is available.
n PEACH2: PCI Express Adaptive Communication Hub ver. 2 n In order to configure TCA, each node is connected to other nodes through PEACH2 chip.
Aug 22, 2013 Hot Interconnects 21 2
Center for Computational Sciences, Univ. of Tsukuba
Design policy of PEACH2 n Implement by FPGA with four PCIe Gen.2 IPs
n Altera Stratix IV GX n Prototyping, flexible enhancement
n Sufficient communication bandwidth n PCI Express Gen2 x8 for each port n Sophisticated DMA controller
n Chaining DMA n Latency reduction
n Hardwired logic n Low-overhead routing mechanism
n Efficient address mapping in PCIe address area using unused bits n Simple comparator for decision of output port
It is not only a proof-of-concept implementation, but it will also be available for product-run in GPU cluster.
Aug 22, 2013 Hot Interconnects 21 3
Center for Computational Sciences, Univ. of Tsukuba
TCA node structure example
n PEACH2 can access all GPUs n NVIDIA Kepler architecture + CUDA 5.0 “GPUDirect Support for RDMA”
n Performance over QPI is quite bad. => support only for GPU0, GPU1
n Connect among 3 nodes using PEACH2
Aug 22, 2013 Hot Interconnects 21
CPU (Xeon
E5)
CPU (Xeon
E5) QPI
PCIe
GPU0
GPU2
GPU3
IB HCA
PEACH2
GPU1
G2 x8 G2
x16 G2 x16
G3 x8
G2 x16
G2 x16
G2 x8
G2 x8
G2 x8
Single PCIe address
GPU: NVIDIA K20, K20X (Kepler architecture)
4
Center for Computational Sciences, Univ. of Tsukuba
Overview of PEACH2 chip n Fully compatible with PCIe
Gen2 spec. n Root and EndPoint must be
paired according to PCIe spec. n Port N: connected to the host
and GPUs n Port E and W: form the ring
topology n Port S: connected to the
other ring n Selectable between Root and Endpoint
n Write only except Port N n Instead, “Proxy write” on remote
node realizes pseudo-read.
CPU & GPU side (Endpoint)
To PEACH2 (Root Complex /
Endpoint)
To PE
AC
H2
(Endpoint)
To PE
AC
H2
(Root C
omplex)
NIOS (CPU)
Memory
DMAC
Routing function
Port N
Port E
Port W
Port S
Aug 22, 2013 Hot Interconnects 21 5
Center for Computational Sciences, Univ. of Tsukuba
Communication by PEACH2 n PIO
n CPU can store the data to remote node directly using mmap.
n DMA n Chaining mode
n DMA requests are prepared as the DMA descriptors chained in the host memory.
n Multiple DMA transactions are operated automatically according to the DMA descriptors by hardware.
n Register mode n Up to 16 requests n No overhead to transfer descriptors from host
n Block stride transfer function
Aug 22, 2013 Hot Interconnects 21 6
Descriptor0
Descriptor1
Descriptor2
Descriptor3
Descriptor (n-1)
…
Source Destination
Length Flags
Next
Center for Computational Sciences, Univ. of Tsukuba
PEACH2 board (Production version for HA-PACS/TCA)
n PCI Express Gen2 x8 peripheral board n Compatible with PCIe Spec.
Aug 22, 2013 Hot Interconnects 21 7
Top View Side View
Center for Computational Sciences, Univ. of Tsukuba
PEACH2 board (Production version for HA-PACS/TCA)
Aug 22, 2013 Hot Interconnects 21 8
Main board + sub board Most part operates at 250 MHz
(PCIe Gen2 logic runs at 250MHz)
PCI Express x8 card edge Power supply
for various voltage
DDR3- SDRAM
FPGA (Altera Stratix IV
530GX)
PCIe x16 cable connecter
PCIe x8 cable connecter
Center for Computational Sciences, Univ. of Tsukuba
Performance Evaluation n Environment: 8node GPU cluster
(TCAMINI) n CPU: Intel Xeon-E5 (SandyBridge EP) 2.6GHz x2socket
n MB: SuperMicro X9DRG-QF n Memory: DDR3 128GB n OS: CentOS 6.3 (kernel 2.6.32-279.22.1.el6.x86_64)
n GPU: NVIDIA K20, GDDR5 5GB x1 n CUDA: 5.0, NVIDIA-Linux-x86_64-310.32
n PEACH2 board: Altera Stratix IV 530GX n MPI: MVAPICH2 1.9 with IB FDR10
Aug 22, 2013 Hot Interconnects 21 9
Center for Computational Sciences, Univ. of Tsukuba
Evaluation items
n Ping-pong performance between nodes n Latency and bandwidth n Written as application n Comparison with MVAPICH2 1.9 (with CUDA support) for GPU-GPU communication
n In order to access GPU memory by the other device, “GPU Direct support for RDMA” in CUDA5 API is used. n Special driver named “TCA
p2p driver” to enable memory mapping is developed.
n “PEACH2 driver” to control the board is also developed.
Aug 22, 2013 Hot Interconnects 21 10
Center for Computational Sciences, Univ. of Tsukuba
Ping-pong Latency
Minimum Latency (nearest neighbor comm.) n PIO: CPU to CPU: 0.9us n DMA:CPU to CPU: 1.9us
GPU to GPU: 2.3us (cf. MVAPICH2 1.9: 19 usec)
Aug 22, 2013 Hot Interconnects 21 11
0
1
2
3
4
5
6
7
8
8 64 512 4096 32768
Late
ncy
(use
c)
Data Size (bytes)
PIO, Good Performance (<64B)
PIO
DMA (CPU)
DMA (GPU)
Center for Computational Sciences, Univ. of Tsukuba
Ping-pong Latency
Minimum Latency (nearest neighbor comm.) n PIO: CPU to CPU: 0.9us n DMA:CPU to CPU: 1.9us
GPU to GPU: 2.3us (cf. MVAPICH2 1.9: 19 usec) Forwarding overhead n 200~300 nsec
Aug 22, 2013 Hot Interconnects 21 12
0
1
2
3
4
5
6
7
8
8 64 512 4096 32768
Late
ncy
(use
c)
Data Size (bytes)
DMA Direct
DMA 1 hop
DMA 2 hop
DMA 3 hop
DMA (CPU)
Center for Computational Sciences, Univ. of Tsukuba
Ping-pong Bandwidth n Max. 3.5 GByte/sec
n 95% of theoretical peak n Converge to the same peak
with various hop counts
n GPU to GPU DMA is saturated by up to 880MByte/sec. Ø PCIe switch embedded in
SandyBridge doesn’t have enough resources for PCIe device read.
n In IvyBridge, the performance is expected to be improved…
n Performance will be expected as same as CPU to CPU case.
Aug 22, 2013 Hot Interconnects 21 13
0
500
1000
1500
2000
2500
3000
3500
4000
8 256 8192 262144
Ban
dwid
th (M
Byt
es/s
ec)
Data Size (bytes)
DMA Direct
DMA 1 hop
DMA 2 hop
DMA 3 hop
GPU Direct
MVA2 GPU
Max Payload Size = 256byte Theoretical peak: 4Gbyte/sec × 256 / (256 + 16 + 2 + 4 + 1 + 1) = 3.66 Gbyte/s
3.5 Gbyte/s
880Mbyte/s
Center for Computational Sciences, Univ. of Tsukuba
Programming for TCA cluster n Data transfer to remote GPU within TCA can
be treated like local GPU. n In particular, suitable for stencil computation
n Good performance at nearest neighbor communication due to direct network
n Chaining DMA can bundle data transfers for every “Halo” planes
n XY-plane: contiguous array n XZ-plane: block stride n YZ-plane: stride
n In each iteration, DMA descriptors can be reused and only a DMA kick operation is needed
=> Improve strong scaling with small data size
Aug 22, 2013 Hot Interconnects 21 14
Bundle to 1 DMA
Center for Computational Sciences, Univ. of Tsukuba
Related Work n Non Transparent Bridge (NTB)
n NTB appends the bridge function to a downstream port of the PCI-E switch.
n Inflexible, the host must recognize during the BIOS scan n It is not defined in the standard of PCI-E and is incompatible with the vendors.
n APEnet+ (Italy) n GPU direct copy using Fermi GPU,different protocol from TCA is used.
n Latency between GPUs is around 5us? n Original 3-D Torus network, QSFP+ cable
n MVAPICH2 + GPUDirect n CUDA5, Kepler n Latency between GPUs is reported as 4.5us at Jun., but currently not released (4Q of 2013?)
Aug 22, 2013 Hot Interconnects 21 15
Center for Computational Sciences, Univ. of Tsukuba
Summary n TCA: Tightly Coupled Accelerators
n TCA enables direct communication among accelerators as an element technology becomes a basic technology for next gen’s accelerated computing in exa-scale era.
n PEACH2 board: Implementation for realizing TCA using PCIe technology n Bandwidth: max. 3.5 Gbyte/sec between CPUs (over 95% of theoretical
peak) Min. Latency: 0.9 us (PIO), 1.9 us (DMA between CPUs), 2.3 us (DMA between GPUs)
n GPU-GPU communication over the nodes can be demonstrated with 8 node cluster.
n By the ping-pong program, PEACH2 can achieve lower latency than existing technology, such as MVAPICH2 in small data size.
n HA-PACS/TCA with 64 nodes will be installed on the end of Oct. 2013. n Actual proof system of TCA architecture with 4 GPUs per each node n Development of the HPC application using TCA, and production-run
n Performance comparison between TCA and InfiniBand n Hybrid communication scheme using TCA and InfiniBand will be investigated.
Aug 22, 2013 Hot Interconnects 21 16