t2k...648 node (quad-core x 4 socket / node) opteron “barcelona” b8000 cpu 2.3 ghz x 4 x 4 core...
TRANSCRIPT
CCS HPC T2K
T2K
• 20 6
• • 95TFLOPS 800TB
•
”T2K Open Supercomputer Alliance”
– T2K 140TFLOPS – T2K 61TFLOPS+SMP
T2K
648 node (quad-core x 4 socket / node) Opteron “Barcelona” B8000 CPU 2.3 GHz x 4 x 4 core x 4 socket = 147.2 GFLOPS / node = 95.3 TFLOPS / system 20.7 TB memory / system
800 TB (physical 1PB) RAID-6 Luster cluster file system Infiniband x 2 Meta-Data Server, File Server
fault tolerance
70 racks
• Multi-core & multi-socket – AMD quad-core Opteron 8000 (Barcelona)
– 4 socket / node 147 GFLOPS/node
– Opteron NUMA 40 GB/s/node
• Multi-rail – 4xDDR Infiniband (Mellanox ConnectX) x 4 / node
– Quad-rail Infiniband 8 GB/s/node
• – MVAPICH (modified by Appro) multi-rail configuration
MPI rail multi-rail
Infiniband ConnectX x 4 ( PCI-Express x 8 lane)
T2K
• – OS: Red Hat Enterprise Linux v.5 WS (Linux kernel 2.6) – F90, C, C++, Java – PGI (Portrand Group), Intel – MPI MVAPICH Appro – IMSL , ACML, SCALAPACK – PGPROFR, PAPI
• – –
• • OpenMP • MPI
• – 16 16 –
• OpenMP – 16 16 – OpenMP directive
• MPI – MPI (Message Passing Interface)
• – or OpenMP MPI
•OpenMP
– CPU
• L1 L2
• L3
•
• 16
– 4MPI
•
multi-thread
multi-thread
MPI
Bridge
NVIDIA
nForce 3050
USB
Dual Channel Reg DDR2
Hyper Transport
8GB/s (Full-duplex)
PCI-X
I/O Hub
8GB/s 8GB/s
(A)2
(B)2
4GB/s (Full-duplex)
4GB/s (Full-duplex)
(A)1
(B)1
4GB/s (Full-duplex)
4GB/s (Full-duplex)
Bridge
NVIDIA
nForce 3600
Bridge
PCI-Express X16
PCI-Express X8
PCI-X
PCI-X
X16
X8
X4
PCI-Express X16
PCI-Express X8
SAS
X16
X8
X4
2GB 667MHz DDR2 DIMM x4
2GB 667MHz DDR2 DIMM x4
2GB 667MHz DDR2 DIMM x4
2GB 667MHz DDR2 DIMM x4
Mellanox MHGH28-XTC ConnectX HCA x2
(1.2 s MPI Latency, 4X DDR 20Gb/s)
Mellanox MHGH28-XTC ConnectX HCA x2
(1.2 s MPI Latency, 4X DDR 20Gb/s)
MPI
• Multi-rail
– 4 rail / 1 MPI process (x 16 thread)
– 4 rail / 4 MPI process (x 4 thread)
– 1 rail / 1 MPI process (x 4 thread)
– 4 rail / 16 MPI process (x 1 thread)
• Multi-rail Infiniband –
–fail-over
– Infiniband Subnet Management
4 link / 1 MPI process (x 16 thread)
Bridge
NVIDIA
nForce 3050
USB
Dual Channel Reg DDR2
Hyper Transport
8GB/s (Full-duplex)
PCI-X
I/O Hub
8GB/s 8GB/s
(A)2
(B)2
4GB/s (Full-duplex)
4GB/s (Full-duplex)
(A)1
(B)1
4GB/s (Full-duplex)
4GB/s (Full-duplex)
Bridge
NVIDIA
nForce 3600
Bridge
PCI-Express X16
PCI-Express X8
PCI-X
PCI-X
X16
X8
X4
PCI-Express X16
PCI-Express X8
SAS
X16
X8
X4
2GB 667MHz DDR2 DIMM x4
2GB 667MHz DDR2 DIMM x4
2GB 667MHz DDR2 DIMM x4
2GB 667MHz DDR2 DIMM x4
Mellanox MHGH28-XTC ConnectX HCA x2
(1.2 s MPI Latency, 4X DDR 20Gb/s)
Mellanox MHGH28-XTC ConnectX HCA x2
(1.2 s MPI Latency, 4X DDR 20Gb/s)
1 link / 1 MPI process (x4 thread)
Bridge
NVIDIA
nForce 3050
USB
Dual Channel Reg DDR2
Hyper Transport
8GB/s (Full-duplex)
PCI-X
I/O Hub
8GB/s 8GB/s
(A)2
(B)2
4GB/s (Full-duplex)
4GB/s (Full-duplex)
(A)1
(B)1
4GB/s (Full-duplex)
4GB/s (Full-duplex)
Bridge
NVIDIA
nForce 3600
Bridge
PCI-Express X16
PCI-Express X8
PCI-X
PCI-X
X16
X8
X4
PCI-Express X16
PCI-Express X8
SAS
X16
X8
X4
2GB 667MHz DDR2 DIMM x4
2GB 667MHz DDR2 DIMM x4
2GB 667MHz DDR2 DIMM x4
2GB 667MHz DDR2 DIMM x4
Mellanox MHGH28-XTC ConnectX HCA x2
(1.2 s MPI Latency, 4X DDR 20Gb/s)
Mellanox MHGH28-XTC ConnectX HCA x2
(1.2 s MPI Latency, 4X DDR 20Gb/s)
T2K
CPU0-3
CPU4-7
CPU8-11
CPU12-15
mlx
4_0
mlx
4_1
mlx
4_2
mlx
4_3
T2K
Socket
Bridge & PCI-e
Infiniband HCA
mlx4_2
mlx4_3
mlx4_0
mlx4_1
CPU12-15 sock3
CPU4-7 sock1
IO55
CPU0-3 sock0
CPU8-11 sock2
MCP55
Mem3 8GB
Mem1 8GB
Mem2 8GB
Memory
Mem0 8GB
CPU
• numactl – $ numactl --hardware available: 4 nodes (0-3) node 0 size: 8062 MB node 0 free: 112 MB node 1 size: 8080 MB node 1 free: 327 MB node 2 size: 8080 MB node 2 free: 274 MB node 3 size: 8080 MB node 3 free: 354 MB node distances: node 0 1 2 3 0: 10 20 20 20 1: 20 10 20 20 2: 20 20 10 20 3: 20 20 20 10
$ numactl --show policy: default preferred node: current physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 cpubind: 0 1 2 3 nodebind: 0 1 2 3 membind: 0 1 2 3
CPU
• --interleave – all, 0-2
–
–
• --preferred –
–
• --physcpubind – CPU
• --cpunodebind –
• --membind –
–
• --localalloc –
usage: numactl [--interleave=nodes] [--preferred=node] [--physcpubind=cpus] [--cpunodebind=nodes] [--membind=nodes] [--localalloc] command args ...
MPI
• MPI
– % module load pgi mvapich2/pgi # PGI
– % module load intel mvapich2/intel # Intel
– % module load mvapich2/gnu # GCC
•
– mpicc, mpiCC, mpif77, mpif90
16MPI
16MPI
mpirun_rsh –np 16 numactl –localalloc ./a.out
# numactl
4MPI SOCKET=$(( $MPIRUN_RANK % 4 ))
/opt/sge/mpi/make_hostfile.sh 4 > ${JOB_ID}_host
mpirun_rsh -np 4 -hostfile ${JOB_ID}_host
numactl –cpunodebind=$SOCKET –localalloc ./a.out
rm ${JOB_ID}_host
# numactl
• T2K IB4
• MVAPICH2
•MV2_NUM_HCAS
– mpirun_rsh … MV2_NUM_HCAS=4 ./a.out 1 4
• T2K-Tsukuba