altix 4700. ccnuma architecture distributed memory - shared address space
TRANSCRIPT
Altix 4700
ccNUMA Architecture
• Distributed Memory - Shared address space
Altix HLRB II – Phase 2
• 19 partitions with 9728 cores• Each with 256 Itanium dual-core processors, i.e., 512 cores
– Clock rate 1.6 GHz– 4 Flops per cycle per core– 12,8 GFlop/s (6,4 GFlop/s per core)
• 13 high-bandwidth partitions– Blades with 1 processor (2 cores) and 4 GB memory– Frontside bus 533 MHz (8.5 GB/sec)
• 6 high-density partitions– Blades with 2 processors (4 cores) and 4 GB memory.– Same memory bandwidth.
• Peak Performance: 62,3 TFlops (6.4 GFlops/core)• Memory: 39 TB
Memory Hierarchy
• L1D• 16 KB, 1 cycle latency, 25,6 GB/s bandwidth • cache line size 64 bytes
• L2D• 256 KB, 6 cycles, 51 GB/s• cache line size 128 bytes
• L3• 9 MB, 14 cycles, 51 GB/s• cache line size 128 bytes
Interconnect
• NUMAlink 4• 2 links per blade• Each link 2*3,2 GB/s bandwidth• MPI latency 1-5µs
Disks
• Direct attached disks (temporary large files)• 600 TB• 40 GB/s bandwidth
• Network attached disks (Home Directories)• 60 TB• 800 MB/s bandwidth
Environment
• Footprint: 24 m x 12 m• Weight: 103 metric tons• Electrical power: ~1 MW
NUMAlink Building Block
NUMALink 4RouterLevel 1
BLADEBLADEBLADEBLADEIO BLADE
NUMALink 4RouterLevel 1
NUMALink 4RouterLevel 1
BLADEBLADEBLADEBLADEIO BLADE
NUMALink 4RouterLevel 1
SANSwitch
10 GE
PCI/FC8 cores (high bandwidth)
16 cores (high-density)
Blades and Rack
Interconnection in a Partition
Interconnection of Partitions
• Gray squares• 1 partition with 512 cores• L: Login B:Batch
• Lines• 2 NUMALink4 planes with 16 cables• each cable: 2 * 3,2 GB/s
Interactive Partition
• Login cores• 32 for compile & test
• Interactive batch jobs• 476 cores• managed by PBS
– daytime interactive usage– small-scale and nighttime
batch processing– single partition only
• High-density blades• 4 cores per memory
12 Login
4 OS16 Login
12 Batch
4 Login16
16 16 16 16
16 16 16 16
18 Batch Partitions
• Batch jobs• 510 (508) cores• managed by PBS• large-scale parallel jobs• single or multi-partition jobs
• 5 partitions with high-density blades
• 13 partitions with high-bandwidth blades
6 (12)
4 OS8 (16) 8 (16) 8 (16)
8 (16)8 (16)8 (16)8 (16)
8 (16)8 (16)8 (16)8 (16)
Bandwidth
1 2 4 8 16 32 64 128
256
512
1024
2048
4096
8192
1638
4
3276
8
6553
6
1310
72
2621
44
5242
88
1048
576
0
500
1000
1500
2000
2500
3000Bandwidth (MB/s) Intra-Node
Intranode
Internode
Coherence Implementatioin
• SHUB2 supports up to 8192 SHUBs (32768 cores)• Coherence domain up to 1024 SHUBs(4096 cores)
• SGI term: "Sharing mode"• Directory with one bit per SHUB• Multiple shared copies are supported.
• Accesses of other coherence domains• SGI term: "Exclusive sharing mode"• Always translated in exclusive access• Only single copy is supported• Directory stores the address of SHUB(13 bits)
SHMEM Latency Model for Altix
• SHMEM get latency is sum of:• 80 nsec for function call• 260 nsec for memory latency• 340 nsec for first hop• 60 nsec per hop• 20 nsec per meter of NUMAlink cable
• Example• 64 P system: max hops is 4, max total cable length is 4. • Total SHMEM get latency is:
1000 nsec = 80 + 260 + 340 + 60x4 + 20x4
Coherency
Domain 1
Parallel Programming Models
Linux Image 2
Coherency
Domain 2
Intra-Host (512 cores) Intra-CoherencyDomain (4096 cores)
and across entire machineOpenMP
Pthreads
MPI
SHMEMTM
Global segments
MPI
SHMEM
Global Segments
Altix® System
Linux Image 1
Barrier Synchronization
• Frequent in OpenMP, SHMEM, MPI single sided ops (MPI_Win_fence)
• Tree-based implementation using multiple fetch-op variables to minimize contention on SHUB.
• Using uncached load to reduce NUMAlink traffic.
CPUHUB
ROUTER
CPU
Fetch-op
variable
Programming Models
• OpenMP on an Linux image• MPI• SHMEM• Shared segments (System V und Global Shared
Memory)
SHMEM
• Can be used for MPI programs where all processes execute same code.
• Enables access within and across partitions.• Static data and symmetric heap data (shmalloc or shpalloc)
• info: man intro_shmem
Example
#include <mpp/shmem.h>
main()
{
long source[10] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 };
static long target[10]; MPI_Init(…)
if (myrank == 0) { /* put 10 elements into target on PE 1 */ shmem_long_put(target, source, 10, 1); } shmem_barrier_all(); /* sync sender and receiver */
if (myrank == 1) printf("target[0] on PE %d is %d\n", myrank,target[0]);}
Global Shared Memory Programming
• Allocation of a shared memory segment via collective GSM_alloc.
• Similar to memory mapped files or System V shared segments. But these are limited to a single OS instance.
• GSM segment can be distributed across partitions.– GSM_ROUNDROBIN: Pages are distributed in roundrobin
across processes– GSM_SINGLERANK: Places all pages near to a single process– GSM_CUSTOM_ROUNDROBIN: Each process specifies how
many pages should be placed in its memory.
• Data structures can be placed in this memory segment and accessed from all processes with normal load and store instructions.
Example
#include <mpi_gsm.h>
placement = GSM_ROUNDROBIN;flags = 0; size = ARRAY_LEN * sizeof(int); int *shared_buf;rc = GSM_Alloc(size, placement, flags, MPI_COMM_WORLD,&shared_buf);// Have one rank initialize the shared memory regionif (rank == 0) { for(i=0; i < ARRAY_LEN; i++) { shared_buf[i] = i; }}
MPI_Barrier(MPI_COMM_WORLD);
// Have every rank verify they can read from the shared memoryfor (i=0; i < ARRAY_LEN; i++) { if (shared_buf[i] != i) { printf("ERROR!! element %d = %d\n", i, shared_buf[i]); printf("Rank %d - FAILED shared memory test.\n", rank); exit(1); }}
Summary
• Altix 4700 is a ccNUMA system• >60 TFlop/s• MPI messages sent with two-copy or single-copy
protocol• Hierarchical coherence implementation
• Intranode• Coherence domain• Across coherence domains
• Programming models• OpenMP• MPI• SHMEM• GSM
The Compute Cube of LRZ
Klima
Archiv/Backup
Hö(sä
Zugangsbrücke
Höchstleistungsrechner(säulenfrei)
Rückkühlwerke
Archiv/Backup
Server/Netz
Klima
Elektro
Zugangsbrücke