ibm blue geneq overview
DESCRIPTION
Presentation overview of IBM's Blue Gene/Q SupercomputingTRANSCRIPT
© 2009 IBM Corporation
IBM Blue Gene/Q OverviewPRACE Winter School
February 6-10, 2012Pascal [email protected]
© 2009 IBM Corporation2
Blue Gene / LPPC 440 @700MHz596+ TF
Blue Gene / PPPC 450 @850MHz1+ PF
2004 20202008 2012 2016
Perf
orm
an
ce
Blue Gene / QIn progress20+ PF
Blue Gene Roadmap
Goals:� Lay the ground work for ExaFlop
& usability� Address many of the power
efficiency, reliability and technology challenges
Goals:�Three orders of magnitude performance in 10 years�Push state of the art in Power efficiency, scalability, & reliability�Enable unprecedented application capability�Exploit new technologies: PCM, photonics, 3DP
© 2009 IBM Corporation3
Understanding Blue Gene
� BG/P – BG/Q architecture and hardware overview
� IO Node
� Processor
� Network – 5D Torus
� BG/Q software and programming model– BG/Q innovatives features: atomic, wake up unit, TM, TLS– Partitioning– Programming model– BG/Q kernel overview– User environment
© 2009 IBM Corporation4
Blue Gene Evolution
� BG/L (5.7 TF/rack) – 130nm ASIC (1999-2004 GA)
– Embedded 440 core, dual-core system-on-chip
– Memory: 0.5/1 GB/node
– Biggest installed system (LLNL): 104 racks, 212,992 cores, 596 TF/s, 210 MF/W
� BG/P (13.9 TF/rack) – 90nm ASIC (2004-2007 GA)
– Embedded 450 core
– Memoy: 2/4 GB/node, quad core SOC, DMA
– Biggest installed system (Jülich): 72 racks, 294,912 cores, 1 PF/s, 357 MF/W
– SMP support, OpenMP, MPI
� BG/Q (209 TF/rack) – 45nm ASIC+ (2007-2012 GA)
– A2 core, 16 core/64 thread SOC
– 16 GB/node
– Biggest installed system (LLNL): 96 racks, 1,572,864 cores, 20 PF/s, 2 GF/W,
– Speculative execution, sophisticated L1 prefetch, transactional memory, fast thread handoff,
compute + IO systems.
© 2009 IBM Corporation
Blue Gene/P
13.6 GF/s8 MB EDRAM
4 processors
1 chip, 20 DRAMs
13.6 GF/s2.0 GB DDR2
(4.0GB 6/30/08)
32 Node Cards
13.9 TF/s2 (4) TB
72 Racks, 72x32x32
1 PF/s144 (288) TB
Cabled 8x8x16Rack
System
Compute Card
Chip
435 GF/s64 (128) GB
(32 chips 4x4x2)32 compute, 0-1 IO cards
Node Card
© 2009 IBM Corporation6
Blue Gene/P Asic
4 cores- SMP/8MB L3 shared – 0.85Ghz – 1 thread/core – 4 GBytes memory
© 2009 IBM Corporation7
Blue Gene/P System in a Complete Configuration� Compute nodes dedicated to running user application, and almost nothing else - simple compute node
kernel (CNK)
� I/O nodes run Linux and provide a more complete range of OS services – files, sockets, process launch, signaling, debugging, and termination
� Service node performs system management services (e.g., heart beating, monitoring errors) -transparent to application software
BG/P
compute
nodes
1024 per Rack
BG/P
I/O nodes
8 to 64
per RackF
edera
ted 1
0G
igabit E
thern
et
Sw
itch
Front -end nodes
Service node
WAN
visualization
archive
FS
1GigE - 4 per Rack
Control network
10GigE
per I/ON
1 to 7
2 Rack
s
GPFS
© 2009 IBM Corporation8
1. Chip:16 λP cores
2. Single Chip Module
4. Node Card:32 Compute Cards, Optical Modules, Link Chips, Torus
5a. Midplane: 16 Node Cards
6. Rack: 2 Midplanes
7. System: 20PF/s
3. Compute card:One chip module,16 GB DDR3 Memory,
5b. IO drawer:8 IO cards w/16 GB8 PCIe Gen2 x8 slots
•Sustained single node perf: 10x P, 20x L
• MF/Watt: (6x) P, (10x) L (~2GF/W target)
Blue Gene/Q
© 2009 IBM Corporation9
Blue Gene/Q Node-Board Assembly
Redundant, Hot-Pluggable Power-Supply Assemblies
Water
Hoses
Fiber-Optic Ribbons (36X, 12 Fibers each)
Compute Card
with One Node
(32X)
48-Fiber
Connectors
Blue Gene/Q node board: 32 compute nodes
© 2009 IBM Corporation10
� I/O Network to/from Compute rack– 2 links (4GB/s in 4GB/s out) feed an I/O PCI-e port– Every node card has up to 4 ports (8 links)– Typical configurations
• 8 ports (32GB/s/rack)• 16 ports (64 GB/s/rack)• 32 ports (128 GB/s/rack)
– Extreme configuration 128 ports (512 GB/s/rack)
� I/O Drawers– 8 I/O nodes/drawer with 8 ports (16 links) to compute rack– 8 PCI-e gen2 x8 slots (32 GB/s aggregate)– 4 I/O drawers per compute rack– Optional installation of I/O drawers in external racks for extreme
bandwidth configurations
BG/Q External I/O
I/O drawers
I/O nodesPCIe
© 2009 IBM Corporation11
I/O Blocks
� I/O Nodes are also combined into blocks
– All I/O drawers can be grouped into a single block for
administrative convenience
– In normal operation the I/O Block(s) remain booted while compute blocks are reconfigured and rebooted
– I/O blocks do not need to be rebooted to resolve fatal errors from
I/O nodes
– Rationale for having multiple I/O node partitions would be experimentation with different Linux ION kernels
� Can be created via genIOblock
� Locations of IO enclosures can be:
– Qxx-Iy (in an IO rack, y is 0 - B)– Rxx-Iy (in a compute rack, y is C - F)
© 2009 IBM Corporation12
BG/Q I/O architecture
� BlueGene Classic I/O with GPFS clients on the logical I/O nodes
� Similar to BG/L and BG/P
� Uses InfiniBand switch
� Uses DDN RAID controllers and File Servers
� BG/Q I/O Nodes are not shared between compute partitions– IO Nodes are bridge data from function-shipped I/O calls to parallel file system client
� Components balanced to allow a specified minimum compute partition size to saturate entire storage array I/O bandwidth
BG/Q compute racks BG/Q IO Switch RAID Storage &
File Servers
IB IBPCI_E
© 2009 IBM Corporation13
Booting Partitions
� Compute partitions are booted and "attach" to an already booted IO partition that contains the attached IO nodes
� IO block cannot be freed while there are attached compute blocks that are booted
� A compute partition must have at least 1 directly attached link, to an IO node that's not marked in error
� When the compute partition is booted, at least one IO node in the IO partition must have an outbound link, and not be marked in error
– For large blocks, there must be at least one attached IO node per midplane
© 2009 IBM Corporation14
IO Node: BG/P vs. BG/Q
Each Image is 5 GB in sizeImage was a few hundred megabytes in size
Integrated health monitoring system.No health monitoring
Supports PCIe based 10Gb Ethernet, Infiniband and Combo Eth/IB cards
Only supported ethernet
Independent I/O or LN block associated with mone or more compute blocks
Part of the compute block
Designed to be booted infrequentlyRebooted frequently
Per-node persistent files spacesNo persistent storage space
Hybrid read only NFS root with in memory (tmpfs) read/write file spaces
Ramdisk based root filesystem
RPM based installation and is customizable before and after installation
Installed via a static tar file
Supports IONs and Log-In Nodes (LNs)Only supported IONs
Fully Featured RHEL6.X based Linux DistroMinimal MCP based Linux Distro
BG/QBG/P
© 2009 IBM Corporation15
Blue Gene/Q processor
© 2009 IBM Corporation16
BQC processor core (A2 core)
� Simple core, designed for excellent power efficiency and small footprint.
� Embedded 64 b PowerPC compliant
� 4 SMT threads typically get a high level of utilization on shared resources.
� Design point is 1.6 GHz @ 0.74V.
� AXU port allows for unique BGQ style floating point
� One AXU (FPU) and one other instruction issue per cycle
� In-order execution
© 2009 IBM Corporation17
Multithreading
� Four threads issuing to two pipelines– Impact of memory access latency reduced
� Issue– Up to two instructions issued per cycle
• One Integer/Load/Store/Control instruction issue per cycle• One FPU instruction issue per cycle
– At most one instruction issued per thread
� Flush– Pipeline is not stalled on conflict – Instead,
• Instructions of conflicting thread are invalidated• Thread is restarted at conflicting instruction
– Guarantees progress of other threads
© 2009 IBM Corporation18
10*2GB/s intra-rack & inter-rack (5-D torus)Network
2 GB/s I/O link (to I/O subsystem)
� 16+1 core SMP
Each core 4-way hardware threaded
� Transactional memory and thread level speculation
� Quad floating point unit on each core
204.8 GF peak node
� Frequency target of 1.6 GHz
� 563 GB/s bisection bandwidth to shared L2
(Blue Gene/L at LLNL has 700 GB/s for system)
� 32 MB shared L2 cache
� 42.6 GB/s DDR3 bandwidth (1.333 GHz DDR3)
(2 channels each with chip kill protection)
� 10 intra-rack interprocessor links each at 2.0GB/s
� one I/O link at 2.0 GB/s
� 8-16 GB memory/node
� ~60 watts max DD1 chip power
DDR-3
Controller
External
DDR3
Test
Blue Gene/Q compute chip
Blue Gene/Q chip architecture
PCI_Express note: chip I/O shares function with PCI_Express
dm
a
PPCFPU
PFL1
PFL1
PFL1
PFL1
PFL1
PFL1
PFL1
PFL1
PFL1
PFL1
PFL1
PFL1
PFL1
PFL1
PFL1
PFL1
PFL1
PFL1
PPCFPU
PPCFPUPPCFPUPPCFPU
PPCFPU
PPCFPU
PPCFPU
PPCFPUPPCFPUPPCFPU
PPCFPU
PPCFPUPPCFPU
PPCFPU
PPCFPUPPCFPUPPCFPU
2MBL2
2MBL2
2MBL2
2MBL2
2MBL2
2MBL2
2MB
L2
2MB
L2
2MB
L2
2MB
L2
2MBL2
2MBL2
2MBL2
2MBL2
2MBL2
2MBL2
DDR-3
Controller External
DDR3
full
cro
ssb
ar
sw
itch
IBM Confidential
© 2009 IBM Corporation19
Scalability Enhancements: the 17th Core
� RAS Event handling and interrupt off-load– Reduce O/S noise and jitter
– Core-to-Core interrupts when necessary
� CIO Client Interface– Asynchronous I/O completion hand-off
– Responsive CIO application control client
� Application Agents: privileged application processing– Messaging assist, e.g., MPI pacing thread
– Performance and trace helpers
© 2009 IBM Corporation20
Blue Gene/Q Network: 5D Torus
© 2009 IBM Corporation21
� Networks– 5 D torus in compute nodes, – 2 GB/s bidirectional bandwidth on all (10+1) links, 5D nearest neighbor exchange measured at
~1.75 GB/s per link– Both collective and barrier networks are embedded in this 5-D torus network.– Virtual Cut Through (VCT)– Floating point addition support in collective network
� Compute rack to compute rack bisection BW (46X BG/L, 19X BG/P)– 20.1PF: bisection is 2x16x16x12x2 (bidi)x2(torus, not mesh)x 2GB/s link bandwidth = 49.152 TB/s– 26.8PF: bisection is 2x16x16x16x4x2GB/s = 65.536TB/s– BGL at LLNL is 0.7 TB/s
� I/O Network to/from Compute rack– 2 links (4GB/s in 4GB/s out) feed an I/O PCI-e port (4GB/s in, 4GB/s out)– Every Q32 node card has up to I/O 8 links or 4 ports– Every rack has up to 32x8 = 256 links or 128 ports
� I/O rack– 8 I/O nodes/drawer, each node has 2 links from compute rack, and 1 PCI-e port to the outside
world– 12/drawers/rack– 96 I/O, or 96x4 (PCI-e) = 384 TB/s = 3 Tb/s.
BG/Q Networks
© 2009 IBM Corporation22
Network Performance
� Performance
– All-to-all: 97% of peak– Bisection: > 93% of peak– Nearest-neighbor: 98% of peak– Collective: FP reductions at 94.6% of peak – No performance problems identified in network logic
© 2009 IBM Corporation23
� New Node
– New voltage scaled processing core (A2) with 4-way SMT– New SIMD floating point unit (8 flop/clock) with alignment support– New “intelligent” prefetcher– 17th Processor core for system functions.– Speculative multithreading and transactional memory support with 32 MB of
speculative state– Hardware mechanisms to help with multithreading (“fetch & op”, “wake on pin”)– Dual SDRAM-DDR3 memory controllers with up to 16 GB/node
� New Network:
– 5 D torus in compute nodes • 2 GB/s bidirectional bandwidth on all (10+1) links • Bisection bandwidth of 65TB/s (26PF/s) / 49 TB/s (20 PF/s) BGL at LLNL is 0.7
TB/s– Collective and barrier networks embedded in 5-D torus network.– Floating point addition support in collective network– 11th port for auto-routing to IO fabric
� I/O system
– I/O nodes in separate drawers/rack with private 3D (or 4D) torus– PCI-Express Gen 2 on every node with full sized PCI slot– Two I/O configurations (one traditional, one conceptual)
BGQ changes from BGL/BGP
© 2009 IBM Corporation24
BG/Q Software& Programming model
© 2009 IBM Corporation25
BG/Q Software
Kittyhawk, InfoSphere StreamsFinancial
Openness
Scalability Scale infinitely, added more functionalityOverall Philosophy
static or dynamicLinking
SMP Linux with SMT, Red HatI/O Node OS
BG/QProperty
Linux, ZeptOS, Plan 9OS
HPC/S Toolkit, dyninst, valgrindToolsGeneralizing and Research Initiatives
Kittyhawk, Cloud, SLAcc, ASFCommercial
Control
Kernel
Programming Model
HPC, generalized sub-partitioning, HA with job cont
Run Mode
generic and real-time APIScheduling
CNK, Linux, Red HatCompute Node OS
glibc/pthreadsLibrary/threading
Linux/POSIX system callsSystem call interface
MPI, OpenMP, UPC, ARMCI, global arrays, Charm++
Programming Models
DCMF, generic parallel program runtimes, wake-up unit
Low-Level General Messaging
1-64 processes
64-1 threads
Hybrid
yesShared Memory
almost all open
© 2009 IBM Corporation26
Blue Gene/Q System Architecture
Functional Network
10Gb QDR
Functional Network
10Gb QDR
I/O Node
Linux
ciod
C-Node 0
CNK
I/O Node
Linux
ciod
C-Node 0
CNK
C-Node n
CNK
C-Node n
CNK
Control Ethernet
(1Gb)
Control Ethernet
(1Gb)
FPGA
LoadLeveler
SystemConsole
MMCS
JTAG
torus
collective network
DB2
Front-end
Nodes
File
Serversfs client
fs client
Service Node
app app
appapp
optical
optical
© 2009 IBM Corporation2727 IBM Confidential 1/24/2012
BG/Q innovations will help programmers cope with an exploding number of hardware threads
� Exploiting a large number of threads is a challenge for all future architectures. This is a key component of the BGQ research.
� Novel hardware and software is utilized in BGQ to ..
a) Reduce the overhead to hand off work to high numbers of threads used in OpenMP and messaging through hardware support for atomic operations and fast wake up of cores.
b) Multiversioning cache to help in a number of dimensions such as performance, ease of use and RAS.
c) Aggressive FPU to allow for higher single thread performance for some applications. Most will get modest bump (10-25%), some big bump (approaching 300%)
d) “perfect” prefetching for repeated memory reference patterns in arbitrarily long code segments. Also helps achieve higher singlethread for some applications.
© 2009 IBM Corporation2828 IBM Confidential 1/24/2012
BG/Q Node Features
• Atomic operations
• Pipelined at L2 rather than retried as in commodity processors
• Low latency even under high contention
• Faster OpenMP work hand off; lowers messaging latency
• Wake up unit
• Allows SMT threads to “sleep” waiting for an event
• Faster OpenMP work hand off; lowers messaging latency
• Multiversioning cache
• Transactional Memory eliminates need for locks
• Speculative Execution allows OpenMP threading for sections with data dependencies
Barrier speed using different syncronizing hardware
0
2000
4000
6000
8000
10000
12000
14000
16000
1 10 100
number of threads
nu
mb
er
of
pro
cess
or
cyc
les
atomic: no -invalidates
atomic: invalidates
lwarx/stwcx
Single MPI
Task
User defined parallelism
User defined
transaction start
User defined
transaction
end
parallelization completion
synchronization point
Hardware
detected
dependency
conflict
rollback
Single MPI
Task
User defined parallelism
User defined
transaction start
User defined
transaction
end
parallelization completion
synchronization point
Hardware
detected
dependency
conflict
rollback
Improvement in atomics under contention
Thread flow using transactions
© 2009 IBM Corporation29
NewTech: L2 Atomics
� L2 implements atomic operations on every 64-bit word in memory
– Allows specific memory operations to performed atomically:
• Load: Increment, Decrement, Clear (plus variations)
• Store: Twin, Add, OR, XOR, Max (plus variations)
– L2 knows the operation when specific shadow physical addresses are
read/written
� CNK exposes these L2 atomic physical addresses
– Applications must pre-register the location of the L2 atomic before access
• Otherwise segfault
– Fast barrier implementation using L2 atomics available
� CNK also uses L2 atomics internally
– Fast performance counters (store w/ add 1)
– Locking
© 2009 IBM Corporation30
Atomic Performance results
Barrier speed using different syncronizing hardware
0
2000
4000
6000
8000
10000
12000
14000
16000
1 10 100
number of threads
nu
mb
er
of
pro
cesso
r cycle
s
atomic: no -invalidates
atomic: invalidates
lwarx/stwcx
© 2009 IBM Corporation31
NewTech: Wakeup Unit
� Used in conjunction with the PowerPC wait instruction
� When a hardware thread is in a wait state, the hardware thread stops executing and the other
hardware threads will benefit from the additional available cycles.
� Sends a wakeup signal to a hardware thread
� Configurable wakeup conditions:
– WakeUp Address Compare
– Messaging Unit activity
– Interrupt sources
� Allows active threads to better utilize core resources
– Reduce wasted core cycles in polling and spin loops
� Kernel provides application interfaces to utilize the wakeup unit
� Thread guard pages
– CNK uses the Wakeup Unit to provide memory protection between stack/heap
– Detects violation, but cannot prevent it
© 2009 IBM Corporation32
NewTech: Transactional Memory
� Performance optimization for critical regions
1. Software threads enter “transactional memory” mode� Memory accesses are tracked. Writes are not visible outside of the thread
until committed.
2. Perform calculation without locking
3. Hardware automatically detects memory contention conflicts between threads1. If conflict:
• TM hardware detects conflict• Kernel decides whether to rollback transaction or let the thread continue• If rollback, the compiler runtime decides whether to serialize or retry
2. If no conflicts, threads can commit their memory
� Threads can commit out-of-order.
� XL Compiler only
� This is a DD2 only feature.
© 2009 IBM Corporation33
LOCKS and Deadlock
// WITH LOCKS
void move(T s, T d, Obj key){
LOCK(s);
LOCK(d);
tmp = s.remove(key);
d.insert(key, tmp);
UNLOCK(d);
UNLOCK(s);
}
DEADLOCK!
move(a, b, key1);
move(b, a, key2);
Thread 0
Thread 1
Moreover
Coarse-grain locking limits
concurrency
Fine-grain locking difficult
Transactional Memory
© 2009 IBM Corporation34
Transactional Memory (TM)
� Programmer says
– “I want this atomic”
� TM system
– “Makes it so”
� Avoids Deadlock
� Fine Grain Locking
void move(T s, T d, Obj key){
atomic {
tmp = s.remove(key);
d.insert(key, tmp);
}
}
� Correct
� Fast
2/22/2010 34Wisconsin Multifacet Project
Transactional Memory
© 2009 IBM Corporation35
BlueGene/Q transactional memory mode
� User program model:– User defines parallel work to be done– User explicitly defines start and end of
transactions within parallel work that are to be treated as atomic
� Compiler implications:– Interpret user program annotations to
spawn multiple threads– Interpret user program annotation for start
of transaction and save state to memory on entry to transaction to enable rollback
– At end of transaction program annotation test for successful completion and optionally branch back to rollback pointer.
� Hardware implications:– Transactional memory support required to
detect transaction failure and rollback– L1 cache visibility for L1 hits as well as
misses allowing for ultra low overhead to enter a transaction
Single MPI
Task
User defined parallelism
User defined
transaction start
User defined
transaction
end
parallelization completion
synchronization point
Hardware
detected
dependency
conflict
rollback
© 2009 IBM Corporation36
NewTech: Speculative Execution
� Similar to Transactional Memory– Except Ordered thread commit and different usage model
� Leverages existing OpenMP parallelization– However compiler does not need to guarantee that there is no array overlap– Should allow the compiler to do a much better job of auto-parallelizing
� Total work is subdivided into workunits without locking
� If work units collide in memory:– SE hardware detects– Kernel rolls back transaction– Runtime decides whether to retry or serialize
� XL Compiler only
� This is a DD2 only feature
© 2009 IBM Corporation37
System control (Research)System control (IBM Beta software available)
User control
� Same memory ordered access as a single thread
� If the nonspeculative thread modifies a memory address already touched by a speculative thread, the system detects a conflict and reschedules the
corresponding threads.
Language extension for compilation stage
Transactions allow operations to go through, but check for memory violations and unroll it if necessary
• Transactional memory (TM) ensures atomicity and isolation
• An abstract construct that hides its details from the programmers
User must manage the memory contentions
#pragma …
for (int i=0; i <N; i++) {
i1 = id[I][1]; i2 = id[i][2];
y(i1)=y(i1) + ….; y(i2)=y(i2) + …;
#pragma omp parallel for ...
for (int i=0; i <N; i++) {
i1 = id[I][1]; i2 = id[i][2];
#pragma tm_atomic
{y(i1)=y(i1) + ….; y(i2)=y(i2) + …;
#pragma omp parallel for...
for (int i=0; i <N; i++) {
i1 = id[I][1]; i2 = id[i][2];
y(i1)=y(i1) + ….; y(i2)=y(i2) + …;
Thread Level SpeculationTransactional MemoryStandard OpenMP
TM and TLS
© 2009 IBM Corporation3838 IBM Confidential 1/24/2012
BG/Q Node Features (Continued)
• Quad FPU
• 4-wide double precision FPU SIMD
• 2-wide complex SIMD
• Supports a multitude of alignments
• Allows for higher single thread performance for some applications
• 4R/2W register file
32x256 bits per thread
• 32B (256 bits) datapath to/from L1 cache
• “Perfect” prefetching
• Augments traditional stream prefetching
• Used for repeated memory reference patterns in arbitrarily long code segments
• Record pattern on first iteration of loop; playback for subsequent iteration
• Tolerance for missing or extra cache misses
• Achieves higher single thread performance for some applications
RF
MAD0 MAD3MAD2MAD1
RFRFRF
Permute
Load
A2
256
64
h
i
k
f
g
c
d
e
x
b
a
g
h
i
e
f
d
y
z
c
b
a
L1
mis
s
addre
ss
Lis
t addre
ss
Pre
fetc
he
d
ad
dre
sses
h
i
k
f
g
c
d
e
x
b
a
g
h
i
e
f
d
y
z
c
b
a
L1
mis
s
addre
ss
Lis
t addre
ss
Pre
fetc
he
d
ad
dre
sses
QPX: Quad FPU
List-based “perfect”prefetching hastolerance for missingor extra cache misses
© 2009 IBM Corporation39
Auto-SIMDization Support
� Improving SIMD from BG/P/L to BG/Q– Moving from low level optimizer to high-level optimizer due to better loop optimization
structure, and alignment analysis
� To enable SIMDization– Start to compile (assume appropriate –qarch=qp and –qtune=qp option ):
• -O3 (compile time, limited hot and SIMD optimizations)– Increase optimization Level
• -O4 (compile time, limited scope analysis, SIMDization)• -O5 (link time, pointer analysis, whole-program analysis, and SIMD instruction)
– -qhot=level=0 (enables SIMD by default)
� To disable SIMDization– Turn off SIMDization for the program
• add -qhot=nosimd to the previous command line options– Turn off simdization for a particular loop
• #pragma nosimd | !IBM* NOSIMD
� Tune your programs– Help the compiler with extra information (directives/pragmas)– Check the SIMD instruction generation in the object code listing (-qsource -qlist).– Use compiler feedback (-qreport -qhot) to guide you.
© 2009 IBM Corporation40
NewTech: L1 Perfect Prefetcher
� The L1p has 2 different prefetch algorithms:
– Stream prefetcher
• Similar to the prefetcher algorithms used on BGL/BGP
– “Perfect” prefetcher
� Perfect Prefetcher:
– First iteration through code:
• BQC records sequence of memory load accesses
• Sequence is stored in DDR memory
– Subsequent iteration through code:
• BQC loads the sequence and tracks where the code is in the
sequence
• Prefetcher attempts to prefetch memory before it is needed in the
sequence
� Kernel provides access routines to setup and configure the stream and
perfect prefetchers
© 2009 IBM Corporation41
Blue Gene/Q Partitioning
© 2009 IBM Corporation42
Blue Gene Compute blocks
� BG/P
– the partitions are denoted by letter XYZT, for XYZ the 3 dimensions and T for core (0-3)
� BG/Q
– the 5 dimensions are denoted by the letters A, B, C, D, and E, T for the core (0-15).
– the latest dimension E is always 2, and is contained entirely within a midplane.
– for any compute block, compute nodes (as well midplanes for large blocks) are
combined in 4 dimensions - only 4 dimensions need to be considered.
© 2009 IBM Corporation43
Partitions
� Compute partitions are generally grouped into two categories:
– Occupying one or more complete midplanes (large blocks)
– Occupying less than a midplane (small blocks)
� Large blocks have these characteristics:
– Always a multiple of 512 nodes
– Can be a torus in all 5 dimensions
� Small blocks have these characteristics:
– Occupies one or more node cards on a midplane
– Always a multiple of 32 nodes (32, 64, 128, 256)
– Cannot be a torus in all 5 dimensions
© 2009 IBM Corporation44
This shows a 4x4x4x3 (A,B,C,D) machine
This is what Sequoia is currently planned for, 192 midplanes.
Each blue box is a midplane.
Note that this is an abstract view of the midplanes, they are not physically organized this way.
A midplane is 4x4x4x4x2 (512) nodes
Sequoia full system is 16x16x16x12x2 nodes
Using A, B, C, D, E to denote the 5 dimensions, the last dimension (E) is fixed at 2, and does not factor into
the
Partitioning details
Compute Partitions - Large Blocks
© 2009 IBM Corporation45
Node board 32 nodes: 2x2x2x2x2
D
B
C
A
0 29
03
1
12
07
E (twin direction,
always 2)
28
02 13
06
31
14
05
30
15
04
25
0824
09
27
10
11
26
21
20
23
22
17
16
19
18
(0,0,0,0,0)
(0,0,0,1,0)
(0,0,0,0,1))
© 2009 IBM Corporation46
Partition : mesh versus Torus
Previously on BG/P we had torus vs. mesh for an entire block, now will have torus vs. mesh on a per-dimension basis, partial torus for sub-midplane blocks,
partial torus for multi-midplane blocks.
101114x2x4x4x22568 (halves)
001112x2x4x4x21284 (quadrants)
001012x2x4x2x2642 (adjacent pairs)
000012x2x2x2x2321
Torus
(ABCDE)
Dimensions# Nodes# Node Cards
© 2009 IBM Corporation47
5D Torus Wiring in a Midplane: (A,B,C,D,E)
B
C
C
C
C
C
C
C
C
D
D
D
D
D
D
D
D
A
A
A
A
A
A
A
A
N00
N01
N02
N03
N04
N05
N06
N07
N08
N09
N10
N11
N12
N13
N14
N15
Side view of a midplane midplane
nodeboard
� Each nodeboard is 2x2x2x2x2
� Arrows show how dimensions A,B,C,D span across nodeboards
� Dimension E does not extend across nodeboards
� The nodeboards combine to form a 4x4x4x4x2 torus
� Note that nodeboards are paired in dimensions A,B,C and D as indicated by the arrows
The 5 dimensions are denoted by the letters A, B, C, D, and E. The latest dimension E is always 2, and is contained entirely within a midplane.
© 2009 IBM Corporation48
Partitioning a Midplane into Blocks
© 2009 IBM Corporation49
I/O Requirements
© 2009 IBM Corporation50
Sub-block jobs
© 2009 IBM Corporation51
� BG/L and BG/P mpirun– minimum job size 1 pset (I/O and associated computes)– –ex: mpirun -np 4 would waste remaining 28 compute nodes on a node board
� Single node HTC jobs :single node jobs without MPI
� sub-block jobs– new in BG/Q– allows multiple multi-node jobs sharing one or more I/O nodes– rectangular shapes smaller than block– restricted to a midplane (512 nodes) and smaller shapes– MPI fully supported
� some collectives may not be optimized
Sub-block jobs 1/2
1 single BG/Q partition
© 2009 IBM Corporation52
Sub-block jobs 2/2
� Can subdivide a block into a smaller torus– Allows small jobs to run on a block while maintaining the I/O requirement
� Specify a corner and shape on runjob– Example:
• runjob --exe hello --block R00-M0 --corner R00-M0-N04-J00 --shape 1x2x2x1x2 --ranks-per-node 16
� MPI is limited by hardware from sending messages outside this sub-block– System software (kernel) can send outside -- allowing I/O to function– Note that I/O will be routed through sub-blocks (i.e. some noise introduced)
� High Throughput Computing– For the extreme scale single task HTC case, a job can be targeted to a
single core on a node– i.e. 16 independent jobs can run on a single BG/Q compute node
© 2009 IBM Corporation53
� Standards-based programming environment
– LinuxTM development environment
– Familiar GNU toolchain with GLIBC, pthreads, gdb
– XL Compilers providing C, C++, Fortran with OpenMP
– Totalview debugger
� Message Passing
– Optimized MPICH2 providing MPI 2.2
– Intermediate and low-level message libraries available, documented, and open source
– GA/ARMCI, Berkeley UPC, etc, ported to this optimized layer
� Compute Node Kernel (CNK) eliminates OS noise
– File I/O offloaded to I/O nodes running full Linux
– GLIBC environment with few restrictions for scaling
� Flexible and fast Job Control
– MPMD and sub-block jobs now supported
Programmability
© 2009 IBM Corporation54 IBM Confidential
54
Toolchain and Tools
� BGQ GNU toolchain
– gcc is currently at 4.4.4. Will update again before we ship.
– glibc is 2.12.2 (optimized QPX memset/memcopy)
– binutils is at 2.21.1
– gdb is 7.1 with QPX registers
– gmon/gprof thread support
• Can turn profiling on/off on a per thread basis
� Python
– Running both Python 2.6 and 3.1.1.
– NUMPY, pynamic, UMT all working
– Python is now an RPM
� Toronto compiler test harness is running on BGQ LNs
© 2009 IBM Corporation55
CNK Overview
Compute Node Kernel (CNK)Binary Compatible with Linux System Calls
Leverage Linux runtime environments and tools
Up to 64 Processes (MPI Tasks) per NodeSPMD and MIMD Support
Multi-Threading: optimized runtimesNative POSIX Threading Library (NPTL)
OpenMP via XL and Gnu Compilers
Thread-Level Speculation (TLS)
System Programming Interfaces (SPI)Networks and DMA, Global Interrupts
Synchronization, Locking, Sleep/Wake
Performance Counters (UPC)
MPI and OpenMP (XL, Gnu)Transactional Memory (TM)Speculative Multi-Threading (TLS)Shared and Persistent MemoryScripting Environments (Python)Dynamic Linking, Demand Loading
FirmwareBoot, Configuration, Kernel LoadControl System InterfaceCommon RAS Event Handling for CNK &
Linux
© 2009 IBM Corporation56
Parallel Active Message Interface
� Message Layer Core has C++ message classes and other utilities to program the different network devices
� Support many programming paradigms
Other
Paradigms*
SPI
Message Layer Core (C++)
pt2pt protocols
ARMCIMPICHConverse/Charm++
PAMI API (C)
Network Hardware (DMA, Collective Network, Global Interrupt Network)
Application
Global
ArraysHigh-Level
API
collective protocols
Low-Level
API
UPC*
DMA Device Collective Device GI Device Shmem Device
© 2009 IBM Corporation57
Blue Gene/Q Job Submission
� BG/L & BG/P Single interface for job submission
– mpirun– submit– submit_job
– mpiexec
� BG/P a single interface for job submission: runjob
© 2009 IBM Corporation58
Job Submission: runjob command
� runjob is somewhat backwards compatible with mpirun
� Supported args:
– exe, arg, env, exp_env, env_all, np, partition, strace, label, mode
� Unsupported args:
– shape, psets_per_bp, connect, reboot, nofree, noallocate, free, boot_options
� Primary difference
– runjob does not boot or free partitions, only runs the job
– booting/rebooting/freeing achieved by a scheduler, console, or script
– partition arg is always required
� sub-partition jobs require a new location arg
– ranges: 0-7
– lists: 1,2,4,9
– combination: 0-7,16,17,18,19
– specific processor: 0.1 for single node jobs only
© 2009 IBM Corporation59
Ranks per node
� BG/L mpirun
– supported SMP and co-processor mode
– either 1 or 2 ranks per node
� BG/P mpirun
– supported SMP, dual, and virtual node mode
– either 1, 2, or 4 ranks per node
� BG/Q runjob
– supports 1, 2, 4, 8, 16, 32, or 64 ranks per node– parameter is --ranks-per-node rather than --mode
© 2009 IBM Corporation60
Execution Modes in BG/Q per Node
© 2009 IBM Corporation61
� Single interface for job submission; runjob
– Mpirun
– submit
– submit_job
– mpiexec
� Remove ‘scheduler’ behavior from mpirun
– cannot create blocks
� Gateway for debuggers
� runjob on the LNs controls via the SN
� Paths shown between LNs, SN and IONs
are Infiniband or 10Gbps Ethernet
� Note the new daemons on the IONs
Job Launch
© 2009 IBM Corporation62
BG/Q MPI Implementation
� MPI-2.1 standard (http://www.mpi-forum.org/docs/docs.html)
� BG/Q mpi execution command: runjob
� To support the Blue Gene/Q hardware, the following additions and modifications have been made to the MPICH2 software architecture:
– A Blue Gene/Q driver has been added that implements the MPICH2 abstract device interface (ADI).
– Optimized versions of the Cartesian functions exist (MPI_Dims_create(), MPI_Cart_create(), MPI_Cart_map()).
– MPIX functions create hardware-specific MPI extensions.
© 2009 IBM Corporation63
5-Dimensional Torus Network
� The 5-dimensional Torus network provides point-to-point and collective communication facilities.
� point-to-point messagingthe route from a sender to a receiver on a torus network has the following two possible paths:
– Deterministic routing• Packets from a sender to a receiver go along the same path. • Advantage: Latency - maintained without additional logic. However, this technique
also creates• Disadvantage: network hot spots with several point-to-point coms
– Adaptive routing• This technique generates a more balanced network load but introduces a latency
penalty.
� Selecting deterministic or adaptive routing depends on the protocol used for communication – 4 in BG/Q: Immediate Message, MPI short, MPI eager and MPI rendez-vous
� environment variables can be used to customize MPI communications (c.f. IBM BG/Q redbook)
© 2009 IBM Corporation64
Blue Gene/Q extra MPI communicators
� int MPIX_Cart_comm_create (MPI_Comm *cart_comm)– This function creates a six-dimensional (6D) Cartesian communicator that mimics
the exact hardware on which it is run. The A, B, C, D, and E dimensions match
those of the block hardware, while the T dimension is equivalent to the ranks per
node argument to runjob.
� Changing class-route usage at runtime
– int MPIX_Comm_update(MPI_Comm comm, int optimize)
� Determining hardware properties – Int MPIX_Init_hw(MPIX_Hardware_t *hw);
– int MPIX_Torus_ndims(int *numdimensions)
– int MPIX_Rank2torus(int rank, int *coords)
– int MPIX_Torus2rank(int *coords, int *rank)
© 2009 IBM Corporation65
Blue Gene/Q Kernel Overview
© 2009 IBM Corporation66
Processes
� Similarities to BGP– Number of tasks per node fixed at job start– No fork/exec support– Support static and dynamically linked processes– -np supported
� Plus:– 64-bit processes– Support for 1, 2, 4, 8, 16, 32, or 64 processes per node
• Numeric “names” for process config. (i.e., not smp, dual, quad, octo, vnm, etc)
– Processes use 16 cores– The 17th core on BQC reserved for:
• Application agents• Kernel networking• RAS offloading
– Sub-block jobs
� Minus– No support for 32-bit processes
© 2009 IBM Corporation67
Threads
� Similarities to BGP– POSIX NPTL threading support
• E.g., libpthread
� Plus
– Thread affinity and thread migration
– Thread priorities
• Support both pthread priority and A2 hardware thread priority
– Full scheduling support for A2 hardware threads
– Multiple software threads per hardware thread is now the default
– CommThreads have extended priority range compared to normal threads
– Performance features
• HWThreads in scheduler without pending work are put into hardware
wait state
No cycles consumed from other hwthreads on core
• Snoop scheduler providing user-state fast access to:
Number of software threads that exist on the current hardware thread
Number of runnable software threads on the current hardware thread
© 2009 IBM Corporation68
Memory
� Similarities to BGP
– Application text segment is shared
– Shared memory
– Memory Protection and guard pages
� Plus
– 64-bit virtual addresses
• Supports up to 64GB of physical memory*
• No TLB misses
• Up to 4 processes per core
– Fixed 16MB memory footprint for CNK. Remainder of physical memory to
applications
– Memory protection for primordial dynamically-linked text segment
– Memory aliases for long-running TM/SE
– Globally readable memory
– L2 atomics
© 2009 IBM Corporation69
System Calls
� Similarities to BGP
– Many common syscalls on Linux work on BG/Q.
– Linux syscalls that worked on BGP should work on BGQ
� Plus
– Support glibc 2.12.2
– Real-time signals support
– Low overhead syscalls
• Only essential registers are saved and restored
– Pluggable File Systems
• Allows CNK to support multiple file system behaviors and types
Different simulator platforms have different capabilities
• File systems:Si
Function shipped
Standard I/O
Ramdisk
Shared Memory
© 2009 IBM Corporation70
I/O Services
� Similarities to BGP
– Function shipping system calls to ionode
– Support NFS, GPFS, Lustre and PVFS2 filesystems
� Plus
– PowerPC64 Linux running on 17 cores
– Supports 8192:1 compute task to ionode ratio
• Only 1 ioproxy per compute node
• Significant internal changes from BGP
– Standard communications protocol
• OFED verbs
• Using Torus DMA hardware for performance
– Network over Torus
• E.g., tools can now communicate between IONodes via torus
– Using “off-the-shelf” Infiniband driver from Mellanox
– ELF images now pulled from I/O nodes, vs push
© 2009 IBM Corporation71
Debugging
� Similarities to BGP
– GDB
– Totalview
– Coreprocessor
– Dump_all, dump_memory
– Lightweight and binary corefiles
� Plus
– Tools interface (CDTI)
• Allows customers to write custom debug and monitoring tools
– Support for versioned memory (TM/SE)
– Fast breakpoint and watchpoint support
– Asynchronous thread control
• Allow selected threads to run while others are being debugged
© 2009 IBM Corporation72
BG/Q Application Environment
© 2009 IBM Corporation73
Compiling MPI programs on Blue Gene/Q
There are six versions of the libraries and the scripts
– gcc: GMU compiler with fine-grained locking in MPICH – error checking
– gcc.legacy: GMU with coarse-grained lock – sligthy better latency for single-thread code
– xl: PAMI compiled with GNU – fine-grained lock
– xl.legacy: PAMI compiled wiyh GNU – coarse-grained lock
– Xl.ndebug: xl with error checking and asserts off
⇒lower latency but not as much debug info
– xl.legacy.ndebug: xl.legacy with error checking and asserts off
© 2009 IBM Corporation74
Control/monitoring
� Provides the Blue Gene Web Services
– Getting data (blocks, jobs, hardware, envs, etc.)
– Create blocks, delete blocks, run diags, etc.
� A web server
� Runs under BGMaster– Should run as special bgws user for security
� New for BG/Q– Had Navigator server in BG/P (Tomcat)
– Tomcat in BG/L
© 2009 IBM Corporation75
Blue Gene Navigator
© 2009 IBM Corporation76
Debugging – Batch scheduler
� Debugging
– Integrated Tools• GDB
• Core Files + addr2line
• coreprossecor
• Compiler Options
• Traceback functions, memory size kernel, signal or exit trap, …
– Supported Commercial Software• Totalview
• DDT (Alinea) ?
� Batch scheduler– IBM LoadLeveler
– SULRM
– LSF?
© 2009 IBM Corporation77
Coreprocessor GUI
© 2009 IBM Corporation78
Performance Analysis
� Profiling
– GNU profiling, vprof with command line or GUI
� IBM HPC Toolkit, IBM mpitrace library
� Major Open-Source Tools
– Scalasca
– TAU– mpiP http://mpip.sourceforge.net
– …
© 2009 IBM Corporation79
IBM mpi trace library – HPC Toolkit
� Mpi timing summary
� Communication and elapsed times
� Heap memory used
� Mpi basic informations: #calls, message sizes, # hops
� Call-stack for every MPI function call
� Source and destination torus coordinates identification for point-to-point messages
� Unix-based profiling
� BG/Q Hardware-counters
� Event-tracing
© 2009 IBM Corporation80
Thanks, Questions?