x-caliber & xgc system software research & development · head. third, tasks can be...
TRANSCRIPT
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company,for the United States Department of Energyʼs National Nuclear Security Administration
under contract DE-AC04-94AL85000.
X-Caliber & XGC System SoftwareResearch & Development
Kyle WheelerSOS16
Thursday, March 15, 2012
Context
•X-Caliber: our DARPA UHPC effort•XGC: Extreme-Scale Computing Grand Challenge LDRD
– (Laboratory Directed Research and Development)
•Consistent Challenge: figure out what exascale system software looks like–Collaborate with the level above and the level below–Leverage technology trends–Rethink application space (what will be important in a decade?)–Be metric-focused!
–Picojoules, Picojoules, Picojoules ... and time too!
Thursday, March 15, 2012
X-Caliber Software Team
•Sandia–Brian Barrett, Kyle Wheeler,
Dylan Stark•Indiana University
–Thomas Sterling•Louisiana State University
–Hartmut Kaiser, Chirag Dekate•University of Illinois
–William Gropp, Marc Snir•USC/ISI
–Pedro Diniz
Thursday, March 15, 2012
XGC Software Team
•Sandia–Brian Barrett, Ron Brightwell, Kevin Pedretti,
Dylan Stark, and Kyle Wheeler
•University of Illinois–Vikram Adve, Bill Gropp, Marc Snir
•RENCI–Allan Porterfield
Thursday, March 15, 2012
XGC Thrust Areas(1
) B
asel
ine
- w
hat
hap
pen
s if
we
do
no
thin
g?
(2) Microsystems - Key Data Movement Enabling Technologies
!"#$#%&'()*+,-.&/,-0%$,-1*',
2!34#%5(!*567829,*$(:&%;
(3) Architecture - Coping with Concurrency and Data MovementVault Vault Vault•••
MC
VAU
DAU
•••
Mem NetworkInterface(SerDes)
DAU
LogicLayer
Ban
k
DRAMLayer 1
DRAMLayer 2
DRAMLayer 3
DRAMLayer N
Ban
k
MC
VAU
MC
VAU
On-Chip Network(Topology, Type, etc. TBD)
Figure 3.2: EMP High-Level Architecture
11
cores in the cluster and has several options in how the cores are used. First, the cores can be used
to operate on completely independent data. Second, the cores can be ganged together to operate in
lockstep to allow multiple identical operations to proceed with much lower synchronization over-
head. Third, tasks can be allocated in a producer/consumer model using hardware mailboxes to
stream data through the cores. This flexibility allows the processing to adaptively adjust to the
requirements of different applications. The functionality of the hardware thread manager is a key
research component of X-caliber.
Core Cluster
CoreCore
FP Vector x4
Local Memory(Cache and/or Scratchpad)
Multithreaded Register File
Thr 0
Thr 1
Thr 2
Thr n
Ma
ilbo
xe
s
Multithreaded Register File
Thr 0
Thr 1
Thr 2
Thr n
Ma
ilbo
xe
s
••••
Thread Manager
NoC
FP Vector x4
L1 L1
Figure 8: Notional core cluster for the com-
pute intensive processor.
It is anticipated that the performance of CIP
will need to be artificially limited in order to reach
thermal and power requirements for the module.
Due to this, normal operating frequency will be
1.5 GHz, but maximum frequency will be closer
to 2.5 GHz. X-caliber will take advantage of this
by using aggressive thermal monitoring and man-
agement to enable two forms of sprint modes. The
first form, which can be sustained indefinitely, is
to pair cores and allow one of those cores to run
at 2.5 GHz as long as the other is turned off. This
maintains a constant peak power output, but may
result in thermal hotspots. To alleviate this, we will
investigate ways to ping-pong compute between
the paired cores. In this way, each core would op-
erate for a time before moving thread state to the
paired core, which would then continue computa-
tion. This mode would allow applications which
cannot take advantage of the available the paral-
lelism to achieve higher performance. This mode could be specified by the compiler, and/or auto-
matically enabled when only a small number of threads are present.
The second form of sprinting would allow any core to accelerate to 2.5 GHz for very short
periods of time. This would increase instantaneous power draw and would rely on thermal inertia
to keep from overheating. The thermal state would be closely monitored and sprint mode would
be turned off as thermal limits were met. This sprint state is useful for moving through Amdahl
regions of code, which could be marked by the compiler. The extreme of this mode is to sprint
whenever allowed by the thermal state.
2.4.1.3 Network The network is built from a single integrated component referred to as Merlin.
For energy efficiency, Merlin consists of an integrated network interface controller (NIC) and 21
port router. The Merlin component benefits from the advances in 3D stacking and silicon photonics
and consists of one or more logic layers coupled to a photonics carrier using 3D integration. The
relatively small number of ports in the router allows a more energy efficient implementation com-
pared to a separate larger router, by reducing the on-chip interconnect lengths (and thus power).
Each module contains two Merlin components, which serves to increase interconnect bandwidth,
24
(4) System Software - enabling a new model of computation
(5) Application Drivers
Safety and Security
Reentry Circuitry Graph Stream
!-&%$,5(<&-'=&$(4#*-5
Thursday, March 15, 2012
Hardware Challenges
•Exponential increase in node-level parallelism•Lower memory capacity per core
–Weak scaling will be insufficient•Significantly lower network to memory bandwidth ratios•Need for system software to have finer control of hardware resources
Fig 1.2 shows the top level diagram of ELVIS core.
Trap Unit(TU)
Flo
atin
g P
oin
t U
nit
(FP
U)
Inte
ger
Ex
ecu
tio
Un
it (
IEU
)
Inb
ou
nd
Th
read
Un
it(I
TU
)
Runnab
le Thread
Unit(R
TU
)
Deferred
Thread
Unit(D
TU
)
Data Cache Unit(DCU)
Instruction Fetch Unit(IFU)
Pipe Unit(PU)
Reg
iste
r F
ile1
5
Reg
iste
r F
ile0
Thread Assignment Unit
Incoming Transaction Unit
I/O Unit
From ServiceProcessor(SP)
To CCX
An ELVIS Core Supports 16 Threads
From DFRU
To CCX
3
1
2
4bus 1: FORK/SPAWN request from I/O.
bus 2: Thread is assigned to an available thread slot
bus 3: Thread is deferred and placed in DTUQ.
bus 4: A FORK/SPAWN command enqueing a thread assignment request in RTUQ
Figure 1.2: ELVIS Core Structure
14
7DOE Arch Workshop Aug. 2, 2011
Bytes per Flop (Peak)
Systems are getting less memory rich
!"!#
!"#!
#"!!
#!"!!
#$#$#%%&
#$#$#%%'
#$#$&!!!
#$#$&!!(
#$#$&!!)
#$#$&!#&
!"#$%&'()*
+,-*$
./0
*+,-./+0123 40123/+0123 *+3+561+7+689
Memory Capacity Trend
Thursday, March 15, 2012
Application Challenges
What industry cares about
Informatics Applications
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Traditional (FP) Sandia Applications
Emerging (Integer) Sandia Applications
SPEC FP
SPEC Int
RandomAccess
LINPACK
STREAM
Temporal Locality
Sp
atia
l L
oca
lity
Benchmark Suite Mean Temporal vs. Spatial Locality
From: Murphy and Kogge, On The Memory Access Patterns of Supercomputer Applications: Benchmark Selection and Its Implications, IEEE T. on Computers, July 2007
•Huge variety in programming models and run-times emerging–Evolutionary BSP-originated applications–Revolutionary programming models
•Everyone’s got one, and they’re all the best•101+ actively developed parallel programming languages
–Lots of new application varieties–Flexibility key
•Multiple optimization points–Time to solution–Energy to solution–Money to solution–Total system efficiency
What we traditionally care about
Thursday, March 15, 2012
Foundational Knowledge
•Distributed systems scaling determined by:–Ability to move data–Synchronization
•Lightweight System Software WORKS–ASCI Red, ASC Red Storm, BG/{L,P,Q}–Low perturbation of applications
•Synchronization Costs–Local and remote–Explicit and implicit
Thursday, March 15, 2012
Research Questions
•How will threads evolve to be more lightweight and match hardware semantics?–What will hardware threading semantics be?
•What synchronization primitives are necessary for highly asynchronous applications?–Free, Fast, Infinite
•What memory consistency models are necessary?–... or even useful?
•What communication primitives are necessary for evolving applications?–Probably not six-function MPI
Thursday, March 15, 2012
Necessity is the Mother of Invention
•Need insight into:–Trade-offs between different data/work movement strategies–Cost of synchronization/protection mechanisms with real applications–How much automaticity/adaptivity is necessary in large scale
applications?
•Research is slowed by lack of experimental platform
•Use both clusters and simulation as foundational experimental platforms!
•Combine Kitten, Portals, and Qthreads to build a multi-node multi-threaded runtime for experimentation (SPR)
Thursday, March 15, 2012
Scalable Parallel Runtime (SPR)
•Qthreads: Lightweight threading interface–Scalable, lightweight scheduling on NUMA platforms–Supports a variety of synchronization mechanisms, including Full/
Empty bits and atomic operations–Potential for direct hardware mapping
•Portals 4: Lightweight networking API–Semantics for supporting both one-sided and tagged message passing–Small set of primitives, allows offload from main CPU–Supports direct hardware mapping
•Kitten: Lightweight OS kernel–Builds on lessons from ASCI Red, Cplant, Red Storm–Utilizes scalable parts of Linux environment–Primarily supports direct hardware mapping
Thursday, March 15, 2012
Kitten Lightweight Kernel
•Simple compute node OS– Tool for OS+runtime research– Looks like Linux to applications and
tools•Current R&D
– Job launch via OpenMPI ORTE / mpirun
– Support for Intel MIC, Arthur cluster at Sandia
– System-call forwarding– Low-overhead task migration
Operating SystemRound-Trip Task Migration Time
(task on core A migrates to core B, then back to A
Linux 2.6.35.7 4435 ns
Kitten 1.3 2630 ns
Core-switching performance between two cores in the same Intel X5570 2.93 GHz processor. Kitten achieves a speedup of 1.7 compared to Linux, due to
simpler implementation.
Kitten LWK supports running native applications alongside guest OSes.
Weak scaling performance of Catamount guest OS is within 5% of Catamount native OS at 4096 nodes
0
50
100
150
200
250
300
350
400
64 128 256 512 1024 2048 4096
Tim
e (
seco
nd
s)
# Nodes
NativeGuest, Nested Paging
Guest, Shadow Paging
Thursday, March 15, 2012
Portals4 Lightweight Comm.
•Simple low-level communication layer–Tool for communication+runtime
research–Thread-safe by design–Supports legacy and next-gen
applications and tools–Common substrate to allow
efficient use and sharing of resources among higher-level protocols
•Current R&D–Shared InfiniBand and SMP multi-
core progress engines–Efficient blocking/waiting
mechanisms
Much of the potential complexity gets out of the way for basic operations.
0E+00
2E+06
4E+06
6E+06
8E+06
8 16 32 64 128
6-peer Shared-Memory Message Rate
Mes
sage
s/se
c
Message Size in Bytes
MPICH2 MPICH2 ThreadSafePortals4 ME/EQ (MPI-like) Portals4 LE/CT (UPC-like)
Message rates for small messages match MPICH2 performance under MPI-like conditions, and can even beat it for UPC-like conditions.
Thursday, March 15, 2012
Qthreads Lightweight Threading
• Simple task-based runtime– Tool for programming model research– Supports both OpenMP-like models
and more complex Chapel-like models– Presents simplified model of system to
the application– High-performance scheduler
• Current Qthreads R&D– Task team and eureka support– Efficient, flexible collective operations– Remote task launch
Shep 0 Shep 1
Shep 0 Shep 1 Worker
Thief
Benefactor
Work queue
Task migration path
Task in/out
High performance “sherwood” work-stealing
scheduler e!ectively balances cache e"ciency
with load balancing.
!"
#!"
$!"
%!"
&!"
'!!"
'#!"
'$!"
'%!"
!('"
'"
'!"
'!!"
!" #$" $&" )#" *%" '#!" '$$" '%&" '*#"
+,--./
,"
01-2/3
45"678
-"9:-2:;"
!"#$%&
'()*&+,-$$.%&'$./012#3&45678&9:;;<&
<==" >?@A-B.:" <=="+,--./," >?@A-B.:"+,--./,"
0.1
1
10
100
1 2 4 8 16 32
Exe
cutio
n T
ime
(sec
s)
Cores
Unbalanced Tree Search Benchmark
Qthreads Intel TBB Intel OpenMP
GCC OpenMP Cilk
More scalable, and more performant OpenMP runtime than GCC!
Competitive load-balancing scheduler (#exibility is the overhead).
Thursday, March 15, 2012
Runtime Architecture /Experimental Platform
Scalable Parallel Runtime (SPR)
Portals4KittenOS
SHM
EM
MPI
UPC
KEU
GA
ECL
Chap
el
MA
ESTR
OO
penM
P
Inte
l TBB
/Ci
lk++
Hab
añer
o-C
Thursday, March 15, 2012
Runtime Architecture /Experimental Platform
Scalable Parallel Runtime (SPR)
Portals4KittenOS
SHM
EM
MPI
UPCK
EUG
A
ECL
Chap
el
MA
ESTR
OO
penM
P
Inte
l TBB
/Ci
lk++
Hab
añer
o-C
Scalable Programming Interface (SPI)
Thursday, March 15, 2012
•Research slowed by lack of applications–Apps need programming environment vision–...and an API, if possible
•Experiment-driven SPI (Scalable Programming Interface) design-points:–Environmental description (local vs global topology)–Naming needs (GIDs vs handles vs ?)–How much detail is necessary from the application to specify performant
data/work movement?–How much detail from the runtime is necessary to enable specification of
performant data/work movement?–What synchronization semantics are needed and/or useful? (Futures vs
mutexes vs FEBs vs ?)•Use both experimental results and application programming effort to guide API development
The Lime in the Coconut
Thursday, March 15, 2012
Current Status
•Download Today!–Kitten: http://code.google.com/p/kitten/–Portals4: http://code.google.com/p/portal4/–Qthreads: http://code.google.com/p/qthreads/
•Stacked components work–Portals4 on Kitten (with InfiniBand)–Qthreads on Kitten–Qthreads on Portals4
•Multinode Threading Environment–Remote spawn/sync–Multinode UTS, without work-stealing
Thursday, March 15, 2012
Thank You!
Thursday, March 15, 2012