x-caliber & xgc system software research & development · head. third, tasks can be...

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company,for the United States Department of Energyʼs National Nuclear Security Administration

under contract DE-AC04-94AL85000.

X-Caliber & XGC System SoftwareResearch & Development

Kyle WheelerSOS16

Thursday, March 15, 2012

Context

•X-Caliber: our DARPA UHPC effort•XGC: Extreme-Scale Computing Grand Challenge LDRD

– (Laboratory Directed Research and Development)

•Consistent Challenge: figure out what exascale system software looks like–Collaborate with the level above and the level below–Leverage technology trends–Rethink application space (what will be important in a decade?)–Be metric-focused!

–Picojoules, Picojoules, Picojoules ... and time too!


X-Caliber Software Team

•Sandia–Brian Barrett, Kyle Wheeler,

Dylan Stark•Indiana University

–Thomas Sterling•Louisiana State University

–Hartmut Kaiser, Chirag Dekate•University of Illinois

–William Gropp, Marc Snir•USC/ISI

–Pedro Diniz


XGC Software Team

•Sandia–Brian Barrett, Ron Brightwell, Kevin Pedretti,

Dylan Stark, and Kyle Wheeler

•University of Illinois–Vikram Adve, Bill Gropp, Marc Snir

•RENCI–Allan Porterfield


XGC Thrust Areas(1

) B

asel

ine

- w

hat

hap

pen

s if

we

do

no

thin

g?

(2) Microsystems - Key Data Movement Enabling Technologies

!"#$#%&'()*+,-.&/,-0%$,-1*',

2!34#%5(!*567829,*$(:&%;

(3) Architecture - Coping with Concurrency and Data MovementVault Vault Vault•••

MC

VAU

DAU

•••

Mem NetworkInterface(SerDes)

DAU

LogicLayer

Ban

k

DRAMLayer 1

DRAMLayer 2

DRAMLayer 3

DRAMLayer N

Ban

k

MC

VAU

MC

VAU

On-Chip Network(Topology, Type, etc. TBD)

Figure 3.2: EMP High-Level Architecture

11

cores in the cluster and has several options in how the cores are used. First, the cores can be used

to operate on completely independent data. Second, the cores can be ganged together to operate in

lockstep to allow multiple identical operations to proceed with much lower synchronization over-

head. Third, tasks can be allocated in a producer/consumer model using hardware mailboxes to

stream data through the cores. This flexibility allows the processing to adaptively adjust to the

requirements of different applications. The functionality of the hardware thread manager is a key

research component of X-caliber.

Core Cluster

CoreCore

FP Vector x4

Local Memory(Cache and/or Scratchpad)

Multithreaded Register File

Thr 0

Thr 1

Thr 2

Thr n

Ma

ilbo

xe

s

Multithreaded Register File

Thr 0

Thr 1

Thr 2

Thr n

Ma

ilbo

xe

s

••••

Thread Manager

NoC

FP Vector x4

L1 L1

Figure 8: Notional core cluster for the com-

pute intensive processor.

It is anticipated that the performance of CIP

will need to be artificially limited in order to reach

thermal and power requirements for the module.

Due to this, normal operating frequency will be

1.5 GHz, but maximum frequency will be closer

to 2.5 GHz. X-caliber will take advantage of this

by using aggressive thermal monitoring and man-

agement to enable two forms of sprint modes. The

first form, which can be sustained indefinitely, is

to pair cores and allow one of those cores to run

at 2.5 GHz as long as the other is turned off. This

maintains a constant peak power output, but may

result in thermal hotspots. To alleviate this, we will

investigate ways to ping-pong compute between

the paired cores. In this way, each core would op-

erate for a time before moving thread state to the

paired core, which would then continue computa-

tion. This mode would allow applications which

cannot take advantage of the available the paral-

lelism to achieve higher performance. This mode could be specified by the compiler, and/or auto-

matically enabled when only a small number of threads are present.

The second form of sprinting would allow any core to accelerate to 2.5 GHz for very short

periods of time. This would increase instantaneous power draw and would rely on thermal inertia

to keep from overheating. The thermal state would be closely monitored and sprint mode would

be turned off as thermal limits were met. This sprint state is useful for moving through Amdahl

regions of code, which could be marked by the compiler. The extreme of this mode is to sprint

whenever allowed by the thermal state.

2.4.1.3 Network The network is built from a single integrated component referred to as Merlin.

For energy efficiency, Merlin consists of an integrated network interface controller (NIC) and 21

port router. The Merlin component benefits from the advances in 3D stacking and silicon photonics

and consists of one or more logic layers coupled to a photonics carrier using 3D integration. The

relatively small number of ports in the router allows a more energy efficient implementation com-

pared to a separate larger router, by reducing the on-chip interconnect lengths (and thus power).

Each module contains two Merlin components, which serves to increase interconnect bandwidth,

24

(4) System Software - enabling a new model of computation

(5) Application Drivers

Safety and Security

Reentry Circuitry Graph Stream

!-&%$,5(<&-'=&$(4#*-5


Hardware Challenges

•Exponential increase in node-level parallelism•Lower memory capacity per core

–Weak scaling will be insufficient•Significantly lower network to memory bandwidth ratios•Need for system software to have finer control of hardware resources

Fig 1.2 shows the top level diagram of ELVIS core.

Trap Unit(TU)

Flo

atin

g P

oin

t U

nit

(FP

U)

Inte

ger

Ex

ecu

tio

Un

it (

IEU

)

Inb

ou

nd

Th

read

Un

it(I

TU

)

Runnab

le Thread

Unit(R

TU

)

Deferred

Thread

Unit(D

TU

)

Data Cache Unit(DCU)

Instruction Fetch Unit(IFU)

Pipe Unit(PU)

Reg

iste

r F

ile1

5

Reg

iste

r F

ile0

Thread Assignment Unit

Incoming Transaction Unit

I/O Unit

From ServiceProcessor(SP)

To CCX

An ELVIS Core Supports 16 Threads

From DFRU

To CCX

3

1

2

4bus 1: FORK/SPAWN request from I/O.

bus 2: Thread is assigned to an available thread slot

bus 3: Thread is deferred and placed in DTUQ.

bus 4: A FORK/SPAWN command enqueing a thread assignment request in RTUQ

Figure 1.2: ELVIS Core Structure

14

7DOE Arch Workshop Aug. 2, 2011

Bytes per Flop (Peak)

Systems are getting less memory rich

!"!#

!"#!

#"!!

#!"!!

#$#$#%%&

#$#$#%%'

#$#$&!!!

#$#$&!!(

#$#$&!!)

#$#$&!#&

!"#$%&'()*

+,-*$

./0

*+,-./+0123 40123/+0123 *+3+561+7+689

Memory Capacity Trend


Application Challenges

What industry cares about

Informatics Applications

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Traditional (FP) Sandia Applications

Emerging (Integer) Sandia Applications

SPEC FP

SPEC Int

RandomAccess

LINPACK

STREAM

Temporal Locality

Sp

atia

l L

oca

lity

Benchmark Suite Mean Temporal vs. Spatial Locality

From: Murphy and Kogge, On The Memory Access Patterns of Supercomputer Applications: Benchmark Selection and Its Implications, IEEE T. on Computers, July 2007

•Huge variety in programming models and run-times emerging–Evolutionary BSP-originated applications–Revolutionary programming models

•Everyone’s got one, and they’re all the best•101+ actively developed parallel programming languages

–Lots of new application varieties–Flexibility key

•Multiple optimization points–Time to solution–Energy to solution–Money to solution–Total system efficiency

What we traditionally care about


Foundational Knowledge

•Distributed systems scaling determined by:–Ability to move data–Synchronization

•Lightweight System Software WORKS–ASCI Red, ASC Red Storm, BG/{L,P,Q}–Low perturbation of applications

•Synchronization Costs–Local and remote–Explicit and implicit


Research Questions

•How will threads evolve to be more lightweight and match hardware semantics?–What will hardware threading semantics be?

•What synchronization primitives are necessary for highly asynchronous applications?–Free, Fast, Infinite

•What memory consistency models are necessary?–... or even useful?

•What communication primitives are necessary for evolving applications?–Probably not six-function MPI


Necessity is the Mother of Invention

•Need insight into:–Trade-offs between different data/work movement strategies–Cost of synchronization/protection mechanisms with real applications–How much automaticity/adaptivity is necessary in large scale

applications?

•Research is slowed by lack of experimental platform

•Use both clusters and simulation as foundational experimental platforms!

•Combine Kitten, Portals, and Qthreads to build a multi-node multi-threaded runtime for experimentation (SPR)


Scalable Parallel Runtime (SPR)

•Qthreads: Lightweight threading interface–Scalable, lightweight scheduling on NUMA platforms–Supports a variety of synchronization mechanisms, including Full/

Empty bits and atomic operations–Potential for direct hardware mapping

•Portals 4: Lightweight networking API–Semantics for supporting both one-sided and tagged message passing–Small set of primitives, allows offload from main CPU–Supports direct hardware mapping

•Kitten: Lightweight OS kernel–Builds on lessons from ASCI Red, Cplant, Red Storm–Utilizes scalable parts of Linux environment–Primarily supports direct hardware mapping


Kitten Lightweight Kernel

•Simple compute node OS– Tool for OS+runtime research– Looks like Linux to applications and

tools•Current R&D

– Job launch via OpenMPI ORTE / mpirun

– Support for Intel MIC, Arthur cluster at Sandia

– System-call forwarding– Low-overhead task migration

Operating SystemRound-Trip Task Migration Time

(task on core A migrates to core B, then back to A

Linux 2.6.35.7 4435 ns

Kitten 1.3 2630 ns

Core-switching performance between two cores in the same Intel X5570 2.93 GHz processor. Kitten achieves a speedup of 1.7 compared to Linux, due to

simpler implementation.

Kitten LWK supports running native applications alongside guest OSes.

Weak scaling performance of Catamount guest OS is within 5% of Catamount native OS at 4096 nodes

0

50

100

150

200

250

300

350

400

64 128 256 512 1024 2048 4096

Tim

e (

seco

nd

s)

# Nodes

NativeGuest, Nested Paging

Guest, Shadow Paging


Portals4 Lightweight Comm.

•Simple low-level communication layer–Tool for communication+runtime

research–Thread-safe by design–Supports legacy and next-gen

applications and tools–Common substrate to allow

efficient use and sharing of resources among higher-level protocols

•Current R&D–Shared InfiniBand and SMP multi-

core progress engines–Efficient blocking/waiting

mechanisms

Much of the potential complexity gets out of the way for basic operations.

0E+00

2E+06

4E+06

6E+06

8E+06

8 16 32 64 128

6-peer Shared-Memory Message Rate

Mes

sage

s/se

c

Message Size in Bytes

MPICH2 MPICH2 ThreadSafePortals4 ME/EQ (MPI-like) Portals4 LE/CT (UPC-like)

Message rates for small messages match MPICH2 performance under MPI-like conditions, and can even beat it for UPC-like conditions.


Qthreads Lightweight Threading

• Simple task-based runtime– Tool for programming model research– Supports both OpenMP-like models

and more complex Chapel-like models– Presents simplified model of system to

the application– High-performance scheduler

• Current Qthreads R&D– Task team and eureka support– Efficient, flexible collective operations– Remote task launch

Shep 0 Shep 1

Shep 0 Shep 1 Worker

Thief

Benefactor

Work queue

Task migration path

Task in/out

High performance “sherwood” work-stealing

scheduler e!ectively balances cache e"ciency

with load balancing.

!"

#!"

$!"

%!"

&!"

'!!"

'#!"

'$!"

'%!"

!('"

'"

'!"

'!!"

!" #$" $&" )#" *%" '#!" '$$" '%&" '*#"

+,--./

,"

01-2/3

45"678

-"9:-2:;"

!"#$%&

'()*&+,-$$.%&'$./012#3&45678&9:;;<&

<==" >?@A-B.:" <=="+,--./," >?@A-B.:"+,--./,"

0.1

1

10

100

1 2 4 8 16 32

Exe

cutio

n T

ime

(sec

s)

Cores

Unbalanced Tree Search Benchmark

Qthreads Intel TBB Intel OpenMP

GCC OpenMP Cilk

More scalable, and more performant OpenMP runtime than GCC!

Competitive load-balancing scheduler (#exibility is the overhead).


Runtime Architecture /Experimental Platform


Portals4KittenOS

SHM

EM

MPI

UPC

KEU

GA

ECL

Chap

el

MA

ESTR

OO

penM

P

Inte

l TBB

/Ci

lk++

Hab

añer

o-C


Runtime Architecture /Experimental Platform


Portals4KittenOS

SHM

EM

MPI

UPCK

EUG

A

ECL

Chap

el

MA

ESTR

OO

penM

P

Inte

l TBB

/Ci

lk++

Hab

añer

o-C

Scalable Programming Interface (SPI)


•Research slowed by lack of applications–Apps need programming environment vision–...and an API, if possible

•Experiment-driven SPI (Scalable Programming Interface) design-points:–Environmental description (local vs global topology)–Naming needs (GIDs vs handles vs ?)–How much detail is necessary from the application to specify performant

data/work movement?–How much detail from the runtime is necessary to enable specification of

performant data/work movement?–What synchronization semantics are needed and/or useful? (Futures vs

mutexes vs FEBs vs ?)•Use both experimental results and application programming effort to guide API development

The Lime in the Coconut


Current Status

•Download Today!–Kitten: http://code.google.com/p/kitten/–Portals4: http://code.google.com/p/portal4/–Qthreads: http://code.google.com/p/qthreads/

•Stacked components work–Portals4 on Kitten (with InfiniBand)–Qthreads on Kitten–Qthreads on Portals4

•Multinode Threading Environment–Remote spawn/sync–Multinode UTS, without work-stealing


http://code.google.com/p/qthreads/






Thank You!


x-caliber & xgc system software research & development · head. third, tasks can be...

Documents