the high performance open community...

The High Performance Open Community Runtime:Explorations on Asynchronous Many Task Runtime Systems

Joshua Landwehr, Joshua Suetterlein, Andres Marquez, Joseph Manzano, Kevin J Barker

and Guang R. Gao

Key Feature: Introspective FrameworkKey Feature: Consistency at ScaleOverview

Performance Open Community Runtime

Key Feature: A System Wide OoO Engine

The Cholesky Kernel Study

Testcases

TestbedsConstance: 300 nodes (24-core Haswell), FDR InfiniBand.Cori: 1630 nodes (32-core Haswell), Cray Aries Interconnect.Edison: 5576 node (24-core Ivy Bridge), Cray Aries Interconnect.Apps: Cholesky, Smith-Waterman (SW) and Ray Tracing (RT) kernels and the XSBench(XS) Mini App.

Cholesky & SW plateau due to starvation. RT & XS achieve almost linear speedup.

Max relative speedups: Cholesky 156x, SW 52x,RT 323x and XS 348x

Plateau due to starvation; Degradation due to app fan in and out

Max Cholesky speedup on Cori 133x

- Exploit variance through the software stack leveraging novel runtimes

Agility: The power of moving quickly and easily; nimbleness.

Efficient use of available resources

Adapting to changing conditions

- Asynchronous Many Task Runtime system (AMT RTS)

Work decomposed into smaller units

More powerful scheduler and data layouts

More available parallelism

Higher Runtime Overhead and cost.

Examples: High Performance ParalleX (HPX), Legion, OCR

Unique features: Advance locality hints. Resource management @ scale. Advance memory control both for intra &inter node. Introspective Capabilities. Network backends: RSOCKETS, MPI & TCP/IP.

- High Performance version of OCR. Computation expressed as DAGs using abstractions for

computation (Event Driven Tasks), data (Data Blocks) and synchronization (Events), all addressable globally.

- Support version 0.99a of the OCR standard.

- Requests can arrive in any order to destination

- The Global Unique Identifier Table (GUID Table)

Address resolution for OCR primitives across the system

Statically decide “owner” nodes for Global ID ranges

Keep track of users of objects and the most up-to-date copy

A unique table per node in the system.

- If an object is not created, create a placeholder in the GUID table inside the runtime for it and queued any outstanding requests

- Requests are serviced in the order that they were received once the object is created

- Permits greater overlap of computation and communication

- Building block for high level abstraction such as futures

- OCR Default Memory Model: Entry consistency

Fine Grained locking of data structures

Acquire and release semantics

- Cache DAG Consistency (CDAG)

Invalidate the block at the moment of acquiring & delay all signaling (i.e. defer side effects) until the invalidation is confirmed.

Ordering of write requests are enforced by the DAG dependencies (i.e Happens-before relationship).

Competing write requests to a distinct Data block are undefined

• “There exists an scenario in which the lifetimes of at least two EDTs that acquire a data block in write mode may overlap.”

Advantages: Reduction in coherency operations & overlap of protocol actions with computation.

Example 1: OCR Computational Graph

OCR Model Network Traffic Ops for Example 1

CDAG Network Traffic Ops for Example 1

Strong Scaling Results in Constance

- Low overhead data acquisition framework

- Runtime events are instrumented on frequency and time

- Threshold, event driven framework

Local windowed events

Global waterfall collection after threshold reached

- Around 7% overhead

- Visualization component: heatmaps for each runtime component and attribution

- Facilitates efforts in introspection, reflection & modeling

Strong and Weak Scaling Results in NERSC Edison and Cori

Memory Model Impact on Scalability

CDAG achieves almost linear scalability when compared

against the OCR default memory model

Memory model effects on Cholesky. Compared against the OCR default memory model and an optimized

version of it.

RTS Centric Cholesky Characterization on 128 Nodes

(1) EDT throughput: Tracks with the computational phases. (2) EDT Signal throughput: Shows cleaning behavior after a computational phase per node. (3) DataBlock (DB) Memory Alloc: DB allocation tracks with the computational phases from (1).

1.

2.

3.

the high performance open community...

Documents