costin iancu lawrence berkeley national laboratory · •productivity = performance without pain +...

Costin Iancu

Lawrence Berkeley National Laboratory

WPSE 2009

•� Unified Parallel C

–� SPMD programming model, shared memory space abstraction

–� Communication is either implicit or explicit – one-sided

–� Memory model: relaxed and strict

•� Ubiquitous UPC implementation –� Compiler based on the Open64 framework

–�Source to source translation

–� GASNet communication libraries

-� PUT/GET primitives

-� Vector/Index/Strided (VIS) primitives

-� Synchronization, collective operations

•� Provide integration across all levels of the software stack

•� Mechanisms for finer grained control over system resources

•� Application level resource usage policies

•� Language and compiler support

Compiler-generated C code

UPC Runtime system

GASNet Communication System

Network Hardware

UPC Code Compiler

UPC Runtime system

Emphasize production quality development tools

•� Productivity = performance without pain + portability

•� Provide support for application adaptation (load balance, comm/comp overlap, scheduling, synchronization)

•� Challenges: scale, heterogeneity, convergence of shared and distributed memory optimizations

•� Broad spectrum of approaches (distributed / shared memory) -� Fine grained communication optimizations (PACT’05)

-� Automatic non-blocking communication (PACT’05, ICS’07)

-� Performance models for loop nest optimizations (PPoPP’07, ICS’08, PACT’08)

-� Applications ( IPDPS’05, SC’07, PPoPP’08, IPDPS’09)

Adoption: >7 years concerted effort, DOE support and encouragement, one big government user

•� One of the highest scaling FFT

(NAS) results to date (~2 Tflops)

•� Communication is aggressively

overlapped with computation

•� UPC vs MPI – 10%-70% faster

one-sided is more effective

�� !��

"##�� $��% �&��'��(�� '��#�)�! � ��*++,�-��##��.�

•� Best performance of “primitive” operations –� Select best implementation available for “primitive” operations (put/get, sync)

–� Provide efficient implementations for library “abstractions” (collectives)

•� Optimizations –� Single node performance

–� Mechanisms to efficiently map application to hardware/OS

–� Program transformations – minimize processor “idle” waiting Runtime Adaptation

•� Multi-level optimizations (distributed and shared memory)

•� Compile time, static optimizations are not sufficient

•� Adaptation = runtime

–� Program Description

–� Performance Models vs Autotuning

–� Parameter Estimation/Classification

Instantaneous vs Asymptotic

Guided vs Automatic

Offline vs Online

–� Feedback Loop

–� Static topology mapping vs dynamic

Compile Time Transformations Runtime Mechanisms

Communication Oblivious

Transformations

Communication Aware

Analysis

MessageVectorization

Message Strip-Mining

Data Redistribution

Estimation of Performance Parameters Description

+

Code Templates

Performance

Database

Performance

Models

Memory

Manager

(Cache)

Estimate Params

Analyze Comm

Requirements

Estimate Load

Instantiate Comm

Plan

Eliminate Redundant

Comm & Reshape

Code Generation

(categorical)

(numerical)

•� Describe program behavior, lightweight representation (Paek - LMAD perfect nests)

-� Easily extended for symbolic analysis

-� RT-LMAD similar to SSA- irregular loops

•� Decouple serial transformations from communication transformations

-� Serial transformations - cache parameters (static/conservative)

-� Communication transformations - network parameters (dynamic)

•� No performance loss when decoupling optimizations

-� Coarse grained characteristics

-� Blocking for cache and network at different scales

-� Compute and communication bound are categories

-� Multithreading

-� No global communication scheduling (intrinsic computation)

COMMUNICATION OPTIMIZATIONS

•� Domain Decomposition and Scheduling for Code Generation

•� Efficient High Level Communication Primitives (collectives,p2p)

•� Application level performance determining factors:

–� Computation

–� Spatial - topology (point-to-point, one-from-many, many-from-one, many-to-many)

–� Temporal - schedule (burst, peer order)

•� System level performance determining factors:

–� Multiple available implementations

–� Resource constraints (issue queue, TLB footprint)

–� Interaction with OS (mapping, scheduling)

Adaptation: offline search, easy to evaluate heuristics, lightweight analysis

Load

Ove

rhea

d O

R In

vers

e B

and

wid

th

Models,

Asymptotic

Optimizations,

Instantaneous

Flow Control,

Fairness

Throttling load is desirable for performance

> 2X

•� Deployed systems are under-provisioned, unfair, noisy

Two processors saturate the network, four processors overwhelm it (Underwood et al, SC’07)

•� Performance is unpredictable and unreproducible

•� Simple models can’t capture variation

100

100100

200100

300100

400100

500100

600100

700100

800100

900100

10 100 1000 10000 100000 1000000 10000000

Ban

dw

idth

(K

B/

s)

Size (bytes)

InfiniBand Bandwidth Repartition for 128 Procs Across Bisection

Quantitative or Qualitative?

•� Previous approaches measure asymptotic values, optimizations

need instantaneous values

•� Existing “time accurate” performance models do not account well

for system scale OR wide SMP nodes

•� Qualitative models: which is faster, not how fast! (PPoPP’07, ICS’08)

Not time accurate, understand errors and model robustness, allow for imprecision/noise

•� Spatiotemporal exploration of network performance:

-� Short and large time scales – account for variability and system noise

-� Small and large system scales – SMP node, full system

•� Preserve Ordering

–� Sample implementation space, transformation specific

–� Be pessimistic – determine the worst case

–� Track derivatives, not absolute values

•� Analytical performance models (strip-mining transformations, PPoPP’07) > 90% efficiency

•� Multiprotocol implementation of vector operations (ICS’08, PACT’08)

TUNING OF VECTOR OPERATIONS

•� Vector Operations – copy disjoint memory regions in one logical step (scatter/gather)

•� Often used in applications: boundary data in finite difference, particle-mesh, sparse

matrices, MPI Derived Data Types

•� Well supported:

•� Native : Elan, InfiniBand, IBM LAPI/DCMF

•� Third party comm libraries: GASNet, ARMCI, MPI •� “Frameworks”: UPC, Titanium, CAF, GA, LAPI

•� Interfaces: strided, indexed

•� Previous studies show the need for a multi-protocol approach

•� Implementations:

–� Blocking – no overlap (BLOCK)

–� Pipelining – flow control and fairness are problems (PIPE)

–� Packing – flow control and attentiveness are problems (VIS)

foreach(S)

start_time()

for (iters)

foreach(N)

get(S)

end_time()

foreach(S)

start_time()

for(iters)

foreach(N)

get_nb(S)

sync_all

end_time()

foreach(S)

start_time()

for(iters)

foreach(N)

vector_get(N,S)

end_time()

•� Protocols : Blocking, Non-Blocking, Packing (AM based)

•� Empirical approach based on optimization space exploration -�Transfer structure (N, S)

-� Application characteristics : active processors, communication topology, system size, instantaneous load

•� For each setting – Which implementation is faster?

•� Fast, lightweight decision mechanism – prune parameter space

•� Strategy: best OR worst case scenario?

��

��

��

��

��

��

��

��

��

� ��

�� ! �"#$��%� ��&��

��

'��

��'�

��

��

Best algorithm determined by SMP arity and load

Resource constraints determine algorithm change

��

��

��

(��

��

��

(��

��

�(��

()��

(� �� (��

��

��

(��

��

�(��

()��

��

��

*��+�� +�#��

VIS

BLOCK ��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

� ��

��#��

VIS

PIPE

See PACT’08 paper for details

•� Changing system size or topology does not cause protocol changes

•� Magnitude of performance differences is lowered (40x – 20x)

•� Accuracy > 90%, less than 2x performance loss

!�

,�

"!�

��,�

��

��

��

��

-�

,�

��

��

��

��

,�

�-�

��

-��

��,�

��-�

��

��

��

�

��

��

��

��

��

��

�� !"��

�#��$��

��

,�

��,��

��

��

��

��

��

��

��

��

�� ,� �� ,� ��

��

��

��.��

��

�� !��"��#��$�%��$�"��

��

2

4

8

16

32

64

128

256

512

1024

1 4 16 64 256 1024

NM

SG

Size (Dbls)

BLOCK Inter/Poll

1-2

0-1

2

4

8

16

32

64

128

256

512

1024

1 2 4 8 16

32

64

128

256

512

1024

NM

SG

Size (Dbls)

VIS Inter/Poll

4-5

3-4

2-3

1-2

0-1

•� Polling vs Interrupts

•� Different event notification mechanisms required for different protocols (event inter-

arrival rate)

•� Categorical choice

> 5X performance difference

Bassi – Power5/Federation

��

'��

��

��

��

��

��

��&��&��

��

��&��

��&� ��

Pessimistic (max) predictors obtained under high load work best.

Our micro-benchmarks and models are always concerned about worst case performance.

•� UPC compiler, GASNet communication layer -� 2 x 2068 x 2.6 Ghz Opteron, Cray (BigBen)

–� 2 x 320 x 2.2 GHz Opteron, InfiniBand 4x cluster (Jacquard)

–� 8 x 111 x 1.9 Ghz Power5, Federation (Bassi)

–� 16 x 3936 x 1.9 Ghz Barcelona, InfiniBand (Ranger)

•� NAS Parallel Benchmarks - manual optimizations vs compiler optimized

–� MG: point-to-point Put, dynamic granularity across one run

–� SP: point-to-point VIS Put, “static”

–� BT: point-to-point VIS Put/Get, “static”

•� Node load (category) is determining performance factor for wide SMPs

•� Categories can be further refined into numerical values, e.g. instantaneous load estimation

Workload: 22% improvement

Load estimation?

0

0.5

1

1.5

2

2.5

16-A 64-A 16-B 64-B 144-C 16-A 64-A 16-B 64-B 144-C 16-A 64-A 16-B 64-B 128-C 256-C

BT SP MG

Per

form

ance

Co

mp

ared

to

VIS

Imp

lem

enta

tio

n

IBM p575

VIS

PIPE

BLOCK

ADAPTIVE

Hig

her

is b

ette

r

•� Communication optimizations: qualitative models, worst case

performance, offline/guided exploration

•� First order performance determining factors are system dependent, number of correlations tends to be constant, large ranges -� Strip-mining optimizations: Fat-tree and Torus

-� Vector optimizations: thin nodes and wide nodes

•� Instantaneous behavior important, can be coarsely categorized (#pragma)

•� Runtime Analysis feasible: algorithms O(n*log n) transfers, O(enest) faster than RTT

•� Decoupling transformations (comm/comp) works – no whole program

analysis

•� SPMD performance can be enhanced by RT/OS mechanisms

Thank You!

•� Large number of network performance models (LogGP variants) -

measurement methodology and validation on applications (asympotic values) –� Su et al (SC’05)

–� Cameron et al (IEEE ToC’07)

•� Implementations: –� Tipparaju et al (IPDPS’04) – InfiniBand

–� Nieplocha et al (HPCA’04) – Quadrics

–� Santhanaraman et al (PVM/MPI ’04) – InfiniBand

•� PGAS compilers

–� CAF: message vectorization

–� Titanium: array copy operations, inspector-executor

Micro-Benchmarks

Processing System

Characterization

Knowledge &

Experience

Base

OS &

Runtime

System

Configuration File & System Model

Bac

k-en

d P

roce

ssor

Spe

cific

Com

pile

r

Source-to-Source Code Transformations

Lang

uage

Ext

ensi

ons

& L

ibra

ries

C/C

++

F

ortr

an

UP

C/C

AF

/Cha

pel

Autotuning

Optimization

Learning &

Reasoning HP

C L

angu

ages

Opt

imiz

ed P

aral

lel E

xecu

tabl

e

Automated Task Recognition

DO

D A

pplic

atio

n C

ode

Program Analysis Source Code

Generation

Ope

nMP

Architecture Models

Network Models

App

licat

ion

Cod

es

Component Framework

Ideal Development Environment

J. Demmel, M. Hall, C. Iancu, D. Quinlan, K. Yelick…

•� All protocols chosen across the whole workload and systems

•� Two types of systems:

–� IBM – N-N estimators – static estimators are enough

–� Sun – P-N, P-HN, P-P – heuristics to change predictors with scale or use instantaneous load estimation

Overall improved scalability and performance

��

��

��

��

��

��

��!�

��!�

��%�

��%�

��

��!�

��!�

��%�

��%�

��

��!�

��!�

��%�

��%�

��,��

��

%��

��

��

��

� ��

��

��

��

��

��

�

��

�$��%�

��

��

��

� ��

� ��

Improvement: 22% workload, 3x speedup max

Load estimation?

NAS Application Benchmarks

Infiniband Cluster

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

A-4

A-8

B-1

6B

-32

C-6

4C

-12

8

A-4

B-1

6C

-64

A-4

B-1

6C

-64

A-4

A-8

B-1

6B

-32

C-6

4C

-12

8

A-4

A-8

B-1

6B

-32

C-6

4C

-12

8

A-4

A-8

B-1

6B

-32

C-6

4C

-12

8

A-4

A-8

B-1

6B

-32

C-6

4

MG SP BT CG IS FT FT-NLE

Perf

orm

an

ce R

ela

tive t

o U

NO

PTIM

IZED

Imp

lem

en

tati

on

HAND

OPT

2.962.15

Improvement: 22% workload, 3x speedup max (Sun: 2.5% workload, 15% speedup)

Iancu, Yelick

Instantaneous load estimation required for these results

(SMP load, comm topology, comm distance)

costin iancu lawrence berkeley national laboratory · •productivity = performance without pain +...

Documents