communication support for global address space languages kathy yelick, christian bell, dan bonachea,...

Communication Support for Global Address Space

LanguagesKathy Yelick, Christian Bell, Dan Bonachea, Yannick Cote, Jason Duell, Paul Hargrove,

Parry Husbands, Costin Iancu, Mike Welcome

NERSC/LBNL, U.C. Berkeley, and Concordia U.

Outline•What is a Global Address Space Language?–Programming advantages–Potential performance advantage

•Application example•Possible optimizations•LogP Model•Cost on current networks

Two Programming Models•Shared memory

+Programming is easier• Can build large shared data structures

– Machines don’t scale• Typically, SMPs < 16 processors, DSM < 128 processors

– Performance is hard to predict and control•Message passing

+Machines easier to build and scale from commodity parts+Programmer has control over performance– Programming is harder

• Distributed data structures only in the programmers mind• Tedious packing/unpacking of irregular data structures

•Losing programmers with each machine generation

Global Address-Space Languages

•Unified Parallel C (UPC)– Extension of C with distributed arrays– UPC efforts

• IDA: t3e implementation based on old gcc• NERSC: Open64 implementation + generic runtime• GMU (documentation) and UMD (benchmarking) • Compaq (Alpha cluster and C+MPI compiler (with MTU))• Cray, Sun, and HP (implementations)• Intrepid (SGI compiler and t3e compiler)

•Titanium (Berkeley)– Extension of Java without the JVM– Compiler available from http://titanium.cs.berkeley.edu– Runs on most machines (shared, distributed, and hybrid)– Some experience calling libraries in other languages

•CAF (Rice and U. Minnesota)

http://titanium.cs.berkeley.edu/

Global Address Space Programming

• Intermediate point between message passing and shared memory

•Program consists of a collection of processes.– Fixed at program startup time, like MPI

•Local and shared data, as in shared memory model– But, shared data is partitioned over local processes– Remote data stays remote on distributed memory machines– Processes communicate by reads/writes to shared variables

•Examples are UPC, Titanium, CAF, Split-C

•Note: These are not data-parallel languages– Compiler does not have to map the n-way loop to p

processors

UPC Pointers•Pointers may point to shared or private variables

– Same syntax for use, just add qualifiershared int *sp; int *lp;

– sp is a pointer to an integer residing in the shared memory space.

– sp is called a shared pointer (somewhat sloppy).•Private pointers are faster -- aliasing common

Shared

Glo

bal a

ddre

ss s

pace x: 3

Privatesp: sp: sp:

lp: lp: lp:

Shared Arrays in UPC•Shared array elements are spread across the threads

shared int x[THREADS] /*One element per thread */shared int y[3][THREADS] /* 3 elements per thread */shared int z[3*THREADS] /* 3 elements per thread, cyclic */

• In the pictures below– Assume THREADS = 4– Elements with affinity to processor 0 are marked

x

y blocked

z cyclic

This is really a 2D array

Example Problem•Relaxation on a mesh (structured or not)

– Also known as Sparse matrix-vector multiply

vColor indicates the owner processor

•Implementation strategies– Read values of across edges, either local or remote– Prefetch remote– Remote processor writes values (into a ghost)– Remote processor packs values, and ship as a block

Communication Requirements•One-sided communication

– origin can read or write the memory of a target node, with no explicit interaction by the target

•Low latency for small messages •Hide latency with non-blocking accesses

(UPC “relaxed”); low software overhead– Overlap communication with communication– Overlap communication with computation

•Support for bulk, scatter/gather, and collective operations (as in MPI)

•Portability to a number of architectures

Performance Advantage of Global Address Space

Languages •Sparse matrix-vector multiplication on a

T3E

0

50

100

150

200

250

1 2 4 8 16 32Processors

Mflo

ps

UPC + PrefetchMPI (Aztec)UPC BulkUPC Small

•UPC model with remote reads is fastest• Small message (1 word)• Hand-coded prefetching

•Thanks to Bob Lucas•Explanations

• MPI on the T3E isn’t very good

• Remote read/write is fundamentally faster than two-sided message passing

Optimization Opportunities•Introducing non-blocking communication

– Currently hand optimized in Titanium code gen– Small message versions of algorithms on IBM SP

Speedup of Non-blocking vs. Blocking

00.20.40.60.8

1

1.21.41.61.8

2

GUPS Sparse Matvec Laplacian

How Hard is the Compiler Problem?

•Split-C, UPC, and Titanium experience– Small effort– Relied on lightweight communication

•Distinguish between– Single thread/process analysis– Global, cross-thread analysis

• Two-sided communication, gets-to-puts, strong consistency semantics with non-blocking implementation

•Support for application level optimization key– Bulk communication, scatter-gather, etc.

UPCNet: Global pointers (opaque type with rich set of pointer operations), memory management, job startup, etc.

GASNet Extended API: Supports put, get, locks, barrier, bulk, scatter/gather

Portable Runtime Support•Developing a runtime layer that can be easily

ported and tuned to multiple architectures.

GASNet Core API:Small interface based on

“Active Messages”

Generic support for UPC, CAF, Titanium

Core sufficient for functional implementation

Direct implementations of parts of full GASNet

Portable Runtime Support•Full runtime designed to be used by multiple

compilers– NERSC compiler based on Open64– Intrepid compiler based on gcc

•Communication layer designed to run on multiple machines– Hardware shared memory (direct load/store)– IBM SP (LAPI)– Myrinet 2K (GM)– Quadrics (Elan3)– Dolphin – VIA and Infiniband in anticipation of future networks– MPI for portability

•Use communication micro-benchmarks to choose optimizations

Core API – Active Messages

• Super-Lightweight RPC– Unordered, reliable delivery with "user"-provided handlers

• Request/reply messages– 3 sizes: small (<=32 bytes),medium (<=512 bytes), large

(DMA)• Very general - provides extensibility

– Available for implementing compiler-specific operations– scatter-gather or strided memory access, remote allocation, …

• Already implemented on a number of interconnects – MPI, LAPI, UDP/Ethernet, Via, Myrinet, and others

• Allow a number of message servicing paradigms– Interrupts, main-thread polling, NIC-thread polling or some

combination

Extended API – Remote memory operations• Want an orthogonal, expressive, high-performance interface

– Scalars and Bulk contiguous data – Blocking and non-blocking (returns a handle)– Also have a non-blocking form where the handle is implicit

• Non-blocking synchronization– Sync on a particular operation (using a handle)– Sync on a list of handles (some or all)– Sync on all pending reads, writes or both (for implicit handles)– Allow polling (trysync) or blocking (waitsync)

• Misc. characteristics– gets specify a destination memory address (also have register-mem

ops)– Remote addresses expressed as (node id, virtual address)– Loopback is supported– Handles need not be explicitly freed– Knows nothing about local UPC threads, but is thread-safe on platforms

with POSIX threads

Extended API – Remote Memory• API for remote gets/puts: void get (void *dest, int node, void *src, int numbytes) handle get_nb (void *dest, int node, void *src, int numbytes) void get_nbi(void *dest, int node, void *src, int numbytes)

void put (int node, void *src, void *src, int numbytes) handle put_nb (int node, void *src, void *src, int numbytes) void put_nbi(int node, void *src, void *src, int numbytes)

• "nb" = non-blocking with explicit handle• "nbi" = non-blocking with implicit handle• Also have "value" forms for register transfers• Recognize and optimize common sizes with macros• Extensibility of core API allows easily adding other more

complicated access patterns (scatter/gather, strided, etc)

Extended API – Remote Memory•API for get/put synchronization:•Non-blocking ops with explicit handles:

int try_syncnb(handle)void wait_syncnb(handle)

int try_syncnb_some(handle *, int numhandles)void wait_syncnb_some(handle *, int numhandles)int try_syncnb_all(handle *, int numhandles)void wait_syncnb_all(handle *, int numhandles)

•Non-blocking ops with implicit handles:int try_syncnbi_gets()void wait_syncnbi_gets()int try_syncnbi_puts()void wait_syncnbi_puts()int try_syncnbi_all() // gets & putsvoid wait_syncnbi_all()

Extended API – Other operations• Basic job control

– Init, exit– Job layout queries – get node rank & node count– Common user interface for job startup

• Synchronization– Named split-phase barrier (wait & notify)– Locking support

• Core API provides "handler-safe" locks for implementing upc_locks

• May also provide atomic compare&swap or fetch&increment• Collective communication

– Broadcast, exchange, reductions, scans?• Other

– Performance monitoring (counters)– Debugging support?

Software Overhead•Overhead: cost cannot be hidden with overlap

– Shown here for 8-byte messages (put or send)– Compare to 1.5 usec for CM5 using Active Messages

0

2

4

6

8

10

12

T3E

/MP

I

T3E

/Shm

em

IBM

/MP

I

IBM

/LA

PI

Com

paq/

MP

I

Com

paq/

Put

Com

paq/

Get

M2K

/MP

I

M2K

/GM

Dol

phin

/MP

I

Gig

anet

/VIP

L

usec

Small Message Bandwidth•If overhead fills all time, there is no

potential for overlapping computationOverhead and Inverse Node Bandwidth (8-byte messages)

02468

101214161820

usec

overheadinverse bw

95

Latency (Including Overhead)End-to-End Latency

05

101520253035404550

IBM/LA

PI

IBM/M

PI

Compaq/P

ut

Compaq/G

et

Compaq/M

PI

Dolphin/

MPI

M2K/G

M

M2K/M

PI

Gigane

t/VIP

L

SysKonn

ect

usec

1-way ping latency

overhead

Large Message Bandwidth

0.01

0.1

1

10

100

10001 2 4 8 16 32 64 128

256

512

1024

2048

4096

8192

1638

4

3276

8

6553

6

1310

72

Message size (Bytes)

MB

/sec

IBM/MPIIBM/LAPICompaq/MPICompaq/PutM2K/MPIM2K/GMDolphin/MPIGiganet/VIPLSysKonnect

What to Take Away•Opportunity to influence vendors to

expose lighter weight communication– Overhead is most important– Then gap (inverse bandwidth)– Then latency

•Global address space languages– Easier first implementation– Incremental performance tuning

•Proposal for a GASNet– Two layers: full interface + core

End of Slides

Performance CharacteristicsLogP model is useful for understanding small

message performance and overlap

•L: latency across the network •o: overhead (sending and receiving busy time)•g: gap between messages (1/rate)•P: number of processors

P M P M

Os or

L (latency)g

Questions•Why Active Messages at the bottom?

– Changing the PC is the minimum work•What about machines with sophisticated

NICs?– Handled by direct implementation of full API

•Why not MPI-2 one-sided?– Designed for application level– Too much synchronization required for runtime

•Why not ARMCI?– Similar goals, but not designed for small (non-

blocking) messages

Implications for Communication•Fast small message read/write simplifies

programming•Non-blocking read/write may be

introduced by the programmer or compiler– UPC has “relaxed” to indicate that an access

need not happen immediately•Bulk and scatter/gather support will be

useful (as in MPI)•Non-blocking versions may also be useful

Overview of NERSC EffortThree components:

1)Compilers – IBM SP platform and PC clusters are main targets– Portable compiler infrasturucture (UPC->C)– Optimization of communication and global pointers

2)Runtime systems for multiple compilers– Allow use by other languages (Titanium and CAF)– And in other UPC compilers– Performance evaluation

3)Applications and benchmarks– Currently looking at NAS PB– Evaluating language and compilers– Plan to do a larger application next year

NERSC UPC Compiler•Compiler being developed by Costin Iancu

– Based on Open64 compiler for C• Originally developed at SGI• Has IA64 backend with some ongoing development• Software available on SourceForge

– Can use as C to C translator• Can either generate before most optimizations• Or after, but this is known to be buggy right now

•Status– Parses and type-checks UPC – Finishing code generation for UPC->C translator

• Code generation for SMPs underway

Compiler Optimizations•Based on lessons learned from

– Titanium: UPC in Java– Split-C: one of the UPC predecessors

•Optimizations– Pointer optimizations:

11.1

1.21.3

1.41.5

1.61.7

1.8

spee

dup

Split-Phase

Synch merge

• Optimization of phase-less pointers

• Turn global pointers into local ones

– Overlap• Split-phase• Merge “synchs” at

barrier– Aggregation Split-C data on CM-5

Possible Optimizations•Use of lightweight communication•Converting reads to writes (or reverse)

•Overlapping communication with communication

•Overlapping communication with computation

•Aggregating small messages into larger ones

MPI vs. LAPI on the IBM SP•LAPI generally faster than MPI•Non-Blocking (relaxed) faster than blocking

0102030405060708090

100

1 10 100 1000 10000

Message Size (Bytes)

usec

(Inv

erse

Thr

ough

put)

IBM/MPI Blocking

IBM/MPI NonBlocking

IBM/LAPI Blocking

IBM/LAPI Nonblocking

Overlapping Computation: IBM SP

•Nearly all software overhead – no computation overlap– Recall: 36 usec blocking, 12 usec nonblocking

1

11

21

31

41

51

61

71

81

0.3 9 17.9 26.4 35.1 43.7 52.4

Time spent in computation (usec)

Tim

e pe

r ste

p (u

sec)

123481216

Conclusions for IBM SP•LAPI is better the MPI•Reads/Writes roughly the same cost•Overlapping communication with

communication (pipelining) is important•Overlapping communication with

computation – Important if no communication overlap– Minimal value if >= 2 messages overlapped

•Large messages are still much more efficient•Generally noisy data: hard to control

Other Machines

• Observations:– Low latency reveals programming advantage– T3E is still much better than the other networks

usec

0

20

40

60

80

100

120

Mille

nniu

m(M

PIC

H/M

2K)

MV

ICH

/Gig

anet

MV

ICH

/M-

VIA

/Sys

konn

ect

Myr

inet

(GM

)

Qua

dric

s(M

PI,O

RN

L)

Qua

dric

s(S

hmem

, OR

NL)

Qua

dric

s (U

PC

,O

RN

L)

SP

(MP

I)

T3E

(MP

I,N

ER

SC

)

T3E

UP

C(N

ER

SC

)

VIP

L/G

igan

et/V

IA

VIP

L/M

-V

IA/S

ysK

onne

ct

Sum of Overlapped Sum of Blocking

Future Plans•This month

– Draft of runtime spec – Draft of GASNet spec

•This year– Initial runtime implementation on shared memory– Runtime implementation on distributed memory (M2K, SP)– NERSC compiler release 1.0b for IBM SP

•Next year– Compiler release for PC cluster– Development of CLUMP compiler– Begin large application effort– More GASNet implementations – Advanced analysis and optimizations

Read/Write Behavior•Negligible difference between blocking read

and write performanceIBM SP Blocking Read/Write Using LAPI

0

1020

3040

50

6070

8090

100

1 10 100 1000 10000

Bytes

usec

Write

Read

Overlapping Communication•Effects of pipelined communication are

significant– 8 overlapped messages are sufficient to saturate NI

Queue

depth

0

10

20

30

40

50

60

70

80

90

100

1 12 104 298 548 2098 3648 5198 6748 8298

Message Size (Bytes)

Inve

rse

Thro

ughp

ut (u

sec) 1

234812162432

Overlapping Computation•Same experiment, but fix total amount of computation

60

65

70

75

80

85

90

95

0.3 9 17.9 26.4 35.1 43.7 52.4

Compute time between messages (usec)

Am

ortiz

ed ti

me

per s

tep

(use

c)

123481216

SPMV on Compaq/Quadrics•Seeing 15 usec latency for small msgs•Data for 1 thread per node

Sparse Matrix-Vector Multiply (Compaq)

0

50

100

150

200

1 2 4 8 16 32Processors

Mflo

ps

MPI (Aztec)UPC Small

Optimization Strategy•Optimizations of communication is key to making UPC more usable

•Two problems:–Analysis of code to determine which

optimizations are legal–Use of performance models to select

transformations to improve performance

•Focus on the second problem here

Runtime Status•Characterizing network performance

– Low latency (low overhead) -> programmability

•Specification of portable runtime– Communication layer (UPC, Titanium, Co-Array Fortran)

• Built on small “core” layer; interoperability a major concern– Full runtime has memory management, job startup, etc.

0

20

40

60

80

100

120

Blocking

Overlapped

usec

What is UPC?•UPC is an explicitly parallel language

– Global address space; can read/write remote memory

– Programmer control over layout and scheduling

– From Split-C, AC, PCP

•Why a new language?– Easier to use than MPI, especially for program

with complicated data structures– Possibly faster on some machines, but current

goal is comparable performance

p0 p1 p2

Background•UPC efforts elsewhere

– IDA: t3e implementation based on old gcc– GMU (documentation) and UMC (benchmarking) – Compaq (Alpha cluster and C+MPI compiler (with MTU))– Cray, Sun, and HP (implementations)– Intrepid (SGI compiler and t3e compiler)

•UPC Book: – T. El-Ghazawi, B. Carlson, T. Sterling, K. Yelick

•Three components of NERSC effort1)Compilers (SP and PC clusters) + optimization (DOE/UPC)2)Runtime systems for multiple compilers (DOE/Pmodels + NSA)3)Applications and benchmarks (DOE/UPC)

Overlapping Computation on Quadrics

1.25

1.45

1.65

1.85

2.05

2.25

2.45

0

0.06

1

0.12

2

0.18

3

0.24

3

0.30

4

0.36

5

0.42

6

0.48

7

0.54

8

0.60

9

0.66

9

0.73

0.79

1

0.85

2

0.91

3

0.97

4

1.03

5

1.09

5

1.15

6

1.21

7

(bla

nk)

1

2

3

4

5

6

7

8

9

10

(blank)

8-Byte non-blocking put on Compaq/Quadrics

communication support for global address space languages kathy yelick, christian bell, dan bonachea,...

Documents