partitioned global address space languages kathy yelick lawrence berkeley national laboratory

1PGAS Languages Kathy Yelick

Partitioned Global Address Space Languages

Kathy YelickLawrence Berkeley National Laboratory

and UC Berkeley

Joint work with

The Titanium Group: S. Graham, P. Hilfinger, P. Colella, D. Bonachea, K. Datta, E. Givelberg, A. Kamil, N. Mai, A. Solar, J. Su, T. Wen

The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J. Duell, P. Hargrove, P. Husbands, C. Iancu, R. Nishtala, M. Welcome

Kathy Yelick, 2

The 3 P’s of Parallel Computing

• Productivity• Global address space supports construction of complex

shared data structures• High level constructs (e.g., multidimensional arrays)

simplify programming

• Performance• PGAS Languages are Faster than two-sided MPI• Some surprising hints on performance tuning

• Portability• These languages are nearly ubiquitous

Kathy Yelick, 3

Partitioned Global Address Space• Global address space: any thread/process may

directly read/write data allocated by another• Partitioned: data is designated as local (near) or

global (possibly far); programmer controls layout

Glo

bal

ad

dre

ss s

pac

e

x: 1y:

l: l: l:

g: g: g:

x: 5y:

x: 7y: 0

p0 p1 pn

By default: • Object heaps

are shared• Program

stacks are private

• 3 Current languages: UPC, CAF, and Titanium • Emphasis in this talk on UPC & Titanium (based on Java)

Kathy Yelick, 4

PGAS Language Overview

• Many common concepts, although specifics differ• Consistent with base language

• Both private and shared data• int x[10]; and shared int y[10];

• Support for distributed data structures• Distributed arrays; local and global pointers/references

• One-sided shared-memory communication • Simple assignment statements: x[i] = y[i]; or t = *p; • Bulk operations: memcpy in UPC, array ops in Titanium and CAF

• Synchronization• Global barriers, locks, memory fences

• Collective Communication, IO libraries, etc.

Kathy Yelick, 5

Example: Titanium Arrays• Ti Arrays created using Domains; indexed using Points:

double [3d] gridA = new double [[0,0,0]:[10,10,10]];

• Eliminates some loop bound errors using foreach foreach (p in gridA.domain())

gridA[p] = gridA[p]*c + gridB[p];

• Rich domain calculus allow for slicing, subarray, transpose and other operations without data copies

• Array copy operations automatically work on intersection data[neighborPos].copy(mydata);

mydata data[neighorPos]

“restrict”-ed (non-ghost) cells

ghost cells

intersection (copied area)

Kathy Yelick, 6

Productivity: Line Count Comparison

• Comparison of NAS Parallel Benchmarks• UPC version has modest programming effort relative to C• Titanium even more compact, especially for MG, which uses multi-d

arrays• Caveat: Titanium FT has user-defined Complex type and cross-language

support used to call FFTW for serial 1D FFTs

UPC results from Tarek El-Gazhawi et al; CAF from Chamberlain et al;

Titanium joint with Kaushik Datta & Dan Bonachea

0

500

1000

1500

2000

NPB-CG NPB-EP NPB-FT NPB-IS NPB-MG

Lin

es

of

co

de

Fortran

C

MPI+F

CAF

UPC

Titanium

Kathy Yelick, 7

Case Study 1: Block-Structured AMR• Adaptive Mesh Refinement

(AMR) is challenging• Irregular data accesses and

control from boundaries• Mixed global/local view is useful

AMR Titanium work by Tong Wen and Philip Colella

Titanium AMR benchmarks available

Kathy Yelick, 8

AMR in TitaniumC++/Fortran/MPI AMR

• Chombo package from LBNL• Bulk-synchronous comm:

• Pack boundary data between procs

Titanium AMR• Entirely in Titanium• Finer-grained communication

• No explicit pack/unpack code• Automated in runtime system

Code Size in Lines

C++/Fortran/MPI Titanium

AMR data Structures 35000 2000

AMR operations 6500 1200

Elliptic PDE solver 4200* 1500

10X reduction in lines of code!

* Somewhat more functionality in PDE part of Chombo code

Elliptic PDE solver running time (secs)

PDE Solver Time (secs) C++/Fortran/MPI Titanium

Serial 57 53

Parallel (28 procs) 113 126

Comparable running time

Work by Tong Wen and Philip Colella; Communication optimizations joint with Jimmy Su

Kathy Yelick, 9

Immersed Boundary Simulation in Titanium• Modeling elastic structures in an

incompressible fluid.• Blood flow in the heart, blood clotting,

inner ear, embryo growth, and many more

• Complicated parallelization• Particle/Mesh method• “Particles” connected into materials

Joint work with Ed Givelberg, Armando Solar-Lezama

Code Size in Lines

Fortran Titanium

8000 4000

Time per timestep

0

20

40

60

80

100

1 2 4 8 16 32 64128

# procs

time

(sec

s)

Pow3/SP 256 3̂

Pow3/SP 512 3̂P4/Myr 512 2̂x256

Kathy Yelick, 10


• Productivity• Global address space supports complex shared structures• High level constructs simplify programming

• Performance• PGAS Languages are Faster than two-sided MPI

• Better match to most HPC networks

• Some surprising hints on performance tuning• Send early and often is sometimes best


Kathy Yelick, 11

PGAS Languages: High PerformanceStrategy for acceptance of a new language• Make it run faster than anything else

Keys to high performance• Parallelism:

• Scaling the number of processors

• Maximize single node performance• Generate friendly code or use tuned libraries (BLAS, FFTW, etc.)

• Avoid (unnecessary) communication cost• Latency, bandwidth, overhead• Berkeley UPC and Titanium use GASNet communication layer

• Avoid unnecessary delays due to dependencies• Load balance; Pipeline algorithmic dependencies

Kathy Yelick, 12

One-Sided vs Two-Sided

• A one-sided put/get message can be handled directly by a network interface with RDMA support• Avoid interrupting the CPU or storing data from CPU (preposts)

• A two-sided messages needs to be matched with a receive to identify memory address to put data• Offloaded to Network Interface in networks like Quadrics• Need to download match tables to interface (from host)

address

message id

data payload

data payload

one-sided put message

two-sided message

network

interface

memory

host

CPU

Kathy Yelick, 13

Performance Advantage of One-Sided Communication: GASNet vs MPI

0

100

200

300

400

500

600

700

800

900

10 100 1,000 10,000 100,000 1,000,000 10,000,000

Size (bytes)

Ba

nd

wid

th (

MB

/s)

GASNet put (nonblock)"

MPI Flood

Relative BW (GASNet/MPI)

1.0

1.2

1.4

1.6

1.8

2.0

2.2

2.4

10 1000 100000 10000000Si z e (bytes )

• Opteron/InfiniBand (Jacquard at NERSC):• GASNet’s vapi-conduit and OSU MPI 0.9.5 MVAPICH

• Half power point (N ½ ) differs by one order of magnitudeJoint work with Paul Hargrove and Dan Bonachea

(up

is

go

od

)

Kathy Yelick, 14

GASNet: Portability and High-Performance(d

ow

n i

s g

oo

d)

GASNet better for latency across machines

8-byte Roundtrip Latency

14.6

6.6

22.1

9.6

6.6

4.5

9.5

18.5

24.2

13.5

17.8

8.3

0

5

10

15

20

25

Elan3/Alpha Elan4/IA64 Myrinet/x86 IB/G5 IB/Opteron SP/Fed

Ro

un

dtr

ip L

ate

nc

y (

us

ec

)

MPI ping-pong

GASNet put+sync

Joint work with UPC Group; GASNet design by Dan Bonachea

Kathy Yelick, 15

(up

is

go

od

)

GASNet at least as high (comparable) for large messages

Flood Bandwidth for 2MB messages

1504

630

244

857225

610

1490799255

858 228795

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%


Pe

rce

nt

HW

pe

ak

(B

W in

MB

)

MPI GASNet

GASNet: Portability and High-Performance


Kathy Yelick, 16

(up

is

go

od

)

GASNet excels at mid-range sizes: important for overlap

GASNet: Portability and High-PerformanceFlood Bandwidth for 4KB messages

547

420

190

702

152

252

750

714231

763223

679

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%


Pe

rce

nt

HW

pe

ak

MPI

GASNet


Kathy Yelick, 17

Case Study 2: NAS FT• Performance of Exchange (Alltoall) is critical

• 1D FFTs in each dimension, 3 phases• Transpose after first 2 for locality• Bisection bandwidth-limited

• Problem as #procs grows

• Three approaches:• Exchange:

• wait for 2nd dim FFTs to finish, send 1 message per processor pair

• Slab: • wait for chunk of rows destined for 1

proc, send when ready

• Pencil: • send each row as it completes

Joint work with Chris Bell, Rajesh Nishtala, Dan Bonachea

Kathy Yelick, 18

Overlapping Communication• Goal: make use of “all the wires all the time”

• Schedule communication to avoid network backup

• Trade-off: overhead vs. overlap• Exchange has fewest messages, less message overhead• Slabs and pencils have more overlap; pencils the most

• Example: Class D problem on 256 Processors


Exchange (all data at once) 512 Kbytes

Slabs (contiguous rows that go to 1 processor)

64 Kbytes

Pencils (single row) 16 Kbytes

Kathy Yelick, 19

NAS FT Variants Performance Summary

• Slab is always best for MPI; small message cost too high• Pencil is always best for UPC; more overlap

0

200

400

600

800

1000

Myrinet 64

InfiniBand 256Elan3 256

Elan3 512Elan4 256

Elan4 512

MF

lops

per

Thr

ead

Best MFlop rates for all NAS FT Benchmark versions

Best NAS Fortran/MPIBest MPIBest UPC


.5 Tflops

Kathy Yelick, 20

Case Study 2: LU Factorization• Direct methods have complicated dependencies

• Especially with pivoting (unpredictable communication)• Especially for sparse matrices (dependence graph with holes)

• LU Factorization in UPC• Use overlap ideas and multithreading to mask latency• Multithreaded: UPC threads + user threads + threaded BLAS

• Panel factorization: Including pivoting• Update to a block of U• Trailing submatrix updates

• Status:• Dense LU done: HPL-compliant • Sparse version underway

Joint work with Parry Husbands

Kathy Yelick, 21

UPC HPL Performance

• Comparison to ScaLAPACK on an Altix, a 2 x 4 process grid• ScaLAPACK (block size 64) 25.25 GFlop/s (tried several block sizes)• UPC LU (block size 256) - 33.60 GFlop/s, (block size 64) - 26.47 GFlop/s

• n = 32000 on a 4x4 process grid• ScaLAPACK - 43.34 GFlop/s (block size = 64) • UPC - 70.26 Gflop/s (block size = 200)

X1 Linpack Performance

0

200

400

600

800

1000

1200

1400

60 X1/64 X1/128

GF

lop

/s

MPI/HPL

UPC

Opteron Cluster Linpack

Performance

0

50

100

150

200

Opt/64

GF

lop

/s

MPI/HPL

UPC

Altix Linpack Performance

0

20

40

60

80

100

120

140

160

Alt/32

GF

lop

/s

MPI/HPL

UPC

•MPI HPL numbers from HPCC database

•Large scaling: • 2.2 TFlops on 512p,

• 4.4 TFlops on 1024p (Thunder)

Joint work with Parry Husbands

Kathy Yelick, 22


• Productivity• Global address space supports complex shared structures• High level constructs simplify programming

• Performance• PGAS Languages are Faster than two-sided MPI• Some surprising hints on performance tuning


• Source-to-source translators are key• Combined with portable communication layer• Specialized compilers are useful in some cases

Kathy Yelick, 23

Portability of Titanium and UPC• Titanium and the Berkeley UPC translator use a similar model

• Source-to-source translator (generate ISO C)• Runtime layer implements global pointers, etc• Common communication layer (GASNet)

• Both run on most PCs, SMPs, clusters & supercomputers• Support Operating Systems:

• Linux, FreeBSD, Tru64, AIX, IRIX, HPUX, Solaris, Cygwin, MacOSX, Unicos, SuperUX• UPC translator somewhat less portable: we provide a http-based compile server

• Supported CPUs: • x86, Itanium, Alpha, Sparc, PowerPC, PA-RISC, Opteron

• GASNet communication:• Myrinet GM, Quadrics Elan, Mellanox Infiniband VAPI, IBM LAPI, Cray X1, SGI Altix,

Cray/SGI SHMEM, and (for portability) MPI and UDP• Specific supercomputer platforms:

• HP AlphaServer, Cray X1, IBM SP, NEC SX-6, Cluster X (Big Mac), SGI Altix 3000• Underway: Cray XT3, BG/L (both run over MPI)

• Can be mixed with MPI, C/C++, Fortran

Also used by gcc/upc

Joint work with Titanium and UPC groups

Kathy Yelick, 24

Portability of PGAS LanguagesOther compilers also exist for PGAS Languages• UPC

• Gcc/UPC by Intrepid: runs on GASNet• HP UPC for AlphaServers, clusters, …• MTU UPC uses HP compiler on MPI (source to source)• Cray UPC

• Co-Array Fortran:• Cray CAF Compiler: X1, X1E• Rice CAF Compiler (on ARMCI or GASNet), John Mellor-Crummey

• Source to source • Processors: Pentium, Itanium2, Alpha, MIPS• Networks: Myrinet, Quadrics, Altix, Origin, Ethernet • OS: Linux32 RedHat, IRIS, Tru64

NB: source-to-source requires cooperation by backend compilers

Kathy Yelick, 25

Summary• PGAS languages offer performance advantages

• Good match to RDMA support in networks• Smaller messages may be faster:

• make better use of network: postpone bisection bandwidth pain• can also prevent cache thrashing for packing

• PGAS languages offer productivity advantage• Order of magnitude in line counts for grid-based code in Titanium• Push decisions about packing/not into runtime for portability

(advantage of language with translator vs. library approach)

• Source-to-source translation• The way to ubiquity• Complement highly tuned machine-specific compilers

26PGAS Languages Kathy Yelick

End of Slides

Kathy Yelick, 27

Productizing BUPC• Recent Berkeley UPC release

• Support full 1.2 language spec• Supports collectives (tuning ongoing); memory model compliance• Supports UPC I/O (naïve reference implementation)

• Large effort in quality assurance and robustness• Test suite: 600+ tests run nightly on 20+ platform configs

• Tests correct compilation & execution of UPC and GASNet• >30,000 UPC compilations and >20,000 UPC test runs per night• Online reporting of results & hookup with bug database

• Test suite infrastructure extended to support any UPC compiler• now running nightly with GCC/UPC + UPCR• also support HP-UPC, Cray UPC, …

• Online bug reporting database• Over >1100 reports since Jan 03• > 90% fixed (excl. enhancement requests)

Kathy Yelick, 30

Benchmarking• Next few UPC and MPI application benchmarks

use the following systems• Myrinet: Myrinet 2000 PCI64B, P4-Xeon 2.2GHz• InfiniBand: IB Mellanox Cougar 4X HCA, Opteron 2.2GHz• Elan3: Quadrics QsNet1, Alpha 1GHz• Elan4: Quadrics QsNet2, Itanium2 1.4GHz

Kathy Yelick, 31

PGAS Languages: Key to High PerformanceOne way to gain acceptance of a new language• Make it run faster than anything elseKeys to high performance• Parallelism:

• Scaling the number of processors

• Maximize single node performance• Generate friendly code or use tuned libraries (BLAS, FFTW, etc.)

• Avoid (unnecessary) communication cost• Latency, bandwidth, overhead

• Avoid unnecessary delays due to dependencies• Load balance• Pipeline algorithmic dependencies

Kathy Yelick, 32

Hardware Latency

• Network latency is not expected to improve significantly• Overlapping communication automatically (Chen)• Overlapping manually in the UPC applications (Husbands, Welcome,

Bell, Nishtala)• Language support for overlap (Bonachea)

Kathy Yelick, 33

Effective Latency

Communication wait time from other factors • Algorithmic dependencies

• Use finer-grained parallelism, pipeline tasks (Husbands)

• Communication bandwidth bottleneck• Message time is: Latency + 1/Bandwidth * Size• Too much aggregation hurts: wait for bandwidth term• De-aggregation optimization: automatic (Iancu);

• Bisection bandwidth bottlenecks• Spread communication throughout the computation (Bell)

Kathy Yelick, 34

Fine-grained UPC vs. Bulk-Synch MPI

• How to waste money on supercomputers• Pack all communication into single message (spend

memory bandwidth)• Save all communication until the last one is ready (add

effective latency)• Send all at once (spend bisection bandwidth)

• Or, to use what you have efficiently:• Avoid long wait times: send early and often• Use “all the wires, all the time”• This requires having low overhead!

Kathy Yelick, 35

What You Won’t Hear Much About• Compiler/runtime/gasnet bug fixes, performance

tuning, testing, …• >13,000 e-mail messages regarding cvs checkins

• Nightly regression testing • 25 platforms, 3 compilers (head, opt-branch, gcc-upc),

• Bug reporting• 1177 bug reports, 1027 fixed

• Release scheduled for later this summer• Beta is available• Process significantly streamlined

Kathy Yelick, 36

Take-Home Messages• Titanium offers tremendous gains in productivity

• High level domain-specific array abstractions• Titanium is being used for real applications

• Not just toy problems• Titanium and UPC are both highly portable

• Run on essentially any machine• Rigorously tested and supported

• PGAS Languages are Faster than two-sided MPI• Better match to most HPC networks

• Berkeley UPC and Titanium benchmarks• Designed from scratch with one-side PGAS model • Focus on 2 scalability challenges: AMR and Sparse LU

Kathy Yelick, 37

Titanium Background

• Based on Java, a cleaner C++• Classes, automatic memory management, etc.• Compiled to C and then machine code, no JVM

• Same parallelism model at UPC and CAF• SPMD parallelism• Dynamic Java threads are not supported

• Optimizing compiler• Analyzes global synchronization• Optimizes pointers, communication, memory

Kathy Yelick, 38

Do these Features Yield Productivity?MG Line Count Comparison

60 58

552

203

9236

0

100

200

300

400

500

600

700

800

900

Fortran w/ MPI Titanium

Language

Pro

duct

ive

Line

s of

Cod

e

Computation

Communication

Declarations

CG Line Count Comparison

1435

27

28

99 44

0

20

40

60

80

100

120

140

160

Fortran w/ MPI Titanium

Language

Pro

duct

ive

Line

s of

Cod

e

Computation

Communication

Declarations

MG Class D Speedup - Opteron/IB

0

20

40

60

80

100

120

140

0 50 100 150

Processors

Sp

eed

up

(O

ver

Bes

t 32

P

roc

Cas

e)

Linear Speedup

Fortran w/MPI

Titanium

CG Class D Speedup - G5/IB

050

100150200

250300350400

0 50 100 150 200 250 300

Processors

Sp

eed

up

(O

ver

Bes

t 64

P

roc

Cas

e)

Linear Speedup

Fortran w/MPI

Titanium

Joint work with Kaushik Datta, Dan Bonachea

Kathy Yelick, 39

GASNet/X1 Performance

• GASNet/X1 improves small message performance over shmem and MPI • Leverages global pointers on X1• Highlights advantage of languages vs. library approach

1 2 4 8 16 32 64 128 256 512 1024 20480

1

2

3

4

5

6

7

8

9

10

11

12

13

14ShmemGASNet

MPI

Message Size (bytes)

Put

per

mess

ag

e g

ap (

mic

rose

conds)

RMW Scalar Vector bcopy()

1 2 4 8 16 32 64 128 256 512 1024 20480

1

2

3

4

5

6

7

8

9

10

11

12

13

14Shmem

GASNet

Message Size (bytes)G

et

Late

ncy

(m

icro

seco

nd

s)

RMW Scalar Vector bcopy()

single word put single word get

Joint work with Christian Bell, Wei Chen and Dan Bonachea

Kathy Yelick, 40

High Level Optimizations in Titanium

Average and maximum speedup of the Titanium version relative to the Aztec version on 1 to 16 processors

Itanium/Myrinet Speedup Comparison

1

1.1

1.2

1.3

1.4

1.5

1.6

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

matrix number

spee

dup

average speedup maximum speedup

• Irregular communication can be expensive• “Best” strategy differs by data size/distribution and machine

parameters • E.g., packing, sending bounding boxes, fine-grained are

• Use of runtime optimizations

• Inspector-executor• Performance on

Sparse MatVec Mult• Results: best strategy

differs within the machine on a single matrix (~ 50% better)

Speedup relative to MPI code (Aztec library)

Joint work with Jimmy Su

Kathy Yelick, 41

Source to Source Strategy

• Source-to-source translation strategy• Tremendous portability advantage• Still can perform significant optimizations

• Relies on high quality back-end compilers and some coaxing in code generation Livermore Loops on Cray X1 (single Node)

0

0.5

1

1.5

2

2.5

3

3.5

4

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Per

form

ance

Rat

io C

/ U

PC

48x• Use of “restrict” pointers in C • Understand Multi-D array

indexing (Intel/Itanium issue)• Support for pragmas like

IVDEP• Robust vectorizators (X1,

SSE, NEC,…)

• On machines with integrated shared memory hardware need access to shared memory operations

Joint work with Jimmy Su

partitioned global address space languages kathy yelick lawrence berkeley national laboratory

Documents