upc project annual review kathy yelick associate laboratory director acting nersc director lawrence...

UPC Project Annual Review

Kathy YelickAssociate Laboratory Director

Acting NERSC Director Lawrence Berkeley National Laboratory

EECS Professor, UC Berkeley

Agenda

Hierarchical PGAS Programming

Kathy YelickAssociate Laboratory Director

Acting NERSC Director Lawrence Berkeley National Laboratory

EECS Professor, UC Berkeley

Design of Programming Models: What to Hide?

• Hidden from most programmers:– Limited DRAM: OS Virtual Memory– Limited Registers: Compilers– Network topology: MPI collectives– Separate address spaces: hidden by UPC, not MPI

• Exposed:– Processor count: not hidden by MPI/UPC– Partitioned memory space on GPU nodes

• “Performance programmers” may want to optimize for these hardware features

Challenges from Machines

Petascale Exascale

6

Challenge #1: Scale / Topology

• Trend: Exascale = ~10x nodes than Petascale– Increase pressure / cost of network – Reduced (relative) bisection bandwidth– Topology is more visible in performance

• Solutions: – For common communication patterns

e.g., collectives, hide in libraries– Expose to programmers– Note job placement / utilization trade-off

• Conclusion: – Need to optimize bisection-sensitive communication

7

Challenge #2: Addressing

• Trend? integration of network / processor – DMA yes; atomics maybe; Invocation no (deadlock)– Especially useful for unexpected communication

• Trade-offs well understood [Active Messages]– Changing the Program Counter is fast / easy– But resources management is key

• Conclusion: we should demand this support

8

address

message id

data payload

data payload

one-sided put message

two-sided send messagenetwork

interfacememory

processor

Challenge #3: Node Hierarchy

• Trend: New levels of software-exposed memory hierarchy1) Scratchpad memory

2) NVRAM (level between DRAM and disk)

3) Co-Processor split memory (CM5, RR, GPUs,…)

• We can automatically manage these:– Hardware: caches, coherent for #3– But it may not be good for power/performance

• Can make this explicit to software– Some software successes: compilers, autotuners,..– Cache-oblivious vs. cache aware algorithms

9

Challenge #4: Heterogeneity

• Many kinds of heterogeneity;– Functional heterogeneity

• Node heterogeneity standard • Core heterogeneity emerging

– Performance heterogeneity• We can functional heterogeneity into this

• Approach: better compiler / system support

10

System Board Chip

60

%

Challenge #5: Resilience

• Resiliency challenges– Chance of component failure

grows with system

• Need realistic analysis– Which failures, how often…

•Approach:– Component failures should not

cause system-wide outages– Or application-wide outages– Detection by hardware– Isolation in software (at all

levels)

Challenge #6: Load Balancing

• Where does the mapping of the graph to a particular number of processors happen?– The compiler: NESL, ZPL– The runtime system: Cilk– The programmer: MPI, UPC

• Irregular task graphs irregular runtime systems– Unpredictability dynamic runtime

• Regular task graphs (statically domain decomposed)– Should we pay for dynamic

runtime?

Depends: Charm++, OpenMP, X10, Chapel

Challenges of Scheduling in a Distributed Environement

• Theoretical and practical problem: Memory deadlock– Not enough memory for all tasks at once.

• (Each update needs two temporary blocks, a green and blue, to run.)

– If updates are scheduled too soon, you will run out of memory– Allocate memory in increasing order of factorization:

• Don't skip any!

– Thread blocks until enough memory available

some edges omitted

Challenge #7: Resource Management

• The problem for putting high level programming models into practice– Dataflow– Remote function invocation– Distributed Shared Memory (with caching)– Cache oblivious algorithms– Dynamic load balancing

• More research is needed to understand trade-offs in an exascale environment

14

Challenge #8: Avoiding Communication

• Start with good algorithms• Sparse Iterative (Krylov Subpace) Methods• Can we lower data movement costs?

– Take k steps with one matrix read fromDRAM and one communication phase• Serial: O(1) moves of data moves vs. O(k)• Parallel: O(log p) messages vs. O(k log p)

• Even dense matrix multiply can avoid communication• Can programming model help?

– Bandwidth avoidance (lower volume)– Latency avoidance (fewer events)– Latency hiding (up to 2x)– Synchronization avoidance (remove barriers)

Joint work with Jim Demmel, Mark Hoemman, Marghoob Mohiyuddin

See Solomonic and Demmel for Matmul

The UPC Experience

• Ecosystem: – Users with a need (fine-grained random access)– Machines with RDMA (not hardware GAS)– Common runtime– Commercial and free software– Center procurements– Sustained, many-year funding

16

1991Active Msgs are fast

1992First Split-C (compiler GAS)

1992First AC (accelerators + split memory)

1993Split-C funding (DOE)

1997First UPC Meeting

“best of” AC, Split-C, PCP

2001First UPC Funding

2003? Berkeley Compiler release

2001gcc-upc at Intrepid

2006UPC in NERSC procurement

2002 GASNet Spec

2010Hybrid MPI/UPC

Other GASNet-based languages

1995 Titanium funding (DARPA)

Evolution vs. Revolution

• Revolutionary investigations can impact evolutionary approaches– GASNet/ARMCI MPI one-sided– PGAS OpenMP– Titanium UPC

• Revolutions require broad support– And real incentives for change

• Languages are useful for clarifying ideas– No hand-waving in compilers

17

Risk

• High risk, high reward– Some projects should fail

• Whose risk?– MPI+OpenMP:

• Failure of models to perform• Failure to recruit new applications

– MPI+X (where X ≠ OpenMP):• Failure to converge on a portable model

– New universal language: • Failure of applications to move• Failure of implementations to perform

• Risk, cost and schedule are related

18

What is Exascale?

• Ensure DOE has machines to solve mission / science problems in 2020

• Ensure DOE has hw/sw technology from which to build machines in future

How to move the users?

19

Where are the users?• Fortran with Flat MPI• Linda on IP Sockets• And adding Python

• Add OpenMP• Add PGAS• Add CUDA

Users in 2020• Add new features• Add new languages• Replace some code

Interoperability is essential

New ideas from basic research

What is Exascale?

• Ensure DOE has machines to solve mission / science problems in 2020

• Ensure DOE has hw/sw technology from which to build machines in future

How to move the users?

20

Where are the users?• Fortran with Flat MPI• Linda on IP Sockets• And adding Python

• Add OpenMP• Add PGAS• Add CUDA

Users in 2020• Add new features• Add new languages• Replace some code

Interoperability is essential

New ideas from basic research

Conclusions

• Solve the problems that must be solved– Locality (how many levels are necessary?)– Heterogeneity– Vertical communication management

• Horizontal is solved by MPI (or PGAS)– Fault resilience, maybe

• Look at the 800-cabinet K machine– Dynamic resource management

• Definitely for irregular problems• Maybe for regular ones on “irregular” machines

– Resource management for dynamic distributed runtimes

21

Irregular vs. Regular Parallelism

• Computations with regular task graphs can be automatically virtualized / scheduled– By a compiler or runtime system

• Fork/Join graphs (no out-of-band dependencies) can be scheduled– By a runtime system (e.g., Cilk)– A greedy scheduler (stealing or pushing) is optimal time– Stealing is optimal in space (but slower to load balance)

• General DAGs are more complicated– Either preemption or user awareness is needed

• Conclusion: If your computation is not regular, the runtime system should be dynamic, i.e., virtualize the processors

What to Virtualize?

• Register count: hide – Compilers prove this, but need autotuners

• Separate address spaces: hide– PGAS proves this; Needs to show for GPUs

• Performance-partitioned memory: expose– For distributed memory needs to be exposed

• Number of cores: depends– On shared memory, we can virtualize– Distributed memory mostly not– Open question for non-SPMD PGAS

23

• Sdf

24

Keeping the Users with Us

Demise of flat MPI

Demise of OpenMP “as we know it”

Game programmers

recruited

Two Parts of a Programming Model

• The parallel control model

• The model for sharing/communication?

implied synchronization for message passing, not for shared memory

data parallel(singe thread of control)

dynamicthreads

single programmultiple data (SPMD)

shared memoryload

storesend

receive

message passing

Irregularity in Task Costs

, search

Irregularity in Task Dependences

Irregularity in Task Locality (Communication)

Evolution of Abstract Machine Models

• SMP

•P •P •P •P •P •P

•

•P •P •P •P

MPI Distributed Memory

local

P P P P P

sharedPGAS

local

P P P P P

Chip shared

shared

?

HPGAS

PGAS for Local Store Memory

• How does local store memory affect PGAS?– No casts from pointer-to-shared to pointer-to-

private (this is a big change!)– Assignments and bulk operations work– Could add an additional notion of local for faster

pointers, and distinguish private from local

local

P P P P P

Chip shared

shared

?

HPGAS

Hierarchical PGAS Model

• A global address space for hierarchical machines may have multiple kinds of pointers

• These can be encode by programmers in type system or hidden, e.g., all global or only local/global

• This partitioning is about pointer span, not control / parallelism

B

span 1(core local)

span 2(chip local)

level 3(node local)

level 4(global world)

CD

A1

23 4

• Demonstrated in Java dialect called Titanium• Statically determines what data each pointer

can refer to• Allocation sites approximated as abstract

locations (alocs)• All explicit and implicit program variables have

points-to sets– The set of alocs that the variable may reference

• Abstract locations have span– Thread local locations created by local thread– Process local locations reside in same memory

space– Global locations can be anywhere

Type Systems Help: Hierarchical Pointer Analysis

• Communication produces version of source abstract locations with greater span

Pointer Analysis: Communication

Points-to Sets

a 1t

b 1g

c 1t

1t.z 2t

d 2g

Alocs 1, 2

class Foo { Object z; }

static void bar() {L1: Foo a = new Foo(); Foo b = broadcast a from 0; Foo c = a;L2: a.z = new Object(); Ti.barrier(); Object d = b.z; }

amr gas ft cg mg10

100

1000

10000

100000

24633

5526

832

1752

4759

286

517

67

262

66

Static Races Detected

sharing concur feasible feasible+AA1 feasible+AA3

Benchmark

Ra

ce

s (

Lo

ga

rith

mic

Sc

ale

)Race Detection Results

Amir Kamil

Data Locality Detection Results

amr gas ft cg mg0

10

20

30

40

50

60

70

80

90

100

Data Locality Detection

LQI PA2 PA3

Benchmark

Pe

rce

nt

De

term

ine

d t

o b

e P

roc

es

s-L

oc

al

Go

od

PA2 = Two level hierarchy, PA3 = 3 Level

Current Programming Model for GPU Clusters

MPI/UPC + CUDA/OpenCL

CPU

CPU

GPU

CPU

CPU

GPU

CPU

CPU

GPU

MPI/UPC

CUDA/OpenCL

UPC+CUDA works today

Hybrid Partitioned Global Address Space

Local Segment on Host Memory

Processor 1

Shared Segment on Host Memory

Local Segment on GPU Memory


Processor 2



Processor 3



Processor 4


Each thread has only two shared segments, which can be either in host memory or in GPU memory, but not both.

Decouple the memory model from execution models; therefore it supports various execution models.

Caveat: type system and therefore interfaces blow up with different parts of address space

Shared Segment on GPU Memory







Data Transfer Example

• MPI+CUDA

• UPC with GPU extensions

Process 1:send_buffer = malloc(nbytes);cudaMemcpy(send_buffer, src); MPI_Send(send_buffer);free(send_buffer);

Process 2:recv_buffer = malloc(nbytes);

MPI_Recv(recv_buffer);cudaMemcpy(dst, recv_buffer);free(recv_buffer);

Thread 1:upc_memcpy(dst, src, nbytes);

Advantages of PGAS on GPU clusters Don’t need explicit buffer management by the user. Facilitate end-to-end optimizations such as data transfer pipelining. One-sided communication maps well to DMA transfers for GPUs. Concise code

copy nbytes data from src on GPU 1 to dst on GPU 2

Thread 2:// no operation required

Berkeley UPC Compiler

Compiler-generated C code

UPC Runtime system

GASNet Communication System

Network Hardware

Platform-independent

Network-independent

Language-independent

Compiler-independent

UPC Code UPC CompilerUsed by bupc and

gcc-upc

Used by Cray UPC, CAF,

Chapel, Titanium, and others

Irregular Applications

• UPC originally for “irregular” applications– Many recent performance results are on “regular” ones

(FFTs, NPBs, etc.); those also do well

• Does it really handle irregular ones? Which?– Irregular in data accesses:

• Irregular in space (sparse matrices, AMR, etc.): global address space helps; needs compiler or language for scatter/gather

• Irregular in time (hash table lookup, etc.): for reads, UPC handles this well; for write you need atomic operations

– Irregular computational patterns:• High level independent tasks (ocean, atm, land, etc.): need

teams• Non bulk-synchronous: use event-driven execution• Not statically load balanced (even with graph partitioning, etc.):

need global task queue

40

// allocate a distributed task queuedistq_t * all_distq_alloc();

// enqueue a task into the distributed queueint distq_enqueue(distq_t *, distq_task_t *);

// dequeue a task from the local task queue

// returns null if task is not readily availabledistq_task_t * distq_dequeue(distq_t *);

// test whether queue is globally emptyint distq_isempty(bupc_taskq_t *);

// free distributed task queue memory int bupc_task_free(shared bupc_taskq_t *);

Distributed Tasking API for UPC

shar

ed

priv

ate

enqueue dequeue

internals are hidden from user, except that dequeue operations may fail and provide hint to steal

• Provide method to query machine structure at runtime

Machine Structure

42

Team T = Ti.defaultTeam();

4

5

6

7

0

1

2

3

0, 1, 2, 3, 4, 5, 6, 7

0, 1, 2, 3

0, 1 2, 3

4, 5, 6, 7

4, 5 6, 7

• Provide library functions to simplify building common team hierarchies

Regular Team Structures

Team T = new Team();T.splitTeam(3);

0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11

0, 1, 2, 3 4, 5, 6, 7 8, 9, 10, 11

• Thread teams may execute distinct tasks– e.g. Forward FFT, multplication/reduction,

reverse FFT in parallel convolution• New partition construct enables this

partition(T) { { forward_fft(); } { multiply_reduce(); } { reverse_fft(); }}

Task Hierarchy

Hierarchical Control Model• Basic questions about parallelism model

– Parallel all the time with fixed parallelism (SPMD)– Dynamic parallelism: serial/parallel (loop annotations)

• Even with SPMD we need hierarchy– Collectives are key feature and need to work on subsets– Programs need divide and conquer (e.g., reduce on rows)– Programs may use different algorithms depending on machine costs

• Teams may provide sufficient functionality– Create teams based on sets of threadIDs – Get a default team structure (hierarchy) from runtime

T = get_row_teams(…) T2 = node_teams();split(T) { split (T) { row_reduce(); if mythread=0} compute_local

Hierarchical Programmers

• 2 types of programmers 2 layers• Efficiency Layer (10% of today’s programmers)

– Expert programmers build Frameworks & Libraries, Hypervisors, …

– “Bare metal” efficiency possible at Efficiency Layer• Productivity Layer (90% of today’s programmers)

– Domain experts / Naïve programmers productively build parallel apps using frameworks & libraries

– Frameworks & libraries composed to form app frameworks• Effective composition techniques allows the efficiency

programmers to be highly leveraged Create language for Composition and Coordination (C&C)

Productivity Models

• Productivity programming model will hide some of the machine structure– Dynamic parallelism model for application

• Chapel-like data and task parallelism• Sequoia-like divide and conquer parallelism

– Fewer levels of hierarchy

• Open question: for regular computations do we need to use dynamic scheduling for– Performance heterogeneity due to fault recovery,

power management, etc. (or is SPMD sufficient)?

• Open question: Align control / parallelism hierarchy with memory / data hierarchy

Response of PGAS to Challenges

• Small memory per core– Ability to directly access another core’s memory

• Lack of UMA memory on chip– Partitioned address apce

• Massive concurrency– Good match for independent parallel cores– Not for data parallelism

• Heterogeneity– Need to relax strict SPMD with at least teams

• Application generality– Add atomics so remote writes work (not just reads)

48

NITRD Projects Addressed Programmer Productivity of Irregular Problems

Message Passing Programming

Divide up domain in pieces

Compute one piece and exchange

PVM, MPI, and many libraries

49

Global Address Space Programming

Each start computing

Grab whatever / whenever

UPC, CAF, X10, Chapel, Fortress, Titanium

Computing Performance Improvements will be Harder than Ever

1970 1975 1980 1985 1990 1995 2000 2005 20100

1

10

100

1,000

10,000

100,000

1,000,000

10,000,000

Transistors (Thousands)

Moore’s Law continues, but power limits performance growth. Parallelism is used instead.

1970 1975 1980 1985 1990 1995 2000 2005 20100

1

10

100

1,000

10,000

100,000

1,000,000

10,000,000

Transistors (Thousands)Exponential (Transistors (Thousands))

1970 1975 1980 1985 1990 1995 2000 2005 20100

1

10

100

1,000

10,000

100,000

1,000,000

10,000,000

Transistors (Thousands)Exponential (Transistors (Thousands))Frequency (MHz)

50

Energy Efficient Computing is Key to Performance Growth

goal

usual scaling

2005 2010 2015 2020

51

At $1M per MW, energy costs are substantial• 1 petaflop in 2010 used 3 MW• 1 exaflop in 2018 would use 130 MW with “Moore’s Law” scaling

This problem doesn’t change if we were to build 1000 1-Petaflop machines instead of 1 Exasflop machine. It affects every university department cluster and cloud data center.

New Processor Designs are Needed to Save Energy

• Server processors designed for performance• Embedded and graphics processors use simple

low-power processors good performance/Watt• New processor architecture and software for

future HPC systems52

Cell phone processor (0.1 Watt, 4 Gflop/s)

Server processor (100 Watts, 50 Gflop/s)

Need to Redo Software and Algorithms

1

10

100

1000

10000

now

Pic

oJ

ou

les

Messages

On-chip

Memory

Stop Communicating!

• Communication is expensive:

– 10-100x in time/energy – Even to memory

• Computation is almost free!

Arit

hmet

ic

Reg

iste

r

Cac

he

Cac

he

Mem

ory

Rem

ote

Far

Aw

ay

• Software and algorithms– Software will control data movement to avoid waste– Algorithms should minimize data movement– Exascale science involves more complex interactions

53

upc project annual review kathy yelick associate laboratory director acting nersc director lawrence...

Documents

zplthe runtime system

better compiler system

system board chip

new levels of software

component failures

processor count

levels slide

mpiupcpartitioned memory