upc project annual review kathy yelick associate laboratory director acting nersc director lawrence...

52
UPC Project Annual Review Kathy Yelick Associate Laboratory Director Acting NERSC Director Lawrence Berkeley National Laboratory EECS Professor, UC Berkeley

Upload: octavia-riley

Post on 27-Dec-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: UPC Project Annual Review Kathy Yelick Associate Laboratory Director Acting NERSC Director Lawrence Berkeley National Laboratory EECS Professor, UC Berkeley

UPC Project Annual Review

Kathy YelickAssociate Laboratory Director

Acting NERSC Director Lawrence Berkeley National Laboratory

EECS Professor, UC Berkeley

Page 2: UPC Project Annual Review Kathy Yelick Associate Laboratory Director Acting NERSC Director Lawrence Berkeley National Laboratory EECS Professor, UC Berkeley

Agenda

Page 3: UPC Project Annual Review Kathy Yelick Associate Laboratory Director Acting NERSC Director Lawrence Berkeley National Laboratory EECS Professor, UC Berkeley

Hierarchical PGAS Programming

Kathy YelickAssociate Laboratory Director

Acting NERSC Director Lawrence Berkeley National Laboratory

EECS Professor, UC Berkeley

Page 4: UPC Project Annual Review Kathy Yelick Associate Laboratory Director Acting NERSC Director Lawrence Berkeley National Laboratory EECS Professor, UC Berkeley

Design of Programming Models: What to Hide?

• Hidden from most programmers:– Limited DRAM: OS Virtual Memory– Limited Registers: Compilers– Network topology: MPI collectives– Separate address spaces: hidden by UPC, not MPI

• Exposed:– Processor count: not hidden by MPI/UPC– Partitioned memory space on GPU nodes

• “Performance programmers” may want to optimize for these hardware features

Page 5: UPC Project Annual Review Kathy Yelick Associate Laboratory Director Acting NERSC Director Lawrence Berkeley National Laboratory EECS Professor, UC Berkeley

Challenges from Machines

Petascale Exascale

6

Page 6: UPC Project Annual Review Kathy Yelick Associate Laboratory Director Acting NERSC Director Lawrence Berkeley National Laboratory EECS Professor, UC Berkeley

Challenge #1: Scale / Topology

• Trend: Exascale = ~10x nodes than Petascale– Increase pressure / cost of network – Reduced (relative) bisection bandwidth– Topology is more visible in performance

• Solutions: – For common communication patterns

e.g., collectives, hide in libraries– Expose to programmers– Note job placement / utilization trade-off

• Conclusion: – Need to optimize bisection-sensitive communication

7

Page 7: UPC Project Annual Review Kathy Yelick Associate Laboratory Director Acting NERSC Director Lawrence Berkeley National Laboratory EECS Professor, UC Berkeley

Challenge #2: Addressing

• Trend? integration of network / processor – DMA yes; atomics maybe; Invocation no (deadlock)– Especially useful for unexpected communication

• Trade-offs well understood [Active Messages]– Changing the Program Counter is fast / easy– But resources management is key

• Conclusion: we should demand this support

8

address

message id

data payload

data payload

one-sided put message

two-sided send messagenetwork

interfacememory

processor

Page 8: UPC Project Annual Review Kathy Yelick Associate Laboratory Director Acting NERSC Director Lawrence Berkeley National Laboratory EECS Professor, UC Berkeley

Challenge #3: Node Hierarchy

• Trend: New levels of software-exposed memory hierarchy1) Scratchpad memory

2) NVRAM (level between DRAM and disk)

3) Co-Processor split memory (CM5, RR, GPUs,…)

• We can automatically manage these:– Hardware: caches, coherent for #3– But it may not be good for power/performance

• Can make this explicit to software– Some software successes: compilers, autotuners,..– Cache-oblivious vs. cache aware algorithms

9

Page 9: UPC Project Annual Review Kathy Yelick Associate Laboratory Director Acting NERSC Director Lawrence Berkeley National Laboratory EECS Professor, UC Berkeley

Challenge #4: Heterogeneity

• Many kinds of heterogeneity;– Functional heterogeneity

• Node heterogeneity standard • Core heterogeneity emerging

– Performance heterogeneity• We can functional heterogeneity into this

• Approach: better compiler / system support

10

System Board Chip

60

%

Page 10: UPC Project Annual Review Kathy Yelick Associate Laboratory Director Acting NERSC Director Lawrence Berkeley National Laboratory EECS Professor, UC Berkeley

Challenge #5: Resilience

• Resiliency challenges– Chance of component failure

grows with system

• Need realistic analysis– Which failures, how often…

•Approach:– Component failures should not

cause system-wide outages– Or application-wide outages– Detection by hardware– Isolation in software (at all

levels)

Page 11: UPC Project Annual Review Kathy Yelick Associate Laboratory Director Acting NERSC Director Lawrence Berkeley National Laboratory EECS Professor, UC Berkeley

Challenge #6: Load Balancing

• Where does the mapping of the graph to a particular number of processors happen?– The compiler: NESL, ZPL– The runtime system: Cilk– The programmer: MPI, UPC

• Irregular task graphs irregular runtime systems– Unpredictability dynamic runtime

• Regular task graphs (statically domain decomposed)– Should we pay for dynamic

runtime?

Depends: Charm++, OpenMP, X10, Chapel

Page 12: UPC Project Annual Review Kathy Yelick Associate Laboratory Director Acting NERSC Director Lawrence Berkeley National Laboratory EECS Professor, UC Berkeley

Challenges of Scheduling in a Distributed Environement

• Theoretical and practical problem: Memory deadlock– Not enough memory for all tasks at once.

• (Each update needs two temporary blocks, a green and blue, to run.)

– If updates are scheduled too soon, you will run out of memory– Allocate memory in increasing order of factorization:

• Don't skip any!

– Thread blocks until enough memory available

some edges omitted

Page 13: UPC Project Annual Review Kathy Yelick Associate Laboratory Director Acting NERSC Director Lawrence Berkeley National Laboratory EECS Professor, UC Berkeley

Challenge #7: Resource Management

• The problem for putting high level programming models into practice– Dataflow– Remote function invocation– Distributed Shared Memory (with caching)– Cache oblivious algorithms– Dynamic load balancing

• More research is needed to understand trade-offs in an exascale environment

14

Page 14: UPC Project Annual Review Kathy Yelick Associate Laboratory Director Acting NERSC Director Lawrence Berkeley National Laboratory EECS Professor, UC Berkeley

Challenge #8: Avoiding Communication

• Start with good algorithms• Sparse Iterative (Krylov Subpace) Methods• Can we lower data movement costs?

– Take k steps with one matrix read fromDRAM and one communication phase• Serial: O(1) moves of data moves vs. O(k)• Parallel: O(log p) messages vs. O(k log p)

• Even dense matrix multiply can avoid communication• Can programming model help?

– Bandwidth avoidance (lower volume)– Latency avoidance (fewer events)– Latency hiding (up to 2x)– Synchronization avoidance (remove barriers)

Joint work with Jim Demmel, Mark Hoemman, Marghoob Mohiyuddin

See Solomonic and Demmel for Matmul

Page 15: UPC Project Annual Review Kathy Yelick Associate Laboratory Director Acting NERSC Director Lawrence Berkeley National Laboratory EECS Professor, UC Berkeley

The UPC Experience

• Ecosystem: – Users with a need (fine-grained random access)– Machines with RDMA (not hardware GAS)– Common runtime– Commercial and free software– Center procurements– Sustained, many-year funding

16

1991Active Msgs are fast

1992First Split-C (compiler GAS)

1992First AC (accelerators + split memory)

1993Split-C funding (DOE)

1997First UPC Meeting

“best of” AC, Split-C, PCP

2001First UPC Funding

2003? Berkeley Compiler release

2001gcc-upc at Intrepid

2006UPC in NERSC procurement

2002 GASNet Spec

2010Hybrid MPI/UPC

Other GASNet-based languages

1995 Titanium funding (DARPA)

Page 16: UPC Project Annual Review Kathy Yelick Associate Laboratory Director Acting NERSC Director Lawrence Berkeley National Laboratory EECS Professor, UC Berkeley

Evolution vs. Revolution

• Revolutionary investigations can impact evolutionary approaches– GASNet/ARMCI MPI one-sided– PGAS OpenMP– Titanium UPC

• Revolutions require broad support– And real incentives for change

• Languages are useful for clarifying ideas– No hand-waving in compilers

17

Page 17: UPC Project Annual Review Kathy Yelick Associate Laboratory Director Acting NERSC Director Lawrence Berkeley National Laboratory EECS Professor, UC Berkeley

Risk

• High risk, high reward– Some projects should fail

• Whose risk?– MPI+OpenMP:

• Failure of models to perform• Failure to recruit new applications

– MPI+X (where X ≠ OpenMP):• Failure to converge on a portable model

– New universal language: • Failure of applications to move• Failure of implementations to perform

• Risk, cost and schedule are related

18

Page 18: UPC Project Annual Review Kathy Yelick Associate Laboratory Director Acting NERSC Director Lawrence Berkeley National Laboratory EECS Professor, UC Berkeley

What is Exascale?

• Ensure DOE has machines to solve mission / science problems in 2020

• Ensure DOE has hw/sw technology from which to build machines in future

How to move the users?

19

Where are the users?• Fortran with Flat MPI• Linda on IP Sockets• And adding Python

• Add OpenMP• Add PGAS• Add CUDA

Users in 2020• Add new features• Add new languages• Replace some code

Interoperability is essential

New ideas from basic research

Page 19: UPC Project Annual Review Kathy Yelick Associate Laboratory Director Acting NERSC Director Lawrence Berkeley National Laboratory EECS Professor, UC Berkeley

What is Exascale?

• Ensure DOE has machines to solve mission / science problems in 2020

• Ensure DOE has hw/sw technology from which to build machines in future

How to move the users?

20

Where are the users?• Fortran with Flat MPI• Linda on IP Sockets• And adding Python

• Add OpenMP• Add PGAS• Add CUDA

Users in 2020• Add new features• Add new languages• Replace some code

Interoperability is essential

New ideas from basic research

Page 20: UPC Project Annual Review Kathy Yelick Associate Laboratory Director Acting NERSC Director Lawrence Berkeley National Laboratory EECS Professor, UC Berkeley

Conclusions

• Solve the problems that must be solved– Locality (how many levels are necessary?)– Heterogeneity– Vertical communication management

• Horizontal is solved by MPI (or PGAS)– Fault resilience, maybe

• Look at the 800-cabinet K machine– Dynamic resource management

• Definitely for irregular problems• Maybe for regular ones on “irregular” machines

– Resource management for dynamic distributed runtimes

21

Page 21: UPC Project Annual Review Kathy Yelick Associate Laboratory Director Acting NERSC Director Lawrence Berkeley National Laboratory EECS Professor, UC Berkeley

Irregular vs. Regular Parallelism

• Computations with regular task graphs can be automatically virtualized / scheduled– By a compiler or runtime system

• Fork/Join graphs (no out-of-band dependencies) can be scheduled– By a runtime system (e.g., Cilk)– A greedy scheduler (stealing or pushing) is optimal time– Stealing is optimal in space (but slower to load balance)

• General DAGs are more complicated– Either preemption or user awareness is needed

• Conclusion: If your computation is not regular, the runtime system should be dynamic, i.e., virtualize the processors

Page 22: UPC Project Annual Review Kathy Yelick Associate Laboratory Director Acting NERSC Director Lawrence Berkeley National Laboratory EECS Professor, UC Berkeley

What to Virtualize?

• Register count: hide – Compilers prove this, but need autotuners

• Separate address spaces: hide– PGAS proves this; Needs to show for GPUs

• Performance-partitioned memory: expose– For distributed memory needs to be exposed

• Number of cores: depends– On shared memory, we can virtualize– Distributed memory mostly not– Open question for non-SPMD PGAS

23

Page 23: UPC Project Annual Review Kathy Yelick Associate Laboratory Director Acting NERSC Director Lawrence Berkeley National Laboratory EECS Professor, UC Berkeley

• Sdf

24

Keeping the Users with Us

Demise of flat MPI

Demise of OpenMP “as we know it”

Game programmers

recruited

Page 24: UPC Project Annual Review Kathy Yelick Associate Laboratory Director Acting NERSC Director Lawrence Berkeley National Laboratory EECS Professor, UC Berkeley

Two Parts of a Programming Model

• The parallel control model

• The model for sharing/communication?

implied synchronization for message passing, not for shared memory

data parallel(singe thread of control)

dynamicthreads

single programmultiple data (SPMD)

shared memoryload

storesend

receive

message passing

Page 25: UPC Project Annual Review Kathy Yelick Associate Laboratory Director Acting NERSC Director Lawrence Berkeley National Laboratory EECS Professor, UC Berkeley

Irregularity in Task Costs

, search

Page 26: UPC Project Annual Review Kathy Yelick Associate Laboratory Director Acting NERSC Director Lawrence Berkeley National Laboratory EECS Professor, UC Berkeley

Irregularity in Task Dependences

Page 27: UPC Project Annual Review Kathy Yelick Associate Laboratory Director Acting NERSC Director Lawrence Berkeley National Laboratory EECS Professor, UC Berkeley

Irregularity in Task Locality (Communication)

Page 28: UPC Project Annual Review Kathy Yelick Associate Laboratory Director Acting NERSC Director Lawrence Berkeley National Laboratory EECS Professor, UC Berkeley

Evolution of Abstract Machine Models

• SMP

•P •P •P •P •P •P

•P •P •P •P

MPI Distributed Memory

local

P P P P P

sharedPGAS

local

P P P P P

Chip shared

shared

?

HPGAS

Page 29: UPC Project Annual Review Kathy Yelick Associate Laboratory Director Acting NERSC Director Lawrence Berkeley National Laboratory EECS Professor, UC Berkeley

PGAS for Local Store Memory

• How does local store memory affect PGAS?– No casts from pointer-to-shared to pointer-to-

private (this is a big change!)– Assignments and bulk operations work– Could add an additional notion of local for faster

pointers, and distinguish private from local

local

P P P P P

Chip shared

shared

?

HPGAS

Page 30: UPC Project Annual Review Kathy Yelick Associate Laboratory Director Acting NERSC Director Lawrence Berkeley National Laboratory EECS Professor, UC Berkeley

Hierarchical PGAS Model

• A global address space for hierarchical machines may have multiple kinds of pointers

• These can be encode by programmers in type system or hidden, e.g., all global or only local/global

• This partitioning is about pointer span, not control / parallelism

B

span 1(core local)

span 2(chip local)

level 3(node local)

level 4(global world)

CD

A1

23 4

Page 31: UPC Project Annual Review Kathy Yelick Associate Laboratory Director Acting NERSC Director Lawrence Berkeley National Laboratory EECS Professor, UC Berkeley

• Demonstrated in Java dialect called Titanium• Statically determines what data each pointer

can refer to• Allocation sites approximated as abstract

locations (alocs)• All explicit and implicit program variables have

points-to sets– The set of alocs that the variable may reference

• Abstract locations have span– Thread local locations created by local thread– Process local locations reside in same memory

space– Global locations can be anywhere

Type Systems Help: Hierarchical Pointer Analysis

Page 32: UPC Project Annual Review Kathy Yelick Associate Laboratory Director Acting NERSC Director Lawrence Berkeley National Laboratory EECS Professor, UC Berkeley

• Communication produces version of source abstract locations with greater span

Pointer Analysis: Communication

Points-to Sets

a 1t

b 1g

c 1t

1t.z 2t

d 2g

Alocs 1, 2

class Foo { Object z; }

static void bar() {L1: Foo a = new Foo(); Foo b = broadcast a from 0; Foo c = a;L2: a.z = new Object(); Ti.barrier(); Object d = b.z; }

Page 33: UPC Project Annual Review Kathy Yelick Associate Laboratory Director Acting NERSC Director Lawrence Berkeley National Laboratory EECS Professor, UC Berkeley

amr gas ft cg mg10

100

1000

10000

100000

24633

5526

832

1752

4759

286

517

67

262

66

Static Races Detected

sharing concur feasible feasible+AA1 feasible+AA3

Benchmark

Ra

ce

s (

Lo

ga

rith

mic

Sc

ale

)Race Detection Results

Amir Kamil

Page 34: UPC Project Annual Review Kathy Yelick Associate Laboratory Director Acting NERSC Director Lawrence Berkeley National Laboratory EECS Professor, UC Berkeley

Data Locality Detection Results

amr gas ft cg mg0

10

20

30

40

50

60

70

80

90

100

Data Locality Detection

LQI PA2 PA3

Benchmark

Pe

rce

nt

De

term

ine

d t

o b

e P

roc

es

s-L

oc

al

Go

od

PA2 = Two level hierarchy, PA3 = 3 Level

Page 35: UPC Project Annual Review Kathy Yelick Associate Laboratory Director Acting NERSC Director Lawrence Berkeley National Laboratory EECS Professor, UC Berkeley

Current Programming Model for GPU Clusters

MPI/UPC + CUDA/OpenCL

CPU

CPU

GPU

CPU

CPU

GPU

CPU

CPU

GPU

MPI/UPC

CUDA/OpenCL

UPC+CUDA works today

Page 36: UPC Project Annual Review Kathy Yelick Associate Laboratory Director Acting NERSC Director Lawrence Berkeley National Laboratory EECS Professor, UC Berkeley

Hybrid Partitioned Global Address Space

Local Segment on Host Memory

Processor 1

Shared Segment on Host Memory

Local Segment on GPU Memory

Local Segment on Host Memory

Processor 2

Local Segment on GPU Memory

Local Segment on Host Memory

Processor 3

Local Segment on GPU Memory

Local Segment on Host Memory

Processor 4

Local Segment on GPU Memory

Each thread has only two shared segments, which can be either in host memory or in GPU memory, but not both.

Decouple the memory model from execution models; therefore it supports various execution models.

Caveat: type system and therefore interfaces blow up with different parts of address space

Shared Segment on GPU Memory

Shared Segment on Host Memory

Shared Segment on GPU Memory

Shared Segment on Host Memory

Shared Segment on GPU Memory

Shared Segment on Host Memory

Shared Segment on GPU Memory

Page 37: UPC Project Annual Review Kathy Yelick Associate Laboratory Director Acting NERSC Director Lawrence Berkeley National Laboratory EECS Professor, UC Berkeley

Data Transfer Example

• MPI+CUDA

• UPC with GPU extensions

Process 1:send_buffer = malloc(nbytes);cudaMemcpy(send_buffer, src); MPI_Send(send_buffer);free(send_buffer);

Process 2:recv_buffer = malloc(nbytes);

MPI_Recv(recv_buffer);cudaMemcpy(dst, recv_buffer);free(recv_buffer);

Thread 1:upc_memcpy(dst, src, nbytes);

Advantages of PGAS on GPU clusters Don’t need explicit buffer management by the user. Facilitate end-to-end optimizations such as data transfer pipelining. One-sided communication maps well to DMA transfers for GPUs. Concise code

copy nbytes data from src on GPU 1 to dst on GPU 2

Thread 2:// no operation required

Page 38: UPC Project Annual Review Kathy Yelick Associate Laboratory Director Acting NERSC Director Lawrence Berkeley National Laboratory EECS Professor, UC Berkeley

Berkeley UPC Compiler

Compiler-generated C code

UPC Runtime system

GASNet Communication System

Network Hardware

Platform-independent

Network-independent

Language-independent

Compiler-independent

UPC Code UPC CompilerUsed by bupc and

gcc-upc

Used by Cray UPC, CAF,

Chapel, Titanium, and others

Page 39: UPC Project Annual Review Kathy Yelick Associate Laboratory Director Acting NERSC Director Lawrence Berkeley National Laboratory EECS Professor, UC Berkeley

Irregular Applications

• UPC originally for “irregular” applications– Many recent performance results are on “regular” ones

(FFTs, NPBs, etc.); those also do well

• Does it really handle irregular ones? Which?– Irregular in data accesses:

• Irregular in space (sparse matrices, AMR, etc.): global address space helps; needs compiler or language for scatter/gather

• Irregular in time (hash table lookup, etc.): for reads, UPC handles this well; for write you need atomic operations

– Irregular computational patterns:• High level independent tasks (ocean, atm, land, etc.): need

teams• Non bulk-synchronous: use event-driven execution• Not statically load balanced (even with graph partitioning, etc.):

need global task queue

40

Page 40: UPC Project Annual Review Kathy Yelick Associate Laboratory Director Acting NERSC Director Lawrence Berkeley National Laboratory EECS Professor, UC Berkeley

// allocate a distributed task queuedistq_t * all_distq_alloc();

// enqueue a task into the distributed queueint distq_enqueue(distq_t *, distq_task_t *);

// dequeue a task from the local task queue

// returns null if task is not readily availabledistq_task_t * distq_dequeue(distq_t *);

// test whether queue is globally emptyint distq_isempty(bupc_taskq_t *);

// free distributed task queue memory int bupc_task_free(shared bupc_taskq_t *);

Distributed Tasking API for UPC

shar

ed

priv

ate

enqueue dequeue

internals are hidden from user, except that dequeue operations may fail and provide hint to steal

Page 41: UPC Project Annual Review Kathy Yelick Associate Laboratory Director Acting NERSC Director Lawrence Berkeley National Laboratory EECS Professor, UC Berkeley

• Provide method to query machine structure at runtime

Machine Structure

42

Team T = Ti.defaultTeam();

4

5

6

7

0

1

2

3

0, 1, 2, 3, 4, 5, 6, 7

0, 1, 2, 3

0, 1 2, 3

4, 5, 6, 7

4, 5 6, 7

Page 42: UPC Project Annual Review Kathy Yelick Associate Laboratory Director Acting NERSC Director Lawrence Berkeley National Laboratory EECS Professor, UC Berkeley

• Provide library functions to simplify building common team hierarchies

Regular Team Structures

Team T = new Team();T.splitTeam(3);

0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11

0, 1, 2, 3 4, 5, 6, 7 8, 9, 10, 11

Page 43: UPC Project Annual Review Kathy Yelick Associate Laboratory Director Acting NERSC Director Lawrence Berkeley National Laboratory EECS Professor, UC Berkeley

• Thread teams may execute distinct tasks– e.g. Forward FFT, multplication/reduction,

reverse FFT in parallel convolution• New partition construct enables this

partition(T) { { forward_fft(); } { multiply_reduce(); } { reverse_fft(); }}

Task Hierarchy

Page 44: UPC Project Annual Review Kathy Yelick Associate Laboratory Director Acting NERSC Director Lawrence Berkeley National Laboratory EECS Professor, UC Berkeley

Hierarchical Control Model• Basic questions about parallelism model

– Parallel all the time with fixed parallelism (SPMD)– Dynamic parallelism: serial/parallel (loop annotations)

• Even with SPMD we need hierarchy– Collectives are key feature and need to work on subsets– Programs need divide and conquer (e.g., reduce on rows)– Programs may use different algorithms depending on machine costs

• Teams may provide sufficient functionality– Create teams based on sets of threadIDs – Get a default team structure (hierarchy) from runtime

T = get_row_teams(…) T2 = node_teams();split(T) { split (T) { row_reduce(); if mythread=0} compute_local

Page 45: UPC Project Annual Review Kathy Yelick Associate Laboratory Director Acting NERSC Director Lawrence Berkeley National Laboratory EECS Professor, UC Berkeley

Hierarchical Programmers

• 2 types of programmers 2 layers• Efficiency Layer (10% of today’s programmers)

– Expert programmers build Frameworks & Libraries, Hypervisors, …

– “Bare metal” efficiency possible at Efficiency Layer• Productivity Layer (90% of today’s programmers)

– Domain experts / Naïve programmers productively build parallel apps using frameworks & libraries

– Frameworks & libraries composed to form app frameworks• Effective composition techniques allows the efficiency

programmers to be highly leveraged Create language for Composition and Coordination (C&C)

Page 46: UPC Project Annual Review Kathy Yelick Associate Laboratory Director Acting NERSC Director Lawrence Berkeley National Laboratory EECS Professor, UC Berkeley

Productivity Models

• Productivity programming model will hide some of the machine structure– Dynamic parallelism model for application

• Chapel-like data and task parallelism• Sequoia-like divide and conquer parallelism

– Fewer levels of hierarchy

• Open question: for regular computations do we need to use dynamic scheduling for– Performance heterogeneity due to fault recovery,

power management, etc. (or is SPMD sufficient)?

• Open question: Align control / parallelism hierarchy with memory / data hierarchy

Page 47: UPC Project Annual Review Kathy Yelick Associate Laboratory Director Acting NERSC Director Lawrence Berkeley National Laboratory EECS Professor, UC Berkeley

Response of PGAS to Challenges

• Small memory per core– Ability to directly access another core’s memory

• Lack of UMA memory on chip– Partitioned address apce

• Massive concurrency– Good match for independent parallel cores– Not for data parallelism

• Heterogeneity– Need to relax strict SPMD with at least teams

• Application generality– Add atomics so remote writes work (not just reads)

48

Page 48: UPC Project Annual Review Kathy Yelick Associate Laboratory Director Acting NERSC Director Lawrence Berkeley National Laboratory EECS Professor, UC Berkeley

NITRD Projects Addressed Programmer Productivity of Irregular Problems

Message Passing Programming

Divide up domain in pieces

Compute one piece and exchange

PVM, MPI, and many libraries

49

Global Address Space Programming

Each start computing

Grab whatever / whenever

UPC, CAF, X10, Chapel, Fortress, Titanium

Page 49: UPC Project Annual Review Kathy Yelick Associate Laboratory Director Acting NERSC Director Lawrence Berkeley National Laboratory EECS Professor, UC Berkeley

Computing Performance Improvements will be Harder than Ever

1970 1975 1980 1985 1990 1995 2000 2005 20100

1

10

100

1,000

10,000

100,000

1,000,000

10,000,000

Transistors (Thousands)

Moore’s Law continues, but power limits performance growth. Parallelism is used instead.

1970 1975 1980 1985 1990 1995 2000 2005 20100

1

10

100

1,000

10,000

100,000

1,000,000

10,000,000

Transistors (Thousands)Exponential (Transistors (Thousands))

1970 1975 1980 1985 1990 1995 2000 2005 20100

1

10

100

1,000

10,000

100,000

1,000,000

10,000,000

Transistors (Thousands)Exponential (Transistors (Thousands))Frequency (MHz)

50

Page 50: UPC Project Annual Review Kathy Yelick Associate Laboratory Director Acting NERSC Director Lawrence Berkeley National Laboratory EECS Professor, UC Berkeley

Energy Efficient Computing is Key to Performance Growth

goal

usual scaling

2005 2010 2015 2020

51

At $1M per MW, energy costs are substantial• 1 petaflop in 2010 used 3 MW• 1 exaflop in 2018 would use 130 MW with “Moore’s Law” scaling

This problem doesn’t change if we were to build 1000 1-Petaflop machines instead of 1 Exasflop machine. It affects every university department cluster and cloud data center.

Page 51: UPC Project Annual Review Kathy Yelick Associate Laboratory Director Acting NERSC Director Lawrence Berkeley National Laboratory EECS Professor, UC Berkeley

New Processor Designs are Needed to Save Energy

• Server processors designed for performance• Embedded and graphics processors use simple

low-power processors good performance/Watt• New processor architecture and software for

future HPC systems52

Cell phone processor (0.1 Watt, 4 Gflop/s)

Server processor (100 Watts, 50 Gflop/s)

Page 52: UPC Project Annual Review Kathy Yelick Associate Laboratory Director Acting NERSC Director Lawrence Berkeley National Laboratory EECS Professor, UC Berkeley

Need to Redo Software and Algorithms

1

10

100

1000

10000

now

Pic

oJ

ou

les

Messages

On-chip

Memory

Stop Communicating!

• Communication is expensive:

– 10-100x in time/energy – Even to memory

• Computation is almost free!

Arit

hmet

ic

Reg

iste

r

Cac

he

Cac

he

Mem

ory

Rem

ote

Far

Aw

ay

• Software and algorithms– Software will control data movement to avoid waste– Algorithms should minimize data movement– Exascale science involves more complex interactions

53