upc project annual review kathy yelick associate laboratory director acting nersc director lawrence...
TRANSCRIPT
UPC Project Annual Review
Kathy YelickAssociate Laboratory Director
Acting NERSC Director Lawrence Berkeley National Laboratory
EECS Professor, UC Berkeley
Agenda
Hierarchical PGAS Programming
Kathy YelickAssociate Laboratory Director
Acting NERSC Director Lawrence Berkeley National Laboratory
EECS Professor, UC Berkeley
Design of Programming Models: What to Hide?
• Hidden from most programmers:– Limited DRAM: OS Virtual Memory– Limited Registers: Compilers– Network topology: MPI collectives– Separate address spaces: hidden by UPC, not MPI
• Exposed:– Processor count: not hidden by MPI/UPC– Partitioned memory space on GPU nodes
• “Performance programmers” may want to optimize for these hardware features
Challenges from Machines
Petascale Exascale
6
Challenge #1: Scale / Topology
• Trend: Exascale = ~10x nodes than Petascale– Increase pressure / cost of network – Reduced (relative) bisection bandwidth– Topology is more visible in performance
• Solutions: – For common communication patterns
e.g., collectives, hide in libraries– Expose to programmers– Note job placement / utilization trade-off
• Conclusion: – Need to optimize bisection-sensitive communication
7
Challenge #2: Addressing
• Trend? integration of network / processor – DMA yes; atomics maybe; Invocation no (deadlock)– Especially useful for unexpected communication
• Trade-offs well understood [Active Messages]– Changing the Program Counter is fast / easy– But resources management is key
• Conclusion: we should demand this support
8
address
message id
data payload
data payload
one-sided put message
two-sided send messagenetwork
interfacememory
processor
Challenge #3: Node Hierarchy
• Trend: New levels of software-exposed memory hierarchy1) Scratchpad memory
2) NVRAM (level between DRAM and disk)
3) Co-Processor split memory (CM5, RR, GPUs,…)
• We can automatically manage these:– Hardware: caches, coherent for #3– But it may not be good for power/performance
• Can make this explicit to software– Some software successes: compilers, autotuners,..– Cache-oblivious vs. cache aware algorithms
9
Challenge #4: Heterogeneity
• Many kinds of heterogeneity;– Functional heterogeneity
• Node heterogeneity standard • Core heterogeneity emerging
– Performance heterogeneity• We can functional heterogeneity into this
• Approach: better compiler / system support
10
System Board Chip
60
%
Challenge #5: Resilience
• Resiliency challenges– Chance of component failure
grows with system
• Need realistic analysis– Which failures, how often…
•Approach:– Component failures should not
cause system-wide outages– Or application-wide outages– Detection by hardware– Isolation in software (at all
levels)
Challenge #6: Load Balancing
• Where does the mapping of the graph to a particular number of processors happen?– The compiler: NESL, ZPL– The runtime system: Cilk– The programmer: MPI, UPC
• Irregular task graphs irregular runtime systems– Unpredictability dynamic runtime
• Regular task graphs (statically domain decomposed)– Should we pay for dynamic
runtime?
Depends: Charm++, OpenMP, X10, Chapel
Challenges of Scheduling in a Distributed Environement
• Theoretical and practical problem: Memory deadlock– Not enough memory for all tasks at once.
• (Each update needs two temporary blocks, a green and blue, to run.)
– If updates are scheduled too soon, you will run out of memory– Allocate memory in increasing order of factorization:
• Don't skip any!
– Thread blocks until enough memory available
some edges omitted
Challenge #7: Resource Management
• The problem for putting high level programming models into practice– Dataflow– Remote function invocation– Distributed Shared Memory (with caching)– Cache oblivious algorithms– Dynamic load balancing
• More research is needed to understand trade-offs in an exascale environment
14
Challenge #8: Avoiding Communication
• Start with good algorithms• Sparse Iterative (Krylov Subpace) Methods• Can we lower data movement costs?
– Take k steps with one matrix read fromDRAM and one communication phase• Serial: O(1) moves of data moves vs. O(k)• Parallel: O(log p) messages vs. O(k log p)
• Even dense matrix multiply can avoid communication• Can programming model help?
– Bandwidth avoidance (lower volume)– Latency avoidance (fewer events)– Latency hiding (up to 2x)– Synchronization avoidance (remove barriers)
Joint work with Jim Demmel, Mark Hoemman, Marghoob Mohiyuddin
See Solomonic and Demmel for Matmul
The UPC Experience
• Ecosystem: – Users with a need (fine-grained random access)– Machines with RDMA (not hardware GAS)– Common runtime– Commercial and free software– Center procurements– Sustained, many-year funding
16
1991Active Msgs are fast
1992First Split-C (compiler GAS)
1992First AC (accelerators + split memory)
1993Split-C funding (DOE)
1997First UPC Meeting
“best of” AC, Split-C, PCP
2001First UPC Funding
2003? Berkeley Compiler release
2001gcc-upc at Intrepid
2006UPC in NERSC procurement
2002 GASNet Spec
2010Hybrid MPI/UPC
Other GASNet-based languages
1995 Titanium funding (DARPA)
Evolution vs. Revolution
• Revolutionary investigations can impact evolutionary approaches– GASNet/ARMCI MPI one-sided– PGAS OpenMP– Titanium UPC
• Revolutions require broad support– And real incentives for change
• Languages are useful for clarifying ideas– No hand-waving in compilers
17
Risk
• High risk, high reward– Some projects should fail
• Whose risk?– MPI+OpenMP:
• Failure of models to perform• Failure to recruit new applications
– MPI+X (where X ≠ OpenMP):• Failure to converge on a portable model
– New universal language: • Failure of applications to move• Failure of implementations to perform
• Risk, cost and schedule are related
18
What is Exascale?
• Ensure DOE has machines to solve mission / science problems in 2020
• Ensure DOE has hw/sw technology from which to build machines in future
How to move the users?
19
Where are the users?• Fortran with Flat MPI• Linda on IP Sockets• And adding Python
• Add OpenMP• Add PGAS• Add CUDA
Users in 2020• Add new features• Add new languages• Replace some code
Interoperability is essential
New ideas from basic research
What is Exascale?
• Ensure DOE has machines to solve mission / science problems in 2020
• Ensure DOE has hw/sw technology from which to build machines in future
How to move the users?
20
Where are the users?• Fortran with Flat MPI• Linda on IP Sockets• And adding Python
• Add OpenMP• Add PGAS• Add CUDA
Users in 2020• Add new features• Add new languages• Replace some code
Interoperability is essential
New ideas from basic research
Conclusions
• Solve the problems that must be solved– Locality (how many levels are necessary?)– Heterogeneity– Vertical communication management
• Horizontal is solved by MPI (or PGAS)– Fault resilience, maybe
• Look at the 800-cabinet K machine– Dynamic resource management
• Definitely for irregular problems• Maybe for regular ones on “irregular” machines
– Resource management for dynamic distributed runtimes
21
Irregular vs. Regular Parallelism
• Computations with regular task graphs can be automatically virtualized / scheduled– By a compiler or runtime system
• Fork/Join graphs (no out-of-band dependencies) can be scheduled– By a runtime system (e.g., Cilk)– A greedy scheduler (stealing or pushing) is optimal time– Stealing is optimal in space (but slower to load balance)
• General DAGs are more complicated– Either preemption or user awareness is needed
• Conclusion: If your computation is not regular, the runtime system should be dynamic, i.e., virtualize the processors
What to Virtualize?
• Register count: hide – Compilers prove this, but need autotuners
• Separate address spaces: hide– PGAS proves this; Needs to show for GPUs
• Performance-partitioned memory: expose– For distributed memory needs to be exposed
• Number of cores: depends– On shared memory, we can virtualize– Distributed memory mostly not– Open question for non-SPMD PGAS
23
• Sdf
24
Keeping the Users with Us
Demise of flat MPI
Demise of OpenMP “as we know it”
Game programmers
recruited
Two Parts of a Programming Model
• The parallel control model
• The model for sharing/communication?
implied synchronization for message passing, not for shared memory
data parallel(singe thread of control)
dynamicthreads
single programmultiple data (SPMD)
shared memoryload
storesend
receive
message passing
Irregularity in Task Costs
, search
Irregularity in Task Dependences
Irregularity in Task Locality (Communication)
Evolution of Abstract Machine Models
• SMP
•P •P •P •P •P •P
•
•P •P •P •P
MPI Distributed Memory
local
P P P P P
sharedPGAS
local
P P P P P
Chip shared
shared
?
HPGAS
PGAS for Local Store Memory
• How does local store memory affect PGAS?– No casts from pointer-to-shared to pointer-to-
private (this is a big change!)– Assignments and bulk operations work– Could add an additional notion of local for faster
pointers, and distinguish private from local
local
P P P P P
Chip shared
shared
?
HPGAS
Hierarchical PGAS Model
• A global address space for hierarchical machines may have multiple kinds of pointers
• These can be encode by programmers in type system or hidden, e.g., all global or only local/global
• This partitioning is about pointer span, not control / parallelism
B
span 1(core local)
span 2(chip local)
level 3(node local)
level 4(global world)
CD
A1
23 4
• Demonstrated in Java dialect called Titanium• Statically determines what data each pointer
can refer to• Allocation sites approximated as abstract
locations (alocs)• All explicit and implicit program variables have
points-to sets– The set of alocs that the variable may reference
• Abstract locations have span– Thread local locations created by local thread– Process local locations reside in same memory
space– Global locations can be anywhere
Type Systems Help: Hierarchical Pointer Analysis
• Communication produces version of source abstract locations with greater span
Pointer Analysis: Communication
Points-to Sets
a 1t
b 1g
c 1t
1t.z 2t
d 2g
Alocs 1, 2
class Foo { Object z; }
static void bar() {L1: Foo a = new Foo(); Foo b = broadcast a from 0; Foo c = a;L2: a.z = new Object(); Ti.barrier(); Object d = b.z; }
amr gas ft cg mg10
100
1000
10000
100000
24633
5526
832
1752
4759
286
517
67
262
66
Static Races Detected
sharing concur feasible feasible+AA1 feasible+AA3
Benchmark
Ra
ce
s (
Lo
ga
rith
mic
Sc
ale
)Race Detection Results
Amir Kamil
Data Locality Detection Results
amr gas ft cg mg0
10
20
30
40
50
60
70
80
90
100
Data Locality Detection
LQI PA2 PA3
Benchmark
Pe
rce
nt
De
term
ine
d t
o b
e P
roc
es
s-L
oc
al
Go
od
PA2 = Two level hierarchy, PA3 = 3 Level
Current Programming Model for GPU Clusters
MPI/UPC + CUDA/OpenCL
CPU
CPU
GPU
CPU
CPU
GPU
CPU
CPU
GPU
MPI/UPC
CUDA/OpenCL
UPC+CUDA works today
Hybrid Partitioned Global Address Space
Local Segment on Host Memory
Processor 1
Shared Segment on Host Memory
Local Segment on GPU Memory
Local Segment on Host Memory
Processor 2
Local Segment on GPU Memory
Local Segment on Host Memory
Processor 3
Local Segment on GPU Memory
Local Segment on Host Memory
Processor 4
Local Segment on GPU Memory
Each thread has only two shared segments, which can be either in host memory or in GPU memory, but not both.
Decouple the memory model from execution models; therefore it supports various execution models.
Caveat: type system and therefore interfaces blow up with different parts of address space
Shared Segment on GPU Memory
Shared Segment on Host Memory
Shared Segment on GPU Memory
Shared Segment on Host Memory
Shared Segment on GPU Memory
Shared Segment on Host Memory
Shared Segment on GPU Memory
Data Transfer Example
• MPI+CUDA
• UPC with GPU extensions
Process 1:send_buffer = malloc(nbytes);cudaMemcpy(send_buffer, src); MPI_Send(send_buffer);free(send_buffer);
Process 2:recv_buffer = malloc(nbytes);
MPI_Recv(recv_buffer);cudaMemcpy(dst, recv_buffer);free(recv_buffer);
Thread 1:upc_memcpy(dst, src, nbytes);
Advantages of PGAS on GPU clusters Don’t need explicit buffer management by the user. Facilitate end-to-end optimizations such as data transfer pipelining. One-sided communication maps well to DMA transfers for GPUs. Concise code
copy nbytes data from src on GPU 1 to dst on GPU 2
Thread 2:// no operation required
Berkeley UPC Compiler
Compiler-generated C code
UPC Runtime system
GASNet Communication System
Network Hardware
Platform-independent
Network-independent
Language-independent
Compiler-independent
UPC Code UPC CompilerUsed by bupc and
gcc-upc
Used by Cray UPC, CAF,
Chapel, Titanium, and others
Irregular Applications
• UPC originally for “irregular” applications– Many recent performance results are on “regular” ones
(FFTs, NPBs, etc.); those also do well
• Does it really handle irregular ones? Which?– Irregular in data accesses:
• Irregular in space (sparse matrices, AMR, etc.): global address space helps; needs compiler or language for scatter/gather
• Irregular in time (hash table lookup, etc.): for reads, UPC handles this well; for write you need atomic operations
– Irregular computational patterns:• High level independent tasks (ocean, atm, land, etc.): need
teams• Non bulk-synchronous: use event-driven execution• Not statically load balanced (even with graph partitioning, etc.):
need global task queue
40
// allocate a distributed task queuedistq_t * all_distq_alloc();
// enqueue a task into the distributed queueint distq_enqueue(distq_t *, distq_task_t *);
// dequeue a task from the local task queue
// returns null if task is not readily availabledistq_task_t * distq_dequeue(distq_t *);
// test whether queue is globally emptyint distq_isempty(bupc_taskq_t *);
// free distributed task queue memory int bupc_task_free(shared bupc_taskq_t *);
Distributed Tasking API for UPC
shar
ed
priv
ate
enqueue dequeue
internals are hidden from user, except that dequeue operations may fail and provide hint to steal
• Provide method to query machine structure at runtime
Machine Structure
42
Team T = Ti.defaultTeam();
4
5
6
7
0
1
2
3
0, 1, 2, 3, 4, 5, 6, 7
0, 1, 2, 3
0, 1 2, 3
4, 5, 6, 7
4, 5 6, 7
• Provide library functions to simplify building common team hierarchies
Regular Team Structures
Team T = new Team();T.splitTeam(3);
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11
0, 1, 2, 3 4, 5, 6, 7 8, 9, 10, 11
• Thread teams may execute distinct tasks– e.g. Forward FFT, multplication/reduction,
reverse FFT in parallel convolution• New partition construct enables this
partition(T) { { forward_fft(); } { multiply_reduce(); } { reverse_fft(); }}
Task Hierarchy
Hierarchical Control Model• Basic questions about parallelism model
– Parallel all the time with fixed parallelism (SPMD)– Dynamic parallelism: serial/parallel (loop annotations)
• Even with SPMD we need hierarchy– Collectives are key feature and need to work on subsets– Programs need divide and conquer (e.g., reduce on rows)– Programs may use different algorithms depending on machine costs
• Teams may provide sufficient functionality– Create teams based on sets of threadIDs – Get a default team structure (hierarchy) from runtime
T = get_row_teams(…) T2 = node_teams();split(T) { split (T) { row_reduce(); if mythread=0} compute_local
Hierarchical Programmers
• 2 types of programmers 2 layers• Efficiency Layer (10% of today’s programmers)
– Expert programmers build Frameworks & Libraries, Hypervisors, …
– “Bare metal” efficiency possible at Efficiency Layer• Productivity Layer (90% of today’s programmers)
– Domain experts / Naïve programmers productively build parallel apps using frameworks & libraries
– Frameworks & libraries composed to form app frameworks• Effective composition techniques allows the efficiency
programmers to be highly leveraged Create language for Composition and Coordination (C&C)
Productivity Models
• Productivity programming model will hide some of the machine structure– Dynamic parallelism model for application
• Chapel-like data and task parallelism• Sequoia-like divide and conquer parallelism
– Fewer levels of hierarchy
• Open question: for regular computations do we need to use dynamic scheduling for– Performance heterogeneity due to fault recovery,
power management, etc. (or is SPMD sufficient)?
• Open question: Align control / parallelism hierarchy with memory / data hierarchy
Response of PGAS to Challenges
• Small memory per core– Ability to directly access another core’s memory
• Lack of UMA memory on chip– Partitioned address apce
• Massive concurrency– Good match for independent parallel cores– Not for data parallelism
• Heterogeneity– Need to relax strict SPMD with at least teams
• Application generality– Add atomics so remote writes work (not just reads)
48
NITRD Projects Addressed Programmer Productivity of Irregular Problems
Message Passing Programming
Divide up domain in pieces
Compute one piece and exchange
PVM, MPI, and many libraries
49
Global Address Space Programming
Each start computing
Grab whatever / whenever
UPC, CAF, X10, Chapel, Fortress, Titanium
Computing Performance Improvements will be Harder than Ever
1970 1975 1980 1985 1990 1995 2000 2005 20100
1
10
100
1,000
10,000
100,000
1,000,000
10,000,000
Transistors (Thousands)
Moore’s Law continues, but power limits performance growth. Parallelism is used instead.
1970 1975 1980 1985 1990 1995 2000 2005 20100
1
10
100
1,000
10,000
100,000
1,000,000
10,000,000
Transistors (Thousands)Exponential (Transistors (Thousands))
1970 1975 1980 1985 1990 1995 2000 2005 20100
1
10
100
1,000
10,000
100,000
1,000,000
10,000,000
Transistors (Thousands)Exponential (Transistors (Thousands))Frequency (MHz)
50
Energy Efficient Computing is Key to Performance Growth
goal
usual scaling
2005 2010 2015 2020
51
At $1M per MW, energy costs are substantial• 1 petaflop in 2010 used 3 MW• 1 exaflop in 2018 would use 130 MW with “Moore’s Law” scaling
This problem doesn’t change if we were to build 1000 1-Petaflop machines instead of 1 Exasflop machine. It affects every university department cluster and cloud data center.
New Processor Designs are Needed to Save Energy
• Server processors designed for performance• Embedded and graphics processors use simple
low-power processors good performance/Watt• New processor architecture and software for
future HPC systems52
Cell phone processor (0.1 Watt, 4 Gflop/s)
Server processor (100 Watts, 50 Gflop/s)
Need to Redo Software and Algorithms
1
10
100
1000
10000
now
Pic
oJ
ou
les
Messages
On-chip
Memory
Stop Communicating!
• Communication is expensive:
– 10-100x in time/energy – Even to memory
• Computation is almost free!
Arit
hmet
ic
Reg
iste
r
Cac
he
Cac
he
Mem
ory
Rem
ote
Far
Aw
ay
• Software and algorithms– Software will control data movement to avoid waste– Algorithms should minimize data movement– Exascale science involves more complex interactions
53