application challenges for sustained petascale william gropp wgropp
TRANSCRIPT
Performance, then Productivity
• Note the “then” – not “instead of”– For “easier” problems, it is correct to invert these
• For the very hardest problems, we must focus on getting the best performance possible– Rely on other approaches to manage the complexity of the codes– Performance can be understood and engineered (note I did not
say predicted)• We need to start now, to get practice
– “Vector” instructions, GPUs, extreme scale networks– Because Exascale platforms will be even more complex and
harder to use effectively
A “Bottom Up” Look at the Problem
• Focus on the features of Cray XE/XK system as model of petascale and trans-petascale systems– Heterogeneous at multiple levels:
• Node functional unit, node types
– Network and network bandwidth may (but only may) be more representative of future directions
• A major challenge is to handle details in a way that is portable and efficient (OpenACC ( www.openacc-standard.org/ ) may not solve all of our problems … )
Node: Getting the most performance out of a NUMA SMP
• Process/Thread mapping to chip/core– What is the quantitative model that permits reasoning and automating this
process?– Describing a “mapping” (almost) assumes a static mapping. How does a more
dynamic behavior fit to current techniques?• Efficient use of node/chip to memory
– Prefetch, memory hierarchy optimizations, tuning for dynamic data patterns• Efficient use of core/node computational resources
– 8/16 core split; vector instructions
• Both algorithm and programming model implications– Algorithm can reflect execution model; realization requires a
programming model
Many Nodes
• Process mapping to nodes– Topology sensitive mapping– Quantitative reasoning about mapping– What changes if work is dynamic?– Relationship to multi-component applications• Application is heterogeneous – how does that impact
mapping, both initial and over time? What are the right software models?
Discovering Performance Opportunities
• Lets look at a single process sending to its neighbors. We expect the rate to be roughly twice that for the halo (since this test is only sending, not sending and receiving)
System 4 neighbors 8 Neighbors
Periodic Periodic
BG/L 488 490 389 389
BG/L, VN 294 294 239 239
BG/P 1139 1136 892 892
BG/P, VN 468 468 600 601
XT3 1005 1007 1053 1045
XT4 1634 1620 1773 1770
XT4 SN 1701 1701 1811 1808
Discovering Performance Opportunities
• Ratios of a single sender to all processes sending (in rate)• Expect a factor of roughly 2 (since processes must also receive)
System 4 neighbors 8 Neighbors
Periodic Periodic
BG/L 2.24 2.01
BG/L, VN 1.46 1.81
BG/P 3.8 2.2
BG/P, VN 2.6 5.5
XT3 7.5 8.1 9.08 9.41
XT4 10.7 10.7 13.0 13.7
XT4 SN 5.47 5.56 6.73 7.06
Interconnect
• Need more general approaches for avoiding contention– Recent example: shuffle data in collectives to reduce
contention (w Paul Sack, PPoPP 2012)
• Overlap communication/computation• Exploit one-sided programming models• Avoiding alltoall in algorithms– Global FFTs move too much data for the value received– Need a better understanding of the accuracy requirements
of the application
A “Top Down” Look At The Problem
• Consider the application and the mapping of the problem (not just current algorithm) to current and future hardware
• One example: the use of FFTs for DNS simulations or for particle-mesh Ewald is just one of many possible choices – and other choices may provide sufficient accuracy at lower cost on large scale platforms– Data motion is costly, not floating point operations
Need for Adaptivity• Uniform meshes rarely optimal
– More work than necessary– Note that minimizing floating-point operations will not minimize
running time – perfect irregular mesh is also not optimal• Once adaptive meshing/model approximations used, need
to address load balance, avoid the use of synchronizing operations– No barriers– Nothing that looks like a barrier (MPI_Allreduce)
• See MPI_Iallreduce, likely to appear in MPI 3
– Care with operations that are weakly synchronizing– e.g., neighbor communication (it synchronizes, just not as tightly)• Using MPI_Send synchronizes
Sharing an SMP• Having many cores available makes
everyone think that they can use them to solve other problems (“no one would use all of them all of the time”)
• However, compute-bound scientific calculations are often written as if all compute resources are owned by the application
• Such static scheduling leads to performance loss
• Pure dynamic scheduling adds overhead, but is better
• Careful mixed strategies are even better• Recent results give 10-16%
performance improvements on large, scalable systems
• Thanks to Vivek Kale
Need for Aggregation
• Functional units are cheap – Small amount of area, relatively small amount of power– Memory motion is expensive– Easy to arrange many floating point units, in different patterns
• Classic vectors (“Classic” Cray, NEC SX)• Commodity vectors (2 or 4 elements)• Streams• GPU
– All have different requirements on both the algorithms (e.g., work with full vectors) and programming (e.g., satisfy alignment rules)
– Compilers will be able to help but will not solve the problem• Need better ways to generate fast and maintainable code
Need for Appropriate Data Structures
• Choice of data structure strongly affects ability of the system to provide good performance– Key is to work with the hardware provided for
improving memory system performance, rather than using it as a crutch
– This choice often requires a large scale view of the problem and is not susceptible to typical autotuning approaches• Refactoring tools may help existing application
Effective Sparse Matrix-Vector Implementation• We have modified the
S-CSR and S-BCSR to match the requirements for vectorization
• We can use OSKI to optimize “within the loops”
• Need corresponding approach for x86 and GPU; method to hide details
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
1.80
2.00
Perf
orm
ance
Rati
o
stream_un2
BLK12-VSX
S-CSR-2
S-CSR-4
S-CSR-2-VSX
S-CSR-4-VSX
SpMV on BlueBiou
Memory Locality and Multiphysics
• Vertically integrated (all modules within same “locality domain”)– Not horizontally in processor blocks– Adapt for load balance• Challenges
– Minimize memory motion– Work within limited memory
– Likely approach: interleave components in regions (nodes, if nodes have 1000’s of cores)
Locality Domains• In hardware, the
memory is in a hierarchy – core, memory stick, chip, node, module, rack, ..
• Algorithm/implementation needs to respect this hierarchy
Summary• The new Blue Waters is a good test bed for
extreme scale– Heterogeneous at all levels– Algorithms need to be more flexible, dynamic– Programming models need to be more flexible
with details but with a realistic execution model– Applications need to reconsider choice of
algorithms to match changing costs– Quantification of performance one way to tie it all
together
Thanks• Torsten Hoefler
– Performance modeling lead, Blue Waters; MPI datatype
• David Padua, Maria Garzaran, Saeed Maleki– Compiler vectorization
• Dahai Guo– Streamed format exploiting
prefetch• Vivek Kale
– SMP work partitioning• Hormozd Gahvari
– AMG application modeling• Marc Snir and William Kramer
– Performance model advocates
• Abhinav Bhatele– Process/node mapping
• Elena Caraba– Nonblocking Allreduce in CG
• Van Bui– Performance model-based evaluation
of programming models• Paul Sack
– Collectives in the presence of contention
• Funding provided by:– Blue Waters project (State of Illinois
and the University of Illinois)– Department of Energy, Office of
Science– National Science Foundation