productivity in high performance computing€œdesign, implementation and adaptation should be a...
TRANSCRIPT
Dec. 19, 2005 HPC Productivity 1
Productivity in High Performance Computing
Overview
•Perspective
•Basic Principles
•Historical and Emerging HPC
•HPC Development Paradigm – Requirements
•HPC Development Paradigm – Concepts
•HPC Development Environment – An Example
•Connection to Other Research
•Research Issues
Dec. 19, 2005 HPC Productivity 2
Perspective - Personal
• 49 years of programming
• 48 years of “HPC” programming
• 25 years of parallel/distributed/grid
programming
• Software tools and applications
Dec. 19, 2005 HPC Productivity 3
PERSPECTIVE – Past Research
Transition from Serial to Vector to Parallel to
Distributed Architectures
1. Transition to Vector Processors – The promise and
the reality:
2. Programming systems for parallel
architectures:1980-1995
Shared Memory-Distributed Memory
Adaptations/extensions of serial languages
3. Programming systems for distributed architectures:
1995-2005
Grid Programming Systems
Dec. 19, 2005 HPC Productivity 4
Productivity
• “Cost of goal attainment”
• Cost = Σ (resources) – people and physical
• Goals (examples):
– Initial use of system
– Completion of problem instance
– N years of use
Dec. 19, 2005 HPC Productivity 5
• Overview
• Perspective
• Basic Principles
• Historical and Emerging HPC
• HPC Development Paradigm – Requirements
• HPC Development Paradigm – Concepts
• HPC Development Environment – An Example
• Connection to Other Research
• Research Issues
Dec. 19, 2005 HPC Productivity 6
Productivity Principle #1
“Our ability to reason is constrained by the language in
which we reason”
Therefore programming systems should facilitate
reasoning about the issues of concern.
HPC has a plethora of different concerns
Challenge - Bring all these concerns into a unified context
Productivity Principles
Dec. 19, 2005 HPC Productivity 7
Productivity Principle #2
“Automation of program composition”
The components from which programs are composed
must support automated composition.
Components must be meaningful in the context of an
application.
Challenge – Representation which enables automated
composition of programs.
Productivity Principles - Continuation
Dec. 19, 2005 HPC Productivity 8
Productivity Principle #3
“Design, implementation and adaptation should be a
unified evolutionary process.”
Design evaluation and system execution should be a
unified process.
Challenge – executable representation spanning
multiple levels of abstraction.
Challenge – Unification of design evaluation and
system execution.
Productivity Principles - Continuation
Dec. 19, 2005 HPC Productivity 9
• Overview
• Perspective
• Basic Principles
• Historical and Emerging HPC
• HPC Development Paradigm – Requirements
• HPC Development Paradigm – Concepts
• HPC Development Environment – An Example
• Connection to Other Research
• Research Issues
Dec. 19, 2005 HPC Productivity 10
Historical HPC
• Users – Small cadre of dedicated professional users combining discipline expertise with programming skills.
• Applications – Narrow family of applications, large PDE system solvers or signal analysis, static structure, visualization based analysis.
• Platforms – Specialized vector/parallel “supercomputer” systems – Closed set of resources -Stable over periods of hours or days.
• Algorithms – Static algorithms but multi-domain physical systems
• Goal – Solve largest possible problems within resource constraints.
Dec. 19, 2005 HPC Productivity 11
Conventional Practice in Application Family
Development
– Comprehensive package of functional modules
– Common data structures.
– Many paths through system structure
– Users choose parameters to select execution paths
– Program is coded before performance is evaluated
Dec. 19, 2005 HPC Productivity 12
Why Current Practice Needs Improvement
• Optimization and adaptation of parallel programs is effort intensive
– Different execution environments
– Different problem instances
• Direct modification of complete application is effort intensive
• Maintenance and evolution of parallel programs is a complex task
• Code structure is often sub-optimal for an given case and/or execution environment
Dec. 19, 2005 HPC Productivity 13
Status of Conventional HPC
• Islands of excellence – application families in well-characterized domains and users of libraries for communication and interaction management.
• Productivity (by some metrics) little changed for two decades
• Complexity of the algorithms used and application system complexity have grown dramatically.
Dec. 19, 2005 HPC Productivity 14
• Overview
• Perspective
• Basic Principles
• Historical HPC
• Emerging HPC
• HPC Development Paradigm – Requirements
• HPC Development Paradigm – Concepts
• HPC Development Environment – An Example
• Connection to Other Research
• Research Issues
Dec. 19, 2005 HPC Productivity 15
Emerging HPC Platforms
1. Broadly available commodity clusters of multi-
core processors.
(Lack of standard configurations)
2. Enormous specialized cluster architectures
eg Blue Gene
3. Grids- Heterogeneous, unreliable and constantly
changing platforms
Each has different properties but really large
clusters and grids are beginning to have similar
characteristics.
Dec. 19, 2005 HPC Productivity 16
Emerging Application Characteristics
• Multiple domains
• Complex adaptive algorithms
• Complex, possibly dynamic coordination/interaction structures
• Data intensive as well as computation intensive
• Interfaced to online data sources
• Integration of automated content analysis
• Require management of uncertainty
Dec. 19, 2005 HPC Productivity 17
• Overview
• Perspective
• Basic Principles
• Historical HPC
• Emerging HPC
• Productivity Concepts for Conventional Software
• HPC Development Paradigm – Barriers and Requirements
• HPC Development Paradigm – Concepts
• HPC Development Environment – An Example
• Connection to Other Research
• Research Issues
Dec. 19, 2005 HPC Productivity 18
Status of Productivity for Mainstream Systems
Application/Platform Characteristics
Serial with mostly straight-line interactions
Standard platforms
Productivity varies dramatically with domain
Commonly used well-supported domains (GUIs, RDB,
etc. – Factors of 10 over a decade or so
Specialized application domains – nearly unchanged
since the 1970’s
Application systems span multiple domains
Dec. 19, 2005 HPC Productivity 19
Current Mainstream Programming Systems
(Why C/C++/Fortran are not suitable for HPC.)
•Assume serial execution
•Parallelism is deviation from normal behavior
•Representation of parallelism is ad hoc
•Locality is only implicitly addressed
•Don’t support automated composition
•Minimal coordination and interaction semantics
•Extension mechanisms have complex semantics
•Design is not really addressed and performance is not-
considered
Dec. 19, 2005 HPC Productivity 20
Basis for Productivity Improvements
Broadly applicable domain analyses
Libraries implementing the domain analyses
Compositional tools (Language specific)
Cheap resource rich uniform platforms – fast
turnaround
Abstraction – use of specification-level languages
Automation – Code generators from specifications
Design and validation/verification methods and
tools
Productivity for Mainstream Systems
Dec. 19, 2005 HPC Productivity 21
Productivity Research in Mainstream Systems
Component-oriented development
Software architectures
Specification languages and code generators
Aspects/Features
Dec. 19, 2005 HPC Productivity 22
• Overview
• Perspective
• Basic Principles
• Historical HPC
• Emerging HPC
• Productivity Concepts for Conventional Software
• HPC Development Paradigm – Barriers and Requirements
• HPC Development Paradigm – Concepts
• HPC Development Environment – An Example
• Connection to Other Research
• Research Issues
Dec. 19, 2005 HPC Productivity 23
Barriers to Productivity for HPC
(Things we can’t do anything about.)
Obvious Barriers
Market size
Few HPC-specialized tools
Heterogeneous, sparsely available platforms
Cultural Barriers
Parochialism and ignorance by all parties
Out-of-date education programs
Us versus them
Code first culture
Dec. 19, 2005 HPC Productivity 24
HPC – CS Disconnect
Scalable Parallelism
Micro-and macro-locality
Increasing complexity of applications
Multiple application domains
Adaptive algorithms
Increasing complexity/diversity of execution platforms
Multi-level locality – cache to network scales
Multi-scale parallelism
Barriers to Productivity in HPC/HPPS
(Things we can’t do anything about.)
Dec. 19, 2005 HPC Productivity 25
Barriers to Productivity in HPC
(Things we can do something about.)
Current programming systems are a lousy basis for reasoning
about HPC
Current programming systems don’t support automated
composition of systems from components.
Absence of HPC-specific design and development methods,
processes and tools
Available programming systems don’t address HPC
requirements and concerns
Dec. 19, 2005 HPC Productivity 26
• Overview
• Perspective
• Basic Principles
• Historical HPC
• Emerging HPC
• Productivity Concepts for Conventional Software
• HPC Development Paradigm – Requirements
• HPC Development Paradigm – Concepts
• HPC Development Environment – An Example
• Connection to Other Research
• Research Issues
Dec. 19, 2005 HPC Productivity 27
Capabilities for Productivity in HPC
Automation of Composition of Programs
Self-describing components
Components which make visible sufficient semantic
information about the services they provide, the services
they require and their properties and behaviors to enable
a compiler to select a component on the basis of its
services, properties and behaviors.
Dec. 19, 2005 HPC Productivity 28
Design and development methods, processes and tools which
address HPC issues, eg. performance
Design methods which incorporate design to performance and
evaluation of performance at design time including
impacts of execution environments and problem
instances
Tools for verification and validation including assessing
performance at component and total system levels
Capabilities for Productiviy for HPC
Dec. 19, 2005 HPC Productivity 29
Capabilities for Productivity in HPC
Unification of design-time, compile-time composition and
runtime composition (adaptation)
Unification of composition among abstract and concrete
components – Design time evaluation
Unification of compile-time and runtime composition
Support for measuring and monitoring of execution
behavior
Support for intelligent analysis of execution behavior
Support for component/algorithm replacement
Dec. 19, 2005 HPC Productivity 30
Capabilities for Productivity in HPC
Specification of dynamic, complex coordination and
interactions among components
•Make an coordination/interaction a first class concept in
the programming system.
•Allow interactions depend on the state of a component
Dec. 19, 2005 HPC Productivity 31
Capabilities for Productivity in HPC
Uncertainty management, adaptivity and fault-tolerance
Explicit representation of component state
Language support for measurement and monitoring
Language support for state analysis
Runtime support for runtime component replacement
Dec. 19, 2005 HPC Productivity 32
Programming systems which address HPC issues
Language extensibility
Support for customization including syntax extensions and
execution environment specifications – Annotation language?
(Anyone have ideas on this?)
Dec. 19, 2005 HPC Productivity 33
Programming systems which address HPC issues
Explicit representation of hierarchical locality
Configurations of data, processes and threads should be
explicitly specifiable to virtual machines.
Mapping of abstract machines to realized machines should be
represented.
(I have not thought through this one.)
Dec. 19, 2005 HPC Productivity 34
• Overview
• Perspective
• Basic Principles
• Historical HPC
• Emerging HPC
• Productivity Concepts for Conventional Software
• HPC Development Paradigm – Requirements
• HPC Development Paradigm – Concepts
• HPC Development Environment – An Example
• Connection to Other Research
• Research Issues
Dec. 19, 2005 HPC Productivity 35
Demonstration Implementation of Concepts
Problem Domain
Development of families of applications which are to be run
on (possibly multiple) large scale dynamic parallel and
distributed execution environments. A family of applications
is a set of programs for solution of a set of related
computational problems. Each instance should be efficient
for a specific case on a specific execution environment. It is
assumed that the programs may utilize adaptive algorithms.
Dec. 19, 2005 HPC Productivity 36
Assumptions
The functionality from which many instances of an
application family can be composed can be
implemented as a reasonable set of well-specified
components
A parameterized coordination structure
(=dependence graph in terms of components) for the
program family is known at design time.
Dec. 19, 2005 HPC Productivity 37
Goals
Order of magnitude productivity enhancement for application families
– Develop parallel programs from sequential components
– Reuse components
– Enable development of program families from multiple versions of components
– Automatic composition of parallel programs from components
– Enable design time evaluation of performance
– Incorporation adaptation and uncertainty management into the programming system.
Dec. 19, 2005 HPC Productivity 38
Conceptual Elements for Enhancing Productivity
•Self-describing components
•Coordination/interaction/composition interface
specification language
•Programming model
•Automated composition of parallel/distributed programs
from components
•Framework for unification of different semantic domains
•Unification of compile time and run time composition
enabling runtime adaptation on a component level
•Unification of abstract (simulated execution) and concrete
execution (for performance modeling)
Dec. 19, 2005 HPC Productivity 39
Demonstration Implementation – P-COM2
Description of Compositional Compiler - LCPC 2003
Case study on adaptation – ICCS 2005
Case study on evolutionary development –WOSP 2005
Case study of benefits on componentization of the
Sweep3D benchmark - Compframe 2005 (submitted
to Concurrency and Computation)
Role-based Programming Model – Proc. Workshop on
Roles
http://www.cs.utexas.edu/users/pcom
Dec. 19, 2005 HPC Productivity 40
Self-Describing Components
Functionality + Composition/Coordination/Abstraction
Interface
Provides interface
profile, state machine, protocol
Sequential Computation
(abstract or concrete)
Requires interface
(selector, transaction, protocol)
“Component”
is recursiveFunctionality:
•Computation
•Measurement/
Monitoring
•Analysis
State
machines
capture
enabling
conditions,
preconditio
ns/postcond
itions
Dec. 19, 2005 HPC Productivity 41
2D FFT Example
• Steps for 2D FFT computation
– Partition given matrix row-wise
– Apply 1D FFT to each row of the partition
– Combine the partitions and transpose the matrix
– Partition transposed matrix row-wise
– Apply 1D FFT to each row of the partition
– Combine the partitions and transpose the matrix
– Transposed matrix is the 2D FFT of the original matrix
Dec. 19, 2005 HPC Productivity 43
selector:
string domain == "matrix";
string function == "distribute";
string element_type == "complex";
bool distribute_by_row == true;
transaction:
int distribute(out mat2 grid_re,out mat2 grid_im, out int n,
out int m, out int p);
protocol: dataflow;
profile:
string domain = "matrix";
string function = "distribute";
string element_type = "complex";
bool distribute_by_row = true;
transaction:
int distribute(in mat2 grid_re,in mat2 grid_im, in int n,
in int m, in int p);
protocol: dataflow;
2D FFT Example (Cont’d)
Requires
interface
of
Initialize
Provides
interface
of
Distribute
Dec. 19, 2005 HPC Productivity 44
{selector:
string domain == "fft";
string input == "matrix";
string element_type == "complex";
string algorithm == "Cooley-Tukey";
bool apply_per_row == true;
transaction:
int fft_row(out mat2 out_grid_re[],out mat2
out_grid_im[], out int n/p, out int m);
protocol: dataflow;
}index [ p ]
profile:
string domain = "fft";
string input = "matrix";
string element_type = "complex";
string algorithm = "Cooley-Tukey";
bool apply_per_row = true;
type = “concrete”;
transaction :
int fft_row(in mat2 grid_re,in mat2 grid_im,in int n,
in int m);
protocol: dataflow;
2D FFT Example (Cont’d)
Requires
interface
(partial) of
Distribute
Provides
interface
of
FFT_Row
Dec. 19, 2005 HPC Productivity 45
selector:
string domain == "matrix";
string function == "gather";
string element_type == "complex";
bool combine_by_row == true;
bool transpose == true;
transaction:
int gather_transpose(out mat2 out_grid_re,out mat2
out_grid_im, out int me);
protocol: dataflow;
profile:
string domain = "matrix";
string function = "gather";
string element_type = "complex";
bool combine_by_row = true;
bool transpose = true;
transaction:
int get_no_of_p(in int n, in int m, in int p,in int state);
>
int gather_transpose(in mat2 grid_re,in mat2 grid_im,
in int inst);
protocol: dataflow;
2D FFT Example (Cont’d)
Requires
interface
of
FFT_Row
Provides
interface
of
Gather_Tr
anspose
Dec. 19, 2005 HPC Productivity 46
2D FFT Example (Cont’d)
selector:
string domain == "matrix";
string function == "distribute";
string element_type == "complex";
bool distribute_by_row == true;
transaction:
%{ exec_no == 1 && gathered == p }%
int distribute(out mat2 out_grid_re,out mat2 out_grid_im, out int m, out int n*p,
out int p);
protocol: dataflow;
Requires
interface
(partial)
of
Gather_T
ranspose
Dec. 19, 2005 HPC Productivity 47
Capabilities Based on Self-Describing Components
•Compiler implementing recursive associative
composition of components
•Compiler generation of parallelism at component
level
•Run time adaptation combining monitoring,
analysis and composition.
•Unified concrete and abstract execution – (Design
Time Performance Evaluation)
•Framework for unification of concerns
Dec. 19, 2005 HPC Productivity 48
Automated Composition Process
• Matching of
– Requires and Provides
• Matching starts from the selector of the start
component
• Applied recursively to each matched components
• Output is a generalized dynamic data flow graph
as defined in CODE (Newton ’92)
• Data flow graph is compiled to a parallel program
for a specific architecture
Dec. 19, 2005 HPC Productivity 49
“Our ability to reason is constrained by the
language in which we reason”
Separation of Concerns
Framework for Unification of Multiple
Representations
Language Framework Concept
Dec. 19, 2005 HPC Productivity 50
Language Framework Concept – Multiple Representations
Specification languageLocality mapping
C/C++/FortranComputation
P-COM2
Coordination/Interaction
specification language
Coordination/interaction,
composition and abstraction
APIMeasurement and
monitoring
Rule-based programmingAnalysis and fault-tolerance
RepresentationConcern
Dec. 19, 2005 HPC Productivity 51
Framework Concept – Multiple Tools
Composers, Weavers, Analyzers, Execution Engines
Composer – automate composition to meet specified system
properties.
Weaver - source to source merges of different layers if
necessary.
Analyzer – static analysis, abstract/interpret models of code,
model checkers
Execution Engine – debuggers, simulated execution, direct
execution, adaptive control
Dec. 19, 2005 HPC Productivity 52
Unification of Compile Time/Run Time Composition
Provides and Requires can be modified at runtime.
Requires/Provides match implemented in runtime system
Monitoring and adaptation components included in
composition
When preconditions/postconditions for a component are
not met, a requires interface for a predecessor component
is modified to require a different component.
Component is replaced using OS dynamic loader
Dec. 19, 2005 HPC Productivity 53
Component-Oriented Evolutionary Development
Do domain analysis (ontology) – define components,
attributes and coordination/interaction structure.
Create execution environment parameterized performance
models for implementations of components with complete
implementation of coordination/interaction behavior.
Compose program instances for target execution
environments and execute via unified execution engine.
Performance Evaluation - If all components are performance
models, then you have evaluated a performance model.
Evolution to Concrete – Replace abstract components by
concrete components. Model and concrete components can be
included in a single composition
Dec. 19, 2005 HPC Productivity 54
Implementation of Unified Execution Engine
Runtime system which combines parallel/distributed simulation
with direct execution.
Based on coordination structure (data/control flow graph)
traversal.
Time management by generalized Lamport clocks at each
component (node in graph)
If a component is abstract it generates its own execution time
for the Lamport clock computation.
If a component is concrete, the execution time is measured.
Communication is also either modeled or concrete.
Dec. 19, 2005 HPC Productivity 55
Case Study – Optimization of Sweep3D
What is Sweep3D?
• Three-dimensional particle transport problem.
• ASCI Benchmark for high performance parallel architectures.
• Parallel wavefront computation via domain decomposition
Data Grid: 10x10x10
Processor Grid: 2x2x10
Dec. 19, 2005 HPC Productivity 56
start
octant
source
initialize
allocat e
read_input
snd_ouflows
compute_flux
rcv_inflows
kplane_block
angle_block
flux_err
allocate allocat e
rcv_inflows rcv_inflows
snd_ouflows snd_ouflows
flux_err flux_err
gather_data
sourcesource source
print_result s stop
stop
stop
Scattering
Operator -
'Inner
Iterations'
Next Iteration .. .
1
1
3
4
5
6
2
7
8
• • • • • •
• • • • • •
• • •
• • •
• • •
• • •
• • • • • •
Figure 1: Data Flow Graph of Sweep3D code
Streaming
Operator -
'Sweep
Routine'
Data Flow Graph with Sweep3D Components
Dec. 19, 2005 HPC Productivity 57
Productivity and Performance
Experiments
• Performance of Component-based code
• Adaptation to Execution Environment
– Memory System Optimizations
– Communication System Optimizations
– Communication/Memory Trade-off
Dec. 19, 2005 HPC Productivity 58
Improved Serial and Parallel
Performance
• Componentized code is faster on a single processor and
gets better speedup in parallel execution.
Problem Size: 100x100x100
0
50
100
150
200
250
0 5 10 15 20 25 30Number of processors
Tim
e (in
sec.)
Original Sw eep3D code
Componentized Sw eep3D code
Dec. 19, 2005 HPC Productivity 59
Efficiency and Isoefficiency
• decline in efficiency of the original code
• for componentized code, we are able to maintain fixed
efficiency (approximately) by increasing the problem size
as we increase the number of processors.
Isoefficiency Analysis
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5 6 7 8Processors,Problem Size
Efficiency
Original Sw eep3D code
Componentized Sw eep3D code
Dec. 19, 2005 HPC Productivity 60
Communication/Memory Trade-offNumber of processors 1 2 4 16 20
25
Runtime with Invariants as comm.. msgs. 164.9 88.751 45.65 13.82 12.34
12.38
Runtime with Invariants as state 164.9 77.15 33.16 11.17 11.1
10.12
• Alternative implementations where invariant data is
either kept as local state in each component or
communicated among components.
Dec. 19, 2005 HPC Productivity 61
Synchronous Versus Asynchronous Communication
Table 6: Performance comparisons on a fixed problem size (100x100x100)
Number of processors 1 2 4 9 16 20
Synchronous Comm. 164.9 88.751 45.65 22.79 13.82 12.34
Asnychronous Comm. 164.9 80.11 36.45 16.27 13.24 11.63
Dec. 19, 2005 HPC Productivity 62
Sweep3D Summary
• Sweep3D benchmark was mapped to components and
dozens of instances of the code realized.
• Productivity Enhanced - Adaptation and
optimizations in minutes or hours, not days or weeks
• Performance Enhanced - Component replacement for
optimizations for execution environments and
problem cases
• X10 Version of Sweep3D
Dec. 19, 2005 HPC Productivity 63
• Overview
• Perspective
• Basic Principles
• Historical HPC
• Emerging HPC
• Productivity Concepts for Conventional Software
• HPC Development Paradigm – Requirements
• HPC Development Paradigm – Concepts
• HPC Development Environment – An Example
• Connection to Other Research
• Research Issues
Dec. 19, 2005 HPC Productivity 64
Related Research
DARPA High Productivity Program
Software Engineering:
Component-oriented development
Software architectures
Grid Programming Systems – Automate, ICENI, etc.
Autonomic Computing
Agent-based systems
Role-Based Systems
Commercial IDE – J2EE, Javabeans, .Net, etc.
NOTE: All of software development is based on a few
simple principles. Different research communities use the
same ideas but give them different names and target different
problem domains.
Dec. 19, 2005 HPC Productivity 65
• Overview
• Perspective
• Basic Principles
• Historical HPC
• Emerging HPC
• Productivity Concepts for Conventional Software
• HPC Development Paradigm – Requirements
• HPC Development Paradigm – Concepts
• HPC Development Environment – An Example
• Connection to Other Research
• Research Issues
Dec. 19, 2005 HPC Productivity 66
Future Research
Unaddressed issues:
•Explicit parallelism within primitive components.
•Locality management beyond components
•Multiple versions of components
•Use of software architectures in instance design
•Fault-tolerance except by replication
•Verification/Validation of coordination behaviors by model
checking.