system software for exascale platforms · decrease barriers to entry with new programming models!...
TRANSCRIPT
March 10, 2011Aachen, Germany
System Software for Exascale Platforms
Barney MaccabeComputer Science and Mathematics DivisionOak Ridge National Laboratory
Friday, March 11, 2011
Exascale Systems Research Activities at ORNL
•Exascale Research Kickoff Meeting• San Diego, California; March 7-10
• http://exascaleresearch.labworks.org/ascr2011
•Scott Klasky
•Jeff Vetter
•David Bernholdt
•Stephen Scott
Friday, March 11, 2011
3
In-situ Data Reduction and Analysisfor Extreme Scale Science
ApproachObjectives
§ Create a robust I/O staging framework for in situ analysis and reduction of extreme-scale application data
§ Create and extend programming models for end scientists to utilize in situ I/O stream processing
§ Create a toolkit of I/O modules such as FastBit indexing that scientists can easily utilize
§ We leverage existing multi-year effort in software tools for scientists§ ADIOS for I/O abstraction in code, DataTap from GT for staging
area, FastBit from LBNL for indexing, Parallel R for analysis
§ We will provide run-time parameterization of ADIOS methods through a control layer
§ We will introduce a PGAS-like interface for staging area processing to simplify for scientists
Scenario of in situ I/O Pipeline analysis and visualization for fusion simulation data
Scott Klasky, ORNLPodhorski, Samatova, Wolf, ORNL; Shosani,
Wu, LBNL; Schwan, GT
§ ADIOS is already used by many extreme-scale science codes
§ Extensions to ADIOS will be immediately inherited
§ Provide future-proofing for scalable I/O needs
§ Initial evaluation systems already deployed for codes running at over 100k cores (Fusion, combustion)
Impact
FWP # ERKJU60
Friday, March 11, 2011
4
Blackcomb: Hardware-Software Co-design for Non-Volatile Memory in Exascale Systems
ApproachObjectives
§ Propose new distributed computer architectures that address the resilience, energy, and performance requirements of future DOE exascale systems:§ replace mechanical-disk-based data-stores with energy-efficient
non-volatile memories;
§ place low power compute cores close to the data store;
§ reduce number of levels in the memory hierarchy.
§ Evaluate the impact of the proposed architectures on the performance of critical DOE applications.
Jeffrey Vetter, ORNLRobert Schreiber, HP Labs
Trevor Mudge, University of Michigan Yuan Xie, Penn State University
Impact
FWP #ERKJU59
A comparison of various memory technologies
§ Address energy scalability of future exascale systems; NV memories have zero standby power.
§ Increase system reliability; MRAM/PCRAM are resilient to soft errors.
§ Develop new programming models that exploit NVMs to improve the fault-tolerance of applications.
§ Identify and evaluate the most promising non-volatile memory (NVM) technologies.
§ Explore assembly of NVM technologies into a storage and memory stack.
§ Propose an exascale HPC system architecture that builds on our new memory architecture.
§ Build the abstractions and interfaces that allow software to exploit NVM to its best advantage.
§ Characterize key DOE applications and investigate how they can benefit from these new technologies.
Friday, March 11, 2011
5
Vancouver: A Software Stack for Productive Heterogeneous Exascale Computing
ApproachObjectives
§ Enhance programmer productivity for the exascale§ Increase code development ROI by enhancing code
portability
§ Decrease barriers to entry with new programming models
§ Create next-generation tools to understand the performance behavior of an exascale machine
§ Programming tools§ GAS programming model§ Analysis, inspection, transformation
§ Software libraries: autotuning
§ Runtime systems: scheduling
§ Performance tools
§ Impact on DOE Applications
The proposed Maestro runtime simplifies programming heterogeneous systems by unifying OpenCL task queues into a single high-level queue. § Reduced application development time
§ Ease of porting applications to heterogeneous systems
§ Increased utilization of hardware resources and code portability
Impact
ERKJU44
Jeffrey Vetter, ORNLWen-Mei Hwu, UIUC
Allen Malony, University of OregonRich Vuduc, Georgia Tech
Friday, March 11, 2011
6
COMPOSE-HPC: Software Composition for Extreme Scale Computational Science and Engineering
ApproachObjectives
§ Develop a flexible, extensible toolkit to help software developers address various kinds of software composition challenges
§ Provide examples of how the toolkit can be applied to specific composition problems
§ Develop new approaches to facilitate composition of parallelism (threads and processes)
§ Develop the Knot Nimble Orchestration Toolkit (KNOT), consisting of three components:§ An annotation parsing facility (PAUL) to interpret guiding
annotations embedded as comments in user source code
§ A transformation facility (ROTE) to apply source-to-source transformations to the code, based on annotations and other inputs
§ A code generation facility (BRAID) capable of manipulating compiler-like intermediate representations, transforming, optimizing, and generating source code based on the results
§ Develop examples of using KNOT to address different composition problems: language interoperability, contract enforcement, automatic performance instrumentation, data marshalling for GPUs
§ Build on this infrastructure to facilitate the composition of parallelism§ Allow codes to express their preferred threaded
execution model, and call modules with different threading models
§ Simplifying the expression and exploitation of MPMD parallelism
David E. Bernholdt, ORNLTom Epperly, LLNL
Manoj Krishnan, PNNLMatt Sottile, Galois
Rob Armstrong, SNL
§ Practical tools to help software developers produce higher quality code that works effectively in modern HPC environments
§ Bring the capabilities of code transformation and code generation to bear on the challenges of software composition
Impact
ERKJU68
A schematic of the KNOT tool chain
Friday, March 11, 2011
7
Enabling Exascale Hardware and Software Design Through Scalable System Virtualization
ApproachObjectives § Provide a novel solution for testing at scale
§ Ease the transition to production by supporting scaling of legacy system software
§ Enable advanced research§ Architecture research toward exascale§ New parallel programming models§ System software research§ Resilience
§ Extend the Kitten/Palacios prototype§ Support for modern hardware§ Port to HPC operating systems§ Integration of system management tools
§ Design & implement new capabilities§ Integration with micro-architectural simulator§ Binary translation for the emulation of new hardware§ Fault injection
Kitten/Palacios Architecture: Virtualization for Exascale
§ Provide a test bed solution for exascale§ Vertical profiling§ Fault injection
§ Provide a platform for exascale research§ System architecture research§ Programming languages research§ System software research§ Resilience
Impact
ERKJU70
Stephen L. Scott, ORNLPatrick Bridges, UNM
Peter Dinda, NWUKevin Pedretti, SNL
Friday, March 11, 2011
Operating Systems Research:Collaboration
•OS meeting held in Phoenix Arizona January 19, 2011
•Participants• Pete Beckman (ANL)
• Ron Brightwell (SNL)
• Kamil Iskra (ANL)
• Larry Kaplan (Cray)
• Barney Maccabe (ORNL)
• Ron Minnich (SNL)
• Marc Snir (UIUC)
• Bob Wisniewski (IBM)
Friday, March 11, 2011
9
•Management of protection and capabilities• isolation is the key challenge
•First level handling of “rule breaking”• interrupts, traps, exceptions, and fauls
• Common mechanisms and services:• Resource management mechanisms – not
necessarily policies• Instrumentation/introspection• Reliable and scalable communication and control
protocols• External communication• …
What is Operating Systems Research?
Friday, March 11, 2011
•Concurrency • O(1B) threads• O(1k—10k) threads per node
•Heterogeneity• Different types of cores• Non-coherent shared memory• Deeper memory hierarchies
•Energy constraints and power management
•Evolving balance between compute, memory, and communication
•More frequent failures
•More complex applications• Dynamic, data-dependent
algorithms; multiscale, multiphysics
•More complex software• Python, C++, Fortran, OpenMP,
MPI, libs, frameworks; run-time adaptation
•Legacy compatibility and toolchain support (TCP/IP…)
•Added HW functionality • E.g., better HW support for
protection
Driving Forces
Friday, March 11, 2011
Increasing importance of effectively managing increasingly complex resources
•Increasing complexity• Memory hierarchy becoming deeper• New technologies, e.g., SSD • Variety of computing resources
•Increasing importance• Parallelism increasing in cores/node and
number of nodes• Scaling efficiency is even more critical• Strong scaling will become more
important• Power management will be a real
consideration• Resilience in the presence of component
failures must be addressed
Friday, March 11, 2011
12
•Focus on tool developers, runtime/middleware/library writers and subsystem developers (I/O, Viz) not application developers
•Sustainability is a key consideration• Needs to be accepted by vendors and community
•Flexibility to handle change• X86/PPC today; ARM tomorrow? (think about revolution)
•Co-design opportunities:• Customers: Tools, libraries, programming models, subsystems
(I/O, Viz)
• Suppliers: HW architecture
Overall Considerations
Friday, March 11, 2011
iOS-4 as an example of co-design
•Multitasking has potential to• drain battery life and• interfere with foreground application
•iOS4 introduced multitasking• specialized services generally available:
• complete tasks (downloads),• continue where you left off (games)• receive push and local notifications
• general services specially available:• listen to music• run your GPS• get VOIP calls
We looked at 1000’s of apps. This is what they needed
“Hardware and software made for each other.”
Friday, March 11, 2011
14
Nominal Architecture
• Enclave: Transitory parallel job or persistent parallel subsystem (e.g., parallel file system, external gateways, monitoring subsystem)
• Little/no time sharing of resources – enclaves are non-interfering (performance, power management) • But different enclaves could reside on different cores of same node
• Specialized resources (performance, power)• Global/enclave level/node level OSR (implemented via node level
software) – more functionality than today (resilience, power, complex applications…)
• Enclaves could be hierarchical
…
SystemEnclave
Node Node…
Enclave
Node Node
Friday, March 11, 2011
15
•Both evolutionary and revolutionary approaches are expected to be necessary.
•Evolutionary: low-risk research/improvement of existing technology areas based on past/current experiences and utilizing known techniques• Might get us to the “gen 1” system
•Revolutionary: higher-risk research in new areas, requiring more innovative solutions.• Expected to be required for a successful “gen 2” system
•Note: either approach is meant to result in production-quality software.
Evolutionary vs Revolutionary
Friday, March 11, 2011
16
•Traditional vendor supplied hypervisor/kernel on each node for setup and “legacy services”
•Run-time services provided either by libraries (e.g., thread-scheduling) or by kernel• Modified Linux or LWK
•Added kernel functionalities:• Contiguous memory, large pages, improved scheduling, I/O
forwarding
•All global functions (resource manager, control infrastructure for tools,…) built atop standard distributed computing protocols (TCP/IP, sockets, rsh, kerberos…)
Evolutionary Track
Friday, March 11, 2011
17
•Vendor kernel/hypervisor SW used for node bootstrapping• Limited to a minority of cores (the system cores) • May include a traditional Linux kernel for legacy services
•Lightweight runtime system with limited functionality runs on the majority of cores (the compute cores)• Synchronization, communication, lightweight thread scheduling• Offloads complex operations to the system cores
•Global services can depend on new HW capabilities:• New communication protocols (rDMA, global communication)• More sophisticated protection HW (more rings, capability-carrying
packets, privileged RMI, etc.)
•Global services can be layered atop global communication services
Revolutionary Track
Friday, March 11, 2011
18
•Pub-sub for failure events• Failures can be reflected back to enclave RT
• Out-of-band monitoring infrastructure has more than one consumer!
• Includes passive monitoring and active monitoring (e.g., heartbeats)
•Pub-sub for performance sensors (including access to HW sensors – energy, temperature) and instantiation and control of SW sensors
•Reliable & scalable Command and control infrastructure • Resource manager, debugger, run-time…
•Parallel coupling and communication protocols between enclaves (workflows, I/O, multi-module codes)
Example of Global Communication Services
Friday, March 11, 2011
19
•High cost of synchronization/on-node communication• Inefficient support for producer-consumer synchronization and
collective synchronization
•Move towards NUMA requires careful mapping• Affinity to CPU and memory
•Heterogeneous cores
•Co-design:• Synchronization HW
• Virtualization (in the sense of hiding fixed amount of HW)
• Same or different ISAs on heterogeneous cores?
Node Concurrency Challenges
Friday, March 11, 2011
20
•Evolutionary• Improved synchronization/IPC
• Advanced affinity
•Revolutionary• Tight coupling with new HW mechanisms for
• synchronization
• event/message driven thread scheduling, e.g., “active messages” (user-level interrupt table?)
• Support for heterogeneous cores
• Support for asymmetric kernel
Node Concurrency Technologies
Friday, March 11, 2011
21
•Inappropriate support for large working sets
• NUMA
• Non-coherent shared memory
• Integration of NV-RAM in memory hierarchy• Support for “out of core” codes, e.g., explicit migration to NV-
RAM?
• Leveraging of 64 bit address space
• Integration of remote memory
Memory Challenges & Opportunities
Friday, March 11, 2011
22
•Evolutionary• NUMA localized memory management
• Management of coherence domains
• Support for contiguous memory regions
• if required by hardware
• Very large pages
• 2 MB “huge” pages are tiny
• transparently accessible to user-space
Memory Technologies
Friday, March 11, 2011
23
•Revolutionary• Incorporating NV-RAM
• turn DRAM into an L4 cache?
• make NV-RAM a swap space for paging?
• Get rid of paging? Go back to Base&Bound?
•Exploratory• Opportunities of a very large address space
• map memory from all nodes?
• Non-traditional memory management
• non-contiguous heap
• memory regions with virtual = physical addressing
Memory Technologies (cont’d)
Friday, March 11, 2011
24
•Has different time scales• Microseconds: uncorrectable data errors, node loss, etc.
• Months: want to revive an app and rerun for some reason
•Today we do the same for all ranges:• Global checkpoint/restart
•Works because uptimes are so good (> 1 week)
•Because of potential (energy) cost of low MTBF, we need mechanisms for alternative approaches
Resilience Challenges
Friday, March 11, 2011
25
•Checkpointing at various granularities• Thread/node/enclave
•Various levels of storage persistence• more/less information dispersal; more/less redundance
• atop NV-RAM and/or disk
• Persistence hierarchy
• Logging services• Various levels of communication reliability• Scalable & reliable pub-sub infrastructure for failure events
(HW, timeouts, run-time defined checks…)•Scalable & reliable coordination protocols
Mechanisms for Resilience
Friday, March 11, 2011
26
•Evolutionary:• Checkpointing run-time (with user/compiler/runtime help)
• Differentiated communication reliability levels
• global services built atop TCP/IP
•Revolutionary:• Integration of in-band and out-of band error detection and
signaling
• General pub-sub infrastructure for failure events
• Differentiated storage reliability services
Resilience Technologies
Friday, March 11, 2011
27
Power Technologies
•Goal: Promote power to a first class resource• along CPU, memory
•Need power allocation infrastructure (at enclave and node level)
•Need power sensor information reflected to different resource management layers
• Need new HW capabilities• low-overhead suspend/resume of cores (ACPI too expensive)
• Measured in cycles, not seconds
• Notification of power profile changes
Friday, March 11, 2011
Barney’s pet peeve: Linux(we’re avoiding the hard conversation)
Friday, March 11, 2011
Linux is the dominant OS on the Top 500
•Lots of things happening in late 1990’s
•Extreme Linux Forum
•Linux Cluster Institute
•LANL Linux clusters
•Sandia Cplant
•Los Lobos and the original Roadrunner at UNM
29
Friday, March 11, 2011
Linux has been a great compute node OS for clusters
•Why Linux?• Beowulf clusters: many people could build a supercomputer
• Cplant, Extreme Linux Forum, Linux Cluster Institute
•At least it wasn’t NT!• Desktop, not network oriented
• Unfamiliar/unsupported programming environment (limited tools)
•The goals of Linux and the needs of Exascale applications are divergent
•We need a real revolution
TUTORIALS CO-CHAIR: Valerie Taylor, Northwestern UniversityTUTORIALS CO-CHAIR: Michelle Hribar, Pacific University
This year’s tutorials include exciting offerings in new topics, such as mesh
generation, XML, parallel programming for clusters, numerical computing inJava, and management of large scientific datasets, along with the return ofsome of the most requested presenters from prior years, with new and
updated materials. In addition, we offer some of the full day tutorials as two
half-day tutorials (denoted by the numberings with a and b), thereby
increasing the number of half-day tutorials. We have a total of 9 full-day and
15 half-day tutorials covering 20 topics. Attendees also have the opportuni-ty for an international perspective on topics through the tutorials on large
data visua lization, cluster computing, performance ana lysis tools, and
numerical linear algebra. Separate registration is required for tutorials; tuto-rial notes and luncheons will be provided on site (additional tutorial noteswill be sold on site). A One- or Two- day Tutorial Passport allows attendeesthe flexibility to attend multiple tutorials.
Using MPI-2: A Tutorial on Advanced Features of the Message-PassingInterface
William Gropp, Ewing “Rusty” Lusk, Rajeev S.Thakur, Argonne National Laboratory20% Introductory | 40% Intermediate | 40% Advanced
This tutorial will discuss methods for using MPI-2, the collection of advanced
features that were added to MPI (Message-Passing Interface) by the second
MPI Forum. These features include parallel I/O, one-sided communication,dynamic process management, language interoperability, and some miscel-laneous features. Implementations of MPI-2 are beginning to appear. A few
vendors have completed implementations; other vendors and research
groups have implemented subsets of MPI-2, with plans for complete imple-mentations. This tutorial will explain how to use MPI-2 in practice, particu-larly, how to use MPI-2 in a way that results in high performance. We willpresent each feature of MPI-2 in the form of a series of examples (in C,Fortran, and C+ +), starting with simple programs and moving on to more
complex ones. We assume that attendees are familiar with the basic mes-sage-passing concepts of MPI-1.
An Introduction to High Performance Data MiningRobert L Grossman, Magnify, Inc. and U of Illinois at Chicago, Vipin Kumar, U of Minnesota50% Introductory | 30% Intermediate | 20% Advanced
Data mining is the semi-automatic discovery of patterns, associations,changes, anomalies, and statistically significant structures and events indata. Traditional data analysis is assumption-driven in the sense that ahypothesis is formed and validated against the data. Data mining, in con-trast, is discovery-driven in the sense that patterns are automatically extract-ed from data. The goal of the tutorial is to provide researchers, practitioners,and advanced students with an introduction to data mining. The focus willbe on basic techniques and algorithms appropriate for mining massive datasets using approaches from high performance computing. There are now
parallel versions of some of the standard data mining algorithms, including
tree-based classifiers, clustering algorithms, and association rules. We willcover these algorithms in detail as well as some general techniques for scal-ing data mining algorithms. In addition, we will give an introduction to some
of the data mining algorithms, which are used in the recommended systemsthat are becoming important in e-business. The tutorial will include severalcase studies involving mining large data sets, from 10-1000 Gigabytes insize. The case studies will be from science, engineering and e-business.
Design and Analysis of High Performance ClustersRobert Pennington, NCSA, Patricia Kovatch, Barney Maccabe, David Bader, UNM25% Introductory | 50% Intermediate | 25% Advanced
The National Computational Science Alliance (the Alliance) has created sev-eral production NT and Linux superclusters for scientists and researchers to
run a variety of parallel applications. The goal of this tutorial is to bring
together researchers in this area and to share the latest information on the
state of high-end commodity clusters. We will discuss details on the design,implementation and management of these systems and demonstrate some
of the current system monitoring and management tools. A wide variety ofapplications and community codes run on these superclusters. We willexamine several of these applications and include details on porting appli-cations and application development tools for NT and Linux. We will also dis-cuss how to use these tools to tune the system and applications for optimalperformance.
High Performance Numerical Computing in Java: Compiler, Language,and Application SolutionsManish Gupta, Samuel P. Midkiff, Jose E. Moreira, IBM T.J. Watson Research Center15% Introductory | 65% Intermediate | 20% Advanced
There has been an increasing interest in using Java for the development ofhigh performance numerical applications. Although Java has many attrac-tive features—includ ing re liab ility, portab ility, a clean ob ject-oriented
model, well defined floating point semantics and a growing programmerbase—the performance of current commercial implementations of Java innumerical applications is still an impediment to its wider adoption in the
performance-sensitive field. In this tutorial we will describe how standard
libraries and currently proposed Java extensions will help in achieving high
performance and writing more maintainable code. We will also show how
Java virtual machines can be improved to provide near-Fortran performance.The proposals of the Java Grande Forum Numerics Working Group, which
include a true multidimensional array package, complex arithmetic and new
floating point semantics, will be discussed. Compiler technologies to be
addressed include array bounds and null pointer check optimizations, aliasdisambiguation techniques, semantic expansion of standard classes, and the
interplay of static and dynamic models of compilation. We will also discussthe importance of language, libraries and compiler codesign. The impact ofthese new technologies on compiler writers, language designers, and appli-cation developers will be described throughout the tutorial.
Performance Analysis and Tuning of Parallel Programs:
Resources and ToolsBarton Miller, U of Wisconsin-Madison, Michael Gerndt, Technical U Munich, Germany,Bernd Mohr, Research Centre Juelich, Germany50% Introductory | 25% Intermediate | 25% Advanced
This tutorial will give a comprehensive introduction into the theory and
practical application of the performance analysis, optimization, and tuning
of parallel programs on currently used high-end computer systems like the
IBM SP, SGI Origin, and CRAY T3E as well as clusters of workstations. We willintroduce the basic terminology, methodology, and techniques of performance
analysis and give practical advice on how to use these in an effective man-ner. Next we describe vendor, third party, and research tools available forthese machines along with practical tips and hints for their usage. We show
how these tools can be used to diagnose and locate typical performance
bottlenecks in real-world parallel programs. Finally, we will give an overview
of Paradyn, an example of a state-of-the-art performance analysis tool that
can be used for parallel programs of today. The presentation will include the
Performance Consultant that automatically locates the performance bottlenecksof user codes. This presentation will be concluded with a live, interactive
demonstration of Paradyn.
Mesh Generation for High Performance Computing Part I: An Overview of Unstructured and Structured Grid Generation TechniquesSteven J. Owen, Patrick Knupp, Sandia National Laboratories100% Introductory | 0% Intermediate | 0% Advanced
Mesh generation plays a vital role in computational field simulation for high
performance computing. The mesh can tremendously influence the accuracy
and efficiency of a simulation. Part I of this tutorial will provide an overview
of the principal techniques currently in use for constructing computationalgrids for both unstructured and structured techniques. For unstructured
techn iques, De launay, a dvancing front and octree methods w ill b e
described with respect to triangle and tetrahedral elements. An overview of
S1
4
SC2000 TUTORIAL PROGRAM
www.sc2000.org/tutorials
SUNDAY FULL DAY
S3
S4
S2
S5SUNDAY HALF DAY—AM
S6A
Friday, March 11, 2011
Linux and the key Exascale challenges
•Memory• Unlikely to explore (or support) HPC
uses of new technologies, e.g., SSD• refused to support for big phys area
• Zeptos removes the standard memory management
• Application resilience• only critical for extreme scale platforms• Why should Linux support resilience?
•Concurrency• Linux won’t dedicate a node to a single
application•Power issues may be supported
We’ll likely repeat the “OS Bypass” exampleFriday, March 11, 2011
Enabling the Revolution
•The OS community needs to define OS API for HPC• Zeptos and CNL don’t support most of the Linux API
• did they throw away the same parts?• Linux is largely a starting point – what’s the destination
• Probably won’t include goofy calls like fork and exec•This API needs to
• Support HPC tools and runtime environments• Run easily on Linux clusters
• Linux won’t go away and tool/runtime developers need to have the HPC OS API everywhere
• Support a virtualization layer that is capable of running an operating system of your choice, including Linux• Support application migration• Emphasis on mechanism, not policy
Friday, March 11, 2011
Palacios Virtual Machine Monitor
•Collaboration between: UNM, Northwestern, SNL, and ORNL
•Palacios handles all of the process/guest OS activites
•Current efforts focused on HPC communication layers
Bare Hardware
Host OS
Palacios
Guest OS Linux
Linux Linux
Kitten
Kitten
Linux
Kitten
HPC App
Friday, March 11, 2011
Thanks
Friday, March 11, 2011