system-directed resilience for exascale platforms ldrd proposal 09-0016 ron oldfield (pi)1423 ron...
TRANSCRIPT
![Page 1: System-Directed Resilience for Exascale Platforms LDRD Proposal 09-0016 Ron Oldfield (PI)1423 Ron Brightwell1423 Jim Laros1422 Kevin Pedretti1423 Rolf](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfea1a28abf838cb75de/html5/thumbnails/1.jpg)
System-Directed Resilience for Exascale Platforms
LDRD Proposal 09-0016
Ron Oldfield (PI) 1423
Ron Brightwell 1423
Jim Laros 1422
Kevin Pedretti 1423
Rolf Riesen 1423
![Page 2: System-Directed Resilience for Exascale Platforms LDRD Proposal 09-0016 Ron Oldfield (PI)1423 Ron Brightwell1423 Jim Laros1422 Kevin Pedretti1423 Rolf](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfea1a28abf838cb75de/html5/thumbnails/2.jpg)
System-Directed Resilience for Exascale Platforms (09-0016)
Ron Oldfield (1423), Neil Pundit (1423), FY09-11, Total $1500 Costs
ProblemCurrent apps cannot survive a node failure
Proposed SolutionApplication-transparent resilience to node failures
ApproachDesign/develop system software to support:
• Application quiescence,• Efficient state management,• Automatic fault recovery
Significance of Results• Represents a fundamental change in the way HPC systems support resilience. • Significant impact on performance: less defensive I/O overhead for checkpoints.• Higher levels of reliability. • Improved productivity: developers worry less about resilience, more on core science.
R&D Goals & Milestones• Investigate and develop new methods for quiescence that don’t hinder other apps.• Identify critical application state and develop efficient methods to manage state.• Identify system software requirements for
• dynamic node allocation, • network/os virtualization, and • MPI node recovery.
Relationship to Other WorkScalability and efficient resource utilization, particularly memory and storage, are key issues for this effort.
Our team has R&D experience in: • Scalable system software (LWK, Portals, LWFS),• Smart memory management techniques (Smartmap)• RAS systems
All efforts developed “lightweight” approaches that are both resource-efficient and scalable.
![Page 3: System-Directed Resilience for Exascale Platforms LDRD Proposal 09-0016 Ron Oldfield (PI)1423 Ron Brightwell1423 Jim Laros1422 Kevin Pedretti1423 Rolf](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfea1a28abf838cb75de/html5/thumbnails/3.jpg)
Resilience Challenges for Exascale
• Current Application characteristics– Require large fractions of systems – Long running– Resource constrained compute nodes– Cannot survive component failure
• Current Options for fault tolerance– Application-directed checkpoints– System-directed checkpoints– System-directed incremental checkpoints– Checkpoint in memory– Others: virtualization, redundant
computation, …
• We propose to develop systems software resilient to node failure– Support for application quiescence,– Efficient (diskless) state management,– Fast methods for fault recovery.
![Page 4: System-Directed Resilience for Exascale Platforms LDRD Proposal 09-0016 Ron Oldfield (PI)1423 Ron Brightwell1423 Jim Laros1422 Kevin Pedretti1423 Rolf](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfea1a28abf838cb75de/html5/thumbnails/4.jpg)
Application Quiescence
Goal: Develop methods to suspend application activity without hindering progress of other applications
• Requires– Methods for accurate and efficient fault detection– Mechanisms and interfaces for conveying node state to shared
services (e.g., need a functional RAS system)
• Approach– Integrated system software for cooperation among shared
services and applications• Network layer: deal with messages in transit• File system: isolate and suspend in-progress I/O operations
![Page 5: System-Directed Resilience for Exascale Platforms LDRD Proposal 09-0016 Ron Oldfield (PI)1423 Ron Brightwell1423 Jim Laros1422 Kevin Pedretti1423 Rolf](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfea1a28abf838cb75de/html5/thumbnails/5.jpg)
State Management
Goal: Efficient methods for extracting and managing state
Approach• Identify critical state
– Characterize memory usage– Investigate resource-efficient methods for logging modified memory.– App guidance to identify unnecessary data (e.g., ghost cells, cache)
• System guidance for when to extract state• Explore diskless methods to manage state • Explore state compression to reduce resource reqs
![Page 6: System-Directed Resilience for Exascale Platforms LDRD Proposal 09-0016 Ron Oldfield (PI)1423 Ron Brightwell1423 Jim Laros1422 Kevin Pedretti1423 Rolf](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfea1a28abf838cb75de/html5/thumbnails/6.jpg)
Fault Recovery
Goal: Dynamically recover a failed node without restarting the whole application
Approach• Explore changes to system software to support
dynamic node allocation (for swap of failed node).• Develop network virtualization to abstract physical
node ID from software.• Develop efficient methods for state recovery
– Investigate roll-back, roll-forward techniques
![Page 7: System-Directed Resilience for Exascale Platforms LDRD Proposal 09-0016 Ron Oldfield (PI)1423 Ron Brightwell1423 Jim Laros1422 Kevin Pedretti1423 Rolf](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfea1a28abf838cb75de/html5/thumbnails/7.jpg)
Summary
• Recovering from independent node failures is a critical issue for exascale systems
• We address that problem through modifications to system software– Support for application quiescence,– Efficient (diskless) state management,– Fast methods for fault recovery.
Our approach represents a fundamental
change in how systems support resilience
![Page 8: System-Directed Resilience for Exascale Platforms LDRD Proposal 09-0016 Ron Oldfield (PI)1423 Ron Brightwell1423 Jim Laros1422 Kevin Pedretti1423 Rolf](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfea1a28abf838cb75de/html5/thumbnails/8.jpg)
Reviewer Questions
• Programmatic– Firm commitments from team if LDRD goes forward?– Why is funding flat for FY10 and FY11?
• Technical– Is the assertion that “checkpoint overhead will exceed 50%
beyond 100K nodes” too modest?– Why use the term “components” instead of cores or processors.
• Technical/Programmatic– Can the project really address all of the proposed work?– With 10-11 technical topics have we identified all the technical
risks?