resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfapril...
TRANSCRIPT
![Page 1: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge](https://reader033.vdocuments.us/reader033/viewer/2022042007/5e707396c2328412203d6ab4/html5/thumbnails/1.jpg)
Presented by
Resiliency for high-performance computing
Christian EngelmannComputer Science and Mathematics Division
Oak Ridge National Laboratory, Oak Ridge, TN, USA
![Page 2: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge](https://reader033.vdocuments.us/reader033/viewer/2022042007/5e707396c2328412203d6ab4/html5/thumbnails/2.jpg)
April 10, 2008 2/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory
Outline
• Resilience deficiencies of high-performance computing architectures
• Research and development goals and projects overview
• Symmetric active/active redundancy for head and service nodes
• Job pause approach for reactive compute node fault tolerance
• Pre-emptive migration for proactive compute node fault tolerance
• Reliability analysis and modeling for fault prediction and anticipation
• Evaluation of compute node fault tolerance policies using simulation
• Vision of a holistic resiliency framework concept
• Conclusion: Achievements, ongoing work and future plans
![Page 3: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge](https://reader033.vdocuments.us/reader033/viewer/2022042007/5e707396c2328412203d6ab4/html5/thumbnails/3.jpg)
April 10, 2008 3/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory
Traditional Beowulf cluster computing architecture
• Single head node manages entire HPC system
• System-wide services are provided by head node:− Job & resource management, networked file system, …
• Local services are provided by compute nodes− Message passing (MPI, PVM), …
Single Point of Failure/Single Point of Control
Single Points of Failure
![Page 4: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge](https://reader033.vdocuments.us/reader033/viewer/2022042007/5e707396c2328412203d6ab4/html5/thumbnails/4.jpg)
April 10, 2008 4/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory
Recent trend toward massively parallel processing (MPP) architectures
• Single head node and additional service nodes manage the entire HPC system
• System-wide services are provided by head node and are offloaded to service nodes, e.g., networked file system
• Local services are provided by service nodes and compute nodes, e.g., message passing
Single Points of FailureSingle Points of FailureSingle Points of Control
![Page 5: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge](https://reader033.vdocuments.us/reader033/viewer/2022042007/5e707396c2328412203d6ab4/html5/thumbnails/5.jpg)
April 10, 2008 5/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory
Recent trend toward partitioned MPP architectures
• Single head node manages entire HPC system
• Service nodes manage and support compute nodes belonging to their partitions
Single Points of FailureSingle Point of Failure/Single Point of Control
Single Points of Failure/Single Point of Control for Partition
Single Points of Failure for System
![Page 6: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge](https://reader033.vdocuments.us/reader033/viewer/2022042007/5e707396c2328412203d6ab4/html5/thumbnails/6.jpg)
April 10, 2008 6/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory
Availability Measured by the Nineshttp://info.nccs.gov/resources - HPC system status at Oak Ridge National Laboratory
9’s Availability Downtime/Year Examples1 90.0% 36 days, 12 hours Personal Computers2 99.0% 87 hours, 36 min Entry Level Business3 99.9% 8 hours, 45.6 min ISPs, Mainstream Business4 99.99% 52 min, 33.6 sec Data Centers5 99.999% 5 min, 15.4 sec Banking, Medical6 99.9999% 31.5 seconds Military Defense
• Enterprise-class hardware + Stable Linux kernel = 5+
• Substandard hardware + Good high availability package = 2-3
• Today’s supercomputers = 1-2
• My desktop = 1-2
![Page 7: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge](https://reader033.vdocuments.us/reader033/viewer/2022042007/5e707396c2328412203d6ab4/html5/thumbnails/7.jpg)
April 10, 2008 7/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory
Typical Failure Causes in HPC Systems
• Hardware failures due to wear/age of:− Hard drives, memory modules, network cards, and processors
• Software failures due to bugs in:− Operating system, middleware, applications
• Stress-related failures that exceed design specifications in:− Hardware, e.g. due to excessive heat radiation− Software, e.g. due to (unintentional) denial of service
• Radiation-induced soft errors (bit flips due to electromagnetic interference, heat radiation, natural neutron radiation) in:− Memory modules, network cards, and processors
Different scale requires different solutions:− Compute nodes (up to 150,000)− Front-end, service, and I/O nodes (1 to 150)
![Page 8: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge](https://reader033.vdocuments.us/reader033/viewer/2022042007/5e707396c2328412203d6ab4/html5/thumbnails/8.jpg)
April 10, 2008 8/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory
Outline
• Resilience deficiencies of high-performance computing architectures
• Research and development goals and projects overview
• Symmetric active/active redundancy for head and service nodes
• Job pause approach for reactive compute node fault tolerance
• Pre-emptive migration for proactive compute node fault tolerance
• Reliability analysis and modeling for fault prediction and anticipation
• Evaluation of compute node fault tolerance policies using simulation
• Vision of a holistic resiliency framework concept
• Conclusion: Achievements, ongoing work and future plans
![Page 9: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge](https://reader033.vdocuments.us/reader033/viewer/2022042007/5e707396c2328412203d6ab4/html5/thumbnails/9.jpg)
April 10, 2008 9/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory
Research and development goals
• Efficient redundancy strategies for head and service nodes in HPC systems to provide high availability as well as high performance of critical infrastructure services
• Reactive fault tolerance for HPC compute nodes utilizing the job pause approach as well as checkpoint interval and placement adaptation to actual and predicted system health threats
• Proactive fault tolerance using system-level virtualization in HPC environments for pre-emptive migration of computation away from compute nodes that are about to fail
• Reliability analysis for identifying pre-fault indicators, predicting failures, and modeling and monitoring of individual component and overall HPC system reliability
• Holistic fault tolerance technology through combination of adaptive proactive and reactive fault tolerance mechanisms in conjunction with system health monitoring and reliability analysis
![Page 10: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge](https://reader033.vdocuments.us/reader033/viewer/2022042007/5e707396c2328412203d6ab4/html5/thumbnails/10.jpg)
April 10, 2008 10/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory
MOLAR: Adaptive runtime support for high-end computing operating and runtime systems
• Previous effort: 2004-2007
• Addresses the challenges for operating and runtime systems to run large applications efficiently on large-scale supercomputers
• Part of the Forum to Address Scalable Technology for Runtime andOperating Systems (FAST-OS)
• MOLAR is a collaborative research effort (www.fastos.org/molar):
![Page 11: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge](https://reader033.vdocuments.us/reader033/viewer/2022042007/5e707396c2328412203d6ab4/html5/thumbnails/11.jpg)
April 10, 2008 11/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory
Reliability, Availability, and Serviceability (RAS) for Petascale High-End Computing and Beyond
• Current effort: 2008-2011
• Addresses the fault resilience challenges for operating and runtime systems of next-generation large-scale supercomputers
• Part of the Forum to Address Scalable Technology for Runtime andOperating Systems (FAST-OS)
• RAS HPC is a collaborative research effort (www.fastos.org/ras):
RAS
RAS
![Page 12: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge](https://reader033.vdocuments.us/reader033/viewer/2022042007/5e707396c2328412203d6ab4/html5/thumbnails/12.jpg)
April 10, 2008 12/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory
Outline
• Resilience deficiencies of high-performance computing architectures
• Research and development goals and projects overview
• Symmetric active/active redundancy for head and service nodes
• Job pause approach for reactive compute node fault tolerance
• Pre-emptive migration for proactive compute node fault tolerance
• Reliability analysis and modeling for fault prediction and anticipation
• Evaluation of compute node fault tolerance policies using simulation
• Vision of a holistic resiliency framework concept
• Conclusion: Achievements, ongoing work and future plans
![Page 13: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge](https://reader033.vdocuments.us/reader033/viewer/2022042007/5e707396c2328412203d6ab4/html5/thumbnails/13.jpg)
April 10, 2008 13/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory
• Many active head nodes
• Work load distribution
• Symmetric replicationbetween head nodes
• Continuous service
• Always up to date
• No fail-over necessary
• No restore-over necessary
• Virtual synchrony model
Complex algorithms
Prototypes for PBS Torque and PVFS metadata server
Symmetric active/active redundancy for head and service nodes
![Page 14: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge](https://reader033.vdocuments.us/reader033/viewer/2022042007/5e707396c2328412203d6ab4/html5/thumbnails/14.jpg)
April 10, 2008 14/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory
Symmetric active/active replication
InputReplication
VirtuallySynchronousProcessing
OutputUnification
1m 30s99.9997%3
1h 45m99.97%2
5d 4h 21m98.58%1Est. annual downtimeAvailabilityNodes
1m 30s99.9997%3
1h 45m99.97%2
5d 4h 21m98.58%1Est. annual downtimeAvailabilityNodes
![Page 15: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge](https://reader033.vdocuments.us/reader033/viewer/2022042007/5e707396c2328412203d6ab4/html5/thumbnails/15.jpg)
April 10, 2008 15/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory
Symmetric active/active Parallel Virtual File System metadata service
![Page 16: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge](https://reader033.vdocuments.us/reader033/viewer/2022042007/5e707396c2328412203d6ab4/html5/thumbnails/16.jpg)
April 10, 2008 16/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory
Outline
• Resilience deficiencies of high-performance computing architectures
• Research and development goals and projects overview
• Symmetric active/active redundancy for head and service nodes
• Job pause approach for reactive compute node fault tolerance
• Pre-emptive migration for proactive compute node fault tolerance
• Reliability analysis and modeling for fault prediction and anticipation
• Evaluation of compute node fault tolerance policies using simulation
• Vision of a holistic resiliency framework concept
• Conclusion: Achievements, ongoing work and future plans
![Page 17: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge](https://reader033.vdocuments.us/reader033/viewer/2022042007/5e707396c2328412203d6ab4/html5/thumbnails/17.jpg)
April 10, 2008 17/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory
Reactive vs. proactive fault tolerance for compute nodes
• Reactive fault tolerance:
• Proactive fault tolerance:
Ideal solution: Matching
• State saving during failure-free operation• State recovery after failure• Assured quality of service, but limited
scalability
• System health monitoring and online reliability modeling
• Failure anticipation and prevention through prediction and reconfiguration before failure
• Highly scalable, but not all failures can be anticipated
combination of both
![Page 18: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge](https://reader033.vdocuments.us/reader033/viewer/2022042007/5e707396c2328412203d6ab4/html5/thumbnails/18.jpg)
April 10, 2008 18/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory
Enhanced reactive fault tolerance with LAM/MPI+BLCR job pause mechanism
• Operational nodes: Pause− BLCR reuses existing processes− LAM/MPI reuses existing
connections− Restore partial process state
from checkpoint
• Failed nodes: Migrate− Restart process on new node
from checkpoint− Reconnect with paused
processes
Scalable MPI membership management for low overhead
Efficient, transparent, and automatic failure recovery
![Page 19: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge](https://reader033.vdocuments.us/reader033/viewer/2022042007/5e707396c2328412203d6ab4/html5/thumbnails/19.jpg)
April 10, 2008 19/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory
• Application registers threaded callback spawns callback thread
• Thread blocks in kernel
• Pause utility calls ioctl(), unblocks callback thread
• All threads complete callbacks and enter kernel
New: All threads restore part of their states
• Run regular application code from restored state
New job pause mechanism in BLCR
![Page 20: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge](https://reader033.vdocuments.us/reader033/viewer/2022042007/5e707396c2328412203d6ab4/html5/thumbnails/20.jpg)
April 10, 2008 20/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory
LAM/MPI+BLCR job pause performance
0123456789
10
BT CG EP FT LU MG SP
Seco
nds
Job Pause and Migrate LAM Reboot Job Restart
• 3.4% overhead over job restart, but− No LAM reboot overhead− Transparent continuation of execution
• No re-queue penalty
• Less staging overhead
![Page 21: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge](https://reader033.vdocuments.us/reader033/viewer/2022042007/5e707396c2328412203d6ab4/html5/thumbnails/21.jpg)
April 10, 2008 21/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory
Outline
• Resilience deficiencies of high-performance computing architectures
• Research and development goals and projects overview
• Symmetric active/active redundancy for head and service nodes
• Job pause approach for reactive compute node fault tolerance
• Pre-emptive migration for proactive compute node fault tolerance
• Reliability analysis and modeling for fault prediction and anticipation
• Evaluation of compute node fault tolerance policies using simulation
• Vision of a holistic resiliency framework concept
• Conclusion: Achievements, ongoing work and future plans
![Page 22: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge](https://reader033.vdocuments.us/reader033/viewer/2022042007/5e707396c2328412203d6ab4/html5/thumbnails/22.jpg)
April 10, 2008 22/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory
• Stand-by Xen host (spare node without guest VM)
• Deteriorating health:− Migrate guest VM to spare
node
• New host generates unsolicited ARP reply− Indicates that guest VM has
moved− ARP tells peers to resend to
new host
Novel fault tolerance scheme that acts before a failure impacts a system
Proactive fault tolerance using Xenvirtualization
Xen VMMXen VMM
GangliaGanglia
Privileged VMPrivileged VM
Xen VMMXen VMM
Privileged VMPrivileged VM
GangliaGanglia
PFTDaemon
PFTDaemon
Guest VMGuest VM
MPITaskMPITask
PFTDaemon
PFTDaemon
Migrate
Xen VMMXen VMM
Privileged VMPrivileged VM Guest VMGuest VM
MPITaskMPITaskGangliaGanglia
PFTDaemon
PFTDaemon
Xen VMMXen VMM
Privileged VMPrivileged VM Guest VMGuest VM
MPITaskMPITaskGangliaGanglia
PFTDaemon
PFTDaemon
H/w BMC H/w BMC
H/w BMC H/w BMC
![Page 23: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge](https://reader033.vdocuments.us/reader033/viewer/2022042007/5e707396c2328412203d6ab4/html5/thumbnails/23.jpg)
April 10, 2008 23/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory
Proactive fault tolerance (PFT) daemon
• Runs in privileged domain (host)
• Initialization− Read safe threshold from config file− Init connection with IPMI controller− Obtain/filter set of available sensors
• Health monitoring− Read sensors from IPMI controller− Periodically sample data − Trigger load balancing if exceeding
sensor threshold
• VM migration− Select target based on load− Invoke Xen live migration for VM
InitializeInitialize
Health MonitorHealth Monitor
IPMIController
IPMIController
Threshold Breach?
Threshold Breach?
Load BalanceLoad BalanceGangliaGanglia
N
Y
PFT Daemon
Raise Alarm / Maintenance of
the System
![Page 24: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge](https://reader033.vdocuments.us/reader033/viewer/2022042007/5e707396c2328412203d6ab4/html5/thumbnails/24.jpg)
April 10, 2008 24/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory
• Single node failure: 0.5-5% additional cost over total wall clock time
• Double node failure: 2-8% additional cost over total wall clock time
VM migration performance impact
0
50
100
150
200
250
300
BT CG EP LU SP
Seco
nds
W /o Migration1 Migration2 Migration
0
50
100
150
200
250
300
350
400
450
500
BT CG EP LU SP
Seco
nds
W /o Migration 1 Migration
![Page 25: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge](https://reader033.vdocuments.us/reader033/viewer/2022042007/5e707396c2328412203d6ab4/html5/thumbnails/25.jpg)
April 10, 2008 25/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory
Outline
• Resilience deficiencies of high-performance computing architectures
• Research and development goals and projects overview
• Symmetric active/active redundancy for head and service nodes
• Job pause approach for reactive compute node fault tolerance
• Pre-emptive migration for proactive compute node fault tolerance
• Reliability analysis and modeling for fault prediction and anticipation
• Evaluation of compute node fault tolerance policies using simulation
• Vision of a holistic resiliency framework concept
• Conclusion: Achievements, ongoing work and future plans
![Page 26: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge](https://reader033.vdocuments.us/reader033/viewer/2022042007/5e707396c2328412203d6ab4/html5/thumbnails/26.jpg)
April 10, 2008 26/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory
HPC reliability analysis and modeling for prediction and anticipation
• Programming paradigm and system scale impact reliability
• Reliability analysis:− Estimate mean time to failure (MTTF)− Obtain failure distribution: Exponential, Weibull, Gamma, ...
• Feedback into fault tolerance schemes for adaptation
System reliability (MTTF) for k-of-n AND Survivability (k=n) Parallel Execution model
0100200300400500600700800
10 50 100 500 1000 2000 5000
Number of Participating Nodes
Tot
al s
yste
m M
TT
F (h
rs) Node MTTF 1000 hrs
Node MTTF 3000 hrsNode MTTF 5000 hrsNode MTTF 7000 hrs
![Page 27: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge](https://reader033.vdocuments.us/reader033/viewer/2022042007/5e707396c2328412203d6ab4/html5/thumbnails/27.jpg)
April 10, 2008 27/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory
Outline
• Resilience deficiencies of high-performance computing architectures
• Research and development goals and projects overview
• Symmetric active/active redundancy for head and service nodes
• Job pause approach for reactive compute node fault tolerance
• Pre-emptive migration for proactive compute node fault tolerance
• Reliability analysis and modeling for fault prediction and anticipation
• Evaluation of compute node fault tolerance policies using simulation
• Vision of a holistic resiliency framework concept
• Conclusion: Achievements, ongoing work and future plans
![Page 28: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge](https://reader033.vdocuments.us/reader033/viewer/2022042007/5e707396c2328412203d6ab4/html5/thumbnails/28.jpg)
April 10, 2008 28/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory
Simulation framework for HPC fault tolerance policies
• Evaluation of fault tolerance policies− Reactive only− Proactive only− Reactive/proactive combination
• Evaluation of fault tolerance parameters− Checkpoint interval− Prediction accuracy
• Event-based simulation framework using actual HPC system logs
• Customizable simulated environment− Number of active and spare nodes− Checkpoint and migration overheads
![Page 29: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge](https://reader033.vdocuments.us/reader033/viewer/2022042007/5e707396c2328412203d6ab4/html5/thumbnails/29.jpg)
April 10, 2008 29/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory
Combination of proactive and reactive fault tolerance: Simulation example 1
10 30 50 70 90 2 4 8 16 32 64 128
256
0
10
20
30
40
50
60
70
80
90
100
% Execution overhead
% Prediction accuracy Checkpoint interval (hours)
Execution overhead for various checkpoint intervals and different prediction accuracy
90-100
80-90
70-80
60-70
50-60
40-50
30-40
20-30
10-20
0-10
• Best: Prediction accuracy >60% and checkpoint interval 16-32 hours
• Better than only proactive or only reactive• Results for higher prediction accuracies
and very low checkpoint intervals are worse than only proactive or only reactive
Number of processes 125
Active nodes / Spare nodes 125 / 12
Checkpoint overhead 50 min/Checkpoint
Migration overhead 1 min/Migration
Simulation based on ASCI White system logs(nodes 1 – 125 and 500-512)
![Page 30: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge](https://reader033.vdocuments.us/reader033/viewer/2022042007/5e707396c2328412203d6ab4/html5/thumbnails/30.jpg)
April 10, 2008 30/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory
Combination of proactive and reactive fault tolerance: Simulation example 2
10 30 50 70
90 2 4 8 16 32 64 128
256
0
10
20
30
40
50
60
70
80
90
100
% Execution overhead
% Prediction accuracy
Checkpoint interval (hours)
Execution overhead for various checkpoint intervals and different prediction accuracy
90-10080-9070-8060-7050-6040-5030-4020-3010-200-10
• Best: Accuracy >60%, interval 16-64h• 70% and 32 hours:
− 8% gain over reactive only− 24% gain in over proactive only
• 80% and 32 hours:− 10% gain over reactive only− 3% loss over proactive only
Number of processes 125
Active nodes / Spare nodes 125 / 12
Checkpoint overhead 50 min/Checkpoint
Migration overhead 1 min/Migration
Simulation based on ASCI White system logs(nodes 126 – 250 and 500-512)
![Page 31: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge](https://reader033.vdocuments.us/reader033/viewer/2022042007/5e707396c2328412203d6ab4/html5/thumbnails/31.jpg)
April 10, 2008 31/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory
Outline
• Resilience deficiencies of high-performance computing architectures
• Research and development goals and projects overview
• Symmetric active/active redundancy for head and service nodes
• Job pause approach for reactive compute node fault tolerance
• Pre-emptive migration for proactive compute node fault tolerance
• Reliability analysis and modeling for fault prediction and anticipation
• Evaluation of compute node fault tolerance policies using simulation
• Vision of a holistic resiliency framework concept
• Conclusion: Achievements, ongoing work and future plans
![Page 32: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge](https://reader033.vdocuments.us/reader033/viewer/2022042007/5e707396c2328412203d6ab4/html5/thumbnails/32.jpg)
April 10, 2008 32/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory
A holistic resiliency framework concept for high-performance computing
RAS
RAS
RAS
RAS
![Page 33: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge](https://reader033.vdocuments.us/reader033/viewer/2022042007/5e707396c2328412203d6ab4/html5/thumbnails/33.jpg)
April 10, 2008 33/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory
Example 1: Automatic Runtime Environment Checkpoint/Restart
![Page 34: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge](https://reader033.vdocuments.us/reader033/viewer/2022042007/5e707396c2328412203d6ab4/html5/thumbnails/34.jpg)
April 10, 2008 34/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory
Example 2: Automatic Virtual Machine Checkpoint/Restart/Migration
![Page 35: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge](https://reader033.vdocuments.us/reader033/viewer/2022042007/5e707396c2328412203d6ab4/html5/thumbnails/35.jpg)
April 10, 2008 35/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory
Example 3: Automatic Application-Level Checkpoint/Restart
![Page 36: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge](https://reader033.vdocuments.us/reader033/viewer/2022042007/5e707396c2328412203d6ab4/html5/thumbnails/36.jpg)
April 10, 2008 36/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory
Example 4: Automatic OS-Level Checkpoint/Restart/Migration
![Page 37: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge](https://reader033.vdocuments.us/reader033/viewer/2022042007/5e707396c2328412203d6ab4/html5/thumbnails/37.jpg)
April 10, 2008 37/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory
Outline
• Resilience deficiencies of high-performance computing architectures
• Research and development goals and projects overview
• Symmetric active/active redundancy for head and service nodes
• Job pause approach for reactive compute node fault tolerance
• Pre-emptive migration for proactive compute node fault tolerance
• Reliability analysis and modeling for fault prediction and anticipation
• Evaluation of compute node fault tolerance policies using simulation
• Vision of a holistic resiliency framework concept
• Conclusion: Achievements, ongoing work and future plans
![Page 38: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge](https://reader033.vdocuments.us/reader033/viewer/2022042007/5e707396c2328412203d6ab4/html5/thumbnails/38.jpg)
April 10, 2008 38/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory
Achievements
• Developed efficient redundancy strategies for critical infrastructure services in HPC systems
• Added the job pause approach for HPC compute nodes to enhance reactive fault tolerance
• Implemented threshold-based preemptive migration using system-level virtualization for proactive fault tolerance
• Investigated mean-time to failure characteristics of HPC systems using reliability analysis
• Simulated and evaluated various fault tolerance strategies usingactual failure data from HPC system logs− Reactive fault tolerance only− Proactive fault tolerance only− Holistic fault tolerance through reactive/proactive combination
![Page 39: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge](https://reader033.vdocuments.us/reader033/viewer/2022042007/5e707396c2328412203d6ab4/html5/thumbnails/39.jpg)
April 10, 2008 39/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory
Ongoing work and future plans
• Finishing development on threshold-based process-level preemptive migration for proactive fault tolerance
• Engaging in prediction-based preemptive migration using for proactive fault tolerance
• Investigating failure mode, failure detection, and failure distribution characteristics of HPC systems using reliability analysis
• Plan to further enhance reactive fault tolerance with checkpoint placement adaptation to actual and predicted system health threats
• Plan to develop a holistic fault tolerance framework that offers the mix-and-match approach for various fault resilience strategies
![Page 40: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge](https://reader033.vdocuments.us/reader033/viewer/2022042007/5e707396c2328412203d6ab4/html5/thumbnails/40.jpg)
April 10, 2008 40/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory
Acknowledgements
• Investigators at Oak Ridge National Laboratory:− Stephen L. Scott, Christian Engelmann, Hong H. Ong, Geoffroy R. Vallée, Thomas
Naughton, Anand Tikotekar, George Ostrouchov
• Investigators at Louisiana Tech University:− Chokchai (Box) Leangsuksun, Nichamon Naksinehaboon, Raja Nassar, Mihaela Paun
• Investigators at North Carolina State University:− Frank Mueller, Chao Wang, Arun Nagarajan, Jyothish Varma
• Investigators at Tennessee Technological University:− Xubin (Ben) He, Li Ou, Xin Chen
• Funding sources:− U.S. Department of Energy, U.S. National Science Foundation, Oak Ridge National
Laboratory, Tennessee Technological University
![Page 41: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge](https://reader033.vdocuments.us/reader033/viewer/2022042007/5e707396c2328412203d6ab4/html5/thumbnails/41.jpg)
April 10, 2008 41/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory
Contacts
• Christian EngelmannSystem Research TeamComputer Science Research GroupComputer Science and Mathematics DivisionOak Ridge National Laboratory
• Stephen L. ScottSystem Research TeamComputer Science Research GroupComputer Science and Mathematics DivisionOak Ridge National Laboratory
• www.fastos.org/ras | www.fastos.org/molar
• www.csm.ornl.gov/srt
41 Scott_RAS_0714