resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfapril...

41
Presented by Resiliency for high-performance computing Christian Engelmann Computer Science and Mathematics Division Oak Ridge National Laboratory, Oak Ridge, TN, USA

Upload: others

Post on 14-Mar-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge

Presented by

Resiliency for high-performance computing

Christian EngelmannComputer Science and Mathematics Division

Oak Ridge National Laboratory, Oak Ridge, TN, USA

Page 2: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge

April 10, 2008 2/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory

Outline

• Resilience deficiencies of high-performance computing architectures

• Research and development goals and projects overview

• Symmetric active/active redundancy for head and service nodes

• Job pause approach for reactive compute node fault tolerance

• Pre-emptive migration for proactive compute node fault tolerance

• Reliability analysis and modeling for fault prediction and anticipation

• Evaluation of compute node fault tolerance policies using simulation

• Vision of a holistic resiliency framework concept

• Conclusion: Achievements, ongoing work and future plans

Page 3: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge

April 10, 2008 3/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory

Traditional Beowulf cluster computing architecture

• Single head node manages entire HPC system

• System-wide services are provided by head node:− Job & resource management, networked file system, …

• Local services are provided by compute nodes− Message passing (MPI, PVM), …

Single Point of Failure/Single Point of Control

Single Points of Failure

Page 4: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge

April 10, 2008 4/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory

Recent trend toward massively parallel processing (MPP) architectures

• Single head node and additional service nodes manage the entire HPC system

• System-wide services are provided by head node and are offloaded to service nodes, e.g., networked file system

• Local services are provided by service nodes and compute nodes, e.g., message passing

Single Points of FailureSingle Points of FailureSingle Points of Control

Page 5: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge

April 10, 2008 5/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory

Recent trend toward partitioned MPP architectures

• Single head node manages entire HPC system

• Service nodes manage and support compute nodes belonging to their partitions

Single Points of FailureSingle Point of Failure/Single Point of Control

Single Points of Failure/Single Point of Control for Partition

Single Points of Failure for System

Page 6: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge

April 10, 2008 6/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory

Availability Measured by the Nineshttp://info.nccs.gov/resources - HPC system status at Oak Ridge National Laboratory

9’s Availability Downtime/Year Examples1 90.0% 36 days, 12 hours Personal Computers2 99.0% 87 hours, 36 min Entry Level Business3 99.9% 8 hours, 45.6 min ISPs, Mainstream Business4 99.99% 52 min, 33.6 sec Data Centers5 99.999% 5 min, 15.4 sec Banking, Medical6 99.9999% 31.5 seconds Military Defense

• Enterprise-class hardware + Stable Linux kernel = 5+

• Substandard hardware + Good high availability package = 2-3

• Today’s supercomputers = 1-2

• My desktop = 1-2

Page 7: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge

April 10, 2008 7/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory

Typical Failure Causes in HPC Systems

• Hardware failures due to wear/age of:− Hard drives, memory modules, network cards, and processors

• Software failures due to bugs in:− Operating system, middleware, applications

• Stress-related failures that exceed design specifications in:− Hardware, e.g. due to excessive heat radiation− Software, e.g. due to (unintentional) denial of service

• Radiation-induced soft errors (bit flips due to electromagnetic interference, heat radiation, natural neutron radiation) in:− Memory modules, network cards, and processors

Different scale requires different solutions:− Compute nodes (up to 150,000)− Front-end, service, and I/O nodes (1 to 150)

Page 8: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge

April 10, 2008 8/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory

Outline

• Resilience deficiencies of high-performance computing architectures

• Research and development goals and projects overview

• Symmetric active/active redundancy for head and service nodes

• Job pause approach for reactive compute node fault tolerance

• Pre-emptive migration for proactive compute node fault tolerance

• Reliability analysis and modeling for fault prediction and anticipation

• Evaluation of compute node fault tolerance policies using simulation

• Vision of a holistic resiliency framework concept

• Conclusion: Achievements, ongoing work and future plans

Page 9: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge

April 10, 2008 9/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory

Research and development goals

• Efficient redundancy strategies for head and service nodes in HPC systems to provide high availability as well as high performance of critical infrastructure services

• Reactive fault tolerance for HPC compute nodes utilizing the job pause approach as well as checkpoint interval and placement adaptation to actual and predicted system health threats

• Proactive fault tolerance using system-level virtualization in HPC environments for pre-emptive migration of computation away from compute nodes that are about to fail

• Reliability analysis for identifying pre-fault indicators, predicting failures, and modeling and monitoring of individual component and overall HPC system reliability

• Holistic fault tolerance technology through combination of adaptive proactive and reactive fault tolerance mechanisms in conjunction with system health monitoring and reliability analysis

Page 10: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge

April 10, 2008 10/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory

MOLAR: Adaptive runtime support for high-end computing operating and runtime systems

• Previous effort: 2004-2007

• Addresses the challenges for operating and runtime systems to run large applications efficiently on large-scale supercomputers

• Part of the Forum to Address Scalable Technology for Runtime andOperating Systems (FAST-OS)

• MOLAR is a collaborative research effort (www.fastos.org/molar):

Page 11: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge

April 10, 2008 11/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory

Reliability, Availability, and Serviceability (RAS) for Petascale High-End Computing and Beyond

• Current effort: 2008-2011

• Addresses the fault resilience challenges for operating and runtime systems of next-generation large-scale supercomputers

• Part of the Forum to Address Scalable Technology for Runtime andOperating Systems (FAST-OS)

• RAS HPC is a collaborative research effort (www.fastos.org/ras):

RAS

RAS

Page 12: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge

April 10, 2008 12/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory

Outline

• Resilience deficiencies of high-performance computing architectures

• Research and development goals and projects overview

• Symmetric active/active redundancy for head and service nodes

• Job pause approach for reactive compute node fault tolerance

• Pre-emptive migration for proactive compute node fault tolerance

• Reliability analysis and modeling for fault prediction and anticipation

• Evaluation of compute node fault tolerance policies using simulation

• Vision of a holistic resiliency framework concept

• Conclusion: Achievements, ongoing work and future plans

Page 13: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge

April 10, 2008 13/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory

• Many active head nodes

• Work load distribution

• Symmetric replicationbetween head nodes

• Continuous service

• Always up to date

• No fail-over necessary

• No restore-over necessary

• Virtual synchrony model

Complex algorithms

Prototypes for PBS Torque and PVFS metadata server

Symmetric active/active redundancy for head and service nodes

Page 14: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge

April 10, 2008 14/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory

Symmetric active/active replication

InputReplication

VirtuallySynchronousProcessing

OutputUnification

1m 30s99.9997%3

1h 45m99.97%2

5d 4h 21m98.58%1Est. annual downtimeAvailabilityNodes

1m 30s99.9997%3

1h 45m99.97%2

5d 4h 21m98.58%1Est. annual downtimeAvailabilityNodes

Page 15: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge

April 10, 2008 15/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory

Symmetric active/active Parallel Virtual File System metadata service

Page 16: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge

April 10, 2008 16/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory

Outline

• Resilience deficiencies of high-performance computing architectures

• Research and development goals and projects overview

• Symmetric active/active redundancy for head and service nodes

• Job pause approach for reactive compute node fault tolerance

• Pre-emptive migration for proactive compute node fault tolerance

• Reliability analysis and modeling for fault prediction and anticipation

• Evaluation of compute node fault tolerance policies using simulation

• Vision of a holistic resiliency framework concept

• Conclusion: Achievements, ongoing work and future plans

Page 17: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge

April 10, 2008 17/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory

Reactive vs. proactive fault tolerance for compute nodes

• Reactive fault tolerance:

• Proactive fault tolerance:

Ideal solution: Matching

• State saving during failure-free operation• State recovery after failure• Assured quality of service, but limited

scalability

• System health monitoring and online reliability modeling

• Failure anticipation and prevention through prediction and reconfiguration before failure

• Highly scalable, but not all failures can be anticipated

combination of both

Page 18: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge

April 10, 2008 18/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory

Enhanced reactive fault tolerance with LAM/MPI+BLCR job pause mechanism

• Operational nodes: Pause− BLCR reuses existing processes− LAM/MPI reuses existing

connections− Restore partial process state

from checkpoint

• Failed nodes: Migrate− Restart process on new node

from checkpoint− Reconnect with paused

processes

Scalable MPI membership management for low overhead

Efficient, transparent, and automatic failure recovery

Page 19: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge

April 10, 2008 19/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory

• Application registers threaded callback spawns callback thread

• Thread blocks in kernel

• Pause utility calls ioctl(), unblocks callback thread

• All threads complete callbacks and enter kernel

New: All threads restore part of their states

• Run regular application code from restored state

New job pause mechanism in BLCR

Page 20: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge

April 10, 2008 20/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory

LAM/MPI+BLCR job pause performance

0123456789

10

BT CG EP FT LU MG SP

Seco

nds

Job Pause and Migrate LAM Reboot Job Restart

• 3.4% overhead over job restart, but− No LAM reboot overhead− Transparent continuation of execution

• No re-queue penalty

• Less staging overhead

Page 21: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge

April 10, 2008 21/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory

Outline

• Resilience deficiencies of high-performance computing architectures

• Research and development goals and projects overview

• Symmetric active/active redundancy for head and service nodes

• Job pause approach for reactive compute node fault tolerance

• Pre-emptive migration for proactive compute node fault tolerance

• Reliability analysis and modeling for fault prediction and anticipation

• Evaluation of compute node fault tolerance policies using simulation

• Vision of a holistic resiliency framework concept

• Conclusion: Achievements, ongoing work and future plans

Page 22: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge

April 10, 2008 22/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory

• Stand-by Xen host (spare node without guest VM)

• Deteriorating health:− Migrate guest VM to spare

node

• New host generates unsolicited ARP reply− Indicates that guest VM has

moved− ARP tells peers to resend to

new host

Novel fault tolerance scheme that acts before a failure impacts a system

Proactive fault tolerance using Xenvirtualization

Xen VMMXen VMM

GangliaGanglia

Privileged VMPrivileged VM

Xen VMMXen VMM

Privileged VMPrivileged VM

GangliaGanglia

PFTDaemon

PFTDaemon

Guest VMGuest VM

MPITaskMPITask

PFTDaemon

PFTDaemon

Migrate

Xen VMMXen VMM

Privileged VMPrivileged VM Guest VMGuest VM

MPITaskMPITaskGangliaGanglia

PFTDaemon

PFTDaemon

Xen VMMXen VMM

Privileged VMPrivileged VM Guest VMGuest VM

MPITaskMPITaskGangliaGanglia

PFTDaemon

PFTDaemon

H/w BMC H/w BMC

H/w BMC H/w BMC

Page 23: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge

April 10, 2008 23/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory

Proactive fault tolerance (PFT) daemon

• Runs in privileged domain (host)

• Initialization− Read safe threshold from config file− Init connection with IPMI controller− Obtain/filter set of available sensors

• Health monitoring− Read sensors from IPMI controller− Periodically sample data − Trigger load balancing if exceeding

sensor threshold

• VM migration− Select target based on load− Invoke Xen live migration for VM

InitializeInitialize

Health MonitorHealth Monitor

IPMIController

IPMIController

Threshold Breach?

Threshold Breach?

Load BalanceLoad BalanceGangliaGanglia

N

Y

PFT Daemon

Raise Alarm / Maintenance of

the System

Page 24: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge

April 10, 2008 24/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory

• Single node failure: 0.5-5% additional cost over total wall clock time

• Double node failure: 2-8% additional cost over total wall clock time

VM migration performance impact

0

50

100

150

200

250

300

BT CG EP LU SP

Seco

nds

W /o Migration1 Migration2 Migration

0

50

100

150

200

250

300

350

400

450

500

BT CG EP LU SP

Seco

nds

W /o Migration 1 Migration

Page 25: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge

April 10, 2008 25/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory

Outline

• Resilience deficiencies of high-performance computing architectures

• Research and development goals and projects overview

• Symmetric active/active redundancy for head and service nodes

• Job pause approach for reactive compute node fault tolerance

• Pre-emptive migration for proactive compute node fault tolerance

• Reliability analysis and modeling for fault prediction and anticipation

• Evaluation of compute node fault tolerance policies using simulation

• Vision of a holistic resiliency framework concept

• Conclusion: Achievements, ongoing work and future plans

Page 26: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge

April 10, 2008 26/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory

HPC reliability analysis and modeling for prediction and anticipation

• Programming paradigm and system scale impact reliability

• Reliability analysis:− Estimate mean time to failure (MTTF)− Obtain failure distribution: Exponential, Weibull, Gamma, ...

• Feedback into fault tolerance schemes for adaptation

System reliability (MTTF) for k-of-n AND Survivability (k=n) Parallel Execution model

0100200300400500600700800

10 50 100 500 1000 2000 5000

Number of Participating Nodes

Tot

al s

yste

m M

TT

F (h

rs) Node MTTF 1000 hrs

Node MTTF 3000 hrsNode MTTF 5000 hrsNode MTTF 7000 hrs

Page 27: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge

April 10, 2008 27/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory

Outline

• Resilience deficiencies of high-performance computing architectures

• Research and development goals and projects overview

• Symmetric active/active redundancy for head and service nodes

• Job pause approach for reactive compute node fault tolerance

• Pre-emptive migration for proactive compute node fault tolerance

• Reliability analysis and modeling for fault prediction and anticipation

• Evaluation of compute node fault tolerance policies using simulation

• Vision of a holistic resiliency framework concept

• Conclusion: Achievements, ongoing work and future plans

Page 28: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge

April 10, 2008 28/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory

Simulation framework for HPC fault tolerance policies

• Evaluation of fault tolerance policies− Reactive only− Proactive only− Reactive/proactive combination

• Evaluation of fault tolerance parameters− Checkpoint interval− Prediction accuracy

• Event-based simulation framework using actual HPC system logs

• Customizable simulated environment− Number of active and spare nodes− Checkpoint and migration overheads

Page 29: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge

April 10, 2008 29/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory

Combination of proactive and reactive fault tolerance: Simulation example 1

10 30 50 70 90 2 4 8 16 32 64 128

256

0

10

20

30

40

50

60

70

80

90

100

% Execution overhead

% Prediction accuracy Checkpoint interval (hours)

Execution overhead for various checkpoint intervals and different prediction accuracy

90-100

80-90

70-80

60-70

50-60

40-50

30-40

20-30

10-20

0-10

• Best: Prediction accuracy >60% and checkpoint interval 16-32 hours

• Better than only proactive or only reactive• Results for higher prediction accuracies

and very low checkpoint intervals are worse than only proactive or only reactive

Number of processes 125

Active nodes / Spare nodes 125 / 12

Checkpoint overhead 50 min/Checkpoint

Migration overhead 1 min/Migration

Simulation based on ASCI White system logs(nodes 1 – 125 and 500-512)

Page 30: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge

April 10, 2008 30/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory

Combination of proactive and reactive fault tolerance: Simulation example 2

10 30 50 70

90 2 4 8 16 32 64 128

256

0

10

20

30

40

50

60

70

80

90

100

% Execution overhead

% Prediction accuracy

Checkpoint interval (hours)

Execution overhead for various checkpoint intervals and different prediction accuracy

90-10080-9070-8060-7050-6040-5030-4020-3010-200-10

• Best: Accuracy >60%, interval 16-64h• 70% and 32 hours:

− 8% gain over reactive only− 24% gain in over proactive only

• 80% and 32 hours:− 10% gain over reactive only− 3% loss over proactive only

Number of processes 125

Active nodes / Spare nodes 125 / 12

Checkpoint overhead 50 min/Checkpoint

Migration overhead 1 min/Migration

Simulation based on ASCI White system logs(nodes 126 – 250 and 500-512)

Page 31: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge

April 10, 2008 31/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory

Outline

• Resilience deficiencies of high-performance computing architectures

• Research and development goals and projects overview

• Symmetric active/active redundancy for head and service nodes

• Job pause approach for reactive compute node fault tolerance

• Pre-emptive migration for proactive compute node fault tolerance

• Reliability analysis and modeling for fault prediction and anticipation

• Evaluation of compute node fault tolerance policies using simulation

• Vision of a holistic resiliency framework concept

• Conclusion: Achievements, ongoing work and future plans

Page 32: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge

April 10, 2008 32/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory

A holistic resiliency framework concept for high-performance computing

RAS

RAS

RAS

RAS

Page 33: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge

April 10, 2008 33/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory

Example 1: Automatic Runtime Environment Checkpoint/Restart

Page 34: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge

April 10, 2008 34/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory

Example 2: Automatic Virtual Machine Checkpoint/Restart/Migration

Page 35: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge

April 10, 2008 35/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory

Example 3: Automatic Application-Level Checkpoint/Restart

Page 36: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge

April 10, 2008 36/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory

Example 4: Automatic OS-Level Checkpoint/Restart/Migration

Page 37: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge

April 10, 2008 37/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory

Outline

• Resilience deficiencies of high-performance computing architectures

• Research and development goals and projects overview

• Symmetric active/active redundancy for head and service nodes

• Job pause approach for reactive compute node fault tolerance

• Pre-emptive migration for proactive compute node fault tolerance

• Reliability analysis and modeling for fault prediction and anticipation

• Evaluation of compute node fault tolerance policies using simulation

• Vision of a holistic resiliency framework concept

• Conclusion: Achievements, ongoing work and future plans

Page 38: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge

April 10, 2008 38/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory

Achievements

• Developed efficient redundancy strategies for critical infrastructure services in HPC systems

• Added the job pause approach for HPC compute nodes to enhance reactive fault tolerance

• Implemented threshold-based preemptive migration using system-level virtualization for proactive fault tolerance

• Investigated mean-time to failure characteristics of HPC systems using reliability analysis

• Simulated and evaluated various fault tolerance strategies usingactual failure data from HPC system logs− Reactive fault tolerance only− Proactive fault tolerance only− Holistic fault tolerance through reactive/proactive combination

Page 39: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge

April 10, 2008 39/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory

Ongoing work and future plans

• Finishing development on threshold-based process-level preemptive migration for proactive fault tolerance

• Engaging in prediction-based preemptive migration using for proactive fault tolerance

• Investigating failure mode, failure detection, and failure distribution characteristics of HPC systems using reliability analysis

• Plan to further enhance reactive fault tolerance with checkpoint placement adaptation to actual and predicted system health threats

• Plan to develop a holistic fault tolerance framework that offers the mix-and-match approach for various fault resilience strategies

Page 40: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge

April 10, 2008 40/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory

Acknowledgements

• Investigators at Oak Ridge National Laboratory:− Stephen L. Scott, Christian Engelmann, Hong H. Ong, Geoffroy R. Vallée, Thomas

Naughton, Anand Tikotekar, George Ostrouchov

• Investigators at Louisiana Tech University:− Chokchai (Box) Leangsuksun, Nichamon Naksinehaboon, Raja Nassar, Mihaela Paun

• Investigators at North Carolina State University:− Frank Mueller, Chao Wang, Arun Nagarajan, Jyothish Varma

• Investigators at Tennessee Technological University:− Xubin (Ben) He, Li Ou, Xin Chen

• Funding sources:− U.S. Department of Energy, U.S. National Science Foundation, Oak Ridge National

Laboratory, Tennessee Technological University

Page 41: Resiliency for high-performance computingengelman/publications/engelmann08resiliency.ppt.pdfApril 10, 2008 Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge

April 10, 2008 41/41Resiliency for High-Performance Computing – Christian Engelmann, Oak Ridge National Laboratory

Contacts

• Christian EngelmannSystem Research TeamComputer Science Research GroupComputer Science and Mathematics DivisionOak Ridge National Laboratory

[email protected]

• Stephen L. ScottSystem Research TeamComputer Science Research GroupComputer Science and Mathematics DivisionOak Ridge National Laboratory

[email protected]

• www.fastos.org/ras | www.fastos.org/molar

• www.csm.ornl.gov/srt

41 Scott_RAS_0714