hardware-integrated approaches to failure advanced warning ralph h. castain, ph.d. los alamos...

28
Hardware-Integrated Approaches to Failure Advanced Warning Ralph H. Castain, Ph.D. Los Alamos National Laboratory

Upload: leon-gunton

Post on 14-Dec-2015

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Hardware-Integrated Approaches to Failure Advanced Warning Ralph H. Castain, Ph.D. Los Alamos National Laboratory

Hardware-Integrated Approaches to Failure Advanced Warning

Ralph H. Castain, Ph.D.Los Alamos National Laboratory

Page 2: Hardware-Integrated Approaches to Failure Advanced Warning Ralph H. Castain, Ph.D. Los Alamos National Laboratory

Outline

• Little history and perspective What do we mean by “resilient”? Traditional vs embedded approach DARPA “built-in-test” program

• Cisco resilient router project Brief overview of project Our approach and partnership with OpenMPI

• Open Cluster Manager (OpenCM)

Page 3: Hardware-Integrated Approaches to Failure Advanced Warning Ralph H. Castain, Ph.D. Los Alamos National Laboratory

Motivation

• Head of new business unit for integrated diagnostics and control

• World’s largest customer If system fails, will search out root cause If your system, you pay cost of lost batch! Rough cost/failure: $10M Rough value of system: $200k

Page 4: Hardware-Integrated Approaches to Failure Advanced Warning Ralph H. Castain, Ph.D. Los Alamos National Laboratory

Resiliency

• Fault Events that hinder the correct operation of a process.

• May not actually be a “failure” of a component, but can cause system-level failure or performance degradation below specified level

Effect may be immediate or some time in the future. Usually are rare. May not have many data examples.

• Fault prediction Estimate probability of incipient fault within some time period in the future

• Fault Tolerance ………………………………………reactive, static Ability to recover from a fault

• Robustness…………………………………………..metric How much can the system absorb without catastrophic consequences

• Resilience……………………………………………..proactive, dynamic Dynamically configure system to minimize impact of potential faults

Page 5: Hardware-Integrated Approaches to Failure Advanced Warning Ralph H. Castain, Ph.D. Los Alamos National Laboratory

Traditional Approach to Faults:The “Bathtub”

InfantMortality

MTBF

“Floor”Region

DefinedLifetime?

B

Page 6: Hardware-Integrated Approaches to Failure Advanced Warning Ralph H. Castain, Ph.D. Los Alamos National Laboratory

What’s Wrong With That?

• Infant mortality Resolved by extensive burn-in: costly

• Where to define “lifetime”? A: Units decommissioned with considerable unused life B: High probability of failures in advance MTBF: ~50% of units fail before

• Bathtub floor does not sit at “zero” Still significant probability of failure

• Can’t reliably estimate system lifetime due to multi-component degradation Component-component interactions not reflected in individual component

lifetime statistics

• Failures can be costly Operational impact Replacement costs B

Page 7: Hardware-Integrated Approaches to Failure Advanced Warning Ralph H. Castain, Ph.D. Los Alamos National Laboratory

DARPA BIT Program

• Multi-year program in 1990s Focus on electronic, mechanical failures Create a “resilient war fighting” capability Enable better maintenance support of increasingly

complex systems

• Objectives Push-button “good box/bad box” readout

• Eliminate diagnostic “carts”, “toolboxes”,…

Pre-emptive switch from failing systems “Okay for mission” test

• Reduce probability of failures during mission

Page 8: Hardware-Integrated Approaches to Failure Advanced Warning Ralph H. Castain, Ph.D. Los Alamos National Laboratory

Results Encouraging

• Vibration signatures Impending bearing failures

• Fans, axles, transmissions

• Thermal patterns Mechanical failures

• Existence of hot spots• Patterns revealed root causes, better prediction

Electronic failures• Patterns across boards, surface of chips

• Electrical frequency composition Breakdowns in power transistors, other devices IC internal wire connection degradation

Page 9: Hardware-Integrated Approaches to Failure Advanced Warning Ralph H. Castain, Ph.D. Los Alamos National Laboratory

General Conclusions

• Exploit access to internals Investigate optimal location, number of sensors Embed intelligence, communications capability

• Integrate data from all available sources Engineering design tests Reliability life tests Production qualification tests

• Utilize learning algorithms to improve performance Both embedded, post process Seed with expert knowledge

Page 10: Hardware-Integrated Approaches to Failure Advanced Warning Ralph H. Castain, Ph.D. Los Alamos National Laboratory

Objective

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Prob of Failure

1 2 3 4 5 6 7 8 9 10

Time Interval

Page 11: Hardware-Integrated Approaches to Failure Advanced Warning Ralph H. Castain, Ph.D. Los Alamos National Laboratory

Motivation

• Head of new business unit for integrated diagnostics and control

• World’s largest customer If system fails, will search out root cause If your system, you pay cost of lost batch! Rough cost/failure: $10M Rough value of system: $200k

Page 12: Hardware-Integrated Approaches to Failure Advanced Warning Ralph H. Castain, Ph.D. Los Alamos National Laboratory

Questions

• Can we develop technologies that would… Warn of impending failure

• Provide time to reconfigure, respond• Allow switch to backup systems for continuous

operation• Provide an opportunity to pace ourselves

“Stretch” life of system

With minimal overhead• Cannot significantly impact performance

• How would we use them?

Page 13: Hardware-Integrated Approaches to Failure Advanced Warning Ralph H. Castain, Ph.D. Los Alamos National Laboratory

Direct DetectionSpectralFilter

ADC

PZT

Temp

VoltageCurrent

PZT

Temp

VoltageCurrent

ADC

VoltageCurrent

ADC

FDDPAnalyzer

Good BoxBad Box

ProblemDiagnosis

FaultPrediction

~ -

Page 14: Hardware-Integrated Approaches to Failure Advanced Warning Ralph H. Castain, Ph.D. Los Alamos National Laboratory

Integrate All Factors

Page 15: Hardware-Integrated Approaches to Failure Advanced Warning Ralph H. Castain, Ph.D. Los Alamos National Laboratory

Results (generalized)

• Prediction Better than 97% faults predicted within

specified response time (hours) Less than 5% “bad” prediction rate

• Diagnosis Better than 80% correct localization

• Detection (good/bad box) Better than 99% correct identification Less than 5% false positive rate

Page 16: Hardware-Integrated Approaches to Failure Advanced Warning Ralph H. Castain, Ph.D. Los Alamos National Laboratory

Outline

• Little history and perspective What do we mean by “resilient”? Traditional vs embedded approach DARPA “built-in-test” program

• Cisco resilient router project Brief overview of project Our approach and partnership with OpenMPI

• Open Cluster Manager (OpenCM)

Page 17: Hardware-Integrated Approaches to Failure Advanced Warning Ralph H. Castain, Ph.D. Los Alamos National Laboratory

17© 2006 Cisco Systems, Inc. All rights reserved.

1) Internet Traffic Growth and interconnect requirements are growing faster than Silicon and Software available power are.

2) One approach is to build a larger more Distributed System.

3) Result are increased requirements on System Software in terms of:

a) High Availability across a multi-component system

b) Coherent view of intra-component messaging

c) Fast Convergence amongst components during change

d) Distributed Failover and effective sharing of load.

e) SW/HW maintenance w/o service impact

Problem Statements

Page 18: Hardware-Integrated Approaches to Failure Advanced Warning Ralph H. Castain, Ph.D. Los Alamos National Laboratory

19© 2006 Cisco Systems, Inc. All rights reserved.

1

10

100

1000

10000

System BW

MHz-gate/mW

Mbps/W

System Power

Shortfall!

Shortfall is overcome by architectural innovation and trading off:Performance, functionality, programmability, physical size/density

Very hard to sustain long-term

Technology is falling behind Demand Curve

Problem Drivers

Page 19: Hardware-Integrated Approaches to Failure Advanced Warning Ralph H. Castain, Ph.D. Los Alamos National Laboratory

20© 2006 Cisco Systems, Inc. All rights reserved.

Product example

• Largest Routing System available today

Each Linecard Chassis: 1.28Tbps, 13.6kW

Switch Fabric Chassis: 8kW

Hardware Details

Page 20: Hardware-Integrated Approaches to Failure Advanced Warning Ralph H. Castain, Ph.D. Los Alamos National Laboratory

21© 2006 Cisco Systems, Inc. All rights reserved.

Product example

• Maximum HW configuration: 92Tbps Switching capacity across millions of interfaces.

48 x LC chassis + 8 x Fabric chassis

=> System Messaging Across all control CPUs to manage switch fabric

and interface control

Hardware Details

Page 21: Hardware-Integrated Approaches to Failure Advanced Warning Ralph H. Castain, Ph.D. Los Alamos National Laboratory

22© 2006 Cisco Systems, Inc. All rights reserved.

System Software Requirements

1) Turn on once with remote access thereafter

2) Non-Stop == max 20 events/day lasting < 200ms each

3) Hitless SW Upgrades and Downgrades

4) Upgrade/downgrade SW components across delta versions

5) Field Patchable

6) Beta Test New Features in situ

7) Extensive Trace Facilities: on Routes, Tunnels, Subscribers,…

8) Configuration

9) Clear APIs; minimize application awareness

10) Extensive remote capabilities for fault management, software maintenance and software installations

Software Details

Page 22: Hardware-Integrated Approaches to Failure Advanced Warning Ralph H. Castain, Ph.D. Los Alamos National Laboratory

Our Approach: Use OpenRTE

• Setup for new frameworks Sensor - monitor hardware, software FDDP - use sensor inputs to compute sliding

window or probabilities

• Contribute back to OpenMPI Proprietary modules as binary plug-ins

• Write new cluster manager Exploit new capabilities Create as non-centralized application

Page 23: Hardware-Integrated Approaches to Failure Advanced Warning Ralph H. Castain, Ph.D. Los Alamos National Laboratory

ORTE Extensions

• Software sensors Memory footprint, cpu utilization (upper and lower),

output file size

• Hardware sensors Temperature, vibration

• FDDP B-spline trend fit

• Resilient mapper Fault groups

• Nodes with common failure mode• Node can belong to multiple fault groups

Map replicas across fault groups

Page 24: Hardware-Integrated Approaches to Failure Advanced Warning Ralph H. Castain, Ph.D. Los Alamos National Laboratory

Cluster Manager

• Orted auto-starts upon node power-up Auto-detect and connect to CM

• CM launches specified number of replicas of each application Resilient mapper => minimize single point

failures

• Applications auto-wireup Plug-and-play inspired approach Application decides which input to declare

“leader”

Page 25: Hardware-Integrated Approaches to Failure Advanced Warning Ralph H. Castain, Ph.D. Los Alamos National Laboratory

Application Failure

• Orted detects (or predicts) failure and notifies CM

• CM utilizes resilient mapper to determine location of replacement Future extension: probability of failure modes

to help drive fault group selection New replica is launched, does auto-wireup

• Connected applications Loss of communication from “leader” Independently select new “leader”

Page 26: Hardware-Integrated Approaches to Failure Advanced Warning Ralph H. Castain, Ph.D. Los Alamos National Laboratory

Outline

• Little history and perspective What do we mean by “resilient”? Traditional vs embedded approach DARPA “built-in-test” program

• Cisco resilient router project Brief overview of project Our approach and partnership with OpenMPI

• Open Cluster Manager (OpenCM)

Page 27: Hardware-Integrated Approaches to Failure Advanced Warning Ralph H. Castain, Ph.D. Los Alamos National Laboratory

OpenCM

• Transition Cisco work to open source

• Broaden mission Extend to HPC, other embedded operations Manage any collection of nodes Resilient operation with hooks

• MPI• Other application layers

• Released under the OpenMPI license BSD-like, open use

Page 28: Hardware-Integrated Approaches to Failure Advanced Warning Ralph H. Castain, Ph.D. Los Alamos National Laboratory

http://www.open-mpi.org/

Concluding Remarks