resiliency – an emerging ility dr. richard mayer president, speed of light consulting llc july 14,...

18
Resiliency – an Emerging “ility” Dr. Richard Mayer President, Speed of Light Consulting LLC July 14, 2011 1

Upload: brooke-russell

Post on 27-Mar-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Resiliency – an Emerging ility Dr. Richard Mayer President, Speed of Light Consulting LLC July 14, 2011 1

Resiliency – an Emerging “ility”

Dr. Richard Mayer

President, Speed of Light Consulting LLC

July 14, 2011

1

Page 2: Resiliency – an Emerging ility Dr. Richard Mayer President, Speed of Light Consulting LLC July 14, 2011 1

DOE Smart Grid Report 20093.6 Operates Resiliently to Disturbances, Attacks, and Natural Disasters “Resiliency” refers to the ability of a system to react to events such that problematic

consequences are isolated with minimal impact to the remaining system, and the overall system is restored to normal operation as soon as practical. These self-healing actions result in reduced interruption of service to consumers and help service providers more effectively manage the delivery infrastructure. Resiliency includes protection against all hazards, whether accidental or malicious, and needs to span natural disasters, deliberate attack, equipment failures, and human error. A smart grid inherently addresses security from the outset as a requirement for all the elements, and ensures an integrated and balanced approach across the system.

From the point of view of the Nation’s national security, this characteristic is arguably the most important. Resiliency in the face of adverse conditions or aggression, particularly high-consequence events, underlies all aspects of a smart grid and cuts across the other characteristics. Resiliency is embedded in operational culture: policy, procedures, and vigilance. It is embodied through effective risk management, with thorough understanding and management.

2

Page 3: Resiliency – an Emerging ility Dr. Richard Mayer President, Speed of Light Consulting LLC July 14, 2011 1

http://www.resilience-engineering.org/

Resilience EngineeringThe term Resilience Engineering represents a new way of

thinking about safety. Whereas conventional risk management approaches are based on hindsight…, Resilience Engineering looks for ways …to create processes that are robust yet flexible, to monitor and revise risk models, and to use resources proactively in the face of disruptions or ongoing production and economic pressures. (scalable, evolvable systems)

3

Page 4: Resiliency – an Emerging ility Dr. Richard Mayer President, Speed of Light Consulting LLC July 14, 2011 1

4

Examples of resilient design: Self-annealing, autonomous networksUploadable spacecraft softwareTSAT communication system had provision for mobile ground control centers with reduced capability, in case fixed control center was attacked.

Page 5: Resiliency – an Emerging ility Dr. Richard Mayer President, Speed of Light Consulting LLC July 14, 2011 1

Is it resilience or resiliency?• Dictionary.com says they are the same.• Resilient = able to absorb/ deal with a deformation/disruption

and return to a previous/functioning state.• Resilience = property of an object/system of being resilient• Resiliency = abstract property of being resilient, e.g., as a

subject for study. Probably not needed; resilience would suffice.

5

Page 6: Resiliency – an Emerging ility Dr. Richard Mayer President, Speed of Light Consulting LLC July 14, 2011 1

Scott Jackson, Architecting Resilient Systems: Accident Avoidance, Survival and Recovery from Disruptions.

INCOSE Webinar December 16, 2009

Definition from Westrum:1. The ability to anticipate a disruption and prevent something bad from

happening – avoidance2. The ability to prevent something from getting worse – survival3. The ability to recover from something bad once it has happened –

recoveryNote: According to Westrum, a system only needs two of these three abilities to

be resilient.Operational resiliency has three basic descriptive properties (Caralli et al.

2006): 1. ability to change (adapt, expand, conform, contort) when a force is

enacted, 2. ability to perform adequately or minimally while the force is in effect, 3. ability to return to a predefined, expected normal state whenever the force

relents or is rendered ineffective. [I would add: “or to be returned” by outside intervention]

6

Page 7: Resiliency – an Emerging ility Dr. Richard Mayer President, Speed of Light Consulting LLC July 14, 2011 1

Distinguish resilience from prevention

• Semantics: compare to elasticity. To a mechanical engineer (but not to ordinary people) a rubber band and an I-beam are both elastic; they retain their original shape after the force is removed.

• I would say that the rubber band is resilient to a point, while the I-Beam is insusceptible to a point, then not resilient at all.

• Resilience implies some disruption in performance, and some degree of recovery.

7

Page 8: Resiliency – an Emerging ility Dr. Richard Mayer President, Speed of Light Consulting LLC July 14, 2011 1

Relationship to other “ilities”: Reliability

Reliability vs. resilience: Neither mutually exclusive, nor identical. Reliability is the probability of achieving specified performance over a specified period.

DAG: Reliability requirements address mission reliability and logistics reliability. …Both … typically include resilience to some component failures or even attack by having built-in redundancy and maintenance.

Reliability is calculated on a specific design. Resilience includes going outside the specific design to restore function by external intervention. The difference could become blurred if you include, for example, software patches to a spacecraft to implement a work-around after a failure. If the workaround was anticipated in the maintenance plan, this could be reliability; if it was developed only in response to an incident, I would call it resilience and not reliability. So a resiliency requirement might be to have modular and upgradeable software. Thus open-system architecture is an enabler of resilience

No single-point failures: a common requirement for spacecraft (always some exceptions). A factor in reliability. Also a factor in enabling resilience.

8

Page 9: Resiliency – an Emerging ility Dr. Richard Mayer President, Speed of Light Consulting LLC July 14, 2011 1

Some Accidents: Instances of (Non)Resilience?

9

Event NY Power Grid 9-11 LA Metrolink 111 Challenger

Disruption A: Input B: Systemic B: Systemic

Capacity Excellent Deficient OK

Flexibility Excellent Deficient Deficient

Tolerance Good Deficient OK

Information Flow Good Deficient Deficient

Collaboration/ Decision Making

Excellent Deficient Deficient

Power restoration on 9/11: systems worked together. Enablers were there: distributed generators, emergency generators, communications, C&C.

The resilience was not in any one system.Challenger: a resilience problem?? If you include Pgm Mgmt and NASA “culture”

as the SoS of building and operating the shuttle, then perhaps it is.Metrolink 111: Systemic deficencies: factors 1 and 2 of Westrum: avoidance and

damage control. Rail network was deficient. Train or traffic control was deficient. Should have been prevention in the Metrolink train.

A system of systems. The engineer is a “system” separate from the train.

Page 10: Resiliency – an Emerging ility Dr. Richard Mayer President, Speed of Light Consulting LLC July 14, 2011 1

Resilience is Primarily for System of Systems

• Resiliency can apply to a system, much more relevant to a system of systems.

– SoS can have much more resiliency than individual systems acting alone.

– Role of a system: to enable or contribute to SoS resiliency.

– Self-healing is part of resiliency. But SoS intervention is probably a bigger part.

• Most disaster responses of the type discussed by Jackson involve a system of systems working or not working in cooperation.

• Deming on Quality applies to resilience: management thinks 80% of failures are labor’s fault, and labor thinks 80% are management’s fault, & labor is right: management has the ability to do something to prevent 80% of the failures.

• Easier to change the environment than to change human nature

10

Page 11: Resiliency – an Emerging ility Dr. Richard Mayer President, Speed of Light Consulting LLC July 14, 2011 1

Heuristics (Collected by Jackson)

• Capacity Heuristics:– Absorption/margin, functional redundancy, physical redundancy– Context spanning: System should be designed to both worst case and

most likely scenarios (Madni)

• Flexibility Heuristics:– Self-reorganization (Woods), organizational flexibility– Human backup/in-the-loop/in control– Predictability, simplicity/complexity avoidance, loose coupling– Rule from Deming, Quality: If you must change the human or

change the environment, change the environment.

11

Modularity

Human vs. automated pendulum

$$

$$

Page 12: Resiliency – an Emerging ility Dr. Richard Mayer President, Speed of Light Consulting LLC July 14, 2011 1

Heuristics (Continued)

• Tolerance Heuristics:– Graceful degradation, drift correction, mobility,

prevention, deterrence• Inter-Element Collaboration Heuristics

– The human operator should be informed (Billings)– Maximize knowledge between nodes (Billings)– Intent awareness: knowledge of the others’ intent

and back up each other... (Madni and Billings)– No inter-element impediments to collaboration.

(Jackson)

12

Page 13: Resiliency – an Emerging ility Dr. Richard Mayer President, Speed of Light Consulting LLC July 14, 2011 1

Implementing Resilience: ABB Paper on a Smarter Grid

Note the Resilience Enablers

13

Current Grid Smart Grid Communications None or one-way; Two-way, real-time

typically not real-time Customer interaction Limited Extensive Metering Electromechanical Digital (enabling real-time pricing and net metering) Operation Manual equipment checks, Remote monitoring, predictive,

maintenance time-based maintenance Generation Centralized Centralized and distributed Power flow control Limited Comprehensive, automated Reliability Prone to failures and cascading Automated, pro-active

outages; essentially reactive protection; prevents outages before they start

Restoration following Manual Self-healing disturbance

Page 14: Resiliency – an Emerging ility Dr. Richard Mayer President, Speed of Light Consulting LLC July 14, 2011 1

Jackson: Some Conclusions About Disruptions

• Failures may occur when all the components of a system function as designed; these are called systemic failures. The system “failed” because it encountered a situation not envisioned in development and therefore not elaborated in a requirement. (See Context-spanning heuristic)

• The large numbers of possible interactions of elements of a system causes the probability of an accident to be much larger than the individual failure would imply. ( sum of individual failures)– Example: sneak paths in software or hardware that only get triggered in

rare situations, or in failure situations, or after unplanned disruptions. Are they failures?

– Practically impossible to ferret out all SW “bugs” in today’s systems. Resiliency might mean the capability to restart or go to a safe mode if the SW gets lost (as opposed to finding all the bugs). SW reliability reqt?

14

Page 15: Resiliency – an Emerging ility Dr. Richard Mayer President, Speed of Light Consulting LLC July 14, 2011 1

Comments on Jackson’s Conclusions• Causes of catastrophes are beyond the domains of traditional reliability, safety and other

reductionist approaches – Not entirely. What do we mean by “the cause”?

Cause of disruption (terrorist attack)? Causes of damage (not designed to withstand an intense fire, reactor not designed to survive power loss) are not “beyond the domains.”.

• My take: Addressing the response may be beyond traditional domains. Resiliency is largely beyond the domains of the traditional approaches, because response most likely must take place at the SoS level.

• Disruptions (e.g., human error) are not the cause of catastrophes; they simply initiate it. This is semantics. The initiation is a cause, but not the only cause, especially of the extent of the damage and the inability to respond. [Train accident in LA. No fail-safe measures in either train or in the rail system.]

• A system can be architected to create resilience. I would say: a system can be architected to enable and achieve a measure of resilience. A System of Systems can be empowered to achieve resilience of its systems.

• The primary aspects of resilience are adaptability, risk and culture. Risk and culture are not properties of a system. Risk assessment can drive creation of resilience.

• Future work includes: (1) validation of heuristics, (2) development of metrics, (3) others I would nominate: 1. Determining and requiring enablers2. Analyzing resilience as a requirement in SoS architecture and operations.

15

Page 16: Resiliency – an Emerging ility Dr. Richard Mayer President, Speed of Light Consulting LLC July 14, 2011 1

What to Do with Resilience?1. Resilience is enabled by system requirements: status data, modularity,

mobility, intrusion detection, etc., and achieved by SoS architecture and operations.

2. Is resilience something to be addressed in its own right, or is it adequately enabled under existing “ilities”? How do you develop system/element requirements to make them resilient?

3. How much “extra capacity” can we afford to achieve the objectives? What is the value or cost/benefit equation when lives are at stake?

4. An approach to flush out enablers: Scenario study If the unthinkable happens…

– What should “we” (the country, company, …) have in place to deal with it?– We want to minimize the loss of life.– We want all emergency responders to be able to communicate.– We want to restore service (power grid, transportation …)– We want no permanent loss of our data.

16

Page 17: Resiliency – an Emerging ility Dr. Richard Mayer President, Speed of Light Consulting LLC July 14, 2011 1

Resources• INCOSE Resilient Systems Working Group

http://www.incose.org/practice/techactivities/wg/rswg/• http://www.resilience-engineering.org/• Scott Jackson. Architecting Resilient Systems: Accident Avoidance and Survival

and Recovery from Disruptions. Wiley. 2009• Cognitive Technologies Laboratory http://www.ctlab.org/• Ashgate Publishing's Resilience Engineering Perspectives Series:

Remaining Sensitive to the Possibility of Failure, which compiles papers from the November 2007. 2nd Symposium on Resilience Engineering in Juan-Les-Pins France. Resilience Engineering: Concepts and Precepts.

• The 3rd International Symposium on Resilience Engineering• Resilience engineering: concepts and precepts

By Erik Hollnagel, David D. Woods, Nancy Leveson• CERT – Resiliency Management http://www.cert.org/resiliency/

emphasis on Information Technology • Center for Resilience at the Ohio State University

http://www.resilience.osu.edu/CFR-site/aboutus.htm

17

Page 18: Resiliency – an Emerging ility Dr. Richard Mayer President, Speed of Light Consulting LLC July 14, 2011 1

18