Risk analysis: managing uncertainty
GOAL: be prepared for whatever happens
Risk analysis should be done for ALL PHASES of a project:---planning phase---development phase---the product itself
Identify risks: What could you have done during the planning stage to manage each of these “risks”?
How likely is it (what is probability) each one will occur?
How likely is it (what is probability) more than one will occur?
What actions will best manage the risk if it occurs?
During planning, a Risk Table can be generated: Risks Type* Probability Impact Plan (Pointer)
System not availableHardware failureColor printer unavailablePersonnel absent
(one meeting)Personnel unavailable
(several meetings)
Personnel have left project
*Type: Performance (product won’t meet requirements); Cost (budget overruns); Support (project can’t be maintained as
planned); Schedule (project will fall behind)
Probability: of this risk occurring
Impact: e.g., catastrophic, critical, marginal, negligible
risk management—identify, plan for risks
Then table is sorted by probability and impact and a “cutoff line” is defined. Everything above this line must be managed (with a management plan pointed to in the last column).
Useful reference: Embedded Syst. Prog. Nov. 00--examples:http://www.embedded.com/2000/0011/0011feat1.htm
Additional interesting reference: H. Petroski, To Engineer is Human: The Role of Failure in Successful Design, Vintage, 1992.
.
risk management—identify, plan for risks
Important concepts for embedded systems::
Risk = Probability of failure * Severity
Increased risk decreased safety
Safety failures—possible causes:incorrect or incomplete specificationbad designimproper implementationfaulty componentimproper use
RELIABILITY: “what is the probability of failure?”
Some ways to determine reliability:
--product performs consistently as expected
--MTBF (mean time between failures) is long
--system behavior is DETERMINISTIC
--system responds or FAILS GRACEFULLY to out-of-bounds or unexpected conditions and recovers if possible
Definitions:
Fault: incorrect or unacceptable state or condition
Fault duration and frequency determines clasification:transient—from unexpected external condition-”soft”intermittent—unstable hardware or marginal design
periodic / aperiodicpermanent—failed component, e.g.—”hard”
Error: static, inherent characteristic of system
Failure: dynamic, occurs at specific time
Possible fault consequences:inappropriate actiontiming—event occurs too early or too latesequence of events incorrectquantity—wrong amount of energy or substance used
Achieving reliability:
safe design
fault detection
fault management
fault tolerant—system recovers, fault not detectede.g., packet transfers
Definition of reliability for embedded system: probability that a failure is detected by the user is less than a specified threshold
Examples—section 8.5—read these carefully!
Ariane 5 rocket: register overflow—64-bit word assigned to 16-bit register in a reused subsystem
Mars Pathfinder mission 1997—lower priority tasks were allowed to hog resources, higher priority tasks could not execute
2004 Mars mission—file management problems
Many more examples in articles at embedded.com, including some information on current Toyota problems
How do we define safety?
One criterion:
“single point”: failure of a single component will not lead to unsafe condition
“common-mode failure”: failure of multiple components due to a single failure event will not lead to an unsafe condition
Safety must be considered THROUGHOUT the project
fig_08_00
Embedded system design—project components
Development process (“waterfall model”):
Alternative process models: Need risk analysis AT EACH INCREMENT(A=analysis, D=design, I=implement, T=test, M=maintenance)Basic waterfall model: A-->D-->I-->T-->M
Prototyping: A-->D-->I-->T-->M Incremental: A-->D-->I-->T-->M-->A-->D-->I-->T--> ……-->M
Component based: A-->D-->Library-->Integrate-->T-->M I
Analysis Design Implement Test Maintain
Specifications:
Identify hazards
Calculate risk
Define safety measures
Specification document should include safety standards and guidelines which system complies with
e.g.: Underwriters Laboratory, FCC, FDA, FAA, AEC, NASA, ISO, etc.
fig_08_03
Example:
Good for debugging stage, allows “controlled crash”
Not robust enough for final code