developing safety critical software: fact and fiction john a mcdermid
Post on 20-Dec-2015
218 views
TRANSCRIPT
Developing Safety Critical Developing Safety Critical Software: Fact and FictionSoftware: Fact and Fiction
John A McDermidJohn A McDermid
OverviewOverview
Fact – costs and distributionsFact – costs and distributions Fiction – get the requirements rightFiction – get the requirements right Fiction – get the functionality rightFiction – get the functionality right Fiction – abstraction is the solutionFiction – abstraction is the solution Fiction – safety critical code must Fiction – safety critical code must
be “bug free”be “bug free” Some key messagesSome key messages
Part 1Part 1
Fact – costs and distributionsFact – costs and distributions
Fiction – get the requirements Fiction – get the requirements rightright
OverviewOverview
Fact – costs and distributionsFact – costs and distributions Fiction – get the requirements rightFiction – get the requirements right Fiction – get the functionality rightFiction – get the functionality right Fiction – abstraction is the solutionFiction – abstraction is the solution Fiction – safety critical code must Fiction – safety critical code must
be “bug free”be “bug free” Some key messagesSome key messages
Costs and DistributionsCosts and Distributions
Examples of industrial experienceExamples of industrial experience– Specific exampleSpecific example– Some more general observationsSome more general observations
Example coversExample covers– Cost by phaseCost by phase– Where errors are introducedWhere errors are introduced– Where errors are detectedWhere errors are detected
and their relationshipsand their relationships
Hardw areSoftw areIntegration
1%Softw are
Integration Test7%
Low Level Softw are Test
17%
Softw areStatic Analysis
1%
Softw areImplementation
10%
Softw areDesign
3%
Review s andInspections
8%
SystemSpecif ication
25%
Management8%
System Integration17%
OtherSoftw are
3%
Process Phases Process Phases
From SystemSpecification
Via Software Engineering
To System Integration Effort/Cost
by Phase
Error IntroductionError Introduction
USERREQUIREMENTS
SYSTEMREQUIREMENTS
DOCUMENTTRACEABILITY
HARDWARE SOFTWARE
ER
RO
RS
RA
ISE
D
FE
MIN FE
NO FE
FE = Functional Effect Min FE typically data change
Finding Requirements ErrorsFinding Requirements ErrorsE
RR
OR
S R
AIS
ED
Requirementstesting tends to
find requirements errors
Phases on Pie Chart System Validation
Erro
rs ra
ised
FE
MIN FE
NO FE
Errors Introduced Here…..
Result - High Development Result - High Development CostCost
Err
ors
ra
ise
d
REQUIREMENT ERROR FUNCTIONAL EFFECT
REQUIREMENT ERROR MINOR FUNCTIONAL EFFECT
REQUIREMENT ERROR NO FUNCTIONAL EFFECT
Erro
rs R
aise
d
FE
MIN FE
NO FE
Errors Introduced Here…..
….are not found until
here
Result - High Development Result - High Development CostCost
Result - High Development Result - High Development CostCost
Err
ors
ra
ise
d
REQUIREMENT ERROR FUNCTIONAL EFFECT
REQUIREMENT ERROR MINOR FUNCTIONAL EFFECT
REQUIREMENT ERROR NO FUNCTIONAL EFFECT
Erro
s Ra
ised
FE
MIN FE
NO FE
Errors Introduced Here…..
….are not found until
here
After following safety critical development
process
Software and MoneySoftware and Money
Typical productivityTypical productivity– 5 Lines of Code (LoC) per person day 5 Lines of Code (LoC) per person day
1 kLoC per person year1 kLoC per person year– Requirements to end of module testRequirements to end of module test
Typical avionics “box”Typical avionics “box”– 100 kLoC100 kLoC– 100 person years of effort100 person years of effort– Circa £10M for software, so Circa £10M for software, so £500M £500M
on a modern aircraft?on a modern aircraft?
US Aircraft Software US Aircraft Software DependenceDependence
0
10
20
30
40
50
60
70
80
90
1960 1964 1970 1975 1982 1990 2000
Year
% f
unct
ions
perf
orm
ed b
y s
of t
ware
F4
A-7
F-111
F-15
F-16
B-2
F-22
DoD Defense Science Board Task Force on Defense Software, November 2000
Increasing DependenceIncreasing Dependence
Software often determinant of functionSoftware often determinant of function Software operates autonomouslySoftware operates autonomously
– Without opportunity for human intervention, Without opportunity for human intervention, e.g. Mercedes Brake Assiste.g. Mercedes Brake Assist
Software affected by other changesSoftware affected by other changes– e.g new weapons fit on EuroFightere.g new weapons fit on EuroFighter
Software has high levels of authoritySoftware has high levels of authority
Inappropriate CofG control in fuel system can reduce fatigue life of wings
Growing DependencyGrowing Dependency
Problem is growingProblem is growing– Now about a third of aircraft Now about a third of aircraft
development costsdevelopment costs– Increasing proportion of car Increasing proportion of car
developmentdevelopment Around 25% of capital cost of new cars in Around 25% of capital cost of new cars in
electronicselectronics
– Problem made more visible by rate of Problem made more visible by rate of improvements in tools for improvements in tools for “mainstream” software development“mainstream” software development
Growth of Airborne SoftwareGrowth of Airborne Software
1980
19871993
19981999
20042004
2014
1
10
100
1000
10000
100000
In Service Date
Co
de S
ize k
Lo
C
Approx £1.5B at current productivity and costs
The Problem - Size mattersThe Problem - Size matters
0
2000
4000
6000
8000
10000
12000
5% 10% 15% 20% 25% 30% 35% 40% 45% 50%
Probability of Software Project Being Cancelled
Capers Jones, Becoming Best In Class, Software Productivity Research, 1995 briefing
Si z
e In
Fu
nct
ion
Po i
nts
1 function point = 80 SLOC of Ada1 function point =128 SLOC of C
Is Software Safety an Issue? Is Software Safety an Issue?
Software has a good track recordSoftware has a good track record– A few high profile accidentsA few high profile accidents
Therac 25Therac 25 Ariane 501Ariane 501 Cali (strictly data not software)Cali (strictly data not software)
– Analysis of 1,100 “computer related Analysis of 1,100 “computer related deaths”deaths”
Only 34 attributed to softwareOnly 34 attributed to software
Chinook - Mull of Kintyre
Was this caused by FADEC software?
But Don’t be ComplacentBut Don’t be Complacent
Many instances of “pilot error” are Many instances of “pilot error” are system assistedsystem assisted
Software failures typically leave no traceSoftware failures typically leave no trace Increasing software complexity and Increasing software complexity and
authorityauthority Can’t measure software safety (no Can’t measure software safety (no
agreement)agreement) Unreliability of commercial softwareUnreliability of commercial software Cost of safety critical softwareCost of safety critical software
SummarySummary
Safety critical software a growing Safety critical software a growing issue issue – Software-based systems are dominant Software-based systems are dominant
source of product differentiationsource of product differentiation– Starting to become a major cost driverStarting to become a major cost driver– Starting to become the drive (drag) on Starting to become the drive (drag) on
product developmentproduct development Can’t cancel, have to keep on spending!!!Can’t cancel, have to keep on spending!!!
– Not major contributor to fatal accidentsNot major contributor to fatal accidents Although many incidentsAlthough many incidents
OverviewOverview
Fact – costs and distributionsFact – costs and distributions Fiction – get the requirements rightFiction – get the requirements right Fiction – get the functionality rightFiction – get the functionality right Fiction – abstraction is the solutionFiction – abstraction is the solution Fiction – safety critical code must Fiction – safety critical code must
be “bug free”be “bug free” Some key messagesSome key messages
Requirements FictionRequirements Fiction
Fiction statedFiction stated– Get the requirements right, and the Get the requirements right, and the
development will be easydevelopment will be easy FactsFacts
– Getting requirements right is difficultGetting requirements right is difficult– Requirements are biggest source of Requirements are biggest source of
errorserrors– Requirements changeRequirements change– Errors occur at organisational boundariesErrors occur at organisational boundaries
Embedded SystemsEmbedded Systems
Computer system embedded in Computer system embedded in larger engineering systemlarger engineering system
Requirements come fromRequirements come from– ““Flow down” from systemFlow down” from system– Design decisions (commitments)Design decisions (commitments)– Safety and reliability analysesSafety and reliability analyses
Derived safety requirements (DSRs)Derived safety requirements (DSRs)
– Fault management/accommodationFault management/accommodation As much as 80% for control applicationsAs much as 80% for control applications
Almost Everything on One Almost Everything on One PicturePicture
NB Based on Parnas’ four variable model
Almost Everything on One Almost Everything on One PicturePicture
IN
System
S1 S2 S3 A1
Control Interface
REQ = restriction on NAT
Control loops, high level modes,end to end response times, etc.
Control System & Software
OUT
Platform
Physical dcomposition of system, to sensors and actuators plus controller.
SOFTREQ specifies what control software must do.
REQ = IN SOFTREQ OUT
SOFTREQ
Almost Everything on One Almost Everything on One PicturePicture
IN
System
S1 S2 S3 A1
Control Interface
REQ = restriction on NAT
HAL
Control loops, high level modes,end to end response times, etc.
I/P O/PSPEC
I/P
Control System & Softw are
Output FnIncluding
loopclosing
OUT
Platform
Input FnIncluding
signalvalidation
Redefinition ofSOFTREQ allow ing
for digitisationnoise, sensormanagement,
actuator dynamics
O/PC
on
tro
lI/
F
Functional decomposition of softw are.Mapping of control functions to genericarchitecture.
SOFTREQ = I/P SPEC O /P
Almost Everything on One Almost Everything on One PicturePicture
Da
taS
ele
cti
on
IN
System
S1 S2 S3 A1
Control Interface
REQ = restriction on NAT
HAL
Control loops, high level modes,end to end response times, etc.
I/P O/PSPEC
I/P
Application
Control System & Software
Output FnIncluding
loopclosing
OUT
Platform
Input FnIncluding
signalvalidation
Redefinition ofSOFTREQ
allow ing fordigitisation noise,
sensormanagement,
actuatordynamics
data selection
O/P
ControllerStructure
Co
ntr
ol
I/F
A
A
F
M
Physical decomp-osition of controller.
Defines FMAAstructure.
Types of LayerTypes of Layer
Some layers have design meaningSome layers have design meaning– Abstraction from computing hardware Abstraction from computing hardware
Time in mS from reference, or ...Time in mS from reference, or ...– Not interrupts or bit patterns from clock hardwareNot interrupts or bit patterns from clock hardware
– The “System” HALThe “System” HAL ““Raw” sensed values, e.g. pressure in psiaRaw” sensed values, e.g. pressure in psia
– Not bit patterns from analogue to digital convertersNot bit patterns from analogue to digital converters
– FMAA to ApplicationFMAA to Application Validated values of platform propertiesValidated values of platform properties
– May also have computational meaningMay also have computational meaning e.g. call to HAL forces scheduling actione.g. call to HAL forces scheduling action
CommitmentsCommitments
Development proceeds via a series of Development proceeds via a series of commitmentscommitments– A design decision which can only be A design decision which can only be
revoked at significant costrevoked at significant cost– Often associated with architectural Often associated with architectural
decision or choice of componentdecision or choice of component Use of triplex redundancy, choice of pump, Use of triplex redundancy, choice of pump,
power supply, etc.power supply, etc.
– Commitments can be functional or physicalCommitments can be functional or physical Most common to make physical commitmentsMost common to make physical commitments
Derived RequirementsDerived Requirements
Commitments introduce derived Commitments introduce derived requirements (requirements (DRsDRs))– Choice of pump gives DRs for control Choice of pump gives DRs for control
algorithm, iteration rate, also algorithm, iteration rate, also requirements for initialisation, etc. requirements for initialisation, etc.
– Also get derived safety requirements Also get derived safety requirements ((DSRsDSRs), e.g. detection and ), e.g. detection and management of sensor failure for management of sensor failure for safety safety
System Level RequirementsSystem Level Requirements
Allocated requirementsAllocated requirements– System level requirements which System level requirements which
come from platformcome from platform– May be (slight) modification due to May be (slight) modification due to
design commitments, e.g.design commitments, e.g. Platform – control engine thrust to within Platform – control engine thrust to within
± 0.5% of demanded± 0.5% of demanded System – control EPR or N1 to within System – control EPR or N1 to within ± ±
0.5% of demanded0.5% of demanded
Stakeholder RequirementsStakeholder Requirements
Direct requirements from stakeholders, Direct requirements from stakeholders, e.g.e.g.– The radar shall be able to detect targets The radar shall be able to detect targets
travelling up to mach 2.5 at 200 nautical travelling up to mach 2.5 at 200 nautical miles, with 98% probabilitymiles, with 98% probability
– In principle allocated from platformIn principle allocated from platform In practice often stated in system termsIn practice often stated in system terms
– Need to distinguish legitimate requirements Need to distinguish legitimate requirements from “soluntioneering”from “soluntioneering”
Legitimacy depends on the stakeholder, e.g. Legitimacy depends on the stakeholder, e.g. CESG and cryptosCESG and cryptos
Requirements TypesRequirements Types
Main requirements typesMain requirements types– Invariants, e.g.Invariants, e.g.
Forward and reverse thrust will not be Forward and reverse thrust will not be commanded at the same timecommanded at the same time
– Functional transform inputs to outputs, e.g.Functional transform inputs to outputs, e.g. Thrust demand from thrust-lever resolver angleThrust demand from thrust-lever resolver angle
– Event response – action on event, e.g.Event response – action on event, e.g. Active ATP on passing signal at dangerActive ATP on passing signal at danger
– Non-functional (NFR) – constraints, e.g.Non-functional (NFR) – constraints, e.g. Timing, resource usage, availabilityTiming, resource usage, availability
Changes to TypesChanges to Types
Note requirements types can Note requirements types can change – NFR to functionalchange – NFR to functional– System – achieve < 10System – achieve < 10-5-5 per hour per hour
unsafe failuresunsafe failures– Software – detect failure modes x, y Software – detect failure modes x, y
and z of the pressure sensor P30 with and z of the pressure sensor P30 with 99% coverage, and mitigate by … 99% coverage, and mitigate by …
Requirements notations/methods Requirements notations/methods must be able to reflect must be able to reflect requirements typesrequirements types
Requirements ChallengesRequirements Challenges
Even if systems requirements are Even if systems requirements are clear, software requirementsclear, software requirements– Must deal with quantisation (sensors)Must deal with quantisation (sensors)– Must deal with temporal constraints Must deal with temporal constraints
(iteration rates, jitter)(iteration rates, jitter)– Must deal with failuresMust deal with failures
Systems requirements often trickySystems requirements often tricky– Open-loop control under failureOpen-loop control under failure– Incomplete understanding of physicsIncomplete understanding of physics
Requirements ErrorsRequirements Errors
Project data suggestsProject data suggests– Typically more than 70% of errors found Typically more than 70% of errors found
post unit test are requirements errorspost unit test are requirements errors– F22 (and other data sets) put F22 (and other data sets) put
requirements errors at 85%requirements errors at 85%– Finding errors drives changeFinding errors drives change
The later they are found, the greater the costThe later they are found, the greater the cost Some data, e.g. F22, write 3 LoC for every Some data, e.g. F22, write 3 LoC for every
one deliveredone delivered
The Certainty of ChangeThe Certainty of Change
Change mainly due to Change mainly due to requirements errorsrequirements errors– high cost due to reverification in high cost due to reverification in
presence of dependenciespresence of dependencies
0
100
200
300
Module
%C
hang
e
The majority ofmodules are
stable
Cumulative change
20%May verify all code 3 times!
Requirements and Requirements and OrganisationsOrganisations
Requirements errors are often Requirements errors are often based on misinterpretations (its based on misinterpretations (its obvious that …)obvious that …)– Thus errors (more likely to) happen at Thus errors (more likely to) happen at
organisational/cultural boundariesorganisational/cultural boundaries Systems to software, safety to software …Systems to software, safety to software …
– Study at NASA by Robyn LutzStudy at NASA by Robyn Lutz 85% of requirements errors arose at 85% of requirements errors arose at
organisational boundariesorganisational boundaries
SummarySummary
Getting requirements right is a Getting requirements right is a major challengemajor challenge– Software is deeply embeddedSoftware is deeply embedded
Discretisation, timing etc. an issueDiscretisation, timing etc. an issue
– Physics not always understoodPhysics not always understood Requirements (genuinely) changeRequirements (genuinely) change
– Notion that can get requirements right Notion that can get requirements right is simplisticis simplistic
Notion of “correct by construction” Notion of “correct by construction” optimisticoptimistic
Part 2Part 2Fiction – get the functionality rightFiction – get the functionality right
Fiction – abstraction is the solutionFiction – abstraction is the solution
Fiction – safety critical code must Fiction – safety critical code must be “bug free”be “bug free”
Some key messagesSome key messages
OverviewOverview
Fact – costs and distributionsFact – costs and distributions Fiction – get the requirements rightFiction – get the requirements right Fiction – get the functionality rightFiction – get the functionality right Fiction – abstraction is the solutionFiction – abstraction is the solution Fiction – safety critical code must Fiction – safety critical code must
be “bug free”be “bug free” Some key messagesSome key messages
Functionality FictionFunctionality Fiction
Fiction statedFiction stated– Get the functionality right, and the Get the functionality right, and the
rest is easyrest is easy FactsFacts
– Functionality doesn’t drive designFunctionality doesn’t drive design Non-Functional Requirements (NFRs) are Non-Functional Requirements (NFRs) are
criticalcritical Functionality isn’t independent of NFRsFunctionality isn’t independent of NFRs
– Fault management is a major aspect Fault management is a major aspect of complexityof complexity
Functionality and DesignFunctionality and Design
FunctionalityFunctionality– System functions allocated to System functions allocated to
softwaresoftware– Elements of REQ which end up in Elements of REQ which end up in
SOFTREQSOFTREQ NB, most of themNB, most of them
– At software level, requirements have At software level, requirements have to allow for properties of sensors, etc.to allow for properties of sensors, etc.
Consider an aero engine exampleConsider an aero engine example
Engine Pressure BlockEngine Pressure Block
Engine Pressure SensorEngine Pressure Sensor
Aero engine measures P0Aero engine measures P0– Atmospheric pressureAtmospheric pressure– A key input to fuel control, etc.A key input to fuel control, etc.
Example input P0Example input P0SensSens
– Byte from A/D converterByte from A/D converter– Resolution – 1 bit Resolution – 1 bit 0.055 psia 0.055 psia– Base = 2, 0 = low (high value Base = 2, 0 = low (high value 16) 16)– Update rate = 50mSUpdate rate = 50mS
Pressure Sensing ExamplePressure Sensing Example
Simple requirementSimple requirement– Provide validated P0 value to other Provide validated P0 value to other
functions and aircraftfunctions and aircraft Output data itemOutput data item
– P0P0ValVal
16 bits16 bits Resolution – 1 bit Resolution – 1 bit 0.00025 psia 0.00025 psia Base = 0, 0 = low (high value Base = 0, 0 = low (high value 16.4) 16.4)
Example RequirementsExample Requirements
Simple functional requirementSimple functional requirement– RS1: P0RS1: P0ValVal shall be provided within 0.03 bar of shall be provided within 0.03 bar of
sensed valuesensed value
– R1: P0R1: P0ValVal = P0 = P0SensSens [ [± 0.03] (software level)± 0.03] (software level)
– Note: simple algorithmNote: simple algorithmP0P0ValVal = (P0 = (P0SensSens * 0.055 + 2)/0.00025 * 0.055 + 2)/0.00025
P0P0SensSens = 0 → P0 = 0 → P0ValVal = 8000 = 00010 1111 0100 0000 binary = 8000 = 00010 1111 0100 0000 binary
P0P0SensSens = 1111 1111 = 16.025 → P0 = 1111 1111 = 16.025 → P0ValVal = 64100 = 1111 = 64100 = 1111 1010 0100 01001010 0100 0100
– Does R1 meet RS1? Does the algorithm meet R1?Does R1 meet RS1? Does the algorithm meet R1?
A Non-Functional A Non-Functional RequirementRequirement
Assume duplex sensorsAssume duplex sensors– P0P0Sens1Sens1 and P0 and P0Sens2Sens2
System levelSystem level– RS2: no single point of failure shall lead to RS2: no single point of failure shall lead to
loss of function (assume loss of function (assume P0P0ValVal is covered by is covered by this requirement)this requirement)
This will be a safety or availability requirementThis will be a safety or availability requirement NB in practice may be different sensors wired NB in practice may be different sensors wired
to different channels, and cross channel commsto different channels, and cross channel comms
Software Level NFRSoftware Level NFR
Software levelSoftware level– R2: If R2: If | | P0P0Sens1Sens1 - P0 - P0Sens2Sens2 | < 0.06 | < 0.06
then then P0P0ValVal = (P0 = (P0Sens1Sens1 + P0 + P0Sens2Sens2 )/2 )/2 else else P0P0ValVal = 0 = 0
– Is R2 a valid requirement?Is R2 a valid requirement? In other words, have we stated the right In other words, have we stated the right
thing?thing?
– Does R2 satisfy RS2?Does R2 satisfy RS2?
Temporal Requirements Temporal Requirements
Timing is often an important system Timing is often an important system propertyproperty– It may be a safety property, e.g. It may be a safety property, e.g.
sequencing in weapons releasesequencing in weapons release System level System level
– RS3: validated pressure value shall never RS3: validated pressure value shall never lag sensed value by more than 100mSlag sensed value by more than 100mS
NB not uncommon to ensure quality of NB not uncommon to ensure quality of controlcontrol
Software Level TimingSoftware Level Timing
Software level requirement, Software level requirement, assuming scheduling on 50mS cycles assuming scheduling on 50mS cycles – R3: P0R3: P0ValVal (t) = P0 (t) = P0SensSens (t-2) [ (t-2) [± 0.03]± 0.03]
– If t is quantised in units of 50mS, If t is quantised in units of 50mS, representing cycles representing cycles
– Is R3 a valid requirement?Is R3 a valid requirement?– Does R3 satisfy RS3?Does R3 satisfy RS3?
NB need data on processor timing to NB need data on processor timing to validatevalidate
Timing and SafetyTiming and Safety
Software levelSoftware level– R4: If R4: If | | P0P0Sens1Sens1 (t) - P0 (t) - P0Sens2Sens2 (t) (t) | < 0.06 | < 0.06
then then P0P0ValVal (t+1) = (P0 (t+1) = (P0Sens1Sens1 (t) + (t) + P0 P0Sens2Sens2 (t))/2 (t))/2 else if | else if | P0P0Sens1Sens1 (t) - P0 (t) - P0Sens1Sens1 (t-1) (t-1) | < | <
| P | P00Sens2Sens2 (t) - P0 (t) - P0Sens2Sens2 (t-1) (t-1) | | then then P0P0ValVal (t+1) = (P0 (t+1) = (P0Sens1Sens1 (t)) (t))
else P0 else P0ValVal (t+1) = (P0 (t+1) = (P0Sens2Sens2 (t)) (t))
– What does R4 respond to (can you think of What does R4 respond to (can you think of an RS4)?an RS4)?
Requirements ValidationRequirements Validation
Is R4 a valid requirement?Is R4 a valid requirement?– Is R4 “safe” in the system context Is R4 “safe” in the system context
(assume that misleading values of P0 (assume that misleading values of P0 could lead to a hazard, e.g. a thrust could lead to a hazard, e.g. a thrust roll-back on take off)roll-back on take off)
Does R4 satisfy RS3?Does R4 satisfy RS3? Does R4 satisfy RS2?Does R4 satisfy RS2? Does R4 satisfy RS1?Does R4 satisfy RS1?
Real RequirementsReal Requirements
Example still somewhat simplisticExample still somewhat simplistic– Need to store sensor state, i.e. Need to store sensor state, i.e.
knowledge of what has failedknowledge of what has failed Typically timing, safety, etc. drive Typically timing, safety, etc. drive
the detailed designthe detailed design– Aspects of requirements, e.g. error Aspects of requirements, e.g. error
bands, depend on timing of codebands, depend on timing of code– Requirements involve trade-offs Requirements involve trade-offs
between, say, safety and availabilitybetween, say, safety and availability
Requirements and Requirements and ArchitectureArchitecture
NFRs also drive the architectureNFRs also drive the architecture– Failure rate 10Failure rate 10-6-6 per hour per hour
Probably just duplex (especially if fail stop)Probably just duplex (especially if fail stop) Functions for cross comms and channel Functions for cross comms and channel
changechange
– Failure rate 10Failure rate 10-9-9 per hour per hour Probably triplex or quadruplexProbably triplex or quadruplex Changes in redundancy managementChanges in redundancy management
NB change in failure rate affects low level NB change in failure rate affects low level functionsfunctions
QuantificationQuantification
The “system level” functionality is The “system level” functionality is in the minorityin the minority– Typically over half is fault managementTypically over half is fault management– EuroFighter exampleEuroFighter example
FCS FCS 1/3 MLoC 1/3 MLoC Control laws Control laws 18 kLoC 18 kLoC
Note, very hard to validateNote, very hard to validate– 777 flight incident in Australia due to 777 flight incident in Australia due to
error in fault management, and error in fault management, and software changesoftware change
Boeing 777 Incident near PerthBoeing 777 Incident near Perth
Problem caused by Air Data Inertial Problem caused by Air Data Inertial Reference Unit (ADIRU)Reference Unit (ADIRU)– Software contained a latent fault Software contained a latent fault
which was revealed by a changewhich was revealed by a changeJune 2001 accelerometer
#5 fails with erroneous
high output values, ADIRU
discards output valuesPower Cycle on ADIRU
occurs each occasion
aircraft electrical system
is restarted
Aug 2006 accelerometer
#6 fails, latent software
error allows use of
previously failed accel #5
SummarySummary
Functionality is importantFunctionality is important– But not the But not the primaryprimary driver of design driver of design
Key drivers of designKey drivers of design– Safety and availabilitySafety and availability
Turns into fault management at software Turns into fault management at software levellevel
– Timing behaviourTiming behaviour Functionality not independent of NFRsFunctionality not independent of NFRs
– Requirements change to reflect NFRsRequirements change to reflect NFRs
OverviewOverview
Fact – costs and distributionsFact – costs and distributions Fiction – get the requirements rightFiction – get the requirements right Fiction – get the functionality rightFiction – get the functionality right Fiction – abstraction is the solutionFiction – abstraction is the solution Fiction – safety critical code must Fiction – safety critical code must
be “bug free”be “bug free” Some key messagesSome key messages
Abstraction FictionAbstraction Fiction
Fiction statedFiction stated– Careful use of abstraction will address Careful use of abstraction will address
problems of requirements etc.problems of requirements etc. FactFact
– Most forms of abstraction don’t work Most forms of abstraction don’t work in embedded control systemsin embedded control systems
State abstraction is of some useState abstraction is of some use
The devil is in the detailThe devil is in the detail
Data AbstractionData Abstraction
Most data is simpleMost data is simple– Boolean, integer, floating pointBoolean, integer, floating point– Complex data structures are rareComplex data structures are rare
May exist in a maintenance subsystem May exist in a maintenance subsystem (e.g. records of fault events)(e.g. records of fault events)
– Systems engineers work in low-level Systems engineers work in low-level terms, e.g. pressures, temperatures, terms, e.g. pressures, temperatures, etc.etc.
Hence requirements are in these termsHence requirements are in these terms
Control Models are Low LevelControl Models are Low Level
LoosenessLooseness
A key objective is to ensure that A key objective is to ensure that requirements are completerequirements are complete– Specify behaviour under all conditionsSpecify behaviour under all conditions– Normal behaviour (everything working)Normal behaviour (everything working)– Fault conditionsFault conditions
Single faults, and combinationsSingle faults, and combinations
– Impossible conditionsImpossible conditions So design is robust against incompletely So design is robust against incompletely
understood requirements/environmentunderstood requirements/environment
Despatch RequirementsDespatch Requirements
Can despatch (use) system Can despatch (use) system “carrying” failures“carrying” failures– Despatch analysis based on Markov Despatch analysis based on Markov
modelmodel– Evaluate probability of being in non-Evaluate probability of being in non-
despatchable state, e.g. only one despatchable state, e.g. only one failure from hazardfailure from hazard
– Link between safety/availability Link between safety/availability process and software designprocess and software design
Fault Management LogicFault Management Logic
Fault-accommodation Fault-accommodation requirements may use four valued requirements may use four valued logiclogic– Working, undetected, detected, Working, undetected, detected,
and confirmedand confirmed– Table illustrates Table illustrates
“logical and” ([.])“logical and” ([.])– Used for analysisUsed for analysis
.. ww uu dd cc
ww ww uu dd cc
uu uu uu dd cc
dd dd dd dd cc
cc cc cc cc cc
Example ImplementationExample Implementation
.. ww dd cc
ww ww dd cc
dd dd dd cc
cc cc cc cc
State AbstractionState Abstraction
Some state abstraction is possibleSome state abstraction is possible– Mainly low-level state to operational Mainly low-level state to operational
modesmodes Aero engine controlAero engine control
– Want to produce thrust proportional to Want to produce thrust proportional to demand (thrust lever angle in cockpit)demand (thrust lever angle in cockpit)
– Can’t measure thrust directlyCan’t measure thrust directly– Can use various “surrogates” for thrustCan use various “surrogates” for thrust
Work with best value, but reversionary Work with best value, but reversionary modelsmodels
Thrust ControlThrust Control
Engine pressure ratio (EPR) – between Engine pressure ratio (EPR) – between atmosphere & the exhaust pressuresatmosphere & the exhaust pressures– Best approximation to thrustBest approximation to thrust– Depends on P0Depends on P0
Low level state modelling “health” of P0 sensorLow level state modelling “health” of P0 sensor
– If P0 fails, revert to use N1 (fan speed)If P0 fails, revert to use N1 (fan speed)– Have control modesHave control modes
EPR, N1, etc. which abstract away from details EPR, N1, etc. which abstract away from details of sensor fault stateof sensor fault state
SummarySummary
Opportunity for abstraction much Opportunity for abstraction much more limited than in “IT” systemsmore limited than in “IT” systems– Hinders many classical approachesHinders many classical approaches
Abstraction is of some valueAbstraction is of some value– Mainly state abstraction, relating low-Mainly state abstraction, relating low-
level state information, e.g. sensor level state information, e.g. sensor “health” to system level control modes“health” to system level control modes
NB formal refinement, a la B, is NB formal refinement, a la B, is helped by this, as little data helped by this, as little data refinementrefinement
OverviewOverview
Fact – costs and distributionsFact – costs and distributions Fiction – get the requirements rightFiction – get the requirements right Fiction – get the functionality rightFiction – get the functionality right Fiction – abstraction is the solutionFiction – abstraction is the solution Fiction – safety critical code must Fiction – safety critical code must
be “bug free”be “bug free” Some key messagesSome key messages
““Bug Free” FictionBug Free” Fiction
Fiction statedFiction stated– Safety critical code must be “bug Safety critical code must be “bug
free”free” FactsFacts
– It is hard to correlate fault density It is hard to correlate fault density and failure rateand failure rate
– <1 fault per kLoC is pretty good!<1 fault per kLoC is pretty good!– Being “bug free” is unrealistic, and Being “bug free” is unrealistic, and
there is a need to “sentence” faultsthere is a need to “sentence” faults
Close to Fault Free?Close to Fault Free?
DO 178A Level 1 software (engine DO 178A Level 1 software (engine controller) – now would be DAL Acontroller) – now would be DAL A– Natural language specifications and Natural language specifications and
macro-assemblermacro-assembler– Over 20,000,000 hours without Over 20,000,000 hours without
hazardous failurehazardous failure– But on version 192 (last time I knew)But on version 192 (last time I knew)
Changes “trims” to reflect hardware Changes “trims” to reflect hardware propertiesproperties
Pretty BuggyPretty Buggy
DO 178B Level A software (aircraft DO 178B Level A software (aircraft system)system)– Natural language, control diagrams Natural language, control diagrams
and high level languageand high level language– 118 “bugs” found in first 18 months, 118 “bugs” found in first 18 months,
20% critical20% critical– Flight incidents but no accidentsFlight incidents but no accidents– Informally “less safe” than the other Informally “less safe” than the other
example, but still flying, still no example, but still flying, still no accidentsaccidents
Fault DensityFault Density
So far as one can get dataSo far as one can get data– <1 flaw per kLoC for SC is pretty good<1 flaw per kLoC for SC is pretty good– Commercial much worse, may be as Commercial much worse, may be as
high as 30 faults per kLoChigh as 30 faults per kLoC– Some “extreme” casesSome “extreme” cases
Space Shuttle – 0.1 per kLoCSpace Shuttle – 0.1 per kLoC Praxis system – 0.04 per kLoCPraxis system – 0.04 per kLoC
– But will a hazardous situation arise?But will a hazardous situation arise?
Faults and FailuresFaults and Failures
Why doesn’t software “crash” more Why doesn’t software “crash” more often?often?– Paths miss “bugs” as Paths miss “bugs” as
don’t get critical datadon’t get critical data– Testing “cleans up” Testing “cleans up”
common pathscommon paths– Also “subtle faults” Also “subtle faults”
which don’t cause a crashwhich don’t cause a crash NB IBM OS NB IBM OS
– 1/3 of failures were “3,000 year events”1/3 of failures were “3,000 year events”
Program Execution Space
Program PathBugs
Commercial SoftwareCommercial Software
Examples of data dependent faults?Examples of data dependent faults?– Loss of availability is acceptableLoss of availability is acceptable– Most SCS have to operate through faultsMost SCS have to operate through faults
Can’t “fail stop” – even reactor protection Can’t “fail stop” – even reactor protection software needs to run circa 24 hours for heat software needs to run circa 24 hours for heat removalremoval
Pic
ture
s ©
3B
P.c
om
Retrospective AnalysisRetrospective Analysis
Retrospective analysis of US civil Retrospective analysis of US civil product for UK military useproduct for UK military use– Analysis of over 500kLoC, in several Analysis of over 500kLoC, in several
languageslanguages– Found 23 faults per kLoC, 3% safety Found 23 faults per kLoC, 3% safety
criticalcritical– Vast majority not safety criticalVast majority not safety critical
NB most of the 3% related to assumptions, NB most of the 3% related to assumptions, i.e. were requirements issuesi.e. were requirements issues
Find and FixFind and Fix
If a fault is found it may not be fixedIf a fault is found it may not be fixed– First it will be “sentenced”First it will be “sentenced”
If not critical, it probably won’t be fixedIf not critical, it probably won’t be fixed
– Potentially critical faults will be analysedPotentially critical faults will be analysed Can it give rise to a problem in practice?Can it give rise to a problem in practice? If decide not to change, document reasonsIf decide not to change, document reasons
– Note: changes may bring (unknown) faultsNote: changes may bring (unknown) faults e.g. Boeing 777 near Perthe.g. Boeing 777 near Perth
Perils of ChangePerils of Change
Module
Dependency
SummarySummary
Probably no safety critical software is Probably no safety critical software is fault freefault free– Less than 1 fault per kLoC is goodLess than 1 fault per kLoC is good– Hard to correlate fault density with Hard to correlate fault density with
failure rate (especially unsafe failures)failure rate (especially unsafe failures) In practiceIn practice
– Sentence faults, and change if net Sentence faults, and change if net benefitbenefit
Need to show presence of faultsNeed to show presence of faults– To decide if need to remove themTo decide if need to remove them
OverviewOverview
Fact – costs and distributionsFact – costs and distributions Fiction – get the requirements rightFiction – get the requirements right Fiction – get the functionality rightFiction – get the functionality right Fiction – abstraction is the solutionFiction – abstraction is the solution Fiction – safety critical code must Fiction – safety critical code must
be “bug free”be “bug free” Some key messagesSome key messages
Summary of the SummariesSummary of the Summaries
Safety critical softwareSafety critical software– Has a good track recordHas a good track record– Increased dependency, complexity, Increased dependency, complexity,
etc. mean that this may not continueetc. mean that this may not continue Much of the difficulty is in Much of the difficulty is in
requirementsrequirements– Partly a systems engineering issuePartly a systems engineering issue– Many of the problems arise from errors Many of the problems arise from errors
in communication in communication – Classical CS approaches limited utilityClassical CS approaches limited utility
Research Directions (1)Research Directions (1)
Advances may come at architectureAdvances may come at architecture– Improve notations to work at architecture Improve notations to work at architecture
and implement via code generationand implement via code generation– Develop approaches, e.g. good interfaces, Develop approaches, e.g. good interfaces,
product lines, to ease changeproduct lines, to ease change– Focus on V&V, recognising that the aim is Focus on V&V, recognising that the aim is
fault-findingfault-finding AADL an interesting developmentAADL an interesting development
Research Directions (2)Research Directions (2)
Advances may come at requirementsAdvances may come at requirements– Work with systems engineering notationsWork with systems engineering notations
Improve to address issues needed for software Improve to address issues needed for software design and assessment, NB PFSdesign and assessment, NB PFS
Produce better ways of mapping to architectureProduce better ways of mapping to architecture Try to find ways of modularising, to bound impact Try to find ways of modularising, to bound impact
of change, e.g. contractsof change, e.g. contracts
– Focus on V&V, e.g. simulationFocus on V&V, e.g. simulation Developments of Parnas/Jackson ideas?Developments of Parnas/Jackson ideas?
Research Directions (3)Research Directions (3)
Work on automation, especially for Work on automation, especially for V&VV&V– Design remains creativeDesign remains creative– V&V is 50% of life-cycle cost, and can be V&V is 50% of life-cycle cost, and can be
automatedautomated– Examples includeExamples include
Auto-generation of test data and test oraclesAuto-generation of test data and test oracles Model-checking consistency/completenessModel-checking consistency/completeness
The best way to apply “classical” CS?The best way to apply “classical” CS?
CodaCoda
Safety critical software researchSafety critical software research– Always “playing catch up”Always “playing catch up”– Aspirations for applications growing Aspirations for applications growing
fastfast To be successfulTo be successful
– Focus on “right problems”, i.e. where Focus on “right problems”, i.e. where the difficulties arise in practicethe difficulties arise in practice
– If possible work with industry – to try If possible work with industry – to try to provide solutions to their problemsto provide solutions to their problems