socio-technical systems failure (lscits engd 2012)

Post on 13-Jan-2015

1.259 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Discusses socio-technical issues in systems failure

TRANSCRIPT

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 1

Systems failure – a socio-technical perspective

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 2

Complex software systems

• Multi-purpose. Organisational systems that support different functions within an organisation

• System of systems. Usually distributed and normally constructed by integrating existing systems/components/services

• Unlimited. Not subject to limitations derived from the laws of physics (so, no natural constraints on their size)

• Data intensive. System data orders of magnitude larger than code; long-lifetime data

• Dynamic. Changing quickly in response to changes in the business environment

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 3

Systems of systems• Operational

independence

• Managerial independence

• Multiple stakeholder viewpoints

• Evolutionary development

• Emergent behaviour

• Geographic distribution

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 4

Complex system realities

• There is no definitive specification of what the system should ‘do’ and it is practically impossible to create such a specification

• The complexity of the system is such that it is not ‘understandable’ as a whole

• It is likely that, at all times, some parts of the system will not be fully operational

• Actors responsible for different parts of the system are likely to have conflicting goals

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 5

System failure

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 6

System dependability model

A system characteristic that can (but need not) lead to a system error

An erroneous system state that can (but need not) lead to a system failure

System fault System error

Externally-observed, unexpected and undesirable system behaviour

System failure

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 7

A hospital system

• A hospital system is designed to maintain information about available beds for incoming patients and to provide information about the number of beds to the admissions unit.

• It is assumed that the hospital has a number of empty beds and this changes over time. The variable B reflects the number of empty beds known to the system.

• Sometimes the system reports that the number of empty beds is the actual number available; sometimes the system reports that fewer than the actual number are available .

• In circumstances where the system reports that an incorrect number of beds are available, is this a failure?

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 8

What is failure?

• Technical, engineering view: a failure is ‘a deviation from a specification’.

• An oracle can examine a specification, observe a system’s behaviour and detect failures.

• Failure is an absolute - the system has either failed or it hasn’t

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 9

Bed management system

• The percentage of system users who considered the system’s incorrect reporting of the number of available beds to be a failure was 0%.

• Mostly, the number did not matter so long as it was greater than 1. What mattered was whether or not patients could be admitted to the hospital.

• When the hospital was very busy (available beds = 0), then people understood that it was practically impossible for the system to be accurate.

• They used other methods to find out whether or not a bed was available for an incoming patient.

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 10

Failure is a judgement• Specifications are a gross simplification of reality

for complex systems.

• Users don’t read and don’t care about specifications

• Whether or not system behaviour should be considered to be a failure, depends on the observer’s judgement

• This judgement depends on:– The observer’s expectations

– The observer’s knowledge and experience

– The observer’s role

– The observer’s context or situation

– The observer’s authority

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 11

Failures are inevitable• Technical reasons

– When systems are composed of opaque and uncontrolled components, the behaviour of these components cannot be completely understood

– Failures often can be considered to be failures in data rather than failures in behaviour

• Socio-technical reasons– Changing contexts of use mean that the judgement on

what constitutes a failure changes as the effectiveness of the system in supporting work changes

– Different stakeholders will interpret the same behaviour in different ways because of different interpretations of ‘the problem’

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 12

Conflict inevitability

• Impossible to establish a set of requirements where stakeholder conflicts are all resolved

• Therefore, successful operation of a system for one set of stakeholders will inevitably mean ‘failure’ for another set of stakeholders

• Groups of stakeholders in organisations are often in perennial conflict (e.g. managers and clinicians in a hospital). The support delivered by a system depends on the power held at some time by a stakeholder group.

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 13

Normal failures

• ‘Failures’ are not just catastrophic events but normal, everyday system behaviour that disrupts normal work and that mean that people have to spend more time on a task than necessary

• A system failure occurs when a direct or indirect user of a system has to carry out extra work, over and above that normally required to carry out some task, in response to some inappropriate or unexpected system behaviour

• This extra work constitutes the cost of recovery from system failure

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 14

The Swiss Cheese model

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 15

Failure trajectories

• Failures rarely have a single cause. Generally, they arise because several events occur simultaneously

– Loss of data in a critical system

• User mistypes command and instructs data to be deleted

• System does not check and ask for confirmation of destructive action

• No backup of data available

• A failure trajectory is a sequence of undesirable events that coincide in time, usually initiated by some human action. It represents a failure in the defensive layers in the system

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 16

Vulnerabilities and defences

• Vulnerabilities– Faults in the (socio-technical) system which, if triggered

by a human or technical error, can lead to system failure

– e.g. missing check on input validity

• Defences– System features that avoid, tolerate or recover from

human error

– Type checking that disallows allocation of incorrect types of value

• When an adverse event happens, the key question is not ‘whose fault was it’ but ‘why did the system defences fail?’

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 17

Reason’s Swiss Cheese Model

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 18

Active failures

• Active failures– Active failures are the unsafe acts committed by people

who are in direct contact with the system or failures in the system technology.

– Active failures have a direct and usually short-lived effect on the integrity of the defenses.

• Latent conditions– Fundamental vulnerabilities in one or more layers of the

socio-technical system such as system faults, system and process misfit, alarm overload, inadequate maintenance, etc.

– Latent conditions may lie dormant within the system for many years before they combine with active failures and local triggers to create an accident opportunity.

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 19

Defensive layers

• Complex IT systems should have many defensive layers:– some are engineered - alarms, physical barriers,

automatic shutdowns,

– others rely on people - surgeons, anesthetists, pilots, control room operators,

– and others depend on procedures and administrative controls.

• In an ideal word, each defensive layer would be intact.

• In reality, they are more like slices of Swiss cheese, having many holes- although unlike in the cheese, these holes are continually opening, shutting, and shifting their location.

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 20

Dynamic vulnerabilities

• While some vulnerabilities are static (e.g. programming errors), others are dynamic and depend on the context where the system is used.

• For example– vulnerabilities may be related to human actions

whose performance is dependent on workload, state of mind, etc. An operator may be distracted and forget to check something

– vulnerabilities may depend on configuration – checks may depend on particular programs being up and running so if program A is running in a system then a check may be made but if program B is running, then the check is not made

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 21

Recovering from failure

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 22

Coping with failure• People are good at

coping with unexpected situations when things go wrong.

– They can take the initiative, adopt responsibilities and, where necessary, break the rules or step outside the normal process of doing things.

– People can prioritise and focus on the essence of a problem

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 23

Recovery strategies

• Local knowledge

– Who to call; who knows what; where things are

• Process reconfiguration

– Doing things in a different way from that defined in the ‘standard’ process

– Work-arounds, breaking the rules (safe violations)

• Redundancy and diversity

– Maintaining copies of information in different forms from that maintained in a software system

– Informal information annotation

– Using multiple communication channels

• Trust

– Relying on others to cope

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 24

Design for recovery

• Holistic systems engineering– Software systems design has to be seen as part of a

wider process of socio-technical systems engineering

• We cannot build ‘correct’ systems – We must therefore design systems to allow the

broader socio-technical systems to recognise, diagnose and recover from failures

• Extend current systems to support recovery

• Develop recovery support systems as an integral part of systems of systems

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 25

Recovery strategy

• Designing for recovery is a holistic approach to system design and not (just) the identification of ‘recovery requirements’

• Should support the natural ability of people and organisations to cope with problems

– Ensure that system design decisions do not increase the amount of recovery work required

– Make system design decisions that make it easier to recover from problems (i.e. reduce extra work required)

• Earlier recognition of problems

• Visibility to make hypotheses easier to formulate

• Flexibility to support recovery actions

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 26

Key points

• Failures are inevitable in complex systems because multiple stakeholders see these systems in different ways and because there is no single manager of these systems

• Failures are a judgement – they are not absolute – but depend on the system observer

• The Swiss cheese model is a failure model based on active failures (trigger events) and latent errors (system vulnerabilities).

• People have developed strategies for coping with failure and systems should not be designed to make coping more difficult.

top related