2005-10-13icalepcs 2005, geneve, switzerland dependability considerations in distributed control...

2005-10-13 ICALEPCS 2005, Geneve, Switzerland

Dependability Considerations in Distributed Control Systems

Klemen Žagar, Cosylab

ICALEPCS 2005, Geneve, Switzerland 2

Dependability

• A dependable system is one which the users may trust.

• Examples of dependable distributed systems:– The Internet– Power distribution grid– Water supply

• Dependability is very general term. Among others, it covers:– Availability: it is there when needed.– Reliability: it can work autonomously for a long period of

time.– Maintainability: easily fixed when broken.– Safety: will not harm other equipment or personnel.– Security: unauthorized, possibly malicious, users can not

gain control


Motivation

• Nodes of a distributed system are like dominos– The domino effect: one falls, all may go down– May happen often, and takes a long time to rebuild

• Thus, fault tolerance is important:– Improved mean-time-to-failure of the system as a whole– Lower mean-time-to-repair Improved availability Reduced maintenance effort

• Fault tolerance in distributed control systems?


Research Objectives

• Dependable Distributed Systems (DeDiSys) research project with the European Union.

• What are the most frequent causes of faults in distributed control systems?

• What mitigation mechanisms are available?

• How to improve availability by trading it against constraint consistency?– What is constraint consistency in control

systems?


Reliability• Reliability, , is the probability that a system will perform as

specified for a given period of time.– Typically exponential:

– Alternative measure is the mean time to failure (MTTF/MTBF):

R(t)

t

1

49.7 days

Relability of the Microsoft Windows 95 operating system


Reliability of Composed Systems

• Weakest link: reliability of a coupled composed system is less than the reliability of its least reliable constituent:

• Redundancy: reliability of a redundant subsystem is greater than the reliability of its most reliable constituent:


Maintainability and Availability• Maintainability: how long it takes to repair a system after a

failure.– The measure is mean time to repair (MTTR)

• Availability: percentage of time the system is actually available during periods when it should be available.– Directly experienced by users!– Expressed in percent. In marketing, also with number of

nines(e.g., 99.999% availability unavailable 7 min/year).

• Example: a gas station (working hours 6AM to 10PM – 16 hours)

– Ran out of gas at 10AM (2h)– Pump malfunction at 2PM (2h)– Availability: 12h/16h = 75%

12AM 6AM2h 2h

10PM


Research Methodology

• Research in the context of the DeDiSys project• Collection of requirements from

– DeDiSys project’s interest group members– Cosylab’s customers (e.g., ANKA, SLS, ...)

• Identification of scenarios– ALMA Common Software (ACS)– EPICS– Geographical Information Systems

• Definition of the architecture for a fault-tolerance naming service (FTNS)


Faults in Distributed Systems

Node failures• A host crashes or a process dies• Volatile state is lostLink failures• A network link is broken• Results in two or more partitions• Difficult to distinguish from a

host crash

Client 1

Client 4

Client 5

Crashed Server

Severed Network Link

Copy 1: activeavailable

Copy 3:active

inconsistentavailable

Copy 2: crashed

Client 2Client 6

Client 3

Consequences• Affected services are lost• Dependent systems

malfunction• User interface doesn’t show

actual status


Improving Hardware MTTF

• Reduce the number of mechanical parts:– Solid-state storage instead of hard disks– Passive cooling of power supplies and CPUs (no fans)

• High-quality or redundant power supplies• Replication:

– network links– CPU boards

• Remote reset (e.g., via power cycling)


Improving Software MTTF

• Ensure that overflows of variables that constantly increase (handle IDs, timers, counters, ...) are properly handled.

• Ensure all resources are properly released when no longer needed (memory leaks, …)– Use a managed platform (Java, .NET)– Use auto-pointers (C++)

• Avoid using heap storage on a per-transaction basis (may result in memory fragmentation); e.g., use free-lists

• Restart a process in a controllable fashion (rejuvenation)• Isolate processes through inter-process communication• Recovery:

– Recover state after a crash– Effective for host and process crashes– Automated repair


Decreasing MTTR• Foresee failures during design

– The major difference between a thing that might go wrong and a thing that cannot possibly go wrong is that when a thing that cannot possibly go wrong goes wrong it usually turns out to be impossible to get at or repair.

– Douglas Adams: Mostly Harmless

• Provide good diagnostics– Alarms

• Detailed description of where and when an error occurred– Logs– State-dump at failures

• ADC buffers after a beam dump• Status of synchronization primitives• Memory dump

• Automated fail-over– In combination with redundancy– Passive replica must have up-to-date state of the primary copy– Fault detection (network ping, analog signal, …)


Consistency/Availability Trade-Off

FinanceBanking

Access controlCorporate databases

Consistency Availability

Control systemsAir-traffic controlFly-by-wireDrive-by-wire


Constraint Consistency in Control Systems

• Constraints: rules that one or more objects must satisfy, for example:– If and only if serverChannel.monitors.contains(client)

then client.isSubscribedTo(serverChannel)– serverChannel.value == clientChannel.value– server.getFromDatabase(‘x’) == database.get(‘x’)– If client.referencesComponent(component)

then component.isReferencedBy(client)• Can some constraints be temporarily relaxed in presence of

faults?• If so, how to reconcile the system in a consistent state

when faults are removed?


Future Work

• DeDiSys:– Design and implementation (due: January 2007)– Validation (due: June 2007)

• Possible inclusion of research findings in control system infrastructures:– ACS (e.g., replication of the manager and components)– EPICS (e.g., V4 fault-tolerance efforts of the EPICS community)

• Inclusion in products:– The microIOC platform– Servers for Geographical Information Systems– Other high-availability products (telecommunications,

automotive)• Know-how for consulting and development services


Conclusion

• Distributed systems are inherently fragile

• Fault tolerance is difficult to program– Should be addressed by

infrastructure/middle-ware, but frequently isn’t

• Comments/questions/contributions: [email protected]

2005-10-13icalepcs 2005, geneve, switzerland dependability considerations in distributed control...

Documents

mean time

long time

percentage of time

improved meantime

long period of time

given period of time

coupled composed system

lostdependent systems