2005-10-13icalepcs 2005, geneve, switzerland dependability considerations in distributed control...
TRANSCRIPT
2005-10-13 ICALEPCS 2005, Geneve, Switzerland
Dependability Considerations in Distributed Control Systems
Klemen Žagar, Cosylab
ICALEPCS 2005, Geneve, Switzerland 2
Dependability
• A dependable system is one which the users may trust.
• Examples of dependable distributed systems:– The Internet– Power distribution grid– Water supply
• Dependability is very general term. Among others, it covers:– Availability: it is there when needed.– Reliability: it can work autonomously for a long period of
time.– Maintainability: easily fixed when broken.– Safety: will not harm other equipment or personnel.– Security: unauthorized, possibly malicious, users can not
gain control
ICALEPCS 2005, Geneve, Switzerland 3
Motivation
• Nodes of a distributed system are like dominos– The domino effect: one falls, all may go down– May happen often, and takes a long time to rebuild
• Thus, fault tolerance is important:– Improved mean-time-to-failure of the system as a whole– Lower mean-time-to-repair Improved availability Reduced maintenance effort
• Fault tolerance in distributed control systems?
ICALEPCS 2005, Geneve, Switzerland 4
Research Objectives
• Dependable Distributed Systems (DeDiSys) research project with the European Union.
• What are the most frequent causes of faults in distributed control systems?
• What mitigation mechanisms are available?
• How to improve availability by trading it against constraint consistency?– What is constraint consistency in control
systems?
ICALEPCS 2005, Geneve, Switzerland 5
Reliability• Reliability, , is the probability that a system will perform as
specified for a given period of time.– Typically exponential:
– Alternative measure is the mean time to failure (MTTF/MTBF):
R(t)
t
1
49.7 days
Relability of the Microsoft Windows 95 operating system
ICALEPCS 2005, Geneve, Switzerland 6
Reliability of Composed Systems
• Weakest link: reliability of a coupled composed system is less than the reliability of its least reliable constituent:
• Redundancy: reliability of a redundant subsystem is greater than the reliability of its most reliable constituent:
ICALEPCS 2005, Geneve, Switzerland 7
Maintainability and Availability• Maintainability: how long it takes to repair a system after a
failure.– The measure is mean time to repair (MTTR)
• Availability: percentage of time the system is actually available during periods when it should be available.– Directly experienced by users!– Expressed in percent. In marketing, also with number of
nines(e.g., 99.999% availability unavailable 7 min/year).
• Example: a gas station (working hours 6AM to 10PM – 16 hours)
– Ran out of gas at 10AM (2h)– Pump malfunction at 2PM (2h)– Availability: 12h/16h = 75%
12AM 6AM2h 2h
10PM
ICALEPCS 2005, Geneve, Switzerland 8
Research Methodology
• Research in the context of the DeDiSys project• Collection of requirements from
– DeDiSys project’s interest group members– Cosylab’s customers (e.g., ANKA, SLS, ...)
• Identification of scenarios– ALMA Common Software (ACS)– EPICS– Geographical Information Systems
• Definition of the architecture for a fault-tolerance naming service (FTNS)
ICALEPCS 2005, Geneve, Switzerland 9
Faults in Distributed Systems
Node failures• A host crashes or a process dies• Volatile state is lostLink failures• A network link is broken• Results in two or more partitions• Difficult to distinguish from a
host crash
Client 1
Client 4
Client 5
Crashed Server
Severed Network Link
Copy 1: activeavailable
Copy 3:active
inconsistentavailable
Copy 2: crashed
Client 2Client 6
Client 3
Consequences• Affected services are lost• Dependent systems
malfunction• User interface doesn’t show
actual status
ICALEPCS 2005, Geneve, Switzerland 10
Improving Hardware MTTF
• Reduce the number of mechanical parts:– Solid-state storage instead of hard disks– Passive cooling of power supplies and CPUs (no fans)
• High-quality or redundant power supplies• Replication:
– network links– CPU boards
• Remote reset (e.g., via power cycling)
ICALEPCS 2005, Geneve, Switzerland 11
Improving Software MTTF
• Ensure that overflows of variables that constantly increase (handle IDs, timers, counters, ...) are properly handled.
• Ensure all resources are properly released when no longer needed (memory leaks, …)– Use a managed platform (Java, .NET)– Use auto-pointers (C++)
• Avoid using heap storage on a per-transaction basis (may result in memory fragmentation); e.g., use free-lists
• Restart a process in a controllable fashion (rejuvenation)• Isolate processes through inter-process communication• Recovery:
– Recover state after a crash– Effective for host and process crashes– Automated repair
ICALEPCS 2005, Geneve, Switzerland 12
Decreasing MTTR• Foresee failures during design
– The major difference between a thing that might go wrong and a thing that cannot possibly go wrong is that when a thing that cannot possibly go wrong goes wrong it usually turns out to be impossible to get at or repair.
– Douglas Adams: Mostly Harmless
• Provide good diagnostics– Alarms
• Detailed description of where and when an error occurred– Logs– State-dump at failures
• ADC buffers after a beam dump• Status of synchronization primitives• Memory dump
• Automated fail-over– In combination with redundancy– Passive replica must have up-to-date state of the primary copy– Fault detection (network ping, analog signal, …)
ICALEPCS 2005, Geneve, Switzerland 13
Consistency/Availability Trade-Off
FinanceBanking
Access controlCorporate databases
Consistency Availability
Control systemsAir-traffic controlFly-by-wireDrive-by-wire
ICALEPCS 2005, Geneve, Switzerland 14
Constraint Consistency in Control Systems
• Constraints: rules that one or more objects must satisfy, for example:– If and only if serverChannel.monitors.contains(client)
then client.isSubscribedTo(serverChannel)– serverChannel.value == clientChannel.value– server.getFromDatabase(‘x’) == database.get(‘x’)– If client.referencesComponent(component)
then component.isReferencedBy(client)• Can some constraints be temporarily relaxed in presence of
faults?• If so, how to reconcile the system in a consistent state
when faults are removed?
ICALEPCS 2005, Geneve, Switzerland 15
Future Work
• DeDiSys:– Design and implementation (due: January 2007)– Validation (due: June 2007)
• Possible inclusion of research findings in control system infrastructures:– ACS (e.g., replication of the manager and components)– EPICS (e.g., V4 fault-tolerance efforts of the EPICS community)
• Inclusion in products:– The microIOC platform– Servers for Geographical Information Systems– Other high-availability products (telecommunications,
automotive)• Know-how for consulting and development services
ICALEPCS 2005, Geneve, Switzerland 16
Conclusion
• Distributed systems are inherently fragile
• Fault tolerance is difficult to program– Should be addressed by
infrastructure/middle-ware, but frequently isn’t
• Comments/questions/contributions: [email protected]