2005-10-13icalepcs 2005, geneve, switzerland dependability considerations in distributed control...

16
2005-10-13 ICALEPCS 2005, Geneve, Switzerland Dependability Considerations in Distributed Control Systems Klemen Žagar, Cosylab

Upload: barry-snow

Post on 31-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

2005-10-13 ICALEPCS 2005, Geneve, Switzerland

Dependability Considerations in Distributed Control Systems

Klemen Žagar, Cosylab

ICALEPCS 2005, Geneve, Switzerland 2

Dependability

• A dependable system is one which the users may trust.

• Examples of dependable distributed systems:– The Internet– Power distribution grid– Water supply

• Dependability is very general term. Among others, it covers:– Availability: it is there when needed.– Reliability: it can work autonomously for a long period of

time.– Maintainability: easily fixed when broken.– Safety: will not harm other equipment or personnel.– Security: unauthorized, possibly malicious, users can not

gain control

ICALEPCS 2005, Geneve, Switzerland 3

Motivation

• Nodes of a distributed system are like dominos– The domino effect: one falls, all may go down– May happen often, and takes a long time to rebuild

• Thus, fault tolerance is important:– Improved mean-time-to-failure of the system as a whole– Lower mean-time-to-repair Improved availability Reduced maintenance effort

• Fault tolerance in distributed control systems?

ICALEPCS 2005, Geneve, Switzerland 4

Research Objectives

• Dependable Distributed Systems (DeDiSys) research project with the European Union.

• What are the most frequent causes of faults in distributed control systems?

• What mitigation mechanisms are available?

• How to improve availability by trading it against constraint consistency?– What is constraint consistency in control

systems?

ICALEPCS 2005, Geneve, Switzerland 5

Reliability• Reliability, , is the probability that a system will perform as

specified for a given period of time.– Typically exponential:

– Alternative measure is the mean time to failure (MTTF/MTBF):

R(t)

t

1

49.7 days

Relability of the Microsoft Windows 95 operating system

ICALEPCS 2005, Geneve, Switzerland 6

Reliability of Composed Systems

• Weakest link: reliability of a coupled composed system is less than the reliability of its least reliable constituent:

• Redundancy: reliability of a redundant subsystem is greater than the reliability of its most reliable constituent:

ICALEPCS 2005, Geneve, Switzerland 7

Maintainability and Availability• Maintainability: how long it takes to repair a system after a

failure.– The measure is mean time to repair (MTTR)

• Availability: percentage of time the system is actually available during periods when it should be available.– Directly experienced by users!– Expressed in percent. In marketing, also with number of

nines(e.g., 99.999% availability unavailable 7 min/year).

• Example: a gas station (working hours 6AM to 10PM – 16 hours)

– Ran out of gas at 10AM (2h)– Pump malfunction at 2PM (2h)– Availability: 12h/16h = 75%

12AM 6AM2h 2h

10PM

ICALEPCS 2005, Geneve, Switzerland 8

Research Methodology

• Research in the context of the DeDiSys project• Collection of requirements from

– DeDiSys project’s interest group members– Cosylab’s customers (e.g., ANKA, SLS, ...)

• Identification of scenarios– ALMA Common Software (ACS)– EPICS– Geographical Information Systems

• Definition of the architecture for a fault-tolerance naming service (FTNS)

ICALEPCS 2005, Geneve, Switzerland 9

Faults in Distributed Systems

Node failures• A host crashes or a process dies• Volatile state is lostLink failures• A network link is broken• Results in two or more partitions• Difficult to distinguish from a

host crash

Client 1

Client 4

Client 5

Crashed Server

Severed Network Link

Copy 1: activeavailable

Copy 3:active

inconsistentavailable

Copy 2: crashed

Client 2Client 6

Client 3

Consequences• Affected services are lost• Dependent systems

malfunction• User interface doesn’t show

actual status

ICALEPCS 2005, Geneve, Switzerland 10

Improving Hardware MTTF

• Reduce the number of mechanical parts:– Solid-state storage instead of hard disks– Passive cooling of power supplies and CPUs (no fans)

• High-quality or redundant power supplies• Replication:

– network links– CPU boards

• Remote reset (e.g., via power cycling)

ICALEPCS 2005, Geneve, Switzerland 11

Improving Software MTTF

• Ensure that overflows of variables that constantly increase (handle IDs, timers, counters, ...) are properly handled.

• Ensure all resources are properly released when no longer needed (memory leaks, …)– Use a managed platform (Java, .NET)– Use auto-pointers (C++)

• Avoid using heap storage on a per-transaction basis (may result in memory fragmentation); e.g., use free-lists

• Restart a process in a controllable fashion (rejuvenation)• Isolate processes through inter-process communication• Recovery:

– Recover state after a crash– Effective for host and process crashes– Automated repair

ICALEPCS 2005, Geneve, Switzerland 12

Decreasing MTTR• Foresee failures during design

– The major difference between a thing that might go wrong and a thing that cannot possibly go wrong is that when a thing that cannot possibly go wrong goes wrong it usually turns out to be impossible to get at or repair.

– Douglas Adams: Mostly Harmless

• Provide good diagnostics– Alarms

• Detailed description of where and when an error occurred– Logs– State-dump at failures

• ADC buffers after a beam dump• Status of synchronization primitives• Memory dump

• Automated fail-over– In combination with redundancy– Passive replica must have up-to-date state of the primary copy– Fault detection (network ping, analog signal, …)

ICALEPCS 2005, Geneve, Switzerland 13

Consistency/Availability Trade-Off

FinanceBanking

Access controlCorporate databases

Consistency Availability

Control systemsAir-traffic controlFly-by-wireDrive-by-wire

ICALEPCS 2005, Geneve, Switzerland 14

Constraint Consistency in Control Systems

• Constraints: rules that one or more objects must satisfy, for example:– If and only if serverChannel.monitors.contains(client)

then client.isSubscribedTo(serverChannel)– serverChannel.value == clientChannel.value– server.getFromDatabase(‘x’) == database.get(‘x’)– If client.referencesComponent(component)

then component.isReferencedBy(client)• Can some constraints be temporarily relaxed in presence of

faults?• If so, how to reconcile the system in a consistent state

when faults are removed?

ICALEPCS 2005, Geneve, Switzerland 15

Future Work

• DeDiSys:– Design and implementation (due: January 2007)– Validation (due: June 2007)

• Possible inclusion of research findings in control system infrastructures:– ACS (e.g., replication of the manager and components)– EPICS (e.g., V4 fault-tolerance efforts of the EPICS community)

• Inclusion in products:– The microIOC platform– Servers for Geographical Information Systems– Other high-availability products (telecommunications,

automotive)• Know-how for consulting and development services

ICALEPCS 2005, Geneve, Switzerland 16

Conclusion

• Distributed systems are inherently fragile

• Fault tolerance is difficult to program– Should be addressed by

infrastructure/middle-ware, but frequently isn’t

• Comments/questions/contributions: [email protected]