phd course: performance & reliability analysis of ip-based communication networks

Hans Peter Schwefel

PHD Course: Performance & Reliability Analysis, Day 6, Spring04

PhD Course: Performance & Reliability Analysis of IP-based Communication NetworksHenrik Schiøler, Hans-Peter Schwefel, Mark Crovella, Søren Asmussen

• Day 1 Basics & Simple Queueing Models(HPS)

• Day 2 Traffic Measurements and Traffic Models (MC)

• Day 3 Advanced Queueing Models & Stochastic Control (HPS)

• Day 4 Network Models and Application (HS)

• Day 5 Simulation Techniques (HPS, SA)

• Day 6 Reliability Aspects (HPS)

Organized by HP Schwefel & H Schiøler

Hans Peter Schwefel


Introduction• System availablility/reliablility/dependability

= ”ability of system to provide ’correct’ (specified) functionality”

– for communication networks, e.g.• Connectivity• Services (network services, applications)

• Properties when applied to complex systems:– Specification of ’correct’ functionality difficult (as well as verification and testing)– Stateful systems: individual faults lead to subsequent dependent faults (significance of

fault detection)– Dependability is in general load dependent; faults influence performance

strong mutual relation between performance and dependability aspects (’performability’)

• Analysis of dependability aspects via– Analytic models: Frequently independence and Markovian assumptions

– Simulation models: Relevance of rare-event techniques

– Experimental Settings (Prototype/Test systems)

Hans Peter Schwefel


Main Application Areas• Safety critical systems

– Air-plane control systems– Vehicular electronic systems (e.g. ABS)

• Satellites/Space missions

• Business critical systems– Electronic stock markets– Electronic ordering/booking systems

• High-Performance/Parallel Computing

• Distributed Systems: Telecommunication systems– Traditional PSTNs: Availability 99.94% (end-to-end)– Internet: below 99%

Hans Peter Schwefel


Motivation: Failure Types in Communication Networks

WANs PSTNs

Source: Hal Stern, Evan Marcus

Hans Peter Schwefel


Economic Aspects of System Failures

Hans Peter Schwefel


Content 1. Basic concepts• Fault, failure, fault-tolerance, redundancy types• Availability, reliability, dependability• Life-times, hazard rates

2. Mathematical models• Availability models• Life-time models• Repair-Models• Performability Models

3. Availability and Network Management• Rolling upgrade, planned downtime, etc.

4. High-Availability for Hosts/Servers • Cluster solution, distributed approach

5. Network Availability • SCTP, HSRP, VRRP

Note: this slide-set contains only the outline of the lecture; many formulas and the mathematical derivations will be presented on the black-board!

Today: Basic concepts & Mathematical models

March 23: Architectures & Protocols

Hans Peter Schwefel


Basic concepts (I)• Fault/defect/error: component behavior deviates from desired behavior

– Transient vs. Persistent faults

– Possible causes:• Design errors• Aging/wear-out• Mis-use• Maintenance (planned down-times)

• System Failure– Observed system behavior deviates from specification, defined by– Failure criteria (time/functionality)

• Note: Faults are a necessary condition for failures but not sufficient– faul-tolerance, ’robust’ failure criteria

• Redundancy

– Active (functional) vs passive (non-functional, stand-by)– Structural (components) vs. Information (e.g. additional header fields) vs. Temporal (retries)

Hans Peter Schwefel


Basic concepts (II): Metrics• Availability (’accessibility’) A(t)

– System operational at time t– System accepts a service request at time t

• Reliability R(t1,t2)– System operational within time interval [t1,t2] provided it was operational (’accessible’) at time t1

– System serves a request given that it had accepted it at time t1

• Dependability D(t1,t2)

– System available at t1 and operational within time interval [t1,t2]

– Successfully accepts and serves a service request

• Systems without repair– Life-times LT– Hazard/failure rates

• Systems with repair– Down-times, up-times– Distribution/Mean of time to failure– Distribution/Mean of time to repair

Hans Peter Schwefel


Content 1. Basic concepts• Fault, failure, fault-tolerance, redundancy types• Availability, reliability, dependability• Life-times, hazard rates

2. Mathematical models• Availability models• Life-time models• Repair-Models• Performability Models

3. Availability and Network Management• Rolling upgrade, planned downtime, etc.

4. High-Availability for Hosts/Servers • Cluster solution, distributed approach

5. Network Availability • SCTP, HSRP, VRRP

Today: Basic concepts & Mathematical models

March 24: Architectures & Protocols

Hans Peter Schwefel


Availability Models (I)• System description

– System consists of N components

– Components states Z1,...,ZN availability state vector

• Binary: Zi in {0,1}

• Multi-state or continuous

– System availability: function of component availability Z=f(Z1,...,ZN)

• Disjunct normal form (DNF) of binary availability function• Monotonicity, coherence (no irrelevant components)

• Availability graph (binary case)

– Graph: Start node S, End-node E, Path for each ’summand’ in DNF, node for each ’factor’– Representation of system availability function– Simplification of graph through factorization

Hans Peter Schwefel


Availability Models (II)

Common Patterns (with independence assumptions!)

• Serial structures– All components need to be available for system availability

– A=Ai

– Boolscher Term and Availability Graph:

• Parallel structures– At least one component has to be available for system availability

– A=1-(1-Ai)

– Boolscher Term and Availability Graph:

• k-out-of-n systems– At least k out of n identical components have to be available for system availability– Use-cases: load-balancing/throughput, failure detection by majority votes

– A=ni=k { binomial(n,i) A1

i (1-A1)n-i}

– Special cases: Parallel structure (k=1) and serial structure (k=n)

• General case: A=...

Hans Peter Schwefel


Availability Models (III): Examples

Examples

• Triple Modular Redundancy (TMR)– 3 functionally identical components, result by majority vote

2-out-of-3 case

• Efficient use of redundancy in serial structure?– Replication of serial structure– Replicate each component– Combinations

when neglecting components necessary for• Failure detection• Node Elimination• Majority votes

Hans Peter Schwefel


Lifetime models (I)• States time-dependent Zi(t) and Z(t) description as stochastic process

• Time LT=mint{Z(t)=0} is called life-time of system– Compare to first passage times in queueing models

• Reliability: R(t)=Pr(Z(t)=1) [for systems without repair]– Serial structure: R(t)=Prod(Ri(t)) [independence!]

• Example: exponential distributed LTs E[LT]=1/(sum i)

– Parallel structure: R(t)= 1- Prod(1-Ri(t))

• Example: exponential distributed LTs E[LT]=1/ sum(1/i) (law of diminishing return)

• Hazard/failure Rate: (t)=lim0 1/ Pr(’failure occurs in interval [t;t+[ | system operational at time t’)– Constant Hazard rate exponential life-time

Hans Peter Schwefel


Lifetime models (II)• Typical hazard rates

– HW: bath-tub functions– Exponential LT– Power-tailed LT

• Examples: – Exponentially distributed lifetimes of components

Hazard Rate of serial structure equals sum of individual rates

– TMR– R(t), E(LT), condition for RTMR(t)>R(t)

• Computation of R(t) in Markovian Models– Via Matrix Exponential distributions

Hans Peter Schwefel


Repair models • Description of component failure and repair processes

– Repair time distribution with mean MTTR– Failure Time distribution with mean MTTF

Common assumption for modeling: Exponential repair time and exponential time to failure

• Availability: A(t)=Pr(’system operational at time t’)– lim A(t) = MTTF/(MTTR+MTTF)

• Reliability: R(t,t+)=Pr(’system operational in [t ,t+[ | operational at time t’)• Dependability: D(t,t+)= Pr(’system operational in [t ,t+[’)

– D(t,t+)= R(t,t+) A(t) (definition of conditional probabilities)

• Example: steady-state and transient behavior for– Single back-up component (passive stand-by): stationary probabilities, transient proces

• Redundant unit fails only when in operation• Redundant unit can also fail in stand-by

– TMR

• Matrix Exponential distribution for Time to failur and time to repair

Hans Peter Schwefel


Performability models• Strong relation between system performance and dependability

– Load/traffic dependent failure probabilities (e.g. overload scenarios!)

– Failures lead to reduced throughput

– Additional delay when redundant components have to become active

– Failure detection and recovery cause additional traffic

• Extensions of traditional queueing networks (Jackson, BCMP, etc.)– Introduction of hazard rates with components

• Spontaneous faults: indep. of load• Load-induced faults

– Increase of traffic as function of hazard rates (failure detection, etc.)

– Service rates as function of hazard rates

Iterative solutions

Hans Peter Schwefel


Exercises:1. Availability Models: A service is implemented on two redundant servers in the following

network topology. The service is availability if at least one of the servers is up and reachable.

a) Draw the availability graph for the service availability.

b) Write down the DNF of the service availability function

c) Compute the service availability for the following values• Availability of the wireless link: 85%

• Availability of the wired links: 99%

• Server Availability (HW+SW): 96%

2. Performability Modles: A redundant network service in passive stand-by is found to be adequately described by an M/M/1 queueing model with request arrival rate (Traffic) and service rate . In case of a failure (with rate ), the second server takes over until the failed server is repaired (repair rate ). If both servers are failed, requests will be queued until service starts again. Draw the Markov state diagram including state transition rates for this model. How will the transition rates change, if the second server is active and utilized with perfect load-balancing?

Hans Peter Schwefel


References/Acknowledgments• Heyman, Sobel (ed.), ‘Stochastic Models’,

Chapt. 13 ‘Reliability and Maintainability’ (Shaked, Shanthikumar), North-Holland,1990• Stern, Marcus, ‘Blueprints for High Availability: Designing Resilient Distributed Systems’.

Wiley, 2000.• E. Jessen, ‘Dependability and Fault-tolerance of computing systems’, lecture notes (German),

TU Munich, 1996.• K. S. Trivedi, ‘Probability and Statistics with Reliability, Queueing, and Computer Science

Applications’, 1982.

phd course: performance & reliability analysis of ip-based communication networks

Documents