phd course: performance & reliability analysis of ip-based communication networks
Post on 22-Jan-2016
26 Views
Preview:
DESCRIPTION
TRANSCRIPT
Page 1Hans Peter Schwefel
PHD Course: Performance & Reliability Analysis, Day 6, Spring04
PhD Course: Performance & Reliability Analysis of IP-based Communication NetworksHenrik Schiøler, Hans-Peter Schwefel, Mark Crovella, Søren Asmussen
• Day 1 Basics & Simple Queueing Models(HPS)
• Day 2 Traffic Measurements and Traffic Models (MC)
• Day 3 Advanced Queueing Models & Stochastic Control (HPS)
• Day 4 Network Models and Application (HS)
• Day 5 Simulation Techniques (HPS, SA)
• Day 6 Reliability Aspects (HPS)
Organized by HP Schwefel & H Schiøler
Page 2Hans Peter Schwefel
PHD Course: Performance & Reliability Analysis, Day 6, Spring04
Introduction• System availablility/reliablility/dependability
= ”ability of system to provide ’correct’ (specified) functionality”
– for communication networks, e.g.• Connectivity• Services (network services, applications)
• Properties when applied to complex systems:– Specification of ’correct’ functionality difficult (as well as verification and testing)– Stateful systems: individual faults lead to subsequent dependent faults (significance of
fault detection)– Dependability is in general load dependent; faults influence performance
strong mutual relation between performance and dependability aspects (’performability’)
• Analysis of dependability aspects via– Analytic models: Frequently independence and Markovian assumptions
– Simulation models: Relevance of rare-event techniques
– Experimental Settings (Prototype/Test systems)
Page 3Hans Peter Schwefel
PHD Course: Performance & Reliability Analysis, Day 6, Spring04
Main Application Areas• Safety critical systems
– Air-plane control systems– Vehicular electronic systems (e.g. ABS)
• Satellites/Space missions
• Business critical systems– Electronic stock markets– Electronic ordering/booking systems
• High-Performance/Parallel Computing
• Distributed Systems: Telecommunication systems– Traditional PSTNs: Availability 99.94% (end-to-end)– Internet: below 99%
Page 4Hans Peter Schwefel
PHD Course: Performance & Reliability Analysis, Day 6, Spring04
Motivation: Failure Types in Communication Networks
WANs PSTNs
Source: Hal Stern, Evan Marcus
Page 5Hans Peter Schwefel
PHD Course: Performance & Reliability Analysis, Day 6, Spring04
Economic Aspects of System Failures
Page 6Hans Peter Schwefel
PHD Course: Performance & Reliability Analysis, Day 6, Spring04
Content 1. Basic concepts• Fault, failure, fault-tolerance, redundancy types• Availability, reliability, dependability• Life-times, hazard rates
2. Mathematical models• Availability models• Life-time models• Repair-Models• Performability Models
3. Availability and Network Management• Rolling upgrade, planned downtime, etc.
4. High-Availability for Hosts/Servers • Cluster solution, distributed approach
5. Network Availability • SCTP, HSRP, VRRP
Note: this slide-set contains only the outline of the lecture; many formulas and the mathematical derivations will be presented on the black-board!
Today: Basic concepts & Mathematical models
March 23: Architectures & Protocols
Page 7Hans Peter Schwefel
PHD Course: Performance & Reliability Analysis, Day 6, Spring04
Basic concepts (I)• Fault/defect/error: component behavior deviates from desired behavior
– Transient vs. Persistent faults
– Possible causes:• Design errors• Aging/wear-out• Mis-use• Maintenance (planned down-times)
• System Failure– Observed system behavior deviates from specification, defined by– Failure criteria (time/functionality)
• Note: Faults are a necessary condition for failures but not sufficient– faul-tolerance, ’robust’ failure criteria
• Redundancy
– Active (functional) vs passive (non-functional, stand-by)– Structural (components) vs. Information (e.g. additional header fields) vs. Temporal (retries)
Page 8Hans Peter Schwefel
PHD Course: Performance & Reliability Analysis, Day 6, Spring04
Basic concepts (II): Metrics• Availability (’accessibility’) A(t)
– System operational at time t– System accepts a service request at time t
• Reliability R(t1,t2)– System operational within time interval [t1,t2] provided it was operational (’accessible’) at time t1
– System serves a request given that it had accepted it at time t1
• Dependability D(t1,t2)
– System available at t1 and operational within time interval [t1,t2]
– Successfully accepts and serves a service request
• Systems without repair– Life-times LT– Hazard/failure rates
• Systems with repair– Down-times, up-times– Distribution/Mean of time to failure– Distribution/Mean of time to repair
Page 9Hans Peter Schwefel
PHD Course: Performance & Reliability Analysis, Day 6, Spring04
Content 1. Basic concepts• Fault, failure, fault-tolerance, redundancy types• Availability, reliability, dependability• Life-times, hazard rates
2. Mathematical models• Availability models• Life-time models• Repair-Models• Performability Models
3. Availability and Network Management• Rolling upgrade, planned downtime, etc.
4. High-Availability for Hosts/Servers • Cluster solution, distributed approach
5. Network Availability • SCTP, HSRP, VRRP
Today: Basic concepts & Mathematical models
March 24: Architectures & Protocols
Page 10Hans Peter Schwefel
PHD Course: Performance & Reliability Analysis, Day 6, Spring04
Availability Models (I)• System description
– System consists of N components
– Components states Z1,...,ZN availability state vector
• Binary: Zi in {0,1}
• Multi-state or continuous
– System availability: function of component availability Z=f(Z1,...,ZN)
• Disjunct normal form (DNF) of binary availability function• Monotonicity, coherence (no irrelevant components)
• Availability graph (binary case)
– Graph: Start node S, End-node E, Path for each ’summand’ in DNF, node for each ’factor’– Representation of system availability function– Simplification of graph through factorization
Page 11Hans Peter Schwefel
PHD Course: Performance & Reliability Analysis, Day 6, Spring04
Availability Models (II)
Common Patterns (with independence assumptions!)
• Serial structures– All components need to be available for system availability
– A=Ai
– Boolscher Term and Availability Graph:
• Parallel structures– At least one component has to be available for system availability
– A=1-(1-Ai)
– Boolscher Term and Availability Graph:
• k-out-of-n systems– At least k out of n identical components have to be available for system availability– Use-cases: load-balancing/throughput, failure detection by majority votes
– A=ni=k { binomial(n,i) A1
i (1-A1)n-i}
– Special cases: Parallel structure (k=1) and serial structure (k=n)
• General case: A=...
Page 12Hans Peter Schwefel
PHD Course: Performance & Reliability Analysis, Day 6, Spring04
Availability Models (III): Examples
Examples
• Triple Modular Redundancy (TMR)– 3 functionally identical components, result by majority vote
2-out-of-3 case
• Efficient use of redundancy in serial structure?– Replication of serial structure– Replicate each component– Combinations
when neglecting components necessary for• Failure detection• Node Elimination• Majority votes
Page 13Hans Peter Schwefel
PHD Course: Performance & Reliability Analysis, Day 6, Spring04
Lifetime models (I)• States time-dependent Zi(t) and Z(t) description as stochastic process
• Time LT=mint{Z(t)=0} is called life-time of system– Compare to first passage times in queueing models
• Reliability: R(t)=Pr(Z(t)=1) [for systems without repair]– Serial structure: R(t)=Prod(Ri(t)) [independence!]
• Example: exponential distributed LTs E[LT]=1/(sum i)
– Parallel structure: R(t)= 1- Prod(1-Ri(t))
• Example: exponential distributed LTs E[LT]=1/ sum(1/i) (law of diminishing return)
• Hazard/failure Rate: (t)=lim0 1/ Pr(’failure occurs in interval [t;t+[ | system operational at time t’)– Constant Hazard rate exponential life-time
Page 14Hans Peter Schwefel
PHD Course: Performance & Reliability Analysis, Day 6, Spring04
Lifetime models (II)• Typical hazard rates
– HW: bath-tub functions– Exponential LT– Power-tailed LT
• Examples: – Exponentially distributed lifetimes of components
Hazard Rate of serial structure equals sum of individual rates
– TMR– R(t), E(LT), condition for RTMR(t)>R(t)
• Computation of R(t) in Markovian Models– Via Matrix Exponential distributions
Page 15Hans Peter Schwefel
PHD Course: Performance & Reliability Analysis, Day 6, Spring04
Repair models • Description of component failure and repair processes
– Repair time distribution with mean MTTR– Failure Time distribution with mean MTTF
Common assumption for modeling: Exponential repair time and exponential time to failure
• Availability: A(t)=Pr(’system operational at time t’)– lim A(t) = MTTF/(MTTR+MTTF)
• Reliability: R(t,t+)=Pr(’system operational in [t ,t+[ | operational at time t’)• Dependability: D(t,t+)= Pr(’system operational in [t ,t+[’)
– D(t,t+)= R(t,t+) A(t) (definition of conditional probabilities)
• Example: steady-state and transient behavior for– Single back-up component (passive stand-by): stationary probabilities, transient proces
• Redundant unit fails only when in operation• Redundant unit can also fail in stand-by
– TMR
• Matrix Exponential distribution for Time to failur and time to repair
Page 16Hans Peter Schwefel
PHD Course: Performance & Reliability Analysis, Day 6, Spring04
Performability models• Strong relation between system performance and dependability
– Load/traffic dependent failure probabilities (e.g. overload scenarios!)
– Failures lead to reduced throughput
– Additional delay when redundant components have to become active
– Failure detection and recovery cause additional traffic
• Extensions of traditional queueing networks (Jackson, BCMP, etc.)– Introduction of hazard rates with components
• Spontaneous faults: indep. of load• Load-induced faults
– Increase of traffic as function of hazard rates (failure detection, etc.)
– Service rates as function of hazard rates
Iterative solutions
Page 17Hans Peter Schwefel
PHD Course: Performance & Reliability Analysis, Day 6, Spring04
Exercises:1. Availability Models: A service is implemented on two redundant servers in the following
network topology. The service is availability if at least one of the servers is up and reachable.
a) Draw the availability graph for the service availability.
b) Write down the DNF of the service availability function
c) Compute the service availability for the following values• Availability of the wireless link: 85%
• Availability of the wired links: 99%
• Server Availability (HW+SW): 96%
2. Performability Modles: A redundant network service in passive stand-by is found to be adequately described by an M/M/1 queueing model with request arrival rate (Traffic) and service rate . In case of a failure (with rate ), the second server takes over until the failed server is repaired (repair rate ). If both servers are failed, requests will be queued until service starts again. Draw the Markov state diagram including state transition rates for this model. How will the transition rates change, if the second server is active and utilized with perfect load-balancing?
Page 18Hans Peter Schwefel
PHD Course: Performance & Reliability Analysis, Day 6, Spring04
References/Acknowledgments• Heyman, Sobel (ed.), ‘Stochastic Models’,
Chapt. 13 ‘Reliability and Maintainability’ (Shaked, Shanthikumar), North-Holland,1990• Stern, Marcus, ‘Blueprints for High Availability: Designing Resilient Distributed Systems’.
Wiley, 2000.• E. Jessen, ‘Dependability and Fault-tolerance of computing systems’, lecture notes (German),
TU Munich, 1996.• K. S. Trivedi, ‘Probability and Statistics with Reliability, Queueing, and Computer Science
Applications’, 1982.
top related