phd course: performance & reliability analysis of ip-based communication networks

18
Page 1 Hans Peter Schwefel PHD Course: Performance & Reliability Analysis, Day 6, Spring04 PhD Course: Performance & Reliability Analysis of IP-based Communication Networks Henrik Schiøler, Hans-Peter Schwefel, Mark Crovella, Søren Asmussen Day 1 Basics & Simple Queueing Models(HPS) Day 2 Traffic Measurements and Traffic Models (MC) Day 3 Advanced Queueing Models & Stochastic Control (HPS) Day 4 Network Models and Application (HS) Day 5 Simulation Techniques (HPS, SA) Day 6 Reliability Aspects (HPS) Organized by HP Schwefel & H Schiøler

Upload: hidi

Post on 22-Jan-2016

26 views

Category:

Documents


0 download

DESCRIPTION

PhD Course: Performance & Reliability Analysis of IP-based Communication Networks. Henrik Schiøler, Hans-Peter Schwefel, Mark Crovella, Søren Asmussen. Day 1 Basics & Simple Queueing Models(HPS) Day 2 Traffic Measurements and Traffic Models (MC) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: PhD Course: Performance & Reliability Analysis of IP-based Communication Networks

Page 1Hans Peter Schwefel

PHD Course: Performance & Reliability Analysis, Day 6, Spring04

PhD Course: Performance & Reliability Analysis of IP-based Communication NetworksHenrik Schiøler, Hans-Peter Schwefel, Mark Crovella, Søren Asmussen

• Day 1 Basics & Simple Queueing Models(HPS)

• Day 2 Traffic Measurements and Traffic Models (MC)

• Day 3 Advanced Queueing Models & Stochastic Control (HPS)

• Day 4 Network Models and Application (HS)

• Day 5 Simulation Techniques (HPS, SA)

• Day 6 Reliability Aspects (HPS)

Organized by HP Schwefel & H Schiøler

Page 2: PhD Course: Performance & Reliability Analysis of IP-based Communication Networks

Page 2Hans Peter Schwefel

PHD Course: Performance & Reliability Analysis, Day 6, Spring04

Introduction• System availablility/reliablility/dependability

= ”ability of system to provide ’correct’ (specified) functionality”

– for communication networks, e.g.• Connectivity• Services (network services, applications)

• Properties when applied to complex systems:– Specification of ’correct’ functionality difficult (as well as verification and testing)– Stateful systems: individual faults lead to subsequent dependent faults (significance of

fault detection)– Dependability is in general load dependent; faults influence performance

strong mutual relation between performance and dependability aspects (’performability’)

• Analysis of dependability aspects via– Analytic models: Frequently independence and Markovian assumptions

– Simulation models: Relevance of rare-event techniques

– Experimental Settings (Prototype/Test systems)

Page 3: PhD Course: Performance & Reliability Analysis of IP-based Communication Networks

Page 3Hans Peter Schwefel

PHD Course: Performance & Reliability Analysis, Day 6, Spring04

Main Application Areas• Safety critical systems

– Air-plane control systems– Vehicular electronic systems (e.g. ABS)

• Satellites/Space missions

• Business critical systems– Electronic stock markets– Electronic ordering/booking systems

• High-Performance/Parallel Computing

• Distributed Systems: Telecommunication systems– Traditional PSTNs: Availability 99.94% (end-to-end)– Internet: below 99%

Page 4: PhD Course: Performance & Reliability Analysis of IP-based Communication Networks

Page 4Hans Peter Schwefel

PHD Course: Performance & Reliability Analysis, Day 6, Spring04

Motivation: Failure Types in Communication Networks

WANs PSTNs

Source: Hal Stern, Evan Marcus

Page 5: PhD Course: Performance & Reliability Analysis of IP-based Communication Networks

Page 5Hans Peter Schwefel

PHD Course: Performance & Reliability Analysis, Day 6, Spring04

Economic Aspects of System Failures

Page 6: PhD Course: Performance & Reliability Analysis of IP-based Communication Networks

Page 6Hans Peter Schwefel

PHD Course: Performance & Reliability Analysis, Day 6, Spring04

Content 1. Basic concepts• Fault, failure, fault-tolerance, redundancy types• Availability, reliability, dependability• Life-times, hazard rates

2. Mathematical models• Availability models• Life-time models• Repair-Models• Performability Models

3. Availability and Network Management• Rolling upgrade, planned downtime, etc.

4. High-Availability for Hosts/Servers • Cluster solution, distributed approach

5. Network Availability • SCTP, HSRP, VRRP

Note: this slide-set contains only the outline of the lecture; many formulas and the mathematical derivations will be presented on the black-board!

Today: Basic concepts & Mathematical models

March 23: Architectures & Protocols

Page 7: PhD Course: Performance & Reliability Analysis of IP-based Communication Networks

Page 7Hans Peter Schwefel

PHD Course: Performance & Reliability Analysis, Day 6, Spring04

Basic concepts (I)• Fault/defect/error: component behavior deviates from desired behavior

– Transient vs. Persistent faults

– Possible causes:• Design errors• Aging/wear-out• Mis-use• Maintenance (planned down-times)

• System Failure– Observed system behavior deviates from specification, defined by– Failure criteria (time/functionality)

• Note: Faults are a necessary condition for failures but not sufficient– faul-tolerance, ’robust’ failure criteria

• Redundancy

– Active (functional) vs passive (non-functional, stand-by)– Structural (components) vs. Information (e.g. additional header fields) vs. Temporal (retries)

Page 8: PhD Course: Performance & Reliability Analysis of IP-based Communication Networks

Page 8Hans Peter Schwefel

PHD Course: Performance & Reliability Analysis, Day 6, Spring04

Basic concepts (II): Metrics• Availability (’accessibility’) A(t)

– System operational at time t– System accepts a service request at time t

• Reliability R(t1,t2)– System operational within time interval [t1,t2] provided it was operational (’accessible’) at time t1

– System serves a request given that it had accepted it at time t1

• Dependability D(t1,t2)

– System available at t1 and operational within time interval [t1,t2]

– Successfully accepts and serves a service request

• Systems without repair– Life-times LT– Hazard/failure rates

• Systems with repair– Down-times, up-times– Distribution/Mean of time to failure– Distribution/Mean of time to repair

Page 9: PhD Course: Performance & Reliability Analysis of IP-based Communication Networks

Page 9Hans Peter Schwefel

PHD Course: Performance & Reliability Analysis, Day 6, Spring04

Content 1. Basic concepts• Fault, failure, fault-tolerance, redundancy types• Availability, reliability, dependability• Life-times, hazard rates

2. Mathematical models• Availability models• Life-time models• Repair-Models• Performability Models

3. Availability and Network Management• Rolling upgrade, planned downtime, etc.

4. High-Availability for Hosts/Servers • Cluster solution, distributed approach

5. Network Availability • SCTP, HSRP, VRRP

Today: Basic concepts & Mathematical models

March 24: Architectures & Protocols

Page 10: PhD Course: Performance & Reliability Analysis of IP-based Communication Networks

Page 10Hans Peter Schwefel

PHD Course: Performance & Reliability Analysis, Day 6, Spring04

Availability Models (I)• System description

– System consists of N components

– Components states Z1,...,ZN availability state vector

• Binary: Zi in {0,1}

• Multi-state or continuous

– System availability: function of component availability Z=f(Z1,...,ZN)

• Disjunct normal form (DNF) of binary availability function• Monotonicity, coherence (no irrelevant components)

• Availability graph (binary case)

– Graph: Start node S, End-node E, Path for each ’summand’ in DNF, node for each ’factor’– Representation of system availability function– Simplification of graph through factorization

Page 11: PhD Course: Performance & Reliability Analysis of IP-based Communication Networks

Page 11Hans Peter Schwefel

PHD Course: Performance & Reliability Analysis, Day 6, Spring04

Availability Models (II)

Common Patterns (with independence assumptions!)

• Serial structures– All components need to be available for system availability

– A=Ai

– Boolscher Term and Availability Graph:

• Parallel structures– At least one component has to be available for system availability

– A=1-(1-Ai)

– Boolscher Term and Availability Graph:

• k-out-of-n systems– At least k out of n identical components have to be available for system availability– Use-cases: load-balancing/throughput, failure detection by majority votes

– A=ni=k { binomial(n,i) A1

i (1-A1)n-i}

– Special cases: Parallel structure (k=1) and serial structure (k=n)

• General case: A=...

Page 12: PhD Course: Performance & Reliability Analysis of IP-based Communication Networks

Page 12Hans Peter Schwefel

PHD Course: Performance & Reliability Analysis, Day 6, Spring04

Availability Models (III): Examples

Examples

• Triple Modular Redundancy (TMR)– 3 functionally identical components, result by majority vote

2-out-of-3 case

• Efficient use of redundancy in serial structure?– Replication of serial structure– Replicate each component– Combinations

when neglecting components necessary for• Failure detection• Node Elimination• Majority votes

Page 13: PhD Course: Performance & Reliability Analysis of IP-based Communication Networks

Page 13Hans Peter Schwefel

PHD Course: Performance & Reliability Analysis, Day 6, Spring04

Lifetime models (I)• States time-dependent Zi(t) and Z(t) description as stochastic process

• Time LT=mint{Z(t)=0} is called life-time of system– Compare to first passage times in queueing models

• Reliability: R(t)=Pr(Z(t)=1) [for systems without repair]– Serial structure: R(t)=Prod(Ri(t)) [independence!]

• Example: exponential distributed LTs E[LT]=1/(sum i)

– Parallel structure: R(t)= 1- Prod(1-Ri(t))

• Example: exponential distributed LTs E[LT]=1/ sum(1/i) (law of diminishing return)

• Hazard/failure Rate: (t)=lim0 1/ Pr(’failure occurs in interval [t;t+[ | system operational at time t’)– Constant Hazard rate exponential life-time

Page 14: PhD Course: Performance & Reliability Analysis of IP-based Communication Networks

Page 14Hans Peter Schwefel

PHD Course: Performance & Reliability Analysis, Day 6, Spring04

Lifetime models (II)• Typical hazard rates

– HW: bath-tub functions– Exponential LT– Power-tailed LT

• Examples: – Exponentially distributed lifetimes of components

Hazard Rate of serial structure equals sum of individual rates

– TMR– R(t), E(LT), condition for RTMR(t)>R(t)

• Computation of R(t) in Markovian Models– Via Matrix Exponential distributions

Page 15: PhD Course: Performance & Reliability Analysis of IP-based Communication Networks

Page 15Hans Peter Schwefel

PHD Course: Performance & Reliability Analysis, Day 6, Spring04

Repair models • Description of component failure and repair processes

– Repair time distribution with mean MTTR– Failure Time distribution with mean MTTF

Common assumption for modeling: Exponential repair time and exponential time to failure

• Availability: A(t)=Pr(’system operational at time t’)– lim A(t) = MTTF/(MTTR+MTTF)

• Reliability: R(t,t+)=Pr(’system operational in [t ,t+[ | operational at time t’)• Dependability: D(t,t+)= Pr(’system operational in [t ,t+[’)

– D(t,t+)= R(t,t+) A(t) (definition of conditional probabilities)

• Example: steady-state and transient behavior for– Single back-up component (passive stand-by): stationary probabilities, transient proces

• Redundant unit fails only when in operation• Redundant unit can also fail in stand-by

– TMR

• Matrix Exponential distribution for Time to failur and time to repair

Page 16: PhD Course: Performance & Reliability Analysis of IP-based Communication Networks

Page 16Hans Peter Schwefel

PHD Course: Performance & Reliability Analysis, Day 6, Spring04

Performability models• Strong relation between system performance and dependability

– Load/traffic dependent failure probabilities (e.g. overload scenarios!)

– Failures lead to reduced throughput

– Additional delay when redundant components have to become active

– Failure detection and recovery cause additional traffic

• Extensions of traditional queueing networks (Jackson, BCMP, etc.)– Introduction of hazard rates with components

• Spontaneous faults: indep. of load• Load-induced faults

– Increase of traffic as function of hazard rates (failure detection, etc.)

– Service rates as function of hazard rates

Iterative solutions

Page 17: PhD Course: Performance & Reliability Analysis of IP-based Communication Networks

Page 17Hans Peter Schwefel

PHD Course: Performance & Reliability Analysis, Day 6, Spring04

Exercises:1. Availability Models: A service is implemented on two redundant servers in the following

network topology. The service is availability if at least one of the servers is up and reachable.

a) Draw the availability graph for the service availability.

b) Write down the DNF of the service availability function

c) Compute the service availability for the following values• Availability of the wireless link: 85%

• Availability of the wired links: 99%

• Server Availability (HW+SW): 96%

2. Performability Modles: A redundant network service in passive stand-by is found to be adequately described by an M/M/1 queueing model with request arrival rate (Traffic) and service rate . In case of a failure (with rate ), the second server takes over until the failed server is repaired (repair rate ). If both servers are failed, requests will be queued until service starts again. Draw the Markov state diagram including state transition rates for this model. How will the transition rates change, if the second server is active and utilized with perfect load-balancing?

Page 18: PhD Course: Performance & Reliability Analysis of IP-based Communication Networks

Page 18Hans Peter Schwefel

PHD Course: Performance & Reliability Analysis, Day 6, Spring04

References/Acknowledgments• Heyman, Sobel (ed.), ‘Stochastic Models’,

Chapt. 13 ‘Reliability and Maintainability’ (Shaked, Shanthikumar), North-Holland,1990• Stern, Marcus, ‘Blueprints for High Availability: Designing Resilient Distributed Systems’.

Wiley, 2000.• E. Jessen, ‘Dependability and Fault-tolerance of computing systems’, lecture notes (German),

TU Munich, 1996.• K. S. Trivedi, ‘Probability and Statistics with Reliability, Queueing, and Computer Science

Applications’, 1982.