tutorial: “statistical methods for system …...• reliability – we expect our systems to fail...

172
Copyright 2016 Andrew Snow All Rights Reserved 1 Tutorial: “Statistical Methods for System Dependability: Reliability, Availability, Maintainability and Resiliency” NexComm 2016 International Associatioin of Research and Industrial Academia (IARIA) Drs. Andy Snow* and Gary Weckman** ([email protected] & [email protected]) *School of Information & Telecommunication Systems ** Department of Industrial & Systems Engineering Ohio University

Upload: others

Post on 24-Mar-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

1

Tutorial: “Statistical Methods for SystemDependability: Reliability, Availability,

Maintainability and Resiliency”

NexComm 2016International Associatioin of Research and

Industrial Academia (IARIA)

Drs. Andy Snow* and Gary Weckman**([email protected] & [email protected])

*School of Information & Telecommunication Systems** Department of Industrial & Systems Engineering

Ohio University

Page 2: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

2

Outline

A. ICT Infrastructure Risk

B. Examples of ICT Network Infrastructure

C. RAM-R: Reliability, Availability,Maintainability and Resiliency

D. Protection Level Assessment &Forecasting

Page 3: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

3

Outline

A. ICT Infrastructure Risk

B. Examples of ICT Network Infrastructure

C. RAM-R: Reliability, Availability,Maintainability and Resiliency

D. Protection Level Assessment &Forecasting

Page 4: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

4

A. Infrstructure Risk

• Human Perceptions of Risk

• Threats (natural and manmade)

• Vulnerabilities

• Faults Taxonomy

• Service Outages

• Single Points of Failure

• Over-Concentration

• Risk as a f(Severity, Likelihood)

• Protection through fault prevention, tolerance, removal,and forecasting

• Best Practices

Page 5: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

5

Human Perceptions of Risk

• Perceptions of “Rare Events”

• Users Demand Dependable Systems

• Dependable Systems are Expensive

Page 6: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

6

Some Fun with Probability

• Which is more likely?1. Winning the “Big Lotto”

2. Getting hit by lightning

3. Being eviscerated/evaporated by a large asteroidover an 80-year lifetime

Page 7: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

7

Some Fun with Probability

• Pick one:1. Winning the “Big Lotto”

2. Getting hit by lightning

• The chances are about the same

• One you have to pay for – the other isfree

Page 8: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

8

Some Fun with Probability

3. Being eviscerated/evaporated by a largeasteroid over an 80-year lifetime

– Based upon empirical data, the chances areabout 1 in a million*

*A. Snow and D. Straub, “Collateral damage from anticipated or realdisasters: skewed perceptions of system and business continuity risk?”,

Page 9: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

9

Probability and People

• It is human nature that we perceive “good”events to be more likely and “bad” eventsto be less likely

• Until a bad event happens, that is

Page 10: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

10

We Expect Dependability attributesfrom our Critical Infrastructure

• Reliability

• Maintainability

• Availability

• Resiliency1

• Data Confidentiality

• Data Integrity

1This perspective replaces “Safety” with “Resiliency”. Attributes werefirst suggested in A. Avizienis, et al, “Basic Concepts & Taxonomy ofDependable & Secure Computing”, IEEE Transactions onDependable & Secure Computing, 2004

Page 11: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

11

We Expect Dependability from ourCritical Infrastructure

• Reliability

– We expect our systems to fail very infrequently

• Maintainability

– When systems do fail, we expect very quickrecovery

• Availability

– Knowing systems occasionally fail and takefinite time to fix, we still expect the services tobe ready for use when we need it

Page 12: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

12

We Expect Dependability from ourCritical Infrastructure (Continued)

• Resiliency– We expect our infrastructure not to fail cataclysmically

– When major disturbances occur, we still expectorganizational missions and critical societal servicesto still be serviced

• Data Confidentiality– We expect data to be accessed only by those who

are authorized

• Data Integrity– We expect data to be deleted or modified only by

those authorized

Page 13: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

13

Are our Expectations Reasonable?

• Our expectations for dependable ICTsystems are high

• So is the cost• If you demand high dependability…………

Page 14: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

14

Don’t Forget Your Wallet

Dependability

Cost

Page 15: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

15

Focus is often on More Reliable andMaintainable Components

• How to make things more reliable– Avoid single points of failure (e.g. over concentration to achieve

economies of scale?)

– Diversity• Redundant in-line equipment spares

• Redundant transmission paths

• Redundant power sources

• How to make things more maintainable– Minimize fault detection, isolation, repair/replacement, and test

time

– Spares, test equipment, alarms, staffing levels, training, bestpractices, transportation, minimize travel time

• What it takes --- lots orf capital and operational costs

Page 16: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

16

Paradox

• We are fickle

• When ICT works, no one wants to spend$$ for unlikely events

• When an unlikely event occurs– We wish we had spent more

• Our perceptions of risk before and aftercatastrophes are key to societal behaviorwhen it comes to ICT dependability

Page 17: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

But Things Go Wrong!

• Central Office facility in Louisiana– Generators at ground level outside building

– Rectifiers and Batteries installed in the basement

– Flat land 20 miles from coast a few feet above sealevel

– Hurricane at high tide results in flood

– Commercial AC lost, Generators inundated,basement flooded

– Facility looses power, communications down

– Fault tolerant architecture defeated by improperdeployment

Copyright 2016 Andrew Snow AllRights Reserved

17

Page 18: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Fukushima Nuclear Accident

• Nuclear reactor cooling design required AC power

• Power Redundancy

– Two sources of commercial power

– Backup generators

– Contingency plan if generators fail? Fly in portablegenerators

• Risks?

– Power plant on coast a few meters above sea-level

– Tsunamis: a 5 meter wall

Copyright 2016 Andrew Snow AllRights Reserved

18

Page 19: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Fukushima Nuclear Accident (Continued)

• Design vulnerabilities?

– Nuclear plant requires AC Power for cooling

– Tsunami wall 5 meters high, in a country where in thelast 100 years numerous > 5 meter tsunamisoccurred

– Remarkably, backup generators at ground level (noton roofs !!! )

• Where do tsunamis come from?

– Ocean floor earthquakes

• What can a severe land-based earthquake do?

– Make man-made things fall, such as AC power linesCopyright 2016 Andrew Snow All

Rights Reserved19

Page 20: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Sequence of Events:Fukushima Nuclear Accident

1. Large land based and ocean floor earthquake– AC transmission lines fall

– Ten meter tsunami hits Fukushima

2. Backup Generators– Startup successfully, then

– Flooded by tsunami coming over wall

3. Portable generators– Flown in

– Junction box vault flooded

4. Nuclear reactors overheat, go critical, and explode

For 40 years, people walked by AC generators at groundlevel and a 5 meter tsunami wall !!!!

Copyright 2016 Andrew Snow AllRights Reserved

20

Page 21: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

21

Assessing Risk is Difficult

• Severity

– Economic impact

– Geographic impact

– Safety impact

• Likelihood

– Vulnerabilities

– Means and Capabilities

– Motivations

Page 22: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

22

9-11 EffectGeographic Dispersal of Human

and ITC Assets

Page 23: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

23

Pre 9-11 IT Redundancy

PrimaryFacility

Back-upFacility

Scenario Single IT FacilityReliability

Redundant IT FacilityReliability

1 0.90 0.9900

2 0.95 0.9950

3 0.99 0.9999

Cost ̴ 2C

C

C

Page 24: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

24

Key Assumptions

1. Failures are independent

2. Switchover capability is perfect

PrimaryFacility

Back-upFacility

Page 25: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

25

9-11: Some Organizations ViolatedThese Assumptions

1. Failures not independent• Primary in WTC1• Backup in WTC1 or WTC2

2. Switchover capability disrupted• People injured or killed in WTC expected to staff backup facility

elsewhere• Transportation and access problems

PrimaryFacility

Back-upFacility

Page 26: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

26

Post 9-11 IT Redundancy Perspectives

• No concentrations of people or systems to one large site

• Geographically dispersed human and IT infrastructure

• Geographic dispersal requires highly dependable networks

• Architecture possible with cloud computing !!

Site 2

Site 3

Site N

..

Site 1

.

Capacity1

1-N

Capacity1

1-N

Capacity1

1-N

Capacity1

1-N

Site 2

Site 3

Site 4

Site 1

Capacity

1

3

Capacity

1

Capacity

1

Capacity

1

3

3

3

Page 27: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

27

Geographic Dispersal

• A. Snow, D. Straub, R. Baskerville, C. Stucke, “Thesurvivability principle: it-enabled dispersal oforganizational capital”, in Enterprise InformationSystems Assurance and System Security: Managerialand Technical Issues, Chapter 11, Idea GroupPublishing, Hershey, PA, 2006.

• Cloud computing enables such approaches!!

Page 28: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

28

Assessing Risk is Difficult

• Severity

– Safety impact

– Economic impact

– Geographic impact

• Likelihood

– Vulnerabilities

– Means and Capabilities

– Motivations

Page 29: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

29

Infrastructure Protection and Risk

• Outages

• Severity

• Likelihood

• Fault Prevention, Tolerance, Removal andForecasting

Page 30: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

30

Infrastructure Protection and Risk

• Outages

• Severity

• Likelihood

• Fault Prevention, Tolerance, Removal andForecasting

RISK

Page 31: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 20014 Andrew Snow AllRights Reserved

31

Risk/Likelihood

LIKLIHOOD OF SERVICE OUTAGE

SE

VE

RIT

YO

FS

ER

VIC

EO

UTA

GE

IIII

IIIV

High SeverityHigh Chance

High SeverityLow Chance

Low SeverityLow Chance

Low SeverityHigh Chance

Page 32: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 20014 Andrew Snow AllRights Reserved

32

Risk

LIKLIHOOD OF SERVICE OUTAGE

SE

VE

RIT

YO

FS

ER

VIC

EO

UTA

GE

IIII

IIIV

High SeverityHigh Chance

High SeverityLow Chance

Low SeverityLow Chance

Low SeverityHigh Chance

??

Page 33: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

33

Vulnerabilities and Threats

• Vulnerability is a weakness or a state ofsusceptibility which opens up the infrastructureto a possible outage due to attack orcircumstance.

• The cause of a triggered vulnerability, or errorstate, is a system fault.

• The potential for a vulnerability to be exploited ortriggered into a disruptive event is a threat.

• Vulnerabilities, or faults, can be exploitedintentionally or triggered unintentionally

Page 34: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

34

Proactive Fault Management

• Fault Prevention by using design, implementation, andoperations rules such as standards and industry bestpractices

• Fault Tolerance techniques are employed, whereinequipment/process failures do not result in serviceoutages because of fast switchover toequipment/process redundancy

• Fault Removal through identifying faults introducedduring design, implementation or operations and takingremediation action.

• Fault Forecasting where the telecommunication systemfault behavior is monitored from a quantitative andqualitative perspective and the impact on servicecontinuity assessed.

Page 35: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

35

Threats and Vulnerabilities

• Natural Threats• Water damage• Fire damage• Wind damage• Power Loss• Earthquake damage• Volcanic eruption damage

• Human Threats• Introducing or triggering vulnerabilities• Exploiting vulnerabilities (hackers/crackers, malware

introduction)• Physical Vandalism• Terrorism and Acts of War

• Fault Taxonomy

Page 36: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

36

Vulnerability or Fault Taxonomy

BoundaryPhase

Developmental

Operational

Dimension

Equipment

Facility

Phenomenon

Natural

Human

Objective

Malicious

Non-Malicious

Intent

Deliberate

Non-Deliberate

Capability

Accidental

Incompetence

Persistence

Permanent

Transient

Internal

External

Page 37: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

37

Reference

• A. Avizienis, et al, “Basic Concepts &Taxonomy of Dependable & SecureComputing”, IEEE Transactions onDependable & Secure Computing, 2004.

Page 38: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

38

Case Study – Danger Index

• Snow, Weckman & Hoag, “UnderstandingDanger to Critical Telecom Infrastructure: ARisky Business”, International Conference onNetworks 2009 (ICN09), IEEE CommunicationsSociety Press, March 2009.

Page 39: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

39

Danger

• Malicious acts aimed directly againsthumans, or indirectly at their criticalinfrastructures is a real and presentdanger

• However, most compromises to ICTcritical infrastructure are often accidentaland non-malicious

• How can we quantify the danger??– Not easily

Page 40: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

40

September 11, 2001

• A large telecommunications outage resulted from thecollapse of the world trade centers– Over 4,000,000 data circuits disrupted

– Over 400,000 local switch lines out

• Pathology of the event– Towers collapsed

– Some physical damage to adjacent TCOM building

– Water pipes burst, and in turn disrupted TCOM facility powerand power backup facilities

• What was the a priori probability of such an event andensuing sequence?– P = Pr{ Successful hijack} x Pr{ Building Collapse} x Pr{ Water Damage}

– Infinitesimal??

Page 41: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

41

Probabilities

• Risk assessments requiring “probabilities”have little utility for rare events

• Why? Can’t rationally assess probability

• Such probabilistic analysis attempts mayalso diminish focus of the root cause of theoutage, and may detract from remediatingvulnerabilities

• In the 9-11 case the issue was one ofTCOM “over-concentration” or creation ofa large SPF

Page 42: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

42

Typical TCOM Power

BackupGenerator

Commercial ACRectifiers

DCDistribution

Panel

DC

Battery Backup

Alarms

Page 43: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

43

PCS Architecture

Page 44: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

44

B. Telecommunications Infrastructure

• Wireline architecture and vulnerabilities

• Wireless architecture and vulnerabilities

• Cable architecture and vulnerabilities

Page 45: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

45

Outline

A. ICT Infrastructure Risk

B. Examples of ICT Network Infrastructure

C. RAM-R: Reliability, Availability,Maintainability and Resiliency

D. Protection Level Assessment &Forecasting

Page 46: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

46

Public Switched Telephone Network

• Architecture

• Local and Tandem Switching

• Transmission

• Signaling & SS7

• Power

• Vulnerabilities

Page 47: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

47

PSTN End to End Connections

Page 48: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

48

Switching InfrastructureDispersal/Concentration

Retrieved from Wikipedia November 7, 2007.http://en.wikipedia.org/wiki/Image:Central_Office_Locations.png

Page 49: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

49

US Growth in Fiber

Page 50: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

50

Transmission Vulnerabilities

• Fiber cuts with non-protected transmissionsystems

• Fiber over Bridges

• Fiber transmission failures inside carrierfacilities

• Digital Cross Connect Systems

• Local Loop Cable Failures

Page 51: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

51

Transmission Vulnerabilities

• Fiber cuts with non-protected transmissionsystems:– No backup path/circuits deployed.– Often done for economic reasons– In urban areas where duct space is at a premium– In rural areas where large distances are involved.

• Fiber over Bridges:– Fiber is vulnerable when it traverses bridges to

overcome physical obstacles such as water orcanyons

– There have been reported instances of firesdamaging cables at these points

Page 52: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

52

Transmission Vulnerabilities• Fiber transmission failures inside carrier facilities:

– Studies have demonstrated that the majority of fibertransmission problems actually occur inside carrierfacilities

– Caused by installation, and maintenance activities.

• Digital Cross Connect Systems:

– Although hot standby protected equipment, DACSs havefailed taking down primary and alternate transmissionpaths.

– These devices represent large impact SPFs.

• Local Loop Cable Failures:

– In some instances, construction has severed multipaircable, or cable sheaths have become flooded

– Require long duration splicing or replacement

Page 53: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

53

Failure

Means same fiber,cable, duct, or conduit

Cut

Proper SONET Ring Operation

Page 54: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

54

Improper Operation of SONET Rings

Cut

Means same fiber,cable, duct, or conduit

Improper Maintenance:Node’s previous failure,and subsequent fiber cutprior to spare on hand

Improper Deployment:“Collapsed” or “Folded” Ringsharing same path or conduit

Cut

Un-repairedFailure

Page 55: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

55

Outside Plant VulnerableNear Central Offices

Page 56: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

56

SSP SSP

SSPSSPSSP

SSP

SCPSCP

STP STP

STPSTP

A, B, or C, or F Transmission LinkSSP: Signaling Service Point (Local or Tandem Switch)STP: Signal Transfer Point (packet Switch Router)SCP: Service Control Point

SS7 Architecture

Page 57: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

57

SS7 Vulnerabilities

• Lack of A-link path diversity: Links share a portion or acomplete path

• Lack of A-link transmission facility diversity: A-links sharethe same high speed digital circuit

• Lack of A-link power diversity: A-links are separatetransmission facilities, but share the same power circuit

• Lack of timing redundancy: A-links are digital circuits thatrequire external timing. This should be accomplished byredundant timing sources.

• Commingling SS7 link transmission with voice trunksand/or alarm circuits: It is not always possible to allocatetrunks, alarms and A-links to separate transmissionfacilities.

Page 58: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

SS7Network

STP

STP

ProperDeployment

Cut

Switch 1

Switch 2

SS7Network

STP

STP

ImproperDeployment

Cut

Switch 1

Switch 2

Means same fiber,cable, duct, or conduit

Copyright 2016 Andrew Snow AllRights Reserved

58

SS7 A-Links

Page 59: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

59

ImproperDeployment

‘A’ Link

‘A’ LinkFiber Cable

CutDS3Mux

F.O.Transceiver

SW

ProperDeployment

F.O.Transceiver

‘A’ Link

‘A’ Link

Fiber Cable 1

Fiber Cable 2

Cut

DS3Mux

DS3Mux

SW

F.O.Transceiver

SS7 A-Links

Page 60: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

60

F.O.Transceiver

‘A’ Link

‘A’ Link

Fiber Cable 1

Fiber Cable 2

DS3Mux

DS3Mux

SW

F.O.Transceiver

DCPowerSource

Fuse

SS7 A-Links

Page 61: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

61

Power Architecture & Vulnerabilities

• Redundant Power

– Commercial AC

– AC Generator

– Batteries

Page 62: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

•Loss of commercial power•Damaged generator•Untested or inoperable alarms prior to loss and damage•Batteries Deplete

BackupGenerator

Commercial ACRectifiers

DCDistribution

Panel

DC

Battery Backup

Alarms Inoperable

Copyright 2016 Andrew Snow AllRights Reserved

62

Inoperative Alarms

Page 63: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

63

Economy of Scale Over-ConcentrationVulnerabilities

SW2

SW3

SW1

SW2

SW1

SW3

Distributed Topology Switches Concentrated

Local Loop

Fiber Pair Gain

Trunks

Building

ToTandem

ToTandem

Page 64: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

64

Wireless PersonalCommunication Systems

• Architecture

• Mobile Switching Center

• Base Station Controllers

• Base Stations

• Inter-Component Transmission

• Vulnerabilities

Page 65: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

65

PCS Architecture

Page 66: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

66

PCS Component Failure ImpactWireless Infrastructure Building Block (WIB)

Components Users PotentiallyAffected

Database 100,000

Mobile Switching Center 100,000

Base Station Controller 20,000

Links between MSC and BSC 20,000

Base Station 2,000

Links between BSC and BS 2,000

Page 67: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

67

Outages at Different Times of DayImpact Different Numbers of People

Page 68: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

68

Concurrent Outages are a Challengefor Network Operators

PSTNGateway

AnchorSW

MSC

MSC

MSC

MSC MSC

MSC

MSCMSC

MSC

MSC

Page 69: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

69

Episodic Outage Events

• Episodes defined as events when either

– A Single outage occurs, or

– Multiple concurrent outages are ongoing

Page 70: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

70

Distribution of Multi-outageEpisodes over One Year

No. Outagesin Epoch

Number of WIB

2(200K)

4(400K)

6(600K)

8(800K)

10(1 M)

1 105 191 254 304 342

2 4 18 38 54 77

3 0 2 7 14 21

4 0 0 1 3 5

5 0 0 0 0 1

6 0 0 0 0 0

7 0 0 0 0 0,

Andy Snow, Yachuan Chen, Gary Weckman, “The Impact of Multi-Outage Episodes on Large-Scale Wireless Voice Networks, “The International Journal on Networks and Services, vol 5, no3&4,pages: 174 – 188, 2012, IARIA.

Page 71: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

71

Outline

A. ICT Infrastructure Risk

B. ICT Network Infrastructure

C. RAM-R: Reliability, Availability,Maintainability and Resiliency

D. Protection Level Assessment &Forecasting

Page 72: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

72

RAMS

• Reliability – f( MTTF )

• Maintainability – f( MTTR )

• Availability – f( MTTF, MTTR)

• Resiliency -- f( MTTF, MTTR, Severity)

• Rresiliency Metrics and Thresholds

• User vs System AdministratorPerspectives

Page 73: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

73

Reliability

• Reliability is the chance equipment or a service willoperate as intended in its environment for aspecified period of time.

• A function of the mean time to failure (MTTF) of theequipment or service.

• Reliability deals with:

– “How often can we expect this equipment/service to notfail”, or,

– “What is the expected lifetime of the equipment/service”?

Page 74: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

74

Mean Time To Failure (MTTF)• How do we get it?

– If equipment/service has been fielded, the MTTF is the arithmeticmean of the observed times to fail.

– If it not yet fielded, it is the predicted lifetime.

• There is a very simple way to calculate the reliability, ifarrivals are a Poisson process (i.i.d and exponentiallydistributed:

• R is the reliability, or the chance the service/componentwill be operational for time t. Lamda known as the failurerate, or reciprocal of the MTTF.

• If lamda is constant, exponentially distributed arrivalassumption is conservative.

teR

MTTF

1

Page 75: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

75

368.015/5/5 eeeeR MTTFttYrs

• What is the chance a switch with an MTTFof 5 years will operate without failure for 5years? 1 year? 1 week?

Reliability Example

818.02.05/1/1 eeeeR MTTFtt

Yr

996.000385.05/)52/1(/1 eeeeR MTTFttWk

Page 76: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

76

Reliability Curves

0.00

0.20

0.40

0.60

0.80

1.00

0 1 2 3 4 5

Years

Rel

iab

ilit

y

MTTF = 1/2 Yr

MTTF = 1 Yr

MTTF = 2 Yrs

MTTF = 3 Yrs

MTTF = 4 Yrs

MTTF = 5 Yrs

Page 77: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

77

Maintainability• Equipment or Service Maintainability is the chance a piece of

failed equipment will be fixed/replaced in its environment by aspecified period of time.

• It is a function of the mean time to repair (MTTR), the inverseof “service rate”, and for exponential repair (a conservativeassumption):

• Basically equipment reliability deals with– “How fast can we expect to repair/replace this equipment”, or– The “expected repair time”.

• The restore time includes the total elapsed time:– To realize there is an outage, isolate, travel to, repair, test

service/component, and put the service/component back into service.

teM 1MTTR

u1

Page 78: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

78

08.0111 0833.012/1/1 eeeeM MTTRtt

Min

Maintainability Example

• A DS3 digital circuit has an MTTR of 12minutes. What is the chance the DS3 willbe recovered for use in 1 minute?

Page 79: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

79

Availability

• Availability is an attribute for either a service or apiece of equipment. Availability has twodefinitions:– The chance the equipment or service is “UP” when

needed (Instantaneous Availability), and

– The fraction of time equipment or service is “UP” overa time interval (Interval or Average Availability).

• Interval availability is the most commonlyencountered.

• Unavailability is the fraction of time the service is“Down” over a time interval AU 1

Page 80: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

80

Availability (Continued)• Over some time interval,

availability can beretrospectively calculated fromthe total uptime experiencedover the interval:

• Availability can also becalculated for a prospectiveview from the MTTF andMTTR of the equipment orservice:

• So availability is a measure ofhow often an item/service fails,and when it does how longdoes it take to fix.

• An availability profile can beshown. The times betweenfailure is equal to the time tofailure and the time torepair/restore, leading to:

TIMEINTERVAL

UPTIMEA

_

MTTRMTTF

MTTFA

MTTRMTTFMTBF

Page 81: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

81

Availability Example

• A telecommunications service has an MTTF of620 hours and an MTTR of 30 minutes.– What is the availability of the service?– How many hours per quarter can we expect the

service to be down?

HoursmonthsdayhrsTimeDown

AU

MTTRMTTF

MTTFA

74.13302400081.0_

00081.01

99919.05.620

620

Page 82: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

82

Availability Curves

0.999

0.9992

0.9994

0.9996

0.9998

1

0 1 2 3 4 5

MTTF in Years

Av

ail

ab

ilit

y MTTR = 10 Min

MTTR = 1 Hr

MTTR = 2 Hr

MTTR = 4 Hr

Page 83: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

83

Resiliency

• There are shortcomings with assessing a large ICTinfrastructure by only RAM perspectives.

• First, the infrastructure often offers many differentservices over wide geographic areas.

• Second, large ICT infrastructures are rarely completely“up” or “down”.

• They are often “partially down” or “mostly up”• Rare for an infrastructure serving hundreds of thousands

or millions of users not to have some small portion ofsubscribers out at any one time.

• Resiliency describes the degree that the ICT system canservice users when experiencing service outages

Page 84: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

84

Outage Profiles

100%

75%

50%

25%

0%

Perc

entU

sers

Serv

ed

SV1

D1

SV2

D2

Outage 1

Outage 2

Time

SV = SEVERITY OF OUTAGED = DURATION OF OUTAGE

Outage 1: Failure andcomplete recovery. E.g.Switch failure

Outage 2: Failure and gracefulRecovery. E.g. Fiber cut withrerouting

Page 85: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

85

Resiliency Thresholds

• One way to measure resiliencyis to set a severity thresholdand observe the fraction of timethe infrastructure is in a resilientstate.

• Why set a threshold? At anyinstant in an ICT system thereare bound to be a small numberof users without service.

• Resiliency deficits are not smallevent phenomena.

• We can define resiliency as thefraction of time theinfrastructure is in a resilientstate, MTTRD is mean time toresiliency deficit (RD) andMTTRD is mean time to restorethe resiliency deficit D.

MTTRDMTTRD

MTTRDS

Time

SV1

D1

SV2

D2

SurvivabilityThreshold

SURVIVAL

NON-SURVIVAL

Page 86: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

86

Severity

• The measure of severity can be expressed anumber of ways, some of which are:– Percentage or fraction of users potentially or actually affected

– Number of users potentially or actually affected

– Percentage or fraction of offered or actual demand served

– Offered or actual demand served

• The distinction between “potentially” and “actually”affected is important.

• If a 100,000 switch were to fail and be out from3:30 to 4:00 am, there are 100,000 userspotentially affected. However, if only 5% of thelines are in use at that time of the morning, 5,000users are actually affected.

Page 87: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

87

Outages at Different Times of DayImpact Different Numbers of People

Page 88: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

88

User vs. SystemAdministrator Perspectives

• User Perspective – High End-to-EndReliability and Availability

– Focus is individual

• SysAdmin Perspective – High SystemAvailability and Survivability

– Focus is on large outages and largecustomers

Page 89: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

89

Minimizing Severity of Outages• It is not always possible to completely avoid

failures that lead to outages. Proactive steps canbe taken to minimize their size and duration.– Avoiding single points of failure that can affect large

numbers of users,– Having recovery assets optimally deployed to

minimize the duration of outages.

• This can be accomplished by:– Ensuring there is not too much over-concentration of

assets in single buildings or complexes– Properly deploying and operating fault tolerant ICT

architectures• Equipment/power fault tolerance• Physically and logical diverse transmission systems/paths

– Ensuring there is adequate trained staff and dispersalof maintenance capabilities and assets

Page 90: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

90

9 -11 TCOM Collateral Damage

• The telecommunications facility adjacentto the World Trade Center towers is anexample of over-concentration,– 4,000,000 data circuits originating,

terminating, or passing through that facility,which experienced catastrophic failure withthe onset of water/structural damage.

– Such “Mega-SPFs” ought to be avoided. Ifthey cannot, significant contingencyplans/capabilities should exist.

Page 91: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Examples of StatisticalReliability Analysis

Page 92: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Superposed Point Processes

X

X

X X

X

X

X

X

X

X X X X X X X X

Individual processeslike Equipment, Causes,Services, etc.

Superposed process

Page 93: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Classification of Failure/OutageEvent Processes

93

Page 94: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Hypothetical HPP

0

1

2

3

4

5

6

7

8

9

0 2 4 6 8 10

CO

UN

T

TIME (hrs)

CUMULATIVE CHART

NO TREND

Reliability

NO TRENDConstantReliability

94

XX

X

X

X

X

X

Page 95: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Hypothetical NHPP

0

2

4

6

8

10

12

14

16

0 20 40 60 80 100

CO

UN

T

TIME

CUMULATIVE CHART

NHPP

95

0

2

4

6

8

10

12

14

16

18

0 20 40 60 80 100 120

CO

UN

T

TIME

CUMULATIVE CHART

NHPPReliability

Deterioration

Page 96: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Dependence?

RENEWALPROCESS

BPP

Other

BPPor

Other

Exponential

Distribution

HPP

HPP

Fit RPdistributions:

LOGNORMAL

Fit RPdistributions:WIEBULL,GAMMA

LOGNORMAL

Yes

No

Yes

No

Process Classification (HPP-RP-NHPP)

96

DATADATA

TRENDTESTINGTREND

TESTING

Signifi

Trend?

Significant

Trend?No

Yes

NHPPNHPP

Page 97: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Hypothesis Testing

• Time Series of Event Trends (Event = Outage)

– Laplace Test

– Lewis-Robinson Test

– Military Handbook Test

• Time Event Dependence

– Correlation tests

97

Page 98: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Hypothesis Testing

• LAPLACE

– Null H0: There is no trend

– Alternative Ha: The process is a NHPP

• Lewis-Robinson

– Null H0: The process is a RP

– Alternative Ha: The process is a NHPP

• Military Handbook Test

– Null H0: The process is a HPP

– Alternative Ha: The process is a NHPP

• Dependence Test

– Null H0: The process is classified as a BPP or other

– Alternative Ha: The process is classified as an RP

98

Trend??

Independence?

Page 99: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Proposed Methods forReliability

• Visual Trend Assessment

• Analytic Test of Trends (Laplace, Lewis Robinson,MilHbk tests)

• Independence of events (significance of firstautocorrelation coefficient)

• Uniform distribution over time tests (Chi-Squared)

99

Page 100: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Analytical Tests for Sub-Processes

• LAPLACE

– T = 14 years; n = Count

• LEWIS-ROBINSON TEST– ULR = UL/CV– CV = Coefficient of variation (VAR/MEAN)

• MILITARY HANDBOOK TEST– χ2

2n =2* ln(T/ti)

100

Page 101: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Hypothesis Testing U-SCORE Critical values

U < -1.96ReliabilityGrowth1.96 <= U <=+1.96

No Trend

U > +1.96ReliabilityDeterioration

101

Page 102: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

MilHbk Trend Test Hypothesis(Chi-Square Percentile values)

DegradationDegradation

(0-5%)

No Trend

(5-95%)

Improvement

(100-95%)

102

Page 103: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Reliability Growth SampleResult

103

0

200

400

600

800

1,000

1,200

-1 4 9 14

Failures Cumulative countby Size bin

16,000 < size <= 64,000

0

50

100

150

200

250

0.0

80.8

81.6

82.4

83.2

84.0

84.8

85.6

86.4

87.2

88.0

88.8

89.6

810

.48

ROCOF per month forfailures by size bin

16,000 < size <= 64,000

UL: -20.2612ULR: -13.9816MHK: 100%

Page 104: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Reliability Deterioration SampleResult

0

500

1,000

1,500

2,000

-1 4 9 14

Cumulative count forfailures by bin Duration >

240 minutes

0100200300400500

0.0

80.8

81.6

82.4

83.2

84.0

84.8

85.6

86.4

87.2

88.0

88.8

89.6

810

.48

ROCOF per month forfailures by bin Duration >

240UL: 28.1397ULR: 14.2292MHK: 0%

104

Page 105: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Constant Reliability SampleResult

105

0

200

400

600

800

1,000

1,200

1,400

-1 4 9 14

Failures Cumulativecount by duration bin 60

< duration <= 120

050

100150200250300350

0.0

80

.81

.52

2.2

42

.96

3.6

84

.45

.12

5.8

46

.56

7.2

8 88

.72

9.4

41

0.1

61

0.8

81

1.6

12.3

21

3.0

41

3.7

6

ROCOF per month forfailures by duration bin

60 < duration <= 120UL: 5.1407ULR: 3.7035MHK:22.68%

Page 106: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

106

Outline

A. ICT Infrastructure Risk

B. ICT Network Infrastructure

C. RAM-R: Reliability, Availability,Maintainability and Resiliency

D. Protection Level Assessment &Forecasting

Page 107: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

107

Empirical CIP Assessment

• Industry Best Practices & Standards

• Reviewing Disaster Recovery Plans forRational Reactive/Proactive Balance

• Outage Data Collection and Analysis

Page 108: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

108

Industry Best Practices & Standards• Industry best practices deal with the architecture, design,

installation, operations and maintenance activities• Deviations from best practices should never be

accidental, as an inadvertent or unknown deviationrepresents a latent vulnerability that can be triggered orexploited.

• In the U.S. Wireline best practices were initiallydeveloped as a Network Reliability & InteroperabilityCouncil (NRIC) initiative. [1]

• The scope of best practices has been expanded to coverthe major network types and there are over 700 commonbest practices.

• A website can be queried by network type, industry role,and keyword:

[1] NRIC is a federal advisory council to the Federal CommunicationsCommission, which has been continuously re-chartered since 1992.

Page 109: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

109

NRIC Industry Best Practices

• Network Type

– Cable

– Internet/Data

– Satellite

– Wireless

– Wireline

• Industry Role

– Service Provider

– Network Operator

– Equipment Supplier

– Property Manager

– Government

www.fcc.gov/nors/outage/bestpractice/BestPractice.cfm

Page 110: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

110

Prevention vs. Reaction

• Preventing outages requires both capital and operationalexpenses.– Capital expenditures for such items as backup AC generators,

batteries, redundant transmission paths, etc. can be very large.– Capital expenses to remove some vulnerabilities might be cost

prohibitive, wherein the risk is deemed as acceptable.

• Users might not be aware that the service provider has avulnerability that they do not plan to remediate.

• Regulator and the service provider might have significantdisagreements as to what is an acceptable risk.– For instance, duct space in metropolitan areas might present

significant constraints to offering true path diversity of fibercables.

– Or rural local switches might present considerable challenges fordesigners to offer two separate SS7 access links.

Page 111: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

111

Prevention vs. Reaction

• Disaster recovery plans are geared towardreacting to outages rather than preventing them.– It is very important not to overlook the importance of

fault removal plans.

• There must be an adequate balance between:– Preventing outages and reacting to outages once

they have occurred.– This is a delicate economic equilibrium point which

service providers struggle with.– Customers should be aware of this balance and

competing perspectives

Page 112: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

112

Prevention vs. Reaction• Preventing outages requires both capital and operational

expenses.– Capital expenditures for such items as backup AC generators,

batteries, redundant transmission paths, etc. can be very large.– Capital expenses to remove some vulnerabilities might be cost

prohibitive, wherein the risk is deemed as acceptable.

• Users might not be aware that the service provider has avulnerability that they do not plan to remediate.

• Service providers might have significant disagreementsas to what is an acceptable risk.

• Disaster recovery plans are geared toward reacting tooutages rather than preventing them.– It is very important not to overlook the importance of fault

removal plans.

• There must be an adequate balance between preventingoutages and reacting to outages once they haveoccurred. This is a delicate economic equilibrium pointwhich service providers struggle with.

Page 113: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

113

Outage Data Collection and Analysis

• Outage data is the bellwether of infrastructurevulnerability.

• The faults which manifest themselves becauseof vulnerabilities are an indicator of the reliabilityand survivability of critical ICT infrastructure.

• Important to track reliability and survivability inorder to asses whether the protection level isincreasing, constant, or decreasing.

• Root Cause Analysis (RCA) is instrumental inimprovements– Trigger– Direct– Root

Page 114: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

114

Assessment Case Studies

• Case 1: Wireless Survivability InfrastructureImprovement Assessment with ANN

• Case 2: Chances of Violating SLA by MonteCarlo Simulation

• Case 3: TCOM Power Outage Assessment byPoisson Regression & RCA

• Case 4: SS7 Outages Assessment by PoissonRegression & RCA

Page 115: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow All RightsReserved

115

Case 1: Wireless Survivability InfrastructureImprovement Assessment with ANN

“Evaluating Network Survivability UsingArtificial Neural Networks” by Gary R.Weckman, Andrew P. Snow and PreetiRastogi

“Assessing Dependability Of WirelessNetworks Using Neural Networks” by A.Snow, P. Rastogi, and G. Weckman

Page 116: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow All RightsReserved

116

Introduction

• Critical infrastructures such as network systems mustexhibit resiliency in the face of major networkdisturbances

• This research uses computer simulation and artificialintelligence to introduce a new approach in assessingnetwork survivability– A discrete time event simulation is used to model

survivability– The simulation results are in turn used to train an artificial

neural network (NN)

• Survivability: defined over a timeframe of interest intwo ways:– Fraction of network user demand capable of being

satisfied– Number of outages experienced by the wireless network

exceeding a particular threshold

Page 117: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow All RightsReserved

117

Wireless Infrastructure Block (WIB)

• MSC: Mobile Switching Center• PSTN: Public Switching Telecommunication Network Signaling• SS7: System Numbering 7

Page 118: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow All RightsReserved

118

WIB CharacteristicsComponents Quantity in Each WIB

Database 1

Mobile Switching Center 1

Base Station Controller 5

Links between MSC and BSC 5

Base Station 50

Links between BSC and BS 50 Components Customers Affected

Database 100,000

Mobile Switching Center 100,000

Base Station Controller 20,000

Links between MSC and BSC 20,000

Base Station 2,000

Links between BSC and BS 2,000

Page 119: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow All RightsReserved

119

Reliability and Maintainability Growth, Constancy, andDeterioration Scenarios

Page 120: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow All RightsReserved

120

Simulation and Neural Network Model

Page 121: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow All RightsReserved

121

Biological Analogy

• Brain Neuron

• Artificial neuron

• Set of processing elements(PEs) and connections(weights) with adjustablestrengths

f(net)

Inp

uts

w1

w2

wn

X4

X3

X5

X1

X2

OutputLayer

InputLayer

Hidden Layer

Page 122: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow All RightsReserved

122

ANNModel

DatabaseMTTF

BSMTTF

BSCMTTF

MSCMTTF

BS-BSCMTTF

BSC-MSCMTTF

DatabaseMTR

ReportableOutages

BSMTR

BSCMTR

MSCMTR

BS-BSCMTR

BSC-MSCMTR

Page 123: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow All RightsReserved

123

Research Methodology

Page 124: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow All RightsReserved

124

Simulation Vs Neural Network Outputs for FCC-ReportableOutages

REPORTABLE OUTAGES Output

0

5

10

15

20

25

30

35

40

0 10 20 30 40

Simulation Output

Neu

ral

Net

wo

rkO

utp

ut

REPORTABLE OUTAGESOutput

Neural Network Output and Simulation Output

0

5

10

15

20

25

30

35

40

1 9 17 25 33 41 49 57 65 73 81 89

Exemplar

Ou

tpu

t REPORTABLE OUTAGES

REPORTABLE OUTAGES

Output

Page 125: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow All RightsReserved

125

Sensitivity Analysis

0

5

10

15

20

25

Sen

sit

ivit

y

MTTFBS

MTTFBSC

MTTFBSC

BS

MTTFM

SC

MTTFM

SCBSC

MTTFD

B

MTR

BS

MTR

BSC

MTR

BSCBS

MTR

MSC

MTR

MSC

BSC

MTR

DB

Input Name

Sensitivity About the Mean

MTTFBS

MTTFBSC

MTTFBSCBS

MTTFMSC

MTTFMSCBSC

MTTFDB

MTRBS

MTRBSC

MTRBSCBS

MTRMSC

MTRMSCBSC

MTRDB

0

20000

40000

60000

80000

100000

120000

140000

160000

180000

Sen

siti

vit

y

MTTFBS

MTTFBSC

MTTFBSCBS

MTTFM

SC

MTTFM

SCBSC

MTTFD

B

MTRBS

MTRBSC

MTRBSCBS

MTRM

SC

MTRM

SCBSC

MTRD

B

Input Name

Sensitivity About the Mean

MTT FBS

MTT FBSC

MTT FBSCBS

MTT FMSC

MTT FMSCBSC

MTT FDB

MTRBS

MTRBSC

MTRBSCBS

MTRMSC

MTRMSCBSC

MTRDB

Survivability

FCC-Reportable Outages

Page 126: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow All RightsReserved

126

COMPOUNDED IMPACT ON GROWTH AND

DETERIORATION

Years Compounded

Growth (%)

Compounded

Deterioration (%)

1 10 10

2 21 19

3 33.1 27.1

4 46.4 34.4

5 61.1 40.9

FCC-Reportable Outages

0.00

5.00

10.00

15.00

20.00

25.00

30.00

35.00

1 2 3 4 5

Years

FC

C-R

eport

ab

leO

uta

ges

RG/MD

RD/MG

RD/MD

RG/MC

RG/MG

RD/MC

RC/MD

RC/MG

RC/MC

Survivability

0.999600

0.999650

0.999700

0.999750

0.999800

0.999850

0.999900

1 2 3 4 5

Years

Su

rviv

ab

ilit

y

RC/MC

RG/MD

RD/MG

RG/MG

RD/MD

RD/MC

RG/MC

RC/MG

RC/MD

Page 127: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow All RightsReserved

127

Conclusions (Continued)• Reliability and/or maintainability:

– Deterioration below nominal values affectswireless network dependability more thangrowth

– Growth beyond nominal values does notimprove survivability performance much

– Cost/performance ratio plays an importantrole in deciding R/M improvement strategies.

• Scenario RG/MG gives the lowest value for FCC-Reportable outages, lost line hours and WIB downtime(high survivability)

– Cost is high for marginal survivabilityimprovement

• Scenario RD/MD indicates massive decreases insurvivability

– Fighting deterioration is more important thanachieving growth.

Page 128: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow All RightsReserved

128

Conclusions• FCC-Reportable outages and survivability,

reliability deterioration below the nominal valuescannot be compensated by maintainabilitygrowth, whereas maintainability deteriorationcan be compensated by reliability growth.

• Benefits of an ANN model– wireless carrier can find out the expected number

of threshold exceedances for a given set ofcomponent MTTF and MTTR values

– Sensitivity analysis tells us the most importantcomponents

Page 129: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow All RightsReserved

129

Conclusions

• Results indicate neural networks can be used toexamine a wide range of reliability, maintainability,and traffic scenarios to investigate wireless networksurvivability, availability, and number of FCC-Reportable outages

• Not only is NN a more efficient modeling method tostudy these issues, but additional insights can bereadily observed

• Limitations of study:– Only one wireless infrastructure building block (WIB) and

does not include the entire wireless network integrated withPSTN

– Modeling for 3G+ generation, however topology/hierarchy hassimilarities with 4G

– Optimization is completed without the involvement of a costfunction and hence economic considerations are notentertained.

Page 130: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow All RightsReserved

130

Case 2: Chances of Violating SLA by MonteCarlo Simulation

• Snow, A. and Weckman G., What are thechances of violating an availability SLA?,International Conference on Networking2008 (ICN08), April 2008.

• Gupta, V., Probability of SLA Violation forSemi-Markov Availability, Masters Thesis,Ohio University, March 2009.

Page 131: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow All RightsReserved

131

What’s an SLA?

• Contractual agreement between a serviceprovider and a customer buying a service

• Agreement stipulates some minimum QOSrequirement– Latency, throughput, availability…..

• Can have incentives or disincentives:– Partial payback of service fees for not

meeting QOS objectives in agreement

Page 132: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow All RightsReserved

132

Who Cares About Availability?• Who Cares About Availability?

– End Users of systems/services

– Providers of systems/services

• When a system/service is not available, customers couldsuffer:

– Inconvenience

– Lost revenue/profit

– Decreased safety

Page 133: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow All RightsReserved

133

Availability Distribution

• Availability is a function of MTTF andMTTR

• MTTF is the arithmetic mean of TTFs,which are random variables

• MTTR is the arithmetic mean of TTRs,which are random variables

• As availability is a function of MTTF andMTTR, its distribution is complex

Page 134: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow All RightsReserved

134

What is the problem with a mean?

• As Availability is made up of means,it too is a mean

• The “Holy Grail” for Availability isoften:

– “Five Nines”, or

– 0.99999 = 99.999%

– Power System, T/E-3 digital link, etc.

• What is the problem with a mean?

Page 135: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow All RightsReserved

135

More than One Way to Meet an Interval AvailabilityGoal of 5-Nines

• For a givenAvailabilitygoal, manycombinationsof MTTF &MTTRproduce thesameavailability

• However the

AVAILABILITY MTTF (Yr) MTTR (Min)

0.99999 0.5 2.63

0.99999 1 5.26

0.99999 2 10.51

0.99999 3 15.77

0.99999 4 21.02

0.99999 5 26.28

0.99999 6 31.54

0.99999 7 36.79

0.99999 8 42.05

Page 136: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow All RightsReserved

136

What we investigated

• Markov Availability

– Exponential arrival of failures and independence offailures (HHP)

– Exponential repair time

• Semi-Markov Availability

– Exponential arrival of failures and independence offailures (HHP)

– Nonexponential repair

• Used Lognormal distribution (short and long tail)

Page 137: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow All RightsReserved

137

Research Methodology

Gupta, V., Probability of SLA Violation for Semi-Markov Availability,Masters Thesis, Ohio University, March 2009.

Page 138: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow All RightsReserved

138

Monte Carlo Simulation Methodology

• Take a sample from PoissonDistribution to see how manyfailures in 1 year

• For each failure, determine TTR bytaking a sample from EITHERexponential or lognormal repairdistribution

• Calculate Availability based upon(Uptime)/(1-Year Interval)

Determine how manyfailures in 1-year time interval

Find MTTR given A=0.99999and specified MTTF

Select MTTF

For each failureDetermine TTR

Calculate Availability forYear using TTRs

Repeat10,000Times

Page 139: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow All RightsReserved

139

Cumulative Distribution FunctionMTTF = 4 Yr; MTTR = 21.02 min; TI = 1 Yr

CDF

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

0.9

99

92

5

0.9

99

93

0

0.9

99

93

5

0.9

99

94

0

0.9

99

94

5

0.9

99

95

0

0.9

99

95

5

0.9

99

96

0

0.9

99

96

5

0.9

99

97

0

0.9

99

97

5

0.9

99

98

0

0.9

99

98

5

0.9

99

99

0

0.9

99

99

5

1.0

00

00

0

Long Tail

Short Tail

Exponential

Page 140: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow All RightsReserved

140

Conditional Cumulative Distribution Function

MTTF = 4 Yr; MTTR = 21.02 min; TI = 1 Yr

Cond CDF

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

0.9

99

92

5

0.9

99

93

0

0.9

99

93

5

0.9

99

94

0

0.9

99

94

5

0.9

99

95

0

0.9

99

95

5

0.9

99

96

0

0.9

99

96

5

0.9

99

97

0

0.9

99

97

5

0.9

99

98

0

0.9

99

98

5

0.9

99

99

0

0.9

99

99

5

1.0

00

00

0

Long Tail

Short Tail

Exponential

Page 141: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow All RightsReserved

141

Cumulative Distribution Function MTTF = 4

Yr MTTR = 20.6 min; TI = 1 YR; A= 0.999999

CDF

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

0.9

99

92

5

0.9

99

93

0

0.9

99

93

5

0.9

99

94

0

0.9

99

94

5

0.9

99

95

0

0.9

99

95

5

0.9

99

96

0

0.9

99

96

5

0.9

99

97

0

0.9

99

97

5

0.9

99

98

0

0.9

99

98

5

0.9

99

99

0

0.9

99

99

5

1.0

00

00

0

Long Tail

Short Tail

Page 142: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow All RightsReserved

142

Conditional Cumulative Distribution FunctionMTTF = 4 Yr; MTTR = 20.6 min;

TI = 1 YR; A= 0.999999

Cond CDF

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

0.9

99925

0.9

99930

0.9

99935

0.9

99940

0.9

99945

0.9

99950

0.9

99955

0.9

99960

0.9

99965

0.9

99970

0.9

99975

0.9

99980

0.9

99985

0.9

99990

0.9

99995

1.0

00000

Long Tail

Short Tail

Page 143: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow All RightsReserved

143

Long Tail Lognormal DistributionTI = 1 YR; A = 0.99999

Long Tail

1.81%

13.49%

36.18%

60.65%

77.59%

98.19%

86.43%

63.44%

37.91%

20.24%

0.00% 0.08% 0.38% 1.44% 2.17%

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

MTTF = 0.25 MTTF = 0.5 MTTF = 1 MTTF = 2 MTTF = 4

A < 0.99999

0.99999 >= A < 1

A = 1

Page 144: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow All RightsReserved

144

Conditional Long Tail Lognormal DistributionTI = 1 YR; A = 0.999999

Cond Long Tail

100.00% 99.91% 99.40% 96.34%90.32%

0.00% 0.09% 0.60% 3.66%9.68%

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

MTTF = 0.25 MTTF = 0.5 MTTF = 1 MTTF = 2 MTTF = 4

A < 0.99999

0.99999 >= A < 1

Page 145: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow All RightsReserved

145

Some Conclusions

• Pr {SLA violation} for 5-nines is fairly insensitive to thelong tail and short tail distributions studied– Largest difference found due to distribution about 5%– Exponential repair distribution pretty safe assumption

• High reliability scenarios depend upon no failures ininterval to meet 5-nines SLA– If there is a failure in interval, SLA missed majority of time

• The shorter the interval, the less chance of violating 5-nines SLA, e.g. for MTTF 4 years:– Interval ¼ year: Pr {SLA violation} about 5%– Interval ½ year: Pr {SLA violation} about 9-12%– Interval 1 year: Pr {SLA violation} about 17-22%

Page 146: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow All RightsReserved

146

Some Conclusions (Continued)

• Availability engineering margin

– Engineered availability of 6-nines to meet a 5-nines objective

– For the cases investigated drives Pr {SLA Violation} to 2% or less

– Essentially removes distribution tail as a Pr {SLA Violation} factor

– Even if there is a failure, maintenance ensures 5-nines objectivemet almost all the time

• When someone is selling/buying an Availability SLA, it is goodto know

– The availability engineering margin

– How much the service provider is depending upon no failures1

– Actual MTTR statistics

1 Based upon statistics anonymously passed to author, recovery timefor a DS3 circuit was reported to be about 3.5 hours

Page 147: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

147

Case 3: TCOM Power Outage Assessmentby Poisson Regression & RCA

• “Modeling Telecommunication Outages Due ToPower Loss”, by Andrew P. Snow, Gary R.Weckman, and Kavitha Chayanam

• “Power Related Network Outages: Impact,Triggering Events, And Root Causes”, by A. Snow,K. Chatanyam, G. Weckman, and P. Campbell

Page 148: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

148

Introduction• Management must include the ability to monitor the

AC and DC power capabilities necessary to run thenetwork.

• Large scale networks, communication facility poweris often triply redundant

• In spite of significant redundancy, loss of power tocommunications equipment affects millions oftelecommunications subscribers per Year

• This is an empirical study of 150 large-scaletelecommunications outages reported by carriers tothe Federal Communications Commission, occurringin the US over an 8 year period– Data includes the date/time of each outage,

allowing time series reliability analysis

Page 149: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

149

Overview• Reasons of loss of power to communications

equipment• This study analyzes this special class of

telecommunications outages over an 8-yearperiod and is based on information found inoutage reports to the FCC– Involve the failure of redundant power systems– Sequential events lead to complete power failure

• During the 8-year study period:– 1,557 FCC reportable outages This study

considers:– Of these150 outages in which the service disruption

was caused by loss of power to communicationsequipment and referred to as ‘Power outages’

Page 150: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

150

Power Wiring Diagram

Rec

ComAC

G

CBAC Circuit

ACTS

CB/F

B

L

DC Circuit

Com AC: Commercial ACG: GeneratorACTS: AC Transfer SwitchCB: Main Circuit Breaker

Rec: RectifiersB: BatteriesCB/F: DC Ckt Breakers/FusesL: Communication Load

Page 151: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

151

METHODOLOGY

• A nonhomogeneous Poisson process (NHPP) isoften suggested as an appropriate model for asystem whose failure rate varies over time– In the early years of development the term “learning

curve” was used to explain the model’s concepts,rather than “reliability growth”. J. T. Duane presentedhis initial findings as a “Learning Curve approach toReliability Monitoring”

– Duane (1964) first introduced the power law model fordecreasing failure point processes

• In addition to the power law, another technique formodeling reliability growth is by breakpointanalysis– Breakpoint reliability processes have previously

shown up in large-scale telecommunications networks

Page 152: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

152

Power Outage Count per Quarter for anEight Year Study Period

Quarter Count Quarter Count Quarter Count Quarter Count

1 (1st Q 96 ) 5 9 (1st Q 98) 2 17 (1st Q 00) 8 25 (1st Q 02) 5

2 (2nd Q 96) 7 10 (2nd Q 98) 11 18 (2nd Q 00) 4 26 (2nd Q 02) 2

3 (3rd Q 96) 5 11 (3rd Q 98) 5 19 (3rd Q 00) 6 27 (3rd Q 02) 1

4 (4th Q 96) 2 12 (4th Q 98) 4 20 (4th Q 00) 9 28 (4th Q 02) 0

5 (1st Q 97) 5 13 (1st Q 99) 1 21 (1st Q 01) 3 29 (1st Q 03) 0

6 (2nd Q 97) 11 14 (2nd Q 99) 8 22 (2nd Q 01) 5 30 (2nd Q 03) 2

7 (3rd Q 97) 6 15 (3rd Q 99) 7 23 (3rd Q 01) 9 31 (3rd Q 03) 10

8 (4th Q 97) 2 16 (4th Q 99) 4 24 (4th Q 01) 1 32 (4th Q 03) 0

Page 153: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

153

Power Outage Cumulative Quarterly Count

0

20

40

60

80

100

120

140

160

0 5 10 15 20 25 30 35

Quarter

Cu

mu

lati

ve

Qu

arte

rly

Co

unt

Page 154: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

154

Power Law Model

• The Power Law Model is also called theWeibull Reliability Growth Model (Asher andFeingold, 1984)

• Commonly used infinite failure model, whichshows monotonic increase or decay inevents.

• This process is a NHPP

Page 155: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

155

Piecewise Linear Model

Cum

ula

tive

Outa

ges

)(t

Time (t)

Piecewise Linear Model

dttt )()(

Breakpoint

t0

Outa

ge

Inte

nsity

Time (t)

Poisson Model

Jump Point

t0

)(t

Page 156: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

156

Comparison of Power Law Model andCumulative Outage Data

0

20

40

60

80

100

120

140

160

0 1 2 3 4 5 6 7 8 9

Years

Cu

mu

lati

ve

Co

un

tCount of power outages

Power Law Model

Page 157: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

157

Comparison of Piecewise LinearModel and Cumulative Outage Count

0

20

40

60

80

100

120

140

160

0 5 10 15 20 25 30 35

Quarter

Cum

ula

tive

Quar

terl

yC

ount

Page 158: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

158

Comparison of Jump Point Model toQuarterly Outage Count

0

2

4

6

8

10

12

0 5 10 15 20 25 30 35

Quarter

Outa

ges

per

Quar

ter

Page 159: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

159

CONCLUSIONS

• Little evidence of a seasonal effect– Not unusual as every commercial power outage does not result in a

telecommunications power outage because of backup powersources (generator and batteries)

– hazards that take down commercial power occur throughout theyear

• The Laplace Trend Test indicated strong statisticalevidence of reliability growth– Reliability growth was not monotonic as evidenced by a poor fit to

the power law model– Evidence for continuous improvement was lacking.

• Evidence for reliability growth occurring after 9-11 isstrong– The piecewise linear model with a rate change jump point is the

best reliability growth model found– Clearly indicates two distinct processes with constant reliability, yet

improvement after the 9-11 attack.

Page 160: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

160

CONCLUSIONS

• It appears that 9-11 was episodic, withtelecommunications carrier managementand engineers focusing more closely onthe reliability of critical infrastructures.

• At this point, it is not known whatproportions of this improvement are due toimproved engineering, operational, ormaintenance processes– The abrupt improvement is highly suggestive of

operational and maintenance efforts.– Perhaps 9-11 served as a wakeup call for service

providers when it comes to business and servicecontinuity? Time will tell.

Page 161: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

161

OUTAGE CAUSES

• Trigger cause– event that initiates the sequence that finally

resulted in the outage

• Direct cause– final event in the sequence of events that lead to

the outage

• Root cause– gives an insight of why the outage occurred, and

how to avoid such outages in the future

– technique called Root Cause Analysis (RCA) [14].

Page 162: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

162

Reliability Diagram with Failure Sequence

GeneratorComAC Batteries

ACTS

CB

Rectifiers

Alarms

DC Distr. PanelCB/F

Power CktsIn Telecom Eqpt

1 3

2

Page 163: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

163

Root Cause Analyses: sample outages

Example 1: A lightning strike resulted in a commercial ACpower surge, causing the rectifier AC circuit breakers totrip open. This means that AC from either the primary orbackup source cannot be converted to DC. As aconsequence, the batteries must supply power until therectifiers are manually switched back on line. The alarmsystem does not work properly, and the NOC is notnotified of the problem. After some time the batteries areexhausted and the communications equipment loosespower, and an outage occurs.

• Trigger Cause: Lightning strike.• Direct Cause: Battery Depletion.• Root Cause: Maintenance -- Failure to test alarm

system.

Page 164: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

164

Root Cause Analyses: sample outages

Example 2: Torrential rains and flooding due to atropical storm in Houston causes commercial ACpower failure. The generators in the communicationcomplexes are supplied with fuel from supply pumpsthat are located in the basement of the building. Dueto the flooding, water entered the basement causingsupply pump failure. Hence, the generators ran out offuel, and the facility goes on battery power. Aftersome time, the batteries stopped supplying power tothe equipment thus resulting in an outage.– Trigger Cause: Storms (Flooding).– Direct Cause: Battery depletion.– Root Cause: Engineering failure (The fuel pump system was

placed in the basement in an area prone to flooding).

Page 165: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

165

Root Cause Analyses: sample outages

Example 3: A wrench dropped by amaintenance worker landed on an exposedDC power bus which shorted out. Exposedpower buses should be covered beforemaintenance activity starts. Maintenancepersonnel error can be reduced byproviding sufficient training to personnel.– Trigger Cause: Dropping a tool.– Direct Cause: DC short circuit.– Root Cause: Human error

Page 166: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

166

Impact of Outages Studied(Trigger and Root Causes)

ImpactCategory

Lost Customer Hours(LCH) In Thousands

Number ofOutages

Low LCH < 250 89

Medium 250 LCH < 1,000 30

High 1,000 31

Trigger Cause TotalOutages

LowImpact

MediumImpact

HighImpac

t

Natural Disasters 14 % 8 % 16 % 29 %

Power Surges 18 % 23 % 10 % 13 %

Comm. AC Loss 38 % 39 % 37 % 35 %

Human Errors 30 % 30 % 37 % 23 %

Total 100 % 100 % 100 % 100%

Root Cause TotalOutages

LowImpact

MediumImpact

HighImpact

Engn. Error 2 % 4 % 3 % 35 %

Install. Error 23 % 27 % 27 % 10 %

Opns. Error 33 % 37 % 33 % 23 %

Maint. Error 27 % 26 % 37 % 23 %

Unforeseen 5 % 6 % 0.0 % 10 %

Total 100% 100% 100% 100%

Page 167: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

167

Component TotalOutages

LowImpact

Med.Impact

HighImpact

Rectifiers 14% 9% 20% 23%Batteries 13% 9% 23% 16%Generators 18% 16% 13% 29%AC Cir. Breakers 20% 23% 17% 16%Comm. Equip. 12% 15% 10% 7%DC Fuse/CB 10% 13% 8% 6%Comm. AC 2% 3% 0% 0%AC Trans Switch 3% 3% 3% 0%Alarm Systems 7% 9% 3% 3%Environ. Systems 1% 0% 3% 0%Total 100% 100% 100% 100%

Failing Power Component Associated with Root Cause

Page 168: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

168

CONCLUSIONS• The trigger, root cause, and equipment most associated

with the root cause, have been examined by outageimpact for telecommunications power outages over aneight year period

• This analysis has provided insights into these outages,and should be of interest to carriers, regulators, andHomeland Security

• There are two aspects of these results:– Proactive

• Carrier industry adoption of NRIC Best Practices can go a long wayto prevent such outages;

• Following best practices could have prevented 75% of the outages.• The other 25% could not be determined from the reports.

– Reactive• An emphasis on rectifier and generator recovery (e.g. spare parts,

training, etc.) can help, as over half of high impact outages are dueto problems with these components.

Page 169: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

169

Case 4: SS7 Outages Assessmentby Poisson Regression & RCA

“A Pre And Post 9-11 Analysis Of SS7Outages In The Public SwitchedTelephone Network” by Garima Bajaj,Andrew P. Snow and Gary Weckman

Page 170: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

170

Reliability Poisson model for all FCC-Large ScaleReportable Outages over 10 Years

0

25

50

75

100

125

150

175

200

1993 1995 1997 1999 2001 2003 2005

Year

Mod

el Before 911

After 911

Data

Jump point model

Page 171: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

171

Outline

A. Telecom & Network Infrastructure Risk

B. Telecommunications Infrastructure

C. RAMS: Reliability, Availability,Maintainability and Survivability

D. Protection Level Assessment &Forecasting

Page 172: Tutorial: “Statistical Methods for System …...• Reliability – We expect our systems to fail very infrequently • Maintainability – When systems do fail, we expect very quick

Copyright 2016 Andrew Snow AllRights Reserved

172

Thank You!!