location of the theme gray location of the theme blue zte proprietary on reliability of cots...

25
ZTE Proprietary On Reliability of COTS Hardware Dr. Li Mo Chief Architect, CTO Group

Upload: douglas-brian-higgins

Post on 21-Dec-2015

222 views

Category:

Documents


1 download

TRANSCRIPT

ZTE Proprietary

On Reliability of COTS Hardware

Dr. Li Mo

Chief Architect, CTO Group

Agenda

ZTE Proprietary

• Differences between “Telecom Hardware” and COTS

Hardware

• Analysis Framework

• Enhancing Application Reliability via Backups –

Theoretical• With one backup (1+1)

• With Different Types of Data Centers (type 1 – type 4)

• Remarks

© ZTE Corporation. All rights reserved.Page 3

ZTE Proprietary

Comparing “Telecom Hardware” and COTS (Commercial of the Shelf) Hardware

COTS Hardware

• May have smaller “mean time between failure” (MTBF)

• Relative smaller “mean time to repair” (MTTR)

• COTS procedures for software upgrade, patching, and maintenance contribute more to “scheduled down time”

• Different grade of reliability for data centers

• Strong fault detection and fault isolation capabilities at hardware level

• Well established traditions on software upgrade, patching, and maintenance

• Reliably Central Office assumed

Telecom Hardware

© ZTE Corporation. All rights reserved.Page 4

ZTE Proprietary

New Item to Consider for COTS – Site Down Time

COTS Hardware

• Site downtime (scheduled, non-scheduled) with varying duration and varying intervals

COTS Hardware

• May have smaller “mean time between failure” (MTBF)

• Relative smaller “mean time to repair” (MTTR)

• COTS procedures for software upgrade, patching, and maintenance contribute more to “scheduled down time”

• Different grade of reliability for data centers

Type 1 99.671%

Type 2 99.741%

Type 3 99.982%

Type 4 99.995%Type

s of

DC

Relia

bilit

y

© ZTE Corporation. All rights reserved.Page 5

ZTE Proprietary

Common Mechanisms to Improve Reliability – Application Level Backup

Master Server

Slave Server

(Backup Server)

Data Synchronization Benefits of Application Level Backup:

• Against failures• Facilitate maintenance and upgrade

Various Failures• Hardwar Failure• Hypervisor/OS Failure• Application Failure

Agenda

ZTE Proprietary

• Differences between “Telecom Hardware” and COTS

Hardware

• Analysis Framework

• Enhancing Application Reliability via Backups –

Theoretical• With one backup (1+1)

• With Different Types of Data Centers (type 1 – type 4)

• Remarks

© ZTE Corporation. All rights reserved.Page 7

ZTE Proprietary

Availability: P1

Availability: P2

COTS Server

VMVMVMVMVM

COTS Server

VMVMVMVMVM

vSwitch 1

vSwitch 2

System availability P1 * P2

© ZTE Corporation. All rights reserved.Page 8

ZTE Proprietary

Introducing COTS – Focusing on Server Part of Reliability

Providing 5 9s reliability with 4 9s availability per networking equipment

Is it possible to have servers with 4 9s (or even 3 9s) availability to provide overall 5 9s reliability?

© ZTE Corporation. All rights reserved.Page 9

ZTE Proprietary

Various Backup Schemes

One Backup

Master Server

Slave Server

Data Synchronization

Other Apps

Other Apps

Data Synchronization

Data Synchronization

Two Backups

Master Server

Slave Server

(Backup 1)

Slave Server

(Backup 2)

Other Apps

Other Apps

Other Apps

Data Synchronization

Load Sharing

Master Server

1

Master Server

N

N Backup Servers

Other Apps

Other Apps

Other Apps

Data SynchronizationMaster

Server 1

Slave Server 2

Other Applicat

ions

Master Server 2

Slave Server 1

Other Applicati

ons1:1 Load sharing

1:1, Same as One Backup Same as One Backup

Marginally Better

http://wwwen.zte.com.cn/endata/magazine/ztecommunications/2014/3/

Agenda

ZTE Proprietary

• Differences between “Telecom Hardware” and COTS

Hardware

• Analysis Framework

• Enhancing Application Reliability via Backups –

Theoretical• With one backup (1+1)

• With Different Types of Data Centers (type 1 – type 4)

• Remarks

© ZTE Corporation. All rights reserved.Page 11

ZTE Proprietary

A Simple Example – Markov State Transition Model

Legend:

MServer is goodM Server is badCurrent MasterM

λ

μ

M MS0 S11-λ 1- μ

Chapman – Kolmogorov Equation

orThe Result

© ZTE Corporation. All rights reserved.Page 12

ZTE Proprietary

S1

S 2

S1

S1

S 2

S2

S1

S 2

S0

S1

S 2

S3

μμ

μ

(1-λ)λ(1-s) (1-λ)λ

λλ

λ2+λ(1- λ)sLegend:

S

Server is goodS

Server is bad

Current MasterS

1+1 System – Markov State Transition Model

Inverse of Server MTBF

Inverse of Server MTTRSilent Error Probability

© ZTE Corporation. All rights reserved.Page 13

ZTE Proprietary

Solving the Global Balancing Equation for Getting overall System Availability (1+1)

© ZTE Corporation. All rights reserved.Page 14

ZTE Proprietary

1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 8 8.5 9 9.5 10 10.5 11 11.5 12 12.5 13 13.5 14 14.5 15

-12

-10

-8

-6

-4

-2

0

1/100 1/1000 1/10000 1/100000

μ

System Unavailable Probability for various MTTR and server MTBF when in LOG scale

MTTR=12 min MTTR=6 min

© ZTE Corporation. All rights reserved.Page 15

ZTE Proprietary

Differences in Availability between Theoretical Data and Simulation for 1+1 Backup Case

MTTR=6 minutes

Not Much Difference

© ZTE Corporation. All rights reserved.Page 16

ZTE Proprietary

Improvement of 1+2 (dual backup) v.s. 1+1 (single backup)

Defining the “percentage” of improvement as

Improvement Deteriorates Fast with Silent Error

© ZTE Corporation. All rights reserved.Page 17

ZTE Proprietary

Revertive Maintenance

Data SynchronizationMaster

Server

Other Apps

Slave Server

Other Apps

Other Apps

DC A DC B

Master Server

Slave Server(Backup)

Other Apps

Other Apps

Other Apps

DC BIn Maintenance

DC A

After MaintenanceStart Reverting when No Fault

Before MaintenanceStart Maintenance when No Fault

© ZTE Corporation. All rights reserved.Page 18

ZTE Proprietary

The Impact of Site Maintenance is Negligible (Revertive)

Agenda

ZTE Proprietary

• Differences between “Telecom Hardware” and COTS

Hardware

• Analysis Framework

• Enhancing Application Reliability via Backups –

Theoretical• With one backup (1+1)

• With Different Types of Data Centers (type 1 – type 4)

• Remarks

© ZTE Corporation. All rights reserved.Page 20

ZTE Proprietary

Four Types of Data Centers (ANSI/TIA-942)

Type 1Single non-redundant distribution path serving the IT equipment. Non-redundant capacity components. Basic site infrastructure with expected availability of 99.671%.

Type 2 Meets or exceeds all type 1 requirements. Redundant site infrastructure capacity components with expected availability of 99.741%.

Type 3

Meets or exceeds all type 2 requirements. Multiple independent distribution paths serving the IT equipment. All IT equipment must be dual-powered and fully compatible with the topology of a site's architecture. Concurrently maintainable site infrastructure with expected availability of 99.982%.

Type 4

Meets or exceeds all type 3 requirements. All cooling equipment is independently dual-powered, including chillers and heating, ventilating and air-conditioning (HVAC) systems Fault-tolerant site infrastructure with electrical power storage and distribution facilities with expected availability of 99.995%

© ZTE Corporation. All rights reserved.Page 21

ZTE Proprietary

1+1 System with Site Error – Markov State Transition Model (Revertive)

SS

S0

SS

S2

μ

-(1-λ)λs+2(1-λ)λ

SS

S1

μ

λ

λ2 +λ(1- λ)s

S6S

S

SS

S3

SS

S5

μ

-(1-λ)λs+2(1-λ)λ

SS

S4

μ

λ

λ2 +λ(1- λ)s

η(1-η)

γ

2η(1-η)

η

η2

2η(1-η)η(1-η)

γ

γ

γ

Both Data Center Work

Only One Data Center Works

© ZTE Corporation. All rights reserved.Page 22

ZTE Proprietary

Service Impact Error Probability for Various Data Centers

1x10-5

1/(DC MTBF)

Agenda

ZTE Proprietary

• Differences between “Telecom Hardware” and COTS

Hardware

• Analysis Framework

• Enhancing Application Reliability via Backups –

Theoretical• With one backup (1+1)

• With Different Types of Data Centers (type 1 – type 4)

• Remarks

© ZTE Corporation. All rights reserved.Page 24

ZTE Proprietary

Remarks

COTS Hardware

• May have smaller “mean time between failure” (MTBF)

• Relative smaller “mean time to repair” (MTTR)

• COTS procedures for software upgrade, patching, and maintenance contribute more to “scheduled down time”

• Different grade of reliability for data centers

• It is possible to provide 5 9 availability with COTS hardware with application level backup

• The Impact of MTTR is not significant if it is reasonably small (e.g. less than 10 minutes) for typical hardware MTBF

• The impact of data center scheduled maintenance is negligible

• 5 9 availability with can only be achieved via type 3 and type 4 data centers

[email protected]

© ZTE Corporation. All rights reserved.

Thanks!

Bring Network Closer