reliability and redundancy: how to minimize downtime and ... workshop... · downtime and maximize...

23
Reliability and Redundancy: How to Minimize Downtime and Maximize Service When System Components Fail Reliability and Redundancy: How to Minimize Downtime and Maximize Service When System Components Fail Roy McClellan and Dave Chapman Roy McClellan and Dave Chapman March 11, 2010 March 11, 2010

Upload: others

Post on 27-May-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Reliability and Redundancy: How to Minimize Downtime and ... Workshop... · Downtime and Maximize Service When System Components Fail Roy McClellan and Dave Chapman March 11, 2010

Reliability and Redundancy: How to Minimize Downtime and Maximize Service When System Components Fail

Reliability and Redundancy: How to Minimize Downtime and Maximize Service When System Components Fail

Roy McClellan and Dave ChapmanRoy McClellan and Dave ChapmanRoy McClellan and Dave Chapman

March 11, 2010

Roy McClellan and Dave Chapman

March 11, 2010

Page 2: Reliability and Redundancy: How to Minimize Downtime and ... Workshop... · Downtime and Maximize Service When System Components Fail Roy McClellan and Dave Chapman March 11, 2010

Overview Overview

• “reliability” - equipment and network designs that are

tolerant to failures and faults

– System should continue to operate with all specified features if any single system controller and/or critical device fails

– First level of backup should not reduce the operational capabilities

• “reliability” - equipment and network designs that are

tolerant to failures and faults

– System should continue to operate with all specified features if any single system controller and/or critical device fails

– First level of backup should not reduce the operational capabilities – First level of backup should not reduce the operational capabilities of P25 digital trunked system

• This presentation will address how different P25 digital

network elements impact reliability

– RFSS Core and Sites

– IP Network

– Transport Network

– First level of backup should not reduce the operational capabilities of P25 digital trunked system

• This presentation will address how different P25 digital

network elements impact reliability

– RFSS Core and Sites

– IP Network

– Transport Network

Page 2

Page 3: Reliability and Redundancy: How to Minimize Downtime and ... Workshop... · Downtime and Maximize Service When System Components Fail Roy McClellan and Dave Chapman March 11, 2010

P25 Network OverviewP25 Network Overview

File

Servers

RFSS System

Controllers

ISSI Media

Gateways

P25 RFSS Core

IP Consoles

& Control

Other P25 RF

Sub-Systems

Gateways

P25 Single Site Cell

Trunked Site

Controller P25

Base

Stations

P25 Simulcast Cell

P25

Base

Stations

Other VendorP25 RF

Sub-Systems

& Control

Rooms

Legacy RFSub-Systems

Trunked Simulcast Site

Controller

Page 3

Page 4: Reliability and Redundancy: How to Minimize Downtime and ... Workshop... · Downtime and Maximize Service When System Components Fail Roy McClellan and Dave Chapman March 11, 2010

P25 Network Reliability – System ControllersP25 Network Reliability – System Controllers

File

Servers

RFSS System

Controllers

ISSI Media

Gateways

P25 RFSS Core

IP Consoles

& Control

Other P25 RF

Sub-Systems

Gateways

P25 Single Site Cell

Trunked Site

Controller P25

Base

Stations

P25 Simulcast Cell

P25

Base

Stations

Other VendorP25 RF

Sub-Systems

& Control

Rooms

Legacy RFSub-Systems

Trunked Simulcast Site

Controller

Page 4

Page 5: Reliability and Redundancy: How to Minimize Downtime and ... Workshop... · Downtime and Maximize Service When System Components Fail Roy McClellan and Dave Chapman March 11, 2010

P25 Network Reliability – Base StationsP25 Network Reliability – Base Stations

File

Servers

RFSS System

Controllers

ISSI Media

Gateways

P25 RFSS Core

IP Consoles

& Control

Other P25 RF

Sub-Systems

Gateways

P25 Single Site Cell

Trunked Site

Controller P25

Base

Stations

P25 Simulcast Cell

P25

Base

Stations

Other VendorP25 RF

Sub-Systems

& Control

Rooms

Legacy RFSub-Systems

Trunked Simulcast Site

Controller

Page 5

Page 6: Reliability and Redundancy: How to Minimize Downtime and ... Workshop... · Downtime and Maximize Service When System Components Fail Roy McClellan and Dave Chapman March 11, 2010

P25 Network Reliability - Site NetworkP25 Network Reliability - Site Network

Page 6

Optical Switch Optical Switch

Page 7: Reliability and Redundancy: How to Minimize Downtime and ... Workshop... · Downtime and Maximize Service When System Components Fail Roy McClellan and Dave Chapman March 11, 2010

P25 Network Reliability – Site Network RedundancyP25 Network Reliability – Site Network Redundancy

• Components are cross-connected as shown in earlier

diagram

– At sites with T1 microwave connections, if a path has multiple T1’s the T1’s are distributed among the two routers

– Where newer “IP based” microwave used, each router uses part of the bandwidth

• Components are cross-connected as shown in earlier

diagram

– At sites with T1 microwave connections, if a path has multiple T1’s the T1’s are distributed among the two routers

– Where newer “IP based” microwave used, each router uses part of the bandwidthof the bandwidth

– Devices with two Ethernet interfaces are cross-connected to two switches

• Dynamic routing protocols manage the path

– In event of disruption, redundant path around failure dynamically selected

– Normal traffic routing resumes when failed component becomes available

– Global Load Balancing Protocol (GLBP) utilized for balancing traffic to and from routers

of the bandwidth

– Devices with two Ethernet interfaces are cross-connected to two switches

• Dynamic routing protocols manage the path

– In event of disruption, redundant path around failure dynamically selected

– Normal traffic routing resumes when failed component becomes available

– Global Load Balancing Protocol (GLBP) utilized for balancing traffic to and from routersPage 7

Page 8: Reliability and Redundancy: How to Minimize Downtime and ... Workshop... · Downtime and Maximize Service When System Components Fail Roy McClellan and Dave Chapman March 11, 2010

P25 Network Reliability – Trunked NetworkP25 Network Reliability – Trunked Network

QueueActive

Conventional Trunking

ActiveAvailableChannels

Queue

Inactive

Ava

ila

ble

Ch

an

ne

ls

TrunkingControl

Page 8Trunked network is inherently more reliable

Page 9: Reliability and Redundancy: How to Minimize Downtime and ... Workshop... · Downtime and Maximize Service When System Components Fail Roy McClellan and Dave Chapman March 11, 2010

P25 Network Reliability – Sub-system IndependenceP25 Network Reliability – Sub-system Independence

File

Servers

RFSS System

Controllers

ISSI Media

Gateways

P25 RFSS Core

IP Consoles

& Control

Other P25 RF

Sub-Systems

Gateways

P25 Single Site Cell

Trunked Site

Controller P25

Base

Stations

P25 Simulcast Cell

P25

Base

Stations

Other VendorP25 RF

Sub-Systems

& Control

Rooms

Legacy RFSub-Systems

Trunked Simulcast Site

Controller

Page 9

Page 10: Reliability and Redundancy: How to Minimize Downtime and ... Workshop... · Downtime and Maximize Service When System Components Fail Roy McClellan and Dave Chapman March 11, 2010

P25 Network Reliability – Failsoft ModeP25 Network Reliability – Failsoft Mode

File

Servers

RFSS System

Controllers

ISSI Media

Gateways

P25 RFSS Core

IP Consoles

& Control

Other P25 RF

Sub-Systems

• Subscribers will go into channel hunting mode and will

operate in conventional mode if the radio fails to find a

control channel

• Communication with primary dispatch centers

supporting Failsoft interface is available

P25 Single Site Cell

Trunked Site

Controller P25

Base

Stations

P25 Simulcast Cell

P25

Base

Stations

Other VendorP25 RF

Sub-Systems

Gateways& Control

Rooms

Legacy RFSub-Systems

Trunked Simulcast Site

Controller

Page 10

Page 11: Reliability and Redundancy: How to Minimize Downtime and ... Workshop... · Downtime and Maximize Service When System Components Fail Roy McClellan and Dave Chapman March 11, 2010

P25 Network Reliability – Simulcast Reliability P25 Network Reliability – Simulcast Reliability

Trunked Site

Controller

P25 Simulcast Cell

P25 Base

StationsP25 Base

Stations

P25 Base

StationsIf not a master, there

CCH

CA

CB

Page 11

Stations Stations Stations

Satellite Master Satellite

Satellite Master Satellite

Satellite Master Satellite

If not a master, there

is a coverage

degradation on one

channel due to the

loss of failed RF

repeater

Page 12: Reliability and Redundancy: How to Minimize Downtime and ... Workshop... · Downtime and Maximize Service When System Components Fail Roy McClellan and Dave Chapman March 11, 2010

P25 Network Reliability – Simulcast Reliability P25 Network Reliability – Simulcast Reliability

Trunked Site

Controller

P25 Simulcast Cell

P25 Base

StationsP25 Base

Stations

P25 Base

StationsTrunked Site

CCH

CA

CB

Page 12

Stations Stations Stations

Satellite Master Satellite

Satellite Master Satellite

Satellite Master Satellite

Trunked Site

Controller re-locates

master RF Repeater

on second master

capable site after

specified period of

time

Page 13: Reliability and Redundancy: How to Minimize Downtime and ... Workshop... · Downtime and Maximize Service When System Components Fail Roy McClellan and Dave Chapman March 11, 2010

P25 Network Reliability – Simulcast Reliability P25 Network Reliability – Simulcast Reliability

Trunked Site

Controller

P25 Simulcast Cell

P25 Base

StationsP25 Base

Stations

P25 Base

Stations

CCH

CA

CB

Site controller re-

Page 13

Stations Stations Stations

Satellite Master Satellite

Satellite Master Satellite

Satellite Master Satellite

Site controller re-

assigns a new

control channel on a

channel having full

RF coverage

Page 14: Reliability and Redundancy: How to Minimize Downtime and ... Workshop... · Downtime and Maximize Service When System Components Fail Roy McClellan and Dave Chapman March 11, 2010

Transport Network Reliability – Network DesignTransport Network Reliability – Network Design

• Rely on link layer techniques such as SONET &

Resilient Packet Rings to protect against link failures

• Failure detection triggers routing reconvergence/

restoration

• Restoration techniques can be enhanced to bypass

the failed equipment before routing convergence

• Rely on link layer techniques such as SONET &

Resilient Packet Rings to protect against link failures

• Failure detection triggers routing reconvergence/

restoration

• Restoration techniques can be enhanced to bypass

the failed equipment before routing convergence the failed equipment before routing convergence

(protection techniques)

– MPLS

– Sub-second convergence successfully achieved in unicastand multicast IP networks

the failed equipment before routing convergence

(protection techniques)

– MPLS

– Sub-second convergence successfully achieved in unicastand multicast IP networks

Page 14

Page 15: Reliability and Redundancy: How to Minimize Downtime and ... Workshop... · Downtime and Maximize Service When System Components Fail Roy McClellan and Dave Chapman March 11, 2010

SONET Network DiagramSONET Network Diagram

Page 15Optical

SwitchIP Path SONET (Fiber or Microwave)

Page 16: Reliability and Redundancy: How to Minimize Downtime and ... Workshop... · Downtime and Maximize Service When System Components Fail Roy McClellan and Dave Chapman March 11, 2010

SONET Network DiagramSONET Network Diagram

Page 16

Automatic Loopback

Optical

SwitchIP Path SONET (Fiber or Microwave)

Page 17: Reliability and Redundancy: How to Minimize Downtime and ... Workshop... · Downtime and Maximize Service When System Components Fail Roy McClellan and Dave Chapman March 11, 2010

Transport Network Reliability - SONETTransport Network Reliability - SONET

• Ring topology - two sets of fiber strands are used, one for

sending and receiving and the other as the spare set

– If a fiber cut occurs, the switch on either side of the break re-routes the traffic in the other direction using the backup ring

– Re-route happens at physical later, and no end devices are aware of the issue or need to take corrective action

• Ring topology - two sets of fiber strands are used, one for

sending and receiving and the other as the spare set

– If a fiber cut occurs, the switch on either side of the break re-routes the traffic in the other direction using the backup ring

– Re-route happens at physical later, and no end devices are aware of the issue or need to take corrective actionaware of the issue or need to take corrective action

– Failover takes <50 msec, with no noticeable impact to voice traffic

– Even IP routes do not typically update

aware of the issue or need to take corrective action

– Failover takes <50 msec, with no noticeable impact to voice traffic

– Even IP routes do not typically update

Page 18: Reliability and Redundancy: How to Minimize Downtime and ... Workshop... · Downtime and Maximize Service When System Components Fail Roy McClellan and Dave Chapman March 11, 2010

Transport Network Reliability - MPLS Altera Corporation

Page 18

1a. Existing routing protocols establish the reachability of the destination networks

1b. Label distribution protocol (LDP) establishes label-to-destination network

mappings

2. Ingress edge label switching router (LSR) receives a packet, performs layer-3

value-added services, and labels the packets

3. LSR switches the packet using label swapping

4. Egress edge LSR removes the label and delivers the packet

Page 19: Reliability and Redundancy: How to Minimize Downtime and ... Workshop... · Downtime and Maximize Service When System Components Fail Roy McClellan and Dave Chapman March 11, 2010

Transport Network Reliability – MPLSTransport Network Reliability – MPLS

• Multi Protocol Label Switching (MPLS) optimizes the

traffic flow between critical network resources

• MPLS offers robust recovery framework that goes

beyond protection rings of SONET/SDH

• MPLS meets the requirements of real-time applications

with recovery times of less than 50 ms (comparable to

• Multi Protocol Label Switching (MPLS) optimizes the

traffic flow between critical network resources

• MPLS offers robust recovery framework that goes

beyond protection rings of SONET/SDH

• MPLS meets the requirements of real-time applications

with recovery times of less than 50 ms (comparable to with recovery times of less than 50 ms (comparable to

SONET rings)

– One-to-one local protection

• MPLS-Traffic Engineering (TE) maintains separate backup paths for

each Label Switch Path (LSP)

– Many-to-one local protection

• MPLS-TE maintains single backup path to protect a set of primary

LSPs traversing the network

with recovery times of less than 50 ms (comparable to

SONET rings)

– One-to-one local protection

• MPLS-Traffic Engineering (TE) maintains separate backup paths for

each Label Switch Path (LSP)

– Many-to-one local protection

• MPLS-TE maintains single backup path to protect a set of primary

LSPs traversing the network

Page 19

Page 20: Reliability and Redundancy: How to Minimize Downtime and ... Workshop... · Downtime and Maximize Service When System Components Fail Roy McClellan and Dave Chapman March 11, 2010

Transport Network Reliability - Microwave Transport Network Reliability - Microwave

• Failure areas to watch for

• Where to build in redundancy

– Failure of redundant MW links will result in sites falling back to localized site trunking mode

– Every user within the coverage areas of effected sites will be operate in the trunking mode with other users operating within

• Failure areas to watch for

• Where to build in redundancy

– Failure of redundant MW links will result in sites falling back to localized site trunking mode

– Every user within the coverage areas of effected sites will be operate in the trunking mode with other users operating within operate in the trunking mode with other users operating within the same cells

• How to configure non-redundant elements to maximize

reliability

– Frequency diversity

– Path diversity

– Microwave Monitored Hot StandBy

operate in the trunking mode with other users operating within the same cells

• How to configure non-redundant elements to maximize

reliability

– Frequency diversity

– Path diversity

– Microwave Monitored Hot StandBy

Page 20

Page 21: Reliability and Redundancy: How to Minimize Downtime and ... Workshop... · Downtime and Maximize Service When System Components Fail Roy McClellan and Dave Chapman March 11, 2010

• From our discussion today, you should understand

that a distributed system that makes us of….

– Redundant components , Failsoft mode, sub-system independence and other recovery mechanisms

– Coupled with the inherent reliability of a trunked network

– A ring topology for rapid recovery from transport disruptions

• From our discussion today, you should understand

that a distributed system that makes us of….

– Redundant components , Failsoft mode, sub-system independence and other recovery mechanisms

– Coupled with the inherent reliability of a trunked network

– A ring topology for rapid recovery from transport disruptions

SummarySummary

– A ring topology for rapid recovery from transport disruptions

– Along with the fault tolerance, adaptive routing, and disaster recovery capabilities of IP

• Results in a radio network that can survive multiple

failures and continue to provide a level of

communication to its users

– A ring topology for rapid recovery from transport disruptions

– Along with the fault tolerance, adaptive routing, and disaster recovery capabilities of IP

• Results in a radio network that can survive multiple

failures and continue to provide a level of

communication to its users

Page 21

Page 22: Reliability and Redundancy: How to Minimize Downtime and ... Workshop... · Downtime and Maximize Service When System Components Fail Roy McClellan and Dave Chapman March 11, 2010
Page 23: Reliability and Redundancy: How to Minimize Downtime and ... Workshop... · Downtime and Maximize Service When System Components Fail Roy McClellan and Dave Chapman March 11, 2010

VPN B

VPN A

MPLSBackbone

VPN A

VPN B

Transport Network Reliability - MPLS

VPN B

Backhaul Network

VPN A

VPN B

Page 23