white paper: building robust signaling networks – meeting the challenges of the rising signaling...

MEETING THE CHALLENGES OF THE RISING SIGNALING STORM

Distributed signaling network robustness that follows the concept of three protection lines provides

operators with a robust and scalable network architecture beyond the capabilities of existing

overload protection mechanisms at node level. Applying this concept enables mobile networks to

handle the growth in signaling without unnecessary over-dimensioning, and ensures service delivery

to consumers even in cases of heavy signaling load, node failures and malicious activities that lead

to signaling storms.

ericsson White paperUen 284 23-3268 | July 2015

Building Robust Signaling Networks

BUILDING ROBUST SIGNALING NETWORKS • INTRODUCTION 2

IntroductionToday’s consumers increasingly expect high availability from communication and data services.

In this environment, network failure scenarios can trigger a massive amount of signaling – a

signaling storm – caused by automatic reconnection requests from multiple connected devices.

Robust and scalable network solutions are therefore required to optimize operator revenue and

maximize the consumer experience. This white paper discusses best practices, and provides

recommendations for building highly scalable and robust signaling networks.

A robust and distributed signaling network provided by the concept of three protection lines

will be introduced as the recommended network architecture. This concept provides a scalable

and robust signaling network beyond the capabilities of existing network protection mechanisms

at node level. The principles presented in this paper are valid for Signaling System #7-based and

Diameter-based signaling. However, the paper will focus solely on Diameter signaling, which is

an important control protocol for LTE networks and IMS.

BUILDING ROBUST SIGNALING NETWORKS • CHALLENGES FOR THE SIGNALING NETWORK 3

Challenges for the signaling networkCurrent developments in telecommunication technologies and markets stress the importance of

having flexible and robust congestion control mechanisms in order to maximize performance and

service availability.

The traditional approach of protecting individual nodes has been around since the introduction

of GSM – overload in the network is addressed with dedicated protection mechanisms in overloaded

nodes. Standardization bodies promoted these protection mechanisms, which became widely

adopted in the industry. These mechanisms served their purpose successfully, until they were

confronted with an increased complexity of mobile networks and new usage scenarios not considered

in earlier specifications.

Nowadays, a constellation of different network access technologies, such as 2G, 3G, LTE, Wi-Fi

and fixed, coexist to provide seamless access to voice and data services. The huge penetration of

smartphones has dramatically increased data consumption and bandwidth requirements, and

smartphone subscriptions will more than double from now until 2020 [1]. The emergence of the

Internet of Things (IoT) means networks must simultaneously face up to new usage scenarios, which

in some cases drastically increase signaling demands. In addition, these connected devices can

be a source of signaling storms in cases of simultaneous connection requests after network

disturbances.

The continuous modernization of operator networks, with increasing centralization of resources

in higher-capacity systems, implies that signaling storms have a bigger impact on the networks, as

incidents on centralized resources are likely to affect a larger number of users. Affected users will

perform reattempts, thereby initiating a

snowball effect that could overload the

signaling network multiple times.

As Figure 1 summarizes, signaling

networks therefore face significant

challenges. As a result, operators are

demanding new overload protection

mechanisms to enable them to provide

the generally accepted five nines, or even

greater network availability.

The journey that the industry has

initiated toward cloud computing, with

the transformation of the current network

nodes into virtualized network functions

(VNFs), seems at first glance to be a

handy solution to cope with signaling

storms in the network. The reality,

however, is quite different, as this

approach may generate an illusion of

infinite resources and lead to

underestimation of the importance of

overload protection mechanisms. Scale-

out mechanisms in the cloud will provide

increased flexibility to handle steady

growth, although these would not be fast

enough to cope with sudden signaling

peaks, which will escalate quicker than

the network functions can cope with.

Increasing network complexityrequires a higher amount of signaling

Growth in data traffic impliesmore signaling traffic

Node outagesaffect moresubscribers

Subscribers moveinstantly betweenaccesstechnologies

More connected devicesimpose new traffic patternsupon the signaling network

The numberof smartphonesubscriptions isconstantly increasing

Network complexity

Data traffic growth

Multiple accesstechnologies

WLAN

P-CSCF

SGSNMMESGSNMME

PGWPGW

HSS

OCSDSC

PCRF

2G,3G,LTE

Fixed

Centralizationof resources

Smartphonesubscriptions

Internet of Things

Figure 1: Challenges for the signaling network.

Under these circumstances, it is clear that the protection strategies that have been standardized

and widely adopted in mobile networks are no longer sufficient. High-traffic peaks and network

failure scenarios can result in massive signaling storms, leading to lengthy outages of network

services.

In its Annual Incident Reports 2012, ENISA (the European Union Agency for Network and

Information Security) reports that there were 79 major telecom outages in 2012 [2]. “System failures”

were the root cause of 75 percent of these incidents. Each overload-related incident affected an

average of around 9.4 million user connections. One of the biggest effects of network outages and

service degradation is the increase in the rate of subscriber churn.

At the same time, operators spend USD 15 billion a year to overcome network outages and service

degradations [3]. On average, operators spend 1.5 percent of their annual revenues on dealing with

these issues. Some even estimate this figure to be as high as 5 percent.

One strategy to mitigate overload problems is to over-dimension the network for the peak signaling

load. This adds complexity and implies

higher opex and capex, leading to an

additional financial burden that puts an

operator in a less competitive position.

On the other hand, it seems to be a

feasible strategy to cover up for signaling

peaks that are two to three times above

the average traffic load. But when a

signaling storm occurs, the load can

easily increase 10 times above the

average, stretching the need for over-

dimensioning to unrealistic limits. In

addition, over-dimensioning the network

does not eliminate the risk of a reduction

in the overall signaling capacity. A typical

node behavior during overload is depicted

in Figure 2. Up to the engineered capacity,

the message throughput is in line with the

offered traffic, and it still increases slightly

when overload is reached. In cases of

massive overload, however, the

throughput drops heavily with a further

increase in the offered load.

Another strategy to address overload

is to blindly reject signaling messages.

The problem with this strategy is that the

throughput of successful services

delivered to the consumer is heavily

reduced, even if only a small percentage

of messages are rejected, as depicted in

Figure 3. For instance, a successful VoLTE

call requires the successful processing of

roughly 20 signaling sequences. If only

one message in this sequence is rejected,

the call will be discarded.

The recommended strategy is to aim

for a robust and scalable network

architecture. The chosen network

architecture and protection mechanisms

need to be capable of handling the growth

in signaling without unnecessary over-

dimensioning, as well as ensuring service

delivery to consumers during cases of

heavy signaling load, node failures and

malicious activities that lead to signaling

storms.

BUILDING ROBUST SIGNALING NETWORKS • CHALLENGES FOR THE SIGNALING NETWORK 4

Processed throughput

Engineeredcapacity

With standardnode leveloverload control

Offered loadOverloadNormal operation

Engineeredcapacity

Figure 2: Processed throughput in relation to offered load.

Figure 3: Success rate of the signaling traffic in relation to the consumer service delivery.

Success rateConsumer service delivery

100%

100%Success ratesignaling traffic

100% success rate of thesignaling traffic refers to100% success rate of theconsumer service delivery

Increasing number of messages

needed per service delivery

A small reduction in signallingsuccess rate leads to a hugereduction in consumerservice delivery

Rejected consumerservice delivery

Rej

ecte

dsi

gnal

ing

traffi

c

BUILDING ROBUST SIGNALING NETWORKS • END-TO-END STRATEGY AND PRINCIPLES TO ACHIEVE ROBUST SIGNALING NETWORKS 5

End-to-end strategy and principles to achieve robust signaling networksThe objectives for a robust signaling network are:

> to reduce the network impact of smartphone and device signaling

> to maximize throughput in cases of overload

> fast recovery from overload and failure scenarios

> maintainability of the signaling network

> scalability of the signaling network.

A robust and distributed signaling network should be based on the following end-to-end strategy

and principles:

> careful network architecture to provide the basis for a robust and scalable signaling network.

> optimize the signaling traffic and minimize the amount of signaling to manage the network services

> a distributed and coordinated overload protection mechanism across several network elements

to maximize the throughput in peak load scenarios.

NETWORK ARCHITECTURE

The signaling network architecture characteristics should be based on simplicity by using the

right amount of infrastructure and features to get a manageable network entity.

The signaling network should be divided into manageable smaller components. Modular

network design enables operators to isolate problems within a module, while the rest of the

network continues to function. This means fewer users are affected and the overall uptime of the

network is increased.

The basic mechanism to prevent physical failure of the transport plane is redundant components

and more than one possible physical path to reach the destination client originating from the

source client via the transport network. QoS can be used to prevent control-plane failure.

Best performance is achieved in overload or failure situations when the overload protection is

distributed across several network elements. Each network element on its own should be

redundant. The availability of the network elements can be further enhanced when redundant

node types are deployed in different geographical places.

Scalability should be supported on node and network level with the aim of:

> managing the growth in signaling traffic efficiently

> being flexible to extend an established network configuration

> having a long-term strategy to resolve overload cases.

OPTIMIZATION

One aim of optimization is to minimize the signaling, and by this to reduce the network impact

of smartphone and device signaling. Recommended ways to achieve this are:

> to reduce paging traffic by starting paging in last known location for non-time-critical traffic

before the paging request is extended to other parts of the network

> to limit the effect of LTE idle timer decrease in the user equipment (UE) by only performing

authentication at every 10th or 20th connection setup, due to the increased number of

connection setup requests from the UE

> to drop or reject excess traffic from misbehaving UE

> to drop or reject traffic from malicious attacks.

BUILDING ROBUST SIGNALING NETWORKS • END-TO-END STRATEGY AND PRINCIPLES TO ACHIEVE ROBUST SIGNALING NETWORKS 6

A second aspect of optimization is to distribute the load evenly in the network. One recommendation

is to build an appropriate structure with pooled network resources for easy capacity expansion

and even load distribution. This allows for a much more efficient use of network resources.

DISTRIBUTED AND COORDINATED OVERLOAD CONTROL

Each network element should provide a working overload control mechanism to prevent its own

resources from being overloaded. This is also an essential function for pooled resources, as there

would otherwise be a risk that a peak in the signaling traffic could bring down one pooled device

after the other.

At network level, overload should be handled as closely to the overload source as possible to

minimize recovery time. A propagation of the signaling peak in the network must be prevented

at any cost in order to avoid a service outage on a larger scale.

An example of a propagation of a signaling peak is described below.

After a major network outage, which could be the result of a transport network or Mobility

Management Entity (MME) failure, a large number of UEs will discover the network and try to

re-attach again to it. This multitude of reattaches causes a signaling storm, which will affect large

parts of the network. Typical signaling scenarios under such circumstances are:

> The MME completes the authentication process, updates the location in the Home

Subscriber Server (HSS), and reestablishes the bearers.

> The serving gateway must also recreate the bearers.

> The packet data network gateway (PDN-GW) recreates the bearers and reestablishes

sessions to, for example, the policy and charging rules function (PCRF).

> New full IMS registration is needed for VoLTE UE.

> The HSS needs to provide authentication information to the MME and register the location

(additional transactions are needed for IMS registration for VoLTE subscribers).

During overload, the signaling throughput can be optimized by intelligent traffic prioritization.

Ways of achieving intelligent traffic prioritization are given below.

> Adding some application logic to the traffic management function of dedicated nodes in the

network can enable the nodes to determine whether a signaling message belongs to a new

subscriber transaction or an ongoing subscriber transaction. In cases of overload, the

message triggering a new subscriber transaction will be rejected in favor of the message

related to an ongoing subscriber transaction. This will optimize the throughput at the

application level.

> Signaling traffic in overload situations can be prioritized based on importance, such as

emergency calls and priority services, ahead of delay tolerant access, such as energy

meters.

> An overload protection system should be adaptive to the current situation and allow higher

throughput when the overload eases, and throttle more of the traffic as the overload gets

worse. Using fixed rate throttling limits will not fulfill requirements for different situations. In

some overload scenarios, the network can handle more traffic and in others much less.

> The throughput of the system under overload can be optimized by the concept of

throughput elasticity, where the latency is allowed to increase. However, it is important that

the maximum available latency budget is never exceeded on an end-to-end level.

BUILDING ROBUST SIGNALING NETWORKS • DISTRIBUTED SIGNALING NETWORK ROBUSTNESS – THE CONCEPT OF THREE PROTECTION LINES 7

Distributed signaling network robustness – the concept of three protection linesThe principles to achieve a robust

signaling network are best represented

in the network architecture depicted in

Figure 4.

A robust and distributed signaling

network should follow the concept of

three protection lines to protect the

operator’s service offering from being

affected by a signaling storm.

The first protection line comprises the

components that act as entry points to

the core network for smartphone and

device signaling. Examples are the Serving

GPRS Support Node – Mobility

Management Entity (SGSN-MME), the

proxy call session control function

(P-CSCF) and the PDN-GW.

The second line of protection consists

of the nodes providing routing

capabilities for the signaling traffic. It is

typically represented by a Diameter

Signaling Controller (DSC) or a Signaling

Transfer Point.

The third line of protection is

represented by the end systems hosting

the application data and logic. User data

management systems such as the HSS

are assigned to this third line of

protection.

The three lines of protection provide:

> distributed network architecture to

allow for an end-to-end overload protection solution in distributed layers

> the ability to cover up failures, misconfiguration or misoperation evident in one protection line

in the next, higher protection line

> maximized signaling throughput during overload conditions

> scalable and maintainable network architecture

> efficient use of network resources by distributing the signaling load evenly in the network

> optimized signaling procedures to minimize the signaling traffic.

FIRST PROTECTION LINE

The SGSN-MME being part of the Evolved Packet Core (EPC) is an example of the first line of

protection.

> Representing the entry point to the core network, the EPC is closest to potential overload

End-to-end overloadprotection solution

Scalable and maintainablenetwork architecture

If onelayer fails,another cantake over

Minimizesignalingtraffic

Maximize throughputunder overload conditions

Use networkresources efficiently

1st line 2nd line 3rd line

Distributedoverload protection

Scalability andmaintainability

Optimize signaling

P-CSCF

DSCSGSNMMESGSNMME

PGWPGW

HSS

OCS

PCRF

Backup alwaysavailable

Distributeload evenly

Maximizethroughput

Figure 4: Network picture of distributed signaling network

robustness – the concept of three protection lines.


sources outside of the core network. The most efficient way to minimize the recovery time is

to apply optimization and overload protection of the signaling traffic in the first protection line.

> Nodes in the EPC aim to optimize the signaling. One strategy is to perform smart and adaptive

paging to minimize the number of paging requests. A second strategy is to limit excessive

signaling from dedicated UE.

> Nodes in the EPC should have a proven overload protection function that shields the nodes

from failure and ensures high throughput during extreme overload. This is achieved by proper

prioritization of services and subscribers, and distinguishing between initial traffic for a

subscriber and subsequent signaling.

SECOND PROTECTION LINE

An example of the second line of protection is the DSC hosting the function of a Diameter Agent

as specified in RFC 6733 Diameter Base Protocol.

> A Diameter Agent simplifies the network architecture and reduces the number of connections

that need to be maintained in the network. The DSC acts as a centralized signaling router in

the network.

> The Diameter Agent itself should provide a proven overload protection mechanism and be

able to process offered traffic that exceeds the dimensioned capacity of the node multiple

times in cases where the first protection line cannot sufficiently limit a traffic peak.

> The Diameter Agent should provide load balancing capabilities toward other interfacing

Diameter peers. This ensures optimal usage of network elements and reduces load peaks on

dedicated Diameter servers such as the PCRF, the online charging system, and the HSS,

which typically reside in the third protection line.

> The load balancing function should consider the varying capacity of the interfacing

Diameter peers.

> The load balancing function could steer the preference for which Diameter peers are

primarily used as the routing target, and by this opt for local servers over remote servers.

> Nevertheless, the load balancing function could be modified dynamically considering the

actual traffic sent, which would actively reduce traffic peaks that would be otherwise sent

to already overloaded diameter peers.

> The Diameter Agent should provide traffic shaping capabilities for interfacing Diameter servers

or clients. This actively prevents the propagation of signaling storms in the network.

> By traffic shaping outgoing diameter traffic, the Diameter Agent prevents overloading

interfacing servers. It thus offers further protection to peers against potential signaling

bursts from nodes inside the network or from roaming partners.

> By traffic shaping incoming diameter traffic, the Diameter Agent prevents Diameter clients

from abusively using network resources beyond operator-determined limits. One of the

paramount examples is a restarting MME node that could flood the HSS servers, causing

deterioration to the services of all other MME nodes.

> Typically, a Diameter Agent has very limited knowledge of the semantics of Diameter

applications. To prioritize traffic in an intelligent way, the Diameter Agent could be configured

with semantically relevant data, so that in congestion situations, messages of ongoing sessions

can be treated with higher priority than messages related to new sessions. This adds the

capability to perform application-aware traffic management in the Diameter Agent.

> The Diameter Agent is also in control of all Diameter interfaces in a network. An overload

indication on one interface can lead to a reduction of traffic on another interface. In cases

where a charging server becomes overloaded, the DSC can initiate throttling on initial attach

messages. This reduces the number of new subscribers following the principle of addressing

overload protection as close to the source of overload as possible.

THIRD PROTECTION LINE

User data management systems and, in particular, user databases sit at the end of the signaling

chain. During a network overload scenario, user databases will therefore naturally be under

pressure, likely becoming the first overloaded component in the network. On the other hand,

user databases are the heart of telecommunications networks in the sense that they are vital to

deliver consumer services, and a failure or degradation of the user database performance might

compromise the complete network. In the user data management space, 3GPP standardized a


data-layered architecture named User Data Convergence [4]. In this architecture, traditional

network databases in the core network – such as the home location register (HLR), the HSS,

authentication, authorization and accounting (AAA), and the policy controller – are split into

Application Front Ends (AFEs), which handle the business logic, and a user data repository (UDR),

which takes the role of the user database storing the user data.

In order to minimize the severity and duration of overload incidents, user data management

systems should:

> secure their availability and user data integrity at all times, no matter how severe the overload

incident might be. For that purpose, overload protection functionalities are a must for both

AFEs and UDRs.

> maximize end-to-end useful throughput during overload. Different complementary strategies

can be applied, such as:

> ensuring throughput elasticity during overload with adequate latency tradeoff

> intelligent traffic throttling: cooperative load regulation in the user data management

systems (Front End (FE) and UDR).

As previously described in Figure 2, a

system will typically experience

throughput degradation with increasing

levels of overload, and databases are

not an exemption. User data

management systems require a more

intelligent throttling of excess traffic

during overload to ensure that

resources from the UDR are fully

utilized to process useful end-to-end

traffic. The cornerstone of this

mechanism is the cooperation between

the AFEs in the network, such as the

HLR, the HSS, AAA and the UDR.

The UDR should constantly monitor

the resource utilization levels, such as

the response time and length of the

buffers. As soon as some of the

resources reach their limit, the situation

is reported back to the AFE as an

overload indication.

Based on the overload indication sent

from the UDR, the AFEs should throttle

the traffic according to the following

principles:

> The most important signaling messages are not throttled.

> Ongoing operations that involve several messages to the UDR are prioritized over new

operations. Typically, a Mobile Application Part (MAP)/Diameter operation received in the AFE

involves several messages sent to the UDR.

The throttling level should be continuously adjusted based on a dynamic and real-time feedback

loop. This ensures the UDR always performs to its maximum capacity and avoids the throughput

degradation caused by devoting resources to rejecting the excess traffic.

Figure 5 shows the expected behavior of a user data management system with cooperative

load regulation between AFE, and a UDR with throughput elasticity during overload as compared

with a standard system.

Processed throughput(MAP, diameter)

With AFE-UDRintelligentthrottling

Maximumthroughput

EngineeredcapacityDimensionedcapacity

With standardnode level UDRoverload control

Offered load(MAP, diameter)

OverloadNormal operationEngineered

capacity

Figure 5: Overload performance behaviour of a user data management system (AFE+UDR) with cooperative load regulation.

BUILDING ROBUST SIGNALING NETWORKS • CONCLUSION 10

ConclusionOperators are confronted with increasing complexity in mobile networks. At the same time,

signaling traffic is continuously increasing, driven by the growth of smartphone traffic and new

usage scenarios presented by the IoT.

The performance and availability of the signaling network are essential for service delivery to

the customer. Existing overload protection mechanisms that only focus on dedicated overloaded

nodes cannot prevent larger-scale outages. Any outages in the signaling network will lead to

service interruptions, causing financial losses and increasing the risk of subscriber churn.

A robust and distributed signaling network that follows the concept of three protection lines

provides an end-to-end overload protection solution that fulfills the objectives of a robust signaling

network. A robust and scalable network architecture, optimization of the signaling traffic, and a

distributed and coordinated overload protection mechanism boosts the availability of the signaling

network beyond today’s measures.

BUILDING ROBUST SIGNALING NETWORKS • REFERENCES 11

[1] Ericsson, February 2015, Ericsson Mobility Report, Mobile World Congress Edition, available at:

http://www.ericsson.com/res/docs/2015/ericsson-mobility-report-feb-2015-interim.pdf

[2] ENISA (European Union Agency for Network and Information Security), August 2013, Annual

Incident Reports 2012, available at:

http://www.enisa.europa.eu/activities/Resilience-and-CIIP/Incidents-reporting/annual-reports/

annual-incident-reports-2012-1/annual-incident-reports-2012/at_download/fullReport

[3] Heavy Reading, October 2013, Mobile Network Outages & Service Degradations: A Heavy

Reading Survey Analysis, available at:

http://www.heavyreading.com/details.asp?sku_id=3103&skuitem_itemid=1524&promo_code=&aff_

code=&next_url=%2Flist.asp%3Fpage_type%3Dall_reports

[4] 3GPP TS 23.335, accessed June 2014, User Data Convergence (UDC); Technical realization and

information flows; Stage 2, available at:

http://www.3gpp.org/DynaReport/23335.htm

References

BUILDING ROBUST SIGNALING NETWORKS • GLOSSARY 12

GLOSSARYAAA authentication, authorization and accounting

AFE Application Front End

DSC Diameter Signaling Controller

ENISA European Union Agency for Network and Information Security

EPC Evolved Packet Core

FE Front End

HLR home location register

HSS Home Subscriber Server

IMS IP Multimedia Systems

IOT Internet of Things

MAP Mobile Application Part

MME Mobility Management Entity

OCS Online Charging System

PCRF policy and charging rules function

P-CSCF proxy call session control function

PDN-GW packet data network gateway

SGSN Serving GPRS Support Node

Signaling Part of a telecommunications network carrying signaling traffic

Network

UDR user data repository

UE user equipment

VNF virtualized network function

© 2015 Ericsson AB – All rights reserved

white paper: building robust signaling networks – meeting the challenges of the rising signaling...

Technology

network robustness

building robust

scalable network architecture

scalable network solutions

networks introduction

networks challenges

network failure scenarios

lte networks