white paper: building robust signaling networks – meeting the challenges of the rising signaling...
TRANSCRIPT
MEETING THE CHALLENGES OF THE RISING SIGNALING STORM
Distributed signaling network robustness that follows the concept of three protection lines provides
operators with a robust and scalable network architecture beyond the capabilities of existing
overload protection mechanisms at node level. Applying this concept enables mobile networks to
handle the growth in signaling without unnecessary over-dimensioning, and ensures service delivery
to consumers even in cases of heavy signaling load, node failures and malicious activities that lead
to signaling storms.
ericsson White paperUen 284 23-3268 | July 2015
Building Robust Signaling Networks
BUILDING ROBUST SIGNALING NETWORKS • INTRODUCTION 2
IntroductionToday’s consumers increasingly expect high availability from communication and data services.
In this environment, network failure scenarios can trigger a massive amount of signaling – a
signaling storm – caused by automatic reconnection requests from multiple connected devices.
Robust and scalable network solutions are therefore required to optimize operator revenue and
maximize the consumer experience. This white paper discusses best practices, and provides
recommendations for building highly scalable and robust signaling networks.
A robust and distributed signaling network provided by the concept of three protection lines
will be introduced as the recommended network architecture. This concept provides a scalable
and robust signaling network beyond the capabilities of existing network protection mechanisms
at node level. The principles presented in this paper are valid for Signaling System #7-based and
Diameter-based signaling. However, the paper will focus solely on Diameter signaling, which is
an important control protocol for LTE networks and IMS.
BUILDING ROBUST SIGNALING NETWORKS • CHALLENGES FOR THE SIGNALING NETWORK 3
Challenges for the signaling networkCurrent developments in telecommunication technologies and markets stress the importance of
having flexible and robust congestion control mechanisms in order to maximize performance and
service availability.
The traditional approach of protecting individual nodes has been around since the introduction
of GSM – overload in the network is addressed with dedicated protection mechanisms in overloaded
nodes. Standardization bodies promoted these protection mechanisms, which became widely
adopted in the industry. These mechanisms served their purpose successfully, until they were
confronted with an increased complexity of mobile networks and new usage scenarios not considered
in earlier specifications.
Nowadays, a constellation of different network access technologies, such as 2G, 3G, LTE, Wi-Fi
and fixed, coexist to provide seamless access to voice and data services. The huge penetration of
smartphones has dramatically increased data consumption and bandwidth requirements, and
smartphone subscriptions will more than double from now until 2020 [1]. The emergence of the
Internet of Things (IoT) means networks must simultaneously face up to new usage scenarios, which
in some cases drastically increase signaling demands. In addition, these connected devices can
be a source of signaling storms in cases of simultaneous connection requests after network
disturbances.
The continuous modernization of operator networks, with increasing centralization of resources
in higher-capacity systems, implies that signaling storms have a bigger impact on the networks, as
incidents on centralized resources are likely to affect a larger number of users. Affected users will
perform reattempts, thereby initiating a
snowball effect that could overload the
signaling network multiple times.
As Figure 1 summarizes, signaling
networks therefore face significant
challenges. As a result, operators are
demanding new overload protection
mechanisms to enable them to provide
the generally accepted five nines, or even
greater network availability.
The journey that the industry has
initiated toward cloud computing, with
the transformation of the current network
nodes into virtualized network functions
(VNFs), seems at first glance to be a
handy solution to cope with signaling
storms in the network. The reality,
however, is quite different, as this
approach may generate an illusion of
infinite resources and lead to
underestimation of the importance of
overload protection mechanisms. Scale-
out mechanisms in the cloud will provide
increased flexibility to handle steady
growth, although these would not be fast
enough to cope with sudden signaling
peaks, which will escalate quicker than
the network functions can cope with.
Increasing network complexityrequires a higher amount of signaling
Growth in data traffic impliesmore signaling traffic
Node outagesaffect moresubscribers
Subscribers moveinstantly betweenaccesstechnologies
More connected devicesimpose new traffic patternsupon the signaling network
The numberof smartphonesubscriptions isconstantly increasing
Network complexity
Data traffic growth
Multiple accesstechnologies
WLAN
P-CSCF
SGSNMMESGSNMME
PGWPGW
HSS
OCSDSC
PCRF
2G,3G,LTE
Fixed
Centralizationof resources
Smartphonesubscriptions
Internet of Things
Figure 1: Challenges for the signaling network.
Under these circumstances, it is clear that the protection strategies that have been standardized
and widely adopted in mobile networks are no longer sufficient. High-traffic peaks and network
failure scenarios can result in massive signaling storms, leading to lengthy outages of network
services.
In its Annual Incident Reports 2012, ENISA (the European Union Agency for Network and
Information Security) reports that there were 79 major telecom outages in 2012 [2]. “System failures”
were the root cause of 75 percent of these incidents. Each overload-related incident affected an
average of around 9.4 million user connections. One of the biggest effects of network outages and
service degradation is the increase in the rate of subscriber churn.
At the same time, operators spend USD 15 billion a year to overcome network outages and service
degradations [3]. On average, operators spend 1.5 percent of their annual revenues on dealing with
these issues. Some even estimate this figure to be as high as 5 percent.
One strategy to mitigate overload problems is to over-dimension the network for the peak signaling
load. This adds complexity and implies
higher opex and capex, leading to an
additional financial burden that puts an
operator in a less competitive position.
On the other hand, it seems to be a
feasible strategy to cover up for signaling
peaks that are two to three times above
the average traffic load. But when a
signaling storm occurs, the load can
easily increase 10 times above the
average, stretching the need for over-
dimensioning to unrealistic limits. In
addition, over-dimensioning the network
does not eliminate the risk of a reduction
in the overall signaling capacity. A typical
node behavior during overload is depicted
in Figure 2. Up to the engineered capacity,
the message throughput is in line with the
offered traffic, and it still increases slightly
when overload is reached. In cases of
massive overload, however, the
throughput drops heavily with a further
increase in the offered load.
Another strategy to address overload
is to blindly reject signaling messages.
The problem with this strategy is that the
throughput of successful services
delivered to the consumer is heavily
reduced, even if only a small percentage
of messages are rejected, as depicted in
Figure 3. For instance, a successful VoLTE
call requires the successful processing of
roughly 20 signaling sequences. If only
one message in this sequence is rejected,
the call will be discarded.
The recommended strategy is to aim
for a robust and scalable network
architecture. The chosen network
architecture and protection mechanisms
need to be capable of handling the growth
in signaling without unnecessary over-
dimensioning, as well as ensuring service
delivery to consumers during cases of
heavy signaling load, node failures and
malicious activities that lead to signaling
storms.
BUILDING ROBUST SIGNALING NETWORKS • CHALLENGES FOR THE SIGNALING NETWORK 4
Processed throughput
Engineeredcapacity
With standardnode leveloverload control
Offered loadOverloadNormal operation
Engineeredcapacity
Figure 2: Processed throughput in relation to offered load.
Figure 3: Success rate of the signaling traffic in relation to the consumer service delivery.
Success rateConsumer service delivery
100%
100%Success ratesignaling traffic
100% success rate of thesignaling traffic refers to100% success rate of theconsumer service delivery
Increasing number of messages
needed per service delivery
A small reduction in signallingsuccess rate leads to a hugereduction in consumerservice delivery
Rejected consumerservice delivery
Rej
ecte
dsi
gnal
ing
traffi
c
BUILDING ROBUST SIGNALING NETWORKS • END-TO-END STRATEGY AND PRINCIPLES TO ACHIEVE ROBUST SIGNALING NETWORKS 5
End-to-end strategy and principles to achieve robust signaling networksThe objectives for a robust signaling network are:
> to reduce the network impact of smartphone and device signaling
> to maximize throughput in cases of overload
> fast recovery from overload and failure scenarios
> maintainability of the signaling network
> scalability of the signaling network.
A robust and distributed signaling network should be based on the following end-to-end strategy
and principles:
> careful network architecture to provide the basis for a robust and scalable signaling network.
> optimize the signaling traffic and minimize the amount of signaling to manage the network services
> a distributed and coordinated overload protection mechanism across several network elements
to maximize the throughput in peak load scenarios.
NETWORK ARCHITECTURE
The signaling network architecture characteristics should be based on simplicity by using the
right amount of infrastructure and features to get a manageable network entity.
The signaling network should be divided into manageable smaller components. Modular
network design enables operators to isolate problems within a module, while the rest of the
network continues to function. This means fewer users are affected and the overall uptime of the
network is increased.
The basic mechanism to prevent physical failure of the transport plane is redundant components
and more than one possible physical path to reach the destination client originating from the
source client via the transport network. QoS can be used to prevent control-plane failure.
Best performance is achieved in overload or failure situations when the overload protection is
distributed across several network elements. Each network element on its own should be
redundant. The availability of the network elements can be further enhanced when redundant
node types are deployed in different geographical places.
Scalability should be supported on node and network level with the aim of:
> managing the growth in signaling traffic efficiently
> being flexible to extend an established network configuration
> having a long-term strategy to resolve overload cases.
OPTIMIZATION
One aim of optimization is to minimize the signaling, and by this to reduce the network impact
of smartphone and device signaling. Recommended ways to achieve this are:
> to reduce paging traffic by starting paging in last known location for non-time-critical traffic
before the paging request is extended to other parts of the network
> to limit the effect of LTE idle timer decrease in the user equipment (UE) by only performing
authentication at every 10th or 20th connection setup, due to the increased number of
connection setup requests from the UE
> to drop or reject excess traffic from misbehaving UE
> to drop or reject traffic from malicious attacks.
BUILDING ROBUST SIGNALING NETWORKS • END-TO-END STRATEGY AND PRINCIPLES TO ACHIEVE ROBUST SIGNALING NETWORKS 6
A second aspect of optimization is to distribute the load evenly in the network. One recommendation
is to build an appropriate structure with pooled network resources for easy capacity expansion
and even load distribution. This allows for a much more efficient use of network resources.
DISTRIBUTED AND COORDINATED OVERLOAD CONTROL
Each network element should provide a working overload control mechanism to prevent its own
resources from being overloaded. This is also an essential function for pooled resources, as there
would otherwise be a risk that a peak in the signaling traffic could bring down one pooled device
after the other.
At network level, overload should be handled as closely to the overload source as possible to
minimize recovery time. A propagation of the signaling peak in the network must be prevented
at any cost in order to avoid a service outage on a larger scale.
An example of a propagation of a signaling peak is described below.
After a major network outage, which could be the result of a transport network or Mobility
Management Entity (MME) failure, a large number of UEs will discover the network and try to
re-attach again to it. This multitude of reattaches causes a signaling storm, which will affect large
parts of the network. Typical signaling scenarios under such circumstances are:
> The MME completes the authentication process, updates the location in the Home
Subscriber Server (HSS), and reestablishes the bearers.
> The serving gateway must also recreate the bearers.
> The packet data network gateway (PDN-GW) recreates the bearers and reestablishes
sessions to, for example, the policy and charging rules function (PCRF).
> New full IMS registration is needed for VoLTE UE.
> The HSS needs to provide authentication information to the MME and register the location
(additional transactions are needed for IMS registration for VoLTE subscribers).
During overload, the signaling throughput can be optimized by intelligent traffic prioritization.
Ways of achieving intelligent traffic prioritization are given below.
> Adding some application logic to the traffic management function of dedicated nodes in the
network can enable the nodes to determine whether a signaling message belongs to a new
subscriber transaction or an ongoing subscriber transaction. In cases of overload, the
message triggering a new subscriber transaction will be rejected in favor of the message
related to an ongoing subscriber transaction. This will optimize the throughput at the
application level.
> Signaling traffic in overload situations can be prioritized based on importance, such as
emergency calls and priority services, ahead of delay tolerant access, such as energy
meters.
> An overload protection system should be adaptive to the current situation and allow higher
throughput when the overload eases, and throttle more of the traffic as the overload gets
worse. Using fixed rate throttling limits will not fulfill requirements for different situations. In
some overload scenarios, the network can handle more traffic and in others much less.
> The throughput of the system under overload can be optimized by the concept of
throughput elasticity, where the latency is allowed to increase. However, it is important that
the maximum available latency budget is never exceeded on an end-to-end level.
BUILDING ROBUST SIGNALING NETWORKS • DISTRIBUTED SIGNALING NETWORK ROBUSTNESS – THE CONCEPT OF THREE PROTECTION LINES 7
Distributed signaling network robustness – the concept of three protection linesThe principles to achieve a robust
signaling network are best represented
in the network architecture depicted in
Figure 4.
A robust and distributed signaling
network should follow the concept of
three protection lines to protect the
operator’s service offering from being
affected by a signaling storm.
The first protection line comprises the
components that act as entry points to
the core network for smartphone and
device signaling. Examples are the Serving
GPRS Support Node – Mobility
Management Entity (SGSN-MME), the
proxy call session control function
(P-CSCF) and the PDN-GW.
The second line of protection consists
of the nodes providing routing
capabilities for the signaling traffic. It is
typically represented by a Diameter
Signaling Controller (DSC) or a Signaling
Transfer Point.
The third line of protection is
represented by the end systems hosting
the application data and logic. User data
management systems such as the HSS
are assigned to this third line of
protection.
The three lines of protection provide:
> distributed network architecture to
allow for an end-to-end overload protection solution in distributed layers
> the ability to cover up failures, misconfiguration or misoperation evident in one protection line
in the next, higher protection line
> maximized signaling throughput during overload conditions
> scalable and maintainable network architecture
> efficient use of network resources by distributing the signaling load evenly in the network
> optimized signaling procedures to minimize the signaling traffic.
FIRST PROTECTION LINE
The SGSN-MME being part of the Evolved Packet Core (EPC) is an example of the first line of
protection.
> Representing the entry point to the core network, the EPC is closest to potential overload
End-to-end overloadprotection solution
Scalable and maintainablenetwork architecture
If onelayer fails,another cantake over
Minimizesignalingtraffic
Maximize throughputunder overload conditions
Use networkresources efficiently
1st line 2nd line 3rd line
Distributedoverload protection
Scalability andmaintainability
Optimize signaling
P-CSCF
DSCSGSNMMESGSNMME
PGWPGW
HSS
OCS
PCRF
Backup alwaysavailable
Distributeload evenly
Maximizethroughput
Figure 4: Network picture of distributed signaling network
robustness – the concept of three protection lines.
BUILDING ROBUST SIGNALING NETWORKS • DISTRIBUTED SIGNALING NETWORK ROBUSTNESS – THE CONCEPT OF THREE PROTECTION LINES 8
sources outside of the core network. The most efficient way to minimize the recovery time is
to apply optimization and overload protection of the signaling traffic in the first protection line.
> Nodes in the EPC aim to optimize the signaling. One strategy is to perform smart and adaptive
paging to minimize the number of paging requests. A second strategy is to limit excessive
signaling from dedicated UE.
> Nodes in the EPC should have a proven overload protection function that shields the nodes
from failure and ensures high throughput during extreme overload. This is achieved by proper
prioritization of services and subscribers, and distinguishing between initial traffic for a
subscriber and subsequent signaling.
SECOND PROTECTION LINE
An example of the second line of protection is the DSC hosting the function of a Diameter Agent
as specified in RFC 6733 Diameter Base Protocol.
> A Diameter Agent simplifies the network architecture and reduces the number of connections
that need to be maintained in the network. The DSC acts as a centralized signaling router in
the network.
> The Diameter Agent itself should provide a proven overload protection mechanism and be
able to process offered traffic that exceeds the dimensioned capacity of the node multiple
times in cases where the first protection line cannot sufficiently limit a traffic peak.
> The Diameter Agent should provide load balancing capabilities toward other interfacing
Diameter peers. This ensures optimal usage of network elements and reduces load peaks on
dedicated Diameter servers such as the PCRF, the online charging system, and the HSS,
which typically reside in the third protection line.
> The load balancing function should consider the varying capacity of the interfacing
Diameter peers.
> The load balancing function could steer the preference for which Diameter peers are
primarily used as the routing target, and by this opt for local servers over remote servers.
> Nevertheless, the load balancing function could be modified dynamically considering the
actual traffic sent, which would actively reduce traffic peaks that would be otherwise sent
to already overloaded diameter peers.
> The Diameter Agent should provide traffic shaping capabilities for interfacing Diameter servers
or clients. This actively prevents the propagation of signaling storms in the network.
> By traffic shaping outgoing diameter traffic, the Diameter Agent prevents overloading
interfacing servers. It thus offers further protection to peers against potential signaling
bursts from nodes inside the network or from roaming partners.
> By traffic shaping incoming diameter traffic, the Diameter Agent prevents Diameter clients
from abusively using network resources beyond operator-determined limits. One of the
paramount examples is a restarting MME node that could flood the HSS servers, causing
deterioration to the services of all other MME nodes.
> Typically, a Diameter Agent has very limited knowledge of the semantics of Diameter
applications. To prioritize traffic in an intelligent way, the Diameter Agent could be configured
with semantically relevant data, so that in congestion situations, messages of ongoing sessions
can be treated with higher priority than messages related to new sessions. This adds the
capability to perform application-aware traffic management in the Diameter Agent.
> The Diameter Agent is also in control of all Diameter interfaces in a network. An overload
indication on one interface can lead to a reduction of traffic on another interface. In cases
where a charging server becomes overloaded, the DSC can initiate throttling on initial attach
messages. This reduces the number of new subscribers following the principle of addressing
overload protection as close to the source of overload as possible.
THIRD PROTECTION LINE
User data management systems and, in particular, user databases sit at the end of the signaling
chain. During a network overload scenario, user databases will therefore naturally be under
pressure, likely becoming the first overloaded component in the network. On the other hand,
user databases are the heart of telecommunications networks in the sense that they are vital to
deliver consumer services, and a failure or degradation of the user database performance might
compromise the complete network. In the user data management space, 3GPP standardized a
BUILDING ROBUST SIGNALING NETWORKS • DISTRIBUTED SIGNALING NETWORK ROBUSTNESS – THE CONCEPT OF THREE PROTECTION LINES 9
data-layered architecture named User Data Convergence [4]. In this architecture, traditional
network databases in the core network – such as the home location register (HLR), the HSS,
authentication, authorization and accounting (AAA), and the policy controller – are split into
Application Front Ends (AFEs), which handle the business logic, and a user data repository (UDR),
which takes the role of the user database storing the user data.
In order to minimize the severity and duration of overload incidents, user data management
systems should:
> secure their availability and user data integrity at all times, no matter how severe the overload
incident might be. For that purpose, overload protection functionalities are a must for both
AFEs and UDRs.
> maximize end-to-end useful throughput during overload. Different complementary strategies
can be applied, such as:
> ensuring throughput elasticity during overload with adequate latency tradeoff
> intelligent traffic throttling: cooperative load regulation in the user data management
systems (Front End (FE) and UDR).
As previously described in Figure 2, a
system will typically experience
throughput degradation with increasing
levels of overload, and databases are
not an exemption. User data
management systems require a more
intelligent throttling of excess traffic
during overload to ensure that
resources from the UDR are fully
utilized to process useful end-to-end
traffic. The cornerstone of this
mechanism is the cooperation between
the AFEs in the network, such as the
HLR, the HSS, AAA and the UDR.
The UDR should constantly monitor
the resource utilization levels, such as
the response time and length of the
buffers. As soon as some of the
resources reach their limit, the situation
is reported back to the AFE as an
overload indication.
Based on the overload indication sent
from the UDR, the AFEs should throttle
the traffic according to the following
principles:
> The most important signaling messages are not throttled.
> Ongoing operations that involve several messages to the UDR are prioritized over new
operations. Typically, a Mobile Application Part (MAP)/Diameter operation received in the AFE
involves several messages sent to the UDR.
The throttling level should be continuously adjusted based on a dynamic and real-time feedback
loop. This ensures the UDR always performs to its maximum capacity and avoids the throughput
degradation caused by devoting resources to rejecting the excess traffic.
Figure 5 shows the expected behavior of a user data management system with cooperative
load regulation between AFE, and a UDR with throughput elasticity during overload as compared
with a standard system.
Processed throughput(MAP, diameter)
With AFE-UDRintelligentthrottling
Maximumthroughput
EngineeredcapacityDimensionedcapacity
With standardnode level UDRoverload control
Offered load(MAP, diameter)
OverloadNormal operationEngineered
capacity
Figure 5: Overload performance behaviour of a user data management system (AFE+UDR) with cooperative load regulation.
BUILDING ROBUST SIGNALING NETWORKS • CONCLUSION 10
ConclusionOperators are confronted with increasing complexity in mobile networks. At the same time,
signaling traffic is continuously increasing, driven by the growth of smartphone traffic and new
usage scenarios presented by the IoT.
The performance and availability of the signaling network are essential for service delivery to
the customer. Existing overload protection mechanisms that only focus on dedicated overloaded
nodes cannot prevent larger-scale outages. Any outages in the signaling network will lead to
service interruptions, causing financial losses and increasing the risk of subscriber churn.
A robust and distributed signaling network that follows the concept of three protection lines
provides an end-to-end overload protection solution that fulfills the objectives of a robust signaling
network. A robust and scalable network architecture, optimization of the signaling traffic, and a
distributed and coordinated overload protection mechanism boosts the availability of the signaling
network beyond today’s measures.
BUILDING ROBUST SIGNALING NETWORKS • REFERENCES 11
[1] Ericsson, February 2015, Ericsson Mobility Report, Mobile World Congress Edition, available at:
http://www.ericsson.com/res/docs/2015/ericsson-mobility-report-feb-2015-interim.pdf
[2] ENISA (European Union Agency for Network and Information Security), August 2013, Annual
Incident Reports 2012, available at:
http://www.enisa.europa.eu/activities/Resilience-and-CIIP/Incidents-reporting/annual-reports/
annual-incident-reports-2012-1/annual-incident-reports-2012/at_download/fullReport
[3] Heavy Reading, October 2013, Mobile Network Outages & Service Degradations: A Heavy
Reading Survey Analysis, available at:
http://www.heavyreading.com/details.asp?sku_id=3103&skuitem_itemid=1524&promo_code=&aff_
code=&next_url=%2Flist.asp%3Fpage_type%3Dall_reports
[4] 3GPP TS 23.335, accessed June 2014, User Data Convergence (UDC); Technical realization and
information flows; Stage 2, available at:
http://www.3gpp.org/DynaReport/23335.htm
References
BUILDING ROBUST SIGNALING NETWORKS • GLOSSARY 12
GLOSSARYAAA authentication, authorization and accounting
AFE Application Front End
DSC Diameter Signaling Controller
ENISA European Union Agency for Network and Information Security
EPC Evolved Packet Core
FE Front End
HLR home location register
HSS Home Subscriber Server
IMS IP Multimedia Systems
IOT Internet of Things
MAP Mobile Application Part
MME Mobility Management Entity
OCS Online Charging System
PCRF policy and charging rules function
P-CSCF proxy call session control function
PDN-GW packet data network gateway
SGSN Serving GPRS Support Node
Signaling Part of a telecommunications network carrying signaling traffic
Network
UDR user data repository
UE user equipment
VNF virtualized network function
© 2015 Ericsson AB – All rights reserved