[ieee 2005 9th ifip/ieee international symposium on integrated network management, 2005. im 2005. -...

14
Scalable Fault Management for mobile networks beyond 3G Giorgio Nunzi NEC Europe Ltd. Network Laboratories Heidelberg, Germany [email protected] Juergen Quittek NEC Europe Ltd. Network Laboratories Heidelberg, Germany [email protected] Marcus Brunner NEC Europe Ltd. Network Laboratories Heidelberg, Germany [email protected] Abstract The development of future releases of third generation and fourth generation cellular net- work shows a clear indications that today’s ATM-based Radio Access Network (RAN) will be replaced by an IP-based RAN particularly for reducing the cost compared to to- day’s RANs. Managing an IP-based RAN with SNMP is a challenge, particularly con- cerning scalability. Already existing solutions for delegating management functions to managed nodes can achieve the required scalability, but they come with a big overhead of resource consumption and complexity. This paper presents a lean solution for health checking of managed nodes that achieves this goal by restricting itself to this purpose. The CHECK MIB module serves for remote configuration of heath check actions to be performed at a managed node and for retrieving health check results. Resource consump- tion is very low compared to already existing, more general technologies. Health checks can be specified easily by setting a very small number of objects and the implementation of the MIB module is lightweight. Keywords SNMP, Network Management, Fault Management 1. Introduction Third Generation (3G) cellular networks are already operational, but their architecture is still under further development and standardization at the 3G Partnership Project (3GPP). The current 3G Radio Access Network (RAN) architecture uses ATM as network proto- col between the base stations (NodeB’s) and Radio Network Controllers (RNC’s). This infrastructure is considered to be too costly. Particularly, it is assumed that the NodeBs and the ATM network have high potential for cost savings. Therefore, 3GPP participants are suggesting to introduce an IP-based RAN with low- cost NodeBs for future releases of the 3GPP standard. Several 3G equipment manufac- turers already started developing prototypes. Also research and development work for 4G cellular networks is usually based on an ”all-IP” vision. Cost-saving of NodeB development, manufacturing and operation also affects the man- agement system. For current NodeBs, typically Common Management Information Pro- tocol (CMIP) is used. Compared to SNMP, CMIP has higher functionality and is better suited to fulfill the requirements for a RAN management protocol, but it also is more costly on agent and manager side. Therefore, consistent with the migration from ATM to IP and as another step of cost-saving, a migration to the Simple Network Management Protocol (SNMP) is considered for IP-based RAN management. The ”All-IP” vision promises to adopt internet protocols in telecommunication archi- tectures. Here, the IP management framework faces new challenges. Compared to typi- cal managed IP networks, the requirements for cellular network management are higher, because a NodeB failure or a network failure immediately impacts revenue. Therefore 0-7803-9087-3/05/$20.00 ©2005 IEEE

Upload: m

Post on 14-Mar-2017

214 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: [IEEE 2005 9th IFIP/IEEE International Symposium on Integrated Network Management, 2005. IM 2005. - Nice, France (15-19 May 2005)] 2005 9th IFIP/IEEE International Symposium on Integrated

Scalable Fault Management for mobilenetworks beyond 3G

Giorgio NunziNEC Europe Ltd.Network LaboratoriesHeidelberg, [email protected]

Juergen QuittekNEC Europe Ltd.Network LaboratoriesHeidelberg, [email protected]

Marcus BrunnerNEC Europe Ltd.Network LaboratoriesHeidelberg, [email protected]

AbstractThe development of future releases of third generation and fourth generation cellular net-work shows a clear indications that today’s ATM-based Radio Access Network (RAN)will be replaced by an IP-based RAN particularly for reducing the cost compared to to-day’s RANs. Managing an IP-based RAN with SNMP is a challenge, particularly con-cerning scalability. Already existing solutions for delegating management functions tomanaged nodes can achieve the required scalability, but they come with a big overheadof resource consumption and complexity. This paper presents a lean solution for healthchecking of managed nodes that achieves this goal by restricting itself to this purpose.The CHECK MIB module serves for remote configuration of heath check actions to beperformed at a managed node and for retrieving health check results. Resource consump-tion is very low compared to already existing, more general technologies. Health checkscan be specified easily by setting a very small number of objects and the implementationof the MIB module is lightweight.

KeywordsSNMP, Network Management, Fault Management

1. IntroductionThird Generation (3G) cellular networks are already operational, but their architecture isstill under further development and standardization at the 3G Partnership Project (3GPP).The current 3G Radio Access Network (RAN) architecture uses ATM as network proto-col between the base stations (NodeB’s) and Radio Network Controllers (RNC’s). Thisinfrastructure is considered to be too costly. Particularly, it is assumed that the NodeBsand the ATM network have high potential for cost savings.

Therefore, 3GPP participants are suggesting to introduce an IP-based RAN with low-cost NodeBs for future releases of the 3GPP standard. Several 3G equipment manufac-turers already started developing prototypes. Also research and development work for 4Gcellular networks is usually based on an ”all-IP” vision.

Cost-saving of NodeB development, manufacturing and operation also affects the man-agement system. For current NodeBs, typically Common Management Information Pro-tocol (CMIP) is used. Compared to SNMP, CMIP has higher functionality and is bettersuited to fulfill the requirements for a RAN management protocol, but it also is morecostly on agent and manager side. Therefore, consistent with the migration from ATM toIP and as another step of cost-saving, a migration to the Simple Network ManagementProtocol (SNMP) is considered for IP-based RAN management.

The ”All-IP” vision promises to adopt internet protocols in telecommunication archi-tectures. Here, the IP management framework faces new challenges. Compared to typi-cal managed IP networks, the requirements for cellular network management are higher,because a NodeB failure or a network failure immediately impacts revenue. Therefore

0-7803-9087-3/05/$20.00 ©2005 IEEE

Page 2: [IEEE 2005 9th IFIP/IEEE International Symposium on Integrated Network Management, 2005. IM 2005. - Nice, France (15-19 May 2005)] 2005 9th IFIP/IEEE International Symposium on Integrated

particularly fault detection is a crucial task when providing cellular network services, butalso a very challenging task.

For fault detection a huge number of managed nodes (the NodeB’s) needs to be moni-tored (up to several thousand with 50-200 controlled by a single RNC). Also the numberof managed objects to be checked per managed node and the required frequency of checkshave relatively high values. Additionally, resources are expensive and resource require-ments need to be low.

Figure 1 illustrates that several aspects of NodeB configuration need to be checked reg-ularly including but not limited to checking interface availability, radio channel frequencyassignment and quality, status of current connections, etc.

Several approaches have been proposed addressing scalability of IP network man-agement using methods for delegating management functions to mid-level managers ormanaged nodes, respectively. However, most of them were designed for a more generalpurpose and cover much wider ranges of applications. This general applicability comeswith heavy-weight implementations and complex application procedures that are not inline with our requirement for a low-cost NodeB.

This paper presents an approach that is more problem-specific and restricted to healthchecking of managed nodes. It uses an optimized problem-specific MIB module that isvery light-weight but still meets all requirements. It is based on the idea of delegatingcheck operations to the agents to be performed locally. The MIB module serves for con-figuring health checks as well as for retrieving health check results.

The remainder of this paper is structured as follows. Section 2 states the problem,highlighting the difficulties of network management in highly populated networks andSection 3 discusses solutions based on existing IETF standards. Section 4 presents theCHECK MIB module design and explains its usage and implementation. In Sections 5and 6 we provide an evaluation of the proposed solution: first we address considerationsbased on a model of the operations and then we analyze our experimental results.

2. Problem StatementWhen a management station monitors a network to discover node failures, it periodicallyreads managed objects from the nodes that describe the operational status. The valuesretrieved are then analyzed by the application following some rules of inference and if

Figure 1: Fault management in an IP-based Radio Access Network (RAN).

558 Session Twelve Fault Management

Page 3: [IEEE 2005 9th IFIP/IEEE International Symposium on Integrated Network Management, 2005. IM 2005. - Nice, France (15-19 May 2005)] 2005 9th IFIP/IEEE International Symposium on Integrated

Figure 2: Fault Management with the polling paradigm of SNMP

the analysis identifies a failure, then an appropriate action is triggered, such as logging amessage, notifying a human operator, or initiating an automatic repair action.

In most cases, the values retrieved are compared against healthy ranges of their objectsand the analysis resolves in small comparisons of object values against health limits. Abasic check, for instance, is verifying the activity of network interfaces by comparing theobject IF-MIB::ifOperStatus with the value up(1).

If we use SNMP for performing complex fault management tasks, for example, as de-scribed in figure 1, it is useful to group the get operations aimed at monitoring a specificaspect of the managed nodes. In this sense, we define a check as the collection of oper-ations to monitor a set of objects, to verify that their values are in healthy ranges and tocreate a report of the eventual failures. The design of a check and the grouping of themonitored objects is left to the expertise of the network administrator.

This method of monitoring objects is referred to as polling. Figure 2 sketches a singlecycle of a fault management activity, where a set of checks is executed periodically. Eachcheck indicates the set of objects to be retrieved by specifying their Object Identifiers(OIDs). OIDs are grouped into SNMP Protocol Data Units (PDUs) that are transmitted aspayloads of UDP packets. SNMP requests built in this way are sent individually to everymonitored node.

Figure 2 shows that when the checks are resolved at the management station, theygenerate a storm of atomic requests that the management station feeds into the network todetect failures. Two aspects are of major concern in dealing with such amount of SNMPrequests, as described in [8]:• Network overhead. The traffic generated by the SNMP operations between the man-

agement station and the managed nodes reduces the available bandwidth for user ap-plications.

• Latency. The latency of a check is the time elapsing between the first request sentfrom the management station to a managed node until all replies from managed nodeshave been received by the management station and the operation of comparisons iscompleted.

The complexity of fault management tasks concerning network overhead and latency ismainly determined by three aspects:• The number of monitored nodes ranging from tens in Local Area Network to hundreds

in carriers’ domains.• The number of objects per host depending on the kind of check performed. A basic and

widely deployed check is controlling the status of interfaces of the a managed node.Depending on the service begin provided, several sets of objects may be monitored atthe same time.

• The frequency of checks. For accurate health monitoring, a high frequency is desirable,but usually it is limited by the scalability of the management system.

The scenario we want to address here consists of highly populated networks, where sev-eral checks must be performed at every managed node. Our goal is to limit the effects of

559Scalable Fault Management for Mobile Networks Beyond 3G

Page 4: [IEEE 2005 9th IFIP/IEEE International Symposium on Integrated Network Management, 2005. IM 2005. - Nice, France (15-19 May 2005)] 2005 9th IFIP/IEEE International Symposium on Integrated

the bandwidth consumption and latency of SNMP management, in order to achieve highscalability while still supporting high frequency of check cycles. We propose executingthe checks locally at the managed nodes. The obvious advantage is a reduction of traf-fic generated. Unclear is the resulting resource consumption at the managed nodes. Weanalyzed and measure both effects in Sections 5 and 6.

3. Related WorkThere exists a lot of related work in the general area of distributed network management.Overview are given, for example, in [1][2][3]. For the specific problem of remote healthchecking of managed nodes with SNMP in IP-based networks, there exists a set of IETFstandards that are discussed in this section.

The Script MIB [4] allows a manager to download arbitrary programs to the managednode and read the results from program execution. For solving the fault managementproblem stated in the previous section, we can code small programs that perform localchecks at the managed nodes and read their results. However, performing this task withthe Script MIB is not really simple, because of the complexity of this flexible environment.The required steps include coding all comparisons into programs, downloading them tothe managed node and controlling their execution. Changes in the check operations, forexample, the modification of one of several thresholds requires to identify the correspond-ing line of the program, modify it, re-compile it (required in some cases), download themodified program to the managed node (required in some cases), stop the execution ofthe original program and start the modified one.

Also the implementing the Script MIB is complex. It is very powerful concerning thehandling of programs, passing arguments to them, receiving intermediate results and con-trolling their execution. In addition, it must provide a runtime engine for program execu-tion. This is achieved by a rather complex and costly implementation. The rich capabilitiesof the Script MIB imply several security issues. Since the programs to be executed canpotentially execute arbitrary code, strict access control to the Script MIB is required.

The Expression MIB [5] allows to compute simple expressions on the values of themanaged objects at the managed node. Instead of supporting arbitrary programs, the Ex-pression MIB offers a very much restricted language for operations on managed objects.For example, the percentage line utilization per second can be expressed as an operationon four objects of the mib-2 module. The drawback of this flexibility is the complexityin defining the expressions. Each value in the expression is treated as a variable and mustbe correctly defined in a table for variables. Also implementing the Expression MIB iscomplex, on both sides, the management station and the managed node.

Most of the monitoring operations performed by management applications resolveto simple comparison between values and don’t require the computational capabilitiesoffered by the Expression MIB. The Event MIB [6] allows an agent to monitor managedobjects at a device and to report failures when certain conditions on their values are met.In this case two different actions can be taken: either set a value of an object reportingfailures to other MIB modules or creating and sending a notification.

4. Health Check MIBFacing the drawbacks of existing solutions, we decided to design a new MIB module,the CHECK MIB, that is tailored for health checking. The design goals beyond scalablehealth checking included a small footprint, low implementation cost and easy use of themodule.

Operations on the CHECK MIB configuration of checks, retrieval of check results, andafter a check failed, the retrieval of OIDs for which the comparison operation failed. Acheck is defined through a set of operations, called rules. Each rule monitors an object

560 Session Twelve Fault Management

Page 5: [IEEE 2005 9th IFIP/IEEE International Symposium on Integrated Network Management, 2005. IM 2005. - Nice, France (15-19 May 2005)] 2005 9th IFIP/IEEE International Symposium on Integrated

--Objects+-Capabilities| +-MinInterval| +-Masscults| +-Marbles+-CheckControl| +-AdminStatus| +-OperStatus|

+-ResultTable| +-ResultName (index)| +-ResultSeverity| +-ResultTime| +-ResultInterval| +-ResultSeverityThreshold| +-ResultStorageType| +-ResultRowStatus|

+-RuleTable| +-ResultName (index)| +-RuleName (index)| +-RuleOid| +-RuleValue| +-RuleOperation| +-RuleSeverity| +-RuleRowStatus+-FailureTable

+-ResultName (index)+-FailureSeverity (index)+-ResultName (index)+-FailureOid

Figure 3: The Health Check MIB OID tree (prefix check omitted for all objects)

and verifies that its value is contained in a healthy region. When a check is performed,each one of its rules is executed, and if at the least one rule fails (i.e. the monitored objectcontain a value that is considered to indicate a fault), then the entire check fails. The rulescan perform only a limited set of operations, including basic logic operations (≤, �=,=,etc.). We argued Section 2 that these cases cover most of the check operations and wethink that this limitation doesn’t significantly impact the applicability to fault managementtasks. We also considered a delta operation for monitoring increment of counter objects,but to allow lean implementations we defined this as an optional feature.

A rule can monitor every type of object supported by the Structure of ManagementInformation version 2 (SMIv2, [7]), including integers, OIDs, and strings. The healthregion is defined in a string object that encode the object type monitored: for instance,when monitoring an INTEGER32 value it contains four bytes.

The result of a check is defined as an integer, called severity. Higher values of theseverity are associated to more critical failures. With this feature the management stationcan associate different actions with different severities. For examples, warning messagescan be generated for severities less than 50 and error reports for higher severities. Whenconfiguring a check, the management system assigns a severity value to each individualrule. If a check fails, then the severity of the check is the highest severity of all failedrules. The managed node can generate a notification for a failed check. In this case thenotification is sent when the severity is greater than a configurable threshold value.

A check can be performed on-demand or on-schedule. In the former case, the result iscomputed at each attempt to read its severity value and the manager receives the responsenot before the complete check has been performed. In the latter case, the check is per-formed automatically at regular intervals and the manager retrieves the result computedat the last performance. Note that when counters object are monitored, then the relatedcheck should be performed on-schedule, to compute the delta values at regular intervals.

The tree of the MIB is illustrated in Figure 3. The following sub-sections describe eachsub-tree.

checkCapabilities This group describes the limits of the MIB implementation: the mini-mum interval to perform checks on-schedule and the maximum number of checks andrules supported.

checkControl This group allow to immediately disable the computations of all checks.This can be useful to temporally disable all checks, when for instance the managedstation is under maintenance.

checkResultTable This table contains the list of checks performed by the agent. Themanager stores a new check by creating a new row in the table using a descrip-tion name as index. Two objects can be used to configure the behavior of the check:checkResultInterval specifies the frequency of the performances (0 sets the perfor-mance on-demand, other values specify the interval of the performances on-schedule);

561Scalable Fault Management for Mobile Networks Beyond 3G

Page 6: [IEEE 2005 9th IFIP/IEEE International Symposium on Integrated Network Management, 2005. IM 2005. - Nice, France (15-19 May 2005)] 2005 9th IFIP/IEEE International Symposium on Integrated

SNMP table: CHECK-MIB::checkResultTableindex Severity Size checkResultTime checkResultInterval Threshold StorageType RowStatus

"traffic" 0 0 0:0:01:50.46 1000 centi-seconds 50 volatile active"interfaces" 0 0 0:0:01:54.00 0 centi-seconds 1 volatile active

SNMP table: CHECK-MIB::checkRuleTableindex checkRuleOid RuleValue Operation Severity

"traffic"."eth0Out" ifOutOctets.2 0:0:c3:50 delta 20"traffic"."eth0Out" ifOutOctets.2 0:1:86:a0 delta 50

"interfaces"."eth0Status" ifOperStatus.2 0:0:0:1 equal 100"interfaces"."eth0InErrors" ifInErrors.2 0:0:0:0 equal 50

Figure 4: The checkResultTable and checkRuleTable of the agents.

the second object, checkResultThreshold, indicates if a notification should be sentin case of failure (this threshold is the minimum severity that triggers the notification).The result of a check is stored in the object checkResultSeverity. A value of 0indicates that the check passed, while higher values report failures with higher gravity.checkResultTime contains the time of the last performance.

checkRuleTable This table contains the operations (rules) that are performed by theagent to build the result of a check. Each row is indexed by the name of the check towhich the rule belongs to and a name for the rule. Each rule is configured to read a lo-cal object specified by checkRuleOid and to compare its value against a health value,specified by checkRuleValue; the comparison operation to be performed is stored incheckRuleOperation. The checkRuleSeverity object specifies the severity of afailure of this rule. If the comparison results to true, then the rule passes, otherwisethe rule fails and the severity of the check has at least the value of this rule’s severity.

checkFailureTable This table is populated with the objects that caused a failure.

4.1 Usage

Using the CHECK MIB starts with configuring the checks at the managed nodes. For eachcheck the manager should perform the following steps:1. It creates a new row in the checkResultTable. A name of the check must be given

as index of the row.2. It creates a new row in the checkRuleTable.3. It configures the comparison of the rule with the objects RuleOid and

RuleOperation.4. It configures the severity for the rule just created and activates the row.5. It repeats steps 2-4 for each further comparison operation included in the check.6. If needed, it configures the frequency of the performances check and the threshold to

send a notification.7. Finally, the row is activated. At this point the check is configured at the managed node.Figure 4 shows two simple checks configured at a managed node. A first check, calledinterfaces, verifies the operational status of the interfaces: it is performed on-demand (theinterval is set to 0) and a notification is sent for every failure (threshold is set to 1). Itincludes two rules: one to verify that the values of the object ifOperStatus are equalto up(1) and an other one to verify that the values of the object ifInErrors are 0. Thesecond check, called traffic, supervises the traffic consumed: it is performed regularlyevery 10 seconds (interval is set to 1000) and a notification is sent only for a high severity(threshold is set to 50). It includes two rules: the first one controls increments less than50.000 Bytes (the gauge value is encoded in hexadecimal as 00:00:c3:50) and the secondone increments less than 100.000 Bytes (hexadecimal value 00:01:86:a0).

Figure 5 shows the operations performed by a management station mon-itoring a network through the CHECK MIB. A check is directly translatedinto the OID of the result value to be retrieved from the hosts, for instance

562 Session Twelve Fault Management

Page 7: [IEEE 2005 9th IFIP/IEEE International Symposium on Integrated Network Management, 2005. IM 2005. - Nice, France (15-19 May 2005)] 2005 9th IFIP/IEEE International Symposium on Integrated

Figure 5: Fault Management using the CHECK MIB.

checkResultSeverity."interfaces". The OIDs generated are certainly fewer thanthose ones showed in Figure 2, because the same amount of objects can be monitoredwith a small set of checks, and so the OIDs can be transported on a single PDU withoutbeing splitted in several messages. Moreover the results of the checks are instances of thesame leave checkResultSeverity and the manager can use a single get-bulk request,setting the MaxRepetitions parameter of the get-bulk request to the number of checksto be retrieved.

With the approach of the CHECK MIB the monitoring activity is distributed from themanager to the managed stations with two advantages. The first one is to reduce the ob-jects accessed by the manager, because to know the result of a check from the CHECKMIB it’s necessary to read only the result of check. The CHECK MIB reduces the trafficoverhead by a factor equal to the number of the objects aggregated into a check. Reduc-ing network traffic ad delegating he execution of checks to managed nodes significantlyincreases scalability of fault management applications. These effects are analyzed anddiscussed in detail in Section 5 and 6.

4.2 Implementation

We implemented the CHECK MIB using the Net-Snmp package [9]. The MIB groupcheckCapabilities contains only read-only values and the group checkControl addsa control statement in the performance of the check. The three tables are implemented inthree separated C modules that contain only the internal representation of the tables andthe handlers to interact with the Net-SNMP core module. Internally, each row of the tablesis represented as a struct, and the table is represented as a linked list. The operations forperforming checks are coded in a separated module, the checkEngine. The main functionis called perform check, which, given the name of a check, performs the check, updating

Figure 6: The implementation of the agent: modules (left side) and the main algorithmof the engine (right side).

563Scalable Fault Management for Mobile Networks Beyond 3G

Page 8: [IEEE 2005 9th IFIP/IEEE International Symposium on Integrated Network Management, 2005. IM 2005. - Nice, France (15-19 May 2005)] 2005 9th IFIP/IEEE International Symposium on Integrated

the values of checkResultSeverity and checkResultTime. For this operation, thefunction iterates through the list of rules included by the check, calling for each one thefunction perform rule. This function returns the severity of the rule performed, such thatat each loop the result of the check can be updated with the maximum severity of all therules. As last step the value of checkResultTime is updated and if the result’s severityis greater than the value of checkResultSeverityThreshold a notification is sent.

The function perform rule retrieves locally the value of the monitored object and per-forms the operation specified in the object checkRuleOperation using the value ofcheckRuleValue as healthy boundary. If the comparison is not respected, then the func-tion returns the value of checkRuleSeverity. In our implementation we used a furtherlevel of abstraction in the perform rule function, using sub-level functions to handle thedifferent types of object types: so for example we delegated the operations on INTEGERobjects to the function perform rule int and so on. Even if this hierarchy is not strictlyneeded, it helped a lot during a step-by-step coverage of all the SMIv2 types.

Figure 6 shows also that the execution of a check can be triggered in two differentways. A first trigger comes from the checkResultTable when the manager reads oneinstance of the object checkResultSeverity (check on-demand); a second one comesfrom the Net-Snmp core to perform automatically the check (check on-schedule). Theactivation of one trigger or the other is controlled by the object checkResultInterval.

We used a linked list of struct to internally represent the tables. The memory con-sumption of the CHECK MIB is strictly related to the data model sketched in Figure 3.In fact, the struct representing a Result contains the fields illustrated in Figure 3: thefield name is an array of 32 bytes, while the remaining fields, even if they are defined withdifferent SMI types, are carried in 32 bits integers each. The struct representing a rulecontains the following fields: the two indexes are two arrays of 32 bytes; the OID is anarray of 32 bits integers whose length depends on the OID monitored; the Value is an abyte array whose length depends on the Object Type of the OID monitored (e.g. it is a 4bytes array for monitoring an integer value); the other fields are simple 32 bits integers.

Table 1 shows some examples of memory occupation for some check configura-tions: for simplicity we used an OID with 10 identifiers (as required for the well knownifOperStatus object) and we considered the two cases of monitoring all integer values(which has the same size as IPAddress values) and 32 bytes strings (which has the samesize as IPv6Address values). Table 1 takes in account also the pointer in each elementof the table to implement the linked list (further 4 bytes for each pointer in an 32 bit ar-chitecture), but we neglected the overhead of the C compiler in representing a struct:this overhead is related not to the CHECK MIB implementation and may strongly differon the basis of the implementation of the SNMP agent (e.g. the overhead in Java-basedimplementation might be bigger).

In order to figure out how a network can be monitored with the CHECK MIB, weimplemented also a small management application able to handle the tables of the CHECKMIB and to interpret the severity returned by the agents. Figure 7 presents a snapshot GUIof the of the manager. The left side contains a tree showing the nodes of network and thenames of the checks configured on them. The right side is used to configure the checks,by accessing the tables of the CHECK MIB. The snapshot also highlights the semantic ofthe severity: each row of the checks is marked with green or red color to highlight whichcheck failed and which passed. Moreover also the nodes of the tree of the left side aremarked with red circles in case at least one check failed on that node. The purpose ofthis representation based on colors is to help administrators with an immediate overviewof the status of the network and the use of the severity to carry the result of a check canbe used, in a more advanced Network Management System, to display different alarmsymbols.

However, our management application gave us the possibility to verify the usability

564 Session Twelve Fault Management

Page 9: [IEEE 2005 9th IFIP/IEEE International Symposium on Integrated Network Management, 2005. IM 2005. - Nice, France (15-19 May 2005)] 2005 9th IFIP/IEEE International Symposium on Integrated

Table 1 Sample values of memory consumption for the tables of the CHECK MIB.

Configuration Memory stamp of tables (bytes)#checks #OIDs Integer/ String 32 chars/

per Check IPAddress IPv6Address

1 1 176 1881 10 1256 13761 50 6056 66561 100 12056 132565 2 1480 16005 10 6280 68805 20 12280 13480

10 1 1760 188010 5 6560 716010 10 12560 13760

of the CHECK MIB in a dense IP network with a small trick. On our small testbed weexecuted several instances of SNMP agents with different UDP ports on a same machine,ending up with 60 ”virtual” monitored nodes on our management application. The GUIsupports replicating entire checks in one step from one managed to another one. In thissituation the manager application worked correctly and successfully retrieved the healthstatus from all the nodes. From the side of the agents, we noticed the increment of thetime to perform the all local checks.

5. EvaluationThis section introduces a model for latency and traffic volume of health check operations.The model is used for a quantitative comparison between the polling paradigm and theCHECK MIB approach. We model the traffic generated by a manager and the resourcesconsumed on the agents in terms of the number of checks and of monitored objects. Themodel is verified by experimental measurements described in Section 6.

5.1 Traffic modelIn polling mode (as shown by Figure 2) the traffic T exchanged between a managementstation and a single managed node is given by the sum of the traffic generated by allindividual check operations:

T{Polling} = ΣChecks{T{Check}}. (1)

Figure 7: The Graphical User Interface for the CHECK MIB.

565Scalable Fault Management for Mobile Networks Beyond 3G

Page 10: [IEEE 2005 9th IFIP/IEEE International Symposium on Integrated Network Management, 2005. IM 2005. - Nice, France (15-19 May 2005)] 2005 9th IFIP/IEEE International Symposium on Integrated

Table 2 Example calculations for generated traffic in bytes.

Parameters Polling CHECK MIB

Checks #OBJCheck #PDUCheck #OBJT ot Treq Tresp Ttot Treq Tresp Ttot

1 1 1 1 83 85 168 80 93 1731 10 1 10 263 283 546 80 93 1731 50 2 50 1126 1226 2352 80 93 1731 100 4 100 2252 2452 4704 80 93 1735 2 1 10 515 535 1050 80 213 2935 10 1 50 1315 1415 2730 80 213 2935 20 1 100 2315 2515 4830 80 213 293

10 1 1 10 830 850 1680 80 363 44310 5 1 50 1630 1730 3360 80 363 44310 10 1 100 2630 2830 5460 80 363 443

For each check a set of objects need to be retrieved. This requires one or more get-request messages from the management station to the managed object and one or moreget-reply messages sent in the other direction. For both message types, the objects aregrouped into PDUs for transmission:

T{Polling} = ΣChecks{T{ΣObjCheck{Object}}} = ΣChecks{ΣPDUsCheck

{T{PDU}}}. (2)

The number #PDUCheck of SNMP PDUs per check depends on the number of containedvarbinds (each containing one object) that fit into a single PDU. A varbind contains thefull OID in a get-request message and additionally the value of an object in a get-replymessage. Because the size of varbinds varies with the size of OID and value, the num-ber of varbinds per PDU cannot be predicted. Therefore we assume a fixed number of#ObjPDU objects that can be carried inside a PDU in order to derivate:

#PDUCheck = � #ObjCheck#ObjPDU

�. (3)

Each PDU is transmitted as payload of a UDP datagram in an IP packet. Beyond thevarbinds the IP packet contains an a SNMP header (in the PDU), an UDP header and anIP header. For our model we assume the length HDR of the chain of all headers to beconstant. For the varying length of the varbinds we assume an average length VARB.

T{Polling} = ΣChecks{HDR ∗ #PDUCheck + ΣObjsCheck{VARB}}. (4)

For traffic Treq from the management station to the managed node, the length VARB isthe average length MOID of object OIDs. For traffic Tresp in the reverse direction, VARBis longer by the average length MVALUE of object values: VARB = MOID + MVALUE.The total traffic Ttot is the sum of Treq and Tresp.

Table 2 shows example calculations for the generated traffic using the following para-meters: MOID=20; VALUE=2; HDR=63. The value for MOID is the OID size of objectifOperStatus.1 and the value for HDR was taken from a packet dump.

When the CHECK MIB is used, we assume that usually checks are passed with positiveresults and it is sufficient to get the (positive) total result of all object value comparisons.Then the management station needs to retrieve only a single varbind for each check,containing the check result that has been computed locally at the managed node. Themanagement station can configure all checks in the CHECK MIB results table such thatthe results of all checks can be requested with a single get bulk request. The get bulk

566 Session Twelve Fault Management

Page 11: [IEEE 2005 9th IFIP/IEEE International Symposium on Integrated Network Management, 2005. IM 2005. - Nice, France (15-19 May 2005)] 2005 9th IFIP/IEEE International Symposium on Integrated

request contains a single varbind only:Treq{CheckMib} = T{get bulk} = HDR + MOID (5)

Tresp{CheckMib} = HDR ∗ � #Check

#ObjPDU� + #Check ∗ (MOID + MVALUE). (6)

MOID is the length of the OID of object ResultSeverity plus its index and MVALUE =2. For most practical applications all check results can be transmitted by a single PDUsuch that:

� #Check

#ObjPDU� = 1. (7)

Table 2 shows generated traffic Treq , Tresp and Ttot also for the CHECK MIB.A comparison of the traffic values in table 2 highlights the different behavior of the two

approaches. The polling mode has the typical linear increase of the bandwidth consumedwith respect of the objects monitored. When retrieving 100 values the traffic generated ishigher than 4,7 KBytes. Using the CHECK MIB, instead, the traffic is independent on thisparameter and depends only on the number of checks configured at the managed node.

5.2 Latency model

For modeling latency we focus on resource consumption at the managed node. For de-scribing the different operations performed with the two approaches, we isolated the mainsteps that the managed node performs when handling the requests of a polling applicationand of an application using the CHECK MIB.

Figure 8 shows the steps performed by the managed node for the polling approach.For every get-request message, the managed node receives an IP packet that is parsed bythe UDP/IP stack. Then the SNMP message must be processed by the SNMP applicationand pass authentication. For each contained object, access control is performed and therequest is passed to the appropriate module in order to retrieve the value and to return it.The consumed resources R can be aggregated per packet and per object. The effort perpacket R{PKT} = R{IP+UDP} + R{SNMP} and the effort per object as R{Obj} =R{MOD} + R{VAR}. Then the total resources for polling can be expressed as:

R{Polling} = ΣChecks{#PDUCheck ∗ R{PKT} + #ObjCheck ∗ R{Obj}}. (8)

For the CHECK MIB, Figure 9 shows the steps performed at the managed node. Thereare two differences to the polling approach: packet processing is reduced but additionalresources R{COMP} are required for comparing object values in the checkEngine:

R{CheckMib} = � #Check

#ObjPDU�∗R{PKT}+ΣChecks{#ObjCheck∗(R{Obj}+R{COMP})}. (9)

Figure 8: Managed node operation for polling.

567Scalable Fault Management for Mobile Networks Beyond 3G

Page 12: [IEEE 2005 9th IFIP/IEEE International Symposium on Integrated Network Management, 2005. IM 2005. - Nice, France (15-19 May 2005)] 2005 9th IFIP/IEEE International Symposium on Integrated

Figure 9: Managed node operation with the CHECK MIB.

Considering latency we have to distinguish two cases: on-demand checks and sched-uled checks. On demand checks are started when requests arrive while scheduled checksare performed asynchronously and the management station will read the result of the lastscheduled check when reading the check result. Consumed resources are the same in bothcases, but in the scheduled case only some of them contribute to the latency:

R{CheckMib} = � #Check

#ObjPDU� ∗ R{PKT} + ΣChecks{R{Obj}}. (10)

Considering equation (7) the latency for typical fault application including severalscheduled checks is the one of a single SNMP request-reply cycle.

6. MeasurementsFor validating the model described in the previous section we conducted experiments ina test network measuring traffic and latencies. The test network consisting of a manage-ment station, a managed node and a traffic measurement probe. A single managed stationis sufficient for our purposes, because the conclusions for a large network can be extrapo-lated. The stations are connected with a 100MB/s Ethernet Hub. The Maximum TransferUnit (MTU) on this media is 1500 Bytes and for our experiments, the maximum numberof objects per PDU without fragmentation was #ObjPDU = 27.

We used the tcpdump utility for dumping SNMP packets that we analyzed afterwards.With the GUI we configured the same set of checks with in polling mode and with theCHECK MIB. We measured the traffic generated and the time needed to perform thewhole checks. When using the CHECK MIB the time is given by the difference betweenthe timestamp of the get-bulk message and the relative response; in polling mode the timeis the interval between the first get-pdu and the last response of a cycle. We conducteddifferent test varying the number of checks and the number of monitored objects for eachcheck. The checked objects contained interface status information at OID ifOperStatus.1.The measured values are shown in Table 3 and in Table 4 are all mean values of measure-ments with 200 cycles.

6.1 Traffic measurements

The traffic measurements in Table 3 show that in polling mode the traffic increases almostlinearly with the number of objects retrieved. Also the absolute values we obtained areinteresting: hundreds of Kilobytes for monitoring tens of objects are certainly a consid-erable bandwidth consumption. For larger networks the values increase linearly with thenumber of managed nodes. We observed a small difference when a fixed number of ob-jects is controlled with different sets of checks: for instance controlling 100 objects with

568 Session Twelve Fault Management

Page 13: [IEEE 2005 9th IFIP/IEEE International Symposium on Integrated Network Management, 2005. IM 2005. - Nice, France (15-19 May 2005)] 2005 9th IFIP/IEEE International Symposium on Integrated

5 checks we need 782KB, while when we used 10 checks the traffic increased to 892KB.The difference is caused by a slightly higher number of PDUs when more checks are used.Then the overhead of packet and SNMP headers is larger.

As expected, the generated traffic is much less when the CHECK MIB is used. Thetraffic does not anymore depend on the number of objects but just on the number ofchecks to be performed. The values are independent of the number of objects per checkbecause only one object is accessed per check. For a total of 100 objects to be checkedin 1 to 10 different checks, the generated traffic is less than 5% of the generated traffic inpolling mode.

6.2 Latency measurements

Latency measurements showed a linear increase of latency for polling mode and for theon-demand checks with the CHECK MIB. Except for checking a single object only, thelatency is larger when using the CHECK MIB compared to polling mode. This effect iscaused by the resources required for interpreting the rule table and performing the objectcomparisons in the checkEngine.

We had expected that the differences between polling mode and CHECK MIB thatwe identified in the model described in Section 5 would compensate each other, but theincrease of latency due to the executed check operations dominated the decrease due toreduced data exchange with the management station.

The observed effect may be caused by our implementation that uses regular snmp-getoperations for retrieving values of local objects. An optimization of the implementationaccessing the managed objects more directly might be much more efficient, but it needsto be realized such that SNMP access control is not circumvented.

The latency is very short and constant for scheduled checks with the CHECK MIB.For the range of numbers of checks in our experiment the result of all checks can bereported by a single SNMP packet. The values are already computed and just need to beencapsulated into varbinds. The resources needed for this step per object are negligible tothe resources required per PDU.

Scheduled checks fully exploit the advantages of the CHECK MIB. The latency of ahealth check becomes very short in the view of the management station. This can be usedfor significantly increasing scalability of management systems.

7. ConclusionsThe migration to ”All-IP” cellular networks imposes new requirements on network man-agement. Some of these requirements cannot be satisfied in a traditional way with thebasic Internet management framework. For managing a large number of base stations

Table 3 Traffic measurements in bytes.

#Checks ObjTotal #ObjCheck Polling Check MIB Saved

5 5 1 142.832 25.300 82.3%5 10 2 156.200 25.300 83.8%5 20 4 156.199 25.300 83.8%5 50 10 156.200 25.300 83.8%5 80 16 345.800 25.300 92.7%5 100 20 782.000 25.300 96.8%

10 10 1 285.852 38.200 86.6%10 20 2 352.000 38.200 89.1%10 50 5 550.000 38.200 93.1%10 80 8 760.000 38.200 95.0%10 100 10 892.000 38.200 95.7%

569Scalable Fault Management for Mobile Networks Beyond 3G

Page 14: [IEEE 2005 9th IFIP/IEEE International Symposium on Integrated Network Management, 2005. IM 2005. - Nice, France (15-19 May 2005)] 2005 9th IFIP/IEEE International Symposium on Integrated

Table 4 Latency measurements in seconds.

#Checks ObjTotal #ObjCheck Polling On-demand Increased Scheduled Saved

1 1 1 0.0066 0.0116 76% 0,007 -6%1 5 5 0.0112 0.2340 108% 0,007 37%1 10 10 0.0153 0.0379 147% 0,007 54%1 20 20 0.0249 0.0698 180% 0,007 72%1 50 50 0.0501 0.1678 200% 0,007 86%1 80 80 0.0773 0.2322 235% 0,007 90%1 100 100 0.0938 0.3239 245% 0,007 92%

10 10 1 0.0178 0.0413 132% 0,007 60%10 20 2 0.0282 0.0734 159% 0,007 75%10 50 5 0.0586 0.1710 191% 0,007 88%10 80 8 0.0816 0.2689 201% 0,007 91%10 100 100 0.1024 0.3316 230% 0,007 93%

(NodeB’s) with a large number of managed objects to be monitored per managed node,advanced management methods are required.

There are standardized MIB modules available that can solve this problem in gen-eral. These are very flexible and powerful, but because of this the complexity of use andimplementation is high as is the resource consumption.

This paper describes a solution using a customized MIB module, the CHECK MIB,designed for distributing health check operations to checked managed nodes. The CHECKMIB allows delegating sets of simple comparison operations by setting a very small num-ber of objects. Analysis and measurements showed that the CHECK MIB achieves theexpected improvement in scalability with a low cost and lean implementation and withvery little resource consumption.

The CHECK MIB aggregates the results of several compare operations into a singlemanaged object that represents the result of the entire check. This reduces network trafficsignificantly. Also the latency of a check can be reduced significantly if the managednodes perform local checks regularly and the management station only collect the resultswithout need to wait for check execution.

We completed our study of the CHECK MIB approach with usability tests in a testnetwork and developed a Graphical User Interface that allows controlling all features andserves as an example for the integration into a full-fledged Network Management System.

References[1] J.P.Martin-Flatin, S. Znaty, J.P. Hubaux, A survey of distributed enterprise network and sys-

tems management paradigms, Journal of Network and Systems Management, 7(1), March1999.

[2] J. Schoenwaelder, J. Quittek, C. Kappler, Building Distributed Management Applications withthe IETF Script MIB, IEEE Journal on Selected Areas in Communications, May 2000, pp.702-714.

[3] J. Quittek, M. Brunner, Applying and Evaluating Active Technologies in Distributed Manage-ment, Journal of Network and Systems Management: Vol. 11, No. 2, 2003.

[4] D. Levi and J. Schoenwaelder, Definitions of Managed Objects for the Delegation of Manage-ment Scripts, IETF RFC 3165, August 2001.

[5] B. Stewart, Distributed Management Expression MIB, IETF RFC2982, October 2000.[6] B. Stewart, Event MIB, IETF RFC 2981, October 2000.[7] K. McCloghrie, D. Perkins, J. Schoenwaelder, J. Case, M. Rose and S. Waldbusser, ”Structure

of Management Information Version 2 (SMIv2)”, IETF RFC 2578, April 1999.[8] R. Sprenkels, Bulk Transfers of MIB Data, Simple Times, Volume 7, Number 1, March, 1999.[9] The NET-SNMP Project, http://www.net-snmp.org/.

570 Session Twelve Fault Management