r. sterritt, k. adamson, c.m. shapcott, e.p. curran · 2014-05-15 · data mining...

Data Mining telecommunications network data

for fault management and development testing

R. Sterritt, K. Adamson, C.M. Shapcott, E.P. CurranFaculty of Informatics, University of Ulster, Northern Ireland.

Abstract

Applying Data Mining and Knowledge Discovery to complex industrial problemsis an increasing trend. The authors have been involved in researching andapplying data mining to fault management and development testing of high-speedtelecommunications systems. This paper discusses the strategies undertaken fordata mining these applications using Telecommunications Management Network(TMN) data.

1 Introduction

This paper discusses strategies for data mining Telecommunications ManagementNetwork (TMN) data for both Fault Management and development testingpurposes. The authors' collaborative experiences with NITEC (Northern IrelandTelecommunications Engineering Centre), an R&D lab of Nortel Networks, inthis area are discussed. The research started in 1993 with the emphasis onsimulation. This emphasis evolved to data mining in 1996.

First the telecommunications domain is described, followed by an overview ofprevious research programmes. Next the data mining strategies for both faultmanagement and development testing are discussed. Lastly an evaluation andfuture plans are discussed in the conclusion.

1.1 Telecommunications systems

Within the telecommunication industry the Synchronous Digital Hierarchy (SDH)is an international standard for broadband networks, offering increased bandwidthand sophisticated services (ITU [1]). This increased sophistication allows fortraditional voice, video on demand, ISDN data transfer and video conferencing to

Data Mining II, C.A. Brebbia & N.F.F. Ebecken (Editors) © 2000 WIT Press, www.witpress.com, ISBN 1-85312-821-X

Data Mining II

use the one network more efficiently and effectively. It is the leadinginfrastructure solution for internet backbone communications.

The management of this level of sophistication becomes more difficult,particularly when a fault in the network occurs (ITU [2]). The SDH multiplexersthemselves and other network elements (NE) have built in recovery methods andthe behaviour of these NE are highly specified by ITU (formerly CCITT), ANSIand ETSI and as such are deterministic.When a fault or multiple faults do occur the operator is presented with events

which represent the symptoms. Yet due to the nature of networks it is not assimple as when x arrives followed by y then z has occurred. Y may arrive beforehand or not at all. When taking into account that other network components upor downstream of the fault detecting problems also issue alarm events, thesymptoms grow and as such the fault identification is a difficult task. This occursto such an extent that the behaviour of the multiplexer network has beendescribed as effectively non-deterministic (Bouloutas [3]).

1.2 Previous work

The original aim of the research was to simulate the multiplexers on a parallelarchitecture to facilitate large-scale tests. The hypothesis being that, in industry,simulation is a cost-effective and rapid method for testing and designing newequipment and systems. Advances in the fields of mathematical modelling andparallel processing meant that it was then possible to address problems that hadpreviously been unrepresentable due to constraints imposed by problemcomplexity and computational requirements.

The initial project NEST (1993 [4]) focused on investigating how to simulatein real-time (emulate) an STM-1 multiplexer and network manager on a parallelprocessing environment. The NETEXTRACT project (1995 [5]) was thenestablished to use evidential reasoning techniques (to ensure a tolerance foruncertainty) to validate and verify the emulation from its output data to secure asfar as possible a correct model (Adamson 1994[6], Moore 1996[7]). TheGARNET project (1996 [8]) was set-up to build on the NEST investigations - todevelop a multi-processor implementation of the network elements to achievereal-time speeds and to develop an intelligent graphical user interface to map thedesired network topology to the parallel environment (Sterritt 1998-1 [9]).

The simulation approach had an inherent problem in that it required acquiringexpert knowledge about a system that was itself under development - a movingtarget. Therefore, this work not unlike many other similar projects suffered fromthe Knowledge Acquisition (KA) bottleneck. By the time the knowledge wasacquired from the experts and the development of the parallelsimulation/emulation models had taken place, the information would be out ofdate. The actual products being simulated had moved on to newer versions anddevelopment cycles.

The same advances in the fields of mathematical modelling and parallelprocessing also meant that it was now possible to search (mine) large amounts of


Data Mining II

data for hidden and useful patterns. The Knowledge Discovery (KD) approachinitially avoids the KA bottleneck in that it mines the actual data (not the mind ofthe expert) and as such can keep abreast of developing versions as long as thedata can be gathered and pre-processed. Yet to achieve true KD and not justperform Data Mining (DM) requires an expert interpretation/evaluation.Therefore we note that the two approaches have the dependency on the expertbut at different ends of their respective processes. The mined results offers theadvantage that there are solid findings to work with.

The NETEXTRACT project had in reality always been about knowledgediscovery since it was extracting cause and effect networks from data - scalewould be the new difference. Instead of extraction from all modelled systemsdata the emphasis would shift to the extraction from real network managementdata from the testing environment (Sterritt 1997-1 [10], 1997-2[11], 1997-3[12]).

Under the second year of the GARNET project, automated testing (Sterritt2000 [13]) was developed in NITEC and as such the KD architecture from theNETEXTRACT project was further refined to provide an assurance level for theauto tests (Sterritt 1998-2[14], 1998-3[15]).

1.3 Data Mining telecommunications network data - the two faces

Two distinct applications of data mining from TMN data have evolved from thepreviously mentioned authors' research with NITEC;(1) Fault Management,(2) Manual and (from 1997) Automated Testing

These have common ground in that both can involve mining TMN data yetwith a difference emphasis. In fault management the desire is to correlate theevents to such a degree to facilitate the prediction of the actual fault. In testingthe same mined correlation can be useful to validate a set of test results or miningbe used to spot any anomalies in the test data.

As a product nears release the data from a test environment becomes morerelevant to fault management (apart from the events caused by test configuration)since it will start to match the final behaviour that will be prevalent in anoperational network.

2. Data Mining

Data mining deals with the discovery of hidden knowledge, unexpected patternsand new rules from large databases. It is now generally considered as thediscovery stage in a much larger process ((Fayyad [16], Uthurusamy [17]) -knowledge discovery in databases (KDD). Adriaans [18] presents acomprehensive introduction for undertaking data mining and KDD particularly allthe stages; Data selection, Cleaning, Enrichment, Coding, Data Mining, andReporting.


Data Mining II

2.1 Data Mining for TMN fault management

2.1.1 OverviewGlobal telecommunication systems are built with extensive redundancy andcomplex management systems to ensure robustness. Fault identification andmanagement of this complexity is an open research issue with which data miningcan greatly assist.

2.1.2 Faults, Events and MaskingA Fault is a malfunction that has occurred either in the hardware or software onthe network. This can be due to some external force for example a digger cuttingthrough the fibre cable or an internal fault such as a card fail.

An event is an occurrence on the network. Those that relate to themanagement of the network are recorded by the Element Controller (EC;historically referred to as the Element Manager - EM). In older releases arecorded event equated to an alarm. This is no longer the case, other examples ofevents are user logins and user actions such as switch protection.

There are numerous types of alarm events that may be generated within aNetwork Element (NE) typically around 100 types. An example of a criticalalarm is a 'Comms fail alarm'. An alarm exists for a time period; thus under normalcircumstances an alarm present event will be accompanied by an alarm clearevent.

Each alarm type is assigned a Severity Level of Critical, Major or Minor by thenetwork management system depending on the severity of the fault indicated bythe alarm type. In the example, the alarm type 'Comms fail' has a critical severitylevel while other alarms such as Tributary Unit Alarm Indication Signal' (TU-AIS) has a minor severity level.

The occurrence of a fault can cause numerous alarm events to be raised froman individual NE, this means that the alarms are often inter-related (and thus thedesire to correlate). Also a fault may trigger numerous similar and differentalarms (and indeed alarm types) to be generated in different NE's up or downstream on the network. For example the Comms fail alarm, an alarm raised by themanagement system if it cannot maintain a communications channel to theindicated NE, may cause other alarms such as RS-LOS, RS-LOF, Qecc-Comms_fail, MS-EXC or even laser alarms depending on the fault andconfiguration. The Qecc-Comms_fail alarm indicates that the NE can notcommunicate via the Embedded Control Channel (ECC) of the indicated STM-Ncard with the neighbouring NE.

Alarms can be generated exponentially in different NE's throughout thenetwork due to certain fault conditions, the larger the network the greater thenumber of alarms that will be generated. It is therefore essential for the NE's toprovide some correlation of the different alarms that are generated so that the ECis not flooded with alarms and only the ones with high priorities are transmitted.


Data Mining II

This is handled in three sequential transformations; alarm monitoring, alarmfiltering and alarm masking. These mean that if the raw state of an alarm instancechanges an alarm event is not necessarily generated.

Alarm monitoring takes the raw state of an alarm and produces a 'monitored'state. Alarm monitoring is enabled/disabled on a per alarm instance premise. Ifmonitoring is enabled, then the monitored state is the same as the raw state, ifdisabled then the monitored state is clear.

Alarm filtering is also enabled/disabled on a per alarm instance basis. An alarmmay exist in any one of three states, Present, Intermittent or Clear, depending onhow long the alarm is raised for. Assigning these states, by checking for thepresence of an alarm within certain 'filtering' periods, determines the AlarmFiltering.

Alarm masking is designed to prevent the unnecessary reporting of alarms.The masked alarm is inhibited from generating reports if an instance of itssuperior alarm is active and fits the 'Masking' periods. A 'Masking Hierarchy'determines the priority of each alarm type. Alarm masking is also enabled/disabledon a per alarm instance basis.

If an alarm changes state at any time the network management system must beinformed. The combination of Alarm Monitoring, Masking and Filtering makesalarm handling within the NE's quite complex.

The simple example of inter-connecting alarms above and the transformationsshould have illustrated that fault determination is not a straightforward process.The combinations of possible alarm events and the time they are received at theEC are numerous. Added to this complexity is the fact individual alarms can beconfigured in different states such as 'Masking Disabled' or 'Masking Enabled'; orthe Network in different states such as '1+1 protection' or 'unprotected'.

2.1.3 Event correlationAt the heart of alarm event correlation is the determination of the cause. Thealarms represent the symptoms and as such, in the global scheme, are not ofgeneral interest once the failure is determined [19], There are two real worldconcerns: (1) the sheer volume of alarm event traffic when a fault occurs; (2) thecause not the symptoms.

The types of correlation that have been described previously meet criterion(1), which is vital. They focus on reducing the volume of alarms but do notnecessarily meet the criterion (2) to determine the actual cause - this is left to theoperator to determine from the reduced set of higher priority alarms.

Ideally, a technique that can tackle both these concerns would be best.Artificial Intelligence (A.I.) and Data Mining offers that potential and has beenand still is an active area of research to assist in fault management.

2.1.4 Event correlation - the Bayesian network wayThe authors' research [7] does deal with both criteria (volume of alarms and causenot the symptoms) using probabilistic reasoning techniques [10]. The cause andeffect graph can be considered a complex form of alarm correlation. The alarms


304 Data Mining II

are connected by edges that indicate the probabilistic strength of correlation. Yetthe cause and effect network can contain more than just alarms as variables -actual faults can be included as variables.

Data Mining is used to produce the probabilistic network by correlating offlinealarm event data, and deducing the cause using this probabilistic network fromlive alarm events.

2.1.5 Data Mining the Bayesian Network - InductionIn this case, as in many cases, the structure of the graphical model (the Bayesiannet) is not known in advance, but there is a database of information concerningthe frequencies of occurrence of combinations of different variable values (thealarms). In such a case the problem is that of induction - to induce the structurefrom the data. Heckerman has a good description of the problem [20][21]. Therehas been a lot of work in the literature in the area, including that of Cooper andHerskovits[22]. Unfortunately the general problem is NP-hard [23]. For a givennumber of variables there is a very large number of potential graphical structureswhich can be induced. To determine the best structure then in theory one shouldfit the data to each possible graphical structure, score the structure, and thenselect the structure with the best score. Consequently algorithms for learningnetworks from data are usually heuristic, once the number of variables gets to beof reasonable size.

2.1.6 Data Mining for additional simple rulesIn practice, when it comes to learning the cause and effect graph, the volume ofevent traffic and correlation of alarms can be reduced by simple first stagecorrelation (generally pattern matchers). The expert system approach (in this casethe deduction from the probabilistic network) could then handle the remainingmore complex problems, taking advantage of the much reduced and enrichedstream of events,

As such the authors have now designed and developed a simple first stageevent correlator. Rules for the system can be written from mined results fromsuch tools as Clementine for example; -when a Comms fail alarm occurs it islikely a Qecc-CommsJail alarm will be injected into the network.

It is envisaged that these additional rules could be potentially adapted toextend the existing correlation system in an element manager.

2.2 Data Mining for development testing

2.2.1 OverviewWithin NITEC high capacity broadband transmission and switching equipment aredesigned and developed. This complex mix of hardware, software and firmwaremust conform to international standards to facilitate heterogeneous globalnetworks.

During the development cycle for each release of a product a significantproportion of the time is taken up with testing, commonly estimated at 60%. As


Data Mining II

the product becomes larger and more complex the ability to comprehensively testand verify the operation within the decreasing timeframe to market becomesincreasingly difficult. Automation offers the potential to decrease this overhead.

Traditional manual testing of telecommunications equipment was expensive interms of time spent, costs involved and even de-motivation of specialisedengineers due to the repetitive task. Automation offered a competitive advantagein terms of reduced cost, reduced time to market, enhanced quality and "freeing-up" of specialised engineers for further investigating and solving of problemsareas discovered from the testing.

The disadvantage to automation is the experimental approach to testing is lost.A test script will not spot anomalous behaviour that an engineer would have.

Automation offers a rich data trail which can then be utilised to compensatefor the lose in live experimentation by the engineer. Hidden in that data should beindications that any anomalies have occurred.

2.2.2 Mining for Test AssuranceThe initial mining that tool place under the Netextract project had limited resultsdue to the fact that in manual testing the user actions were not recorded.Automated test scripts use a command interface to inject 'user action' events intothe network. These commands can be used to model potential faults in thenetwork, for example to disable a card to model a card fail. These commands arenow also available to mine in conjunction with the resultant alarm data.

Since the resultant data to these commands is TMN alarm event data that hasbeen previously discussed the same mining approaches can be utilised in thistesting environment to provide assurance for the lack of live test engineerexperimentation.

Each execution of an individual test leaves behind a statistical 'footprint' whichcan be presented graphically, i.e. the Bayesian networks, to assist in classifying apass/fail.

Rules to define specific test environment behaviour amongst the alarms eventsfor example a TU-AIS alarm being raised on 15 ports instead of 1 port due to thedaisy chain test environment configuration.

Mining for other hidden behaviour across a test script's results from differentexecutions over a period of time with the aim to find any anomalies that mayindicate a fail.

2.2.3 FootprintsThe assumption is that the foot-prints can be utilised for a wider-basedidentification (classification) of a pass or a fail of an individual test. In any casewhere there is a sufficiently large number of pass and fail tests available, it shouldbe possible to use classification techniques, such as a neural network, to generatea pass/fail classifier for automated testing. Yet a drawback of many classificationtechniques, including neural networks, is that they do not provide any explanationof the decision.


Data Mining II

Inducing a probabilistic network from the data provides a much more visualfootprint. Probabilistic networks, in which relationships between variables can berepresented by the existence of links between them, have an intuitive appeal. Theyare easy to "read" if represented graphically and can summarise fairly complexrelationships succinctly.

The approach offers great promise. From initial experimentation it wouldappear that the nets can be used as a classification technique. They cover allevents that have occurred during a test and therefore provide the means to makeup for the lack of a test engineer at the scene monitoring for anomalous activity.

3. Conclusion

3.1 Evaluation

Over the years this research has produced a useful study of applying different datamining techniques to several problems in the telecommunications domain. Since1997 each year several data sets, usually of a month duration, have been gatheredfrom the environments to assist in this work. The most promising areas are forfault management and identification and for test assurance in the equipment'sR&D lifecycle.

3.2 Future work

Under the new JIGSAW project, as part of a data warehouse strategy, databasesare being established to store all the necessary data from this point forward. Thiswill enable the data mining applications discussed here to be used on a day to daybasis for instance as a part of a decision support system with data mining at itscore (Schuster [24]).

Acknowledgements

We are greatly indebted to our industrial collaborators Northern IrelandTelecommunications Engineering Centre (NITEC), Nortel Networks, who havesupported our research for many years now. We would also like to thank EU(Stride programme 1993-95), EPSRC (AIKMS programme 1995-97), IRTU(Start programme 1996-99) for funding this work.

References

[1] ITU, Types and General Characteristics of SDH Multiplexing Equipment,ITU-T (previously CCITT) Recommendation G, 782 1990.

[2] ITU, SDH Management, ITU-T (previously CCITT) Recommendation G. 7841990.


Data Mining II 207

[3] Bouloutas, A. T., Calo, S. and Finkel, A., Alarm Correlation and FaultIdentification in Communication Networks, IEEE Trans Comms, Vol. 42, No2/3/4, Feb/Mar/Apr 1994.

[4] EU/STRIDE 'NEST' Project, Collaborators; University of Ulster, QueensUniversity of Belfast and Northern Telecom, 1993-95.

[5] EPSRC & DTI/AIKMS 'NetExtract' Project, Collaborators; University ofUlster, Nortel (Northern Telecom) and Transtech Parallel Systems, 1995-1997.

[6] Adamson K., A Knowledge Based Approach to Real time Systems Modelling,Proc. of the 12th Int. IASTED Conf. On Applied Informatics, ppl-3, 1994

[7] Moore P., Shao I, Adamson K, Hull MEC, Bell DA, Shapcott M, AnArchitecture For Modelling Non-Deterministic Systems Using Bayesian BeliefNetworks, Proc. of the 14th Int. IASTED Conf. On Applied Informatics, pp254-257, 1996

[8] IRTU/START 'GARNET' Project, Collaborators; University of Ulster andNortel, 1996-1999.

[9] Sterritt, R., Curran, E.P., Adamson, K., Towards A Graphical And Real-timeNetwork Simulation Toolset, Eds Adey R.A., Rzevski G., Nolan P., Applicationsof Artificial Intelligence in Engineering XIII, CMP: Southampton, CD-ROMpp210-227, 1998

[10] Sterritt R, Daly M., Adamson K., Shapcott M., Bell DA, McErlean F.,NETEXTRACT: An Architecture For The Extraction Of Cause And EffectNetworks From Complex Systems, Proc. of the 15th Int. IASTED Conf. onApplied Informatics, pp55-57, 1997

[11] Sterritt R, Adamson K., Shapcott M., Bell DA, McErlean F , Using A.LFor The Analysis Of Complex Systems, Proc. of the Int. IASTED Conf. On AIand Soft Computing, pp 105-108, 1997

[12] Sterritt R, Adamson K., Shapcott M., Wells N, Bell DA, Lui W., P-CAEGA: A Parallel Genetic Algorithm For Cause And Effect Networks, ProcInt. IASTED Conf. On Aland Soft Computing, pp 105-108, 1997

[13] Sterritt, R., Shapcott, CM, Adamson, K., Curran, E.P., Calvert, W,Johnson, R, Designing And Implementing An Automated Testing Approach ForThe Development Of High Speed Telecommunication Equipment, Accepted forthe 18th Int. IASTED Conf. Applied Informatics, 2000


308Data Mining II

[14] Sterritt, R., Adamson, K., Shapcott, CM, Curran, E.P., Adapting AnArchitecture For Knowledge Discovery In Complex TelecommunicationSystems For Testing Assurance Proc NIMES 98 Conf. on Complex Systems,Intelligent Systems and Interfaces, pp37-39, 1998

[15] Sterritt, R, Curran, E.P., Adamson, K., Shapcott, CM., Application OfAI For Automated Testing In Complex Telecommunication Systems, ProcEXPERSYS 98, 10th Int Conf. Artificial Intelligent Applications, pp97-102,1998

[16] Fayyad, U.M., Piatetsky-Shapiro, G, Smyth, P From Data Mining toKnowledge Discovery: An Overview, Advances in Knowledge Discovery & DataMining, AAAI Press & The MIT Press: California, ppl-34 1996

[17] Uthurusamy, R "From Data Mining to Knowledge Discovery: CurrentChallenges and Future Directions", Advances in Knowledge Discovery & DataMining, AAAI Press & The MIT Press: California, pp 561-569, 1996

[18] Adriaans, P., Zantinge, D., Data Mining, Addison-Wesley: England, 1996.

[19] Harrison K. "A Novel Approach to Event Correlation", HP, IntelligentNetworked Computing Lab, HP Labs, Bristol. HP-94-68, July, 1994, pp. 1-10.

[20] Heckerman, D., 1997, "Bayesian Networks for Data Mining", DM&KD 1,79-119, 1997

[21] Heckerman D, 1996. "Bayesian Networks for Knowledge Discovery" eds.Fayyad UM, Piatetsky-Shapiro G, Smyth P and Uthurusamy R, Advances inKnowledge Discovery and Data Mining, AAAI Press / The MIT Press, 273-305.

[22] Cooper, G.F. and Herskovits, E, 1992. "A Bayesian Method for theInduction of Probabilistic Networks from Data". Machine Learning, 9, pp 309-347

[23] Chickering D.M. and D Heckerman, 1994. "Learning Bayesian networks isNP-hard", MSR-TR-94-17, MS Research, Microsoft Corporation, 1994.

[24] Schuster, A. , Sterritt, R. , Adamson, K., Curran, E.P., Shapcott, CM,Towards a Decision Support System for Automated Testing of ComplexTelecommunication Networks, Submitted for publication at IEEE Int, Conf. OnSystems, Man and Cybernetics, 2000


r. sterritt, k. adamson, c.m. shapcott, e.p. curran · 2014-05-15 · data mining...

Documents