1 ai approaches to network fault management andrew learn 29 nov 2001

24
1 AI Approaches to Network Fault Management Andrew Learn 29 Nov 2001

Post on 22-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 AI Approaches to Network Fault Management Andrew Learn 29 Nov 2001

1

AI Approaches to Network Fault Management

Andrew Learn

29 Nov 2001

Page 2: 1 AI Approaches to Network Fault Management Andrew Learn 29 Nov 2001

2

Outline

• Fault Management Process

• AI Approaches– Expert Systems– Neural Networks– Case-based Reasoning

Page 3: 1 AI Approaches to Network Fault Management Andrew Learn 29 Nov 2001

3

Network Faults

• Hardware– Wear and tear– Cut cables– Improper installation

• Software– Incorrect design– Bugs– Incorrect data (e.g. routing tables)

Page 4: 1 AI Approaches to Network Fault Management Andrew Learn 29 Nov 2001

4

Fault Management Process

1. Collect alarms

2. Filter and correlate alarms

3. Diagnose faults

4. Restoration and repair

5. Evaluate effectiveness

Page 5: 1 AI Approaches to Network Fault Management Andrew Learn 29 Nov 2001

5

1. Collect Alarms

• Types of alarms– Physical: Failure in communication

• e.g. loss of signal, CRC failure

– Logical: Statistical values exceed threshold• e.g. number of packets dropped

• Communication with components– Control protocol: Simple Network Management

Protocol (SNMP)– Data format: Management Information Base (MIB-

II, 1990) has ~170 manageable objects

Page 6: 1 AI Approaches to Network Fault Management Andrew Learn 29 Nov 2001

6

• Sample MIB Entry

• Sample SNMP “get” call

ipInReceives OBJECT-TYPE SYNTAX Counter ACCESS read-only STATUS mandatory DESCRIPTION "The total number of input datagrams received from interfaces, including those received in error." ::= { ip 3 }

snmpget netdev-kbox.cc.cmu.edu public system.sysUpTime.0

Name: system.sysUpTime.0 Timeticks: (2270351) 6:18:23

Page 7: 1 AI Approaches to Network Fault Management Andrew Learn 29 Nov 2001

7

2. Filter and Correlate Alarms

• Filter– Eliminate redundant alarms– Suppress noncritical alarms– Inhibit low-priority alarms in presence of

high-priority alarms

• Correlate– Analyze and interpret multiple alarms to

assign new meaning (derived alarm)

Page 8: 1 AI Approaches to Network Fault Management Andrew Learn 29 Nov 2001

8

3. Diagnose Faults

• May require additional tests/diagnostics on circuits or components– Automated or manual

• Analyze all info from alarms, tests, performance monitoring

• Identify smallest system module that needs to be repaired or replaced

Page 9: 1 AI Approaches to Network Fault Management Andrew Learn 29 Nov 2001

9

4. Restoration and Repair

• Restoration: Continue service in presence of fault

– Switch over to spares– Reroute around trouble spot– Restore software or data from backup

• Repair– Replace parts– Repair cables– Debug software

• Retest to verify fault is eliminated

Page 10: 1 AI Approaches to Network Fault Management Andrew Learn 29 Nov 2001

10

5. Evaluate Effectiveness

• Questions to answer :– How often do faults occur?– How many faults affect service?– How long is service interrupted?– How long to repair?

• Provides assessment of:– Performance of fault management system– Reliability of equipment

Page 11: 1 AI Approaches to Network Fault Management Andrew Learn 29 Nov 2001

11

AI Approaches to Fault Management

• Well-developed approach:– Expert systems

• New approaches:– Neural networks– Case-based reasoning– Other

Page 12: 1 AI Approaches to Network Fault Management Andrew Learn 29 Nov 2001

12

Why AI?

• Need for intelligence– Data analysis– Pattern recognition– Clustering and categorization– Problem solving

• Need for automation– Manual analysis/solution takes time– Limited manpower– Limited expertise

Page 13: 1 AI Approaches to Network Fault Management Andrew Learn 29 Nov 2001

13

Well-developed approach: Expert Systems

• Expert systems = Rule-base + Working Memory• Three parts to rules:

1. Context trigger (when should rule be considered)2. Condition ( if X . . . )3. Conclusion ( . . . then Y)

• Used since 1980’s by major telecomm companies– Bell: Automated Cable Expertise (ACE) system– GTE: Central Office Maintenance Printout Analysis &

Suggestion System (COMPASS)– AT&T: Network Management Expert System

(NEMESYS)

Page 14: 1 AI Approaches to Network Fault Management Andrew Learn 29 Nov 2001

14

Need for New Approaches

• Weaknesses of expert systems– Brittle in unforeseen situations– Cannot learn from experience– Hard to maintain (adding/deleting/modifying rules)– Knowledge acquisition bottleneck– Can’t handle incomplete or probabilistic data

• Factors driving new approach– Rapidly changing technology– Dynamic network topology– Network complexity– Competition, demand for QoS

Page 15: 1 AI Approaches to Network Fault Management Andrew Learn 29 Nov 2001

15

Neural Nets

• Structure: input, hidden, output layers

• Training– Supervised: Input pattern & desired output– Unsupervised: Clustering of similar inputs

Input

Hidden

Output

weights

Page 16: 1 AI Approaches to Network Fault Management Andrew Learn 29 Nov 2001

16

Neural Nets• Advantages

– Pattern matching & generalization– Fast & efficient– Trainable– Handles incomplete, ambiguous data

• Disadvantages– Black box– Lack of training data

Page 17: 1 AI Approaches to Network Fault Management Andrew Learn 29 Nov 2001

17

Neural Net Example

• Example: Alarm correlation in cell phone networks (Univ of Hannover, Germany)

Base Stations

Mobile units

Base Station Controller

Switching Centers

BS2

BS1 MC

BSCMicrowave Links

Maintenance Center

Page 18: 1 AI Approaches to Network Fault Management Andrew Learn 29 Nov 2001

18

Neural Net Example

BSC alarms

Initial Cause

• Test Results:

– 94 alarms

– 99.76% correct classification with up to 25% noise

ML-1 fault

ML-2 fault

BS-2 alarms

BS-1 alarms

.

.

.

.

.

.

Page 19: 1 AI Approaches to Network Fault Management Andrew Learn 29 Nov 2001

19

Case-Based Reasoning

• Case-based reasoning = matching previous examples– Case library: Set of previous faults, diagnoses,

solutions– Usually based on “trouble ticket” help-desk

databases

• Design considerations:– What are key attributes of a case?– What attributes will be used to index & access a

case?

Page 20: 1 AI Approaches to Network Fault Management Andrew Learn 29 Nov 2001

20

Case-Based Reasoning

• Advantages– Easier knowledge acquisition than expert

systems– Can learn by adding new cases– Doesn’t require extensive maintenance

• Disadvantages– Requires time-consuming user interaction – No help for first-time problems

Page 21: 1 AI Approaches to Network Fault Management Andrew Learn 29 Nov 2001

21

Case-Based Reasoning ExampleCase 134

Problem Type: Performance

Description: High error rate in comm between POA-SP & DF

No access: Intermittent

Retrieval: Case 103 [Similarity = 0.69]

Description: 64kb line from VendorX drops big datagrams.

Additional Info requested: Is there loss of big datagrams in ping test? (Result: Yes)

Cause: Link 34 inside Bldg 207 was defective

Solution: Vendor replaced cabling.

Page 22: 1 AI Approaches to Network Fault Management Andrew Learn 29 Nov 2001

22

Summary of 3 AI Methods• Expert systems

– If / then rules– Well-developed technology– Brittle, hard to maintain

• Neural networks– Output = weighted transform of inputs– Fast pattern matching, robust to noise– Black box, lack of training data

• Case-based systems– Trouble-ticket retrieval– Easy to build, maintain– Slower diagnosis, takes time to build

Page 23: 1 AI Approaches to Network Fault Management Andrew Learn 29 Nov 2001

23

Other Approaches

• Bayesian networks– Model statistical probabilities and

dependence of faults

• Mobile intelligent agents– Independent software agents cooperate to

collect info, suggest solutions

Page 24: 1 AI Approaches to Network Fault Management Andrew Learn 29 Nov 2001

24

Future Trends

• Proactive fault detection– Recognizing trouble signs and taking

corrective action before service degrades

• Hybrid systems– Multiple AI methods integrated