connect. communicate. collaborate gÉant2 monitoring otto kreiter, dante navneet daga, dante lhc...

31
Connect. Communicate. Collaborate GÉANT2 monitoring Otto Kreiter, DANTE Navneet Daga, DANTE LHC Monitoring Workshop, Munich, 19.07.2006

Post on 22-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Connect. Communicate. Collaborate

GÉANT2 monitoring

Otto Kreiter, DANTENavneet Daga, DANTE

LHC Monitoring Workshop, Munich, 19.07.2006

Connect. Communicate. CollaborateAgenda

• Extraction of monitoring information from the GÉANT2 network

• External application developed by DANTE for JRA-4• Demonstration of a home grown weather-map• Conclusion

Connect. Communicate. CollaborateNetwork Element Manager• All network elements communicate with the NM separately • NM task is to configure and monitor one by one each NE• It is not service aware – no knowledge about the intra-domain e2e path status.

Connect. Communicate. Collaborate

Regional Network Manager (RM)

TopologyServices

Correlation“User”

interface

Connect. Communicate. CollaborateHow we export data !

Alarms

Alarms

Perf. Meas.

Rem. Inv.

Connect. Communicate. CollaborateStatus via alarms

Alarms

SNMPTrapD

Alarms

Monitoringstation

Connect. Communicate. CollaborateAlarm content

• From the NM:– Information about interfaces and associated signal

status, SDH timing problems– NE and ILA status

• From the RM– Information related to services– Information related to path, trails and physical

connections at all layers

Connect. Communicate. CollaborateOne hop case NMS vs JRA-4

Path – gen_mil_CERN

OCH trailPhys-link Phys link

Domain linkP. ID link P. ID link

BOL-CERN-LHC-001

Connect. Communicate. CollaborateMultiple hop case NMS vs JRA-4

Path – gen_mil_CERN

OCH trailPhys-link Phys link

Domain link P. IDLink

CERN-SARA-LHC-001

OCH trailPhys-link

P. IDLink

Connect. Communicate. CollaborateAlarm processing

• SNMP traps from the Alcatel IOO module.• Alcatel Enterprise v1/v2c MIB• SNMP traps received by a Linux station

– snmptrapd to pick up all alarms– For each trap a bash script is called which performs:

• Analysis• Selection• Action

Connect. Communicate. CollaborateAlarm type & information

Alarm Raise:– friendlyName– probableCause– perceivedSeverity– currentAlarmId– eventTime– acknowledgementStatus– additionalInformation– eventType– snmpTrapAddress

Alarm Clear:– friendlyName– probableCause– currentAlarmId– eventTime– snmpTrapAddress

Connect. Communicate. CollaborateUsed alarm information

Alarm Raise:– friendlyName– probableCause– perceivedSeverity– currentAlarmId– eventTime– acknowledgementStatus– additionalInformation– eventType– snmpTrapAddress

Alarm Clear:– friendlyName– probableCause– currentAlarmId– eventTime– snmpTrapAddress

Connect. Communicate. CollaborateAlarm analyzer process

SNMP trap received

snmpTrapAddress Must be registered

Check for type Of Alarm

Raise

Additional Infopath

clientpath

ochtrail

omstrail

physicallink

recordAlarm

Call External Program

Clear

alarmID

Read recordAlarm

Call ExternalProgram

Record all traps

delete recordAl

Connect. Communicate. CollaborateAlarm analyzer

• Called every time a trap is received• Written in bash• Each trap is analyzed separately and if in the meantime a

new trap arrives it waits in the queue (snmptrapd)– Possible problem if an external program get stuck and

the scripts hangs. The alarms remains unprocessed in the queue

• Must maintain state– SNMP traps may get lost so a program needs to check

time to time if the monitoring station is in syncro with the NMS.

Connect. Communicate. CollaborateExternal applications

• JRA-4 monitoring (xml file generation)• perfSonar DB feeder• Project weather-map: LHC

Connect. Communicate. Collaborate

JRA-4 monitoring (XML file generation)

Connect. Communicate. CollaborateE2E Data transformation

• Prototype applications developed in Java – – E2EXMLWriter– XMLGenerator

• E2EXMLWriter takes in a template XML and produces an XML file containing live e2e path status information conforming to the JRA4 e2e data model– Triggered by a script listening to SNMP alarms– Parameters passed

• Trail ID• Status

• XMLGenerator produces this template XML that E2EXMLWriter uses to export domain’s e2e information

Connect. Communicate. CollaborateDesign of E2EXMLWriter

• Relies on 2 configuration files to produce live XML status information– Properties file (links.properties)

• Properties file containing key = value entries• Each key is one e2e path name• Value to each key is a csv of multiple trails that form one path• Currently manually maintained

– Alarm register• A simple csv file• Application maintained• An “alarm raise” registers the associated path• An “alarm clear” de-registers the associated path

(contd).

Connect. Communicate. CollaborateDesign (contd.)

• The application sets all path’s default status as UP with admin state as NORMALOPERATION

• Only the paths “registered” in the alarm-register csv file are set as DOWN with admin state as MAINTENANCE

• No implementation of the status DEGRADED at the moment

• No implementation of other admin states at the moment

Connect. Communicate. CollaborateDesign of XMLGenerator

• Relies on 3 configuration files – – Properties file (init.properties)

• Contains a key = value entry• Key = DOMAIN• Value = <domain_name>• Enables on-the-fly domain name configuration

– Config file (config.csv)• A simple CSV file• Contains node-link-node information

– A sample XML file containing “pieces of XML” to be replicated for each node and link in the final output “template XML”

• All configuration files are currently manually maintained

Connect. Communicate. CollaborateData Provision

• Currently, the final XML containing live e2e path status information is written to a URL for export– http://unix.dante.org.uk/~otto/jra4-cbf.xml

• Later, maybe integration with perfSONAR framework

Connect. Communicate. CollaborateperfSonar feeder

• Enters data in the perfSonar MA

• Takes as input:– Type of logical link: trunk, trail, physical link or path.– Name: friendlyName– Time: the time when the event occurred– Status: UP/Down– Alarm ID

Connect. Communicate. CollaborateLHC weather-map live demonstration

1. CERN user-side down

2. CERN user-side up

3. GEN-MIL Lambda down

4. GARR user-side down

5. Back-to-back interconnection in DE broken

6. AMS-FRA lambda down

7. Up DE interconnection

8. AMS-FRA lambda up

9. GARR user-side up

10. GEN-MIL lambda up

Connect. Communicate. CollaborateConclusion

• Status monitoring via alarms in an advanced phase and well understood.– Once the characteristic of the equipment/alarms/faults

understood the development was easy.• Alarm collector can be reused by NRENs using Alcatel

equipment.• XMLGenerator and perfSonar feeder not bonded to a

specific equipment.

Connect. Communicate. Collaborate

Questions ?

[email protected]

[email protected]

Connect. Communicate. CollaborateBackup

Connect. Communicate. CollaborateCERN user side down

Connect. Communicate. CollaborateLambda CH-IT down

Connect. Communicate. Collaborate

Lambda and user failure in IT

Connect. Communicate. Collaborate

Lambda + POP interconnect failure

Connect. Communicate. Collaborate

Multiple Lambda, user and POP interconnect failure