connect. communicate. collaborate hades – going operational roland karch, rrze fau...

Post on 04-Jan-2016

224 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Connect. Communicate. Collaborate

Hades – Going Operational

Roland Karch, RRZE FAU Erlangen-Nürnberg

JRA1 Montpellier Meeting, October 2006

Connect. Communicate. CollaborateHades Implementation Status List

• IPv6 Measurements (Up and running in more than half of the JRA1 locations)

• Multicast Measurements (Implementation)• Alerts

– Packet Loss Maps (Implemented, Deployed for X-WiN)– SNMP Traps (Server needs to be set up)– Generic Web Interface (Evaluation)

• Maintenance– To be integrated into one interface with Alerts

Connect. Communicate. CollaborateIPv6 Measurements

• Running in:– Amsterdam (SURFnet)– Athens (GRNET)– Ljubljana (ARNES)– Paris (RENATER) (currently offline)– Prague (CESNET)– Sofia (ISTF)– Zagreb (CARNET)

• Owning a JRA1 Hades measurement box as well as an IPv6 capable network but aren‘t on the list? Contact us!

Connect. Communicate. Collaborate

Hades weather map (GEANT/NRENs, Geographically)

Connect. Communicate. Collaborate

Hades weather maps (Abstract, domain specific)

Connect. Communicate. CollaborateAlerts – Packet Loss Maps

• One map to show observed packet loss on all Hades monitored links

• Colour coding on links to show short and long outages• Currently still in development, not yet in the european

context available• Maps for other metrics under consideration, but details

about those metrics yet to be determined (see statistical analysis)

Connect. Communicate. CollaborateAlerts – SNMP traps

• Problem with data on measurement archive: age between 0 and 90 minutes

• To ensure up to date information for alerts, solutions are either:– Increase frequency of data polling (causing

management network overhead and load on the measurement point and archive)

– Do analysis on the measurement point in real time (CPU load on the measurement point only, but problem of how to deliver decentralized alerts

• Solution: Decentralized analysis, and SNMP traps for alerting

Connect. Communicate. CollaborateAlerts – SNMP traps

• Multiple potential use cases for traps– Central visualization to subscribe to all alerts in order to

create a powerful map and/or alert list with history– NOCs might subscribe for their uplinks/sensitive paths

to important locations (typically already running SNMP capable monitoring facilities)

Connect. Communicate. CollaborateAlerts – SNMP traps

• Benefits– Only causes network traffic when necessary– Real time data for analysis available on the

measurement point– SNMP MP usable?

• Drawbacks– SNMP very often filtered into user networks (web

visualisation as intermediate server might solve that)– Won’t alert when the reporting path is affected by the

network problem itself

Connect. Communicate. CollaborateAlerts – Statistics

• Higher level of statistical analysis for measurement data might help to determine a „connection footprint“ and show changes in it due to routing changes.

• Possible numbers to play with:– Line inherent delay (minimal delay that catches all, or a

high percentile of all measurement packets)– Regular IPDV (blurry zone in a plot, delta between line

inherent delay and maximum of 90 percent of the measurements)

Connect. Communicate. CollaborateAlerts – Statistics – Key values

• 11.4 ms minimal delay subtracted: „Network intrinsic delay“

• 1 µs gap: timestamp precision• Lower boundary: timer precision

00:00 01:00 02:00 03:00 04:000

5

10

15

20

25

30

35

40

One-way-delay

Delta Delay / µs

Tim e / hD

elt

a D

ela

y /

µs

Connect. Communicate. CollaborateAlerts – Statistics – Pathfinders

• First packet in every group of 5: ~7 µs longer delay

• Most probable reason: Receiver process has to be loaded into the CPU cache before processing the first packet

0 4 8 12 16 20 240

200

400

600

800

1000

1200

1400

1600

1800

Pathfinder packets

Pathfinder

No pathfinder

Delta OWD / µsN

Connect. Communicate. CollaborateAlerts – Statistics – Path fingerprint

• Comparison of paths on different networks (hardware, lines, configuration differs)

• Both: small OWD, narrow distribution of delay

• Path 2: longer distribution tail• Path 1: reordering!

0 20 40 60 800

100

200

300

400

500

600

700

800

Delay on two network paths

Path 1 Path 2

Delta Delay / µsN

Connect. Communicate. CollaborateMaintenance

• Most important part of „going operational“• Current status:

– Daily checking of which measurement lines are down (up to 24 hours delay) over the web visualization

– Scripts run to catch most anomalies (clock status, old data

– perfSONAR MAs are monitored externally (ISTF)

Connect. Communicate. CollaborateMaintenance

• Evaluation of Nagios [1]• Could serve as a common platform for alert and

maintenance visualization• Provides a front end for both SNMP and scripted

surveillance

[1] http://nagios.org/

Connect. Communicate. CollaborateMaintenance

• Goals– Highest possible level of automation– Fixing of simple problems either fully automated (i.e.

restarting measurements) or via scripts that can be triggered on the web server

– Transparency for users

Connect. Communicate. Collaborate

Questions / Discussion / Want to contact us?

• Website: http://www.win-labor.dfn.de/• Email: win-labor@dfn.de

top related