tomáš podermański , [email protected]
DESCRIPTION
Brno University of Technology CESNET z.s.p.o University Campus Network Monitoring in Everyday Life. Tomáš Podermański , [email protected]. Brno University of Technology. http://www.vutbr.cz One of the largest universities in the Czech Republic - PowerPoint PPT PresentationTRANSCRIPT
Brno University of TechnologyBrno University of TechnologyCESNET z.s.p.oCESNET z.s.p.o
University Campus Network Monitoring in Everyday LifeUniversity Campus Network Monitoring in Everyday Life
Tomáš Podermański, [email protected]
Brno University of Technology
• http://www.vutbr.cz• One of the largest universities in the Czech Republic• founded in 1899, 110th anniversary will be celebrated this year• 20,000 students and 2,000 employees• 9 faculties• 6 other organisation units• Student dormitory for 6,000 students
VUT FIT, Božetechova 2
VUT Koleje, Mánesova 12
AV VFU, Palackého 1/3
MU CESNET , Botanická 68a
VUT Koleje , Kounicova 46/48
VUT Rektorát, Antonínská 1
VUT , Gorkého 13
VUT FaVU, Údolní 19
VUT FEKTÚdolní 53
MU, Vinařská 5
VUT FaVU, Rybářská 13AV ČR, Rybářská 13
VUT FA, Poříčí 5
VUT FAST, Veveří 95
AV ČR UFM
VUT, Kounicova 67a
MZLU, Tauferova
VUT FEKT, Technická 8
VUT Koleje, Kolejní 2
VUT FP, FEKT, Kolejní 4
VUT FCH, FEKT, Purkyňova 118
AV ČR UPT
VUT Koleje, Purk.
VUT TI, Technická 4
VUT FSI, Technická 2
Physical Layer• 24 places connected to each other
• Each place is connected at least from two directions (by separated cables)
• Over 100 km of optical cables
• Most of the cables are the property of the university
IPv4 layer• The network cores are based on Hewlett Packard
• OSPF based routing
• For multicast PIM SM and DM are used.
• Most of the traffic is being transported thought this network
IPv6 layer• IPv6 functionality on HP devices available as beta release• Temporary solution based on 3com devices or PC routers with Xorp. • Dedicated IPv6 switch/router together with the main IPv4 switch/router. • For connections between IPv6 routers VLANs are used. • Temporary low cost solution until main devices will have full IPv6 support
Basic monitoring, active vs. passive
• Active monitoringActive monitoring• We sent a probe data and get We sent a probe data and get
a response a response • A probe of the device, network A probe of the device, network
etc.etc.
• Passive monitoringPassive monitoring• Observer of the device, network Observer of the device, network
etc. etc.
Components in a Monitoring System
Components in monitoring system
Agent and protocol• SNMP agent
• Get, Set, Walk, Traps
• NetFlow, SFlow, IPFIX probe
• Accumulated statistics
• For many systems specialized protocol based on the main system
• Role of a cache on the agent
• Active monitoring
• We use an appropriate protocol or data depending on a monitored service
• Proxy service (view from the other point)
Agent and protocol• SNMP agent
• Get, Set, Walk, Traps
• NetFlow, SFlow, IPFIX probe
• Accumulated statistics
• For many systems specialized protocol based on the main system
• Role of a cache on the agent
• Active monitoring
• We use an appropriate protocol or data depending on a monitored service
• Proxy service (view from the other point)
Components in Monitoring System
Manager & Frontend• Manager collects and proceses data
from agents
• Store and archive in datastore
• SQL, RRD, …
• User interface
• Web, application
• Reports, SLA, …
• Configuration
• Historical view
• System of alerts
• Email, SMS, phone call
• The most popular systems
• Zabbix, Nagios, OpenView, nfsen/dump, flowtools, rrdtool, mrtg, cacti, munin, …
Manager & Frontend• Manager collects and proceses data
from agents
• Store and archive in datastore
• SQL, RRD, …
• User interface
• Web, application
• Reports, SLA, …
• Configuration
• Historical view
• System of alerts
• Email, SMS, phone call
• The most popular systems
• Zabbix, Nagios, OpenView, nfsen/dump, flowtools, rrdtool, mrtg, cacti, munin, …
Quiz
What causes the most of troubles in IT?What causes the most of troubles in IT?
– Power supply of systems Power supply of systems • Overloaded circuitsOverloaded circuits• Non managed UPSNon managed UPS• Mess in eletricity instalationsMess in eletricity instalations• IImpropermproper power supply could be a booby trap power supply could be a booby trap
– Cooling systems Cooling systems • Absence of a preventive monitoringAbsence of a preventive monitoring• Frozen units Frozen units • Jam by foliageJam by foliage• … …
LAYER 0,1
Physical infrastructure
Power Supply with 1 + 1 Redundancy
UPS II
UPS I
PDU I PDU II
ATS
2x 16A
Power Supply with 1 + 1 Redundancy
UPS II
UPS I
PDU I PDU II
ATS
2x 16A
Load, voltage
Load, voltage on source 1,voltage on source 2,Selected source
Load, Input voltage,output voltage,battery status
power system with 1 + 1 redundancy
UPS
ATS
2x 16A
power system with 1 + 1 redundancy
UPS
ATS
2x 16A
Load, currentvoltage on source 1,voltage on source 2,Selected source
Load, currentInput voltage,output voltage,battery status
power system with 1 + 1 redundancy
UPS
ATS
2x 16A
Overloaded circuittripped circuit breaker
power system with 1 + 1 redundancy
UPS
ATS
2x 16A
in a few minutes UPS
is low
When the power goes up again...
Second circuit is overloadedtripped circuit breaker
Cooling Systems
• In many cases a cooling system is a part of the building.In many cases a cooling system is a part of the building. • Majority of cooling systems are difficult to monitor. Majority of cooling systems are difficult to monitor. • Some devices have a support, but it costs a lot of money. Some devices have a support, but it costs a lot of money.
– In many cases monitoring is more expensive than the cooling device. In many cases monitoring is more expensive than the cooling device. – There is no standard interface (RS485 with a closed protocol). There is no standard interface (RS485 with a closed protocol). – Some devices have a binary output which indicates both error and running Some devices have a binary output which indicates both error and running
state (via relay) state (via relay) • Possible conversion to SNMPPossible conversion to SNMP
• Another and the easiest solution -> monitoring of temperatureAnother and the easiest solution -> monitoring of temperature in a in a communication room. communication room.
• Thermometer with a SNMP output. Thermometer with a SNMP output.
Monitoring systemUnit status/SNMP
Temperatue/SNMP
LonWorks
Monitoring in Data Center Rooms
• More More complexcomplex eletrical instal eletrical installlation ation • Having UPS and ATS in every rack is ineffectiveHaving UPS and ATS in every rack is ineffective• DeviceDevicess with a 3-phase power with a 3-phase power• Circuits are divided to Circuits are divided to 3 3 groups (direct, genset, UPS)groups (direct, genset, UPS)• MMore detailed information ore detailed information aboutabout the eletricity distribution the eletricity distribution is is
very useful. very useful. • It is necessary to monitor whether phases are balancedIt is necessary to monitor whether phases are balanced
– Genset could break down Genset could break down
Power in Data Center Rooms
Genset
UPS
ATS
Bypass HVAC
Main powerDevices in racks
A
A
A
A
V
V
V
temperature in datacenter
temperature in datacenter
Server Monitoring
• HardwareHardware– Manufacturers’ software support is required (Dell OpenManage, HP Manufacturers’ software support is required (Dell OpenManage, HP
InsightControl, …)InsightControl, …)– Chassis temperatureChassis temperature– Fan conditionFan condition– Power statusPower status
• Operating system Operating system – CPU, Load, Memory, Utilization, processCPU, Load, Memory, Utilization, process
• Disk subsystem Disk subsystem – External disk array with own management portExternal disk array with own management port– Raid statusRaid status– Disk condition (S.M.A.R.T.)Disk condition (S.M.A.R.T.)
Monitoring system
SNMP
IPMI
Other
Network Device Monitoring
• HardwareHardware– Chassis temperatureChassis temperature– Fan conditionFan condition– Power statusPower status
• State of the operating systemState of the operating system– CPU CPU – Load Load – MemoryMemory
Monitoring systemSNMP
10/100/1000Base-T Ports - all ports are IEEE Auto MDI/MDI-X
1 5
62
3
4
7 11
128
9
10
13 17
1814
15
16
19
20
23
24
21
22
Use ProCurve mini-GBICs and SFPs only
vlModule
ProCurveGig-T/SFPvl Module J9033A
1 5
62
3
4
7 11
128
9
10
13 17
1814
15
16
19 23
2420
21
22
10/100/1000Base-T Ports - all ports are IEEE Auto MDI/MDI-X
vlModule
ProCurve24p Gig-Tvl Module J8768A
1 5
62
3
4
7 11
128
9
10
13 17
1814
15
16
19 23
2420
21
22
10/100/1000Base-T Ports - all ports are IEEE Auto MDI/MDI-X
vlModule
ProCurve24p Gig-Tvl Module J8768A
Power
Fault
Console
Self Test
Reset Clear
Auxiliary Port
off = 10Mbps
flash = 100Mbps
*Spd Mode
Fan
1 2Power
C DBA EStatus
Modules
ProCurve ProCurve Switch
4208vl-72GS
J9030A
BA
C
E
G
D
F
H
Act FDx Spd !
on = 1000Mbps
G HFLED Mode Select
Use vl modules only
Network Connection – L1 Monitoring
• Port status Port status – Link UP/DOWN Link UP/DOWN – SpeedSpeed– Errors on interfacesErrors on interfaces– Traffic on interfacesTraffic on interfaces
• Remote device status Remote device status – LLDP + data from MIB LLDP + data from MIB – Remote interface, remote device, … Remote interface, remote device, …
10/100/1000Base-T Ports - all ports are IEEE Auto MDI/MDI-X
1 5
62
3
4
7 11
128
9
10
13 17
1814
15
16
19
20
23
24
21
22
Use ProCurve mini-GBICs and SFPs only
vlModule
ProCurveGig-T/SFPvl Module J9033A
1 5
62
3
4
7 11
128
9
10
13 17
1814
15
16
19 23
2420
21
22
10/100/1000Base-T Ports - all ports are IEEE Auto MDI/MDI-X
vlModule
ProCurve24p Gig-Tvl Module J8768A
1 5
62
3
4
7 11
128
9
10
13 17
1814
15
16
19 23
2420
21
22
10/100/1000Base-T Ports - all ports are IEEE Auto MDI/MDI-X
vlModule
ProCurve24p Gig-Tvl Module J8768A
Power
Fault
Console
Self Test
Reset Clear
Auxiliary Port
off = 10Mbps
flash = 100Mbps
*Spd Mode
Fan
1 2Power
C DBA EStatus
Modules
ProCurve ProCurve Switch
4208vl-72GS
J9030A
BA
C
E
G
D
F
H
Act FDx Spd !
on = 1000Mbps
G HFLED Mode Select
Use vl modules only
10/100/1000Base-T Ports - all ports are IEEE Auto MDI/MDI-X
1 5
62
3
4
7 11
128
9
10
13 17
1814
15
16
19
20
23
24
21
22
Use ProCurve mini-GBICs and SFPs only
vlModule
ProCurveGig-T/SFPvl Module J9033A
1 5
62
3
4
7 11
128
9
10
13 17
1814
15
16
19 23
2420
21
22
10/100/1000Base-T Ports - all ports are IEEE Auto MDI/MDI-X
vlModule
ProCurve24p Gig-Tvl Module J8768A
1 5
62
3
4
7 11
128
9
10
13 17
1814
15
16
19 23
2420
21
22
10/100/1000Base-T Ports - all ports are IEEE Auto MDI/MDI-X
vlModule
ProCurve24p Gig-Tvl Module J8768A
Power
Fault
Console
Self Test
Reset Clear
Auxiliary Port
off = 10Mbps
flash = 100Mbps
*Spd Mode
Fan
1 2Power
C DBA EStatus
Modules
ProCurve ProCurve Switch
4208vl-72GS
J9030A
BA
C
E
G
D
F
H
Act FDx Spd !
on = 1000Mbps
G HFLED Mode Select
Use vl modules only
LAYER 2
Link
Network Connection – L2 Monitoring
• L2 monitoring L2 monitoring – L2 ping could be very useful L2 ping could be very useful – We have to use information obtained from other layers We have to use information obtained from other layers
(L1,L3)(L1,L3)– UnfortunatelyUnfortunately, there is no simple possibility to check , there is no simple possibility to check
connectivity on a single VLAN connectivity on a single VLAN – One option is to obtain some information from MIB, but One option is to obtain some information from MIB, but
it’s not sufficientit’s not sufficient• SPT/MSPT information, root bridge SPT/MSPT information, root bridge • VLAN on interfacesVLAN on interfaces
Network Connection – L3 monitoring
• L3 monitoring L3 monitoring – ICMP and PING are still the most important ICMP and PING are still the most important – The problem is how to monitor broken paths (routing The problem is how to monitor broken paths (routing
protocol usually covers any problem)protocol usually covers any problem)• Check of the routing protocol state Check of the routing protocol state • ICMP using the source routing ICMP using the source routing
– Flow based monitoring Flow based monitoring – Multicast monitoring Multicast monitoring
147.229.6.2
147.229.6.1
Data
Network Connection – L3 monitoring
• L3 monitoring L3 monitoring – Checking the a router having the proper neighborChecking the a router having the proper neighbor– OSPF-MIB RFC-4750OSPF-MIB RFC-4750
• ospfNbrRtrIdospfNbrRtrId
– VRRP-MIB RFC-2787VRRP-MIB RFC-2787• vrrpOperAdminState, vrrpOperState, vrrpOperMasterIpAddrvrrpOperAdminState, vrrpOperState, vrrpOperMasterIpAddr
DR
MasterBDR
Backup
Multicast Monitoring
• Quite demanding taskQuite demanding task– For each stream the <S,G> path has to be created For each stream the <S,G> path has to be created – Continuously received and transmitted stream doesn’t Continuously received and transmitted stream doesn’t
have to discover problem on the RPhave to discover problem on the RP– Almost impossible to monitor local infrastructure Almost impossible to monitor local infrastructure
• The only one known tool – Multicast Beacon The only one known tool – Multicast Beacon – Written in perl Written in perl – Dead project Dead project
• Last release 2006Last release 2006• Without VLAN support or support for multiple interfaces on a Without VLAN support or support for multiple interfaces on a
single hostsingle host• Homepage unavailable Homepage unavailable
• Own solution : mcwatch Own solution : mcwatch
Multicast Agents
Data is periodically sent to a server
VLAN
Multicast Agent
PO
SIX
S
OC
KE
TAPPLICATION
Multicast Beacon
VLAN
Multicast Agent
PO
SIX
S
OC
KE
TAPPLICATION
mcwatch
NetFlow Monitoring
• Two NetFlow probes see on both external connectivity lines• NetFlow probes connected directly to optical fiber via TAP • Wire speed accelerated probes (FlowMon).
CESNET PoPCRS-1/16
University network
10G Ethernet
Flow Processing
• Two NetFlow probes see on both external connectivity lines• NetFlow probes connected directly to optical fiber via TAP • Wire speed accelerated probes (FlowMon).
Nfcapd
DatastoreSQLaggregated
All administrators
Backbone administrator
Flow Processing
Data are stored on a storage server– Data are kept for 30 days – Analysis of security incidents, statistical proposes– Big deal – how to get/select useful data and provide them to people who
need them. – Security matter– Full data are accessible only for small and trustful group of administrators– For other IT staff (faculty administrators, IT managers) summarised data
are accessible via a web interface.
• Data are processed by common open source tools:– nfdump– A lot of troubles, but we don’t have any better solution – We are trying to do any optimalisation into the current impelentations – Several theses on this topic is in process
• Commercial tools - situation is not better– Usually plenty of nice charts and statistics– But performance is often terrible (sampling is required)
LAYER 4-7
Transport, application and the others
Layer 7
• Many own pluginsMany own plugins– Eduroam/radius monitoringEduroam/radius monitoring– DNSDNS– Database status Database status – Backup server statusBackup server status– ……..
• Collected data and avilable for administrators on Collected data and avilable for administrators on different leveldifferent level– Eduroam/Radius logsEduroam/Radius logs– Maillogs (DNSBL, spam clasification, statistics)Maillogs (DNSBL, spam clasification, statistics)– WiFi/VPN connectionsWiFi/VPN connections– ……..
Components in the Monitoring System
SNMP
Zabbix
Spinel
SNMP
radius
icmp
mysql
snmp
xmon
netflow
millogs
radiuslogs
incidents
wifilogs
honeypots
…
aggflow
zab
bix
xwh
o,
xhis
Net
Isn
fdu
mp
Monitoring : Layers & Technology
zab
bix
xwh
o,
xhis
Net
Isn
fdu
mp
Ph
ysic
al
Power, Cooling systems, TemperatureServer and disk arrays Network devices
Lin
k
Port statistics, link status, number of errorsLLDP neighbour
Inte
rnet
ICMP tests using source routing optionOSPF, VRRP peers Multicast traffic monitoring
Ap
pli
cati
on
Radius, DNSOther services
SN
MP
, za
bbix
, N
etF
low
, ra
dius
, IC
MP
, IC
MP
v6,
Spi
nel,
…
Actuall problems
• SNMP protocol – No alternative – Many bugs in various implementations
• Absence of the L2 testing tool
• Netflow– We have plenty of the data but nobody knows how to
process it in the effective way – In some cases the more detailed information is required
than Flow
• IPv6 brings some new problems and challenges
Brno University of TechnologyBrno University of TechnologyCESNET z.s.p.oCESNET z.s.p.o
University Campus Network Monitoring in Everyday LifeUniversity Campus Network Monitoring in Everyday Life
Tomáš Podermański, [email protected]