tomáš podermański , [email protected]

55
Brno University of Technology Brno University of Technology CESNET z.s.p.o CESNET z.s.p.o University Campus Network Monitoring in University Campus Network Monitoring in Everyday Life Everyday Life Tomáš Podermański, [email protected]

Upload: ismail

Post on 30-Jan-2016

30 views

Category:

Documents


0 download

DESCRIPTION

Brno University of Technology CESNET z.s.p.o University Campus Network Monitoring in Everyday Life. Tomáš Podermański , [email protected]. Brno University of Technology. http://www.vutbr.cz One of the largest universities in the Czech Republic - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Tomáš Podermański ,  tpoder@cis.vutbr.cz

Brno University of TechnologyBrno University of TechnologyCESNET z.s.p.oCESNET z.s.p.o

University Campus Network Monitoring in Everyday LifeUniversity Campus Network Monitoring in Everyday Life

Tomáš Podermański, [email protected]

Page 2: Tomáš Podermański ,  tpoder@cis.vutbr.cz

Brno University of Technology

• http://www.vutbr.cz• One of the largest universities in the Czech Republic• founded in 1899, 110th anniversary will be celebrated this year• 20,000 students and 2,000 employees• 9 faculties• 6 other organisation units• Student dormitory for 6,000 students

Page 3: Tomáš Podermański ,  tpoder@cis.vutbr.cz

VUT FIT, Božetechova 2

VUT Koleje, Mánesova 12

AV VFU, Palackého 1/3

MU CESNET , Botanická 68a

VUT Koleje , Kounicova 46/48

VUT Rektorát, Antonínská 1

VUT , Gorkého 13

VUT FaVU, Údolní 19

VUT FEKTÚdolní 53

MU, Vinařská 5

VUT FaVU, Rybářská 13AV ČR, Rybářská 13

VUT FA, Poříčí 5

VUT FAST, Veveří 95

AV ČR UFM

VUT, Kounicova 67a

MZLU, Tauferova

VUT FEKT, Technická 8

VUT Koleje, Kolejní 2

VUT FP, FEKT, Kolejní 4

VUT FCH, FEKT, Purkyňova 118

AV ČR UPT

VUT Koleje, Purk.

VUT TI, Technická 4

VUT FSI, Technická 2

Page 4: Tomáš Podermański ,  tpoder@cis.vutbr.cz

Physical Layer• 24 places connected to each other

• Each place is connected at least from two directions (by separated cables)

• Over 100 km of optical cables

• Most of the cables are the property of the university

IPv4 layer• The network cores are based on Hewlett Packard

• OSPF based routing

• For multicast PIM SM and DM are used.

• Most of the traffic is being transported thought this network

IPv6 layer• IPv6 functionality on HP devices available as beta release• Temporary solution based on 3com devices or PC routers with Xorp. • Dedicated IPv6 switch/router together with the main IPv4 switch/router. • For connections between IPv6 routers VLANs are used. • Temporary low cost solution until main devices will have full IPv6 support

Page 5: Tomáš Podermański ,  tpoder@cis.vutbr.cz

Basic monitoring, active vs. passive

• Active monitoringActive monitoring• We sent a probe data and get We sent a probe data and get

a response a response • A probe of the device, network A probe of the device, network

etc.etc.

• Passive monitoringPassive monitoring• Observer of the device, network Observer of the device, network

etc. etc.

Page 6: Tomáš Podermański ,  tpoder@cis.vutbr.cz

Components in a Monitoring System

Page 7: Tomáš Podermański ,  tpoder@cis.vutbr.cz

Components in monitoring system

Agent and protocol• SNMP agent

• Get, Set, Walk, Traps

• NetFlow, SFlow, IPFIX probe

• Accumulated statistics

• For many systems specialized protocol based on the main system

• Role of a cache on the agent

• Active monitoring

• We use an appropriate protocol or data depending on a monitored service

• Proxy service (view from the other point)

Agent and protocol• SNMP agent

• Get, Set, Walk, Traps

• NetFlow, SFlow, IPFIX probe

• Accumulated statistics

• For many systems specialized protocol based on the main system

• Role of a cache on the agent

• Active monitoring

• We use an appropriate protocol or data depending on a monitored service

• Proxy service (view from the other point)

Page 8: Tomáš Podermański ,  tpoder@cis.vutbr.cz

Components in Monitoring System

Manager & Frontend• Manager collects and proceses data

from agents

• Store and archive in datastore

• SQL, RRD, …

• User interface

• Web, application

• Reports, SLA, …

• Configuration

• Historical view

• System of alerts

• Email, SMS, phone call

• The most popular systems

• Zabbix, Nagios, OpenView, nfsen/dump, flowtools, rrdtool, mrtg, cacti, munin, …

Manager & Frontend• Manager collects and proceses data

from agents

• Store and archive in datastore

• SQL, RRD, …

• User interface

• Web, application

• Reports, SLA, …

• Configuration

• Historical view

• System of alerts

• Email, SMS, phone call

• The most popular systems

• Zabbix, Nagios, OpenView, nfsen/dump, flowtools, rrdtool, mrtg, cacti, munin, …

Page 9: Tomáš Podermański ,  tpoder@cis.vutbr.cz

Quiz

What causes the most of troubles in IT?What causes the most of troubles in IT?

– Power supply of systems Power supply of systems • Overloaded circuitsOverloaded circuits• Non managed UPSNon managed UPS• Mess in eletricity instalationsMess in eletricity instalations• IImpropermproper power supply could be a booby trap power supply could be a booby trap

– Cooling systems Cooling systems • Absence of a preventive monitoringAbsence of a preventive monitoring• Frozen units Frozen units • Jam by foliageJam by foliage• … …

Page 10: Tomáš Podermański ,  tpoder@cis.vutbr.cz

LAYER 0,1

Physical infrastructure

Page 11: Tomáš Podermański ,  tpoder@cis.vutbr.cz

Power Supply with 1 + 1 Redundancy

UPS II

UPS I

PDU I PDU II

ATS

2x 16A

Page 12: Tomáš Podermański ,  tpoder@cis.vutbr.cz

Power Supply with 1 + 1 Redundancy

UPS II

UPS I

PDU I PDU II

ATS

2x 16A

Load, voltage

Load, voltage on source 1,voltage on source 2,Selected source

Load, Input voltage,output voltage,battery status

Page 13: Tomáš Podermański ,  tpoder@cis.vutbr.cz

power system with 1 + 1 redundancy

UPS

ATS

2x 16A

Page 14: Tomáš Podermański ,  tpoder@cis.vutbr.cz

power system with 1 + 1 redundancy

UPS

ATS

2x 16A

Load, currentvoltage on source 1,voltage on source 2,Selected source

Load, currentInput voltage,output voltage,battery status

Page 15: Tomáš Podermański ,  tpoder@cis.vutbr.cz

power system with 1 + 1 redundancy

UPS

ATS

2x 16A

Overloaded circuittripped circuit breaker

Page 16: Tomáš Podermański ,  tpoder@cis.vutbr.cz

power system with 1 + 1 redundancy

UPS

ATS

2x 16A

in a few minutes UPS

is low

When the power goes up again...

Second circuit is overloadedtripped circuit breaker

Page 17: Tomáš Podermański ,  tpoder@cis.vutbr.cz

Cooling Systems

• In many cases a cooling system is a part of the building.In many cases a cooling system is a part of the building. • Majority of cooling systems are difficult to monitor. Majority of cooling systems are difficult to monitor. • Some devices have a support, but it costs a lot of money. Some devices have a support, but it costs a lot of money.

– In many cases monitoring is more expensive than the cooling device. In many cases monitoring is more expensive than the cooling device. – There is no standard interface (RS485 with a closed protocol). There is no standard interface (RS485 with a closed protocol). – Some devices have a binary output which indicates both error and running Some devices have a binary output which indicates both error and running

state (via relay) state (via relay) • Possible conversion to SNMPPossible conversion to SNMP

• Another and the easiest solution -> monitoring of temperatureAnother and the easiest solution -> monitoring of temperature in a in a communication room. communication room.

• Thermometer with a SNMP output. Thermometer with a SNMP output.

Monitoring systemUnit status/SNMP

Temperatue/SNMP

LonWorks

Page 18: Tomáš Podermański ,  tpoder@cis.vutbr.cz

Monitoring in Data Center Rooms

• More More complexcomplex eletrical instal eletrical installlation ation • Having UPS and ATS in every rack is ineffectiveHaving UPS and ATS in every rack is ineffective• DeviceDevicess with a 3-phase power with a 3-phase power• Circuits are divided to Circuits are divided to 3 3 groups (direct, genset, UPS)groups (direct, genset, UPS)• MMore detailed information ore detailed information aboutabout the eletricity distribution the eletricity distribution is is

very useful. very useful. • It is necessary to monitor whether phases are balancedIt is necessary to monitor whether phases are balanced

– Genset could break down Genset could break down

Page 19: Tomáš Podermański ,  tpoder@cis.vutbr.cz

Power in Data Center Rooms

Genset

UPS

ATS

Bypass HVAC

Main powerDevices in racks

A

A

A

A

V

V

V

Page 20: Tomáš Podermański ,  tpoder@cis.vutbr.cz
Page 21: Tomáš Podermański ,  tpoder@cis.vutbr.cz

temperature in datacenter

Page 22: Tomáš Podermański ,  tpoder@cis.vutbr.cz

temperature in datacenter

Page 23: Tomáš Podermański ,  tpoder@cis.vutbr.cz

Server Monitoring

• HardwareHardware– Manufacturers’ software support is required (Dell OpenManage, HP Manufacturers’ software support is required (Dell OpenManage, HP

InsightControl, …)InsightControl, …)– Chassis temperatureChassis temperature– Fan conditionFan condition– Power statusPower status

• Operating system Operating system – CPU, Load, Memory, Utilization, processCPU, Load, Memory, Utilization, process

• Disk subsystem Disk subsystem – External disk array with own management portExternal disk array with own management port– Raid statusRaid status– Disk condition (S.M.A.R.T.)Disk condition (S.M.A.R.T.)

Monitoring system

SNMP

IPMI

Other

Page 24: Tomáš Podermański ,  tpoder@cis.vutbr.cz

Network Device Monitoring

• HardwareHardware– Chassis temperatureChassis temperature– Fan conditionFan condition– Power statusPower status

• State of the operating systemState of the operating system– CPU CPU – Load Load – MemoryMemory

Monitoring systemSNMP

10/100/1000Base-T Ports - all ports are IEEE Auto MDI/MDI-X

1 5

62

3

4

7 11

128

9

10

13 17

1814

15

16

19

20

23

24

21

22

Use ProCurve mini-GBICs and SFPs only

vlModule

ProCurveGig-T/SFPvl Module J9033A

1 5

62

3

4

7 11

128

9

10

13 17

1814

15

16

19 23

2420

21

22

10/100/1000Base-T Ports - all ports are IEEE Auto MDI/MDI-X

vlModule

ProCurve24p Gig-Tvl Module J8768A

1 5

62

3

4

7 11

128

9

10

13 17

1814

15

16

19 23

2420

21

22

10/100/1000Base-T Ports - all ports are IEEE Auto MDI/MDI-X

vlModule

ProCurve24p Gig-Tvl Module J8768A

Power

Fault

Console

Self Test

Reset Clear

Auxiliary Port

off = 10Mbps

flash = 100Mbps

*Spd Mode

Fan

1 2Power

C DBA EStatus

Modules

ProCurve ProCurve Switch

4208vl-72GS

J9030A

BA

C

E

G

D

F

H

Act FDx Spd !

on = 1000Mbps

G HFLED Mode Select

Use vl modules only

Page 25: Tomáš Podermański ,  tpoder@cis.vutbr.cz

Network Connection – L1 Monitoring

• Port status Port status – Link UP/DOWN Link UP/DOWN – SpeedSpeed– Errors on interfacesErrors on interfaces– Traffic on interfacesTraffic on interfaces

• Remote device status Remote device status – LLDP + data from MIB LLDP + data from MIB – Remote interface, remote device, … Remote interface, remote device, …

10/100/1000Base-T Ports - all ports are IEEE Auto MDI/MDI-X

1 5

62

3

4

7 11

128

9

10

13 17

1814

15

16

19

20

23

24

21

22

Use ProCurve mini-GBICs and SFPs only

vlModule

ProCurveGig-T/SFPvl Module J9033A

1 5

62

3

4

7 11

128

9

10

13 17

1814

15

16

19 23

2420

21

22

10/100/1000Base-T Ports - all ports are IEEE Auto MDI/MDI-X

vlModule

ProCurve24p Gig-Tvl Module J8768A

1 5

62

3

4

7 11

128

9

10

13 17

1814

15

16

19 23

2420

21

22

10/100/1000Base-T Ports - all ports are IEEE Auto MDI/MDI-X

vlModule

ProCurve24p Gig-Tvl Module J8768A

Power

Fault

Console

Self Test

Reset Clear

Auxiliary Port

off = 10Mbps

flash = 100Mbps

*Spd Mode

Fan

1 2Power

C DBA EStatus

Modules

ProCurve ProCurve Switch

4208vl-72GS

J9030A

BA

C

E

G

D

F

H

Act FDx Spd !

on = 1000Mbps

G HFLED Mode Select

Use vl modules only

10/100/1000Base-T Ports - all ports are IEEE Auto MDI/MDI-X

1 5

62

3

4

7 11

128

9

10

13 17

1814

15

16

19

20

23

24

21

22

Use ProCurve mini-GBICs and SFPs only

vlModule

ProCurveGig-T/SFPvl Module J9033A

1 5

62

3

4

7 11

128

9

10

13 17

1814

15

16

19 23

2420

21

22

10/100/1000Base-T Ports - all ports are IEEE Auto MDI/MDI-X

vlModule

ProCurve24p Gig-Tvl Module J8768A

1 5

62

3

4

7 11

128

9

10

13 17

1814

15

16

19 23

2420

21

22

10/100/1000Base-T Ports - all ports are IEEE Auto MDI/MDI-X

vlModule

ProCurve24p Gig-Tvl Module J8768A

Power

Fault

Console

Self Test

Reset Clear

Auxiliary Port

off = 10Mbps

flash = 100Mbps

*Spd Mode

Fan

1 2Power

C DBA EStatus

Modules

ProCurve ProCurve Switch

4208vl-72GS

J9030A

BA

C

E

G

D

F

H

Act FDx Spd !

on = 1000Mbps

G HFLED Mode Select

Use vl modules only

Page 26: Tomáš Podermański ,  tpoder@cis.vutbr.cz

LAYER 2

Link

Page 27: Tomáš Podermański ,  tpoder@cis.vutbr.cz

Network Connection – L2 Monitoring

• L2 monitoring L2 monitoring – L2 ping could be very useful L2 ping could be very useful – We have to use information obtained from other layers We have to use information obtained from other layers

(L1,L3)(L1,L3)– UnfortunatelyUnfortunately, there is no simple possibility to check , there is no simple possibility to check

connectivity on a single VLAN connectivity on a single VLAN – One option is to obtain some information from MIB, but One option is to obtain some information from MIB, but

it’s not sufficientit’s not sufficient• SPT/MSPT information, root bridge SPT/MSPT information, root bridge • VLAN on interfacesVLAN on interfaces

Page 28: Tomáš Podermański ,  tpoder@cis.vutbr.cz

Network Connection – L3 monitoring

• L3 monitoring L3 monitoring – ICMP and PING are still the most important ICMP and PING are still the most important – The problem is how to monitor broken paths (routing The problem is how to monitor broken paths (routing

protocol usually covers any problem)protocol usually covers any problem)• Check of the routing protocol state Check of the routing protocol state • ICMP using the source routing ICMP using the source routing

– Flow based monitoring Flow based monitoring – Multicast monitoring Multicast monitoring

147.229.6.2

147.229.6.1

Data

Page 29: Tomáš Podermański ,  tpoder@cis.vutbr.cz

Network Connection – L3 monitoring

• L3 monitoring L3 monitoring – Checking the a router having the proper neighborChecking the a router having the proper neighbor– OSPF-MIB RFC-4750OSPF-MIB RFC-4750

• ospfNbrRtrIdospfNbrRtrId

– VRRP-MIB RFC-2787VRRP-MIB RFC-2787• vrrpOperAdminState, vrrpOperState, vrrpOperMasterIpAddrvrrpOperAdminState, vrrpOperState, vrrpOperMasterIpAddr

DR

MasterBDR

Backup

Page 30: Tomáš Podermański ,  tpoder@cis.vutbr.cz

Multicast Monitoring

• Quite demanding taskQuite demanding task– For each stream the <S,G> path has to be created For each stream the <S,G> path has to be created – Continuously received and transmitted stream doesn’t Continuously received and transmitted stream doesn’t

have to discover problem on the RPhave to discover problem on the RP– Almost impossible to monitor local infrastructure Almost impossible to monitor local infrastructure

• The only one known tool – Multicast Beacon The only one known tool – Multicast Beacon – Written in perl Written in perl – Dead project Dead project

• Last release 2006Last release 2006• Without VLAN support or support for multiple interfaces on a Without VLAN support or support for multiple interfaces on a

single hostsingle host• Homepage unavailable Homepage unavailable

• Own solution : mcwatch Own solution : mcwatch

Page 31: Tomáš Podermański ,  tpoder@cis.vutbr.cz

Multicast Agents

Data is periodically sent to a server

Page 32: Tomáš Podermański ,  tpoder@cis.vutbr.cz
Page 33: Tomáš Podermański ,  tpoder@cis.vutbr.cz

VLAN

Multicast Agent

PO

SIX

S

OC

KE

TAPPLICATION

Multicast Beacon

Page 34: Tomáš Podermański ,  tpoder@cis.vutbr.cz

VLAN

Multicast Agent

PO

SIX

S

OC

KE

TAPPLICATION

mcwatch

Page 35: Tomáš Podermański ,  tpoder@cis.vutbr.cz

NetFlow Monitoring

• Two NetFlow probes see on both external connectivity lines• NetFlow probes connected directly to optical fiber via TAP • Wire speed accelerated probes (FlowMon).

CESNET PoPCRS-1/16

University network

10G Ethernet

Page 36: Tomáš Podermański ,  tpoder@cis.vutbr.cz

Flow Processing

• Two NetFlow probes see on both external connectivity lines• NetFlow probes connected directly to optical fiber via TAP • Wire speed accelerated probes (FlowMon).

Nfcapd

DatastoreSQLaggregated

All administrators

Backbone administrator

Page 37: Tomáš Podermański ,  tpoder@cis.vutbr.cz

Flow Processing

Data are stored on a storage server– Data are kept for 30 days – Analysis of security incidents, statistical proposes– Big deal – how to get/select useful data and provide them to people who

need them. – Security matter– Full data are accessible only for small and trustful group of administrators– For other IT staff (faculty administrators, IT managers) summarised data

are accessible via a web interface.

• Data are processed by common open source tools:– nfdump– A lot of troubles, but we don’t have any better solution – We are trying to do any optimalisation into the current impelentations – Several theses on this topic is in process

• Commercial tools - situation is not better– Usually plenty of nice charts and statistics– But performance is often terrible (sampling is required)

Page 38: Tomáš Podermański ,  tpoder@cis.vutbr.cz

LAYER 4-7

Transport, application and the others

Page 39: Tomáš Podermański ,  tpoder@cis.vutbr.cz

Layer 7

• Many own pluginsMany own plugins– Eduroam/radius monitoringEduroam/radius monitoring– DNSDNS– Database status Database status – Backup server statusBackup server status– ……..

• Collected data and avilable for administrators on Collected data and avilable for administrators on different leveldifferent level– Eduroam/Radius logsEduroam/Radius logs– Maillogs (DNSBL, spam clasification, statistics)Maillogs (DNSBL, spam clasification, statistics)– WiFi/VPN connectionsWiFi/VPN connections– ……..

Page 40: Tomáš Podermański ,  tpoder@cis.vutbr.cz

Components in the Monitoring System

SNMP

Zabbix

Spinel

SNMP

radius

icmp

mysql

snmp

xmon

netflow

millogs

radiuslogs

incidents

wifilogs

honeypots

aggflow

zab

bix

xwh

o,

xhis

Net

Isn

fdu

mp

Page 41: Tomáš Podermański ,  tpoder@cis.vutbr.cz

Monitoring : Layers & Technology

zab

bix

xwh

o,

xhis

Net

Isn

fdu

mp

Ph

ysic

al

Power, Cooling systems, TemperatureServer and disk arrays Network devices

Lin

k

Port statistics, link status, number of errorsLLDP neighbour

Inte

rnet

ICMP tests using source routing optionOSPF, VRRP peers Multicast traffic monitoring

Ap

pli

cati

on

Radius, DNSOther services

SN

MP

, za

bbix

, N

etF

low

, ra

dius

, IC

MP

, IC

MP

v6,

Spi

nel,

Page 42: Tomáš Podermański ,  tpoder@cis.vutbr.cz

Actuall problems

• SNMP protocol – No alternative – Many bugs in various implementations

• Absence of the L2 testing tool

• Netflow– We have plenty of the data but nobody knows how to

process it in the effective way – In some cases the more detailed information is required

than Flow

• IPv6 brings some new problems and challenges

Page 43: Tomáš Podermański ,  tpoder@cis.vutbr.cz
Page 44: Tomáš Podermański ,  tpoder@cis.vutbr.cz
Page 45: Tomáš Podermański ,  tpoder@cis.vutbr.cz
Page 46: Tomáš Podermański ,  tpoder@cis.vutbr.cz
Page 47: Tomáš Podermański ,  tpoder@cis.vutbr.cz
Page 48: Tomáš Podermański ,  tpoder@cis.vutbr.cz
Page 49: Tomáš Podermański ,  tpoder@cis.vutbr.cz
Page 50: Tomáš Podermański ,  tpoder@cis.vutbr.cz
Page 51: Tomáš Podermański ,  tpoder@cis.vutbr.cz
Page 52: Tomáš Podermański ,  tpoder@cis.vutbr.cz
Page 53: Tomáš Podermański ,  tpoder@cis.vutbr.cz
Page 54: Tomáš Podermański ,  tpoder@cis.vutbr.cz
Page 55: Tomáš Podermański ,  tpoder@cis.vutbr.cz

Brno University of TechnologyBrno University of TechnologyCESNET z.s.p.oCESNET z.s.p.o

University Campus Network Monitoring in Everyday LifeUniversity Campus Network Monitoring in Everyday Life

Tomáš Podermański, [email protected]