improving enoc ’s support for cods cod-18, abingdon, uk

21
EGEE-III INFSO-RI-222667 Enabling Grids for E-sciencE www.eu-egee.org EGEE and gLite are registered trademarks Improving ENOC ’s support for CODs COD-18, Abingdon, UK Guillaume Cessieux (CNRS, IN2P3-CC / EGEE SA2) 2008-12-03

Upload: cachez

Post on 18-Jan-2016

21 views

Category:

Documents


2 download

DESCRIPTION

Improving ENOC ’s support for CODs COD-18, Abingdon, UK. Guillaume Cessieux (CNRS, IN2P3-CC / EGEE SA2) 2008-12-03. Outlines. ENOC and COD interactions Status of work around network trouble tickets DownCollector Assessment Review of last 12 months - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Improving   ENOC     ’s support for CODs COD-18, Abingdon, UK

EGEE-III INFSO-RI-222667

Enabling Grids for E-sciencE

www.eu-egee.org

EGEE and gLite are registered trademarks

Improving ENOC ’s support for CODs

COD-18, Abingdon, UK

Guillaume Cessieux (CNRS, IN2P3-CC / EGEE SA2)

2008-12-03

Page 2: Improving   ENOC     ’s support for CODs COD-18, Abingdon, UK

Enabling Grids for E-sciencE

GCX

Outlines

• ENOC and COD interactions

• Status of work around network trouble tickets

• DownCollector– Assessment– Review of last 12 months

• Proposal to handle DownCollector’s troubles– Processes– Tools’ improvements

COD18 2008-12-03 2

Page 3: Improving   ENOC     ’s support for CODs COD-18, Abingdon, UK

Enabling Grids for E-sciencE

GCX

EGEE Network Operating Centre

• ENOC– Aiming to provide support for:

• Sites

• ROCs

• CODs– Hard to get feedbacks and requirements from SA1

“Two different worlds”...

– Now real-life background with better vision

~ 0.5 FTE in EGI, main changes MUST happen before– Drop unnecessary things, focus on useful– Network support wider role than the ENOC in EGI

COD18 2008-12-03 3

Page 4: Improving   ENOC     ’s support for CODs COD-18, Abingdon, UK

Enabling Grids for E-sciencE

GCX

Current status with COD

• Only DownCollector seems now to be used by CODs [ https://ccenoc.in2p3.fr/DownCollector/ ]

– Very efficient integration in COD’s dashboard

• SA2 is willing to know how to better serve CODs around network support– Regarding processes

Balance between wait and see & over-engineered things

– Regarding tools and integration DownCollector, other tools, CIC dashboard, alarms …

• Use background to sketch wise, realistic and useful processes and tools

COD18 2008-12-03 4

Page 5: Improving   ENOC     ’s support for CODs COD-18, Abingdon, UK

Enabling Grids for E-sciencE

GCX

Around network trouble tickets (1/2)

• Currently TTdrawlight [ https://ccenoc.in2p3.fr/TTdrawlight/ ]

– Repository of network trouble tickets– Not enough accurate & hard to be used efficiently

• Network trouble tickets are not a panacea– «  Главным образом сеть вниз. Будет вверх скоро »

Targeted for a local community

– But often the only operational information available…

• Strong privacy issues to share network trouble tickets– No filtering of sensible information delivered (school, military…)– Fear of comparison and competition– Knowledge database of networks trouble tickets compromised?

COD18 2008-12-03 5

~ « Main router is down. Will be up soon. »

Page 6: Improving   ENOC     ’s support for CODs COD-18, Abingdon, UK

Enabling Grids for E-sciencE

GCX

Around network trouble tickets (2/2)

• 19 NRENs currently sending their tickets to the ENOC– EGEE relies on networks from ~ 50 NRENs + GÉANT2

We cover ~80% of European Grid sites

– 2800 e-mails for 900 tickets/month– Really hard to deal with meaning of tickets (location, duration...)

• Standardisation of network TT?– Can enable painless, accurate and automatic management of TT– Strong advances in this domain but hard to promote to NRENs

• Situation to be sorted out between NRENs & SA2– Solve centralisation, accuracy and exposure of TT– Then tools will easily follow

COD18 2008-12-03 6

Page 7: Improving   ENOC     ’s support for CODs COD-18, Abingdon, UK

Enabling Grids for E-sciencE

GCX

Around network monitoring

• Connectivity addressed with DownCollector– But not performance

• Hard to have information on end-to-end performances– Require to go on network paths and devices details

300 certified sites, 50 NRENs... Inhomogeneous domains

– Network is shared, should be monitored once and not at project level Slowly converging toward perfSONAR – not yet mature

• EGEE Network troubleshooting tool upcoming– Lightweight package from SA2– Prototype around January 2009

COD18 2008-12-03 7

Page 8: Improving   ENOC     ’s support for CODs COD-18, Abingdon, UK

Enabling Grids for E-sciencE

GCX

DownCollector (1/3)

• Now a key tool reporting TCP listening of Grid nodes

• 2 minutes accuracy~ 2600 nodes pooled– Often first to detect some failures

• GOCDB Scheduled downtimes are managed– Troubles not reported for sites in scheduled downtimes

COD18 2008-12-03 8

Page 9: Improving   ENOC     ’s support for CODs COD-18, Abingdon, UK

Enabling Grids for E-sciencE

GCX

DownCollector (2/3)

• A trouble = All Grid hosts of a site unreached– To avoid measuring host availability

• Network checkpoint = border router– Demarcation point for ENOC’s responsibility– Checked during trouble

• Three kinds of troubles1. OFF-SITE: Network checkpoint NOT reached

Fault in: WAN, MAN, NREN, GÉANT2, ISP...

2. ON-SITE: Network checkpoint reached LAN, power, software ...

3. UNKNOWN: No clear and reliable checkpoint, but site in trouble

COD18 2008-12-03 9

NREN X

GÉANT2

checkpoint

OF

F-S

ITE

ON

-SIT

E

Page 10: Improving   ENOC     ’s support for CODs COD-18, Abingdon, UK

Enabling Grids for E-sciencE

GCX COD18 2008-12-03 10

• Is it trustable or biased?– If failure reported from ENOC is failure from entire infrastructure?

For ON-SITE troubles: ~YES

– What about French sites reached without using GÉANT2? remote probes?

– 2 instances of DownCollector? ~NO

DownCollector (3/3)

RENATER

GÉANT2

NREN X

ENOC

French site

Foreign site 1

Foreign site 2

Router A Router B

Checkpoint for site 1

Page 11: Improving   ENOC     ’s support for CODs COD-18, Abingdon, UK

Enabling Grids for E-sciencE

GCX

Troubles detected by DownCollector

COD18 2008-12-03 11

• 54% of detected problems are ON-SITES

Min Max AVG % AVG

OFF-SITES 157 354 248 28%

ON-SITE 273 615 467 54%

Unknown 59 167 106 12%

• Scope– (300 certified sites)– Last 12 months

Number of troubles per month:

Number of troubles

Troubles are not concentrated on few sites!

Page 12: Improving   ENOC     ’s support for CODs COD-18, Abingdon, UK

Enabling Grids for E-sciencE

GCX

Troubles’ durations

COD18 2008-12-03 12

Noticed resolution time OFF-SITE ON-SITE<= 5 min 36,08% 42,71%

> 5 min and <= 30 min 45,04% 36,36%> 30 min and <= 1h 6,48% 6,37%

> 1h and <= 4h 8,17% 7,67%> 4h and <= 12h 3,14% 3,32%> 12h and <= 1d 0,79% 1,93%

> 1d 0,31% 1,65%

• 80% solved within 30 min Pareto’s law

• The others– OFF-SITE

Avg 45 troubles/month

– ON-SITE Avg 85 troubles/month

Last 12 months troubles’ dispatching:

Page 13: Improving   ENOC     ’s support for CODs COD-18, Abingdon, UK

Enabling Grids for E-sciencE

GCX

Yearly sum of downtimes per sites

• N.B: unscheduled downtime• Better: 4 minutes down• Worst: 64 days (PPS…)

COD18 2008-12-03 13

85% of sites <4d of downtime/year

= 98.90% reachability/year

46 sites

Last 12 months total downtime for site 46: 4d OFF-SITE, 17d ON-SITE

164 sites have less than 1d of downtime during last 12 months

Page 14: Improving   ENOC     ’s support for CODs COD-18, Abingdon, UK

Enabling Grids for E-sciencE

GCX

First assessment

• Networks are quite reliable– Few long outages on resilient transit networks– ON-SITE troubles are important things– 30 minutes seems a wise threshold– DownCollector seems reliable and trustable enough

• Automatic management of network TT currently not reliable

• Currently few interactions SA2 / CODs

• This was discussed with pole1 for improvements– Thanks to them for feedbacks, results are following

COD18 2008-12-03 14

Page 15: Improving   ENOC     ’s support for CODs COD-18, Abingdon, UK

Enabling Grids for E-sciencE

GCX

Proposal for troubles handling

• Map troubles handling around the three kinds of problem from DownCollector

COD18 2008-12-03 15

ON-SITE OFF-SITE UNKNOWN

Create alarms in COD dashboard from DownCollector

Alarm hierarchy and masking

GGUS Tickets created by CODs to sites after 30 minutes

Not ENOC’s responsibility Currently not managed

ENOC’s responsibility

Allow flagging particular outage for focusing on them (cf. next slide)

Threshold 30 minutes?

Only inform

Try to reduce number of unknown trouble

Page 16: Improving   ENOC     ’s support for CODs COD-18, Abingdon, UK

Enabling Grids for E-sciencE

GCX

OFF-SITE troubles handling

• ENOC’s responsibility – devolving trouble resolution to NRENs/GÉANT2

• Targeted key information: expected end date– Hard to get…

• Enable marking of particular outages– Maybe then automatically create a ticket into ENOC’s helpdesk

(GGUS) to exchange information with COD

COD18 2008-12-03 16

ENOC please follow that

#GGUS-41

#GGUS-42

#GGUS-43

Page 17: Improving   ENOC     ’s support for CODs COD-18, Abingdon, UK

Enabling Grids for E-sciencE

GCX

Proposal for tools (1/2)

• ENOC to work with sites to improve some network checkpoints– Reduce number of unknown troubles (~ 12%, ~106/month)– 351 sites in database: 32 (9%) without usable checkpoint

[ https://ccenoc.in2p3.fr/DownCollector/?v=list_headnodes ]

• ENOC’s bar in COD dashboard

COD18 2008-12-03 17

Trouble OFF-SITE ON-SITEUNKNOWN

Now-5h

Page 18: Improving   ENOC     ’s support for CODs COD-18, Abingdon, UK

Enabling Grids for E-sciencE

GCX

Proposal for tools (2/2)

• Notification from DownCollector to site admins for long-standing outage (15 or 30 minutes?)– Integration to Nagios not sufficient?– Existing DownCollector feature: Subscribe to troubles

[ https://ccenoc.in2p3.fr/DownCollector/?v=subscription ]

Released with EGEE broadcast on 2008-07-16 34 sites, 26 distinct emails have currently registered Noticed problem: E-mails not reaching disconnected sites… No threshold implemented yet

COD18 2008-12-03 18

1.5 - select threshold

Page 19: Improving   ENOC     ’s support for CODs COD-18, Abingdon, UK

Enabling Grids for E-sciencE

GCX

Actions list for tools

• ENOC1. DownCollector

Improve checkpoints Add threshold to subscribe feature?

2. Allow flagging important network outages and study scheme to exchange around (GGUS ENOC’s helpdesk...)

3. Provide ENOC’s bar

• CIC portal1. Manage networks alarms & alarms masking

2. Integrate ENOC’s bar

COD18 2008-12-03 19

Page 20: Improving   ENOC     ’s support for CODs COD-18, Abingdon, UK

Enabling Grids for E-sciencE

GCX

Conclusion

• Its really going ahead

• Some implementation details to sort out– Scalability, regionalisation– Right now or waiting your next model (alarm DB, R-COD etc.)?– CIC portal & ENOC

priorities, manpower and roadmap

• Other ideas, feedbacks etc. always welcome– Help designing the network support you need

COD18 2008-12-03 20

Page 21: Improving   ENOC     ’s support for CODs COD-18, Abingdon, UK

Enabling Grids for E-sciencE

GCX

Questions?

COD18 2008-12-03 21