1 lhc-opn 2008, madrid, 10-11 th march. bruno hoeft, aurelie reymund gridka – de-kit procedurs...

10
1 LHC-OPN 2008, Madrid, 10-11 th March. Bruno Hoeft, Aurelie Reymund GridKa – DE-KIT procedurs Bruno Hoeft LHC-OPN Meeting 10. – 11. 03. 08

Upload: harvey-stewart

Post on 03-Jan-2016

219 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 1 LHC-OPN 2008, Madrid, 10-11 th March. Bruno Hoeft, Aurelie Reymund GridKa – DE-KIT procedurs Bruno Hoeft LHC-OPN Meeting 10. – 11. 03. 08

1LHC-OPN 2008, Madrid, 10-11th March.Bruno Hoeft, Aurelie Reymund

GridKa – DE-KIT procedurs

Bruno HoeftLHC-OPN Meeting

10. – 11. 03. 08

Page 2: 1 LHC-OPN 2008, Madrid, 10-11 th March. Bruno Hoeft, Aurelie Reymund GridKa – DE-KIT procedurs Bruno Hoeft LHC-OPN Meeting 10. – 11. 03. 08

2LHC-OPN 2008, Madrid, 10-11th March.Bruno Hoeft, Aurelie Reymund

LHC-OPN Hardware at DE-KIT (GridKa):fully redundant border router setup are in place (resilience)

two border router Cisco Catalyst 6509 Router - 2 sup engines WS-SUP720-3B ( IOS s72033_rp-IPSERVICESK9_WAN-VM), Version 12.2(33)SXF9).

-- line cards WS-x6704-10GE, facilitated with single mode transceiver XENPAK-10GB-SR

-DFN 2 Huawei DWDM- one DWDM is providing the light colour from DE-KIT (GridKa) to CERN and SARA (direction north from Karlsruhe)

- the second DWDM is providing the light colour from DE-KIT (GridKa) to IN2P3 and CNAF (direction south from Karlsruhe) The direction to CERN from Karlsruhe is north since the DANTE peering to DFN is located in Frankfurt for the DFN/Dante link DE-KIT(GridKa) – CERN.

Page 3: 1 LHC-OPN 2008, Madrid, 10-11 th March. Bruno Hoeft, Aurelie Reymund GridKa – DE-KIT procedurs Bruno Hoeft LHC-OPN Meeting 10. – 11. 03. 08

3LHC-OPN 2008, Madrid, 10-11th March.Bruno Hoeft, Aurelie Reymund

DE-KIT LHC-OPN links

Interface(Layer-2) VLan IP (Layer-3) / Link Name (DFN) Description

Te 7/2 10 192.16.166.34/30 GE10/HUA0674_FRA_FZK (Frankfurt/Dante ->Genf) CERN (fra-gen_LHC_CERN-DFN_06006)

Te 1/1 751 192.16.166.105/30 GE10/HUA0778_FZK_MUE Muenster/Surfnet-> Amsterdam/SARA (DFN/Surfnet CBF)

R-inet-gis-I

R-inet-gis-II

Interface(Layer-2)

Vlan IP (Layer-3) / Link Name (DFN) Description

Te 3/2 752 192.16.166.109/30 / GE10/HUA1106_FZK_KEH (Kehl) IN2P3 (DFN/RENATER CBF)

Te 2/2 750 192.16.166.101/30 / GE10/HUA0673_BAS_FZK (Milano) Bologna INFN(CNAF) (DFN/Switch/GARR CBF)

Page 4: 1 LHC-OPN 2008, Madrid, 10-11 th March. Bruno Hoeft, Aurelie Reymund GridKa – DE-KIT procedurs Bruno Hoeft LHC-OPN Meeting 10. – 11. 03. 08

4LHC-OPN 2008, Madrid, 10-11th March.Bruno Hoeft, Aurelie Reymund

Operative service levels

three service levels entities:- First level support is GGUS (5*8)- General FZK network support: (5*8, (plus an automated incident broadcast (SMS)

24*7) – Telematis (an external Company is covering the “off workinghours” incident broadcast on call support)

- Expert Support: (5*8, plus Experts on call)• The combination of the three operative service levels are providing a 24*7 LHC-OPN support. This

will match the requirements specified by the LHC experiments in there CDR.• • All operators will be granted a fully transparent access to the DE-KIT (GridKa) wiki knowledge base,

the DE-KIT (GridKa) log analyser facility and monitoring system as well as LHC-OPN monitoring systems, as they are:

o - DE-KIT (GridKa) local – DE-KIT (GridKa) general monitoring site [http://www.gridka.de/monitoring/main.html]

cacti , netflow, ganglia, nagios, log analyseriepm [http://192.108.45.161/iepm-bw.fzk.de/LHC-ATLAS.slac_wan_bw_tests.html#node1.uchicago.edu]

• - LHC-OPN central monitoring pages– BGP – ENOC monitoring page– Dante E2Ecu monitoring page

• - Several DE-KIT (GridKa) local information sites are restricted to local access only.

three service levels entities:- First level support is GGUS (5*8)- General FZK network support: (5*8, (plus an automated incident broadcast (SMS)

24*7) – Telematis (an external Company is covering the “off workinghours” incident broadcast on call support)

- Expert Support: (5*8, plus Experts on call)• The combination of the three operative service levels are providing a 24*7 LHC-OPN support. This

will match the requirements specified by the LHC experiments in there CDR.• • All operators will be granted a fully transparent access to the DE-KIT (GridKa) wiki knowledge base,

the DE-KIT (GridKa) log analyser facility and monitoring system as well as LHC-OPN monitoring systems, as they are:

o - DE-KIT (GridKa) local – DE-KIT (GridKa) general monitoring site [http://www.gridka.de/monitoring/main.html]

cacti , netflow, ganglia, nagios, log analyseriepm [http://192.108.45.161/iepm-bw.fzk.de/LHC-ATLAS.slac_wan_bw_tests.html#node1.uchicago.edu]

• - LHC-OPN central monitoring pages– BGP – ENOC monitoring page– Dante E2Ecu monitoring page

• - Several DE-KIT (GridKa) local information sites are restricted to local access only.

Page 5: 1 LHC-OPN 2008, Madrid, 10-11 th March. Bruno Hoeft, Aurelie Reymund GridKa – DE-KIT procedurs Bruno Hoeft LHC-OPN Meeting 10. – 11. 03. 08

5LHC-OPN 2008, Madrid, 10-11th March.Bruno Hoeft, Aurelie Reymund

Incident origination: - DE-KIT (GridKa) Monitoring (LogMonitoring/PortMonitoring)

- DE-KIT (GridKa) Monitoring tools triggering an incident, automated email/SMS (e.g. router port up/down, flapping, bgp changes…), or by router operators

- operation at DE-KIT (GridKa) will open a GGus (or LCU) ticket - GGus (or LCU) will control the ticket- the mainly involved tier-1 site (DE-KIT (GridKa)) will operate the ticket, until the ticket is solved or

closed. - appropriate partner(s) affected by the incident will be included in the ticket.

- GGus/LCU:- GGUS/LCU ticket initiated by HEP user, distant NOC/Tier-0/1 or NREN - GGus/LCU submits the ticket to the appropriate site (DE-KIT (GridKa)) - the ticket will still be controlled by GGus(/LCU) and DE-KIT (GridKa) will take over the operative part

- LIPCU (LCU)/E2ECU:- no difference to a GGus/LCU ticket.

- Information by a site:- request to open a GGus/LCU ticket- however appropriate actions will be taken immediately to solve the issue.

- maintenance/changes at DE-KIT (GridKa) / EGEE Broadcast:- GGus (and/or LCU) ticket will be opened and it will be announced in GOC, this should inform all LHC-

OPN sites via EGEEBroadcast as well as through GOC (for each EGEE broadcast should exist an according ticket)

Incident and ticket handling- ticket of an incident is handled and controlled by either GGus, LCU, or E2Ecu- operation of certain actions are transferred to the affected/coresponding location like a tier-1 centre

DE-KIT (GridKa) or a “NREN”- the management will still resides at the ticket owner (GGUS, LCU/LIPCU, E2ECU

Page 6: 1 LHC-OPN 2008, Madrid, 10-11 th March. Bruno Hoeft, Aurelie Reymund GridKa – DE-KIT procedurs Bruno Hoeft LHC-OPN Meeting 10. – 11. 03. 08

6LHC-OPN 2008, Madrid, 10-11th March.Bruno Hoeft, Aurelie Reymund

Operation of an Incident (1)- Layer-1 incident

(An issue on layer-1 has for consequence that there is no light on the path)- No light (Descr.: there is a light cut somewhere on the path)

Actions: - check the router / transceiver / hardware / cable / logs - evaluate the impact (backup path available) - contact DFN and Di-Data as well as T0/T1 - send an EGEE broadcast if no backup path (depended on –estimated length, and impact) and escalate to Experts - report the incident and its solution in the documentation

Involved groups: - Internal: GIS / NG (Network Group)- External: DFN, Di-Data, T0/T1 network responsible, NREN / Dante

- Momitoring eg.: http://stats.geant2.net/e2emon/mon/G2_E2E_index_PROD.html

- Local hardware failure (Descr.: a hardware element seems to be deficient on the local network)Actions: - check the router / transceiver / hardware / cable / logs

- evaluate the impact (backup path available) - contact T0/T1 - send an EGEE broadcast if no backup path (depended on –estimated length, and impact) and escalate to Experts - report the incident and its solution in the documentation

Involved groups: -Internal: GIS / NG - External: DFN, Di-Data, T0/T1 network responsible, NREN / Dante

- Remote hardware failure (Descr.: a hardware element seems to be deficient on the remote network)Actions: - check the router / transceiver / hardware / cable / logs

- evaluate the impact (backup path available) - if nothing suspicious detected, contact T0/T1 - send an EGEE broadcast if no backup path (depended on –estimated length, and impact) and escalate to Experts - report the incident and its solution in the documentation

Involved groups: - Internal: GIS / NG - External: DFN, Di-Data, T0/T1 network responsible, NREN / Dante

http://stats.geant2.net/e2emon/mon/G2_E2E_index_PROD.html

Page 7: 1 LHC-OPN 2008, Madrid, 10-11 th March. Bruno Hoeft, Aurelie Reymund GridKa – DE-KIT procedurs Bruno Hoeft LHC-OPN Meeting 10. – 11. 03. 08

7LHC-OPN 2008, Madrid, 10-11th March.Bruno Hoeft, Aurelie Reymund

Operation of an Incident (2)

- Layer-2 (the light on the path is maintained, but there is no connectivity to the neighbour)

- No MAC (Descr.: missing mac entry from the neighbor’s network)

Actions: - check router configuration

- evaluate the impact- contact T0/T1- send EGEE broadcast if no backup path (estimated length, and impact),

escalate to Experts - report the incident and its solution in the documentation

Groups involved: - Internal: GIS / NG- External: T0/T1 network responsible

Page 8: 1 LHC-OPN 2008, Madrid, 10-11 th March. Bruno Hoeft, Aurelie Reymund GridKa – DE-KIT procedurs Bruno Hoeft LHC-OPN Meeting 10. – 11. 03. 08

8LHC-OPN 2008, Madrid, 10-11th March.Bruno Hoeft, Aurelie Reymund

Operation of an Incident (3)- Layer-3 (By a routing issue on layer-3, the light on the path is maintained, but there is no

reachability to the neighbour)- Routing issue : no route to neighbour (Descr.: T1-center cannot reach the neighbour)

Actions: - check router configuration / routing / acls- evaluate the impact- contact T0/T1- send EGEE broadcast if no backup path (estimated length, and impact),escalate to

Experts- report the incident and its solution in the documentation

Involved groups: - Internal: GIS / NG- External: T0/T1 network responsible

- BGP issue : no announcement from neighbour (Descr.: the bgp table shows)Actions: - check router configuration / routing / acls

- evaluate the impact- contact T0/T1- send EGEE broadcast if no backup path (estimated length, and impact), escalate to

Experts- eport the incident and its solution in the documentation

Involved groups: - Internal: GIS / NG- External: T0/T1 network responsible

- BGP issue : no routes advertised to neighbour (Descr.: local bgp does not advertise the network(s) correctly to the neighbour)Actions: - check router configuration / routing / acls

- evaluate the impact- contact T0/T1- send EGEE broadcast if no backup path (estimated length, and impact), escalate to

Experts- report the incident and its solution in the documentation

Involved groups: - Internal: GIS / NG- External: T0/T1 network responsible

Page 9: 1 LHC-OPN 2008, Madrid, 10-11 th March. Bruno Hoeft, Aurelie Reymund GridKa – DE-KIT procedurs Bruno Hoeft LHC-OPN Meeting 10. – 11. 03. 08

9LHC-OPN 2008, Madrid, 10-11th March.Bruno Hoeft, Aurelie Reymund

Maintenance window

- The light path and/or the connectivity / reachability can be affected -- Descr.: T1-center plans maintenance on the network infrastructure

Actions: - send an EGEE broadcast- contact T0/T1, NREN, Dante

Involved groups: - Internal: GIS / NG / Security- External: T0/T1 network responsible, NREN (DFN) / Dante

Page 10: 1 LHC-OPN 2008, Madrid, 10-11 th March. Bruno Hoeft, Aurelie Reymund GridKa – DE-KIT procedurs Bruno Hoeft LHC-OPN Meeting 10. – 11. 03. 08

10LHC-OPN 2008, Madrid, 10-11th March.Bruno Hoeft, Aurelie Reymund

Configuration / Infrastructure change- Configuration change (The light path and/or the connectivity / reachability

can be affected -- Descr.: T1-center makes a change on the network configuration)

Actions: - send an EGEE broadcast- contact T0/T1, NREN, Dante

Involved groups: - Internal: GIS / NG / Security- External: T0/T1 network responsible, NREN (DFN) / Dante

- Infrastructure change (The light path and/or the connectivity / reachability can be affected -- Descr.: T1-center plans a change in the network infrastructure/topology)

Actions: - send an EGEE broadcast- contact T0/T1, NREN, Dante

Involved groups: - Internal: GIS / NG / Security- External: T0/T1 network responsible, NREN (DFN) / Dante

- General remarks:- all LHC-OPN involving actions:

- (as long as planable) shall as possible 3 days in advanced anounced (ticket, GOC, EGEEBroadcast)- Changes of the infrastructure (e.g. routing/reorganisation of router port) shall be discussed with the

affected site, cern and the coordination unit (LCU/LIPCU)

- The configuration of the DE-KIT (GridKa) installation will be documented,as well as all changes will be included in the documentation