egee operation procedures
Post on 31-Dec-2015
27 Views
Preview:
DESCRIPTION
TRANSCRIPT
FP6−2004−Infrastructures−6-SSA-026409
www.eu-eela.org
E-infrastructure shared between Europe and Latin America
EGEE Operation Procedures
Alexandre Duarte
CERN IT-GD-OPS
Lost Island, First EELA ROC-on-Duty Tutorial , 29.11.2006FP6−2004−Infrastructures−6-SSA-026409
E-infrastructure shared between Europe and Latin America COD• COD is Operator on Duty
• global LCG/EGEE GRID monitoring
• 1 (2) ROCs responsible for the whole GRID operations at a time– 12 ROCs involved– weekly rotation
• weekly WLCG-OSG-EGEE Operations meeting– ROCS, Tier1, experiments– all sites invited
Lost Island, First EELA ROC-on-Duty Tutorial , 29.11.2006FP6−2004−Infrastructures−6-SSA-026409
E-infrastructure shared between Europe and Latin America COD Procedures• https://twiki.cern.ch/twiki/bin/view/EGEE/EGE
EROperationalProcedures
• Looking at monitoring tools– SAM, gstat, Certificate Monitoring pages
• Open tickets using COD Dasboard
• Escalate expired tickets
• Process site responses (update tickets accordingly)
• End of duty: hand-over notes
Lost Island, First EELA ROC-on-Duty Tutorial , 29.11.2006FP6−2004−Infrastructures−6-SSA-026409
E-infrastructure shared between Europe and Latin America COD Dashboard• summary of necessary monitoring
information + tools for ticket processing
• tickets linked to GGUS
• GOCDB information
• SAM + gstat results
• ticket creation and management tool
• tools for related e-mail
Lost Island, First EELA ROC-on-Duty Tutorial , 29.11.2006FP6−2004−Infrastructures−6-SSA-026409
E-infrastructure shared between Europe and Latin America COD Dashboard
Lost Island, First EELA ROC-on-Duty Tutorial , 29.11.2006FP6−2004−Infrastructures−6-SSA-026409
E-infrastructure shared between Europe and Latin America
Escalation Procedure
• defines the steps to be taken during the lifetime of a ticket
• avaliable on CIC Operations Portal– (https://edms.cern.ch/document/701575)
• distinction between sites depending on the amount of resources
Lost Island, First EELA ROC-on-Duty Tutorial , 29.11.2006FP6−2004−Infrastructures−6-SSA-026409
E-infrastructure shared between Europe and Latin America Escalation Steps
1. ticket creation
2. first mail (to: site + ROC)
3. second mail (to: site + ROC)
4. suspension from the GRID
• before 4.:a) mail to ROCb) weekly operations meeting callc) mail to OMC for validation
Lost Island, First EELA ROC-on-Duty Tutorial , 29.11.2006FP6−2004−Infrastructures−6-SSA-026409
E-infrastructure shared between Europe and Latin AmericaEscalation Procedure• site categories
– low: CPU <20– normal: 20 < CPU < 100– high: 100 < CPU
• between 2.-3. and 3.-4.– low + normal: 3 days– high: 1 days
Lost Island, First EELA ROC-on-Duty Tutorial , 29.11.2006FP6−2004−Infrastructures−6-SSA-026409
E-infrastructure shared between Europe and Latin AmericaEscalation Procedure
Create ticket Close ticket
When
deadline
reachedProblem solved ?
last
escalation ?
Extend deadline
Suspend site
Escalate
yes
no
no
site respondsmail mail
Lost Island, First EELA ROC-on-Duty Tutorial , 29.11.2006FP6−2004−Infrastructures−6-SSA-026409
E-infrastructure shared between Europe and Latin America
What a site should do
• Look at the monitoring tools (SAM)– try to notice & fix failures before the CODs
• COD notification about a failure– fix it ASAP
• Scheduled downtime– announce it in advance– announce when it's finished
• problems → contact the ROC– best way: Create a ticket
• question → ask the ROC
top related