[email protected] lcg wlcg operations john gordon, cclrc gridpp18 glasgow 21 march 2007
TRANSCRIPT
LCG WLCG=3 Grids
EGEE+OSG+NGDF Would like it to be one seamless grid but not yet High-level tasks like Simulation Production can be split
into 3 parts and farmed out Interoperability has some successes in job submission
and information publishing For us WLCG Operations = EGEE Operations Many parts to infrastructure – concentrate here on
Production Service How does it relate to you? What action can you take?
LCG The EGEE Infrastructure
Certification Testbeds (SA3)
Pre-production Service
Production Service
Test-beds & Services
Operations Coordination Centre
Regional Operations Centres
Global Grid User Support
EGEE Network Operations Centre (SA2)
Operational Security Coordination Team
Support Structures
Operations Advisory Group (+NA4)
Joint Security Policy Group EuGridPMA (& IGTF)
Grid Security Vulnerability Group
Security & Policy Groups
Infrastructure:• Physical test-beds & services• Support organisations & procedures• Policy groups
LCG Middleware Release Technical Coordination
Group Agrees the contents and
priorities for what goes into the integration and testing process
Not all desired new components or updates may make the next distribution
Depends on priorities and urgency for other pieces
Moving away from big-bang releases to component upgrades
Concept of a baseline release and then updates and patches
New baseline when significant changes (dependencies, …)
LCG Certification
Extensive certification test-bed: Close to 100 machines involved Main test-bed at CERN, test-beds for specific tasks at SA3 partner sites
Emulate the deployment environments Or at least the main ones …
Certification testing: Installation and configuration Component (service) functionality System testing (trying to emulate real workloads and stress testing) Beginning to use virtualization to simplify the testing environment
Deployment into the pre-production system Final step of certification – validation by real sites Validation by applications – also allows to prepare apps for new versions
Mostly hidden from you, but a lot of effort goes into it.
LCG Operations
Operations Meetings Weekly reports
GGUS TPM, COD
Accounting Monitoring
LCG Grid management: structure Operations
Coordination Centre (OCC)
management, oversight of all operational and support activities
Regional Operations Centres (ROC)
providing the core of the support infrastructure, each supporting a number of resource centres within its region
Grid Operator on Duty
Resource centres providing resources
(computing, storage, network, etc.);
Grid User Support (GGUS)
At FZK, coordination and management of user support, single point of contact for users
LCG Grid monitoring The goal is to proactively monitor the operational state of
the Grid and its performance, initiating corrective action to remedy problems arising with either core infrastructure or Grid resources
Regional Operations
Centre
… …Regional
Operations Centre
Resource Centre
Resource Centre
…
Regional Operations
Centre
Resource Centre
Resource Centre
…
OSCTGrid Operator on-duty (COD)
Monitoring shows a problem
LCG Grid Operator on Duty Role:
Watch the problems detected by the grid monitoring tools
Problem diagnosis Report these problems (GGUS tickets) Follow and escalate them if needed (well defined
procedure) Provide help, propose solutions Build and maintain a central knowledge database (WIKI)
Who? 10 ROC teams working in pairs (one lead and one
backup) on a weekly rotation
LCG Grid monitoring tools Tools used by the Grid Operator
on Duty team to detect problems Distributed responsibility CIC portal
single entry point Integrated view of monitoring
tools
Site Functional Tests (SFT) -> Service Availability Monitoring (SAM)
Grid Operations Centre Core Database (GOCDB)
GIIS monitor (Gstat) GOC certificate lifetime GOC job monitor Others
LCG COD Tickets
Don’t ignore them! If problems seem to fix themselves (BDII load) then keep
some stats (tickets/interventions) and report to Jeremy/Philippa
Don’t just fix problems Report trends, repeat problems, solutions
The problem at your site is often a symptom of an underlying problem
Middleware, deployment, configuration, documentation.
Your intervention might help to fix them
LCG SAM Availability Algorithm
CE = OR of your CEs SE = OR of your SEs Up if CE.AND.SE.AND.BDII.AND.SRM If Down Then Down until next Up Availability = % of time Up Reliability = % of time Up excluding Scheduled Downtime
LCG What to do?
SAM Monitoring will be used to judge your site in many ways
MoU, user satisfaction, Operations Get used to it! Complaining about the middleware doesn’t work
Continue to raise tickets and operations reports Look for workrounds
Look at SAM failures for long-term fixes. If you can’t reduce the number of problems, reduce their
effect Automation, alarms Many other tools
Nagios? Work on your problems but also work as a team.
LCG Accounting
Each Tier1 submits manual report of:- Cputime, wallclocktime, disk, tape Allocated and used Per LHC VO
Aggregated into a monthly report Which accumulates through the year
Compared with MoU and installed capacity
LCG Automated Accounting
This report is being Automated From March the results will be taken from APEL
Overlap with manual report for 3 months
Storage Accounting too (Greg’s talk)
Once automatic, easy to extend to Tier2s Be warned!
LCG What to do
Study APEL for your site Look for gaps in data Check SI2K values published Compare with local records Check Storage Accounts
If you are not being used by VOs, investigate
LCG Summary
Act on trouble tickets Work on improving your SAM figures Check your accounting
LCG Message
Site view may be from the bottom up We are motivated to put constituent parts in place and run
them well WLCG view is from the top down. From up there they see the Tier1s clearly and are driving
them They’ll spot you soon, so be prepared. Learn from the Tier1
GridPP has been a success in delivering to LHC … but the pressure will increase over 2007 Keep up the good work!