[email protected] lcg wlcg operations john gordon, cclrc gridpp18 glasgow 21 march 2007

[email protected]

LCG

WLCG Operations

John Gordon, CCLRCGridPP18Glasgow

21 March 2007

[email protected]

LCG3 Grids

EGEE

OSG

Nordugrid

[email protected]

LCG WLCG=3 Grids

EGEE+OSG+NGDF Would like it to be one seamless grid but not yet High-level tasks like Simulation Production can be split

into 3 parts and farmed out Interoperability has some successes in job submission

and information publishing For us WLCG Operations = EGEE Operations Many parts to infrastructure – concentrate here on

Production Service How does it relate to you? What action can you take?

[email protected]

LCG The EGEE Infrastructure

Certification Testbeds (SA3)

Pre-production Service

Production Service

Test-beds & Services

Operations Coordination Centre

Regional Operations Centres

Global Grid User Support

EGEE Network Operations Centre (SA2)

Operational Security Coordination Team

Support Structures

Operations Advisory Group (+NA4)

Joint Security Policy Group EuGridPMA (& IGTF)

Grid Security Vulnerability Group

Security & Policy Groups

Infrastructure:• Physical test-beds & services• Support organisations & procedures• Policy groups

[email protected]

LCG Middleware Release Technical Coordination

Group Agrees the contents and

priorities for what goes into the integration and testing process

Not all desired new components or updates may make the next distribution

Depends on priorities and urgency for other pieces

Moving away from big-bang releases to component upgrades

Concept of a baseline release and then updates and patches

New baseline when significant changes (dependencies, …)

[email protected]

LCG Certification

Extensive certification test-bed: Close to 100 machines involved Main test-bed at CERN, test-beds for specific tasks at SA3 partner sites

Emulate the deployment environments Or at least the main ones …

Certification testing: Installation and configuration Component (service) functionality System testing (trying to emulate real workloads and stress testing) Beginning to use virtualization to simplify the testing environment

Deployment into the pre-production system Final step of certification – validation by real sites Validation by applications – also allows to prepare apps for new versions

Mostly hidden from you, but a lot of effort goes into it.

[email protected]

LCG Operations

Operations Meetings Weekly reports

GGUS TPM, COD

Accounting Monitoring

[email protected]

LCG Grid management: structure Operations

Coordination Centre (OCC)

management, oversight of all operational and support activities

Regional Operations Centres (ROC)

providing the core of the support infrastructure, each supporting a number of resource centres within its region

Grid Operator on Duty

Resource centres providing resources

(computing, storage, network, etc.);

Grid User Support (GGUS)

At FZK, coordination and management of user support, single point of contact for users

[email protected]

LCG Grid monitoring The goal is to proactively monitor the operational state of

the Grid and its performance, initiating corrective action to remedy problems arising with either core infrastructure or Grid resources

Regional Operations

Centre

… …Regional

Operations Centre

Resource Centre

Resource Centre

…

Regional Operations

Centre

Resource Centre

Resource Centre

…

OSCTGrid Operator on-duty (COD)

Monitoring shows a problem

[email protected]

LCG Grid Operator on Duty Role:

Watch the problems detected by the grid monitoring tools

Problem diagnosis Report these problems (GGUS tickets) Follow and escalate them if needed (well defined

procedure) Provide help, propose solutions Build and maintain a central knowledge database (WIKI)

Who? 10 ROC teams working in pairs (one lead and one

backup) on a weekly rotation

[email protected]

LCG Grid monitoring tools Tools used by the Grid Operator

on Duty team to detect problems Distributed responsibility CIC portal

single entry point Integrated view of monitoring

tools

Site Functional Tests (SFT) -> Service Availability Monitoring (SAM)

Grid Operations Centre Core Database (GOCDB)

GIIS monitor (Gstat) GOC certificate lifetime GOC job monitor Others

[email protected]

LCG COD Tickets

Don’t ignore them! If problems seem to fix themselves (BDII load) then keep

some stats (tickets/interventions) and report to Jeremy/Philippa

Don’t just fix problems Report trends, repeat problems, solutions

The problem at your site is often a symptom of an underlying problem

Middleware, deployment, configuration, documentation.

Your intervention might help to fix them

[email protected]

LCG

[email protected]

LCG SAM Availability Algorithm

CE = OR of your CEs SE = OR of your SEs Up if CE.AND.SE.AND.BDII.AND.SRM If Down Then Down until next Up Availability = % of time Up Reliability = % of time Up excluding Scheduled Downtime

[email protected]

LCG

[email protected]

LCG What to do?

SAM Monitoring will be used to judge your site in many ways

MoU, user satisfaction, Operations Get used to it! Complaining about the middleware doesn’t work

Continue to raise tickets and operations reports Look for workrounds

Look at SAM failures for long-term fixes. If you can’t reduce the number of problems, reduce their

effect Automation, alarms Many other tools

Nagios? Work on your problems but also work as a team.

[email protected]

LCG Accounting

Each Tier1 submits manual report of:- Cputime, wallclocktime, disk, tape Allocated and used Per LHC VO

Aggregated into a monthly report Which accumulates through the year

Compared with MoU and installed capacity

[email protected]

LCG

[email protected]

LCG Automated Accounting

This report is being Automated From March the results will be taken from APEL

Overlap with manual report for 3 months

Storage Accounting too (Greg’s talk)

Once automatic, easy to extend to Tier2s Be warned!

[email protected]

LCG What to do

Study APEL for your site Look for gaps in data Check SI2K values published Compare with local records Check Storage Accounts

If you are not being used by VOs, investigate

[email protected]

LCG

[email protected]

LCG Summary

Act on trouble tickets Work on improving your SAM figures Check your accounting

[email protected]

LCG Message

Site view may be from the bottom up We are motivated to put constituent parts in place and run

them well WLCG view is from the top down. From up there they see the Tier1s clearly and are driving

them They’ll spot you soon, so be prepared. Learn from the Tier1

GridPP has been a success in delivering to LHC … but the pressure will increase over 2007 Keep up the good work!

[email protected] lcg wlcg operations john gordon, cclrc gridpp18 glasgow 21 march 2007

Documents

egee operations

lcg grid management

support infrastructure

grid user support ggus

seamless grid

management of user support

grids egee osg nordugrid

region grid operator