deployment summary gridpp12 jeremy coles [email protected] 1 st february 2005

31
Deployment Summary GridPP12 Jeremy Coles [email protected] 1 st February 2005

Upload: toby-wiggins

Post on 11-Jan-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Deployment Summary GridPP12 Jeremy Coles J.Coles@rl.ac.uk 1 st February 2005

Deployment Summary

GridPP12

Jeremy [email protected]

1st February 2005

Page 2: Deployment Summary GridPP12 Jeremy Coles J.Coles@rl.ac.uk 1 st February 2005

Contents

• LCG operations workshop• EGEE structures• Operations model• Current status• Support• Planning• Metrics• Some of the recurring issues at GridPP12 • Future activities

Page 3: Deployment Summary GridPP12 Jeremy Coles J.Coles@rl.ac.uk 1 st February 2005

Some operational issues

• Slow response from sites (central perception) – Upgrades, response to problems, etc– Problems reported daily – some problems last for weeks

• Lack of staff available to fix problems– All on vacation, …

• Misconfigurations (units, gridmap-file builds, user profiles, pools …)• Lack of configuration management – problems that are fixed reappear• Lack of fabric management

– Is it GDA responsibility to provide solutions to these problems?

• Lack of understanding (training?)– Admins reformat disks of SE …

• Firewall issues – coordination between grid admins and firewall maintainers

• PBS problems– Are we seeing the scaling limits of PBS?

• People not reading documentation … Background to workshop

Page 4: Deployment Summary GridPP12 Jeremy Coles J.Coles@rl.ac.uk 1 st February 2005

LCG Workshop Nov 2004

• Operational Security– Incident Handling Process– Variance in site support availability – Reporting Channels – Service Challenges

• Operational Support– Workflow for operations & security actions– What tools are needed to implement the model– “24X7” global support

• sharing operational load (CIC-on-duty)– Communications (news)– Problem Tracking System– Defining Responsibilities

• problem follow-up• deployment of new releases

– Interface to User Support

LCG (EGEE) discussion on superset of topics discussed at GridPP11

Page 5: Deployment Summary GridPP12 Jeremy Coles J.Coles@rl.ac.uk 1 st February 2005

LCG Workshop Nov 2004

• Fabric Management– System installations (tools, intrfacing tools with each other)– Batch/scheduling Systems (openPBS/Torque, MAUI. fair-share) – Fabric monitoring – Software installation – Representation of site status (load) in the Information System

• Software Management– Operations on and for VOs (add/remove/service discovery)– Fault tolerance, operations on running services (stop,upgrades, re-

starts)– Link to developers– What level of intrusion can be tolerated on the WNs (farm nodes)

• application (experiment) software installation– Removing/(re-adding) sites with (fixed)troubles– Multiple views in the information system (maintenance)

Page 6: Deployment Summary GridPP12 Jeremy Coles J.Coles@rl.ac.uk 1 st February 2005

GDB

LCG Grid Deployment Board• One representative from each country (with a Regional Centre) involved in the LCG and one representative from each experiment• Chairman changes annually • Meet in person once per month

What it does!• Explores issues of global concern to the LCG community• Makes decisions on deployment, operations and planning for LCG• Provides mechanisms for resource forecasting

How?• By calling upon experts to present latest information on specific topics• By creating and overseeing working groups to tackle important areas

• Currently three groups: The Security, Networking and Quattor groups

Who is involved in UKI• UK representative: John Gordon• Security group coordinator: Dave Kelsey• GDB secretary: Jeremy Coles

Page 7: Deployment Summary GridPP12 Jeremy Coles J.Coles@rl.ac.uk 1 st February 2005

Proposed escalation procedure

• Because unstable and badly configured sites cause a big problem:– Unstable sites that have frequent problems

• Will appear on a list of bad sites

– Sites that do not respond to problem reports• Including not upgrading middleware versions

– Will be removed from the information systems and maps

– Will have to be re-certified to get back in– Will be reported to the GDB (LCG) or PMB (EGEE)

representative as non-responsive

Page 8: Deployment Summary GridPP12 Jeremy Coles J.Coles@rl.ac.uk 1 st February 2005

ROCs

Regional Operations Centres (ROCs)• Part of the EGEE SA1 activity (http://egee-sa1.web.cern.ch/egee%2Dsa1/)• The regions are CERN, France, Italy, UK & Ireland, Germany & Switzerland, Northern Europe, South West Europe, South East Europe, Central Europe and Russia.

What they do• Coordinate regional efforts in all activities (support, operations representation, security)• Take up operations and deployment issues at cross project meetings• Provide forum for agreeing work needed – pre-production service

How?• Setup ROC structures within the region • Create common groups to work on areas like pre-production services, helpdesk interfaces• Meet fortnightly via telephone (http://agenda.cern.ch/displayLevel.php?fid=339) to discuss regional issues and problems

Who is involved for UK?• General: John Gordon• Support: Andy Richards• Security: Romain Wartel

EGEE Background

Page 9: Deployment Summary GridPP12 Jeremy Coles J.Coles@rl.ac.uk 1 st February 2005

CICs

Core Infrastructure Centre (CIC)• The CICs cover more than one region and deal with operations issues. There are currently 4 CICs France, Italy, UK & Ireland and CERN• Coordinated by the Operations Management Centre team at CERN. • Meet weekly via telephone (http://agenda.cern.ch/displayLevel.php?fid=258)• Each CIC is “on-duty” for 1 week in 4.

What they do!• Operational and performance monitoring• Troubleshooting and following up identified problems• Operate general grid services (e.g. VO related services)• Provide information via the CIC portal http://cic.in2p3.fr/

How?• Review monitoring data such as gstat, daily test results• Enter problems identified into Savannah (moving to GGUS portal soon)• Follow up problems using email and telephone contacts• Troubleshoot using experts, Wiki etc.

Who is involved in UKI• Steve Traylen & Philippa Strange

EGEE Background

Page 10: Deployment Summary GridPP12 Jeremy Coles J.Coles@rl.ac.uk 1 st February 2005

CIC portal

http://cic.in2p3.fr/

Page 11: Deployment Summary GridPP12 Jeremy Coles J.Coles@rl.ac.uk 1 st February 2005

• Regional Operations Centres (9)– Act as front-line support for user and

operations issues– Provide local knowledge and

adaptations

• User Support Centre (GGUS)– In FZK –provide single point of contact

(service desk)

• Core Infrastructure Centres (4)– CICs build on the LCG GOC at RAL– Also run essential infrastructure

services– Provide support for other (non-LHC)

applications– Provide 2nd level support to ROCs

• Coordination:– At CERN (Operations Management

Centre) and CIC for HEP

LCG-2/EGEE Operations

• Taipei provide operations centre, and 2nd instance of GGUS start to build round-the-clock coverage

• Discussions with Grid3/OSG on how to collaborate on ops support

– Share coverage?

Page 12: Deployment Summary GridPP12 Jeremy Coles J.Coles@rl.ac.uk 1 st February 2005

(New) Operations Model

• Operations Center role rotates through the CICs– CIC on duty for one week– Procedures and tasks are currently defined

• first operations manual is available (living document) – tools, frequency of checks, escalation procedures, hand over procedures

• CIC on duty website: – Problems are tracked with a tracking tool

• now central in Savannah • migration to GGUS (remedy) with link to ROCs PT tools• problems can be added at GGUS or ROC level

– CICs monitor service, spot and track problems• interact with sites on short term problems (service restart etc,)• interact with ROCs on longer, non trivial problems• all communication with a site is visible for the ROC• build FAQs

– ROCs support• installation, first certification• resolving complex problems

Page 13: Deployment Summary GridPP12 Jeremy Coles J.Coles@rl.ac.uk 1 st February 2005

Operations Model

OMCOMC

CICCICCICCIC

CICCICCICCIC

ROCROC ROCROC

ROCROC

ROCROC

ROCROC

RCRC

RCRCRCRC

Other GridOther Grid

Other GridOther Grid

RCRC

RCRCRCRCRCRC

Page 14: Deployment Summary GridPP12 Jeremy Coles J.Coles@rl.ac.uk 1 st February 2005

How does support map onto this?

OMCOMC

CICCICCICCIC

CICCICCICCIC

ROC helpdeskROC helpdesk ROC helpdeskROC helpdesk

ROC helpdeskROC helpdesk

ROC helpdeskROC helpdesk

ROC helpdeskROC helpdesk

RCRC

RCRCRCRC

Other GridOther Grid

Other GridOther Grid

RCRC

RCRCRCRCRCRC

Savannah

GGUS

Page 15: Deployment Summary GridPP12 Jeremy Coles J.Coles@rl.ac.uk 1 st February 2005

How does user support map onto

this?OMCOMC

CICCICCICCIC

CICCICCICCIC

ROC helpdeskROC helpdesk ROC helpdeskROC helpdesk

ROC helpdeskROC helpdesk

ROC helpdeskROC helpdesk

ROC helpdeskROC helpdesk

RCRC

RCRCRCRC

Other GridOther Grid

Other GridOther Grid

RCRC

RCRCRCRCRCRC

Savannah

GGUS

VO1

VO2

VO3

Page 16: Deployment Summary GridPP12 Jeremy Coles J.Coles@rl.ac.uk 1 st February 2005

How does user support map onto

this?OMCOMC

CICCICCICCIC

CICCICCICCIC

ROC helpdeskROC helpdesk ROC helpdeskROC helpdesk

ROC helpdeskROC helpdesk

ROC helpdeskROC helpdesk

ROC helpdeskROC helpdesk

RCRC

RCRCRCRC

Other GridOther Grid

Other GridOther Grid

RCRC

RCRCRCRCRCRC

Savannah

GGUS

VO1

VO2

VO3

We need to work out a better

model for this in the UK

Page 17: Deployment Summary GridPP12 Jeremy Coles J.Coles@rl.ac.uk 1 st February 2005

Site updates

No Site Reports GIIS Host sanity GridPP11 GridPP12

1 BHAM-LCG2 epcf36.ph.bham.ac.uk ok LCG-2_2_0 LCG-2_2_02 BitLab-LCG2 dgc-grid-35.brunel.ac.uk ok LCG-2_2_0 LCG-2_2_03 CAVENDISH-LCG2 farm012.hep.phy.cam.ac.uk warn LCG-2_1_1 LCG-2_2_04 IC-LCG2 gw39.hep.ph.ic.ac.uk warn LCG-2_2_0 LCG-2_3_05 Lancs-LCG2 lunegw.lancs.ac.uk ok LCG-2_1_1 LCG-2_3_0

6 LivHEP-LCG2 hepgrid2.ph.liv.ac.uk warn LCG-2_1_1 LCG-2_2_07 ManHEP-LCG2 bohr0001.tier2.hep.man.ac.uk ok LCG-2_1_1 LCG-2_3_08 OXFORD-01-LCG2 t2ce01.physics.ox.ac.uk ok LCG-2_1_1 LCG-2_3_0

9 QMUL-eScience ce01.ph.qmul.ac.uk ok LCG-2_1_0 LCG-2_1_010 RAL-LCG2 lcgce02.gridpp.rl.ac.uk warn LCG-2_1_1 LCG-2_2_011 RALPP-LCG heplnx131.pp.rl.ac.uk ok LCG-2_2_0 LCG-2_3_012 RHUL-LCG2 ce1.pp.rhul.ac.uk warn LCG-2_1_1 LCG-2_2_013 ScotGRID-Edinburgh glenlivet.epcc.ed.ac.uk ok LCG-2_0_0 LCG-2_2_014 scotgrid-gla ce1-gla.scotgrid.ac.uk warn LCG-2_2_0 LCG-2_2_015 SHEFFIELD-LCG2 ce.gridpp.shef.ac.uk warn LCG-2_1_1 LCG-2_3_016 UCL-CCC ce-a.ccc.ucl.ac.uk ok LCG-2_2_0 LCG-2_2_017 UCL-HEP pc31.hep.ucl.ac.uk warn LCG-2_1_1 LCG-2_2_018 Durham LCG-2_2_0

Most sites have stated an intention to move to SL3 and LCG 2.3 over the next few weeks

Page 18: Deployment Summary GridPP12 Jeremy Coles J.Coles@rl.ac.uk 1 st February 2005

Monitoring progress

http://goc.grid-support.ac.uk/gridsite/monitoring/

Produced:Certification testsGPPMonMapsRSS feeds

Can we:Have a single viewIntegrate network info

Page 19: Deployment Summary GridPP12 Jeremy Coles J.Coles@rl.ac.uk 1 st February 2005

Today’s functional test results

Region Site Name Site CE Test date VersionSoftware Version

CA RPMs Version

BrokerInfo

R-GMA client

CSH testBDII

LDAP (RM)

CopyAndReg. WN -

> defaultSE

Copy defaultSE

-> WN

Replicate defaultSE

to castorgri

d

3rd Party Rep.

castorgrid to

defaultSE

3rd Party cp

castorgrid to WN

Delete Replica

from defaultSE

GFAL infosys

lcg-cr -> defaultSE

lcg-cp defaultSE

-> WN

lcg-rep defaultSE

-> castorgri

d

lcg-rep castorgri

d to defaultSE

lcg-cp castorgrid to WN

lcg-del from

defaultSE

UKI BHAM-LCG2 epcf36.ph.bham.ac.uk 01/02/2005 06:05LCG-2_2_0

LCG-2_2_0 OK OK OK OK

ldap://lxn1189.cern.ch:2170 OK OK OK OK OK OK OK OK OK OK OK OK OK

UKI BITLab-LCGdgc-grid-35.brunel.ac.uk 01/02/2005 06:05

LCG-2_2_0 ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??

UKICAVENDISH-LCG2

serv03.hep.phy.cam.ac.uk 01/02/2005 06:05

LCG-2_2_0 ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??

UKI csTCDie gridgate.cs.tcd.ie 01/02/2005 06:05LCG-2_2_0

LCG-2_2_0 OK OK OK OK

ldap://cagraidsvr19.cs.tcd.ie:2170 FAILED FAILED FAILED FAILED FAILED FAILED OK FAILED FAILED FAILED FAILED FAILED FAILED

UKI Durhamhelmsley.dur.scotgrid.ac.uk 01/02/2005 06:05

LCG-2_2_0 ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??

UKI IC-LCG2 gw39.hep.ph.ic.ac.uk 01/02/2005 06:05LCG-2_3_0 ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??

UKI Lancs-LCG2 lunegw.lancs.ac.uk 01/02/2005 06:05LCG-2_3_0

LCG-2_3_0 OK OK OK OK

ldap://lxn1189.cern.ch:2170 OK OK OK OK OK OK OK OK OK OK OK OK OK

UKI LivHEP-LCG2 hepgrid2.ph.liv.ac.uk 01/02/2005 06:05 FAILED ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??

UKIManHEP-LCG2

bohr0001.tier2.hep.man.ac.uk 01/02/2005 06:05

LCG-2_3_0

LCG-2_3_0 OK OK OK OK

ldap://lcgbdii02.gridpp.rl.ac.uk:2170 OK OK OK OK OK OK OK OK OK OK OK OK OK

UKIOXFORD-01-LCG2

t2ce01.physics.ox.ac.uk 01/02/2005 06:05

LCG-2_3_0

LCG-2_3_0 OK OK OK OK

ldap://lcgbdii02.gridpp.rl.ac.uk:2170 OK OK OK OK OK OK OK OK OK OK OK OK OK

UKIQMUL-eScience ce01.ph.qmul.ac.uk 01/02/2005 06:05

LCG-2_1_0 ?? OK OK ?? OK

ldap://lxn1189.cern.ch:2170 OK OK OK OK OK OK OK n/a n/a n/a n/a n/a n/a

UKI RAL-LCG2 lcgce02.gridpp.rl.ac.uk 01/02/2005 06:05LCG-2_2_0

LCG-2_2_0 OK OK ?? OK

ldap://lcgbdii02.gridpp.rl.ac.uk:2170 OK OK OK OK OK OK OK OK ?? ?? ?? ?? ??

UKI RALPP-LCG heplnx201.pp.rl.ac.uk 01/02/2005 06:05LCG-2_3_0

LCG-2_3_0 OK OK OK OK

ldap://lcgbdii02.gridpp.rl.ac.uk:2170 OK OK OK OK OK OK OK OK OK OK OK OK OK

UKI RHUL-LCG2 ce1.pp.rhul.ac.uk 01/02/2005 06:05 FAILED ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??

UKIScotGRID-Edinburgh glenlivet.epcc.ed.ac.uk 01/02/2005 06:05

LCG-2_2_0

LCG-2_2_0 OK OK OK OK

ldap://lcgbdii02.gridpp.rl.ac.uk:2170 OK OK OK OK OK OK OK OK OK OK OK OK OK

UKI scotgrid-gla ce1-gla.scotgrid.ac.uk 01/02/2005 06:05 FAILED ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??

UKISHEFFIELD-LCG2 lcgce0.shef.ac.uk 01/02/2005 06:05

LCG-2_3_0

LCG-2_3_0 OK OK FAILED OK

ldap://lcgbdii02.gridpp.rl.ac.uk:2170 OK OK OK OK OK OK OK OK OK OK OK OK OK

UKI UCL-CCC ce-a.ccc.ucl.ac.uk 01/02/2005 06:05LCG-2_2_0

LCG-2_2_0 OK OK OK OK

ldap://lxn1189.cern.ch:2170 OK OK OK OK OK OK OK OK OK OK OK OK OK

Job list match failed #ffcc39Critical tests failed #cc3cffOK #99ff99Test job still waiting for execution #ffff33Job Submission failed (Job Manager) #cc3c00

Scheduled downtime #c0c0c0

• The tests show similar patterns across EGEE as a whole• How can tests be made more useable by those who can react?

Page 20: Deployment Summary GridPP12 Jeremy Coles J.Coles@rl.ac.uk 1 st February 2005

Accounting progress

http://goc.grid-support.ac.uk/gridsite/accounting/

Well done:Imperial CollegeManchesterOxfordRAL Tier-1RAL PPDEdinburghGlasgowUCL – CCCDurham

What next?More sites!!Provide older dataAnalyse & use

ALL sites need to keep their log files. Details in the accounting page FAQ.

Page 21: Deployment Summary GridPP12 Jeremy Coles J.Coles@rl.ac.uk 1 st February 2005

Ganglia

Well done:ManchesterEdinburghLancasterQMULSheffieldBristolOxford Liverpool

What next?We need all sites Review against MoUsUse data for warnings?

http://www.gridpp.ac.uk/ganglia/

Page 22: Deployment Summary GridPP12 Jeremy Coles J.Coles@rl.ac.uk 1 st February 2005

Status of planning

Page 23: Deployment Summary GridPP12 Jeremy Coles J.Coles@rl.ac.uk 1 st February 2005

Status of planning

We have developed a plan for deployment at a high level. The deliverables form part of the GridPP2 project map. Each area has consequences for Tiers-1, 2 and 3 in for example:

•Service challenges•Data challenges•Networking•Security•Resource provision•Core services•MoU commitments•Functionality•Accounting •Scheduling of use•Support…. It is still evolving and there is a lot of work here!

Page 24: Deployment Summary GridPP12 Jeremy Coles J.Coles@rl.ac.uk 1 st February 2005

What metrics and why?

• Number of sites in production – simple count based on GOCDB information?• Number of registered users – count of certificates issued?• Number of active users• Number of supported VOs• Percentage of available resources utilised • Peak number of concurrent jobs – measured by Gstat for grid jobs• Average number of concurrent jobs – measured by Gstat for grid jobs• Number of jobs not terminated by themselves or the batch system• Accumulated site downtime per week (scheduled and un-scheduled)• Total CPUs deployed• CPUs available• Storage available and used• CPU hours per VO• UK relative contribution to experiments

The list shared before…

Subject of DTEAM discussion 16:00-18:00 todayWhat is actually useful now?

Page 25: Deployment Summary GridPP12 Jeremy Coles J.Coles@rl.ac.uk 1 st February 2005

LHCb DC feedback

Jobs(k) %Sub %RemainSubmitted 211 100.0%Cancelled 26 12.2%Remaining 185 87.8% 100.0%Aborted (not Run) 37 17.6% 20.1%Running 148 70.0% 79.7%Aborted (Run) 34 16.2% 18.5%Done 113 53.8% 61.2%Retrieved 113 53.8% 61.2%

LCG Job Submission Summary Table

LCG Efficiency: 61 %

… but note Tony Cass’s comments earlier of improving performance

Page 26: Deployment Summary GridPP12 Jeremy Coles J.Coles@rl.ac.uk 1 st February 2005

DO MC performance

CE Success Failed

bohr0001.tier2.hep.man.ac.uk 237 3

cclcgceli01.in2p3.fr - 14

grid-ce.physik.uni-wuppertal.de - -

gridkap01.fzk.de 2564 19

golias25.farm.particle.cz 198 15

heplnx131.pp.rl.ac.uk 246 4

lcgce02.gridpp.rl.ac.uk 293 10

mu6.matrix.sara.nl 397 7

tbn18.nikhef.nl 154 2

Total 4089 74

Efficiency 98 %Is this “much less than production quality” ?

98.8%

98.4%

96.7%

Page 27: Deployment Summary GridPP12 Jeremy Coles J.Coles@rl.ac.uk 1 st February 2005

DO MC performance

LCG Efficiency 99 %We need to be careful with what we mean !

Error

Aborted 35 LCG error: f.e. file not found

Cancelled 21 Done by us for various reasons

Cleared 5 Done by us, enough events

Running 10 D0 softw.error: infinite loop

Scheduled 3 Can be OK, CZ disk crash

Total 74 Really 35 LCG errors

Page 28: Deployment Summary GridPP12 Jeremy Coles J.Coles@rl.ac.uk 1 st February 2005

• Ability to plan (service challenges, networking, resources)• Responsiveness of sites• Security• gLite, gLite, gLite

GridPP12 Deployment issues

} This is a “production” service

Concept behind the “pre-production” service:

• New middleware (gLite, …) can be demonstrated and validated before being deployed in production

• Understand the migration strategy to 2nd generation middleware

• Use the existing production service as the baseline comparison

Page 29: Deployment Summary GridPP12 Jeremy Coles J.Coles@rl.ac.uk 1 st February 2005

• Ability to plan (service challenges, networking, resources)• Responsiveness of sites• Security• gLite, gLite, gLite• Tier-2s operating as real Tier-2s• Use of Tier-2s (experiment models)• Metrics (“get fit” plan)• Use of Tier-2 SEs• SRM = Storage Really Matters!• Engagement with experiments• On-demand tests and other tools• Support• Communications

} This is a “production” service

GridPP12 Deployment issues

Page 30: Deployment Summary GridPP12 Jeremy Coles J.Coles@rl.ac.uk 1 st February 2005

Deployment web-pages

WORK IN PROGRESS

Page 31: Deployment Summary GridPP12 Jeremy Coles J.Coles@rl.ac.uk 1 st February 2005

Summary

• LCG workshop was useful. Some progress but not enough answers. Roadmaps proposed.

• EGEE has a deployment structure and GridPP deployment works within the UKI ROC/CIC

• We need to unravel the support problems and introduce something that works well for UK

• Sites are responding to requests but sometimes slowly. Better communications are needed.

• We still have significant planning challenges to overcome (LCG SC1 failed and there is no clear gLite migration strategy. gLite could require a step back in deployment terms! Implications of experiment computing models.)

• By the next GridPP meeting we must be reporting on carefully defined metrics

• THANK YOU to everyone involved. Please remember - we need your feedback to improve the deployment mechanisms and GridPP service.