tier-1 overview andrew sansum 21 november 2007. overview of presentations morning presentations...

14
Tier-1 Overview Andrew Sansum 21 November 2007

Upload: ernest-hardy

Post on 11-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Tier-1 Overview Andrew Sansum 21 November 2007. Overview of Presentations Morning Presentations –Overview (Me) Not really overview – at request of Tony

Tier-1 Overview

Andrew Sansum21 November 2007

Page 2: Tier-1 Overview Andrew Sansum 21 November 2007. Overview of Presentations Morning Presentations –Overview (Me) Not really overview – at request of Tony

Overview of Presentations

• Morning Presentations– Overview (Me)

• Not really overview – at request of Tony mainly MoU commitments – CASTOR (Bonny)

• Storing the data and getting it to tape– Grid Infrastructure (Derek Ross)

• Grid Services• dCache future• Grid Only Access

– Fabric Talk (Martin Bly)• Procurements• Hardware infrastructure (inc Local Network)• Operation

• Afternoon Presentations– Neil (RAL benefits)– Site Networking (Robin Tasker)– Machine Rooms (Graham Robinson)

Page 3: Tier-1 Overview Andrew Sansum 21 November 2007. Overview of Presentations Morning Presentations –Overview (Me) Not really overview – at request of Tony

What I’ll Cover

• Mainly going to cover MoU commitments – Response Times– Reliability– On-Call– Disaster planning

• Also cover staffing

Page 4: Tier-1 Overview Andrew Sansum 21 November 2007. Overview of Presentations Morning Presentations –Overview (Me) Not really overview – at request of Tony

GRIDPP2 Team OrganisationGrid

Services Grid/exp Support

RossConduracheHodgesKlein (PPS)Vacancy

Fabric(H/W and OS)

Bly WheelerVacancyThorneWhite (OS support)Adams (HW support)

CASTORSW/Robot

Corney (GL)Strong (Service Manager)Folkes (HW Manager)deWittJensenKrukKetleyJackson (CASE)Prosser (Contractor)(Nominally 5.5 FTE)

Machine Room operations (1.5 FTE)

Networking Support (0.5 FTE)

Database Support (0.5 FTE) (Brown)

Project Management (Sansum/Gordon/(Kelsey)) (1.5 FTE)

Page 5: Tier-1 Overview Andrew Sansum 21 November 2007. Overview of Presentations Morning Presentations –Overview (Me) Not really overview – at request of Tony

Staff Evolution to GRIDPP3

• Level– GRIDPP2 (13.5 GRIDPP + 3.0 e-Science)– GRIDPP3 (17.0 GRIDPP + 3.4 e-Science)

• Main changes– Hardware repair effort 1->2 FTE– New incident response team (2 FTE)– Extra castor effort (0.5 FTE) (but this is already effort that has

been working on CASTOR unreported. – Small changes elsewhere

• Main problem– We have injected 2 FTE of effort temporarily into CASTOR. Long

term GRIDPP3 plan funds less effort than current experience suggests that we need.

Page 6: Tier-1 Overview Andrew Sansum 21 November 2007. Overview of Presentations Morning Presentations –Overview (Me) Not really overview – at request of Tony

Service Maximum delay in responding to operational problems Average availability measured on an annual basis

Service interruption

Degradation of the capacity of the service by more than 50%

Degradation of the capacity of the service by more than 20%

During accelerator operation

At all other times

Acceptance of data from the Tier-0

12 hours 12 hours 24 hours 99% n/a

Networking service to the Tier-0 during accelerator operation

12 hours 24 hours 48 hours 98% n/a

Data-intensive analysis services, including networking to Tier-0, Tier-1 centres

24 hours 48 hours 48 hours 98% 98%

All other services – prime service hours[1]

2 hour 2 hour 4 hours 98% 98%

All other services – other times

24 hours 48 hours 48 hours 97% 97%

[1] Prime service hours are 08:00-18:00 during the working week of the centre, except public holidays.

WLCG/GRIDPP MoU Expectations

Page 7: Tier-1 Overview Andrew Sansum 21 November 2007. Overview of Presentations Morning Presentations –Overview (Me) Not really overview – at request of Tony

Response Time

• Time to acknowledge fault ticket• 12-48 hour response time outside prime shift• On-call system should easily cover this provided

possible to automatically classify problem tickets by level of service required.

• Cover during prime shift more challenging (2-4 hours) but is already a routine task for Admin on Duty

• To hit availability target must be much faster (2 hours or less)

Page 8: Tier-1 Overview Andrew Sansum 21 November 2007. Overview of Presentations Morning Presentations –Overview (Me) Not really overview – at request of Tony

Reliability

• Have made good progress in last 12 months– Prioritised issues affecting SAM test failures.– Introduced “issue tracking” and weekly reviews of

outstanding issues.– Introduced resilience into trouble spots (but more

still to do) – Moved services to appropriate capacity hardware,

seperated services, etc etc.– Introduced new team role: “Admin on Duty”.

Monitoring farm operation, ticket progression, EGEE broadcast info.

• Best Tier-1 averaged over last 3 months (other than CERN).

Page 9: Tier-1 Overview Andrew Sansum 21 November 2007. Overview of Presentations Morning Presentations –Overview (Me) Not really overview – at request of Tony

RAL-LCG2 Availability

Page 10: Tier-1 Overview Andrew Sansum 21 November 2007. Overview of Presentations Morning Presentations –Overview (Me) Not really overview – at request of Tony

MoU Commitments (Availability)

• Really reliability (availability while scheduled up)• Still tough – 97-99% service availability will be hard (1% is just 87

hours per year).– OPN reliability predicted to be 98% without resilience, site SJ5

connection is much better (Robin will discuss). – Most faults (75%) will fall outside normal working hours – Software components still changing (eg CASTOR upgrades, WMS) etc.– Many faults in 2008 will be “new” only emerging as WLCG ramps up to

full load. – Emergent faults can take a long time to diagnose and fix (days)

• To improve on current availability will need to:– Improve automation– Speed up manual recovery process– Improve monitoring further– Provide on-call

Page 11: Tier-1 Overview Andrew Sansum 21 November 2007. Overview of Presentations Morning Presentations –Overview (Me) Not really overview – at request of Tony

On-Call

• On-Call will be essential in order to meet response and availability targets.

• On-Call project now running (Matt Hodges), target is to have on-call operational by March 2008.

• Automation/recovery/monitoring all important parts of on-call system. Avoid callouts by avoiding problems.

• May be possible to have some weekend on-call cover before March for some components.

• On-call will continue to evolve after March as we learn from experience.

Page 12: Tier-1 Overview Andrew Sansum 21 November 2007. Overview of Presentations Morning Presentations –Overview (Me) Not really overview – at request of Tony

Disaster Planning (I)

• Extreme end of availability problem. Risk analysis exists, but aging and not fully developed.

• Highest Impact risks:– Extended environment problem in machine room

• Fire• Flood• Power Failure• Cooling failure

– Extended network failure– Major data loss through loss of CASTOR metadata– Major security incident (site or Tier-1)

Page 13: Tier-1 Overview Andrew Sansum 21 November 2007. Overview of Presentations Morning Presentations –Overview (Me) Not really overview – at request of Tony

Disaster Planning (II)

• Some disaster plan components exist– Disaster plan for machine room. Assuming equipment is undamaged, relocate

and endeavour to sustain functions but at much reduced capacity. – Datastore (ADS) disaster recovery plan developed and tested– Network plan exists– Individual Tier-1 systems have documented recovery processes and fire-safe

backups or can be instanced from kickstart server. Not all these are simple nor are all fully tested.

• Key Missing Components– National/Global services (RGMA/FTS/BDII/LFC/…). Address by distributing

elsewhere. Probably feasible and is necessary – 6 months.– CASTOR – All our data holdings depend on integrity of catalogue. Recover from

first principles not tested. Is flagged as a priority area but balance against need to make CASTOR work.

– Second – independent Tier-1 build infrastructure to allow us to rebuild Tier-1 at new physical location. Would allow us to address major issues such as fire. Major project – priority?

Page 14: Tier-1 Overview Andrew Sansum 21 November 2007. Overview of Presentations Morning Presentations –Overview (Me) Not really overview – at request of Tony

Conclusions

• Made a lot of progress in many areas this year. Availability improving, hardware reliable, CASTOR working quite well and upgrades on-track.

• Main challenges for 2008 (data taking)– Large hardware installations and almost

immediate next procurement– CASTOR at full load – On-call and general MoU processes