Download - Tier-1 – Final preparations for data Andrew Sansum 9 th September 2009

Tier-1 – Final preparations for data

Andrew Sansum9th September 2009

Themes (last 9 months)

• Improve planning • Recruitment • Re-engineer production and operations

processes • Enhance resilience• Test it works (STEP09)• Move to R89• “Test” new Disaster Management

System • Final preparations for data taking 10 April 2023 Tier-1 Status

Apr May Jun Jul Aug Sep Oct Nov

FreezeFreeze UpdateUpdatecontingency

STEP

R89 Migration

CASTOR Hardware Resilience

New Hardware

CASTOR upgrade

SL5 upgrade

Test disasterManagement system

LFC/FTS3D

prepare for STEPprepare for R89

Prepare fordata taking

The Plan

SRM + nameserver

Recruitment complete

• Recruitment has been tough (but good team in place now)– Initially STFC freeze– Later, hard to recruit

10 April 2023 Tier-1 Status

STFCfreeze

Meeting Experiment Needs

• VO survey carried out in April– Based on a series of qualitative and quantitative questions– Very helpful and considered feedback from most significant

Vos

• Generally very positive: Key findings– Communication between Tier-1 and VOs generally working

well– Production team have made a big difference– Meeting commitments/expectations of LHC VOs– VOs not always clear on Tier-1 priorities (since tried to

address this by liaison meeting) – Non LHC VOs particularly commented that although support

was good Tier-1 did not always deliver service on agreed timescales (unfortunately intentional, reflecting priorities – expectations management?) -

– Documentation poor (need to work on this still) 10 April 2023 Tier-1 Status

Production Team/Production ops

• Daytime team of 3 staff (Gareth Smith, John Kelly, Tiju Idiculla)– Handle operation exceptions (NAGIOS alerts/pager callouts)– track tickets– Monitor routine metrics, loads, network rates– Ensure operational status is communicated to VOs– Represent Tier-1 to WLCG daily operations– Oversee downtime planning, agree near term downtime plan– Oversee progression of Service Incident reports– (re-)engineer operational processes

• Night-time/weekend team of 5 staff on-call at any time (2 hour response):– Primary on-call (triage and fix easy faults)– Secondary on-call: CASTOR, Grid on-call, Fabric, Database


Callout rate

• Big improvement over 2009 – recent deterioration owing to recent development activity and major incidents


Process Improvement


•Service is complex•Frequent routine interventions – eg:. •Add disk servers to class•take disk servers offline

• Mistakes occur if not engineered out.•Work in progress but critical if we are to meet high expectations

CASTOR (I)• Process of gradual improvement, tracking down

causes of individual transfer failures. Improving processes (eg disk server intervention status)– Applied ORACLE patch to fix the Big ID bug– Series of CASTOR minor version upgrades to 2.1.7-27. These

have predominantly included bug-fixes, including one workaround to prevent the ORACLE Crosstalk bug from reoccurring

– Reconfiguration of internal LSF scheduler to improve stability and scalability (move from NFS to HTTP)

– Tuning changes

• ORACLE migration to new hardware (two EMC RAID arrays) which provides additional resilience, improved performance and better maintenance.

• SRM upgrades to version 2.7.1510 April 2023 Tier-1 Status

CASTOR: Downtime (2008-2009)


2.1.7 upgrade

R89

CASTOR (III): Plans

• September– Nameserver upgraded to 2.1.8– SRM upgrade to version 2.8– CIP upgrade to version 2 (in progress)

• 2009Q4– optimizing the ORACLE database– Additional resilience– Disaster recovery testing


STEP09: Operations Overview

• Generally very smooth operation: – Most service systems relatively unloaded plenty of spare

capacity – Calm atmosphere.

• Daytime “production team” monitored service• Only one callout, • Most of the team even took two days out off site for department

meeting!

– Very good liaison with VOs and good idea what was going on.• In regular informal contact with UK representatives

– Some problems with CASTOR tape migration (3 days) on ATLAS instance but all handled satisfactorily and fixed. Did not visibly impact experiments.

• Robot broke down for several hours (stuck handbot led to all drives de-configured in CASTOR). Caught up quickly.

• Very useful exercise – learned a lot, but very reassuring – More at: http://www.gridpp.rl.ac.uk/blog/category/step09/

STEP09: Batch Service• Farm typically running > 2000 jobs. By 9th June at

equilibrium: (ATLAS 42%, CMS 18%, Alice 3%, LHCB 20%)

• Problem 1: ATLAS job submission exceeded 32K files on CE– See hole on 9th. We thought ATLAS had paused took time

to spot.

• Problem 2: Fair shares not honoured as aggressive ALICE submission beat ATLAS to job starts. – Need more ATLAS jobs in queue faster. Manually cap ALICE.

Fixed by 9th June. See decrease in (red) ALICE work.

• Problem 3: Occupancy initially poor (initially 90%). Short on memory (2GB/core but ATLAS jobs needed 3GB vmem). Gradually increase MAUI over-commit on memory to 50%. Occupancy --> 98%.

STEP09: Network

– Batch Farm drawing approx 3Gb/s from CASTOR during reprocessing. Peaked at 30Gb/s for CMS reprocessing without lazy download.

– Total OPN traffic. Inbound 3.5Gb/s, outbound 1Gb/s

–

– RAL->Tier-2 outbound rate average 1.5Gb/s but 6Gb/s spikes!

STEP09: Tape

• Tape system worked well. Sustained 4Gb/s during peak load on 13 drives (ATLAS+CMS), 15 drives with LHCB. We played with a mix of dedicated (4 ATLAS, 4 CMS, 2 LHCB, 5 shared).– Typical average rate of 35MB/s per drive (1 day average)– Lower than we would like (looking for nearer 45MB/s)– On CMS instance, modified write policy gave > 60MB/s but

reads more challenging to optimise.

R89: Migration

• Migration planning started early 2008 (building early 2006)

• Detailed equipment documentation together with a requirements document was sent to vendors during September 2008

• Workshop hosted during November. Vendors committed to 3 racks (each) per day (we believe 5-6 was feasible)

• Orders placed at the end of November to move 77 racks of equipment (and robot) to an agreed schedule (T1=43 racks).

• Started 22nd July and ended 6th August• Completed to schedule10 April 2023 Tier-1 Status

R89 Migration

• 43 racks moved


Wed 17 Fri 19 Mon 22 Wed 24 Fri 26 Mon 29 Wed 1 Fri 3 Mon 6

Drain WMS

Drain CEs

batch workersstart

DrainFTS

Critical Services

CastorCore + Disk

Batch workerscomplete

Disk complete

CASTORrestarting

Restart

Disasters: Swine Flu

• First to test new disaster management system• Easy to handle – trivial to generate a contingency plan

based on existing template.• Situation regularly assessed. Tier-1 response initially

running ahead of RAL site planning. • Reached level 2 in DMS with assessment meetings

every 2 weeks. Work mainly on remote working and communication strategy

• Now downgraded to level 1 until significant rise in case frequency

• Expect to dust off again before Christmas


Disasters: Air-conditioning(I)


• Two cooling failures in 3 days

cold isle

hot isle

15

25

35

45

shutdown

room reaches equilibrium

•Monday day: both chiller systems shutdown, restarted quickly

•Tuesday: one chiller shutdown and failed over to second chiller

•Wednesday night: both chillers shutdown could not restart

•After third event decided not to restart Tier-1

chiller restart

Disasters: Air –conditioning (II)

• Initial post mortem started after first (daytime) event– Thermal monitoring, callout and automated shutdown in R89

not fully implemented/working correctly– urgent remedial work underway

• Second, night-time incident raised further concerns– Tier-1 called out and rapidly escalated– But automated shutdown still in test mode– Forced to do manual shutdown– Operations thermal callout failed to work as required– Site security did not escalate BMS alarm (not expected alarm)– Escalation to building services very slow (owing to

R89 being still under warranty/acceptance)– Chillers could not be restarted– No explanation of cause of outage

• Concluded we would not restart Tier-1 until issues resolved


Disasters: Air-conditioning (III)

• Critical Services continued to run:– Separate, redundant cooling system in UPS room. – Tape robotics and CASTOR core OK too (low temperature

room)

• By Friday:– Tier-1 response at disaster level 3 (meeting held with VOs and

PMB) – Building services believed that cooling was stable and fault

could not recur.– all necessary automation, callout and escalation processes in

place– Nevertheless Tier-1 team not prepared to run hardware

unattended over the weekend.

• On Monday:– Full service restart – plan to baby-sit service during Mon/Tue evening

• Forensics and post-mortem continued


Disasters: Air-conditioning (IV)

• Monday 10th incident believed to be caused by a planned reboot of the Building Management System (BMS)– Caused pumps to stop– Low pressure caused chiller valves to close– BMS returned but system deadlocked

• Tuesday 11th – single chiller trip followed by failover– logs do not allow diagnosis.

• Wednesday 12th – BMS detected overpressurein cooling system and triggered shutdown– Probably true over pressure (1.9 Bar)– Settings (1.7bar) considered to be too low– Now raised to 2.5 Bar and only calls out– System tested to 6 bar.– Investigations continue10 April 2023 Tier-1 Status

Disasters: Water Leak

• Water found dripping on tape robot!!!!!! • “I don’t believe this is happening” moment• Should not be able to happen as no planned water

supplies above machine room.• “Fortunately” Tier-1 already shut-down so turn

off robot too.• STK engineer investigates and concludes that

damage is mainly superficial splash damage,drive heads not contaminated, tapes (60 splashed)probably OK.

• Indication that had been occurring occasionally for several weeks


Disasters: Water leak

• Cause: condensation from 1st floor cooling system– Incorrect damper setting (air intake) led to excess

condensation– Condensation collected in “drip tray” and pumped– Tray too small and pump inadequate– Water overflowed tray and tracked along floor to hole

• Remedy– Place umbrella over robot– Chillers switched off – 1st floor inspected daily!– Planning underway to re-engineer drip trays/pumps

alarms, etc.– Monitor tape error rate


Procurements

• Disk, CPU and robotics procurements delayed from January/February delivery dates– New SL8500 tape robot entirely for GRIDPP, 2PB of disk – 24

drive units (50% Areca/WD, 50% 3Ware/Seagate), CPU capacity

• Eventually delivered in May, but entangled in R89 migration,– New Robot in production in July– CPU completed acceptance test and deploying into SL5– One Lot of disk (1PB) ready for deployment– Second Lot failed acceptance (many drive ejects)

• Positive aspects of acceptance failure– Two Lot risk avoidance strategy worked– Vendor 1 week load test failed to find fault– Our 28 day acceptance caught fault before kit reached

production10 April 2023 Tier-1 Status

LFC , FTS and 3D

• Now complete• Upgrade back end RAID arrays and Oracle servers

– Replace elderly RAID arrays with pair of new EMC RAID arrays– Better support (we hope)– Better performance

• Move to ORACLE RAC for LFC/FTS (increased resilience)

• Separate ATLAS LFC from general LFC• Upgrade 3D servers and move to new RAID arrays• Work commenced on testing replication of LFC for

disaster contingency


Quattor – Story so Far

• Began work in earnest in June 2009

• Set up Quattor Working Group instance to manage deployment and configuration of new hardware.

– leverages strong QWG support for gLite

• Have SL5 torque/maui server under Quattor control

• Are (as of today) deploying 220+ new WNs in SL5 batch service

• Significant work to get up and running. New way of working.

• Have uncovered and helped fix a number of bugs and issues in the process

Quattor – Next Steps• As we move existing WNs them to SL5 (need 75% of our capacity in SL5) we will quattorise them

• Move CEs and other grid service nodes to Quattor

• Gradually migrate non-grid services to Quattor control

• AQUILON– Database backend to Quattor developed by Morgan Stanley• Improves scalability and manageability (MS are managing >15,000 nodes)

– Will first deploy at RAL– Then plan to make Aquilon make usable by other grid sites as well

• Available at http://www.gridpp.rl.ac.uk/status• Constantly evolving

– Components can be added/updated/removed

• Present components– SAM Tests

• Latest test results for critical services• Locally cached for 10 minutes to reduce load

– Downtimes• Ongoing and upcoming downtimes pulled from GOCDB • Red colour for OUTAGE and yellow for AT_RISK

– Notices • Latest information on Tier 1 operations• Only Tier 1 staff can post

– Ganglia plots of key components from the Tier1 farm

• Feedback welcome

Dashboard

Next week - 14th - 18th September! LHC only (for now) – but all VOs affected New batch service - lcgbatch01

Quattorised torque/maui server Quattorised worker nodes New LCG-CEs (6-8) for LHC vos – old LHC CEs (3-5) being

retired, other CEs reconfigured Same queue configuration

Use submit filter script on CEs to add SLX property requirement as required

SL5 Migration (I)

CPU08 going straight into SL5 now (~1800 job slots) All 64-bit capable existing WNs will be

reinstalled eventually Non-LHC vos will get new CE for migration after

dust settles No plan to retire SL4 WNs completely yet

SL5 Migration (II)

October Freeze

• No planned upgrades beyond September except possibly network upgrade.

• Recognise that some change will have to take place• Need to put in place lightweight change control

process – Allow changes where benefit outweighs risk

• Expect increased stability as downtimes reduce• Apply pressure once more to reduce low grade

failures.


Conclusion

• Recent staff additions have had a huge impact on quality of service we operate.

• Tier-1 development plan for 2009 nearly complete.

• Positive feedback from STEP09 that service meets requirements.

• Still a few major items (like SL5) to get through (fingers crossed).

• Probably still some R89 suprises in pipeline.• Looking forward to start of data taking


Download - Tier-1 – Final preparations for data Andrew Sansum 9 th September 2009

Top Related