Tier-1 – Final preparations for data
Andrew Sansum9th September 2009
Themes (last 9 months)
• Improve planning • Recruitment • Re-engineer production and operations
processes • Enhance resilience• Test it works (STEP09)• Move to R89• “Test” new Disaster Management
System • Final preparations for data taking 10 April 2023 Tier-1 Status
Apr May Jun Jul Aug Sep Oct Nov
FreezeFreeze UpdateUpdatecontingency
STEP
R89 Migration
CASTOR Hardware Resilience
New Hardware
CASTOR upgrade
SL5 upgrade
Test disasterManagement system
LFC/FTS3D
prepare for STEPprepare for R89
Prepare fordata taking
The Plan
SRM + nameserver
Recruitment complete
• Recruitment has been tough (but good team in place now)– Initially STFC freeze– Later, hard to recruit
10 April 2023 Tier-1 Status
STFCfreeze
Meeting Experiment Needs
• VO survey carried out in April– Based on a series of qualitative and quantitative questions– Very helpful and considered feedback from most significant
Vos
• Generally very positive: Key findings– Communication between Tier-1 and VOs generally working
well– Production team have made a big difference– Meeting commitments/expectations of LHC VOs– VOs not always clear on Tier-1 priorities (since tried to
address this by liaison meeting) – Non LHC VOs particularly commented that although support
was good Tier-1 did not always deliver service on agreed timescales (unfortunately intentional, reflecting priorities – expectations management?) -
– Documentation poor (need to work on this still) 10 April 2023 Tier-1 Status
Production Team/Production ops
• Daytime team of 3 staff (Gareth Smith, John Kelly, Tiju Idiculla)– Handle operation exceptions (NAGIOS alerts/pager callouts)– track tickets– Monitor routine metrics, loads, network rates– Ensure operational status is communicated to VOs– Represent Tier-1 to WLCG daily operations– Oversee downtime planning, agree near term downtime plan– Oversee progression of Service Incident reports– (re-)engineer operational processes
• Night-time/weekend team of 5 staff on-call at any time (2 hour response):– Primary on-call (triage and fix easy faults)– Secondary on-call: CASTOR, Grid on-call, Fabric, Database
10 April 2023 Tier-1 Status
Callout rate
• Big improvement over 2009 – recent deterioration owing to recent development activity and major incidents
10 April 2023 Tier-1 Status
Process Improvement
10 April 2023 Tier-1 Status
•Service is complex•Frequent routine interventions – eg:. •Add disk servers to class•take disk servers offline
• Mistakes occur if not engineered out.•Work in progress but critical if we are to meet high expectations
CASTOR (I)• Process of gradual improvement, tracking down
causes of individual transfer failures. Improving processes (eg disk server intervention status)– Applied ORACLE patch to fix the Big ID bug– Series of CASTOR minor version upgrades to 2.1.7-27. These
have predominantly included bug-fixes, including one workaround to prevent the ORACLE Crosstalk bug from reoccurring
– Reconfiguration of internal LSF scheduler to improve stability and scalability (move from NFS to HTTP)
– Tuning changes
• ORACLE migration to new hardware (two EMC RAID arrays) which provides additional resilience, improved performance and better maintenance.
• SRM upgrades to version 2.7.1510 April 2023 Tier-1 Status
CASTOR: Downtime (2008-2009)
10 April 2023 Tier-1 Status
2.1.7 upgrade
R89
CASTOR (III): Plans
• September– Nameserver upgraded to 2.1.8– SRM upgrade to version 2.8– CIP upgrade to version 2 (in progress)
• 2009Q4– optimizing the ORACLE database– Additional resilience– Disaster recovery testing
10 April 2023 Tier-1 Status
STEP09: Operations Overview
• Generally very smooth operation: – Most service systems relatively unloaded plenty of spare
capacity – Calm atmosphere.
• Daytime “production team” monitored service• Only one callout, • Most of the team even took two days out off site for department
meeting!
– Very good liaison with VOs and good idea what was going on.• In regular informal contact with UK representatives
– Some problems with CASTOR tape migration (3 days) on ATLAS instance but all handled satisfactorily and fixed. Did not visibly impact experiments.
• Robot broke down for several hours (stuck handbot led to all drives de-configured in CASTOR). Caught up quickly.
• Very useful exercise – learned a lot, but very reassuring – More at: http://www.gridpp.rl.ac.uk/blog/category/step09/
STEP09: Batch Service• Farm typically running > 2000 jobs. By 9th June at
equilibrium: (ATLAS 42%, CMS 18%, Alice 3%, LHCB 20%)
• Problem 1: ATLAS job submission exceeded 32K files on CE– See hole on 9th. We thought ATLAS had paused took time
to spot.
• Problem 2: Fair shares not honoured as aggressive ALICE submission beat ATLAS to job starts. – Need more ATLAS jobs in queue faster. Manually cap ALICE.
Fixed by 9th June. See decrease in (red) ALICE work.
• Problem 3: Occupancy initially poor (initially 90%). Short on memory (2GB/core but ATLAS jobs needed 3GB vmem). Gradually increase MAUI over-commit on memory to 50%. Occupancy --> 98%.
STEP09: Network
– Batch Farm drawing approx 3Gb/s from CASTOR during reprocessing. Peaked at 30Gb/s for CMS reprocessing without lazy download.
– Total OPN traffic. Inbound 3.5Gb/s, outbound 1Gb/s
–
– RAL->Tier-2 outbound rate average 1.5Gb/s but 6Gb/s spikes!
STEP09: Tape
• Tape system worked well. Sustained 4Gb/s during peak load on 13 drives (ATLAS+CMS), 15 drives with LHCB. We played with a mix of dedicated (4 ATLAS, 4 CMS, 2 LHCB, 5 shared).– Typical average rate of 35MB/s per drive (1 day average)– Lower than we would like (looking for nearer 45MB/s)– On CMS instance, modified write policy gave > 60MB/s but
reads more challenging to optimise.
R89: Migration
• Migration planning started early 2008 (building early 2006)
• Detailed equipment documentation together with a requirements document was sent to vendors during September 2008
• Workshop hosted during November. Vendors committed to 3 racks (each) per day (we believe 5-6 was feasible)
• Orders placed at the end of November to move 77 racks of equipment (and robot) to an agreed schedule (T1=43 racks).
• Started 22nd July and ended 6th August• Completed to schedule10 April 2023 Tier-1 Status
R89 Migration
• 43 racks moved
10 April 2023 Tier-1 Status
Wed 17 Fri 19 Mon 22 Wed 24 Fri 26 Mon 29 Wed 1 Fri 3 Mon 6
Drain WMS
Drain CEs
batch workersstart
DrainFTS
Critical Services
CastorCore + Disk
Batch workerscomplete
Disk complete
CASTORrestarting
Restart
Disasters: Swine Flu
• First to test new disaster management system• Easy to handle – trivial to generate a contingency plan
based on existing template.• Situation regularly assessed. Tier-1 response initially
running ahead of RAL site planning. • Reached level 2 in DMS with assessment meetings
every 2 weeks. Work mainly on remote working and communication strategy
• Now downgraded to level 1 until significant rise in case frequency
• Expect to dust off again before Christmas
10 April 2023 Tier-1 Status
Disasters: Air-conditioning(I)
10 April 2023 Tier-1 Status
• Two cooling failures in 3 days
cold isle
hot isle
15
25
35
45
shutdown
room reaches equilibrium
•Monday day: both chiller systems shutdown, restarted quickly
•Tuesday: one chiller shutdown and failed over to second chiller
•Wednesday night: both chillers shutdown could not restart
•After third event decided not to restart Tier-1
chiller restart
Disasters: Air –conditioning (II)
• Initial post mortem started after first (daytime) event– Thermal monitoring, callout and automated shutdown in R89
not fully implemented/working correctly– urgent remedial work underway
• Second, night-time incident raised further concerns– Tier-1 called out and rapidly escalated– But automated shutdown still in test mode– Forced to do manual shutdown– Operations thermal callout failed to work as required– Site security did not escalate BMS alarm (not expected alarm)– Escalation to building services very slow (owing to
R89 being still under warranty/acceptance)– Chillers could not be restarted– No explanation of cause of outage
• Concluded we would not restart Tier-1 until issues resolved
10 April 2023 Tier-1 Status
Disasters: Air-conditioning (III)
• Critical Services continued to run:– Separate, redundant cooling system in UPS room. – Tape robotics and CASTOR core OK too (low temperature
room)
• By Friday:– Tier-1 response at disaster level 3 (meeting held with VOs and
PMB) – Building services believed that cooling was stable and fault
could not recur.– all necessary automation, callout and escalation processes in
place– Nevertheless Tier-1 team not prepared to run hardware
unattended over the weekend.
• On Monday:– Full service restart – plan to baby-sit service during Mon/Tue evening
• Forensics and post-mortem continued
10 April 2023 Tier-1 Status
Disasters: Air-conditioning (IV)
• Monday 10th incident believed to be caused by a planned reboot of the Building Management System (BMS)– Caused pumps to stop– Low pressure caused chiller valves to close– BMS returned but system deadlocked
• Tuesday 11th – single chiller trip followed by failover– logs do not allow diagnosis.
• Wednesday 12th – BMS detected overpressurein cooling system and triggered shutdown– Probably true over pressure (1.9 Bar)– Settings (1.7bar) considered to be too low– Now raised to 2.5 Bar and only calls out– System tested to 6 bar.– Investigations continue10 April 2023 Tier-1 Status
Disasters: Water Leak
• Water found dripping on tape robot!!!!!! • “I don’t believe this is happening” moment• Should not be able to happen as no planned water
supplies above machine room.• “Fortunately” Tier-1 already shut-down so turn
off robot too.• STK engineer investigates and concludes that
damage is mainly superficial splash damage,drive heads not contaminated, tapes (60 splashed)probably OK.
• Indication that had been occurring occasionally for several weeks
10 April 2023 Tier-1 Status
Disasters: Water leak
• Cause: condensation from 1st floor cooling system– Incorrect damper setting (air intake) led to excess
condensation– Condensation collected in “drip tray” and pumped– Tray too small and pump inadequate– Water overflowed tray and tracked along floor to hole
• Remedy– Place umbrella over robot– Chillers switched off – 1st floor inspected daily!– Planning underway to re-engineer drip trays/pumps
alarms, etc.– Monitor tape error rate
10 April 2023 Tier-1 Status
Procurements
• Disk, CPU and robotics procurements delayed from January/February delivery dates– New SL8500 tape robot entirely for GRIDPP, 2PB of disk – 24
drive units (50% Areca/WD, 50% 3Ware/Seagate), CPU capacity
• Eventually delivered in May, but entangled in R89 migration,– New Robot in production in July– CPU completed acceptance test and deploying into SL5– One Lot of disk (1PB) ready for deployment– Second Lot failed acceptance (many drive ejects)
• Positive aspects of acceptance failure– Two Lot risk avoidance strategy worked– Vendor 1 week load test failed to find fault– Our 28 day acceptance caught fault before kit reached
production10 April 2023 Tier-1 Status
LFC , FTS and 3D
• Now complete• Upgrade back end RAID arrays and Oracle servers
– Replace elderly RAID arrays with pair of new EMC RAID arrays– Better support (we hope)– Better performance
• Move to ORACLE RAC for LFC/FTS (increased resilience)
• Separate ATLAS LFC from general LFC• Upgrade 3D servers and move to new RAID arrays• Work commenced on testing replication of LFC for
disaster contingency
10 April 2023 Tier-1 Status
Quattor – Story so Far
• Began work in earnest in June 2009
• Set up Quattor Working Group instance to manage deployment and configuration of new hardware.
– leverages strong QWG support for gLite
• Have SL5 torque/maui server under Quattor control
• Are (as of today) deploying 220+ new WNs in SL5 batch service
• Significant work to get up and running. New way of working.
• Have uncovered and helped fix a number of bugs and issues in the process
Quattor – Next Steps• As we move existing WNs them to SL5 (need 75% of our capacity in SL5) we will quattorise them
• Move CEs and other grid service nodes to Quattor
• Gradually migrate non-grid services to Quattor control
• AQUILON– Database backend to Quattor developed by Morgan Stanley• Improves scalability and manageability (MS are managing >15,000 nodes)
– Will first deploy at RAL– Then plan to make Aquilon make usable by other grid sites as well
• Available at http://www.gridpp.rl.ac.uk/status• Constantly evolving
– Components can be added/updated/removed
• Present components– SAM Tests
• Latest test results for critical services• Locally cached for 10 minutes to reduce load
– Downtimes• Ongoing and upcoming downtimes pulled from GOCDB • Red colour for OUTAGE and yellow for AT_RISK
– Notices • Latest information on Tier 1 operations• Only Tier 1 staff can post
– Ganglia plots of key components from the Tier1 farm
• Feedback welcome
Dashboard
Next week - 14th - 18th September! LHC only (for now) – but all VOs affected New batch service - lcgbatch01
Quattorised torque/maui server Quattorised worker nodes New LCG-CEs (6-8) for LHC vos – old LHC CEs (3-5) being
retired, other CEs reconfigured Same queue configuration
Use submit filter script on CEs to add SLX property requirement as required
SL5 Migration (I)
CPU08 going straight into SL5 now (~1800 job slots) All 64-bit capable existing WNs will be
reinstalled eventually Non-LHC vos will get new CE for migration after
dust settles No plan to retire SL4 WNs completely yet
SL5 Migration (II)
October Freeze
• No planned upgrades beyond September except possibly network upgrade.
• Recognise that some change will have to take place• Need to put in place lightweight change control
process – Allow changes where benefit outweighs risk
• Expect increased stability as downtimes reduce• Apply pressure once more to reduce low grade
failures.
10 April 2023 Tier-1 Status
Conclusion
• Recent staff additions have had a huge impact on quality of service we operate.
• Tier-1 development plan for 2009 nearly complete.
• Positive feedback from STEP09 that service meets requirements.
• Still a few major items (like SL5) to get through (fingers crossed).
• Probably still some R89 suprises in pipeline.• Looking forward to start of data taking
10 April 2023 Tier-1 Status