Tier1 View: ResilienceStatus, plans, and best practice
Martin BlyRAL Tier1 Fabric Manager
GridPP22 – UCL - 2 April 2009
2 April 2009 Resilience at the Tier1 - Martin Bly - GridPp22 2
Overview
“How to make critical services at the T1 bullet proof”
2 April 2009 Resilience at the Tier1 - Martin Bly - GridPp22 3
Resilience - Why?
• Services and system components fail– <insert_expletive_of_your_choice> happens!
• You don’t want your services to be brought down by a failure– MoU commitments quite taxing to meet even without failures– You can’t hide from auntie SAM…
• Better to deal with problems without pressure to restart services– Fewer mistakes
• Even better to avoid the problems in the first place• So: design service implementation so that it *will*
survive failures of whatever nature
2 April 2009 Resilience at the Tier1 - Martin Bly - GridPp22 4
Approaches to resilience
• Hardware– Use hardware that can survive component failure
• Software– Use software that can survive problems on hardware– Use software designed for distributed operation– Use software that has inbuilt resilience
• Location– Locate hosts such that a service can survive failure at host
location
2 April 2009 Resilience at the Tier1 - Martin Bly - GridPp22 5
Hardware
Resilient hardware will help your services survive common failure modes and keep it operating until you can replace the component and make the service resilient again
2 April 2009 Resilience at the Tier1 - Martin Bly - GridPp22 6
Storage
• Most common is RAID as used in storage arrays• Single (RAID5) or double (RAID6) disk failures do not
take out the storage array– Use of hot spares allows automatic rebuilds to maintain the
resilience
• RAID1 for system disks in servers – in the event of a single disk failure the server carries on– RAID1 with a hot spare can be used for super-critical systems
– automatic rebuild maintains the resilience
• Works with software RAID as well as hardware RAID controllers– If you set the BIOS up for hot-swap capability…
• Failed disks can be replaced without taking the service down– If you have hot-swap caddies
2 April 2009 Resilience at the Tier1 - Martin Bly - GridPp22 7
Memory
• ECC helps systems to detect and correct single bit and multi-bit errors in the RAM – can help prevent data corruption
• If the EEC correction rate begins to rise, the RAM may be failing, or need reseating, or be subject to interference, or be slipping out of tolerance.
• Higher-end kit can stop using ‘bad’ RAM – if not interrupting the service is considered worth the cost (high)
2 April 2009 Resilience at the Tier1 - Martin Bly - GridPp22 8
Power Supply
• Redundant PSU configurations– N+1 redundancy: at least one more PSU in a server than is
needed to make it work. If one fails, the server keeps running and the failed unit can be replaced without taking the server down
• Multiple power feeds– For an N+1 redundant PSU configuration, one can feed each
PSU from a different PDU. If one PDU fails (and they do), or the fuse blows (and they definitely do!) the other PSU is still powered and the service can continue
• UPS for systems where loss of power is a problem– Bridge blips, brownouts and short interruptions, smoothed
feed, harmonic reduction– Permanent or time-limited – how much power must it provide
and how long must it continue?
2 April 2009 Resilience at the Tier1 - Martin Bly - GridPp22 9
Interconnects
• Networking– Two or more network ports bonded can provide resilience if
cables routed to different switches or via different routes – increases performance too
– Bonded links in fibre installations can provide resilience against transceiver failure or fibre cuts
– ‘Stacked’ switches with bi-directional stacking capability• If one cable fails, data goes the other way• If one unit fails, data can still reach the one the other side
– Fail-over links in site infrastructure and national / international long-haul links - fibre cuts happen with depressing regularity
• Fibre-channel– Multi-port FC HBAs and array controllers can be set up to
provide two independent routes from servers to storage devices with multi-path and failover support keeping the data flowing
2 April 2009 Resilience at the Tier1 - Martin Bly - GridPp22 10
Software
Software services should be designed to be resilient and to be provided by multiple hosts and at distributed locations.
“This is the Grid – it’s distributed. If the services aren’t distributable, <expletive> rewrite them.” – anon
2 April 2009 Resilience at the Tier1 - Martin Bly - GridPp22 11
Monitoring
• If it can be monitored… • Look for and restart failed service daemons• Look for signatures of impending problems to predict
component failure• Idle disks hide their faults
– Regular low-level verification runs to push sick drives over the edge
– Replace early in failure cycle• So it doesn’t fail during a rebuild…
• Increased error rates on network links from failing line cards, transceivers or cable/fibre degradation– If you have redundant links, you can replace the faulty one
and keep the service going
• Call-out system for problems that impact services
2 April 2009 Resilience at the Tier1 - Martin Bly - GridPp22 12
Multiple hosts
• Services can be provided by more than one host if the application supports it– Share the load and increase performance– If one host fails, the rest provide the service– Use DNS round-robin to ‘randomly’ select a host using a
service alias with short TTL– Take broken host/s out of active DNS– Avoid single-points-of-failure
• Can locate multiple hosts…– … in different rooms– … in different buildings– … at different sites
2 April 2009 Resilience at the Tier1 - Martin Bly - GridPp22 13
Tier1
Resilience steps at the Tier1…
2 April 2009 Resilience at the Tier1 - Martin Bly - GridPp22 14
Hardware at the Tier1
• Most of the hardware techniques are used at the Tier1• Bulk storage uses RAID1/5/6, ECC RAM, N+1 PSUs,
multiple power feeds, regular verifies of arrays (scrubbing)
• Services nodes use RAID1, ECC RAM, some with N+1 PSUs
• Databases: RAID1/10/5/6, ECC RAM, N+1 PSU, dual FC links, multiple power feeds
• Networking: redundant off-site link to SJ5– working on redundancy (failover/backup) for OPN link to CERN
• UPS (in the new building)– 24/7 UPS for critical services / database racks– Short-lived UPS for storage systems to allow clean shutdown
2 April 2009 Resilience at the Tier1 - Martin Bly - GridPp22 15
CASTOR Service
FC ARRAY
(Neptune) ORACLE RAC (Pluto)
srm
ns
LSFlicence
Stager
LSF Master
Shared Castor Core
rmmaster
In general (all for CMS) mirrordisks on stager/lsf master and rmmaster
mirrordisks
Single CASTOR Instanceeg CMS
2 April 2009 Resilience at the Tier1 - Martin Bly - GridPp22 16
3D Services + LHCB LFC
FC ARRAY
3D ORACLE RAC
3Dlhcb lfcreadonly replica,single host,fast kickstartfailover to CERN
2 April 2009 Resilience at the Tier1 - Martin Bly - GridPp22 17
FTS and General LFC
5 Web Front Ends in DNS RR
1 channel / VO agent host ( raid 1)
Hot spare soon
RAID 10 SAN
FTS
Oracle RAC
LFC
DNSRR
Oracle currently 2independent servers.Work active to deploy 3 server RAC
LFC currently singleHost. Second host planned for mid September
work in progress,running late
2 April 2009 Resilience at the Tier1 - Martin Bly - GridPp22 18
CE and Fabric
ce02 03 04 05
torque/maui
3 doublets, one for each of ATLASCMS and LHCBeach CE has Mirror disks
CE
NIS
dn to accountmappingMirrored disks
/home file system(hardware RAID)
2 April 2009 Resilience at the Tier1 - Martin Bly - GridPp22 19
CE/SRM instances
2 April 2009 Resilience at the Tier1 - Martin Bly - GridPp22 20
WMS and LB
• Now: – lcgwms01 – LHC– lcgwms02 – everyone– lcgwms03 – non-LHC
• Developments:– lcgwms01 – LHC– lcgwms02 – LHC– lcgwms03 – non-LHC
• All WMS use both LB systems
WMS triplet, LB doublet
LBWMS
2 April 2009 Resilience at the Tier1 - Martin Bly - GridPp22 21
Other Tier1 Services• UK-BDII:
– DNS R-R triplet of simple hosts– Copes with load, provides resilience– Easy kickstart for rapid instancing
• RGMA registry: – single host, RAID disks, easy kickstart
• MONbox: – single host, RAID disks, easy kickstart
• VO boxes:– several x single host, easy kickstart
• Site BDII– DNS R-R doublet of simple hosts (same as UK-BDII)
• PROXY– Doublet of simple hosts, easy kickstart
• GOCDB:– internal failover with alternative database, (oracle), and external failover to
another web front-end in Germany and mirrored database in Italy. Latter still being tested.
• Apel:– has a warm standby and is buying new hardware.
2 April 2009 Resilience at the Tier1 - Martin Bly - GridPp22 22
Tier1 Monitoring
• Catch problems early with nagios where possible (or at least catch problems before anyone notices)– load alarms– File systems near to full– certificates close to expiry– Failed drives
• Some ganglia/cacti capacity planning reviews (but ad hoc) looking for long term trends. Service Operations team making a difference.
2 April 2009 Resilience at the Tier1 - Martin Bly - GridPp22 23
Tier1 Backups
• Critical hosts all backed up to tape store• Tape details written to central loggers
– So we can find which tape numbers to restore if the host is toast
• Speedy restores to toasted systems• Verify and exercise backups…
2 April 2009 Resilience at the Tier1 - Martin Bly - GridPp22 24
Tier1 On-call
• A good driver for service improvement.• Continuous improvement process with weekly review
of night-time incidents• Review is driver for:
– Auto-restarters (team still not 100% keen)– Improved monitoring (more plugins)– Better response documentation. – Changes to processes
• Also runs daytime• Gradually routine operations will become more and
more the responsibility of the service intervention team.
• CASTOR team carry out “weekly” detailed review of all incidents (looking to see how to avoid them again). Will generalise to whole Tier-1
2 April 2009 Resilience at the Tier1 - Martin Bly - GridPp22 25
Tier1 People
• Several teams with some degree of expertise sharing within each team– Fabric, Grid/Support, CASTOR, Databases– This has been pretty successful and we are reasonably
confident we can handle tractable problems without the specialist present
• As far as is reasonable fair/practicable we seek to ensure leave is scheduled to ensure expert cover – not always possible
• On-call also spreading expertise in critical services (e.g., even the Facility Manager knows how to restart the CASTOR request handler!)
• Able to call upon RAL Tier-2 staff (or other GRIDPP/elsewhere) in case of complete lack of expertise. Have done this occasionally. Should probably be prepared to do it more often.
2 April 2009 Resilience at the Tier1 - Martin Bly - GridPp22 26
Off Site services
• A few critical services are candidates for off-site replication, others such as BDIIs, LHCB LFC are already federated
• Possible candidates: FTS and general LFC (possibly RGMA)– Both essential to GRIDPP– LFC based on Oracle
• Streaming technology already deployed and tested elsewhere (3D)
– RAL could operate these remotely, but existing configuration very expensive (£40K hardware) plus Oracle licences. Failover to new DNS names would also need to be site resilient (not trivial). May be worth exploring with nearby sites or Daresbury
2 April 2009 Resilience at the Tier1 - Martin Bly - GridPp22 27
Questions
To Andrew, please…!