Download - Tier1 View: Resilience Status, plans, and best practice Martin Bly RAL Tier1 Fabric Manager GridPP22 – UCL - 2 April 2009

Tier1 View: ResilienceStatus, plans, and best practice

Martin BlyRAL Tier1 Fabric Manager

GridPP22 – UCL - 2 April 2009

2 April 2009 Resilience at the Tier1 - Martin Bly - GridPp22 2

Overview

“How to make critical services at the T1 bullet proof”


Resilience - Why?

• Services and system components fail– <insert_expletive_of_your_choice> happens!

• You don’t want your services to be brought down by a failure– MoU commitments quite taxing to meet even without failures– You can’t hide from auntie SAM…

• Better to deal with problems without pressure to restart services– Fewer mistakes

• Even better to avoid the problems in the first place• So: design service implementation so that it *will*

survive failures of whatever nature


Approaches to resilience

• Hardware– Use hardware that can survive component failure

• Software– Use software that can survive problems on hardware– Use software designed for distributed operation– Use software that has inbuilt resilience

• Location– Locate hosts such that a service can survive failure at host

location


Hardware

Resilient hardware will help your services survive common failure modes and keep it operating until you can replace the component and make the service resilient again


Storage

• Most common is RAID as used in storage arrays• Single (RAID5) or double (RAID6) disk failures do not

take out the storage array– Use of hot spares allows automatic rebuilds to maintain the

resilience

• RAID1 for system disks in servers – in the event of a single disk failure the server carries on– RAID1 with a hot spare can be used for super-critical systems

– automatic rebuild maintains the resilience

• Works with software RAID as well as hardware RAID controllers– If you set the BIOS up for hot-swap capability…

• Failed disks can be replaced without taking the service down– If you have hot-swap caddies


Memory

• ECC helps systems to detect and correct single bit and multi-bit errors in the RAM – can help prevent data corruption

• If the EEC correction rate begins to rise, the RAM may be failing, or need reseating, or be subject to interference, or be slipping out of tolerance.

• Higher-end kit can stop using ‘bad’ RAM – if not interrupting the service is considered worth the cost (high)


Power Supply

• Redundant PSU configurations– N+1 redundancy: at least one more PSU in a server than is

needed to make it work. If one fails, the server keeps running and the failed unit can be replaced without taking the server down

• Multiple power feeds– For an N+1 redundant PSU configuration, one can feed each

PSU from a different PDU. If one PDU fails (and they do), or the fuse blows (and they definitely do!) the other PSU is still powered and the service can continue

• UPS for systems where loss of power is a problem– Bridge blips, brownouts and short interruptions, smoothed

feed, harmonic reduction– Permanent or time-limited – how much power must it provide

and how long must it continue?


Interconnects

• Networking– Two or more network ports bonded can provide resilience if

cables routed to different switches or via different routes – increases performance too

– Bonded links in fibre installations can provide resilience against transceiver failure or fibre cuts

– ‘Stacked’ switches with bi-directional stacking capability• If one cable fails, data goes the other way• If one unit fails, data can still reach the one the other side

– Fail-over links in site infrastructure and national / international long-haul links - fibre cuts happen with depressing regularity

• Fibre-channel– Multi-port FC HBAs and array controllers can be set up to

provide two independent routes from servers to storage devices with multi-path and failover support keeping the data flowing


Software

Software services should be designed to be resilient and to be provided by multiple hosts and at distributed locations.

“This is the Grid – it’s distributed. If the services aren’t distributable, <expletive> rewrite them.” – anon


Monitoring

• If it can be monitored… • Look for and restart failed service daemons• Look for signatures of impending problems to predict

component failure• Idle disks hide their faults

– Regular low-level verification runs to push sick drives over the edge

– Replace early in failure cycle• So it doesn’t fail during a rebuild…

• Increased error rates on network links from failing line cards, transceivers or cable/fibre degradation– If you have redundant links, you can replace the faulty one

and keep the service going

• Call-out system for problems that impact services


Multiple hosts

• Services can be provided by more than one host if the application supports it– Share the load and increase performance– If one host fails, the rest provide the service– Use DNS round-robin to ‘randomly’ select a host using a

service alias with short TTL– Take broken host/s out of active DNS– Avoid single-points-of-failure

• Can locate multiple hosts…– … in different rooms– … in different buildings– … at different sites


Tier1

Resilience steps at the Tier1…


Hardware at the Tier1

• Most of the hardware techniques are used at the Tier1• Bulk storage uses RAID1/5/6, ECC RAM, N+1 PSUs,

multiple power feeds, regular verifies of arrays (scrubbing)

• Services nodes use RAID1, ECC RAM, some with N+1 PSUs

• Databases: RAID1/10/5/6, ECC RAM, N+1 PSU, dual FC links, multiple power feeds

• Networking: redundant off-site link to SJ5– working on redundancy (failover/backup) for OPN link to CERN

• UPS (in the new building)– 24/7 UPS for critical services / database racks– Short-lived UPS for storage systems to allow clean shutdown


CASTOR Service

FC ARRAY

(Neptune) ORACLE RAC (Pluto)

srm

ns

LSFlicence

Stager

LSF Master

Shared Castor Core

rmmaster

In general (all for CMS) mirrordisks on stager/lsf master and rmmaster

mirrordisks

Single CASTOR Instanceeg CMS


3D Services + LHCB LFC

FC ARRAY

3D ORACLE RAC

3Dlhcb lfcreadonly replica,single host,fast kickstartfailover to CERN


FTS and General LFC

5 Web Front Ends in DNS RR

1 channel / VO agent host ( raid 1)

Hot spare soon

RAID 10 SAN

FTS

Oracle RAC

LFC

DNSRR

Oracle currently 2independent servers.Work active to deploy 3 server RAC

LFC currently singleHost. Second host planned for mid September

work in progress,running late


CE and Fabric

ce02 03 04 05

torque/maui

3 doublets, one for each of ATLASCMS and LHCBeach CE has Mirror disks

CE

NIS

dn to accountmappingMirrored disks

/home file system(hardware RAID)


CE/SRM instances


WMS and LB

• Now: – lcgwms01 – LHC– lcgwms02 – everyone– lcgwms03 – non-LHC

• Developments:– lcgwms01 – LHC– lcgwms02 – LHC– lcgwms03 – non-LHC

• All WMS use both LB systems

WMS triplet, LB doublet

LBWMS


Other Tier1 Services• UK-BDII:

– DNS R-R triplet of simple hosts– Copes with load, provides resilience– Easy kickstart for rapid instancing

• RGMA registry: – single host, RAID disks, easy kickstart

• MONbox: – single host, RAID disks, easy kickstart

• VO boxes:– several x single host, easy kickstart

• Site BDII– DNS R-R doublet of simple hosts (same as UK-BDII)

• PROXY– Doublet of simple hosts, easy kickstart

• GOCDB:– internal failover with alternative database, (oracle), and external failover to

another web front-end in Germany and mirrored database in Italy. Latter still being tested.

• Apel:– has a warm standby and is buying new hardware.


Tier1 Monitoring

• Catch problems early with nagios where possible (or at least catch problems before anyone notices)– load alarms– File systems near to full– certificates close to expiry– Failed drives

• Some ganglia/cacti capacity planning reviews (but ad hoc) looking for long term trends. Service Operations team making a difference.


Tier1 Backups

• Critical hosts all backed up to tape store• Tape details written to central loggers

– So we can find which tape numbers to restore if the host is toast

• Speedy restores to toasted systems• Verify and exercise backups…


Tier1 On-call

• A good driver for service improvement.• Continuous improvement process with weekly review

of night-time incidents• Review is driver for:

– Auto-restarters (team still not 100% keen)– Improved monitoring (more plugins)– Better response documentation. – Changes to processes

• Also runs daytime• Gradually routine operations will become more and

more the responsibility of the service intervention team.

• CASTOR team carry out “weekly” detailed review of all incidents (looking to see how to avoid them again). Will generalise to whole Tier-1


Tier1 People

• Several teams with some degree of expertise sharing within each team– Fabric, Grid/Support, CASTOR, Databases– This has been pretty successful and we are reasonably

confident we can handle tractable problems without the specialist present

• As far as is reasonable fair/practicable we seek to ensure leave is scheduled to ensure expert cover – not always possible

• On-call also spreading expertise in critical services (e.g., even the Facility Manager knows how to restart the CASTOR request handler!)

• Able to call upon RAL Tier-2 staff (or other GRIDPP/elsewhere) in case of complete lack of expertise. Have done this occasionally. Should probably be prepared to do it more often.


Off Site services

• A few critical services are candidates for off-site replication, others such as BDIIs, LHCB LFC are already federated

• Possible candidates: FTS and general LFC (possibly RGMA)– Both essential to GRIDPP– LFC based on Oracle

• Streaming technology already deployed and tested elsewhere (3D)

– RAL could operate these remotely, but existing configuration very expensive (£40K hardware) plus Oracle licences. Failover to new DNS names would also need to be site resilient (not trivial). May be worth exploring with nearby sites or Daresbury


Questions

To Andrew, please…!

Download - Tier1 View: Resilience Status, plans, and best practice Martin Bly RAL Tier1 Fabric Manager GridPP22 – UCL - 2 April 2009

Top Related