large farm 'real life problems' and their solutions

20
20.10.2004 1 Large Farm 'Real Life Large Farm 'Real Life Problems' and their Solutions Problems' and their Solutions Thorsten Kleinwort CERN IT/FIO HEPiX II/2004 BNL

Upload: gladys

Post on 03-Feb-2016

32 views

Category:

Documents


0 download

DESCRIPTION

Large Farm 'Real Life Problems' and their Solutions. Thorsten Kleinwort CERN IT/FIO HEPiX II/2004 BNL. Outline. Farms at the CERN CC: The Tools Framework The Working Teams Real Life Use Cases Collaborations Summary Useful Links. =. +. +. LEAF. QUATTOR. LEMON. The Tools Framework. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Large Farm 'Real Life Problems' and their Solutions

20.10.2004 1

Large Farm 'Real Life Problems' Large Farm 'Real Life Problems' and their Solutions and their Solutions

Thorsten KleinwortCERN IT/FIO

HEPiX II/2004BNL

Page 2: Large Farm 'Real Life Problems' and their Solutions

20 October 2004 Thorsten Kleinwort

IT/FIO/FS2

OutlineOutline

Farms at the CERN CC:• The Tools Framework• The Working Teams• Real Life Use Cases• Collaborations• Summary• Useful Links

Page 3: Large Farm 'Real Life Problems' and their Solutions

20 October 2004 Thorsten Kleinwort

IT/FIO/FS3

The Tools FrameworkThe Tools Framework

• ELFms• Quattor:

• Installation (Kickstart + SWREP)• Configuration (CDB + NCM)• Management (SPMA + NCM)

• Lemon:• Monitoring• Batch system statistics

• LEAF:• State management (SMS)• Hardware management (HMS)

LEMON

QUATTOR

LEAF

= + +

Page 4: Large Farm 'Real Life Problems' and their Solutions

20 October 2004 Thorsten Kleinwort

IT/FIO/FS4

The Tools Framework The Tools Framework (cont’d)(cont’d)

• The evolution of the ELFms tools is described in various previous presentations:

• HEPiX II/2003 (Vanouver):• ‘The new Fabric Management Tools in Production at CERN’

• HEPiX I/2004 (Edinburgh):• ‘ELFms, status, deployment’ by German Cancio• ‘Lemon Web Monitoring’ by Miroslav Siket

• CHEP 2004 (Interlaken):• ‘Current Status of Fabric Management at CERN’ by German

• This HEPiX:• `Experience in the use of quattor tool suite outside CERN’

=> Progress has been made, improvements are ongoing, Quattor is more and more used outside CERN

Page 5: Large Farm 'Real Life Problems' and their Solutions

20 October 2004 Thorsten Kleinwort

IT/FIO/FS5

Tools (cont’d):Tools (cont’d):

• Other tools [interfacing CDB]:• Script: PrepareInstall.pl:

• Does all necessary steps to prepare a machine install

• Can run with a list of hosts (for mass installs)• Gets all the necessary information from CDB• Creates a kickstart file for each node

• Local Script: maintenance:• Script to rundown a node:

• Drains batch nodes• Warns users on interactive nodes

• Can execute configurable script at the end, e.g. reboot

Page 6: Large Farm 'Real Life Problems' and their Solutions

20 October 2004 Thorsten Kleinwort

IT/FIO/FS6

Tools (cont’d)Tools (cont’d)

• Automated Fabric [LEAF]:• State Management System SMS:

• Other CDB changes are done by SMS:• Change OS/Cluster

• Systems have state:• ‘production’ or ‘standby’

• Hardware Management System HMS:• Workflow to track hardware changes [interfaces

CDB]: • New machine arrival• Machine moves• Machine interventions (Vendor calls),

retirements

Page 7: Large Farm 'Real Life Problems' and their Solutions

20 October 2004 Thorsten Kleinwort

IT/FIO/FS7

The Working TeamsThe Working Teams

Operator

“Customers”

Service Manager

SysAdmins

•24/7•Alarm display•Following procedures:

•Acting on alarms•Open Remedy tickets•Email/phone notification•Machine reboots

•New team•Now 7 staff, more to hire•Running more and more services in the CC•Doing most of the install and maintenance work on farm PCs•Following up h/w failures ‘Vendor calls’

•Farm/Cluster resource planning•Writing/improving the procedures/tools•Following up on new problems

•Other groups/teams in CERN-IT, like:

•DB (ORACLE)•GD (LCG)•GM (EGEE)•Experiments(Data Challenges)

•Changing requirements

Page 8: Large Farm 'Real Life Problems' and their Solutions

20 October 2004 Thorsten Kleinwort

IT/FIO/FS8

Another Management Another Management ToolTool

• Remedy:• The problem tracking tool in CERN IT • Used in different workflows, e.g. by:

• The Operator to open tickets following up on alarms • The Service Managers to ask for machine

interventions• The SysAdmins to follow up on problems/general

issues• HMS is implemented as a Remedy Workflow as

well• Recently started to get statistics on hardware

failures

Page 9: Large Farm 'Real Life Problems' and their Solutions

20 October 2004 Thorsten Kleinwort

IT/FIO/FS9

Real Life Use CasesReal Life Use Cases

• Kernel upgrade (on LXBATCH, ~1500 hosts):

1. Put the new software into the repository (SWREP, precaching)

2. Put the new kernel RPM on the nodes:SPMA, with multi-package option (old kernel is still running!)

3. Configure the new kernel version for the cluster in CDB, and run the GRUB NCM component for configuring the node

4. Drain the nodes by disabling new batch jobs (maintenance)

Page 10: Large Farm 'Real Life Problems' and their Solutions

20 October 2004 Thorsten Kleinwort

IT/FIO/FS10

Real Life Use CasesReal Life Use Cases

• Kernel upgrade (cont’d):5. Node reboots when it is drained (could be at

any time)6. New machine comes up with new kernel, and

goes back into production immediately Least downtime for each node. Capacity is

always available:• First reboot instantaneous, last one can be

several days later• Everything runs automatically, some cleanup

has to be done for few machines (don’t shutdown or h/w failure on startup) => caught by the monitoring/alarm

Page 11: Large Farm 'Real Life Problems' and their Solutions

20 October 2004 Thorsten Kleinwort

IT/FIO/FS11

Real Life Use Cases Real Life Use Cases (cont’d)(cont’d)

• Configure batch resources (LSF):• LSF resources are defined, depending on

availability, power and cluster of machines• Resources are defined in CDB• Configured on the node using NCM• The master file is generated from CDB2SQL in

a cron job every day (reconfig takes several minutes)

• Consistency of client/master due to CDB• Resources assignments are done in CDB on

(sub-) cluster level (template structure)• Reassignments of (sub-)clusters in CDB are

done with SMS tools

Page 12: Large Farm 'Real Life Problems' and their Solutions

20 October 2004 Thorsten Kleinwort

IT/FIO/FS12

Real Life Use Cases Real Life Use Cases (cont’d)(cont’d)

• Emptying the Computer Centre• For the refurbishment of the CERN Computer

Centre all machines had to be moved, either from one side to the other, or downstairs (vault)

• ~ 2000 machines had to be moved• Taking the opportunity to add machines to CDB

• As quattor and non-quattor nodes• Batch machines were moved in ‘racks=44

nodes’:• HMS was used to steer the moves• SMS/maintenance to shut down the machines• Rename/PrepareInstall to bring machines back

Page 13: Large Farm 'Real Life Problems' and their Solutions

20 October 2004 Thorsten Kleinwort

IT/FIO/FS13

Page 14: Large Farm 'Real Life Problems' and their Solutions

20 October 2004 Thorsten Kleinwort

IT/FIO/FS14

Real Life Use Cases Real Life Use Cases (cont’d)(cont’d)

• New h/w arrival => mass installation• New machines (~400) arrive at CERN

(in bunches of 50 – 100)• Racks have to be prepared:

• Network equipment• Power supply• (Console service)

• Plan machine membership (cluster)• Put machine into CDB:

• h/w type• Cluster type/OS

Page 15: Large Farm 'Real Life Problems' and their Solutions

20 October 2004 Thorsten Kleinwort

IT/FIO/FS15

Real Life Use CasesReal Life Use Cases

• New h/w arrival (cont’d)• Physical machine installation (HMS):

• New DNS entry• OS installation: PrepareInstall• Installation by the SysAdmin• Burn-in test (h/w test, several days to weeks)• Follow up on h/w problems with Vendor• Add the machines to the alarm display (SURE)

• Put machines into production

Page 16: Large Farm 'Real Life Problems' and their Solutions

20 October 2004 Thorsten Kleinwort

IT/FIO/FS16

Page 17: Large Farm 'Real Life Problems' and their Solutions

20 October 2004 Thorsten Kleinwort

IT/FIO/FS17

CollaborationsCollaborations

• External ‘Customers’:• EGEE, LCG, and other groups at CERN are now using

Quattor managed machines:• They benefit from standard, manageable, and

reproducible machine setups• They are able/should learn to do modifications

themselves• External sites using Quattor:

• IN2P3, NIKHEF, UAM Madrid,… discussing to or use already Quattor => see Rafael’s talk

• This helps to enhance the tools:• Service nodes (for LCG-2)• Having a wider usage• Generalizing components

Page 18: Large Farm 'Real Life Problems' and their Solutions

20 October 2004 Thorsten Kleinwort

IT/FIO/FS18

SummarySummary

• ELFms is deployed in production at CERN

• Established technology – from Prototype to Production

• Though enhancements are ongoing• Fundamental part of our infrastructure• Merged with our existing environment

• Quattor and Lemon are generic software

• Used by others inside/outside CERN• Hopefully a fruitful collaboration in the future

Page 19: Large Farm 'Real Life Problems' and their Solutions

20 October 2004 Thorsten Kleinwort

IT/FIO/FS19

Useful Links:Useful Links:

• ELFms: http://cern.ch/elfms • Quattor: http://quattor.org/ • Lemon: http://cern.ch/lemon• LEAF: http://cern.ch/leaf• Previous presentations:

• HEPiX II/2003 (Vanouver):http://www.triumf.ca/hepix2003• ‘The new Fabric Management Tools in Production at CERN’:

• HEPiX I/2004 (Edinburgh):http://www.nesc.ac.uk/esi/events/291/• ‘ELFms, status, deployment’ by German Cancio• ‘Lemon Web Monitoring’ by Miroslav Siket

• CHEP 2004 (Interlaken):http://chep2004.web.cern.ch/chep2004/• ‘Current Status of Fabric Management at CERN’ by German

Cancio

Page 20: Large Farm 'Real Life Problems' and their Solutions

20 October 2004 Thorsten Kleinwort

IT/FIO/FS20

Questions?