the status of prague computing farm

4.12.2008, Prague

JAN ŠVECInstitute of Physics AS CR

History and main users Hardware description Networking Software and jobs distribution Management Monitoring User support

HEP LHC - ATLAS, ALICE Tevatron - D0 RHIC - Star H1, Calice

Astrophysics Auger

Solid State Physics Others (HEP section of Institute of Physics)

Active participation in GRID projects since 2001 (EDG, EGEE, EGEE II, EGEE III (running), EGI) in collaboration with CESNET Czech Tier-2 site, connected to 2 Tier-1 sites

(Forschungszentrum Karlsruhe – Germany, ASGC – Taiwan) and 3 Tier-3 sites (MFF Troja, FJFI ČVUT, ÚJF Řež, ÚJF Bulovka, UTEF)

first computers bought in 2001 (2 racks), placed in the main building insufficient cooling, small space, small UPS, inconvenient

access (2nd floor)

new server room opened in 2004 server room and adjacent office 18 racks 200kVA UPS 350kVA diesel 2 cooling units

water cooling planned for 2009

automatic fire suppression system (Argonite gas)

good access

35x dual PIII 1.13 GHz 67x dual Xeon 3.06 GHz 5x dual Xeon 2.8 GHz

Redundant components Key services

3x dual Opteron 1.6 GHz File servers

36x bl35p - dual Opteron 275, 280

6x bl20p - dual Xeon 5160 8x bl460 - dual Xeon 5160 12x bl465 - dual Opteron

2220

HP Netserver 12 - 1TB SCSI Easy Stor - 10TB ATA Easy Stor - 30TB SATA Promise Vtrak M610p - 13TB

SATA HP EVA 6100 - 28TB FATA

(SATA over FC) Overland Ultramus - 144TB

SATA DPM pool

Overland Ultramus 12TB Fiber Channell 4Gb tape library cache

100 TB tape library - LTO4 (expandable to 400TB)

SGI Altix ICE 8200

512 cores, Intel Xeon 2.5 GHz 1GB RAM per core Diskless nodes External SAS disk array 7.2 TB Infiniband 4x (20 Gbps) Suse Linux Enterprise Server Torque + Maui SGI ProPack

IBM iDataPlex

672 cores, Intel Xeon 2.83 GHz 2GB RAM per core local SAS disks 300GB Scientific Linux Cern 4 Torque + Maui

First in Europe

Scientific Linux (Cern) 4, 5 Suse Linux (SLES 10, Opensuse 11) 32bit, 64bit testing in progress Job management - PBSPro 9.x, Torque with Maui scheduler

fair share used for scheduling cputime and walltime multiplicators

Legato Networker - tape backup (user homes, configuration) gLite grid middleware (CE, SE, UI, MON box, site BDII, …)

Job submission Local – “prak” interface (grid unsupported experiments)

○ No special requirements GRID - UI interface (Atlas, Alice, Auger)

○ X509 certificate, signed by GRID certification authority (Cesnet, CERN)

Interface hosts merging in progress

installation using PXE + kickstart system automatically updates from SLC repositories gLite middleware configured with YAIM (integrated into

cfengine local site changes managed using cfengine

Manual administration is tedious and error prone Configuration is scattered among several places

○ Kickstart’s postinstall vs. existing nodes

Ad-hoc changes, no revisions○ Communication among sysadmins

Machines temporarily offline○ Conflicting changes

Issues when reinstalling○ Stuff went missing

Too much work

Cfengine to the rescue! Managing hundreds of boxes from a central place Change tracking with subversion Describe the end result, not the process

Cfengine architecture a central server running cfservd

○ Server policy files and other data to nodescfagent on each node

○ Performs the real changescfexecd (cron)

○ A thin wrapper around cfagent

configuration at one place can be easily managed by SCMs

shooting yourself in the foot easily any change can have huge impact - think before doing

changes ! interactions among rules might have unexpected results =>

TESTING IS IMPORTANT!

Flexible monitoring is crucial for reliability Nagios, Munin (graphic representation of time development of

events), PBS graphs, MRTG, CEF monitoring, … Why Nagios

de facto standard in monitoring open source easy to write new sensors static configuration is not a problem lots of addons

○ Nuvola – better look for nagios○ NagiosReport (developed locally, summarizes problems at site)○ NagiosGrapher (generates graphs from nagios outputs)○ …

Plugins default plugins (part of nagios installation) - ping, disk, procs, load, swap,

ldap, … SRCE plugins (developed by E. Imamagic) - cert, dpm, gridftp, srm, … locally developed – hpacucli, ups, jobs, gstat

Nagios summary generated at 11/28/2008 00:10:02 in 1.54553 seconds.======================================================================

Hosts in trouble: golias123, golias131

Hosts in downtime (not monitored): golias01, golias02, golias38

======================================================================

golias123:==========CFAGENT: CHECK_NRPE: Socket timeout after 30 seconds.

golias131:==========CFAGENT: CHECK_NRPE: Socket timeout after 30 seconds.

downtimes:==========golias01: Down for tests of cfengine installation.golias02: Down for golias02-golias199 synchronization tests.golias38: Disk failure.

User support

Mailing list [email protected] (primarily used for news, announcements of downtimes, etc.)

wiki pages with documentation (user and admin sections)

RT (Request Tracking) system developed by Best Practical Solutions – operated by Cesnet in cooperation with FZU used by users for communication with administrators

1. user sends mail to [email protected]. RT system creates ticket with unique number3. all administrators are notified by RT, that new ticket was created4. administrators can communicate the problem with user (using

reply function) or with each other (using comment function). All communication is saved as part of the ticket

5. each week RT automatically reminds administrators of all opened tickets

the status of prague computing farm

Documents

ghz67x dual xeon

ghz5x dual xeon

bl20p dual xeon

bl460 dual xeon

intel xeon

bl465 dual opteron

tb ataeasy stor

tb scsieasy stor