ral tier1/a site report hepix-hepnt vancouver, october 2003
DESCRIPTION
RAL Tier1/A Site Report HEPiX-HEPNT Vancouver, October 2003. Contents. GRID Stuff – clusters and interfaces Hardware and utilisation Software and utilities. Layout. EDG Status. EDG 2.0.x deployed on production test-bed since early September. Provides: EDG RGMA info catalogue - PowerPoint PPT PresentationTRANSCRIPT
Martin Bly
RAL Tier1/A
RAL Tier1/A Site Report
HEPiX-HEPNT
Vancouver, October 2003
Martin Bly
RAL Tier1/A
Contents
• GRID Stuff – clusters and interfaces• Hardware and utilisation• Software and utilities
Martin Bly
RAL Tier1/A
Layout
Martin Bly
RAL Tier1/A
EDG Status• EDG 2.0.x deployed on production test-bed since early September.
Provides:
– EDG RGMA info catalogue
– RLS for lhcb, biom, eo, wpsix, tutor and Babar
• EDG 2.1 deployed on dev test-bed. VOMS integration work underway. May be found useful by small GRIDPP experiments (eg NA48, MICE and MINOS)
• EDG 2.0 gatekeeper provides gateway into main CSF production farm. Provides access for some of Babar and ATLAS work. Being prepared for forthcoming D0 production via SAMGrid
• Along with IN2P3, CSFUI provides main UI for EDG
• Many WP3 and WP5 mini test-beds
• Further GRID integration into production farm via LCG – not EDG
Martin Bly
RAL Tier1/A
LCG Integration• LCG-0 mini test-bed deployed March• LCG-1 test-bed deployed in July• LCG 1 upgraded to LCG1-1_0_1 in August/September. Consists of:
– Lcgwest regional GIIS– RB, CE, SE, UI, BDII, PROXY, 5*WN
• WN = 2*1GHz/1GB RAM, SE = 540GB• Soon need to make important decisions about how much hardware to
deploy into LCG – driven by what the Experiment Board want.• Issues:
– Installation and configuration still difficult for non-experts.– Documentation still thin in many places.– Support often very helpful but answers not always forthcoming for
some problems.– Not everything works – all of the time.
• Beginning to discuss internally how to interoperate with production farm.
Martin Bly
RAL Tier1/A
SRB Service for CMS• SDSC Storage Resource Broker• SRB MCAT for whole CMS production. Consists of enterprise
class ORACLE servers and “thin” MCAT ORACLE client.• SRB interface into Datastore• SRB enabled disk server to handle data imports.• SRB clients on disk servers for data moving• Needed some work to deploy• Very good support from developers SDSC• ADS interface integrated into main SRB source• Considerable learning experience for Datastore team (and
CMS)!
Martin Bly
RAL Tier1/A
P4 Xeon Experiences• Disappointing performance with gcc
– Hope for 2.66P4/1.4P3=1.5– see 1.2 - 1.3
• Can obtain more by exploiting hyper-threading but Linux CPU scheduling causes difficulties (ping-pong effects)
• Performance better with Intel Compiler• Efforts to run `0(1)’ scheduler unsuccessful• CPU accounting now depends on number of jobs
running.• Beginning to look closely at Opteron solutions.
Martin Bly
RAL Tier1/A
Datastore Upgrade• STK 9310 robot, 6000 slots
– IBM 3590 drives being phased out (10GB 10MB/Sec)
– STK 9940B drives in production (200GB 30MB/sec)
• 4 IBM 610+ servers with two FC connections and Gbit networking on PCI-X– 9940 drives FC connected via 2 switches for
redundancy– SCSI raid 5 disk with hot spare for 1.2Tbytes
cache space
Martin Bly
RAL Tier1/A
Switch_1 Switch_2
RS6000 RS6000RS6000 RS6000
fsc0 fsc1 fsc1fsc0
9940B 9940B 9940B 9940B 9940B 9940B 9940B 9940B
1 2 3 4 5 6 7 8
11 14 11 1415
fsc1fsc0fsc1fsc0
12 13 12 13 15
rmt1 rmt4rmt3rmt2rmt5-8 rmt5-8rmt5-8rmt5-8
A A A A A A A A
STK 9310 “Powder Horn”
Gbit network
1.2TB 1.2TB 1.2TB 1.2TB
Martin Bly
RAL Tier1/A
Operating Systems• Redhat 6.2 closed end of August (Babar build-box)
• Redhat 7.2
– Babar 7.2 service migrated to Redhat 7.3 during October.
– Residual `bulk’ batch service closing soon.
– Three front-ends for Babar.
• Redhat 7.3
– Service now main workhorse for LHC experiments and Babar batch work.
– `Bulk’ service opening soon.
– Three front-ends.
– LCG-1
• Need to start looking at what to do next (Fedora, Debian, RH-ES/AS, …)!
• Need to deploy Redhat Advanced Server
Martin Bly
RAL Tier1/A
Next Procurement
• Based on experiments expected demand profile (as best they can estimate).
• Exact numbers still being finalised, but about:
– 250 dual processor CPU nodes
– 70TB available disk
– 100TB tape
Martin Bly
RAL Tier1/A
0
100
200
300
400
500
600
700
800
900
1000
UKQCD
Other
D0
Alice
LHCb
Atlas
CMS
BaBar
GPP-only
90%
Capacity
CPU Requirements (KSI2K)
Martin Bly
RAL Tier1/A
GridPP Disk Requirements (TB)
0
20
40
60
80
100
120
140
160
LCG
Others
UKQCD
D0
Alice
LHCb
Atlas
CMS
BaBar
90%
Capacity
Martin Bly
RAL Tier1/A
New Helpdesk • Need to deploy new helpdesk (had Remedy).
Wanted:– Web based.– Free open source.– Multiple queues and personalities.
• Looked at Bugzilla, OTRS and RequestTracker.• Finally selected RequestTracker.• http://helpdesk.gridpp.rl.ac.uk/.• Available for other Tier 2 sites and other GRIDPP
projects if needed.
Martin Bly
RAL Tier1/A
Martin Bly
RAL Tier1/A
YUMIT: RPM Monitoring
• Hundreds of nodes on the farm. Need to make sure RPMs are up to date.
• Wanted light-weight solution until full fabric management tools are deployed.
• Package written by Steve Traylen:– Yum installed on all systems– Nightly comparison with YUM database
uploaded to MYSQL server.– Simple web based display utility in perl
Martin Bly
RAL Tier1/A
Martin Bly
RAL Tier1/A
Martin Bly
RAL Tier1/A
Exception Monitoring: Nagios
• Already have an exception handling system (CERN’s SURE coupled with the commercial Automate).
• Looking at alternatives – no firm plans yet but currently looking at NAGIOS:http://www.nagios.org/
Martin Bly
RAL Tier1/A
Martin Bly
RAL Tier1/A
Summary: Outstanding Issues
• Many new developments and new services deployed this year.
• We have to run many distinct services. For example, FERMI Linux, RH 7.2/7.3, EDG testbeds, LCG, CMS DC03, SRB etc.
• Waiting to hear when the experiments want LCG in volume.
• The Pentium 4 processor is performing poorly.• Redhat’s changing policy is a major concern