tier1 status report andrew sansum gridpp15 12 january 2006
TRANSCRIPT
Tier1 Status Report
Andrew SansumGRIDPP15
12 January 2006
Overview
• Usage (Make the oversight committee happy)
• Service Challenge Plans• Robot Upgrade and CASTOR plans• Other Hardware Upgrades• Spec Benchmark (if time)
Utilisation Concerns
• July: Oversight Committee concerned that Tier-1 was under utilised - Procurements delayed
• Team perception was that farm quite well occupied and demand increasing
• Needed to:– Understand miss-match between CPU use and Occupancy– Maximise capacity available (minimise downtime etc)– Address reliability issues – buy more mirror disks –
investigate Linux HA etc, move to limited on-call system ....– Maximise number of running experiments
January-July
CPU Use (KSI2K*CPUMonths)
0
100
200
300
400
500
600
700
800
900
J an Feb Mar Apr May J un J ul Aug Sep Oct Nov Dec
Other
SNO
Zeus
H1
Minos
UKQCD
LHCB
DZERO
CMS
CDF
Babar
Atlas
Alice
nominal capacity: 796KSI2K
Efficiency Monitoring
• Work by Matt Hodgeshttp://wiki.gridpp.ac.uk/wiki/RAL_Tier1_PBS_Efficiencies
• Automate post processing of PBS logfiles each month, measuring JobCPUEfficiency=[CPU time]/[Elapsed Time]
• Important tool to improve farm throughput• Need to address low efficiency jobs – they
occupy job slots that could be used by other work.
Overview
Experiment September October November December
ATLAS 0.88 0.92 0.81 0.93
BaBar 0.94 0.87 0.90 0.90
BioMed 0.97 0.97 0.95 0.55
CMS 0.51 0.63 0.28 0.67
DTeam 0.00 0.03 0.06 0.47
D0 0.74 0.89 0.92 0.92
H1 0.82 0.90 0.87 0.87
LHCb 0.98 0.95 0.91 0.92
MINOS 0.97 0.87 0.26 0.68
Pheno 0.97 0.70
SNO 0.98 0.97 0.98 0.86
Theory 0.99 0.99 0.99 0.91
Zeus 0.97 0.94 0.98 0.98
Others 0.98 0.98 0.97 0.98
BabarDecember 2005
Stuck jobs clocking up wallclock time but no CPUProbably system staff paused Babar job queue to fix disk server
Typical analysis jobs
MinosDecember 2005
NFS Server overload
CMSNovember 2005
dCache Server overload – 3Gb/s I/O to dCache
Outstanding Scheduling Problems
• How to schedule large memory jobs. Large memory job can occupy two job slots on an older 1GB memory system. – Is it better to always run whatever work you can – even at
the expense of future possible job starts– Is it better to limit large memory job starts and keep small
memory systems free for small memory work that might not turn up.
• No inter-VO scheduling at present (give 40% to LHC for example)
• No intra-VO scheduling at present (give production 80%)
XMASS Availability
• Usually farm runs unattended over Christmas. Major failure early in the period can severely impact December CPU delivery [eg power failure last XMASS]
• Martin Bly/Steve Traylen worked several days overtime over the period to fix faults.
• Fantastic availability and good demand led to one of the most productive Christmases ever.
XMASS Running
XMASS
Have we succeeded
• Significant improvement for second half of 2005.
2005 Occupancy
Overview
Overview
Scheduler Monitoring
• Work by Matt Hodges:http://wiki.gridpp.ac.uk/wiki/RAL_Tier1_PBS_Scheduling
• Now we have heavy demand!!! – need to monitor MAUI scheduler.
• Put MAUI scheduling data into ganglia.• Gain insight into scheduling – help experiments
understand why experiments jobs don’t start ...• Not an exact science – UB over allocates CPU
shares – we use this data to simply calculate target shares and schedule over relatively short period of time (9 days, decaying by 0.7 per day)
Target Shares
January Target shares Implemented
Actual Shares
Usage/Target
Service Challenges
• An ever expanding project!• 12 Months ago:
– SC2 throughput March 2005 disk/disk– SC3 throughput July 2005 disk/tape– SC4 throughput April 2006 tape/tape
• 6 Months ago:– As above, but now add Service Phases and Tier-
2s (significant part of Matt and Catalins work: VO boxes, FTS, LFC, etc)
Service Challenges
• 2 Months ago - as above but add:– SC3 throughput test 2 (16th January/1 week disk
@100MB/s)– Change April 2006 test to disk/disk (from tape) @ 150MB/s.
Motivation – several Tier-1 tape SRMs are late.– Add July 2006 test – tape-tape @ 150MB/s. RAL will use this
as an opportunity to use CASTOR– Add Tier1-Tier-2 throughput tests.
• 1 Month ago:– As above, but now request early February test to tape
robot @ 50MB/s– RAL unable to take part – too many commitments. May
review close to date – depends on our work schedule. Other sites in similar position.
Robot Upgrade
• Old Robot– 6000 slots– Early 1990s hardware – but still pretty good– 1 robot arm– supports most recent generation drives but end of line– Still operational but migrate drives shortly and close
• New Robot (funded by CCLRC): STK SDL8530– 10,000 slots– Life expectancy of at least 2 drive generations– Up to 8 mini robots mounting tapes – faster – resiliant– T10K drives/media in FY06: Faster drives&bigger tapes
CASTOR Plans
• ADS Software (that drives tape robot) home-grown, old and hard to support. Many limiting factors. Not viable for LCG operation.
• Only financially and functionally viable alternative to ADS found was CASTOR
• Following negotiation with CERN, RAL now a member of CASTOR project (contributing SRM) – see Jens’s talk.
• It makes no sense to operate both dCache and CASTOR. Two SRMs = double trouble**2
CASTOR PLANS
• Carry out migration from dCache to CASTOR2 over 12 months (CASTOR1 test system deployed in Autumn).
• Tension:– Deploy CASTOR as late as possible to allow proper build/testing– Deploy CASTOR as soon as possible to allow experiments to test
• CASTOR2 test system scheduled for end January• Will be looking for trial users in May to throughput test
CASTOR• CASTOR will be used in SC4 tape throughput test in July. • Provided we are satisfied, then CASTOR SRM will be
offered as a production service in September for both disk and tape
dCache PLANS
• dCache is currently our production SRM to both disk and tape.
• dCache will remain a production SRM to disk (and part of tape capacity):– Until CASTOR is proven to work– Experiments have had adequate time to migrate
data from dCache
• dCache phase out (probably) in 1H07• dCache may provide a plan B if CASTOR
deployment is late – not desirable for all kinds of reasons.
Disk and CPU
• Hardware delivery scheduled for late February (evaluation nearly complete)– Modest CPU upgrade (200-266KSI2K) – modest
demand– Spend more on disk (up to 135TB additional
capacity)– CPU online early June, disk: July– Disk moving from external RAID array to internal PCI
based RAID in order to reduce cost.
• Probably second round of purchases early in FY06. Size and timing to be agreed at Tier-1 board. Capacity available by September.
Other Hardware
• Oracle Hardware– See Gordon’s talk for details– Mini Storage Area Network to meet Oracle:
requirements. • 1 Fibre Channel RAID Array• 4 server hosts• SAN Switch (Qlogic SANbox 5200 stackable switch)• Delivery in February.
• Upgrade 10 systems to mirror disks for critical services.
Network Plans
• Currently 2*1Gbit to CERN by UKLIGHT to CERN. 1*1Gbit to Lancaster. 2*1Gbit to SJ4 production.
• Upgrade to 4*1Gbit to CERN end January (for SC4)
• Upgrade site edge (lightpath) router to 10Gbit, end February
• Attach Tier-1 at 10Gbit to edge, via 10Gbit uplink from Nortel 5530 switch (£5K switch stackable with our existing 5510 commodity units) (March)
• Attach T1 to CERN at 10Gbit early in SJ5 rollout (early summer).
Machine Rooms
• Extensive planning out to 2010 and beyond to identify growth constraints
• Major power and cooling work (additional 400KW) in A5 lower in 2004 funded by CCLRC E-Science to accommodate growth of Tier-1 and HPC systems. Sufficient to cool kit up to mid 2006.
• Further cooling expansion just started (>400KW) to meet profile out to 2008 hardware delivery for Tier-1
• Investigating building new machine room for 2009+ hardware installation.
T1/A and SPEC
• Work by George Prassas (hardware support)• Motivation:• Investigate whether the batch scaling
factors used by T1/A were accurate • Whether our performance/scaling mirrored
the published results• Help form a view about CPUs for future
purchases
SPEC CPU2000 - Metrics
• SPECint2000/SPECint_base2000– Geometric mean of 12 normalised ratios (one for
each app) when compiled with aggressive/conservative compiler options
• SPECint_rate2000/SPECint_rate_base2000– Same as above but for 12 normalised throughput
ratios
• Same apply for CFP2000
Warning: Maths!
• If α1, α2, α3, …, αn are real numbers, we define their geometric mean as:
(α1*α2*α3…αn)1/n =
•
Results - SPECint Scaling
FactorsScaling Factors
0
5059
82 82
0
56 59
9284
0
20
40
60
80
100
P3 1.4 P4 2.6 P4 2.8 P4 3.2 Opt 2GHz
System
Per
cen
tag
e (%
)
T1/A
Published
Results – SPECint
PerformancePerformance
442
663 722887 888
664
1033 1068
14201317
0
200
400
600
800
1000
1200
1400
1600
P3 1.4 P4 2.6 P4 2.8 P4 3.2 Opt 2GHz
System
SP
EC
int2
000
T1/A
Published
Scaling Comparison
Scaling Comparison
0
20
40
60
80
100
P3 1.4 P4 2.6 P4 2.8 Opt 2GHz P4 3.2
Perc
en
tag
e (
%)
LHCb SPEC T1/A SPEC Published
Metadise g77 Metadise ifc
Conclusions
• We have come an immense distance in 1 year • LHC service challenge work is ever expanding• major effort to increase utilisation• Have (from planning purposes) been living in
2006 for most of 2005. Now we have arrived.• CASTOR deployment will be a considerable
challenge.• Hardening the service will be an important part
of work for 1H2006• Very little time now left and lots to do.