![Page 1: Tier1 Status Report Andrew Sansum GRIDPP15 12 January 2006](https://reader036.vdocuments.us/reader036/viewer/2022081518/5515c40c55034693758b478f/html5/thumbnails/1.jpg)
Tier1 Status Report
Andrew SansumGRIDPP15
12 January 2006
![Page 2: Tier1 Status Report Andrew Sansum GRIDPP15 12 January 2006](https://reader036.vdocuments.us/reader036/viewer/2022081518/5515c40c55034693758b478f/html5/thumbnails/2.jpg)
Overview
• Usage (Make the oversight committee happy)
• Service Challenge Plans• Robot Upgrade and CASTOR plans• Other Hardware Upgrades• Spec Benchmark (if time)
![Page 3: Tier1 Status Report Andrew Sansum GRIDPP15 12 January 2006](https://reader036.vdocuments.us/reader036/viewer/2022081518/5515c40c55034693758b478f/html5/thumbnails/3.jpg)
Utilisation Concerns
• July: Oversight Committee concerned that Tier-1 was under utilised - Procurements delayed
• Team perception was that farm quite well occupied and demand increasing
• Needed to:– Understand miss-match between CPU use and Occupancy– Maximise capacity available (minimise downtime etc)– Address reliability issues – buy more mirror disks –
investigate Linux HA etc, move to limited on-call system ....– Maximise number of running experiments
![Page 4: Tier1 Status Report Andrew Sansum GRIDPP15 12 January 2006](https://reader036.vdocuments.us/reader036/viewer/2022081518/5515c40c55034693758b478f/html5/thumbnails/4.jpg)
January-July
CPU Use (KSI2K*CPUMonths)
0
100
200
300
400
500
600
700
800
900
J an Feb Mar Apr May J un J ul Aug Sep Oct Nov Dec
Other
SNO
Zeus
H1
Minos
UKQCD
LHCB
DZERO
CMS
CDF
Babar
Atlas
Alice
nominal capacity: 796KSI2K
![Page 5: Tier1 Status Report Andrew Sansum GRIDPP15 12 January 2006](https://reader036.vdocuments.us/reader036/viewer/2022081518/5515c40c55034693758b478f/html5/thumbnails/5.jpg)
Efficiency Monitoring
• Work by Matt Hodgeshttp://wiki.gridpp.ac.uk/wiki/RAL_Tier1_PBS_Efficiencies
• Automate post processing of PBS logfiles each month, measuring JobCPUEfficiency=[CPU time]/[Elapsed Time]
• Important tool to improve farm throughput• Need to address low efficiency jobs – they
occupy job slots that could be used by other work.
![Page 6: Tier1 Status Report Andrew Sansum GRIDPP15 12 January 2006](https://reader036.vdocuments.us/reader036/viewer/2022081518/5515c40c55034693758b478f/html5/thumbnails/6.jpg)
Overview
![Page 7: Tier1 Status Report Andrew Sansum GRIDPP15 12 January 2006](https://reader036.vdocuments.us/reader036/viewer/2022081518/5515c40c55034693758b478f/html5/thumbnails/7.jpg)
Experiment September October November December
ATLAS 0.88 0.92 0.81 0.93
BaBar 0.94 0.87 0.90 0.90
BioMed 0.97 0.97 0.95 0.55
CMS 0.51 0.63 0.28 0.67
DTeam 0.00 0.03 0.06 0.47
D0 0.74 0.89 0.92 0.92
H1 0.82 0.90 0.87 0.87
LHCb 0.98 0.95 0.91 0.92
MINOS 0.97 0.87 0.26 0.68
Pheno 0.97 0.70
SNO 0.98 0.97 0.98 0.86
Theory 0.99 0.99 0.99 0.91
Zeus 0.97 0.94 0.98 0.98
Others 0.98 0.98 0.97 0.98
![Page 8: Tier1 Status Report Andrew Sansum GRIDPP15 12 January 2006](https://reader036.vdocuments.us/reader036/viewer/2022081518/5515c40c55034693758b478f/html5/thumbnails/8.jpg)
BabarDecember 2005
Stuck jobs clocking up wallclock time but no CPUProbably system staff paused Babar job queue to fix disk server
Typical analysis jobs
![Page 9: Tier1 Status Report Andrew Sansum GRIDPP15 12 January 2006](https://reader036.vdocuments.us/reader036/viewer/2022081518/5515c40c55034693758b478f/html5/thumbnails/9.jpg)
MinosDecember 2005
NFS Server overload
![Page 10: Tier1 Status Report Andrew Sansum GRIDPP15 12 January 2006](https://reader036.vdocuments.us/reader036/viewer/2022081518/5515c40c55034693758b478f/html5/thumbnails/10.jpg)
CMSNovember 2005
dCache Server overload – 3Gb/s I/O to dCache
![Page 11: Tier1 Status Report Andrew Sansum GRIDPP15 12 January 2006](https://reader036.vdocuments.us/reader036/viewer/2022081518/5515c40c55034693758b478f/html5/thumbnails/11.jpg)
Outstanding Scheduling Problems
• How to schedule large memory jobs. Large memory job can occupy two job slots on an older 1GB memory system. – Is it better to always run whatever work you can – even at
the expense of future possible job starts– Is it better to limit large memory job starts and keep small
memory systems free for small memory work that might not turn up.
• No inter-VO scheduling at present (give 40% to LHC for example)
• No intra-VO scheduling at present (give production 80%)
![Page 12: Tier1 Status Report Andrew Sansum GRIDPP15 12 January 2006](https://reader036.vdocuments.us/reader036/viewer/2022081518/5515c40c55034693758b478f/html5/thumbnails/12.jpg)
XMASS Availability
• Usually farm runs unattended over Christmas. Major failure early in the period can severely impact December CPU delivery [eg power failure last XMASS]
• Martin Bly/Steve Traylen worked several days overtime over the period to fix faults.
• Fantastic availability and good demand led to one of the most productive Christmases ever.
![Page 13: Tier1 Status Report Andrew Sansum GRIDPP15 12 January 2006](https://reader036.vdocuments.us/reader036/viewer/2022081518/5515c40c55034693758b478f/html5/thumbnails/13.jpg)
XMASS Running
XMASS
![Page 14: Tier1 Status Report Andrew Sansum GRIDPP15 12 January 2006](https://reader036.vdocuments.us/reader036/viewer/2022081518/5515c40c55034693758b478f/html5/thumbnails/14.jpg)
Have we succeeded
• Significant improvement for second half of 2005.
![Page 15: Tier1 Status Report Andrew Sansum GRIDPP15 12 January 2006](https://reader036.vdocuments.us/reader036/viewer/2022081518/5515c40c55034693758b478f/html5/thumbnails/15.jpg)
2005 Occupancy
![Page 16: Tier1 Status Report Andrew Sansum GRIDPP15 12 January 2006](https://reader036.vdocuments.us/reader036/viewer/2022081518/5515c40c55034693758b478f/html5/thumbnails/16.jpg)
Overview
![Page 17: Tier1 Status Report Andrew Sansum GRIDPP15 12 January 2006](https://reader036.vdocuments.us/reader036/viewer/2022081518/5515c40c55034693758b478f/html5/thumbnails/17.jpg)
Overview
![Page 18: Tier1 Status Report Andrew Sansum GRIDPP15 12 January 2006](https://reader036.vdocuments.us/reader036/viewer/2022081518/5515c40c55034693758b478f/html5/thumbnails/18.jpg)
Scheduler Monitoring
• Work by Matt Hodges:http://wiki.gridpp.ac.uk/wiki/RAL_Tier1_PBS_Scheduling
• Now we have heavy demand!!! – need to monitor MAUI scheduler.
• Put MAUI scheduling data into ganglia.• Gain insight into scheduling – help experiments
understand why experiments jobs don’t start ...• Not an exact science – UB over allocates CPU
shares – we use this data to simply calculate target shares and schedule over relatively short period of time (9 days, decaying by 0.7 per day)
![Page 19: Tier1 Status Report Andrew Sansum GRIDPP15 12 January 2006](https://reader036.vdocuments.us/reader036/viewer/2022081518/5515c40c55034693758b478f/html5/thumbnails/19.jpg)
Target Shares
January Target shares Implemented
![Page 20: Tier1 Status Report Andrew Sansum GRIDPP15 12 January 2006](https://reader036.vdocuments.us/reader036/viewer/2022081518/5515c40c55034693758b478f/html5/thumbnails/20.jpg)
Actual Shares
![Page 21: Tier1 Status Report Andrew Sansum GRIDPP15 12 January 2006](https://reader036.vdocuments.us/reader036/viewer/2022081518/5515c40c55034693758b478f/html5/thumbnails/21.jpg)
Usage/Target
![Page 22: Tier1 Status Report Andrew Sansum GRIDPP15 12 January 2006](https://reader036.vdocuments.us/reader036/viewer/2022081518/5515c40c55034693758b478f/html5/thumbnails/22.jpg)
Service Challenges
• An ever expanding project!• 12 Months ago:
– SC2 throughput March 2005 disk/disk– SC3 throughput July 2005 disk/tape– SC4 throughput April 2006 tape/tape
• 6 Months ago:– As above, but now add Service Phases and Tier-
2s (significant part of Matt and Catalins work: VO boxes, FTS, LFC, etc)
![Page 23: Tier1 Status Report Andrew Sansum GRIDPP15 12 January 2006](https://reader036.vdocuments.us/reader036/viewer/2022081518/5515c40c55034693758b478f/html5/thumbnails/23.jpg)
Service Challenges
• 2 Months ago - as above but add:– SC3 throughput test 2 (16th January/1 week disk
@100MB/s)– Change April 2006 test to disk/disk (from tape) @ 150MB/s.
Motivation – several Tier-1 tape SRMs are late.– Add July 2006 test – tape-tape @ 150MB/s. RAL will use this
as an opportunity to use CASTOR– Add Tier1-Tier-2 throughput tests.
• 1 Month ago:– As above, but now request early February test to tape
robot @ 50MB/s– RAL unable to take part – too many commitments. May
review close to date – depends on our work schedule. Other sites in similar position.
![Page 24: Tier1 Status Report Andrew Sansum GRIDPP15 12 January 2006](https://reader036.vdocuments.us/reader036/viewer/2022081518/5515c40c55034693758b478f/html5/thumbnails/24.jpg)
Robot Upgrade
• Old Robot– 6000 slots– Early 1990s hardware – but still pretty good– 1 robot arm– supports most recent generation drives but end of line– Still operational but migrate drives shortly and close
• New Robot (funded by CCLRC): STK SDL8530– 10,000 slots– Life expectancy of at least 2 drive generations– Up to 8 mini robots mounting tapes – faster – resiliant– T10K drives/media in FY06: Faster drives&bigger tapes
![Page 25: Tier1 Status Report Andrew Sansum GRIDPP15 12 January 2006](https://reader036.vdocuments.us/reader036/viewer/2022081518/5515c40c55034693758b478f/html5/thumbnails/25.jpg)
CASTOR Plans
• ADS Software (that drives tape robot) home-grown, old and hard to support. Many limiting factors. Not viable for LCG operation.
• Only financially and functionally viable alternative to ADS found was CASTOR
• Following negotiation with CERN, RAL now a member of CASTOR project (contributing SRM) – see Jens’s talk.
• It makes no sense to operate both dCache and CASTOR. Two SRMs = double trouble**2
![Page 26: Tier1 Status Report Andrew Sansum GRIDPP15 12 January 2006](https://reader036.vdocuments.us/reader036/viewer/2022081518/5515c40c55034693758b478f/html5/thumbnails/26.jpg)
CASTOR PLANS
• Carry out migration from dCache to CASTOR2 over 12 months (CASTOR1 test system deployed in Autumn).
• Tension:– Deploy CASTOR as late as possible to allow proper build/testing– Deploy CASTOR as soon as possible to allow experiments to test
• CASTOR2 test system scheduled for end January• Will be looking for trial users in May to throughput test
CASTOR• CASTOR will be used in SC4 tape throughput test in July. • Provided we are satisfied, then CASTOR SRM will be
offered as a production service in September for both disk and tape
![Page 27: Tier1 Status Report Andrew Sansum GRIDPP15 12 January 2006](https://reader036.vdocuments.us/reader036/viewer/2022081518/5515c40c55034693758b478f/html5/thumbnails/27.jpg)
dCache PLANS
• dCache is currently our production SRM to both disk and tape.
• dCache will remain a production SRM to disk (and part of tape capacity):– Until CASTOR is proven to work– Experiments have had adequate time to migrate
data from dCache
• dCache phase out (probably) in 1H07• dCache may provide a plan B if CASTOR
deployment is late – not desirable for all kinds of reasons.
![Page 28: Tier1 Status Report Andrew Sansum GRIDPP15 12 January 2006](https://reader036.vdocuments.us/reader036/viewer/2022081518/5515c40c55034693758b478f/html5/thumbnails/28.jpg)
Disk and CPU
• Hardware delivery scheduled for late February (evaluation nearly complete)– Modest CPU upgrade (200-266KSI2K) – modest
demand– Spend more on disk (up to 135TB additional
capacity)– CPU online early June, disk: July– Disk moving from external RAID array to internal PCI
based RAID in order to reduce cost.
• Probably second round of purchases early in FY06. Size and timing to be agreed at Tier-1 board. Capacity available by September.
![Page 29: Tier1 Status Report Andrew Sansum GRIDPP15 12 January 2006](https://reader036.vdocuments.us/reader036/viewer/2022081518/5515c40c55034693758b478f/html5/thumbnails/29.jpg)
Other Hardware
• Oracle Hardware– See Gordon’s talk for details– Mini Storage Area Network to meet Oracle:
requirements. • 1 Fibre Channel RAID Array• 4 server hosts• SAN Switch (Qlogic SANbox 5200 stackable switch)• Delivery in February.
• Upgrade 10 systems to mirror disks for critical services.
![Page 30: Tier1 Status Report Andrew Sansum GRIDPP15 12 January 2006](https://reader036.vdocuments.us/reader036/viewer/2022081518/5515c40c55034693758b478f/html5/thumbnails/30.jpg)
Network Plans
• Currently 2*1Gbit to CERN by UKLIGHT to CERN. 1*1Gbit to Lancaster. 2*1Gbit to SJ4 production.
• Upgrade to 4*1Gbit to CERN end January (for SC4)
• Upgrade site edge (lightpath) router to 10Gbit, end February
• Attach Tier-1 at 10Gbit to edge, via 10Gbit uplink from Nortel 5530 switch (£5K switch stackable with our existing 5510 commodity units) (March)
• Attach T1 to CERN at 10Gbit early in SJ5 rollout (early summer).
![Page 31: Tier1 Status Report Andrew Sansum GRIDPP15 12 January 2006](https://reader036.vdocuments.us/reader036/viewer/2022081518/5515c40c55034693758b478f/html5/thumbnails/31.jpg)
Machine Rooms
• Extensive planning out to 2010 and beyond to identify growth constraints
• Major power and cooling work (additional 400KW) in A5 lower in 2004 funded by CCLRC E-Science to accommodate growth of Tier-1 and HPC systems. Sufficient to cool kit up to mid 2006.
• Further cooling expansion just started (>400KW) to meet profile out to 2008 hardware delivery for Tier-1
• Investigating building new machine room for 2009+ hardware installation.
![Page 32: Tier1 Status Report Andrew Sansum GRIDPP15 12 January 2006](https://reader036.vdocuments.us/reader036/viewer/2022081518/5515c40c55034693758b478f/html5/thumbnails/32.jpg)
T1/A and SPEC
• Work by George Prassas (hardware support)• Motivation:• Investigate whether the batch scaling
factors used by T1/A were accurate • Whether our performance/scaling mirrored
the published results• Help form a view about CPUs for future
purchases
![Page 33: Tier1 Status Report Andrew Sansum GRIDPP15 12 January 2006](https://reader036.vdocuments.us/reader036/viewer/2022081518/5515c40c55034693758b478f/html5/thumbnails/33.jpg)
SPEC CPU2000 - Metrics
• SPECint2000/SPECint_base2000– Geometric mean of 12 normalised ratios (one for
each app) when compiled with aggressive/conservative compiler options
• SPECint_rate2000/SPECint_rate_base2000– Same as above but for 12 normalised throughput
ratios
• Same apply for CFP2000
![Page 34: Tier1 Status Report Andrew Sansum GRIDPP15 12 January 2006](https://reader036.vdocuments.us/reader036/viewer/2022081518/5515c40c55034693758b478f/html5/thumbnails/34.jpg)
Warning: Maths!
• If α1, α2, α3, …, αn are real numbers, we define their geometric mean as:
(α1*α2*α3…αn)1/n =
•
![Page 35: Tier1 Status Report Andrew Sansum GRIDPP15 12 January 2006](https://reader036.vdocuments.us/reader036/viewer/2022081518/5515c40c55034693758b478f/html5/thumbnails/35.jpg)
Results - SPECint Scaling
FactorsScaling Factors
0
5059
82 82
0
56 59
9284
0
20
40
60
80
100
P3 1.4 P4 2.6 P4 2.8 P4 3.2 Opt 2GHz
System
Per
cen
tag
e (%
)
T1/A
Published
![Page 36: Tier1 Status Report Andrew Sansum GRIDPP15 12 January 2006](https://reader036.vdocuments.us/reader036/viewer/2022081518/5515c40c55034693758b478f/html5/thumbnails/36.jpg)
Results – SPECint
PerformancePerformance
442
663 722887 888
664
1033 1068
14201317
0
200
400
600
800
1000
1200
1400
1600
P3 1.4 P4 2.6 P4 2.8 P4 3.2 Opt 2GHz
System
SP
EC
int2
000
T1/A
Published
![Page 37: Tier1 Status Report Andrew Sansum GRIDPP15 12 January 2006](https://reader036.vdocuments.us/reader036/viewer/2022081518/5515c40c55034693758b478f/html5/thumbnails/37.jpg)
Scaling Comparison
Scaling Comparison
0
20
40
60
80
100
P3 1.4 P4 2.6 P4 2.8 Opt 2GHz P4 3.2
Perc
en
tag
e (
%)
LHCb SPEC T1/A SPEC Published
Metadise g77 Metadise ifc
![Page 38: Tier1 Status Report Andrew Sansum GRIDPP15 12 January 2006](https://reader036.vdocuments.us/reader036/viewer/2022081518/5515c40c55034693758b478f/html5/thumbnails/38.jpg)
Conclusions
• We have come an immense distance in 1 year • LHC service challenge work is ever expanding• major effort to increase utilisation• Have (from planning purposes) been living in
2006 for most of 2005. Now we have arrived.• CASTOR deployment will be a considerable
challenge.• Hardening the service will be an important part
of work for 1H2006• Very little time now left and lots to do.