tier1 status report andrew sansum gridpp15 12 january 2006

Tier1 Status Report

Andrew SansumGRIDPP15

12 January 2006

Overview

• Usage (Make the oversight committee happy)

• Service Challenge Plans• Robot Upgrade and CASTOR plans• Other Hardware Upgrades• Spec Benchmark (if time)

Utilisation Concerns

• July: Oversight Committee concerned that Tier-1 was under utilised - Procurements delayed

• Team perception was that farm quite well occupied and demand increasing

• Needed to:– Understand miss-match between CPU use and Occupancy– Maximise capacity available (minimise downtime etc)– Address reliability issues – buy more mirror disks –

investigate Linux HA etc, move to limited on-call system ....– Maximise number of running experiments

January-July

CPU Use (KSI2K*CPUMonths)

0

100

200

300

400

500

600

700

800

900

J an Feb Mar Apr May J un J ul Aug Sep Oct Nov Dec

Other

SNO

Zeus

H1

Minos

UKQCD

LHCB

DZERO

CMS

CDF

Babar

Atlas

Alice

nominal capacity: 796KSI2K

Efficiency Monitoring

• Work by Matt Hodgeshttp://wiki.gridpp.ac.uk/wiki/RAL_Tier1_PBS_Efficiencies

• Automate post processing of PBS logfiles each month, measuring JobCPUEfficiency=[CPU time]/[Elapsed Time]

• Important tool to improve farm throughput• Need to address low efficiency jobs – they

occupy job slots that could be used by other work.

Overview

Experiment September October November December

ATLAS 0.88 0.92 0.81 0.93

BaBar 0.94 0.87 0.90 0.90

BioMed 0.97 0.97 0.95 0.55

CMS 0.51 0.63 0.28 0.67

DTeam 0.00 0.03 0.06 0.47

D0 0.74 0.89 0.92 0.92

H1 0.82 0.90 0.87 0.87

LHCb 0.98 0.95 0.91 0.92

MINOS 0.97 0.87 0.26 0.68

Pheno 0.97 0.70

SNO 0.98 0.97 0.98 0.86

Theory 0.99 0.99 0.99 0.91

Zeus 0.97 0.94 0.98 0.98

Others 0.98 0.98 0.97 0.98

http://www.gridpp.rl.ac.uk/stats/eff/archive/2005-09.html




http://www.gridpp.rl.ac.uk/stats/eff/archive/2005-ATLAS.html

http://www.gridpp.rl.ac.uk/stats/eff/archive/2005-BaBar.html

http://www.gridpp.rl.ac.uk/stats/eff/archive/2005-BioMed.html

http://www.gridpp.rl.ac.uk/stats/eff/archive/2005-CMS.html

http://www.gridpp.rl.ac.uk/stats/eff/archive/2005-DTeam.html

http://www.gridpp.rl.ac.uk/stats/eff/archive/2005-D0.html

BabarDecember 2005

Stuck jobs clocking up wallclock time but no CPUProbably system staff paused Babar job queue to fix disk server

Typical analysis jobs

MinosDecember 2005

NFS Server overload

CMSNovember 2005

dCache Server overload – 3Gb/s I/O to dCache

Outstanding Scheduling Problems

• How to schedule large memory jobs. Large memory job can occupy two job slots on an older 1GB memory system. – Is it better to always run whatever work you can – even at

the expense of future possible job starts– Is it better to limit large memory job starts and keep small

memory systems free for small memory work that might not turn up.

• No inter-VO scheduling at present (give 40% to LHC for example)

• No intra-VO scheduling at present (give production 80%)

XMASS Availability

• Usually farm runs unattended over Christmas. Major failure early in the period can severely impact December CPU delivery [eg power failure last XMASS]

• Martin Bly/Steve Traylen worked several days overtime over the period to fix faults.

• Fantastic availability and good demand led to one of the most productive Christmases ever.

XMASS Running

XMASS

Have we succeeded

• Significant improvement for second half of 2005.

2005 Occupancy

Overview

Scheduler Monitoring

• Work by Matt Hodges:http://wiki.gridpp.ac.uk/wiki/RAL_Tier1_PBS_Scheduling

• Now we have heavy demand!!! – need to monitor MAUI scheduler.

• Put MAUI scheduling data into ganglia.• Gain insight into scheduling – help experiments

understand why experiments jobs don’t start ...• Not an exact science – UB over allocates CPU

shares – we use this data to simply calculate target shares and schedule over relatively short period of time (9 days, decaying by 0.7 per day)

Target Shares

January Target shares Implemented

Actual Shares

Usage/Target

Service Challenges

• An ever expanding project!• 12 Months ago:

– SC2 throughput March 2005 disk/disk– SC3 throughput July 2005 disk/tape– SC4 throughput April 2006 tape/tape

• 6 Months ago:– As above, but now add Service Phases and Tier-

2s (significant part of Matt and Catalins work: VO boxes, FTS, LFC, etc)

Service Challenges

• 2 Months ago - as above but add:– SC3 throughput test 2 (16th January/1 week disk

@100MB/s)– Change April 2006 test to disk/disk (from tape) @ 150MB/s.

Motivation – several Tier-1 tape SRMs are late.– Add July 2006 test – tape-tape @ 150MB/s. RAL will use this

as an opportunity to use CASTOR– Add Tier1-Tier-2 throughput tests.

• 1 Month ago:– As above, but now request early February test to tape

robot @ 50MB/s– RAL unable to take part – too many commitments. May

review close to date – depends on our work schedule. Other sites in similar position.

Robot Upgrade

• Old Robot– 6000 slots– Early 1990s hardware – but still pretty good– 1 robot arm– supports most recent generation drives but end of line– Still operational but migrate drives shortly and close

• New Robot (funded by CCLRC): STK SDL8530– 10,000 slots– Life expectancy of at least 2 drive generations– Up to 8 mini robots mounting tapes – faster – resiliant– T10K drives/media in FY06: Faster drives&bigger tapes

CASTOR Plans

• ADS Software (that drives tape robot) home-grown, old and hard to support. Many limiting factors. Not viable for LCG operation.

• Only financially and functionally viable alternative to ADS found was CASTOR

• Following negotiation with CERN, RAL now a member of CASTOR project (contributing SRM) – see Jens’s talk.

• It makes no sense to operate both dCache and CASTOR. Two SRMs = double trouble**2

CASTOR PLANS

• Carry out migration from dCache to CASTOR2 over 12 months (CASTOR1 test system deployed in Autumn).

• Tension:– Deploy CASTOR as late as possible to allow proper build/testing– Deploy CASTOR as soon as possible to allow experiments to test

• CASTOR2 test system scheduled for end January• Will be looking for trial users in May to throughput test

CASTOR• CASTOR will be used in SC4 tape throughput test in July. • Provided we are satisfied, then CASTOR SRM will be

offered as a production service in September for both disk and tape

dCache PLANS

• dCache is currently our production SRM to both disk and tape.

• dCache will remain a production SRM to disk (and part of tape capacity):– Until CASTOR is proven to work– Experiments have had adequate time to migrate

data from dCache

• dCache phase out (probably) in 1H07• dCache may provide a plan B if CASTOR

deployment is late – not desirable for all kinds of reasons.

Disk and CPU

• Hardware delivery scheduled for late February (evaluation nearly complete)– Modest CPU upgrade (200-266KSI2K) – modest

demand– Spend more on disk (up to 135TB additional

capacity)– CPU online early June, disk: July– Disk moving from external RAID array to internal PCI

based RAID in order to reduce cost.

• Probably second round of purchases early in FY06. Size and timing to be agreed at Tier-1 board. Capacity available by September.

Other Hardware

• Oracle Hardware– See Gordon’s talk for details– Mini Storage Area Network to meet Oracle:

requirements. • 1 Fibre Channel RAID Array• 4 server hosts• SAN Switch (Qlogic SANbox 5200 stackable switch)• Delivery in February.

• Upgrade 10 systems to mirror disks for critical services.

Network Plans

• Currently 2*1Gbit to CERN by UKLIGHT to CERN. 1*1Gbit to Lancaster. 2*1Gbit to SJ4 production.

• Upgrade to 4*1Gbit to CERN end January (for SC4)

• Upgrade site edge (lightpath) router to 10Gbit, end February

• Attach Tier-1 at 10Gbit to edge, via 10Gbit uplink from Nortel 5530 switch (£5K switch stackable with our existing 5510 commodity units) (March)

• Attach T1 to CERN at 10Gbit early in SJ5 rollout (early summer).

Machine Rooms

• Extensive planning out to 2010 and beyond to identify growth constraints

• Major power and cooling work (additional 400KW) in A5 lower in 2004 funded by CCLRC E-Science to accommodate growth of Tier-1 and HPC systems. Sufficient to cool kit up to mid 2006.

• Further cooling expansion just started (>400KW) to meet profile out to 2008 hardware delivery for Tier-1

• Investigating building new machine room for 2009+ hardware installation.

T1/A and SPEC

• Work by George Prassas (hardware support)• Motivation:• Investigate whether the batch scaling

factors used by T1/A were accurate • Whether our performance/scaling mirrored

the published results• Help form a view about CPUs for future

purchases

SPEC CPU2000 - Metrics

• SPECint2000/SPECint_base2000– Geometric mean of 12 normalised ratios (one for

each app) when compiled with aggressive/conservative compiler options

• SPECint_rate2000/SPECint_rate_base2000– Same as above but for 12 normalised throughput

ratios

• Same apply for CFP2000

Warning: Maths!

• If α1, α2, α3, …, αn are real numbers, we define their geometric mean as:

(α1*α2*α3…αn)1/n =

•

Results - SPECint Scaling

FactorsScaling Factors

0

5059

82 82

0

56 59

9284

0

20

40

60

80

100

P3 1.4 P4 2.6 P4 2.8 P4 3.2 Opt 2GHz

System

Per

cen

tag

e (%

)

T1/A

Published

Results – SPECint

PerformancePerformance

442

663 722887 888

664

1033 1068

14201317

0

200

400

600

800

1000

1200

1400

1600

P3 1.4 P4 2.6 P4 2.8 P4 3.2 Opt 2GHz

System

SP

EC

int2

000

T1/A

Published

Scaling Comparison

Scaling Comparison

0

20

40

60

80

100

P3 1.4 P4 2.6 P4 2.8 Opt 2GHz P4 3.2

Perc

en

tag

e (

%)

LHCb SPEC T1/A SPEC Published

Metadise g77 Metadise ifc

Conclusions

• We have come an immense distance in 1 year • LHC service challenge work is ever expanding• major effort to increase utilisation• Have (from planning purposes) been living in

2006 for most of 2005. Now we have arrived.• CASTOR deployment will be a considerable

challenge.• Hardening the service will be an important part

of work for 1H2006• Very little time now left and lots to do.

tier1 status report andrew sansum gridpp15 12 january 2006

Documents

time slide

dcache slide

xmass slide

overview slide

occupancy slide

usagetarget slide

796ksi2k slide

day slide