tier1 status report martin bly ral 27,28 april 2005

11
Tier1 Status Report Martin Bly RAL 27,28 April 2005

Upload: ronald-walker

Post on 01-Jan-2016

214 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Tier1 Status Report Martin Bly RAL 27,28 April 2005

Tier1 Status Report

Martin Bly

RAL 27,28 April 2005

Page 2: Tier1 Status Report Martin Bly RAL 27,28 April 2005

27/28 April 2005 Tier1 Status Report - HEPSysMan, RAL

Topics

• Hardware• Atlas DataStore• Networking• Batch services• Storage• Service Challenges• Security

Page 3: Tier1 Status Report Martin Bly RAL 27,28 April 2005

27/28 April 2005 Tier1 Status Report - HEPSysMan, RAL

Hardware

• Approximately 550 CPU nodes– ~980 processors deployed in batch– Remainder are services nodes, servers etc.

• 220TB disk space~ 60 servers, ~120 arrays

• Decommissioning– Majority of the P3/600MHz systems decommissioned Jan 05– P3/1GHz systems to be decommissioned in July/Aug 05 after

commissioning of Year 4 procurement.– Babar SUN systems decommissioned by end Feb 05– CDF IBM systems decommissioned and sent to Oxford, Liverpool, Glasgow

and London• Next procurement

– 64bit AMD or Intel CPU nodes – power, cooling– Dual cores possibly too new– Infortrend Arrays / SATA disks / SCSI connect

• Future– Evaluate new disk technologies, dual core CPUs, etc.

Page 4: Tier1 Status Report Martin Bly RAL 27,28 April 2005

27/28 April 2005 Tier1 Status Report - HEPSysMan, RAL

Atlas DataStore

• Evaluating new disk systems for staging cache– FC attached SATA arrays– Additional 4TB/server, 16TB total– Existing IBM/AIX servers

• Tape drives– Two additional 9940B drives, FC attached– 1 for ADS, 1 for test CASTOR installation

• Developments– Evaluating a test CASTOR installation– Stress testing ADS components to prepare for Service

Challenges– Planning for a new robot– Considering next generation of tape drives– SC4 (2006) requires step in cache performance– Ancillary network rationalised

Page 5: Tier1 Status Report Martin Bly RAL 27,28 April 2005

27/28 April 2005 Tier1 Status Report - HEPSysMan, RAL

Networking

• Planned upgrades to Tier1 production network– Started November 04– Based on Nortel 5510-48T `stacks’ for large groups of CPU

and disk server nodes (up to 8/stack, 384 ports)– High speed backbone inter-unit interconnect (40Gb/s bi-

directional) within stacks– Multiple 1Gb/s uplinks aggregated to form backbone

• currently 2 x 1Gb/s, max 4 x 1Gb/s– Update to 10Gb/s uplinks and head node as cost falls– Uplink configuration with links to separate units within

each stack and the head switch will provide resilience– Ancillary links (APCs, disk arrays) on separate network

• Connected to UKLight for SC2 (c.f. later)– 2 x 1Gb/s links aggregated from Tier1

Page 6: Tier1 Status Report Martin Bly RAL 27,28 April 2005

27/28 April 2005 Tier1 Status Report - HEPSysMan, RAL

Batch Services

• Worker node configuration based on traditional style batch workers with LCG configuration on top.– Running SL 3.0.3 with LCG 2_4_0– Provisioning by PXE/Kickstart– YUM/Yumit, Yaim, Sure, Nagios, Ganglia…

• All rack-mounted workers dual purpose, accessed via a single batch system PBS server (Torque).

• Scheduler (MAUI) allocates resources for LCG, Babar and other experiments using Fair Share allocations from User Board.

• Jobs able to spill into allocations for other experiments and from one `side’ to the other when spare capacity is available, to make best use of the capacity.

• Some issues with jobs that use excess memory (memory leaks) not being killed by Maui or Torque – under investigation.

Page 7: Tier1 Status Report Martin Bly RAL 27,28 April 2005

27/28 April 2005 Tier1 Status Report - HEPSysMan, RAL

Service Systems

• Service systems migrated to SL 3– Mail hub, NIS servers, UIs– Babar UIs configured as DNS triplet

• NFS / data servers– Customised RH7.n

• Driver issues• NFS performance of SL 3 uninspiring c/w 7.n

– dCache systems at SL 3• LCG service nodes at SL 3, LCG-2_4_0

• Need to migrate to LCG-2_4_0 or loose work

Page 8: Tier1 Status Report Martin Bly RAL 27,28 April 2005

27/28 April 2005 Tier1 Status Report - HEPSysMan, RAL

Storage

• Moving to SRMs from NFS for data access– dCache successfully deployed in production

• Used by CMS, ATLAS…• See talk by Derek Ross

– Xrootd deployed in production• Used by Babar• Two `redirector’ systems handle requests

– Selected by DNS pair– Hand off request to appropriate server

– Reduces NFS load on disk servers • Load issues with Objectivity server

– Two additional servers being commissioned

• Project to look at SL 4 for servers– 2.6 kernel, journaling file systems - ext3, XFS

Page 9: Tier1 Status Report Martin Bly RAL 27,28 April 2005

27/28 April 2005 Tier1 Status Report - HEPSysMan, RAL

Service Challenges I

• The Service Challenges are a program infrastructure trials designed to test the LCG fabric at increasing levels of stress/capacity in the run up to LHC operation.

• SC2 – March/April 05:– Aim: T0->T1s aggregate of >500MB/s sustained for 2 weeks– 2Gb/sec link via UKlight to CERN– RAL sustained 80MB/sec for two weeks to dedicated (non-

production) dCache• 11/13 gridftp servers• Limited by issues with network

– Internal testing reached 3.5Gb/sec (~400MB/sec) aggregate disk to disk

– Aggregate to 7 participating sites: ~650MB/sec• SC3 – July 05 -Tier1 expects:

– CERN -> RAL at 150MB/s sustained for 1 month– T2s -> RAL (and RAL -> T2s?) at yet-to-be-defined rate

• Lancaster, Imperial …• Some on UKlight, some via SJ4

• Production phase Sept-Dec 05

Page 10: Tier1 Status Report Martin Bly RAL 27,28 April 2005

27/28 April 2005 Tier1 Status Report - HEPSysMan, RAL

Service Challenges II

• SC4 - April 06– CERN-RAL T0-T1 expects 220MB/sec sustained for one month– RAL expects T2-T1 traffic at N x 100MB/sec simultaneously.

• June 06 – Sept 06: production phase• Longer term:

– There is some as yet undefined T1 -> T1 capacity needed. This could be add 50 to 100MB/sec.

– CMS production will require 800MB/s combined and sustained from batch workers to the storage systems within the Tier1.

– At some point there will be a sustained double rate test – 440MB/sec T0-T1 and whatever is then needed for T2-T1.

• It is clear that the Tier1 will be able to keep a significant part of a 10Gb/sec link busy continuously, probably from late 2006.

Page 11: Tier1 Status Report Martin Bly RAL 27,28 April 2005

27/28 April 2005 Tier1 Status Report - HEPSysMan, RAL

Security

• The Badguys™ are out there– Users are vulnerable to loosing

authentication data anywhere• Still some less than ideal practices

– All local privilege escalation exploits must be treated as a high priority must-fix

– Continuing program of locking down and hardening exposed services and systems

– You can only be more secure• See talk by Roman Wartel