Operation of CASTOR at RAL
Tier1 Review November 2007
Bonny Strong
History
Jan 2005 Castor1 installed at RAL for evaluationJan 2006 Castor2 first available to external institutes,
installation begun at RALAug 2006 Castor2 running after resolving problems
for deployment outside CERN, verion 2.1.0
Sep 2006 CSA06 ran successfully Mar 2007 Upgrade to version 2.1.2
- Major problems and instability causing frequent meltdowns
Sep 2007 Deployed separate instances per VO and castor version 2.1.3
- Much better stability
NameServer 2
Production Architecture
stager DLF
LSF
stager stagerstager DLFDLFDLF
LSF LSF LSF
1 Diskserver
9 TB
TapeServer
Oraclestager
OracleNS+vmgr
NameServer
1+vmgr
CMS StagerInstance
Atlas StagerInstance
LHCb StagerInstance
Repack and SmallUser Stager Instance
22 Diskservers
133 TB
7Diskservers
48 TB
20 Diskservers
144 TB
Oracle
DLF
Oracle
stager
OracleDLF
Oraclestager
OracleDLF
OracleDLF
Oraclerepack
Oraclestager
TapeServer
TapeServer
TapeServer
TapeServer
TapeServer
repack
Shared Services
Test Architecture
stagerstager DLF
DLF+LSF
LSF
1 Diskserver - variable
TapeServer
Oraclestager
OracleNS+vmgr
NameServer +vmgr
DevelopmentPreproduction
1 Diskserver- variable
OracleDLF
OracleDLF
Oraclerepack
Oraclestager
repack
Shared Services
stager DLF
LSF
1 Diskserver - variable
TapeServer
OracleNS+vmgr
NameServer +vmgr
Certification Testbed
OracleDLF
Oraclerepack
Oraclestager
repack
Shared Services
Operational Management
• Change management• System manager on duty• Helpdesk• Monitoring: nagios, ganglia, castor-specific • Team
Bonny Strong – service managerShaun de Witt – developerTim Folkes (about 50%)- tape operationsChris Kruk – LSF manager, diskservers, sys adminCheney Ketley (50%) – sys admin, LSF backup
Working with VOs
• Weekly meeting with all VOs to discuss issues and plans
• Meetings individually with VOs to model data flow and plan CASTOR configuration
Atlas Data Flow Model
T0Raw
StripInput
D0T1
D1T0
D1T1
D0T0
T0 T2T2T1’sT1’s
RAW
RAW
AODm1/TAGAODm2/
TAG
ESD2/AODm2/TAG
AOD2
simRaw
ESD/AODm/TAG/
RAW
simStrip
ESD1/AODm1/
TAG
TAG/AODm2
Partner T1ESD1
AODm2/TAG
ESD
Farm
RAW
Key Improvements Planned
Over Next 6 Months• Resilience
– Oracle clusters (RAC) with Dataguard DB replication
– Redundant stagers for each VO– Encouraging development for additional redundancy
• Monitoring improvements• Development of administrative tools• Deployment and configuration management
procedures• Disaster recovery documentation and testing
SRMv2
• In production at RAL by 1 Dec 2007• Separate endpoints for each VO• Front end clusters for redundancy• Will run in parallel with SRMv1 until VOs
approve v1 decommissioning
Major Problems and Issues
• Software reliability• Heavy operational cost• CERN-specific development• Repack delayed• Lack of administrative tools• Performance to tape• Staffing for 24/7 coverage
Working with CERN
• External institutes conference call every 2 weeks to review development progress and operational issues
• Twice yearly face-to-face meetings of external institutes
• Once monthly deployment conference call to plan development priorities
• Management level meetings over last year to address problems of CASTOR for Tier1s– Improved release procedures and planning– More involvement of Tier1s in development planning– Improved testing with development of certification
testbed and testsuite at RAL
Conclusions
• Has not been a smooth road• Have taken or plan significant steps to
overcome problems• Major concerns for 2008:
– 24/7 operation– Improving tape performance
• Expect system reliability to be much better in 2008 than 2007