dave newbold, university of bristol24/6/2003 cms mc production tools a lot of work in this area...
TRANSCRIPT
Dave Newbold, University of Bristol 24/6/2003
CMS MC production tools A lot of work in this area recently!
Context: PCP03 (100TB+) just started Short-term development team ~10 people; core deployment team
~10 people? (incl. UK).
New generation of tools Based upon existing distributed toolset: IMPALA, BOSS, RefDB Evolution draws from experience gained in DC02 Not explicitly designed for use on LCG testbed, but intended to
operate on Grid later (experience from CMS EDG stress test, etc).
New umbrella project: OCTOPUS Covers all CMS distributed production and Grid tools
• “Overtly Contrived Toolkit of Previously Unrelated Stuff”?• “Oh Crap: Time to Operate Production Uber-Software”
Formal support system / bug tracking now in place (via Savannah) Our worldwide Octopus has more than eight arms…
Dave Newbold, University of Bristol 24/6/2003
The problems to solve The nature of CMS production:
Highly distributed (~30 sites)• Some sites have MUCH more resource (kit, people) than others
We produce ‘useful data’, so DQM is very important The application chain is somewhat complex
• Different event types require different processing chains High-lumi background simulation presents some special problems
Some key issues: Communication (~ fortnightly VRVS meetings, very useful) Documentation, support for installation and use of tools Adaptability of production system to local conditions (now easier) Real-time data and metadata validation Data storage and migration between sites (data is NOT bunged off
to CERN) ‘Hotspots’ in distributed computing system (CERN + RAL, FNAL)
Dave Newbold, University of Bristol 24/6/2003
Core user-side toolset McRunjob: generic python local production framework
Originally a D0 tool – D0 and CMS versions almost merged ‘Glues together’ the various stages of a production chain in a
consistent and generic way; handles job setup and input / output tracking
CMS-specific classes are provided to configure our applications.
ImpalaLite: CMS-specific modules in McRunjob Core functionality from IMPALA, handling job preparation Interfaces global CMS bookkeeping database (RefDB), data
validation, job submission
BOSS: local job submission and tracking Provides a uniform interface to the various batch systems (PBS,
LSF, BQS, MOP etc etc) Based on MySQL job tracking database BODE is a web-based front end for local job management
Dave Newbold, University of Bristol 24/6/2003
System-side toolset RefDB: central bookkeeping / metadata database
Provides (physicist) user interface for requesting data Web interface allows users to track their requests, drill down into
detailed metadata corresponding to produced data Used remotely by ImpalaLite at job preparation time to establish
job input parameters, etc Based upon MySQL database at CERN
DAR: packaging of applications Very simple way of automatically packaging CMS software
components (CMKIN, CMSIM, OSCAR, ORCA) with required libraries, etc
Minimal dependence upon site conditions Ensures uniformity of application versions, etc, across sites. NB: only one current platform for production, linux RH73
Dave Newbold, University of Bristol 24/6/2003
RefDB web user interfaceO
ne drawback: need big laptop screen for brow
ser!
Dave Newbold, University of Bristol 24/6/2003
Data handling Dcache: pileup background serving
Highly challenging from the hardware point of view• e.g. need to serve up to ~200MByte/s to the RAL farm during high-
lumi digitisation step; cheap disk servers don’t cut it due to ‘random seek’ access pattern
Some large sites planning to use dcache for background library• Each ‘sub-farm’ (workers on one network switch) has its own local
disk pool – should provide a scaleable solution without killing network
SRB: wide-area data management Subject of some debate in CMS (versus Grid tools) SRB is short-term solution, since nothing else works (at 100TB
scale) – results from CMS EDG stress test, UK / US work in ‘03. Supported via UCSD / FNAL and RAL e-science centre
• RAL will host central MCAT server for PCP03 (thanks RAL). Generic Interface to RAL datastore in testing phase CMSUK responsible for roll-out and support for PCP03
Dave Newbold, University of Bristol 24/6/2003
Grid integration Current status
Toolset designed for distributed use… but not built on Grid middleware Reflection of the current scalability of many Grid components? EDG stress test taught us a lot about what is possible (now).
Plan: Grid tools to be introduced and tested during PCP03 The goal: Grid data handling, monitoring, job scheduling for DC04 Some first targets:
• BOSS + RGMA for real-time monitoring• replica management to supplement / replace SRB
CMS ‘owned’ testbed (“LCG-0”) in place at several sites Yes, yet another testbed Based upon LCG pilot + VOMS + R-GMA + Ganglia Can test “CMSprod” product, integrating existing toolset with Grid
middleware
NB: many crucial ‘local’ issues unaddressed by Grid model – discuss!
Dave Newbold, University of Bristol 24/6/2003
The worrying side effects of PCP