dave newbold, university of bristol24/6/2003 cms mc production tools a lot of work in this area...

Dave Newbold, University of Bristol 24/6/2003

CMS MC production tools A lot of work in this area recently!

Context: PCP03 (100TB+) just started Short-term development team ~10 people; core deployment team

~10 people? (incl. UK).

New generation of tools Based upon existing distributed toolset: IMPALA, BOSS, RefDB Evolution draws from experience gained in DC02 Not explicitly designed for use on LCG testbed, but intended to

operate on Grid later (experience from CMS EDG stress test, etc).

New umbrella project: OCTOPUS Covers all CMS distributed production and Grid tools

• “Overtly Contrived Toolkit of Previously Unrelated Stuff”?• “Oh Crap: Time to Operate Production Uber-Software”

Formal support system / bug tracking now in place (via Savannah) Our worldwide Octopus has more than eight arms…


The problems to solve The nature of CMS production:

Highly distributed (~30 sites)• Some sites have MUCH more resource (kit, people) than others

We produce ‘useful data’, so DQM is very important The application chain is somewhat complex

• Different event types require different processing chains High-lumi background simulation presents some special problems

Some key issues: Communication (~ fortnightly VRVS meetings, very useful) Documentation, support for installation and use of tools Adaptability of production system to local conditions (now easier) Real-time data and metadata validation Data storage and migration between sites (data is NOT bunged off

to CERN) ‘Hotspots’ in distributed computing system (CERN + RAL, FNAL)


Core user-side toolset McRunjob: generic python local production framework

Originally a D0 tool – D0 and CMS versions almost merged ‘Glues together’ the various stages of a production chain in a

consistent and generic way; handles job setup and input / output tracking

CMS-specific classes are provided to configure our applications.

ImpalaLite: CMS-specific modules in McRunjob Core functionality from IMPALA, handling job preparation Interfaces global CMS bookkeeping database (RefDB), data

validation, job submission

BOSS: local job submission and tracking Provides a uniform interface to the various batch systems (PBS,

LSF, BQS, MOP etc etc) Based on MySQL job tracking database BODE is a web-based front end for local job management


System-side toolset RefDB: central bookkeeping / metadata database

Provides (physicist) user interface for requesting data Web interface allows users to track their requests, drill down into

detailed metadata corresponding to produced data Used remotely by ImpalaLite at job preparation time to establish

job input parameters, etc Based upon MySQL database at CERN

DAR: packaging of applications Very simple way of automatically packaging CMS software

components (CMKIN, CMSIM, OSCAR, ORCA) with required libraries, etc

Minimal dependence upon site conditions Ensures uniformity of application versions, etc, across sites. NB: only one current platform for production, linux RH73


RefDB web user interfaceO

ne drawback: need big laptop screen for brow

ser!


Data handling Dcache: pileup background serving

Highly challenging from the hardware point of view• e.g. need to serve up to ~200MByte/s to the RAL farm during high-

lumi digitisation step; cheap disk servers don’t cut it due to ‘random seek’ access pattern

Some large sites planning to use dcache for background library• Each ‘sub-farm’ (workers on one network switch) has its own local

disk pool – should provide a scaleable solution without killing network

SRB: wide-area data management Subject of some debate in CMS (versus Grid tools) SRB is short-term solution, since nothing else works (at 100TB

scale) – results from CMS EDG stress test, UK / US work in ‘03. Supported via UCSD / FNAL and RAL e-science centre

• RAL will host central MCAT server for PCP03 (thanks RAL). Generic Interface to RAL datastore in testing phase CMSUK responsible for roll-out and support for PCP03


Grid integration Current status

Toolset designed for distributed use… but not built on Grid middleware Reflection of the current scalability of many Grid components? EDG stress test taught us a lot about what is possible (now).

Plan: Grid tools to be introduced and tested during PCP03 The goal: Grid data handling, monitoring, job scheduling for DC04 Some first targets:

• BOSS + RGMA for real-time monitoring• replica management to supplement / replace SRB

CMS ‘owned’ testbed (“LCG-0”) in place at several sites Yes, yet another testbed Based upon LCG pilot + VOMS + R-GMA + Ganglia Can test “CMSprod” product, integrating existing toolset with Grid

middleware

NB: many crucial ‘local’ issues unaddressed by Grid model – discuss!


The worrying side effects of PCP

dave newbold, university of bristol24/6/2003 cms mc production tools a lot of work in this area...

Documents