crab: a tool for cms distributed analysis in grid environment

21
INFSO-RI-508833 Enabling Grids for E- sciencE www.eu-egee.org CRAB: a tool for CMS distributed analysis in grid environment Federica Fanzago INFN PADOVA

Upload: risa-king

Post on 04-Jan-2016

22 views

Category:

Documents


0 download

DESCRIPTION

Federica Fanzago INFN PADOVA. CRAB: a tool for CMS distributed analysis in grid environment. Introduction. CMS “Compact Muon Solenoid” is one of the four particle physics experiment that will collect data at LHC “Large Hadron Collider” starting in 2007 at CERN - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: CRAB: a tool for CMS distributed analysis in grid environment

INFSO-RI-508833

Enabling Grids for E-sciencE

www.eu-egee.org

CRAB: a tool for CMS distributed

analysis in grid environment

Federica FanzagoINFN PADOVA

Page 2: CRAB: a tool for CMS distributed analysis in grid environment

Federica Fanzago INFN-PADOVA EGEE User Forum 01 March 2006 2

Enabling Grids for E-sciencE

INFSO-RI-508833

Introduction

• CMS “Compact Muon Solenoid” is one of the four particle physics experiment that will collect data at LHC “Large Hadron Collider” starting in 2007 at CERN

• CMS will produce a large amount of data (events) that should be made available for analysis to world-wide distributed physicists

• CMS will produce – ~2 PB events/year (assumes startup

luminosity 2x1033 cm-2 s-1)

• All events will be stored into files– O(10^6) files/year

• Files will be grouped in Fileblocks – O(10^3) Fileblocks/year

• Fileblocks will be grouped in Datasets

– O(10^3) Datasets (total after 10 years of CMS)

– 0.1- 100 TB

“bunch crossing” every 25 nsecs.

100 “triggers” per second

Each triggered event ~1 MB in size

Page 3: CRAB: a tool for CMS distributed analysis in grid environment

Federica Fanzago INFN-PADOVA EGEE User Forum 01 March 2006 3

Enabling Grids for E-sciencE

INFSO-RI-508833

• How to manage and where to store this huge quantity of data?

• How to assure data access to physicists of CMS collaboration?

• How to have enough computing power for processing and data analysis?

• How to ensure resources and data availability?

• How to define local and global policy about data access and resources?

CMS will use a distributed architecture based on grid infrastructure

Tools for accessing distributed data and resources are provided by WLCG (World LHC Computing Grid) with two main different flavours– LCG/gLite in Europe, OSG in the US

Issues and help

Page 4: CRAB: a tool for CMS distributed analysis in grid environment

Federica Fanzago INFN-PADOVA EGEE User Forum 01 March 2006 4

Enabling Grids for E-sciencE

INFSO-RI-508833

Online system

Tier 0

Tier 1

Tier 2

Tier 3

Offline farm

CERN Computer center

. .

Tier2 Center Tier2 Center Tier2 Center

InstituteB InstituteA

. . .workstation

Italy Regional Center

Fermilab Regional Center

France Regional Center

recorded data

CMS computing model

The CMS offline computing system is arranged in four Tiers and is geographically distributed

Remote data accessible

via grid

Page 5: CRAB: a tool for CMS distributed analysis in grid environment

Federica Fanzago INFN-PADOVA EGEE User Forum 01 March 2006 5

Enabling Grids for E-sciencE

INFSO-RI-508833

Analysis:what happens in a local environment...

• User writes his own analysis code and configuration parameter card– Starting from CMS specific analysis software

– Builds executable and libraries

• He apply the code to a given amount of events, whose location is known, splitting the load over many jobs– But generally he is allowed to access only local data

• He writes wrapper scripts and uses a local batch system to exploit all the computing power– Comfortable until data you’re looking for are sitting just by your side

• Then he submits all by hand and checks the status and overall progress

• Finally collects all output files and store them somewhere

Page 6: CRAB: a tool for CMS distributed analysis in grid environment

Federica Fanzago INFN-PADOVA EGEE User Forum 01 March 2006 6

Enabling Grids for E-sciencE

INFSO-RI-508833

...and in a distributed grid environment

The distributed analysis is a more complex computing task because it assume to know:

• which data are available• where data are stored and how to access them• which resources are available and are able to comply

with analysis requirements• grid and CMS infrastructure details

But users don't want deal with these kind of problem

Users want to analyze data in “a simple way” as in local environment

Page 7: CRAB: a tool for CMS distributed analysis in grid environment

Federica Fanzago INFN-PADOVA EGEE User Forum 01 March 2006 7

Enabling Grids for E-sciencE

INFSO-RI-508833

Distribution analysis chain...

To allow analysis in distributed environment, the CMS collaboration is developing some tools interfaced with grid services, that include

• Installation of CMS software via grid on remote resources

• Data transfer service: to move and manage a large flow of data among tiers

• Data validation system: to ensure data consistency

• Data location system: to keep track of data available in each site and to allow data discovery, composed by– Central database (RefDB) that knows what kind of data (dataset)

have been produced in each Tier– Local database (PubDB) in each Tier, with info about where data

are stored and their access protocol

• CRAB: Cms Remote Analysis Builder...

Page 8: CRAB: a tool for CMS distributed analysis in grid environment

Federica Fanzago INFN-PADOVA EGEE User Forum 01 March 2006 8

Enabling Grids for E-sciencE

INFSO-RI-508833

... and CRAB role

• CRAB is a user-friendly tool whose aim is to simplify the work of users with no knowledge of grid infrastructure to create, submit and manage job analysis into grid environments. – written in python and installed on UI (grid user access point)

• Users have to develop their analysis code in a interactive environment and decide which data to analyse.

• They have to provide to CRAB:– Dataset name, number of events– Analysis code and parameter card– Output files and handling policy

• CRAB handles data discovery, resources availability, job creation and submission, status monitoring and output retrieval

Page 9: CRAB: a tool for CMS distributed analysis in grid environment

Federica Fanzago INFN-PADOVA EGEE User Forum 01 March 2006 9

Enabling Grids for E-sciencE

INFSO-RI-508833

How CRAB works

• Job creation: crab –create N (or all)– data discovery: sites storing data are found querying RefDB and

local PubDBs– packaging of user code: creation of a tgz archive with user code

(bin, lib and data)– wrapper script (sh) for the real user executable– JDL file, script which drives the real job towards the “grid”– splitting: according to user request (number of events per job and

in total)

• Job submission: crab –submit N (or all) -c– jobs are submitted to the Resource Broker using BOSS, the

submitter and tracking tool interfaced with CRAB– jobs are sent to those sites which host data

Page 10: CRAB: a tool for CMS distributed analysis in grid environment

Federica Fanzago INFN-PADOVA EGEE User Forum 01 March 2006 10

Enabling Grids for E-sciencE

INFSO-RI-508833

How CRAB works (2)

• Job monitoring: crab –status (n_of_job)– the status of all submitted jobs is checked using Boss

• Job output management: crab –getoutput (n_of_job)– following user request CRAB can

copy them back to the UI ... ... or copy to a Storage Element

• Job resubmission: crab –resubmit n_of_job– if job suffers grid failure (aborted or cancelled status)

Page 11: CRAB: a tool for CMS distributed analysis in grid environment

Federica Fanzago INFN-PADOVA EGEE User Forum 01 March 2006 11

Enabling Grids for E-sciencE

INFSO-RI-508833

CRAB workflow: today

Page 12: CRAB: a tool for CMS distributed analysis in grid environment

Federica Fanzago INFN-PADOVA EGEE User Forum 01 March 2006 12

Enabling Grids for E-sciencE

INFSO-RI-508833

CRAB experience

• Used by tens of users to access remote MC data for Physics TDR analysis

• ~7000 Datasets available for O(10^8) total events, full MC production

• CMS users, via CRAB, use two dedicated Resources Brokers (at CERN and at CNAF) knowing all CMS sites

CRAB proves that CMS users are able to use available grid services and that the full analysis chain works in a distributed environment!

Page 13: CRAB: a tool for CMS distributed analysis in grid environment

Federica Fanzago INFN-PADOVA EGEE User Forum 01 March 2006 13

Enabling Grids for E-sciencE

INFSO-RI-508833

CRAB usage

Top 20 dataset/owner requested from users

Top 20 CE whereCRAB-Jobs run

CRAB is currently used to analysedata for the CMS Physics TDR (being written now…)

The total number of jobs submitted to the grid using CRAB during the second half of the last year is more than 300’000 by 40-50 users.

Page 14: CRAB: a tool for CMS distributed analysis in grid environment

Federica Fanzago INFN-PADOVA EGEE User Forum 01 March 2006 14

Enabling Grids for E-sciencE

INFSO-RI-508833

CRAB future

• As CMS analysis framework and grid middleware evolve:– CRAB has to adapt to cope with these changes and always

guarantee its usability and thus remote data access to users New data discovery components (DBS, DLS) that will substitute RefDB

and PubDB New Event Data Model (as analysis framework) gLite, new middleware for grid computing

• Open issues to be resolved (number of users and submitted jobs is increasing…) – Jobs policies and priorities at VO level: for example

for next tree weeks Higgs group users have priorities over other groups tracker alignment jobs performed by user xxx must start immediately

– Bulk submission: handle 1000 jobs as a single task, just one submission/status/...

Page 15: CRAB: a tool for CMS distributed analysis in grid environment

Federica Fanzago INFN-PADOVA EGEE User Forum 01 March 2006 15

Enabling Grids for E-sciencE

INFSO-RI-508833

CRAB future (2)

• CRAB will be split in two different components to minimize the user effort to manage analysis jobs and obtain their results.

• Some user actions will be delegated to “not user dependent” services, that take care to follow job evolution on the grid, get results and return them to user

• The Me/MyFriend idea: – Me: the user desktop (laptop or shell), where working environment

is and where user can work interactively. For user operation as: job creation job submission

– MyFriend: a set of robust services running 24x7 to guarantee the execution of:

job tracking resubmission output retrieval

Page 16: CRAB: a tool for CMS distributed analysis in grid environment

Federica Fanzago INFN-PADOVA EGEE User Forum 01 March 2006 16

Enabling Grids for E-sciencE

INFSO-RI-508833

Conclusion

• CRAB was born in April ’05• A big effort has been done to understand user needs and

how to use in the best way services provided by grid• Lot of work have been made to make it robust, flexible

and reliable• Users appreciate the tool and are asking for further

improvements– CRAB has been used by many CMS collaborators to analyze

remote data for CMS Physics TDR, otherwise not accessible– CRAB is used to continuously test CMS Tiers to prove the whole

infrastructure robustness

• The use of CRAB proves the complete computing chain for distributed analysis works for a generic CMS user !

http://cmsdoc.cern.ch/cms/ccs/wm/www/Crab/

Page 17: CRAB: a tool for CMS distributed analysis in grid environment

Federica Fanzago INFN-PADOVA EGEE User Forum 01 March 2006 17

Enabling Grids for E-sciencE

INFSO-RI-508833

back-up

Back-up slide

Page 18: CRAB: a tool for CMS distributed analysis in grid environment

Federica Fanzago INFN-PADOVA EGEE User Forum 01 March 2006 18

Enabling Grids for E-sciencE

INFSO-RI-508833

Statistics with CRAB(1)

From 10-07-05 to 22.01.06 The weekly rate of the

CRAB-jobs flow is:

week

# of jobs

week

LCG

OSG

(%) jobs

Page 19: CRAB: a tool for CMS distributed analysis in grid environment

Federica Fanzago INFN-PADOVA EGEE User Forum 01 March 2006 19

Enabling Grids for E-sciencE

INFSO-RI-508833

Statistics with CRAB(2)

Efficiency:% of jobs which arrive to WN (remote CE) and run

INFN CEAll CE

Page 20: CRAB: a tool for CMS distributed analysis in grid environment

Federica Fanzago INFN-PADOVA EGEE User Forum 01 March 2006 20

Enabling Grids for E-sciencE

INFSO-RI-508833

CRAB flow: the future

Page 21: CRAB: a tool for CMS distributed analysis in grid environment

Federica Fanzago INFN-PADOVA EGEE User Forum 01 March 2006 21

Enabling Grids for E-sciencE

INFSO-RI-508833

• CMS