grid production experience in the atlas experiment

23
Grid Production Experience in the ATLAS Experiment Horst Severini Horst Severini University of Oklahoma University of Oklahoma Kaushik De Kaushik De University of Texas at Arlington University of Texas at Arlington D0-SAR Workshop, LaTech D0-SAR Workshop, LaTech April 7, 2004 April 7, 2004

Upload: lidia

Post on 15-Jan-2016

28 views

Category:

Documents


0 download

DESCRIPTION

Grid Production Experience in the ATLAS Experiment. Horst Severini University of Oklahoma Kaushik De University of Texas at Arlington D0-SAR Workshop, LaTech April 7, 2004. ATLAS Data Challenges. Original Goals (Nov 15, 2001) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Grid Production Experience in the ATLAS Experiment

Grid Production Experience in the ATLAS Experiment

Horst SeveriniHorst SeveriniUniversity of OklahomaUniversity of Oklahoma

Kaushik DeKaushik DeUniversity of Texas at ArlingtonUniversity of Texas at Arlington

D0-SAR Workshop, LaTechD0-SAR Workshop, LaTech

April 7, 2004April 7, 2004

Page 2: Grid Production Experience in the ATLAS Experiment

March 29, 2004March 29, 2004LaTech D0SAR MeetingLaTech D0SAR Meeting 2

ATLAS Data Challenges

Original Goals (Nov 15, 2001)Original Goals (Nov 15, 2001) Test computing model, its software, its data model, and to ensure

the correctness of the technical choices to be made Data Challenges should be executed at the prototype Tier centres Data challenges will be used as input for a Computing Technical

Design Report due by the end of 2003 (?) and for preparing a MoU

Current StatusCurrent Status Goals are evolving as we gain experience Computing TDR ~end of 2004 DC’s are ~yearly sequence of increasing scale & complexity DC0 and DC1 (completed) DC2 (2004), DC3, and DC4 planned Grid deployment and testing is major part of DC’s

Page 3: Grid Production Experience in the ATLAS Experiment

March 29, 2004March 29, 2004LaTech D0SAR MeetingLaTech D0SAR Meeting 3

ATLAS DC1: July 2002-April 2003

Goals : Produce the data needed for the HLT TDR Get as many ATLAS institutes involved as possible

Worldwide collaborative activityParticipation : 56 Institutes

Australia Australia

AustriaAustria

CanadaCanada

CERN CERN

ChinaChina

Czech RepublicCzech Republic

Denmark Denmark **

France France

Germany Germany

GreeceGreece

IsraelIsrael

ItalyItaly

JapanJapan

Norway Norway **

PolandPoland

RussiaRussia

SpainSpain

Sweden Sweden **

TaiwanTaiwan

UKUK

USA USA ** * using Grid

Page 4: Grid Production Experience in the ATLAS Experiment

March 29, 2004March 29, 2004LaTech D0SAR MeetingLaTech D0SAR Meeting 4

(6300)(6300)(84)(84)2.5x102.5x1066ReconstructionReconstruction

+ Lvl1/2+ Lvl1/2

14141650165022224x104x1066Lumi02 Pile-upLumi02 Pile-up

22960096001251253x103x1077SimulationSimulation

Single part.Single part.

6060

2121

2323

TBTB

Volume of Volume of datadata

51000 (+6300)51000 (+6300)

37503750

60006000

3000030000

CPU-daysCPU-days

(400 SI2k)(400 SI2k)

kSI2k.monthskSI2k.months

4x104x1066

2.8x102.8x1066

101077

No. of No. of eventsevents

CPU TimeCPU TimeProcessProcess

690 (+84)690 (+84)TotalTotal

5050ReconstructionReconstruction

7878Lumi10 Pile-upLumi10 Pile-up

415415SimulationSimulation

Physics evt.Physics evt.

DC1 Statistics (G. Poulard, July 2003)

Page 5: Grid Production Experience in the ATLAS Experiment

March 29, 2004March 29, 2004LaTech D0SAR MeetingLaTech D0SAR Meeting 5

U.S. ATLAS DC1 Data Production

Year long process, Summer 2002-2003Year long process, Summer 2002-2003

Played 2nd largest role in ATLAS DC1Played 2nd largest role in ATLAS DC1

Exercised both farm and grid based productionExercised both farm and grid based production

10 U.S. sites participating10 U.S. sites participating Tier 1: BNL, Tier 2 prototypes: BU, IU/UC, Grid Testbed sites: ANL,

LBNL, UM, OU, SMU, UTA (UNM & UTPA will join for DC2)

Generated Generated ~2 million~2 million fully simulated, piled-up and fully simulated, piled-up and reconstructed eventsreconstructed events

U.S. was largest grid-based DC1 data producer in ATLASU.S. was largest grid-based DC1 data producer in ATLAS

Data used for HLT TDR, Athens physics workshop, Data used for HLT TDR, Athens physics workshop, reconstruction software tests...reconstruction software tests...

Page 6: Grid Production Experience in the ATLAS Experiment

March 29, 2004March 29, 2004LaTech D0SAR MeetingLaTech D0SAR Meeting 6

U.S. ATLAS Grid Testbed

BNL - U.S. Tier 1, 2000 nodes, 5% for BNL - U.S. Tier 1, 2000 nodes, 5% for ATLAS, 10 TB, HPSS through MagdaATLAS, 10 TB, HPSS through Magda

LBNL - pdsf cluster, 400 nodes, 5% for LBNL - pdsf cluster, 400 nodes, 5% for ATLAS (more if idle ~10-15% used), 1TBATLAS (more if idle ~10-15% used), 1TB

Boston U. - prototype Tier 2, 64 nodesBoston U. - prototype Tier 2, 64 nodes

Indiana U. - prototype Tier 2, 64 nodesIndiana U. - prototype Tier 2, 64 nodes

UT Arlington - new 200 cpu’s, 50 TBUT Arlington - new 200 cpu’s, 50 TB

Oklahoma U. - OSCER facilityOklahoma U. - OSCER facility

U. Michigan - test nodesU. Michigan - test nodes

ANL - test nodes, JAZZ clusterANL - test nodes, JAZZ cluster

SMU - 6 production nodesSMU - 6 production nodes

UNM - Los Lobos clusterUNM - Los Lobos cluster

U. Chicago - test nodesU. Chicago - test nodes

Page 7: Grid Production Experience in the ATLAS Experiment

March 29, 2004March 29, 2004LaTech D0SAR MeetingLaTech D0SAR Meeting 7

U.S. Production Summary

Number of Number of CPU hours CPU hours CPU hours

Files in Magda Events Simulation Pile-up Reconstruction

25 Gev di-jets 41k 1M ~60k 56k 60k+

50 Gev di-jets 10k 250k ~20k 22k 20k+

Single particles 24k 200k 17k 6k

Higgs sample 11k 50k 8k 2k

SUSY sample 7k 50k 13k 2k

minbias sample 7k ? ?

* Total ~30 CPU YEARS delivered to DC1 from U.S.* Total produced file size ~20TB on HPSS tape system, ~10TB on disk.* Black - majority grid produced, Blue - majority farm produced

Exercised both farm and grid based productionExercised both farm and grid based production Valuable large scale grid based production experienceValuable large scale grid based production experience

Page 8: Grid Production Experience in the ATLAS Experiment

March 29, 2004March 29, 2004LaTech D0SAR MeetingLaTech D0SAR Meeting 8

DC1 Production Systems

Local batch systems - bulk of productionLocal batch systems - bulk of production

GRAT - grid scripts, generated ~50k files produced in U.S.GRAT - grid scripts, generated ~50k files produced in U.S.

NorduGrid - grid system, ~10k files in Nordic countriesNorduGrid - grid system, ~10k files in Nordic countries

AtCom - GUI, ~10k files at CERN (mostly batch)AtCom - GUI, ~10k files at CERN (mostly batch)

GCE - Chimera based, ~1k files producedGCE - Chimera based, ~1k files produced

GRAPPA - interactive GUI for individual userGRAPPA - interactive GUI for individual user

EDG/LCG - test files onlyEDG/LCG - test files only

+ systems I forgot…+ systems I forgot…

More systems coming for DC2More systems coming for DC2 Windmill GANGA DIAL

Page 9: Grid Production Experience in the ATLAS Experiment

March 29, 2004March 29, 2004LaTech D0SAR MeetingLaTech D0SAR Meeting 9

GRAT Software

GRid Applications ToolkitGRid Applications Toolkit developed by KD, Horst Severini, Mark Sosebee, and students

Based on Globus, Magda & MySQLBased on Globus, Magda & MySQL

Shell & Python scripts, modular designShell & Python scripts, modular design

Rapid development platformRapid development platform Quickly develop packages as needed by DC

Physics simulation (GEANT/ATLSIM) Pileup production & data management Reconstruction

Test grid middleware, test grid performanceTest grid middleware, test grid performance

Modules can be easily enhanced or replaced, e.g. EDG Modules can be easily enhanced or replaced, e.g. EDG resource broker, Chimera, replica catalogue… (in progress)resource broker, Chimera, replica catalogue… (in progress)

Page 10: Grid Production Experience in the ATLAS Experiment

March 29, 2004March 29, 2004LaTech D0SAR MeetingLaTech D0SAR Meeting 10

GRAT Execution Model

1. Resource Discovery2. Partition Selection3. Job Creation4. Pre-staging5. Batch Submission6. Job Parameterization

DC1

Prod.(UTA)

RemoteGatekeeper

Replica(local)

MAGDA(BNL)

Param(CERN)

BatchExecution

scratch

1,4,5,10

2

3

4

5

6

7

89

7. Simulation8. Post-staging9. Cataloging10. Monitoring

Page 11: Grid Production Experience in the ATLAS Experiment

March 29, 2004March 29, 2004LaTech D0SAR MeetingLaTech D0SAR Meeting 11

U.S. Middleware Evolution

Used for 95% of DC1 production

Used successfully for simulation

Tested for simulation, used forall grid-based reconstruction

Used successfully for simulation(complex pile-up workflow not yet)

Page 12: Grid Production Experience in the ATLAS Experiment

March 29, 2004March 29, 2004LaTech D0SAR MeetingLaTech D0SAR Meeting 12

DC1 Production Experience

Grid paradigm works, using GlobusGrid paradigm works, using Globus Opportunistic use of existing resources, run anywhere, from

anywhere, by anyone...

Successfully exercised grid middleware with increasingly Successfully exercised grid middleware with increasingly complex taskscomplex tasks Simulation: create physics data from pre-defined parameters and

input files, CPU intensive Pile-up: mix ~2500 min-bias data files into physics simulation files,

data intensive Reconstruction: data intensive, multiple passes Data tracking: multiple steps, one -> many -> many more mappings

Page 13: Grid Production Experience in the ATLAS Experiment

March 29, 2004March 29, 2004LaTech D0SAR MeetingLaTech D0SAR Meeting 13

New Production System for DC2

GoalsGoals Automated data production

system for all ATLAS facilities Common database for all

production - Oracle currently Common supervisor run by all

facilities/managers - Windmill Common data management

system - Don Quichote Executors developed by

middleware experts (Capone, LCG, NorduGrid, batch systems, CanadaGrid...)

Final verification of data done by supervisor

Page 14: Grid Production Experience in the ATLAS Experiment

March 29, 2004March 29, 2004LaTech D0SAR MeetingLaTech D0SAR Meeting 14

Windmill - Supervisor

Supervisor development/U.S. DC production teamSupervisor development/U.S. DC production team UTA: Kaushik De, Mark Sosebee, Nurcan Ozturk + students BNL: Wensheng Deng, Rich Baker OU: Horst Severini ANL: Ed May

Windmill web pageWindmill web page http://www-hep.uta.edu/windmill

Windmill statusWindmill status version 0.5 released February 23

includes complete library of xml messages between agents includes sample executors for local, pbs and web services

can run on any Linux machine with Python 2.2 development continuing - Oracle production DB, DMS, new schema

Page 15: Grid Production Experience in the ATLAS Experiment

March 29, 2004March 29, 2004LaTech D0SAR MeetingLaTech D0SAR Meeting 15

Windmill Messaging

supervisoragent

executoragent

XML switch(JabberServer)

XMPP(XML)

XMPP(XML)

Webserver

SOAP

All messaging is XML based All messaging is XML based

Agents communicate using Jabber (open chat) protocolAgents communicate using Jabber (open chat) protocol

Agents have same command line interface - GUI in futureAgents have same command line interface - GUI in future

Agents & web server can run at same or different locationsAgents & web server can run at same or different locations

Executor accesses grid directly and/or thru web servicesExecutor accesses grid directly and/or thru web services

Page 16: Grid Production Experience in the ATLAS Experiment

March 29, 2004March 29, 2004LaTech D0SAR MeetingLaTech D0SAR Meeting 16

Intelligent Agents

Supervisor/executor are intelligent communication agentsSupervisor/executor are intelligent communication agents uses Jabber open source instant messaging framework Jabber server routes XMPP messages - acts as XML data switch reliable p2p asynchronous message delivery through firewalls built in support for dynamic ‘directory’, ‘discovery’, ‘presence’ extensible - we can add monitoring, debugging agents easily provides ‘chat’ capability for free - collaboration among operators Jabber grid proxy under development (LBNL - Agarwal)

JabberServer

Jabber Clients

Jabber Clients

XMPP

Page 17: Grid Production Experience in the ATLAS Experiment

March 29, 2004March 29, 2004LaTech D0SAR MeetingLaTech D0SAR Meeting 17

Core Windmill Libraries

interact.py - command line interface libraryinteract.py - command line interface library

agents.py - common intelligent agent libraryagents.py - common intelligent agent library

xmlkit.py - xml creation (generic) and parsing libraryxmlkit.py - xml creation (generic) and parsing library

messages.py - xml message creation (specific)messages.py - xml message creation (specific)

proddb.py - production database methods for oracle, mysql, proddb.py - production database methods for oracle, mysql, local, dummy, and possibly other optionslocal, dummy, and possibly other options

supervise.py - supervisor methods to drive productionsupervise.py - supervisor methods to drive production

execute.py - executor methods to run facilitiesexecute.py - executor methods to run facilities

Page 18: Grid Production Experience in the ATLAS Experiment

March 29, 2004March 29, 2004LaTech D0SAR MeetingLaTech D0SAR Meeting 18

Capone Executor

Various executors are being developedVarious executors are being developed Capone - U.S. VDT executor by U. of Chicago and Argonne Lexor - LCG executor mostly by Italian groups NorduGrid, batch (Munich), Canadian, Australian(?)

Capone is based on GCE (Grid Computing Environment)Capone is based on GCE (Grid Computing Environment) (VDT Client/Server, Chimera, Pegasus, Condor, Globus)

Status:Status: Python module Process “thread” for each job Archive of managed jobs Job management Grid monitoring

Aware of key parameters (e.g. available CPUs, jobs running)

Page 19: Grid Production Experience in the ATLAS Experiment

March 29, 2004March 29, 2004LaTech D0SAR MeetingLaTech D0SAR Meeting 19

Capone Architecture

Message interfaceMessage interface Web Service Jabber

Translation levelTranslation level Windmill

CPE (Capone Process Engine)CPE (Capone Process Engine)

Processes Processes Grid Stub DonQuixote

from Marco Mambellifrom Marco Mambelli

Message protocols

Translation

Web Service

CPE

Jabber

Windmill ADA

Stu

b

Grid

Don

Qu

ixote

Page 20: Grid Production Experience in the ATLAS Experiment

March 29, 2004March 29, 2004LaTech D0SAR MeetingLaTech D0SAR Meeting 20

Windmill Screenshots

Page 21: Grid Production Experience in the ATLAS Experiment

March 29, 2004March 29, 2004LaTech D0SAR MeetingLaTech D0SAR Meeting 21

Page 22: Grid Production Experience in the ATLAS Experiment

March 29, 2004March 29, 2004LaTech D0SAR MeetingLaTech D0SAR Meeting 22

Web Services Example

Page 23: Grid Production Experience in the ATLAS Experiment

March 29, 2004March 29, 2004LaTech D0SAR MeetingLaTech D0SAR Meeting 23

Conclusion

Data Challenges are important for ATLAS software and Data Challenges are important for ATLAS software and computing infrastructure readinesscomputing infrastructure readiness

Grids will be the default testbed for DC2Grids will be the default testbed for DC2

U.S. playing a major role in DC2 planning & productionU.S. playing a major role in DC2 planning & production

12 U.S. sites ready to participate in DC212 U.S. sites ready to participate in DC2

Major U.S. role in production software developmentMajor U.S. role in production software development

Test of new grid production system imminentTest of new grid production system imminent

Physics analysis will be emphasis of DC2 - new experiencePhysics analysis will be emphasis of DC2 - new experience

Stay tunedStay tuned