1
ATLAS Grid Computing and Data Challenges
Nurcan Ozturk
University of Texas at Arlington
Recent Progresses in High Energy PhysicsBolu, Turkey. June 23-25, 2004
2
Outline
• Introduction • ATLAS Experiment• ATLAS Computing System• ATLAS Computing Timeline
• ATLAS Data Challenges• DC2 Event Samples• Data Production Scenario
• ATLAS New Production System• Grid Flavors in Production System• Windmill-Supervisor• An Example of XML Messages• Windmill-Capone Screenshots
• Grid Tools• Conclusions
3
Introduction
• Why Grid Computing:• Scientific research becomes more and more complex
and international teams of scientists grow larger and larger
• Grid technologies enables scientist to use remote computers and data storage systems to be able to retrieve and analyze the data around the world
• Grid Computing power will be a key to the success of the LHC experiments
• Grid computing is a challenge not only for particle physics experiments but also for biologists, astrophysicists and gravitational wave researchers
4
ATLAS Experiment• ATLAS (A Toroidal LHC Apparatus)
experiment at the Large Hadron Collider at CERN will start taking data in 2007.
• proton-proton collisions with a 14 TeV center-of-mass energy
• ATLAS will study:• SM Higgs Boson• SUSY states• SM QCD, EW, HQ Physics• New Physics?
• Total amount of “raw” data: 1 PB/year
• Needs the GRID to reconstruct and analyze this data: Complex “Worldwide Computing Model” and “Event Data Model”
• Raw Data @ CERN• Reconstructed data “distributed”• All members of the collaboration must have
access to “ALL” public copies of the data
~2000 Collaborators~150 Institutes 34 Countries
5
Tier2 Centre ~200kSI2k
Event Builder
Event Filter~159kSI2k
T0 ~5MSI2k
UK Regional Centre (RAL)
US Regional Centre
French Regional Centre
Italian Regional Centre
SheffieldManchester
LiverpoolLancaster ~0.25TIPS
Workstations
10 GB/sec
450 Mb/sec
100 - 1000 MB/s
•Some data for calibration and monitoring to institutes
•Calibrations flow back
Each Tier 2 has ~25 physicists working on one or more channels
Each Tier 2 should have the full AOD, TAG & relevant Physics Group summary data
Tier 2 do bulk of simulation
Physics data cache
~Pb/sec
~ 300MB/s/T1 /expt
Tier2 Centre ~200kSI2k
Tier2 Centre ~200kSI2k
622Mb/s
Tier 0Tier 0
Tier 1Tier 1
DesktopDesktop
PC (2004) = ~1 kSpecInt2k
Northern Tier ~200kSI2k
Tier 2Tier 2 ~200 Tb/year/T2
~7.7MSI2k/T1 ~2 Pb/year/T1
~9 Pb/year/T1 No simulation
622Mb/s
ATLAS Computing System (R. Jones)
6
• POOL/SEAL release (done)
• ATLAS release 7 (with POOL persistency) (done)
• LCG-1 deployment (done)
• ATLAS complete Geant4 validation (done)
• ATLAS release 8 (done)
• DC2 Phase 1: simulation production
• DC2 Phase 2: intensive reconstruction (the real challenge!)
• Combined test beams (barrel wedge)
• Computing Model paper
• Computing Memorandum of Understanding
• ATLAS Computing TDR and LCG TDR
• DC3: produce data for PRR and test LCG-n
• Physics Readiness Report
• Start commissioning run• GO!
2003
2004
2005
2006
2007
NOW
ATLAS Computing Timeline (D. Barberis)
7
ATLAS Data Challenges
Data Challenges --> generate and analyze simulated data with increasing scale and complexity using Grid (as much as possible)• Goal:
• Validation of the Computing Model, the software, the data model, and to ensure the correctness of the technical choices to be made
• Provide simulated data to design and optimize the detector• Experience gained these Data Challenges will be used to
formulate the ATLAS Computing Technical Design Report• Status:
• DC0 (December2001-June2002), DC1 (July2002-March2003) completed• DC2 ongoing• DC3, DC4 planned (one/year)
8
DC2 Event Samples (G. Poulard) Channel Decay Cuts Events (10**6) Events (10**6)
before filter
A0 Top 1A0a Top (mis-aligned)A1 Z e-e no Pt cut 1A2 mu-mu 1A3 tau-tau 1A4 W leptons 1A5 Z + jet 0.5A6 dijets Pt > 600 0.25A7 W + 4 jets W -> leptons 0.25A8 QCD 0.5A9 Suzy 0.1A10 Higss tau-tau 0.1A11 DC1 susy 0.05
B1 J ets Pt > 180 1B2 Gamma + jet Pt > 20 0.2B3 bb -> B mu6-mu6 0.25B4 J ets Pt > 17 1B5 Gamma_ jet 0.05
H1 Higgs (130) 4 leptons 0.04H2 Higgs (180) 4 leptons 0.04H3 Higgs (120) gamma-gamma 0.015H4 Higgs (170) W-W 0.015H5 Higgs (170) 0.015H6 Higgs (115) tau-tau 0.015H7 Higgs (115) tau-tau 0.015H8 MSSM Higgs b-b-A(300) 0.015H9 MSSM Higgs b-b-A(115) 0.015
M1 Minimum bias
Total 9.435
9
Data Production Scenario (G. Poulard)
RDO (or BS)
RDO (or BS)
< 2 GB files
< 2 GB files
Streaming?
“
Still some work
~ 2000 jobs/day
I nput:~ 10 GB/ job~ 10 TB/day~ 150 MB/s
No MCTruth if BS
J ob duration limited to 24h!~ 2000 jobs/day~ 500 GB/day~ 5 MB/s
Comments
AOD
ESD
BS
BS
Digits+MCTruth
Digits+MCTruth
Hits+ MCTruth
Generated events
Output
Several files
1 (or few) files
1 fileSeveral 10 files
1 file
“part of” < 2 GB files
I nput
noneEvent generation
ESDAOD production
RDO or BSReconstruction
RDO or BSEvents mixing
“pile- up” dataRDO
Byte- stream
Hits “signal”+MCTruth
Hits “min.b”
Hits+ MCTruth(Generated
events)
Generated Events
G4 simulation
Pile- up
Detector response
RDO (or BS)
RDO (or BS)
< 2 GB files
< 2 GB files
Streaming?
“
Still some work
~ 2000 jobs/day
I nput:~ 10 GB/ job~ 10 TB/day~ 150 MB/s
No MCTruth if BS
J ob duration limited to 24h!~ 2000 jobs/day~ 500 GB/day~ 5 MB/s
Comments
AOD
ESD
BS
BS
Digits+MCTruth
Digits+MCTruth
Hits+ MCTruth
Generated events
Output
Several files
1 (or few) files
1 fileSeveral 10 files
1 file
“part of” < 2 GB files
I nput
noneEvent generation
ESDAOD production
RDO or BSReconstruction
RDO or BSEvents mixing
“pile- up” dataRDO
Byte- stream
Hits “signal”+MCTruth
Hits “min.b”
Hits+ MCTruth(Generated
events)
Generated Events
G4 simulation
Pile- up
Detector response
10
ATLAS New Production System
LCG NG Grid3 LSF
LCGexe
LCGexe
NGexe
G3exe
LSFexe
super super super super super
prodDBdms
RLS RLS RLS
jabber jabber soap soap jabber
Don Quijote
Windmill
Lexor
AMI
CaponeDulcinea
http://www.nordugrid.org/applications/prodsys/
11
Grids Flavors in Production System
• LCG: LHC Computing Grid, > 40 sites• Grid3: USA Grid, 27 sites• NorduGrid: Denmark, Sweden, Norway, Finland, Germany, Estonia, Slovenia, Slovakia,
Australia, Switzerland, 35 sites
07-May-04country centre country centre
Austria UIBK Portugal LIP, Lisbon
Canada TRIUMF, Vancouver Russia SINP, Moscow
Univ. Montreal Spain PIC, Barcelona
Univ. Alberta IFIC, Valencia
Czech Republic CESNET, Prague IFCA, SantanderUniversity of Prague University of Barcelona
France IN2P3, Lyon** Uni. Santiago de CompostelaGermany FZK, Karlsruhe CIEMAT, Madrid
DESY UAM, MadridUniversity of Aachen Switzerland CERNUniversity of Wuppertal CSCS, Manno**
Greece GRNET, Athens Taiwan Academia Sinica, TaipeiHolland NIKHEF, Amsterdam NCU, Taipei
Hungary KFKI, Budapest UK RALIsrael Tel Aviv University** Cavendish, Cambridge
Weizmann Institute Imperial, LondonItaly CNAF, Bologna Lancaster University
INFN, Torino Manchester UniversityINFN, Milano Sheffield University
INFN, Roma QMUL, London
INFN, Legnaro USA FNALJapan ICEPP, Tokyo** BNL**Poland Cyfronet, Krakow
Regional Centres Connected to the LCG Grid
** not yet in LCG-2
Centres in process of being connectedcountry centre
China IHEP, BeijingIndia TIFR, MumbaiPakistan NCP, IslamabadHewlett Packard to provide “Tier 2-like” services for LCG, initially in Puerto Rico
L. Perini
12
Windmill-Supervisor
• Supervisor development team at UTA: Kaushik De, Nurcan Ozturk, Mark Sosebee
• supervisor-executor communication is via Jabber protocol developed for Instant Messaging
• XML (Extensible Markup Language ) messages are passed between supervisor-executor
• supervisor-executor interaction:• numJobsWanted• executeJobs• getExecutorData• getStatus• fixJob• killJob
• Final verification of jobs is done by supervisor
prodDB
supervisor data mgtsystem
replicacatalog
*
*prod manager
executor
Windmill webpage:http://www-hep.uta.edu
13
An Example of XML Messages
<?xml version="1.0" ?><windmill type="request” user="supervisor" version="0.6"> <numJobsWanted> <minimumResources> <transUses>JobTransforms-8.0.1.2 Atlas-8.0.1 – software version</transUses> <cpuConsumption> <count>100000 - minimum CPU required for a production job</count> <unit>specint2000seconds - unit of CPU usage</unit> </cpuConsumption> <diskConsumption> <count>500 - maximum output file size</count> <unit>MB</unit> </diskConsumption> <ipConnectivity>no - IP connection required from CE </ipConnectivity> <minimumRAM> <count>256 - minimum physical memory requirement</count> <unit>MB</unit> </minimumRAM> </minimumResources> </numJobsWanted></windmilll>
<?xml version="1.0" ?><windmill type="respond” user=“executor" version="0.8"> <numJobsWanted> <availableResources> <jobCount>5</jobCount> <cpuMax> <count>100000</count> <unit>specint2000</unit> </cpuMax> </availableResources> </numJobsWanted></windmill>
numJobWanted : supervisor-executor negotiation of number of jobs to process
supervisor’s request
executor’s respond
15
Grid Tools
An example: Grid3 - USA Grid• Joint project with USATLAS,
USCMS, iVDGL, PPDG, GriPhyN
• Components:• VDT based • Classic SE (gridftp)• Monitoring: Grid site Catalog,
Ganglia, MonALISA• Two RLS servers and VOMS
server for ATLAS• Installation:
• pacman –get iVDGL:Grid3• Takes ~ 4 hours to bring up a
site from scratch
VDT (Virtual Data Toolkit)
version 1.1.14 gives:• Virtual Data System 1.2.3• Class Ads 0.9.5• Condor 6.6.1• EDG CRL Update 1.2.5• EDG Make Gridmap 2.1.0• Fault Tolerant Shell (ftsh) 2.0.0• Globus 2.4.3 plus patches• GLUE Information providers• GLUE Schema 1.1, extended
version 1• GPT 3.1• GSI-Enabled OpenSSH 3.0• Java SDK 1.4.1• KX509 2031111• Monalisa 0.95• MyProxy 1.11• Netlogger 2.2• PyGlobus 1.0• PyGlobus URL Copy 1.1.2.11• RLS 2.1.4• UberFTP 1.3
What tools are needed for a Grid site?
16
Conclusions
• Grid paradigm works; opportunistic use of existing resources, run anywhere, from anywhere, by anyone...
• Grid computing is a challenge, needs world wide collaboration
• Data production using Grid is possible, successful so far• Data Challenges are the way to test the ATLAS
computing model before the real experiment starts• Data Challenges also provides data for Physics groups• A learning and improving experience with Data
Challenges