phase 2 of the physics data challenge ‘04 latchezar betev alice offline week geneva, september 15,...

14
Phase 2 of the Physics Data Challenge ‘04 Latchezar Betev ALICE Offline week Geneva, September 15, 2004

Upload: alexia-oconnor

Post on 17-Jan-2018

222 views

Category:

Documents


0 download

DESCRIPTION

3 Status of PDC04 15 Sep. 2004, Alice Offline week Phase 2 purpose and tasks  Mixing of signal events with different physics content into the underlying Pb+Pb events (underlying events are reused several times)  Test of:  Standard production of signal events  Stress test of network and file transfer tools  Storage at remote SEs, stability (crucial for phase 3)  Conditions, jobs …:  62 different conditions  340K jobs, 15.2M events  10 TB produced data  200 TB data transfer from CERN  500 MSI2K hours CPU

TRANSCRIPT

Page 1: Phase 2 of the Physics Data Challenge ‘04 Latchezar Betev ALICE Offline week Geneva, September 15, 2004

Phase 2 of the Physics Data Challenge ‘04

Latchezar BetevALICE Offline week

Geneva, September 15, 2004

Page 2: Phase 2 of the Physics Data Challenge ‘04 Latchezar Betev ALICE Offline week Geneva, September 15, 2004

2Status of PDC0415 Sep. 2004, Alice Offline week

Outline Purpose and conditions of Phase 2 Job structure and improvements to AliEn Statistics (up to today) Problems Toward phase 3 Conclusions

Page 3: Phase 2 of the Physics Data Challenge ‘04 Latchezar Betev ALICE Offline week Geneva, September 15, 2004

3Status of PDC0415 Sep. 2004, Alice Offline week

Phase 2 purpose and tasks Mixing of signal events with different physics content

into the underlying Pb+Pb events (underlying events are reused several times)

Test of: Standard production of signal events Stress test of network and file transfer tools Storage at remote SEs, stability (crucial for phase 3)

Conditions, jobs …: 62 different conditions 340K jobs, 15.2M events 10 TB produced data 200 TB data transfer from CERN 500 MSI2K hours CPU

Page 4: Phase 2 of the Physics Data Challenge ‘04 Latchezar Betev ALICE Offline week Geneva, September 15, 2004

4Status of PDC0415 Sep. 2004, Alice Offline week

Repartition of tasks (physics signals):

Page 5: Phase 2 of the Physics Data Challenge ‘04 Latchezar Betev ALICE Offline week Geneva, September 15, 2004

5Status of PDC0415 Sep. 2004, Alice Offline week

Structure of event production in phase 2:

Master job submission, Job Optimizer (N sub-jobs), RB, File

catalogue, processes monitoring and control, SE…

Central servers

CEs

Sub-jobs

Job processing

AliEn-LCG interface

Sub-jobs

RB

Job processingCEs

Storage

CERN CASTOR: underlying events

Local SEs

CERN CASTOR: backup copy

Storage

Primary copy Primary copy

Local SEs

Output files Output files

Underlying event input files

zip archive of output files

Register in AliEn FC: LCG SE: LCG LFN = AliEn PFN

edg(lcg) copy&register

File catalogu

e

Page 6: Phase 2 of the Physics Data Challenge ‘04 Latchezar Betev ALICE Offline week Geneva, September 15, 2004

6Status of PDC0415 Sep. 2004, Alice Offline week

jets master job jdl:

12 Input files from Phase 1

8 configuration files, job steering and validation

scripts

4 output files (local SE), 1 backup zip copy (CERN CASTOR), 4 log

files (CERN SE)

Page 7: Phase 2 of the Physics Data Challenge ‘04 Latchezar Betev ALICE Offline week Geneva, September 15, 2004

7Status of PDC0415 Sep. 2004, Alice Offline week

To make this possible AliEn system improvements:

AliEn processes tables – split in “running” (lightweight) and “done” (archive) – allows for faster process tracking

Implemented symbolic links and event groups (through sophisticated search algorithms): Number of underlying events are grouped (through symbolic

links) in a directory for a specific signal event type – example 1660 underlying events will be used for each jet signal condition. Another 1660 will be used for the next and so on up to 20000 in total (12 conditions)

Implemented zip archiving, mainly to overcome the limitations of the taping systems (less files, large size)

Fast resubmission of failed jobs – in this phase all jobs must finish

New job monitoring tools, including singe job trace logs from start to finish with logical steps and timing

LCG improvements: See talk of Piergiorgio Cerello

Page 8: Phase 2 of the Physics Data Challenge ‘04 Latchezar Betev ALICE Offline week Geneva, September 15, 2004

8Status of PDC0415 Sep. 2004, Alice Offline week

Phase 2 statistics (start July 2004 – end September 2004): Jet signals: unquenched and quenched, cent 1: 90%

complete Jet signals: unquenched per1: 60% copmlete Special TRD production at CNAF: phase 1 running

Number of jobs: 75K (number of done jobs/day is accelerating)

Number of output files: 375K data, 300K log Data volume: 3.2 TB at local SEs, 3.2 TB at CERN (backup) Job duration: 2h 30min cent1, 1h 20min per1:

Careful profiling of AliRoot and cleaning up of the code has reduced the processing time by a factor of 2!

Page 9: Phase 2 of the Physics Data Challenge ‘04 Latchezar Betev ALICE Offline week Geneva, September 15, 2004

9Status of PDC0415 Sep. 2004, Alice Offline week

Individual sites: CPU contribution

AliEn direct control: 17 CEs, each with a SE; CERN-LCG is encompassing the LCG resources worldwide (also with local/close SEs)

Page 10: Phase 2 of the Physics Data Challenge ‘04 Latchezar Betev ALICE Offline week Geneva, September 15, 2004

10Status of PDC0415 Sep. 2004, Alice Offline week

Individual sites: jobs successfully done

Page 11: Phase 2 of the Physics Data Challenge ‘04 Latchezar Betev ALICE Offline week Geneva, September 15, 2004

11Status of PDC0415 Sep. 2004, Alice Offline week

Current problems AliEn problems:

Proxy server – out of memory due to a spiralling number of proxy connections: attempt to introduce a schema with pre-forked and limited number of proxies was not successful and the problem has to be studied further: Not a show-stopper – we know what to monitor and how to

avoid it JobOptimizer – due to the very complex structure of the jobs

(many files in the input box) the time needed to prepare one job for submission is large and the service sometimes cannot supply enough jobs to fill the available resources: Not a show stopper now – we are mixing jobs with different

execution time length, thus load-balancing the system Has to be fixed for phase 3, where the input boxes of the

jobs will be even larger and the processing time is very short – clever ideas how to speed-up the system already exist

LCG problems: See talk of Piergiorgio Cerello

Page 12: Phase 2 of the Physics Data Challenge ‘04 Latchezar Betev ALICE Offline week Geneva, September 15, 2004

12Status of PDC0415 Sep. 2004, Alice Offline week

Toward Phase 3 Purpose: distributed analysis of the processed in Phase 2

data AliEn analysis prototype already exsists:

Some poor souls are trying to work with it, but it’s difficult with the production running…

We want to use gLite during this phase as much as possible (and provide feedback)

Service reuirements: In both Phase 1 and 2 the service quality of the computing

centres has been excellent with very short response times in case of problems

Phase 3 will continue until the end of the year: The remote computing centres will have to continue

providing the same excellent level of service Since the data are stored locally, interruptions of service will

fail (or make very slow) the analysis jobs. The backup copy at CERN is on tape only and will take considerable amount of time to stage back in case the local copy is not accessible

The above is valid for the centres directly controlled through AliEn and the LCG sites

Page 13: Phase 2 of the Physics Data Challenge ‘04 Latchezar Betev ALICE Offline week Geneva, September 15, 2004

13Status of PDC0415 Sep. 2004, Alice Offline week

Conclusions Phase 2 of the PDC’04 is about 50% finished and is

progressing well, despite its complexity There is a keen competition for resources at all sites (LHCb

and ATLAS are also running massive DCs) We have not encountered any show-stoppers. All

production problems arising are fixed by the AliEn crew very quickly. The response of the experts at the computing centres is very efficient

We are also running a considerable amount of jobs on LCG sites and it is performing very well with more and more resources being made available for ALICE (see talk of Piergiorgio Cerello), thanks to the hard work of the LCG team

In about 3 weeks time we will seamlessly enter the last phase of the PDC’04…

It’s not over yet, but we are getting close!

Page 14: Phase 2 of the Physics Data Challenge ‘04 Latchezar Betev ALICE Offline week Geneva, September 15, 2004

14Status of PDC0415 Sep. 2004, Alice Offline week

Acknowledgements Special tanks to the site experts for the computing and

storage resources and for the excellent support:Francesco Minafra – BariHaavard Helstrup – BergenRoberto Barbera – CataniaGiuseppe Lo Re – CNAF BolognaKilian Schwarz – FZK KarlsruheJason Holland – TLC² HoustonGalina Shabratova – IHEP, ITEP, JINREygene Ryabinkin – KIAE MoscowDoug Olson – LBLYves Schutz – CC-IN2P3 LyonDoug Johnson – OSC OhioJiri Chudoba – Golias PragueAndrey Zarochencev – SPBsU St. PetersburgJean-Michel Barbet – SUBATECH NantesMario Sitta – Torino

And to: Patricia Lorenzo for bearing with us….