us atlas transparent distributed facility workshop, unc 3/4/2008 1 fdr - results so far, schedule...

1 US ATLAS Transparent Distributed Facility Workshop, UNC 3/4/2008

FDR - Results so far, schedule and challenges Jim Cochran

Iowa State University

What is the FDR ?

Event mixing – a moving target

Current Schedule

FDR-1 operation: the week of February 4 (Tier 0)

Data distribution

Preparation for US analysis effort

Early user feedback

Lessons Learned (so far)

Outline

Plans for FDR-2 Much (most ?) material stolen from talks byMichael Wilson, Ian Hinchcliffe, Dave Charlton, Alexei Klimentov, Kors Bos, …


What is the FDR ?

The Full-Dress Rehearsal (FDR) is an attempt to test the complete chain(or as much of it as we can actually test without real collisions)

There are many steps between DAQ output and final plots

The individual steps have been tested, the full chain has not …

as we’ll see,FDR is not a totallyrealistic test

Basic idea: feed (appropriately mixed*) raw events into the system at DAQ output

Treat, as much as possible, as if real data (calibration, streaming, reco, …)

To happen in two partsInstantaneous luminosity (cm-1s-1)

Integrated luminosity (pb-1)

FDR-1 ~1031 0.36

FDR-2 ~1033 20-25

detailslater

13.3


more specifically …

mixed data in bytestream format is copied to and streamed from SFO

What’s an SFO ?

Receives event data from Event Filternodes and writes raw data files

SFOs are the output of TDAQ and reside at Point1 on the surface

Data is copied from SFO disks to T0

SFOs have a 24hr buffer and a data rate up to 600 MB/s

will be 5-10 SFO PCs

Insert mixed (bytestream) data here

T0 T0


Motivation

• If the LHC turned on tomorrow, what would happen to the recorded data?

– We know we are not ready—what don’t we know?

– Realtime tests of hardware, software, databases, storage media, networks, data flow, data integrity and quality checks, calibrations, etc.

– What are the flaws in our processing models?

– Are we able to identify and correct routine running problems? Where are resources needed?

– Tests of distributed computing and analysis machinery (T0T1T2T3)

(especially should be tested under heavy user load)

SFO Tier0 Tier1Metadata

Trigger config.Lumi. info

AOD, ESD, TAG prod

DQ monitoringCalibrations

(ongoing) MC prod.Import

Archiving and Export

DPD prod.Reprocessing

knownunknowns

vsunknownunknowns

…


Main FDR steps

Sample preparation

Calibration stream preparation

SFO preparation

FDR run- SFO to Tier-0 part - Tier-0 operations - Data quality, calibration and alignment operations - Data transfers - Tier-1 operations - Tier-2 operations

After main "FDR run" - Perform distributed analysis, use TAGs, produce DPDs, etc. - Re-reconstruct from BS after a time- Remake DPDs as needed for analysis


Schedule

Sample preparation 1/3/08 (actual 1/18/08)

Calibration stream preparation

SFO preparation

FDR run Week of Feb 4

After main "FDR run" Feb 11 and beyond

(re)processing at T1 ~March 17 ?

FDR-1

Simulation already started

Digitization starts April 1

90% of RDO at BNL April 20

remaining 10% of RDO at BNL April 30

Mixing complete at BNL May 14

data on SFO May 19

FDR-2

done at T2s

{detailslater

Original estimate for mixing was 30 days


Sample Preparation: Details

• Make use of previously generated events where possible, else generate

• Mix events randomly to get correct physics mixture as expected at HLT output - Fakes will be discussed later

• Event mixing also runs the trigger simulation, only events passing all levels of triggering are kept

• Trigger information is written into event, for later use in analysis

• Events are written into physics streams (+ Express & Calibration)

• Format of written events is bytestream (BS=RAW)

• Files for most streams respect luminosity block (LB) boundaries - not express stream

• Files for physics stream therefore are written per stream, per LB, per SFO

• O(107) events for each of FDR-1 and FDR-2

• All MC-truth information is lost by this processing


Event Mixing: FDR-1

• An evolving compromise between realistic simulation and

practical limitations

– Goal: include all significant standard-model processes in known proportions (“mixing”) with background events passing trigger chains (“fakes”)

- 10 one-hour runs at ~1031 cm-1s-1 (lumi. varying across run), 1 one-hour run at ~1032 cm-1s-1 (constant lumi.)

- Produce files in bytestream format, split across streams, SFOs,

and luminosity blocks (2 minutes/LB)

– Actually achieved by Thurs Feb 7:

- 8 runs at ~1031 no fakes; 2 runs at ~1031 with some e/γ fakes

– 1 run at ~1032 finished late, was sent to Tier1 asynchronously


Event Mixing: Issues

At 1032 we have an event rate from minbias of 70 kHz and a trigger output of 200 Hz

The only way to make 10 hr of unbiased data is to start with 2.5B events!

But, we only need realistic event rates after the EventFilter: 8M events in total

This low rate is possible only because of trigger rejection and prescales

need to prefilter and make samples with threshholds matched to triggers not trivialto get right!

Mixing logistics:

Data was not available until Jan 18 (Jan 3 was original plan)

All data prestaged to CASTOR disk – output copied back to CASTOR

Could not get all the data on the compute nodes fast enough

CPU time to process events too long

In order to get data in time for T0 had to change strategy:

Reduce fake samples drastically


Lessons from FDR-1 sample preparation

• Fake samples must be matched in detail to trigger menu

– Fakes need to be prefiltered using same menu as will be used for mixing

• Tier0 is not appropriate for mixing (as noted will use BNL for FDR-2)

• A number of lessons about computing infrastructure:

– Fast turn-around on MC production not yet reliable

– Data distribution is labor intensive

– Software fixes have taken time

• Reallocating Tier0 resources (computers, people) affects operations

– Tier0 managers need time to prepare and maintain infrastructure

– Disk space and CPU capacities are large but limited


FDR-1 Operation: Trigger menu

• The trigger menu for L = 1031 selects mostly fakes

– The rate of interesting events selected at this lumi. is much lower

than the nominal 200 Hz

– Thus, because 8/10 low-lumi. runs do not include enhanced-fake

(e.g., EM-like jets) samples, the overall rate is closer to 10 Hz

• Achieved rate is ~50 Hz for runs with enhanced-fake samples

– Conclusions

• This is a menu for first data and detector commissioning

• Not optimal for the final FDR-1 mixed sample


FDR-1 data playing

• Data played from 5 SFOs during 5-8 Feb (Tues-Fri)

– Same events replayed every day with different run numbers

– Reconstruction with different AtlasPoint1 patch every day (for monitoring updates and fixes)

- Flexible: allowed us to run anything at all

- Dangerous: not reproducible; software not properly validated

- Will build AtlasPoint1 13.0.40.2 for Tier1 reconstruction

Exported streamsNon-exported streams

- Express- ID tracks for alignment

- Muon and b physics (“Muon”)- Jets, tau, and missing ET (“Jet”)- Electron and photon (“Egamma”)- Minimum bias (“Minbias”)


Express stream

• First use of express stream was successful

– Primary purposes: data quality and calibrations

– Sufficient to detect problems and details, allowing timely validation of data (if DQ is operational)

– Monitoring histograms (detector and performance) available hours after run appeared; shifters spotted problems shortly thereafter

– Histograms moved to AFS to avoid overloading castor with too many requests

– ESDs and AODs not exported—consider temporary storage so users can access them if needed for DQ


Data quality

• Data quality was checked by central shifters and system experts

– Included: Pixel, LAr, Tile, MDT, RPC, L1RPC, ID alignment, e/gamma, jets, missing ET, muon tracks

– Two shifts per day using desk in Tier0 control room

– Exercised ability to spot problems with 10-minute granularity

– Lots of room for improvement; core functionality came online during the week.

http://atlasdqm.web.cern.ch/atlasdqm/results.html

http://sroe.home.cern.ch/sroe/runlist/query.htmlRun summary:

Histograms:


Some unexpected featuresShoulders at ~150 GeV and ~600 GeVTRT timing not configured

for one wheel (shift of 15ns)


Expected DQ flags

Run 3062,dead LAr crate (or crates?)

Run 3062, minutes 20-30,hot LAr cells

What about the noisy barrel crate?

Problems were introduced for short time intervals to test data-quality monitoring:


Calibrations

• Tested the calibration loop: determine calibration constants within 24 hrs of data being recorded and apply new constants during bulk reconstruction

– Pixel calibration planned to run on express stream• Software only working in 13.X.0; postponed until FDR2

– TRT calibration running on ID-alignment stream• First of many iterations completed on Tier0 promptly; remaining

iterations finishing on lxbatch

– ID alignment


ID-alignment stream

• First test of a calibration stream

– An efficient and dedicated effort had this ready for testing at the beginning of January

– Used by both ID alignment and TRT calibration in FDR1

– Uses event fragments from L2 (“L2_trck10i_calib”)

• Isolated tracks with pT > 10GeV (5GeV for FDR1)

• Select tracks at 60 Hz (L = 1031) with no additional TDAQ load

• 600k tracks selected for FDR1 (from a dijet sample; uses 26GB)

– Attempted to include 50k cosmic events; postponed due to software incompatibilities


ID alignment procedure

• The ID alignment group produced constants under tight computing and time constraints

– Only had ID alignment stream from beginning of January for testing

– The alignment procedure necessarily involves iterations

- This is not suitable for Tier0; full implications are still being considered

– Calibration model is being adapted—ID alignment ran outside of production, constants fed back before bulk reconstruction

– Will expect a full test of the new model in FDR-2


ID alignment updating

• A new ID alignment was calculated by 16:00 Thursday Feb 7 (two-day turnaround)

– Used ID-alignment stream;

did not use cosmics or express stream this time

– Only two iterations at level 2 due to time constraints

PerfectNominalAligned



d0 z0 Q/pT


Daily meeting

• Daily operations meeting at 16:00 from 4-8 Feb

– Review first look at fresh data and shift report

– Complete look at previous day’s data (incl. overnight)

• Calibration signoff—upload new constants?

• If new constants, reprocess express stream⇒ Need to consider resources for this carefully

• Process physics streams if data quality is understood

– Ensure that all data-quality flags have been set

• Need reports from ALL systems and groups


Flagging data quality

• Data-quality assessment has both automatic & manual components

– Automatic tools will be run to check DCS, histograms

– Expert system shifters need to check automatic assessments and possibly override for every run

– After the daily meeting, all assessments are combined into a final flag, which is written into both DB and AODs

– Assessments may span time period from 1 LB - entire run


External participation

• Information flow to those outside CERN should be improved

– Ideally, a portal where collaborators can check realtime status of

• Runs recorded

• Runs with express stream processed

• Runs waiting for data-quality signoff

• List of bulk streams processed, and then exported

– How can offsite colleagues participate in meaningful data checks?


QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Data Distribution




Dataset replication to T2s started immediately after T1s had complete replicas


Data Distribution: Summary

Data replication from CERN to Tier-1s and within all clouds is relatively stable

Notes:

Site problems were fixed by operations team (central and regional) within 24h

Data export from CERN was delayed by 3 days (several technical issues - all under discussion)

Data replication monitoring and book-keeping still have room for improvement

Not enough data to be a transfer challenge - will use other data (M5) for CCRC


Preparation for US analysis effort (and some results)

Ongoing effort in US to prepare user analysis queues - have been tested and are mostly ready to go (modulo scl3 vs scl4 issues etc.) - experts have been very responsive

User participation in AOD analysis frustrated/delayed by

- lack of standard analysis package - should be available this week (including trigger info)

- Non-existence of the tags (?)

- confusion about where to obtain lumi info (not a serious issue for now)

A large set of people have expressed interest in FDR participation - expect “surge” in activity once analysis doesn’t require expert status

Primary DPDs to be generated during production - perhaps during T0 reprocessing ?

Individual groups are starting to produce their own secondary & tertiary DPDs

Tutorial

today


Lashkar KashifHarvardDimuon mass


Looking for J/ and Upsilon to ee

Andy Nelson, ISU


1. 7h of 1032 data (2.5 pb-1)- with fakes where possible- some fakes will be reused many times (JF samples) aim to redigitize so the repeated events are not quite the same

2. 3h of 1033 data (10.8 pb-1)- probably no fakes- with 75ns pileup- for some of this data, the beamspot will move

FDR-2 plan (March 4, 2008)


FDR-2 Simulation Production

• Simulation of physics and background for FDR-2 Data Sample• Need to produce before May (1/3 during CCRC-1 )

– 0.5 M minimum bias and cavern events– 10 M physics events– 100 M fake events

• Simulation HITS (1.5 MB/ev), Digitization RDO (2.5 MB/ev)• Reconstruction ESD (1 MB/ev), AOD 0.2 MB/ev), TAG (1 kB/ev)• Simulation + Digitization is done at the T2’s• HITS and RDOs uploaded to T1

– HITS to tape at T1– RDO to CERN for mixing

• Reconstruction done at T1 – ESD, AOD, TAG archived to tape at T1– ESD copied to other T1 by share (3 full copies world-wide)– AOD and TAG copied to each other T1


FDR Lessons Learned (so far)

Fake samples must match trigger menu items; purity and rate both important

Software validation must be improved

Calibration procedures need detailed advance planning and coordination

Signoff procedure for bulk processing needs both automation and expert attention

Express-stream reprocessing needs consideration and resources

Always best to try things early

– Finding problems was the goal; most were overcome

– For FDR-2 expect basics to run more smoothly

• Will focus on the operational details and the calibration model

• Tune the existing system; commission new components

• More physics content in streams

• Developers, users and experts have a much better sense of how the

system is intended to work; can be more efficient in the future

– All these problems must be faced sometime—better now than later

us atlas transparent distributed facility workshop, unc 3/4/2008 1 fdr - results so far, schedule...

Documents

event data

mixed data

fdr results

data flow

data integrity

data rate

surface data

recorded data