us atlas transparent distributed facility workshop, unc 3/4/2008 1 fdr - results so far, schedule...
TRANSCRIPT
1 US ATLAS Transparent Distributed Facility Workshop, UNC 3/4/2008
FDR - Results so far, schedule and challenges Jim Cochran
Iowa State University
What is the FDR ?
Event mixing – a moving target
Current Schedule
FDR-1 operation: the week of February 4 (Tier 0)
Data distribution
Preparation for US analysis effort
Early user feedback
Lessons Learned (so far)
Outline
Plans for FDR-2 Much (most ?) material stolen from talks byMichael Wilson, Ian Hinchcliffe, Dave Charlton, Alexei Klimentov, Kors Bos, …
2 US ATLAS Transparent Distributed Facility Workshop, UNC 3/4/2008
What is the FDR ?
The Full-Dress Rehearsal (FDR) is an attempt to test the complete chain(or as much of it as we can actually test without real collisions)
There are many steps between DAQ output and final plots
The individual steps have been tested, the full chain has not …
as we’ll see,FDR is not a totallyrealistic test
Basic idea: feed (appropriately mixed*) raw events into the system at DAQ output
Treat, as much as possible, as if real data (calibration, streaming, reco, …)
To happen in two partsInstantaneous luminosity (cm-1s-1)
Integrated luminosity (pb-1)
FDR-1 ~1031 0.36
FDR-2 ~1033 20-25
detailslater
13.3
3 US ATLAS Transparent Distributed Facility Workshop, UNC 3/4/2008
more specifically …
mixed data in bytestream format is copied to and streamed from SFO
What’s an SFO ?
Receives event data from Event Filternodes and writes raw data files
SFOs are the output of TDAQ and reside at Point1 on the surface
Data is copied from SFO disks to T0
SFOs have a 24hr buffer and a data rate up to 600 MB/s
will be 5-10 SFO PCs
Insert mixed (bytestream) data here
T0 T0
4 US ATLAS Transparent Distributed Facility Workshop, UNC 3/4/2008
Motivation
• If the LHC turned on tomorrow, what would happen to the recorded data?
– We know we are not ready—what don’t we know?
– Realtime tests of hardware, software, databases, storage media, networks, data flow, data integrity and quality checks, calibrations, etc.
– What are the flaws in our processing models?
– Are we able to identify and correct routine running problems? Where are resources needed?
– Tests of distributed computing and analysis machinery (T0T1T2T3)
(especially should be tested under heavy user load)
SFO Tier0 Tier1Metadata
Trigger config.Lumi. info
AOD, ESD, TAG prod
DQ monitoringCalibrations
(ongoing) MC prod.Import
Archiving and Export
DPD prod.Reprocessing
knownunknowns
vsunknownunknowns
…
5 US ATLAS Transparent Distributed Facility Workshop, UNC 3/4/2008
Main FDR steps
Sample preparation
Calibration stream preparation
SFO preparation
FDR run- SFO to Tier-0 part - Tier-0 operations - Data quality, calibration and alignment operations - Data transfers - Tier-1 operations - Tier-2 operations
After main "FDR run" - Perform distributed analysis, use TAGs, produce DPDs, etc. - Re-reconstruct from BS after a time- Remake DPDs as needed for analysis
6 US ATLAS Transparent Distributed Facility Workshop, UNC 3/4/2008
Schedule
Sample preparation 1/3/08 (actual 1/18/08)
Calibration stream preparation
SFO preparation
FDR run Week of Feb 4
After main "FDR run" Feb 11 and beyond
(re)processing at T1 ~March 17 ?
FDR-1
Simulation already started
Digitization starts April 1
90% of RDO at BNL April 20
remaining 10% of RDO at BNL April 30
Mixing complete at BNL May 14
data on SFO May 19
FDR-2
done at T2s
{detailslater
Original estimate for mixing was 30 days
7 US ATLAS Transparent Distributed Facility Workshop, UNC 3/4/2008
Sample Preparation: Details
• Make use of previously generated events where possible, else generate
• Mix events randomly to get correct physics mixture as expected at HLT output - Fakes will be discussed later
• Event mixing also runs the trigger simulation, only events passing all levels of triggering are kept
• Trigger information is written into event, for later use in analysis
• Events are written into physics streams (+ Express & Calibration)
• Format of written events is bytestream (BS=RAW)
• Files for most streams respect luminosity block (LB) boundaries - not express stream
• Files for physics stream therefore are written per stream, per LB, per SFO
• O(107) events for each of FDR-1 and FDR-2
• All MC-truth information is lost by this processing
8 US ATLAS Transparent Distributed Facility Workshop, UNC 3/4/2008
Event Mixing: FDR-1
• An evolving compromise between realistic simulation and
practical limitations
– Goal: include all significant standard-model processes in known proportions (“mixing”) with background events passing trigger chains (“fakes”)
- 10 one-hour runs at ~1031 cm-1s-1 (lumi. varying across run), 1 one-hour run at ~1032 cm-1s-1 (constant lumi.)
- Produce files in bytestream format, split across streams, SFOs,
and luminosity blocks (2 minutes/LB)
– Actually achieved by Thurs Feb 7:
- 8 runs at ~1031 no fakes; 2 runs at ~1031 with some e/γ fakes
– 1 run at ~1032 finished late, was sent to Tier1 asynchronously
9 US ATLAS Transparent Distributed Facility Workshop, UNC 3/4/2008
Event Mixing: Issues
At 1032 we have an event rate from minbias of 70 kHz and a trigger output of 200 Hz
The only way to make 10 hr of unbiased data is to start with 2.5B events!
But, we only need realistic event rates after the EventFilter: 8M events in total
This low rate is possible only because of trigger rejection and prescales
need to prefilter and make samples with threshholds matched to triggers not trivialto get right!
Mixing logistics:
Data was not available until Jan 18 (Jan 3 was original plan)
All data prestaged to CASTOR disk – output copied back to CASTOR
Could not get all the data on the compute nodes fast enough
CPU time to process events too long
In order to get data in time for T0 had to change strategy:
Reduce fake samples drastically
10 US ATLAS Transparent Distributed Facility Workshop, UNC 3/4/2008
Lessons from FDR-1 sample preparation
• Fake samples must be matched in detail to trigger menu
– Fakes need to be prefiltered using same menu as will be used for mixing
• Tier0 is not appropriate for mixing (as noted will use BNL for FDR-2)
• A number of lessons about computing infrastructure:
– Fast turn-around on MC production not yet reliable
– Data distribution is labor intensive
– Software fixes have taken time
• Reallocating Tier0 resources (computers, people) affects operations
– Tier0 managers need time to prepare and maintain infrastructure
– Disk space and CPU capacities are large but limited
11 US ATLAS Transparent Distributed Facility Workshop, UNC 3/4/2008
FDR-1 Operation: Trigger menu
• The trigger menu for L = 1031 selects mostly fakes
– The rate of interesting events selected at this lumi. is much lower
than the nominal 200 Hz
– Thus, because 8/10 low-lumi. runs do not include enhanced-fake
(e.g., EM-like jets) samples, the overall rate is closer to 10 Hz
• Achieved rate is ~50 Hz for runs with enhanced-fake samples
– Conclusions
• This is a menu for first data and detector commissioning
• Not optimal for the final FDR-1 mixed sample
12 US ATLAS Transparent Distributed Facility Workshop, UNC 3/4/2008
FDR-1 data playing
• Data played from 5 SFOs during 5-8 Feb (Tues-Fri)
– Same events replayed every day with different run numbers
– Reconstruction with different AtlasPoint1 patch every day (for monitoring updates and fixes)
- Flexible: allowed us to run anything at all
- Dangerous: not reproducible; software not properly validated
- Will build AtlasPoint1 13.0.40.2 for Tier1 reconstruction
Exported streamsNon-exported streams
- Express- ID tracks for alignment
- Muon and b physics (“Muon”)- Jets, tau, and missing ET (“Jet”)- Electron and photon (“Egamma”)- Minimum bias (“Minbias”)
13 US ATLAS Transparent Distributed Facility Workshop, UNC 3/4/2008
Express stream
• First use of express stream was successful
– Primary purposes: data quality and calibrations
– Sufficient to detect problems and details, allowing timely validation of data (if DQ is operational)
– Monitoring histograms (detector and performance) available hours after run appeared; shifters spotted problems shortly thereafter
– Histograms moved to AFS to avoid overloading castor with too many requests
– ESDs and AODs not exported—consider temporary storage so users can access them if needed for DQ
14 US ATLAS Transparent Distributed Facility Workshop, UNC 3/4/2008
Data quality
• Data quality was checked by central shifters and system experts
– Included: Pixel, LAr, Tile, MDT, RPC, L1RPC, ID alignment, e/gamma, jets, missing ET, muon tracks
– Two shifts per day using desk in Tier0 control room
– Exercised ability to spot problems with 10-minute granularity
– Lots of room for improvement; core functionality came online during the week.
http://atlasdqm.web.cern.ch/atlasdqm/results.html
http://sroe.home.cern.ch/sroe/runlist/query.htmlRun summary:
Histograms:
15 US ATLAS Transparent Distributed Facility Workshop, UNC 3/4/2008
Some unexpected featuresShoulders at ~150 GeV and ~600 GeVTRT timing not configured
for one wheel (shift of 15ns)
16 US ATLAS Transparent Distributed Facility Workshop, UNC 3/4/2008
Expected DQ flags
Run 3062,dead LAr crate (or crates?)
Run 3062, minutes 20-30,hot LAr cells
What about the noisy barrel crate?
Problems were introduced for short time intervals to test data-quality monitoring:
17 US ATLAS Transparent Distributed Facility Workshop, UNC 3/4/2008
Calibrations
• Tested the calibration loop: determine calibration constants within 24 hrs of data being recorded and apply new constants during bulk reconstruction
– Pixel calibration planned to run on express stream• Software only working in 13.X.0; postponed until FDR2
– TRT calibration running on ID-alignment stream• First of many iterations completed on Tier0 promptly; remaining
iterations finishing on lxbatch
– ID alignment
18 US ATLAS Transparent Distributed Facility Workshop, UNC 3/4/2008
ID-alignment stream
• First test of a calibration stream
– An efficient and dedicated effort had this ready for testing at the beginning of January
– Used by both ID alignment and TRT calibration in FDR1
– Uses event fragments from L2 (“L2_trck10i_calib”)
• Isolated tracks with pT > 10GeV (5GeV for FDR1)
• Select tracks at 60 Hz (L = 1031) with no additional TDAQ load
• 600k tracks selected for FDR1 (from a dijet sample; uses 26GB)
– Attempted to include 50k cosmic events; postponed due to software incompatibilities
19 US ATLAS Transparent Distributed Facility Workshop, UNC 3/4/2008
ID alignment procedure
• The ID alignment group produced constants under tight computing and time constraints
– Only had ID alignment stream from beginning of January for testing
– The alignment procedure necessarily involves iterations
- This is not suitable for Tier0; full implications are still being considered
– Calibration model is being adapted—ID alignment ran outside of production, constants fed back before bulk reconstruction
– Will expect a full test of the new model in FDR-2
20 US ATLAS Transparent Distributed Facility Workshop, UNC 3/4/2008
ID alignment updating
• A new ID alignment was calculated by 16:00 Thursday Feb 7 (two-day turnaround)
– Used ID-alignment stream;
did not use cosmics or express stream this time
– Only two iterations at level 2 due to time constraints
PerfectNominalAligned
PerfectNominalAligned
PerfectNominalAligned
d0 z0 Q/pT
21 US ATLAS Transparent Distributed Facility Workshop, UNC 3/4/2008
Daily meeting
• Daily operations meeting at 16:00 from 4-8 Feb
– Review first look at fresh data and shift report
– Complete look at previous day’s data (incl. overnight)
• Calibration signoff—upload new constants?
• If new constants, reprocess express stream⇒ Need to consider resources for this carefully
• Process physics streams if data quality is understood
– Ensure that all data-quality flags have been set
• Need reports from ALL systems and groups
22 US ATLAS Transparent Distributed Facility Workshop, UNC 3/4/2008
Flagging data quality
• Data-quality assessment has both automatic & manual components
– Automatic tools will be run to check DCS, histograms
– Expert system shifters need to check automatic assessments and possibly override for every run
– After the daily meeting, all assessments are combined into a final flag, which is written into both DB and AODs
– Assessments may span time period from 1 LB - entire run
23 US ATLAS Transparent Distributed Facility Workshop, UNC 3/4/2008
External participation
• Information flow to those outside CERN should be improved
– Ideally, a portal where collaborators can check realtime status of
• Runs recorded
• Runs with express stream processed
• Runs waiting for data-quality signoff
• List of bulk streams processed, and then exported
– How can offsite colleagues participate in meaningful data checks?
24 US ATLAS Transparent Distributed Facility Workshop, UNC 3/4/2008
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Data Distribution
25 US ATLAS Transparent Distributed Facility Workshop, UNC 3/4/2008
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
26 US ATLAS Transparent Distributed Facility Workshop, UNC 3/4/2008
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Dataset replication to T2s started immediately after T1s had complete replicas
27 US ATLAS Transparent Distributed Facility Workshop, UNC 3/4/2008
Data Distribution: Summary
Data replication from CERN to Tier-1s and within all clouds is relatively stable
Notes:
Site problems were fixed by operations team (central and regional) within 24h
Data export from CERN was delayed by 3 days (several technical issues - all under discussion)
Data replication monitoring and book-keeping still have room for improvement
Not enough data to be a transfer challenge - will use other data (M5) for CCRC
28 US ATLAS Transparent Distributed Facility Workshop, UNC 3/4/2008
Preparation for US analysis effort (and some results)
Ongoing effort in US to prepare user analysis queues - have been tested and are mostly ready to go (modulo scl3 vs scl4 issues etc.) - experts have been very responsive
User participation in AOD analysis frustrated/delayed by
- lack of standard analysis package - should be available this week (including trigger info)
- Non-existence of the tags (?)
- confusion about where to obtain lumi info (not a serious issue for now)
A large set of people have expressed interest in FDR participation - expect “surge” in activity once analysis doesn’t require expert status
Primary DPDs to be generated during production - perhaps during T0 reprocessing ?
Individual groups are starting to produce their own secondary & tertiary DPDs
Tutorial
today
29 US ATLAS Transparent Distributed Facility Workshop, UNC 3/4/2008
Lashkar KashifHarvardDimuon mass
30 US ATLAS Transparent Distributed Facility Workshop, UNC 3/4/2008
Looking for J/ and Upsilon to ee
Andy Nelson, ISU
31 US ATLAS Transparent Distributed Facility Workshop, UNC 3/4/2008
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
32 US ATLAS Transparent Distributed Facility Workshop, UNC 3/4/2008
1. 7h of 1032 data (2.5 pb-1)- with fakes where possible- some fakes will be reused many times (JF samples) aim to redigitize so the repeated events are not quite the same
2. 3h of 1033 data (10.8 pb-1)- probably no fakes- with 75ns pileup- for some of this data, the beamspot will move
FDR-2 plan (March 4, 2008)
33 US ATLAS Transparent Distributed Facility Workshop, UNC 3/4/2008
FDR-2 Simulation Production
• Simulation of physics and background for FDR-2 Data Sample• Need to produce before May (1/3 during CCRC-1 )
– 0.5 M minimum bias and cavern events– 10 M physics events– 100 M fake events
• Simulation HITS (1.5 MB/ev), Digitization RDO (2.5 MB/ev)• Reconstruction ESD (1 MB/ev), AOD 0.2 MB/ev), TAG (1 kB/ev)• Simulation + Digitization is done at the T2’s• HITS and RDOs uploaded to T1
– HITS to tape at T1– RDO to CERN for mixing
• Reconstruction done at T1 – ESD, AOD, TAG archived to tape at T1– ESD copied to other T1 by share (3 full copies world-wide)– AOD and TAG copied to each other T1
34 US ATLAS Transparent Distributed Facility Workshop, UNC 3/4/2008
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
35 US ATLAS Transparent Distributed Facility Workshop, UNC 3/4/2008
FDR Lessons Learned (so far)
Fake samples must match trigger menu items; purity and rate both important
Software validation must be improved
Calibration procedures need detailed advance planning and coordination
Signoff procedure for bulk processing needs both automation and expert attention
Express-stream reprocessing needs consideration and resources
Always best to try things early
– Finding problems was the goal; most were overcome
– For FDR-2 expect basics to run more smoothly
• Will focus on the operational details and the calibration model
• Tune the existing system; commission new components
• More physics content in streams
• Developers, users and experts have a much better sense of how the
system is intended to work; can be more efficient in the future
– All these problems must be faced sometime—better now than later