2nd september 2008 1richard hawkings / paul laycock conditions data handling in fdr2c tag...
TRANSCRIPT
2nd September 2008 1Richard Hawkings / Paul Laycock
Conditions data handling in FDR2c
Tag hierarchies set up (largely by Paul) and communicated in advance No real problems uploading data to the correct tag
Calibration experts starting to deal with ‘real’ IOVs (data valid for calibn period)
New POOL file registration scripts worked fine Calibration users need to be in AFS group atlcond:poolcond
Consider doing calibration uploads from a ‘calibration’ account, not personal ones?
No instances of data in COOL without corresponding (or wrong) POOL file upload
No use of run-signoff database pages yet System was not ready and integrated yet (holidays; too busy with other things) But only one set of runs, and all calibrations were ‘accepted’ - no real test
Handling of detector status information works technically Merging and transfer to LBSUMM folder (for ESD/AOD) still done by hand Limited mapping of DQ histograms to status flags restricts usefulness
Need to make sure this improves for real data
Need to clarify how detector status flags are dealt with in ES1, ES2 processing
2nd September 2008 2Richard Hawkings / Paul Laycock
Conditions DB access problems
Big problems in Tier-0 conditions DB access Thursday night/ Friday morning Combination of several factors
2/4 of Oracle server nodes got into trouble and restarted Kernel patch being applied this week, some interdependencies not fully understood yet Server full of ‘stuck’ connections which were never released or cleaned up - deadlock
Very high load due to FDR2 bulk reprocessing and cosmics reprocessing going on in parallel, plus FCT, ATN, RTT, TCT tests, plus user jobs
All jobs accessing Oracle directly, no use of SQLite replicas at present Replica only useful once the run is ended online - applicable to ES2, bulk reco only
Vulnerability in that ALL Athena jobs accessing Oracle use same reader account Limit of 800 concurrent sessions, now changed to 4 x 800 Each Athena job holds O(10) connections in parallel until end of first event (one per
subdetector schema) - typically for 5 minutes or so. Vulerable to ‘deadlock’
Further actions being pursued Deploy SQLite replica for bulk processing (but not for cosmics / express stream) Use a dedicated COOL reader account for Tier-0 jobs - guarantee # connections Reduce connection load from Athena jobs (short/long term actions)
2nd September 2008 3Richard Hawkings / Paul Laycock
Next steps - discussion needed
Work on conditions DB access problems Deployment of SQLite replicas to be used where possible
Start to setup tag hierarchies for first data Separate top-level tags to be used by HLT, monitoring, Tier-0, reprocessing
Define calibration loop model for first data Cosmics processing has no calibration loop, and several ‘express’ streams Same plan for single beam running, or move to ‘calibration loop’
Calibration 24hrs might be needed for code fixes even if no prompt calibration can be done yet, might have multiple processings at Tier-0
What to do for first collisions Sign-off tool and Tier-0/conditions integration to support all this ..?