Download - LHC Physics Analysis and Databases
LHC Physics Analysis and Databases
Maaike Limper
Introduction
New Oracle sponsored CERN OpenLab fellow: Maaike Limper, started January 2012
Project outline: Investigate possibility of doing LHC-scale data reconstruction and/or physics analysis within an Oracle database
MINI-CV: Maaike Limper has a master in Physics from the University of Amsterdam, and a PhD in Particle Physics completed with Nikhef (The Dutch National Institute for Particle Physics). As a part of her PhD and sub-sequent post-doc she worked on the ATLAS experiment, one of the experiments that measures events produced by the Large Hadron Collider at CERN. During her work as an ATLAS physicist, Maaike worked on the construction of the silicon tracker, developed track and vertex reconstruction algorithms and participated in the analysis of the first LHC data recorded with the ATLAS detector. Maaike was also involved in developing some of the ATLAS data-taking conditions databases for the pixel detector. As Prompt Reconstruction Coordinator for ATLAS, she was responsible for the reconstruction of all data recorded by the ATLAS detector. As of January 2012 she works full-time as an Oracle funded Openlab fellow for the CERN IT department.
2LHC physics analysis and databases - M. Limper
Lake Geneva
CERNLHC
ATLAS
Large Hadron Collider at CERN
Four main experiments recording events produced by the Large Hadron Collider: ATLAS, CMS, LHCb and ALICE
Examples of LHC-scale data-processing from my experience with the ATLAS experiment
3LHC physics analysis and databases - M. Limper
LHC-scale data processing
4
generationsimulationdigitization
reconstruction
analysis
interactivephysics analysis (thousands of users!)
eventanalysis
raw data
eventreconstruction
analysis objects(extracted per physics topic)
data acquisition
event data taking
event summary data
simulated raw data
ntuple1
ntuple2
ntupleN
eventsimulation
LHC physics analysis and databases - M. Limper
~20 thousand events that produced a Standard Model Higgs with a mass of 150 GeV~300 billion inelastic proton-proton interactions ATLAS uses a flexible trigger menu to determine which events are interesting enough to record…
ATLAS recorded 1.6 billion events in 2011
Event data taking
5
raw data
data acquisition
event data taking
In 2011 LHC delivered 5.61 fb-1 of p-p collision data
LHC physics analysis and databases - M. Limper
ATLAS data reconstruction
6
Raw data (detector hits, energy depositions etc) reconstructed by the ATLAS (C++) software framework
Reconstruction task examples:• Fit particle trajectories from
hits measured in the inner detector
• Cluster energy deposits measured in the calorimeter to reconstruct “jets” of particles
• Fit trajectory from hits in muon spectrometer
• Combine track information to determine muon candidate from interaction point
LHC physics analysis and databases - M. Limper
ATLAS data reconstruction
7
Raw data (detector hits, energy depositions etc) reconstructed by the ATLAS (C++) software framework
Reconstruction task examples:• Fit particle trajectories from
hits measured in the inner detector
• Cluster energy deposits measured in the calorimeter to reconstruct “jets” of particles
• Fit trajectory from hits in muon spectrometer
• Combine track information to determine muon candidate from interaction point
LHC physics analysis and databases - M. Limper
ATLAS data reconstruction
8
Raw data (detector hits, energy depositions etc) reconstructed by the ATLAS (C++) software framework
Reconstruction task examples:• Fit particle trajectories from
hits measured in the inner detector
• Cluster energy deposits measured in the calorimeter to reconstruct “jets” of particles
• Fit trajectory from hits in muon spectrometer
• Combine track information to determine muon candidate from interaction point
LHC physics analysis and databases - M. Limper
ATLAS data reconstruction
Reconstruction task examples:• Fit particle trajectories from
hits measured in the inner detector
• Cluster energy deposits measured in the calorimeter to reconstruct “jets” of particles
• Fit trajectory from hits in muon spectrometer
• Combine track information to determine muon candidate from interaction point
9
Raw data (detector hits, energy depositions etc) reconstructed by the ATLAS (C++) software framework
LHC physics analysis and databases - M. Limper
ATLAS data reconstruction
Reconstruction task examples:• Fit particle trajectories from
hits measured in the inner detector
• Cluster energy deposits measured in the calorimeter to reconstruct “jets” of particles
• Fit trajectory from hits in muon spectrometer
• Combine track information to determine muon candidate from interaction point
10
Raw data (detector hits, energy depositions etc) reconstructed by the ATLAS (C++) software framework
LHC physics analysis and databases - M. Limper
ATLAS real physics event example: ATLAS data analysis
Reconstruction focuses on creating physics objects from the information measured in the detectorAnalysis focuses on interpreting information from the reconstructed objects to determine what type of event took place
Z->mm candidate, mmm=93.4 GeV
Example: apply quality criteria on muon candidates and calculate the invariant mass from the sum of the muon 4-momentum to find a Z-boson candidate 11LHC physics analysis and databases - M. Limper
ATLAS data reconstructionATLAS uses flexible trigger menus to reduce data-taking rate to ~300 recorded events/second (2011 rate)Tier-0 computing center at CERN has ~3000 CPUs available to reconstruct ATLAS data while it is recordedInitial data gets reconstructed twice:• “Express reconstruction” during data-taking• “Bulk reconstruction” 36 hours after end of run, using
beamspot and calibration constants derived from express reconstruction
12
Reprocessing campaigns every 2/3 months to re-reconstruct all data with latest version of reconstruction software
number of CPUs increased during busy periods
expressbulk reco
LHC physics analysis and databases - M. Limper
Event Simulation
In addition to the real physics events, physicists require simulated (MC) events to compare/test/understand their resultsEach physics group requests sets of signal and background samples ~100 million simulated events requestedGenerationSimulationDigitizationReconstructionDuring each ATLAS reprocessing campaign new version of all simulation samples are provided
13
generationsimulationdigitization
simulated raw data
eventsimulation
LHC physics analysis and databases - M. Limper
Data analysis
ROOT-ntuples are centrally produced by physics groups from previously reconstructed event summary dataEach physics group determines specific content of ntuple• Physics objects to include • Level of detail to be stored per physics object• Event filter and/or pre-analysis steps
14
event summary data
ntuple1
ntuple2
ntupleNVariables stored for each event in the form of: • scalar (example: missing energy, number of reconstructed muons)• vectors (example: energy, direction, momentum of reconstructed muons)• vector-of-vectors (example: position of hits on reconstructed muons)
Physics analysis at LHC is mainly done with ROOT• C++• analysis tools (plotting/fitting/statistical analysis)
LHC physics analysis and databases - M. Limper
Physics Analysis from DB
Benchmark Physics Analys in an Oracle DB:• Simplified version of HZbbll analysis (search for standard model
Higgs boson produced in association with a Z-boson)• Select muon-candidates to recontruct Z-peak• Select b-jet-candidates to reconstruct Higgs-peak
• Signal sample: 29887 events (3 ntuples)• Background sample (Z->mumu+jets): 289916 events (30 ntuples)• Use ntuple defined by ATLAS Top Physics Group: ”NTUP_TOP”
• 4212 physics attributes per event
Initial challenges:Large number of attributes, many of which are vector-type, difficult to implement in a single table, so I divided data over multiple tablesNeed to select data with SQL-queries instead of C++ code containing a loop over all events in the file
15LHC physics analysis and databases - M. Limper
Physics DB Initial DB implementation holds 695 out of 4212 variables (16.5%):• “EventData”-table: 184 columns
184 event-related variables (scalar), primaryKey=(RunNumber,EventNumber)
• “muon”-table: 271 columns, 268 muon-related variables (muon-vector content), primaryKey=(muonId,RunNumber,EventNumber), foreignKey=(RunNumber,EventNumber)
• “jet”-table: 193 columns 190 jet-related variables (jet-vector content), primaryKey=(jetId,RunNumber,EventNumber), foreignKey=(RunNumber,EventNumber)
• “MET”-table: 55 columns 53 MET (Missing Transverse Energy)-related variables (scalar), primaryKey=(RunNumber,EventNumber), foreignKey=(RunNumber,EventNumber)
16LHC physics analysis and databases - M. Limper
ROOT-ntuple size is 880 MBCurrent DB-size per stored ntuple (16.5% of contents) is ~ 200 MBFull DB-size would be ~1.2 GB per ntuple ~2.6 GB per ntuple
Physics Analysis
Simplified version of HZbbll analysis:• muon selection: “IsMuon”-function to return TRUE, include
requirement pT>20 GeV and |η|<2.4 plus several requirement on hits and holes on tracks
• Require exactly 2 selected muons per event• b-jet selection: tranverse momentum greater than pT>25 GeV, |η|<2.5
and “flavour_weight_Comb”>1.55 (to select b-jets)• Require exactly 2 selected b-jets per event• Require 1 of the 2 b-jets to have pT>45 GeV
• Plot “invariant mass” of muons (Z-peak) and of b-jets (Higgs-peak)
Two versions of this analysis:• Standard ntuple-analysis in ROOT (C++) using locally stored ntuples• Analysis from Oracle Physics DB running on same machine as DB and using
functions implemented in “PHYSANALYSIS” PL/SQL-package: “IsMuon”, “InvariantMassLeptons, “InvariantMassJets”
17LHC physics analysis and databases - M. Limper
select MLIMPER.PHYSANALYSIS.INV_MASS_LEPTONS(mu1."E",mu2."E",mu1."px",mu2."px",mu1."py", mu2."py",mu1."pz",mu2."pz")/1000. as "DiMuonMass", MLIMPER.PHYSANALYSIS.INV_MASS_JETS(jet1."E",jet2."E",jet1."pt",jet2."pt",jet1."phi",jet2."phi",jet1."eta",jet2."eta")/1000. as "DiJetMass" from selectedmuon mu1, selectedmuon mu2, selectedbjet jet1, selectedbjet jet2, selectedevents evSelwhere mu1."muon_i"<mu2."muon_i" and mu1."EventNumber"=evSel."EventNumber" and mu2."EventNumber"=evSel."EventNumber" and jet1."jet_i"<jet2."jet_i" and jet1."EventNumber"=evSel."EventNumber" and jet2."EventNumber"=evSel."EventNumber" and jet1."pt"/1000.>45.
with selectedmuon as (select "muon_i","EventNumber","RunNumber","E","px","py","pz" from "muon" where MLIMPER.PHYSANALYSIS.IS_MUON("muon_i", "pt", "eta", "phi", "E", "me_qoverp_exPV", "id_qoverp_exPV","me_theta_exPV", "id_theta_exPV", "id_theta","isCombinedMuon", "isLowPtReconstructedMuon","tight","expectBLayerHit", "nBLHits", "nPixHits","nPixelDeadSensors", "nPixHoles", "nSCTHits","nSCTDeadSensors", "nSCTHoles","nTRTHits", "nTRTOutliers",0,20000.,2.4) = 1 ),selectedeventsmuon as (select "EventNumber", COUNT(*) as "mu_sel_n" from selectedmuon group by "EventNumber" HAVING COUNT(*)=2),
selectedbjet as (select "jet_i","EventNumber","RunNumber","E","pt","phi","eta" from "jet" INNER JOIN selectedeventsmuon USING("EventNumber") where "pt"/1000>25 and abs("eta")<2.5 and "fl_w_Comb">1.55 ),selectedevents as (select "EventNumber", COUNT(*) as "jet_sel_n" from selectedbjet group by "EventNumber" HAVING COUNT(*)=2)
Physics Analysis in SQLDone using SQL-query making temporary tables for different selections and joining data from different tables via “EventNumber”
18
select MLIMPER.PHYSANALYSIS.INV_MASS_LEPTONS(mu1."E",mu2."E",mu1."px",mu2."px",mu1."py", mu2."py",mu1."pz",mu2."pz")/1000. as "DiMuonMass", MLIMPER.PHYSANALYSIS.INV_MASS_JETS(jet1."E",jet2."E",jet1."pt",jet2."pt",jet1."phi",jet2."phi",jet1."eta",jet2."eta")/1000. as "DiJetMass" from selectedmuon mu1, selectedmuon mu2, selectedbjet jet1, selectedbjet jet2, selectedevents evSelwhere mu1."muon_i"<mu2."muon_i" and mu1."EventNumber"=evSel."EventNumber" and mu2."EventNumber"=evSel."EventNumber" and jet1."jet_i"<jet2."jet_i" and jet1."EventNumber"=evSel."EventNumber" and jet2."EventNumber"=evSel."EventNumber" and jet1."pt"/1000.>45.
LHC physics analysis and databases - M. Limper
with selectedmuon as (select "muon_i","EventNumber","RunNumber","E","px","py","pz" from "muon" where MLIMPER.PHYSANALYSIS.IS_MUON("muon_i", "pt", "eta", "phi", "E", "me_qoverp_exPV", "id_qoverp_exPV","me_theta_exPV", "id_theta_exPV", "id_theta","isCombinedMuon", "isLowPtReconstructedMuon","tight","expectBLayerHit", "nBLHits", "nPixHits","nPixelDeadSensors", "nPixHoles", "nSCTHits","nSCTDeadSensors", "nSCTHoles","nTRTHits", "nTRTOutliers",0,20000.,2.4) = 1 ),selectedeventsmuon as (select "EventNumber", COUNT(*) as "mu_sel_n" from selectedmuon group by "EventNumber" HAVING COUNT(*)=2),
selectedbjet as (select "jet_i","EventNumber","RunNumber","E","pt","phi","eta" from "jet" INNER JOIN selectedeventsmuon USING("EventNumber") where "pt"/1000>25 and abs("eta")<2.5 and "fl_w_Comb">1.55 ),selectedevents as (select "EventNumber", COUNT(*) as "jet_sel_n" from selectedbjet group by "EventNumber" HAVING COUNT(*)=2)
Physics Analysis benchmark
Output of SQL-query send to ROOT to produce standard root-histograms:
19
ROOT-macro using original ntuples as input produces identical histograms:
LHC physics analysis and databases - M. Limper
Physics Analysis in DB
Both tests done on CERN virtual machine, 2 GB RAM, 2 CPU, SLC5 64-bitAverage time of analysis measured after reboot of virtual machineSpeed from ntuple scales depends on:
• number of files • number of used ntuple-branches (=physics-attributes)
DB-speed depends on:• clever implementation of SQL-query: I’m not (yet) an SQL-guru…• table-size: select from jet-table much slower than from muon-table, as
more jet-objects stored per event
20
Time to produce plots from physics DB vs from ntuple-files
Sample #events #sel.events time from DB time from ntupleHZbbll 9993 421 5 s ?? 7 sHZbbll 29987 1326 95 s 14 sZ+2jets 289916 170 85 s 110 s
LHC physics analysis and databases - M. Limper
Physics Analysis in DB
Possible space-gain for DB version of analysis data:Each physics group optimized their own ntuple-size based on physics and level of detail required for their analysis, but sum of different ntuple contains duplicate infoPhysics Analysis DB would contains all physics objects, divided over multiple tables, each physics group can choose which tables to use
21
First implementation of Physics Analysis in Oracle DB • Data in multiple tables• SQL-query can reproduce selection in loop over events• Analysis from DB speed similar to original ntuple-analysis but
many complexities still not implemented…
analysis objects(extracted per physics topic)
event summary data
ntuple1
ntuple2
ntupleN
LHC physics analysis and databases - M. Limper
Physics Analysis in DB
Possible space-gain for DB version of analysis data:Each physics group optimized their own ntuple-size based on physics and level of detail required for their analysis, but sum of different ntuple contains duplicate infoPhysics Analysis DB would contains all physics objects, divided over multiple tables, each physics group can choose which tables to use
22
First implementation of Physics Analysis in Oracle DB • data in multiple tables• SQL-query can reproduce selection in loop over events• Analysis from DB slower than original ntuple-analysis and many
complexities still not implemented…
analysis objectsstored in databaseevent
summary data
physicsDB
LHC physics analysis and databases - M. Limper
Physics Analysis in DB
Space requirement for realistic physics DB (order of magnitude):~1.2 2.6 GB per ntuple (10k events)~2 billion events data+simulation in 2011~240 TB 520 TB of analysis data~10 revisions (reconstruction software versions) actively analyzed at a given time (more revisions may need to be archived) ~2400 TB of analysis data~factor 4 more data from LHC expected in 2012Analysis DB would need to be accessible by thousands of users!
duplicates of DB at different analysis sites needed
2323
analysis objectsstored in databaseevent
summary data
physicsDB
LHC physics analysis and databases - M. Limper
To be continued…Reconstruction versus Analysis of LHC data
Analysis data is relatively easily organized in tables and columns Reconstruction of data starts from raw-event data, less easily organized in tables and columns and likely to require data in blobsAnalysis uses many select-type arguments and functions than can be implemented in PL/SQLReconstruction of data in DB will require use of external agent to run the experiment’s C++ reconstruction software at the databaseAnalysis data will need to be accessible by many users doing many different analysis’ at once Reconstruction tasks are centrally organized, data is reconstructed during data-taking and re-reconstructed during re-processing campaigns
24LHC physics analysis and databases - M. Limper
How to optimize the use of Oracle services for LHC-scale data processing?