Eduardo RodriguesUniversity of Cincinnati
On behalf of the LHCb collaboration
Data Analysis in LHCbAn overview of how LHCb performs physics analysis
Amsterdam, The Netherlands, 22-4 May 2017
LHCb in a mnutshell
Time-dependent CP violation – Bs Ds K
23 sep 2010 19:49:24
Run 79646 Event 143858637
The LHCb Detector and Cavern
MUON CALO RICH2 T MAGNET
RICH1
VELOTT
HSF Workshop, Amsterdam, The Netherlands, 22 May 2017 4/25Eduardo Rodrigues
The data challenge
Run I:
- 2010-2
- ~3 fb -1 collected, i.e. ~ 3x1011 b- bbar pairs
produced within LHCb!
Run II:
- 2015-present
- By 2018 beauty sample expected to be increased by a factor 3 or more …
HSF Workshop, Amsterdam, The Netherlands, 22 May 2017 5/25Eduardo Rodrigues
The analysis challenge(s)
Wildly differentChallenges !
𝟕𝟎𝟎𝑴 𝒆𝒗𝒆𝒏𝒕𝒔 !
Data flow overview
HSF Workshop, Amsterdam, The Netherlands, 22 May 2017 7/25Eduardo Rodrigues
Data flow overview – the unique LHCb trigger in run II !
HSF Workshop, Amsterdam, The Netherlands, 22 May 2017 8/25Eduardo Rodrigues
𝐇𝐋𝐓𝟏
𝐇𝐋𝐓𝟐
Data flow – the High Level Trigger
Offline-line quality, more discriminant trigger, data readily usable by analysts, …
HSF Workshop, Amsterdam, The Netherlands, 22 May 2017 9/25Eduardo Rodrigues
New strategy for Run II
Makes use of real-time alignment & calibration performed at each LHC fill !
- Aligns ~ 1700 detector components & computes ~ 2000 calibration constants
same quality online and offline !
1st use in “2015 early measurements”
Comparison with FULL,
with 2016 pp data:
- FULL stream: ~ 2.7 PB
- Turbo: ~ 0.8 PB
Data flow – «Turbo stream» & «online analysis»
HSF Workshop, Amsterdam, The Netherlands, 22 May 2017 10/25Eduardo Rodrigues
Data flow – calibration samples
Particle identification
Calibration data made available for calculation of (mis)PID efficiencies with data-driven methods
In run II, skimming-like step through the “Turbo calibration” stream
- In fact this flow is being used as a test case for so-called WG productions for the upgrade …
Tracking
We also have calibration data and production of tracking efficiency tables
obtained with data-driven methods
- Process of preparation is less fancy/streamlined
Analysis software ecosystem
HSF Workshop, Amsterdam, The Netherlands, 22 May 2017 12/25Eduardo Rodrigues
LHCb software stack
Built atop LCG releases, based on Gaudi framework (shared with ATLAS)
Contains almost all software required for typical analyses with no special needs
Deployed in AFS and CVMFS (latter is now the default given the AFS phase-out move @ CERN)
Frameworks and languages
ROOT & non-ROOT frameworks, see above
C++ and/or Python
Python enthusiasts often have their own software installations
- Especially when using many non-HEP tools
- Not all packages are available at CERN. Even less the case at institutes
Others
Many fitting packages in use in LHCb, of which many are institute-made
and exploited by that single group (bulk based on ROOT’s Minuit one way or another)
The LHCb analysis software ecosystem (1/2)
HSF Workshop, Amsterdam, The Netherlands, 22 May 2017 13/25Eduardo Rodrigues
The LHCb analysis software ecosystem (2/2)
Purpose Software Language of use HEP ?
Data manipulation ROOT
numpy, pandas, bcolz
root_numpy, root_pandas
C++ & Python
Python
Python
Yes
No
Yes
Machine learning(classification, regression)
TMVA
scikit-learn
NeuroBayes
C++ & Python
Python
C++
Yes
No
No
Plotting ROOT
matplotlib, seaborn, bokeh
C++ & Python
Python
Yes
No
Fitting RooFit
<Institute/user packages>
C++ & Python
C++
Yes
Yes
Statistics CLs
RooStats
Python
C++ & Python
Yes
Yes
Reweighting hep_ml Python Yes & no
Error propagation uncertainties, mcerp Python No
Other packages some analysts use
Docker for the runtime environment, Snakemake for defining the analysis pipeline
jug for submitting jobs to the batch system
Note: MC programs not listed
Note: not claiming it
to be a comprehensive list.
On “types of analysis” @ LHCb
HSF Workshop, Amsterdam, The Netherlands, 22 May 2017 15/25Eduardo Rodrigues
« Types of analysis » for what concerns software
“Standard analyses”
No peculiar needs
Largely performed with ROOT-based packages and user-written code
Driven with Python, ROOT macros or shell scripts
- Broad mix of C++/Python depending on experience and background, what the friend does/provide,
the package needed, etc.
Alternatives / innovative approaches
Do it all in Python (> 90% CL) exploiting many non-HEP tools (see next slide)
“One of the most important things for me wasn't so much the tools themselves but the ability to use whatever tools
I wanted, and to be able to experiment. I enjoy writing Python …”
BTW, non-HEP tools being used not only for ML but also a lot for data manipulation
There are also even more innovative analyses …
HSF Workshop, Amsterdam, The Netherlands, 22 May 2017 16/25Eduardo Rodrigues
Innovative approaches / techniques - GPUs
Why
Computing power & tools adequate for high statistics are paramount (e.g. for fits, toys)
Some LHCb analyses simply require very complex fit (models)
too slow to run on standard architectures
How
GPUs used to complement other parts of the analysis done on standard CPUs
- GooFit being used on 2-3 analyses
- 1 analysis investigating & exploiting TensorFlow
- 1 analysis using CUDA, from Python ;-)
Looking ahead
Fraction of analyses exploiting GPUs is bound to grow IMO
The trend is already there,
with 2 talks related to such activities at our last LHCb Analysis & Software week !
Another analysis is applying the Hydra package, see presentation at this workshop …
HSF Workshop, Amsterdam, The Netherlands, 22 May 2017 17/25Eduardo Rodrigues
Analyses exploiting GPUs – CUDA or TensorFlowOngoing
work …
pyCUDA
Probability distribution functions implemented as Cuda-C functions (stay in the GPU)
Fit framework/configuration/OO/etc. in Python
- pyCUDA + python analysis software + custom classes for parameter bookkeeping, etc.
Not using ROOT but custom FCN passed to Minuit
Observing a gain ~ 60x in fitting time !
TensorFlow
Couple of amplitude analyses exploring this route
See presentation on Wednesday …
In both cases, motivations are:
- Very complex models requiring numerical convolution
- Necessity to run millions of toys
HSF Workshop, Amsterdam, The Netherlands, 22 May 2017 18/25Eduardo Rodrigues
Analyses exploiting GPUs – example with Hydra
Hydra
See recent presentation at NVIDIA’s GPU Technology Conference, May 8-11, 2017 - Silicon Valley
And presentation on Wednesday …
GitHub repository
LHCb analysis
Precision K+ mass measurement @ LHCb, see DIANA/HEP topical meeting presentation
Expect to make use of tens of millions of 𝑲+ → 𝟑𝝅 events …
Ongoing
work …
On validation
HSF Workshop, Amsterdam, The Netherlands, 22 May 2017 20/25Eduardo Rodrigues
Data validation
(Nothing particularly fancy here.)
Online & offline data quality
Detector-specific monitoring
LHCb software stack validation
A whole suite of nightly tests for all software projects
Test of builds (obviously!), comparisons of logged info (e.g. stats), histogrammed contents,
packing-un-packing of event data, etc.
Validation / inspection of code naturally done in Git on merge requests
Analysis (code) validation – see next slide …
Data / software / analysis validation
Validationcomes in
various flavours
HSF Workshop, Amsterdam, The Netherlands, 22 May 2017 21/25Eduardo Rodrigues
Reduction (skimming) of DSTs to size manageable by analysts is called “stripping” in LHCb
- See step as a preselection in the analysis flow, done centrally
Code & results validation done at 3 levels:
- By project coordinators, which ensure overall goodness/goals/etc. for each campaign
- By physics WG liaisons
- By WG analysts themselves
Validation & checks based on selection rates, timing and, more importantly,
histogramming of relevant physics quantities
Note that simulation has its own validation
Analysis (code) validation – centralised part
HSF Workshop, Amsterdam, The Netherlands, 22 May 2017 22/25Eduardo Rodrigues
The trickiest! How to catch analysts’ mistakes …?
Validation largely done with physics output – no code validation/inspection per se
Suite of WG reviews followed by set up of a dedicated review committee
Largely rely on data-MC comparisons with control samples
Other checks are very much analysis dependent
Example of a relative branching fraction measurement
Calculate expected yields for the control/normalisation decay
Trivially compare it with the results from the fit to the data
This does not validate the code but provides confidence in the selection chain,
catches silly bugs
Analysis (code) validation – «private» user-code
Looking into the future …
HSF Workshop, Amsterdam, The Netherlands, 22 May 2017 24/25Eduardo Rodrigues
What does the future mean for LHCb ?
Full software trigger in 2020 !
- Yes, at the LHC
Detector will produce 30 MHz of events
Each trigger server will have to sustain 10k events / sec, i.e. 20x more than in the current run II !
We simply cannot in the future analyse data the way we do it today !
HSF Workshop, Amsterdam, The Netherlands, 22 May 2017 25/25Eduardo Rodrigues
Trends
Analyses ever more complex though performed already straight out of the trigger
- LHCb so far unique in this respect, since such a strategy is already a reality
Computing/analysis model needs to consider various analysis flows
- A single analysis flow will be impossible
- Also means distributed computing (e.g. job submission)
Moving towards reproducible analysis flows
- Define the steps required to run each stage of an analysis,
linking together the data and the analysis code and environment
- Win-win for data & analysis preservation (latter already actively being developed in LHCb)
- Also makes analysis code validation much much easier
Increasing discussion of collaborative tools
- Sustainability made easier
Centralised MC productions (aka physics working group prods.) ever more encouraged
- To minimise data access costs in particular
Thank you
HSF Workshop, Amsterdam, The Netherlands, 22 May 2017 27/25Eduardo Rodrigues
Run I & II integrated luminosities
HSF Workshop, Amsterdam, The Netherlands, 22 May 2017 28/25Eduardo Rodrigues
Data flow overview – run I
HSF Workshop, Amsterdam, The Netherlands, 22 May 2017 29/25Eduardo Rodrigues
Representation
HSF Workshop, Amsterdam, The Netherlands, 22 May 2017 30/25Eduardo Rodrigues
Representation of LHCb detector in the software stack
HSF Workshop, Amsterdam, The Netherlands, 22 May 2017 31/25Eduardo Rodrigues
Workflows for production of particle ID calibration samples