Download - Data Analysis in LHCb - Indico · Eduardo Rodrigues HSF Workshop, Amsterdam, The Netherlands, 22 May 2017 4/25 The data challenge Run I: - 2010-2 - ~3 fb -1 collected, i.e. ~ 3x1011

Eduardo RodriguesUniversity of Cincinnati

On behalf of the LHCb collaboration

Data Analysis in LHCbAn overview of how LHCb performs physics analysis

Amsterdam, The Netherlands, 22-4 May 2017

LHCb in a mnutshell

Time-dependent CP violation – Bs Ds K

23 sep 2010 19:49:24

Run 79646 Event 143858637

The LHCb Detector and Cavern

MUON CALO RICH2 T MAGNET

RICH1

VELOTT

HSF Workshop, Amsterdam, The Netherlands, 22 May 2017 4/25Eduardo Rodrigues

The data challenge

Run I:

- 2010-2

- ~3 fb -1 collected, i.e. ~ 3x1011 b- bbar pairs

produced within LHCb!

Run II:

- 2015-present

- By 2018 beauty sample expected to be increased by a factor 3 or more …


The analysis challenge(s)

Wildly differentChallenges !

𝟕𝟎𝟎𝑴 𝒆𝒗𝒆𝒏𝒕𝒔 !

Data flow overview


Data flow overview – the unique LHCb trigger in run II !


𝐇𝐋𝐓𝟏

𝐇𝐋𝐓𝟐

Data flow – the High Level Trigger

Offline-line quality, more discriminant trigger, data readily usable by analysts, …


New strategy for Run II

Makes use of real-time alignment & calibration performed at each LHC fill !

- Aligns ~ 1700 detector components & computes ~ 2000 calibration constants

same quality online and offline !

1st use in “2015 early measurements”

Comparison with FULL,

with 2016 pp data:

- FULL stream: ~ 2.7 PB

- Turbo: ~ 0.8 PB

Data flow – «Turbo stream» & «online analysis»


Data flow – calibration samples

Particle identification

Calibration data made available for calculation of (mis)PID efficiencies with data-driven methods

In run II, skimming-like step through the “Turbo calibration” stream

- In fact this flow is being used as a test case for so-called WG productions for the upgrade …

Tracking

We also have calibration data and production of tracking efficiency tables

obtained with data-driven methods

- Process of preparation is less fancy/streamlined

Analysis software ecosystem


LHCb software stack

Built atop LCG releases, based on Gaudi framework (shared with ATLAS)

Contains almost all software required for typical analyses with no special needs

Deployed in AFS and CVMFS (latter is now the default given the AFS phase-out move @ CERN)

Frameworks and languages

ROOT & non-ROOT frameworks, see above

C++ and/or Python

Python enthusiasts often have their own software installations

- Especially when using many non-HEP tools

- Not all packages are available at CERN. Even less the case at institutes

Others

Many fitting packages in use in LHCb, of which many are institute-made

and exploited by that single group (bulk based on ROOT’s Minuit one way or another)

The LHCb analysis software ecosystem (1/2)


The LHCb analysis software ecosystem (2/2)

Purpose Software Language of use HEP ?

Data manipulation ROOT

numpy, pandas, bcolz

root_numpy, root_pandas

C++ & Python

Python

Python

Yes

No

Yes

Machine learning(classification, regression)

TMVA

scikit-learn

NeuroBayes

C++ & Python

Python

C++

Yes

No

No

Plotting ROOT

matplotlib, seaborn, bokeh

C++ & Python

Python

Yes

No

Fitting RooFit

<Institute/user packages>

C++ & Python

C++

Yes

Yes

Statistics CLs

RooStats

Python

C++ & Python

Yes

Yes

Reweighting hep_ml Python Yes & no

Error propagation uncertainties, mcerp Python No

Other packages some analysts use

Docker for the runtime environment, Snakemake for defining the analysis pipeline

jug for submitting jobs to the batch system

Note: MC programs not listed

Note: not claiming it

to be a comprehensive list.

On “types of analysis” @ LHCb


« Types of analysis » for what concerns software

“Standard analyses”

No peculiar needs

Largely performed with ROOT-based packages and user-written code

Driven with Python, ROOT macros or shell scripts

- Broad mix of C++/Python depending on experience and background, what the friend does/provide,

the package needed, etc.

Alternatives / innovative approaches

Do it all in Python (> 90% CL) exploiting many non-HEP tools (see next slide)

“One of the most important things for me wasn't so much the tools themselves but the ability to use whatever tools

I wanted, and to be able to experiment. I enjoy writing Python …”

BTW, non-HEP tools being used not only for ML but also a lot for data manipulation

There are also even more innovative analyses …


Innovative approaches / techniques - GPUs

Why

Computing power & tools adequate for high statistics are paramount (e.g. for fits, toys)

Some LHCb analyses simply require very complex fit (models)

too slow to run on standard architectures

How

GPUs used to complement other parts of the analysis done on standard CPUs

- GooFit being used on 2-3 analyses

- 1 analysis investigating & exploiting TensorFlow

- 1 analysis using CUDA, from Python ;-)

Looking ahead

Fraction of analyses exploiting GPUs is bound to grow IMO

The trend is already there,

with 2 talks related to such activities at our last LHCb Analysis & Software week !

Another analysis is applying the Hydra package, see presentation at this workshop …

https://goofit.github.io/


Analyses exploiting GPUs – CUDA or TensorFlowOngoing

work …

pyCUDA

Probability distribution functions implemented as Cuda-C functions (stay in the GPU)

Fit framework/configuration/OO/etc. in Python

- pyCUDA + python analysis software + custom classes for parameter bookkeeping, etc.

Not using ROOT but custom FCN passed to Minuit

Observing a gain ~ 60x in fitting time !

TensorFlow

Couple of amplitude analyses exploring this route

See presentation on Wednesday …

In both cases, motivations are:

- Very complex models requiring numerical convolution

- Necessity to run millions of toys


Analyses exploiting GPUs – example with Hydra

Hydra

See recent presentation at NVIDIA’s GPU Technology Conference, May 8-11, 2017 - Silicon Valley

And presentation on Wednesday …

GitHub repository

LHCb analysis

Precision K+ mass measurement @ LHCb, see DIANA/HEP topical meeting presentation

Expect to make use of tens of millions of 𝑲+ → 𝟑𝝅 events …

Ongoing

work …

https://gputechconf2017.smarteventscloud.com/connect/sessionDetail.ww?SESSION_ID=110110

https://github.com/MultithreadCorner/Hydra

https://indico.cern.ch/event/616687/

On validation


Data validation

(Nothing particularly fancy here.)

Online & offline data quality

Detector-specific monitoring

LHCb software stack validation

A whole suite of nightly tests for all software projects

Test of builds (obviously!), comparisons of logged info (e.g. stats), histogrammed contents,

packing-un-packing of event data, etc.

Validation / inspection of code naturally done in Git on merge requests

Analysis (code) validation – see next slide …

Data / software / analysis validation

Validationcomes in

various flavours


Reduction (skimming) of DSTs to size manageable by analysts is called “stripping” in LHCb

- See step as a preselection in the analysis flow, done centrally

Code & results validation done at 3 levels:

- By project coordinators, which ensure overall goodness/goals/etc. for each campaign

- By physics WG liaisons

- By WG analysts themselves

Validation & checks based on selection rates, timing and, more importantly,

histogramming of relevant physics quantities

Note that simulation has its own validation

Analysis (code) validation – centralised part


The trickiest! How to catch analysts’ mistakes …?

Validation largely done with physics output – no code validation/inspection per se

Suite of WG reviews followed by set up of a dedicated review committee

Largely rely on data-MC comparisons with control samples

Other checks are very much analysis dependent

Example of a relative branching fraction measurement

Calculate expected yields for the control/normalisation decay

Trivially compare it with the results from the fit to the data

This does not validate the code but provides confidence in the selection chain,

catches silly bugs

Analysis (code) validation – «private» user-code

Looking into the future …


What does the future mean for LHCb ?

Full software trigger in 2020 !

- Yes, at the LHC

Detector will produce 30 MHz of events

Each trigger server will have to sustain 10k events / sec, i.e. 20x more than in the current run II !

We simply cannot in the future analyse data the way we do it today !


Trends

Analyses ever more complex though performed already straight out of the trigger

- LHCb so far unique in this respect, since such a strategy is already a reality

Computing/analysis model needs to consider various analysis flows

- A single analysis flow will be impossible

- Also means distributed computing (e.g. job submission)

Moving towards reproducible analysis flows

- Define the steps required to run each stage of an analysis,

linking together the data and the analysis code and environment

- Win-win for data & analysis preservation (latter already actively being developed in LHCb)

- Also makes analysis code validation much much easier

Increasing discussion of collaborative tools

- Sustainability made easier

Centralised MC productions (aka physics working group prods.) ever more encouraged

- To minimise data access costs in particular

Thank you


Run I & II integrated luminosities


Data flow overview – run I


Representation


Representation of LHCb detector in the software stack


Workflows for production of particle ID calibration samples

Download - Data Analysis in LHCb - Indico · Eduardo Rodrigues HSF Workshop, Amsterdam, The Netherlands, 22 May 2017 4/25 The data challenge Run I: - 2010-2 - ~3 fb -1 collected, i.e. ~ 3x1011

Top Related