data analysis in lhcb - indico · eduardo rodrigues hsf workshop, amsterdam, the netherlands, 22...

31
Eduardo Rodrigues University of Cincinnati On behalf of the LHCb collaboration Data Analysis in LHCb An overview of how LHCb performs physics analysis Amsterdam, The Netherlands, 22-4 May 2017

Upload: others

Post on 09-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Analysis in LHCb - Indico · Eduardo Rodrigues HSF Workshop, Amsterdam, The Netherlands, 22 May 2017 4/25 The data challenge Run I: - 2010-2 - ~3 fb -1 collected, i.e. ~ 3x1011

Eduardo RodriguesUniversity of Cincinnati

On behalf of the LHCb collaboration

Data Analysis in LHCbAn overview of how LHCb performs physics analysis

Amsterdam, The Netherlands, 22-4 May 2017

Page 2: Data Analysis in LHCb - Indico · Eduardo Rodrigues HSF Workshop, Amsterdam, The Netherlands, 22 May 2017 4/25 The data challenge Run I: - 2010-2 - ~3 fb -1 collected, i.e. ~ 3x1011

LHCb in a mnutshell

Page 3: Data Analysis in LHCb - Indico · Eduardo Rodrigues HSF Workshop, Amsterdam, The Netherlands, 22 May 2017 4/25 The data challenge Run I: - 2010-2 - ~3 fb -1 collected, i.e. ~ 3x1011

Time-dependent CP violation – Bs Ds K

23 sep 2010 19:49:24

Run 79646 Event 143858637

The LHCb Detector and Cavern

MUON CALO RICH2 T MAGNET

RICH1

VELOTT

Page 4: Data Analysis in LHCb - Indico · Eduardo Rodrigues HSF Workshop, Amsterdam, The Netherlands, 22 May 2017 4/25 The data challenge Run I: - 2010-2 - ~3 fb -1 collected, i.e. ~ 3x1011

HSF Workshop, Amsterdam, The Netherlands, 22 May 2017 4/25Eduardo Rodrigues

The data challenge

Run I:

- 2010-2

- ~3 fb -1 collected, i.e. ~ 3x1011 b- bbar pairs

produced within LHCb!

Run II:

- 2015-present

- By 2018 beauty sample expected to be increased by a factor 3 or more …

Page 5: Data Analysis in LHCb - Indico · Eduardo Rodrigues HSF Workshop, Amsterdam, The Netherlands, 22 May 2017 4/25 The data challenge Run I: - 2010-2 - ~3 fb -1 collected, i.e. ~ 3x1011

HSF Workshop, Amsterdam, The Netherlands, 22 May 2017 5/25Eduardo Rodrigues

The analysis challenge(s)

Wildly differentChallenges !

𝟕𝟎𝟎𝑴 𝒆𝒗𝒆𝒏𝒕𝒔 !

Page 6: Data Analysis in LHCb - Indico · Eduardo Rodrigues HSF Workshop, Amsterdam, The Netherlands, 22 May 2017 4/25 The data challenge Run I: - 2010-2 - ~3 fb -1 collected, i.e. ~ 3x1011

Data flow overview

Page 7: Data Analysis in LHCb - Indico · Eduardo Rodrigues HSF Workshop, Amsterdam, The Netherlands, 22 May 2017 4/25 The data challenge Run I: - 2010-2 - ~3 fb -1 collected, i.e. ~ 3x1011

HSF Workshop, Amsterdam, The Netherlands, 22 May 2017 7/25Eduardo Rodrigues

Data flow overview – the unique LHCb trigger in run II !

Page 8: Data Analysis in LHCb - Indico · Eduardo Rodrigues HSF Workshop, Amsterdam, The Netherlands, 22 May 2017 4/25 The data challenge Run I: - 2010-2 - ~3 fb -1 collected, i.e. ~ 3x1011

HSF Workshop, Amsterdam, The Netherlands, 22 May 2017 8/25Eduardo Rodrigues

𝐇𝐋𝐓𝟏

𝐇𝐋𝐓𝟐

Data flow – the High Level Trigger

Offline-line quality, more discriminant trigger, data readily usable by analysts, …

Page 9: Data Analysis in LHCb - Indico · Eduardo Rodrigues HSF Workshop, Amsterdam, The Netherlands, 22 May 2017 4/25 The data challenge Run I: - 2010-2 - ~3 fb -1 collected, i.e. ~ 3x1011

HSF Workshop, Amsterdam, The Netherlands, 22 May 2017 9/25Eduardo Rodrigues

New strategy for Run II

Makes use of real-time alignment & calibration performed at each LHC fill !

- Aligns ~ 1700 detector components & computes ~ 2000 calibration constants

same quality online and offline !

1st use in “2015 early measurements”

Comparison with FULL,

with 2016 pp data:

- FULL stream: ~ 2.7 PB

- Turbo: ~ 0.8 PB

Data flow – «Turbo stream» & «online analysis»

Page 10: Data Analysis in LHCb - Indico · Eduardo Rodrigues HSF Workshop, Amsterdam, The Netherlands, 22 May 2017 4/25 The data challenge Run I: - 2010-2 - ~3 fb -1 collected, i.e. ~ 3x1011

HSF Workshop, Amsterdam, The Netherlands, 22 May 2017 10/25Eduardo Rodrigues

Data flow – calibration samples

Particle identification

Calibration data made available for calculation of (mis)PID efficiencies with data-driven methods

In run II, skimming-like step through the “Turbo calibration” stream

- In fact this flow is being used as a test case for so-called WG productions for the upgrade …

Tracking

We also have calibration data and production of tracking efficiency tables

obtained with data-driven methods

- Process of preparation is less fancy/streamlined

Page 11: Data Analysis in LHCb - Indico · Eduardo Rodrigues HSF Workshop, Amsterdam, The Netherlands, 22 May 2017 4/25 The data challenge Run I: - 2010-2 - ~3 fb -1 collected, i.e. ~ 3x1011

Analysis software ecosystem

Page 12: Data Analysis in LHCb - Indico · Eduardo Rodrigues HSF Workshop, Amsterdam, The Netherlands, 22 May 2017 4/25 The data challenge Run I: - 2010-2 - ~3 fb -1 collected, i.e. ~ 3x1011

HSF Workshop, Amsterdam, The Netherlands, 22 May 2017 12/25Eduardo Rodrigues

LHCb software stack

Built atop LCG releases, based on Gaudi framework (shared with ATLAS)

Contains almost all software required for typical analyses with no special needs

Deployed in AFS and CVMFS (latter is now the default given the AFS phase-out move @ CERN)

Frameworks and languages

ROOT & non-ROOT frameworks, see above

C++ and/or Python

Python enthusiasts often have their own software installations

- Especially when using many non-HEP tools

- Not all packages are available at CERN. Even less the case at institutes

Others

Many fitting packages in use in LHCb, of which many are institute-made

and exploited by that single group (bulk based on ROOT’s Minuit one way or another)

The LHCb analysis software ecosystem (1/2)

Page 13: Data Analysis in LHCb - Indico · Eduardo Rodrigues HSF Workshop, Amsterdam, The Netherlands, 22 May 2017 4/25 The data challenge Run I: - 2010-2 - ~3 fb -1 collected, i.e. ~ 3x1011

HSF Workshop, Amsterdam, The Netherlands, 22 May 2017 13/25Eduardo Rodrigues

The LHCb analysis software ecosystem (2/2)

Purpose Software Language of use HEP ?

Data manipulation ROOT

numpy, pandas, bcolz

root_numpy, root_pandas

C++ & Python

Python

Python

Yes

No

Yes

Machine learning(classification, regression)

TMVA

scikit-learn

NeuroBayes

C++ & Python

Python

C++

Yes

No

No

Plotting ROOT

matplotlib, seaborn, bokeh

C++ & Python

Python

Yes

No

Fitting RooFit

<Institute/user packages>

C++ & Python

C++

Yes

Yes

Statistics CLs

RooStats

Python

C++ & Python

Yes

Yes

Reweighting hep_ml Python Yes & no

Error propagation uncertainties, mcerp Python No

Other packages some analysts use

Docker for the runtime environment, Snakemake for defining the analysis pipeline

jug for submitting jobs to the batch system

Note: MC programs not listed

Note: not claiming it

to be a comprehensive list.

Page 14: Data Analysis in LHCb - Indico · Eduardo Rodrigues HSF Workshop, Amsterdam, The Netherlands, 22 May 2017 4/25 The data challenge Run I: - 2010-2 - ~3 fb -1 collected, i.e. ~ 3x1011

On “types of analysis” @ LHCb

Page 15: Data Analysis in LHCb - Indico · Eduardo Rodrigues HSF Workshop, Amsterdam, The Netherlands, 22 May 2017 4/25 The data challenge Run I: - 2010-2 - ~3 fb -1 collected, i.e. ~ 3x1011

HSF Workshop, Amsterdam, The Netherlands, 22 May 2017 15/25Eduardo Rodrigues

« Types of analysis » for what concerns software

“Standard analyses”

No peculiar needs

Largely performed with ROOT-based packages and user-written code

Driven with Python, ROOT macros or shell scripts

- Broad mix of C++/Python depending on experience and background, what the friend does/provide,

the package needed, etc.

Alternatives / innovative approaches

Do it all in Python (> 90% CL) exploiting many non-HEP tools (see next slide)

“One of the most important things for me wasn't so much the tools themselves but the ability to use whatever tools

I wanted, and to be able to experiment. I enjoy writing Python …”

BTW, non-HEP tools being used not only for ML but also a lot for data manipulation

There are also even more innovative analyses …

Page 16: Data Analysis in LHCb - Indico · Eduardo Rodrigues HSF Workshop, Amsterdam, The Netherlands, 22 May 2017 4/25 The data challenge Run I: - 2010-2 - ~3 fb -1 collected, i.e. ~ 3x1011

HSF Workshop, Amsterdam, The Netherlands, 22 May 2017 16/25Eduardo Rodrigues

Innovative approaches / techniques - GPUs

Why

Computing power & tools adequate for high statistics are paramount (e.g. for fits, toys)

Some LHCb analyses simply require very complex fit (models)

too slow to run on standard architectures

How

GPUs used to complement other parts of the analysis done on standard CPUs

- GooFit being used on 2-3 analyses

- 1 analysis investigating & exploiting TensorFlow

- 1 analysis using CUDA, from Python ;-)

Looking ahead

Fraction of analyses exploiting GPUs is bound to grow IMO

The trend is already there,

with 2 talks related to such activities at our last LHCb Analysis & Software week !

Another analysis is applying the Hydra package, see presentation at this workshop …

Page 17: Data Analysis in LHCb - Indico · Eduardo Rodrigues HSF Workshop, Amsterdam, The Netherlands, 22 May 2017 4/25 The data challenge Run I: - 2010-2 - ~3 fb -1 collected, i.e. ~ 3x1011

HSF Workshop, Amsterdam, The Netherlands, 22 May 2017 17/25Eduardo Rodrigues

Analyses exploiting GPUs – CUDA or TensorFlowOngoing

work …

pyCUDA

Probability distribution functions implemented as Cuda-C functions (stay in the GPU)

Fit framework/configuration/OO/etc. in Python

- pyCUDA + python analysis software + custom classes for parameter bookkeeping, etc.

Not using ROOT but custom FCN passed to Minuit

Observing a gain ~ 60x in fitting time !

TensorFlow

Couple of amplitude analyses exploring this route

See presentation on Wednesday …

In both cases, motivations are:

- Very complex models requiring numerical convolution

- Necessity to run millions of toys

Page 18: Data Analysis in LHCb - Indico · Eduardo Rodrigues HSF Workshop, Amsterdam, The Netherlands, 22 May 2017 4/25 The data challenge Run I: - 2010-2 - ~3 fb -1 collected, i.e. ~ 3x1011

HSF Workshop, Amsterdam, The Netherlands, 22 May 2017 18/25Eduardo Rodrigues

Analyses exploiting GPUs – example with Hydra

Hydra

See recent presentation at NVIDIA’s GPU Technology Conference, May 8-11, 2017 - Silicon Valley

And presentation on Wednesday …

GitHub repository

LHCb analysis

Precision K+ mass measurement @ LHCb, see DIANA/HEP topical meeting presentation

Expect to make use of tens of millions of 𝑲+ → 𝟑𝝅 events …

Ongoing

work …

Page 19: Data Analysis in LHCb - Indico · Eduardo Rodrigues HSF Workshop, Amsterdam, The Netherlands, 22 May 2017 4/25 The data challenge Run I: - 2010-2 - ~3 fb -1 collected, i.e. ~ 3x1011

On validation

Page 20: Data Analysis in LHCb - Indico · Eduardo Rodrigues HSF Workshop, Amsterdam, The Netherlands, 22 May 2017 4/25 The data challenge Run I: - 2010-2 - ~3 fb -1 collected, i.e. ~ 3x1011

HSF Workshop, Amsterdam, The Netherlands, 22 May 2017 20/25Eduardo Rodrigues

Data validation

(Nothing particularly fancy here.)

Online & offline data quality

Detector-specific monitoring

LHCb software stack validation

A whole suite of nightly tests for all software projects

Test of builds (obviously!), comparisons of logged info (e.g. stats), histogrammed contents,

packing-un-packing of event data, etc.

Validation / inspection of code naturally done in Git on merge requests

Analysis (code) validation – see next slide …

Data / software / analysis validation

Validationcomes in

various flavours

Page 21: Data Analysis in LHCb - Indico · Eduardo Rodrigues HSF Workshop, Amsterdam, The Netherlands, 22 May 2017 4/25 The data challenge Run I: - 2010-2 - ~3 fb -1 collected, i.e. ~ 3x1011

HSF Workshop, Amsterdam, The Netherlands, 22 May 2017 21/25Eduardo Rodrigues

Reduction (skimming) of DSTs to size manageable by analysts is called “stripping” in LHCb

- See step as a preselection in the analysis flow, done centrally

Code & results validation done at 3 levels:

- By project coordinators, which ensure overall goodness/goals/etc. for each campaign

- By physics WG liaisons

- By WG analysts themselves

Validation & checks based on selection rates, timing and, more importantly,

histogramming of relevant physics quantities

Note that simulation has its own validation

Analysis (code) validation – centralised part

Page 22: Data Analysis in LHCb - Indico · Eduardo Rodrigues HSF Workshop, Amsterdam, The Netherlands, 22 May 2017 4/25 The data challenge Run I: - 2010-2 - ~3 fb -1 collected, i.e. ~ 3x1011

HSF Workshop, Amsterdam, The Netherlands, 22 May 2017 22/25Eduardo Rodrigues

The trickiest! How to catch analysts’ mistakes …?

Validation largely done with physics output – no code validation/inspection per se

Suite of WG reviews followed by set up of a dedicated review committee

Largely rely on data-MC comparisons with control samples

Other checks are very much analysis dependent

Example of a relative branching fraction measurement

Calculate expected yields for the control/normalisation decay

Trivially compare it with the results from the fit to the data

This does not validate the code but provides confidence in the selection chain,

catches silly bugs

Analysis (code) validation – «private» user-code

Page 23: Data Analysis in LHCb - Indico · Eduardo Rodrigues HSF Workshop, Amsterdam, The Netherlands, 22 May 2017 4/25 The data challenge Run I: - 2010-2 - ~3 fb -1 collected, i.e. ~ 3x1011

Looking into the future …

Page 24: Data Analysis in LHCb - Indico · Eduardo Rodrigues HSF Workshop, Amsterdam, The Netherlands, 22 May 2017 4/25 The data challenge Run I: - 2010-2 - ~3 fb -1 collected, i.e. ~ 3x1011

HSF Workshop, Amsterdam, The Netherlands, 22 May 2017 24/25Eduardo Rodrigues

What does the future mean for LHCb ?

Full software trigger in 2020 !

- Yes, at the LHC

Detector will produce 30 MHz of events

Each trigger server will have to sustain 10k events / sec, i.e. 20x more than in the current run II !

We simply cannot in the future analyse data the way we do it today !

Page 25: Data Analysis in LHCb - Indico · Eduardo Rodrigues HSF Workshop, Amsterdam, The Netherlands, 22 May 2017 4/25 The data challenge Run I: - 2010-2 - ~3 fb -1 collected, i.e. ~ 3x1011

HSF Workshop, Amsterdam, The Netherlands, 22 May 2017 25/25Eduardo Rodrigues

Trends

Analyses ever more complex though performed already straight out of the trigger

- LHCb so far unique in this respect, since such a strategy is already a reality

Computing/analysis model needs to consider various analysis flows

- A single analysis flow will be impossible

- Also means distributed computing (e.g. job submission)

Moving towards reproducible analysis flows

- Define the steps required to run each stage of an analysis,

linking together the data and the analysis code and environment

- Win-win for data & analysis preservation (latter already actively being developed in LHCb)

- Also makes analysis code validation much much easier

Increasing discussion of collaborative tools

- Sustainability made easier

Centralised MC productions (aka physics working group prods.) ever more encouraged

- To minimise data access costs in particular

Page 26: Data Analysis in LHCb - Indico · Eduardo Rodrigues HSF Workshop, Amsterdam, The Netherlands, 22 May 2017 4/25 The data challenge Run I: - 2010-2 - ~3 fb -1 collected, i.e. ~ 3x1011

Thank you

Page 27: Data Analysis in LHCb - Indico · Eduardo Rodrigues HSF Workshop, Amsterdam, The Netherlands, 22 May 2017 4/25 The data challenge Run I: - 2010-2 - ~3 fb -1 collected, i.e. ~ 3x1011

HSF Workshop, Amsterdam, The Netherlands, 22 May 2017 27/25Eduardo Rodrigues

Run I & II integrated luminosities

Page 28: Data Analysis in LHCb - Indico · Eduardo Rodrigues HSF Workshop, Amsterdam, The Netherlands, 22 May 2017 4/25 The data challenge Run I: - 2010-2 - ~3 fb -1 collected, i.e. ~ 3x1011

HSF Workshop, Amsterdam, The Netherlands, 22 May 2017 28/25Eduardo Rodrigues

Data flow overview – run I

Page 29: Data Analysis in LHCb - Indico · Eduardo Rodrigues HSF Workshop, Amsterdam, The Netherlands, 22 May 2017 4/25 The data challenge Run I: - 2010-2 - ~3 fb -1 collected, i.e. ~ 3x1011

HSF Workshop, Amsterdam, The Netherlands, 22 May 2017 29/25Eduardo Rodrigues

Representation

Page 30: Data Analysis in LHCb - Indico · Eduardo Rodrigues HSF Workshop, Amsterdam, The Netherlands, 22 May 2017 4/25 The data challenge Run I: - 2010-2 - ~3 fb -1 collected, i.e. ~ 3x1011

HSF Workshop, Amsterdam, The Netherlands, 22 May 2017 30/25Eduardo Rodrigues

Representation of LHCb detector in the software stack

Page 31: Data Analysis in LHCb - Indico · Eduardo Rodrigues HSF Workshop, Amsterdam, The Netherlands, 22 May 2017 4/25 The data challenge Run I: - 2010-2 - ~3 fb -1 collected, i.e. ~ 3x1011

HSF Workshop, Amsterdam, The Netherlands, 22 May 2017 31/25Eduardo Rodrigues

Workflows for production of particle ID calibration samples