deep neural networks for physics analysis on low-level

26
Deep Neural Networks for Physics Analysis on low-level whole- detector data at the LHC Wahid Bhimji, Steve Farrell, Thorsten Kurth, Michela Paganini, Prabhat, Evan Racah Lawrence Berkeley National Laboratory ACAT 2017: 21st August 2017 -1-

Upload: others

Post on 11-May-2022

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Deep Neural Networks for Physics Analysis on low-level

Deep Neural

Networks for

Physics Analysis

on low-level whole-

detector data at the

LHC

Wahid Bhimji, Steve Farrell, Thorsten Kurth, Michela Paganini, Prabhat, Evan Racah

Lawrence Berkeley National LaboratoryACAT 2017: 21st August 2017

- 1 -

Page 2: Deep Neural Networks for Physics Analysis on low-level

Introduction / Aims

• Use Deep Neural Network (NN)s on ‘raw’ data directly for physics analysis:– Without reconstruction of physics objects like jets;

without tuning of analysis variables; using data from whole calorimeter /detector

– Cutting-edge methods: performance and interpretation

• Run efficiently on NERSC supercomputers– Primarily Intel Knights Landing (KNL) XeonPhi CPU based– Distributed training (up to 10k KNL nodes)– Timings, optimisations and recipes

- 2 -

Page 3: Deep Neural Networks for Physics Analysis on low-level

Physics Use-Case

• Search for RPV SUSY gluino decays– Multi-jet final state– Analysis from ATLAS-CONF-2016-057 used as a

benchmark

– Classification problem: RPV Susy vs. QCD

• Simulated samples – Pythia - event gen. (matching ATLAS config)

• Cascade mg ̃= 1400 , mχ ̃0=850 default –explore other masses

– Delphes detector simulation (ATLAS card)• Output calorimeter towers (and tracks)

used in analysis - 3 -

From ATLAS-CONF-2016-057:

Page 4: Deep Neural Networks for Physics Analysis on low-level

Data processing

• Bin calorimeter tower energy in η/ɸ to form an ‘image’

– 64x64 bins (~0.1 η/ɸ towers) or 224x224• Also try 3 ‘channels’ (à la RGB images)1:

– Energy in Electromagnetic and Hadronic Calorimeters and no. of tracks in bin

• Reconstruct jets using same algorithm as physics analysis (AntiKt R=1.0 trimmed) for benchmark comparison and pre-selection

- 4 -

1Similar to Komiske, Metodiev, and Schwartz arXiv:1612.01551

Page 5: Deep Neural Networks for Physics Analysis on low-level

Convolutional (CNN) Architecture

- 5 -

Figure from Dumoulin, Vincent, and

Francesco Visin. arXiv:1603.07285

• Popular architecture for natural images and now many HEP studies (so not explained here) – Learn non-linear ‘filters’- slide across image:

shared filter reduces weights– Local structure/ translational invariance– Stacked layers respond to different scales

• We use 3 alternating convolutional and pooling layers (or 4 for large images): with bias and/or batch normalization

• QCD generated in PT ranges:– X-sec weight in training loss and evaluation

Input Conv+Pool(1) Conv+Pool(2) Conv+Pool(3) (Conv+Pool(4)) Fully Connected (FC) FC Output

1(or3)x64x64 64x32x32 128x8x8 256x4x4 4096 512 1

Page 6: Deep Neural Networks for Physics Analysis on low-level

• Need good signal efficiency and (True Positive Rate (TPR)) high background rejection (low False Positive Rate (FPR))

– Compare to physics selections (see backup)– ROC curve (relative to preselection)

• Increased signal efficiency at same background rejection without using jet variables

- 6 -

CNN Performance

TPR=0.41, AMS=2.3

TPR=0.77, AMS=4.2

Also compare AMS (approximate median significance) accounting for initial pre-selection and luminosity

Page 7: Deep Neural Networks for Physics Analysis on low-level

• Try (gradient) boosted decision tree (GBDT) and 1 hidden-layer NN (MLP)

– Input jet variables used in the physics analysis (Sum of Jet Mass, Number of Jets, Eta between leading 2 jets) and 4-mom of first 5 jets

• These outperform selections but CNN performs better

- 7 -

Compare to shallow classifier

Page 8: Deep Neural Networks for Physics Analysis on low-level

Weights

- 8 -

• Cross-section weights applied in training loss – Some QCD background

events weighted 107

over RPV signal• Try log of weights

– More stable implementation

– More focussed on signal performance

Page 9: Deep Neural Networks for Physics Analysis on low-level

Channels

- 9 -

• Three channel CNN – Separate Energy in

Electromagnetic and Hadronic Calorimeters

– Number of tracks in the same Eta/Phi bin

• Further improves performance

Page 10: Deep Neural Networks for Physics Analysis on low-level

Further improving performance

- 10 -

• Implementation with full weights and that with log weights focus differently on signal and background

• Can ensemble these by taking mean of predictions - gives best performance

Page 11: Deep Neural Networks for Physics Analysis on low-level

Robustness to different signals

- 11 -

• Model trained on a specific cascade decay Gluino mass (MGlu) of 1400 GeV, and neutralino mass (MNeu) of 850 GeV

• Apply this model to other signal samples without retraining

• Still good performance

Page 12: Deep Neural Networks for Physics Analysis on low-level

Pileup

- 12 -

• Most studies here use Delphes without pileup

• Repeat with Delphes pileup card (mu=20)

• Physics selections have lower bkg rejection

• CNN still performs well– 1 channel CNN shown

Page 13: Deep Neural Networks for Physics Analysis on low-level

Comparing CNN to jet variables

- 13 -

• Plot NN output (P(signal)) vs benchmark analysis variable

• Clear correlation – (Signal cuts:

NJets >= 4 /5 MJet >= 800/600 GeV)

• Add jet variable to CNN output in a 1 layer NN

– Little/no increase in performance

Page 14: Deep Neural Networks for Physics Analysis on low-level

Running at NERSC

- 14 -

Page 15: Deep Neural Networks for Physics Analysis on low-level

NERSC and Cori

• NERSC at LBL, production HPC center for US Dept. of Energy

– >7000 diverse users across science domains including many outside HEP

• Cori – NERSC’s Newest Supercomputer – Cray XC40 (31.4 PF Peak)

–Phase 1: 2388 Intel Haswell dual 16-core (2.3 GHz), 128 GB DDR4 DRAM

–Phase 2: 9668 Intel Knights Landing (KNL) nodes: XeonPhi 68-core (1.4 GHz), 4 hardware threads; AVX-512 Vector pipelines; 16 GB MCDRAM, 96 GB DDR4

• Cray Aries high-speed “dragonfly” topology interconnect

• Many popular deeplearning frameworks available– Caffe; Keras; Lasagne; PyTorch; Tensorflow; Theano– Working with Intel to improve CPU (KNL) performance

- 15 -

Page 16: Deep Neural Networks for Physics Analysis on low-level

Timing RPV Susy CNN

• Implemented CNN network in different frameworks:– (Pure) Tensorflow, Keras (Theano and TF), Lasagne (Theano), Caffe

• Aim to drive multi-node Cori CPU performance to be comparable with GPU (for real use-cases):– Not aiming for exact comparison: implementation in frameworks differ

slightly and some have been optimised

• Compare training time (per batch ignoring I/O) for:– GPU: Titan X (Pascal) (10.2 TeraFlops (single-precision) peak)– CPU: Haswell E5-2698 v3 32 cores @ 2.3 GHz (2.4 TF)

– KNL: Xeon Phi 7250 68 cores @1.4 GHz (6 TF)

- 16 -

Page 17: Deep Neural Networks for Physics Analysis on low-level

0.4 0.40.1

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Lasagne+Theano

Keras+Theano

Keras+Tensorflow

Keras+Tensorflow

Keras+TF(Intel)

Keras+TF(Latest)

Keras+Tensorflow

Keras+TF(Intel)

Keras+TF(Latest)

Caffe Caffe

GPU CPU-HSW CPU-KNL CPU-8NodeKNL

TimePerBatch(s)

Lasagne+Theano Keras+Theano Keras+Tensorflow

Keras+TF(Intel) Keras+TF(Latest) Caffe

Timings and Tensorflow

- 17 -

• CPU performance of default TF 1.2 is poor

• Intel optimisations with Intel Math Kernel Library (MKL) e.g. Conv layers multi-threaded,vectorize channels/ filters and cache blocking– Now in main TF-repo

• Further optimisations (releasedsoon): e.g. MKL element-wise operations (avoid MKL->Eigen conversions)

Batch Size: 512

• (Intel)Caffe similar optimizations and Multi-node with MLSL library e.g. scale to 8 nodes time 6x faster for this 64x64 network

0.4 0.40.1

0.9

4.6

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Lasagne+Theano

Keras+Theano

Keras+Tensorflow

Keras+Tensorflow

Keras+TF(Intel)

Keras+TF(Latest)

Keras+Tensorflow

Keras+TF(Intel)

Keras+TF(Latest)

Caffe Caffe

GPU CPU-HSW CPU-KNL CPU-8NodeKNL

TimePerBatch(s)

Lasagne+Theano Keras+Theano Keras+Tensorflow

Keras+TF(Intel) Keras+TF(Latest) Caffe

0.4 0.40.1

0.9

4.6

0.6

1.4

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Lasagne+Theano

Keras+Theano

Keras+Tensorflow

Keras+Tensorflow

Keras+TF(Intel)

Keras+TF(Latest)

Keras+Tensorflow

Keras+TF(Intel)

Keras+TF(Latest)

Caffe Caffe

GPU CPU-HSW CPU-KNL CPU-8NodeKNL

TimePerBatch(s)

Lasagne+Theano Keras+Theano Keras+Tensorflow

Keras+TF(Intel) Keras+TF(Latest) Caffe

0.4 0.40.1

0.9

4.6

0.6

1.4

0.4 0.4

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Lasagne+Theano

Keras+Theano

Keras+Tensorflow

Keras+Tensorflow

Keras+TF(Intel)

Keras+TF(Latest)

Keras+Tensorflow

Keras+TF(Intel)

Keras+TF(Latest)

Caffe Caffe

GPU CPU-HSW CPU-KNL CPU-8NodeKNL

TimePerBatch(s)

Lasagne+Theano Keras+Theano Keras+Tensorflow

Keras+TF(Intel) Keras+TF(Latest) Caffe

0.4 0.40.1

0.9

4.6

0.6

1.4

0.4 0.4 0.30.06

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Lasagne+Theano

Keras+Theano

Keras+Tensorflow

Keras+Tensorflow

Keras+TF(Intel)

Keras+TF(Latest)

Keras+Tensorflow

Keras+TF(Intel)

Keras+TF(Latest)

Caffe Caffe

GPU CPU-HSW CPU-KNL CPU-8NodeKNL

TimePerBatch(s)

Lasagne+Theano Keras+Theano Keras+Tensorflow

Keras+TF(Intel) Keras+TF(Latest) Caffe

Page 18: Deep Neural Networks for Physics Analysis on low-level

Scaling up

• Train on 10 Million 224x224 3-channel images (7.4 TB) • Caffe implementation: multi-node - data parallel

– Use Intel MLSL library (wraps comms - portable)• Sync/Async and Hybrid strategies:

– Sync: barriers so nodes iterate together • can have straggler nodes and limit batch size

– Async: users parameter servers to scale better• can have old gradients so not converge faster.

– Hybrid: Sync within a group and async across• Dedicated parameter servers for each layer of network • Modify our CNN layers to reduce communication:

remove batch norm. and replaced big (~200MB) fully connected layers with convolutional layer

- 18 -

Thorsten Kurth, Jian Zhang, Nadathur Satish, Ioannis Mitliagkas, Evan Racah, Mostofa Patwary, TareqMalas, Narayanan Sundaram, Wahid Bhimji, Mikhail Smorkalov, Jack Deslippe, Mikhail Shiryaev, SrinivasSridharan, Prabhat, Pradeep Dubey, Deep Learning at 15PF (accepted for SC17) arXiv:1708.05256

Hybrid Architecture:

LayerNPS

LayerN-1PS

Layer2PS

Layer1PS

Group1

Group2

GroupG

Modelupdate

Newmodel

Page 19: Deep Neural Networks for Physics Analysis on low-level

Scaling up - results

• Single node 1.9 TF (⅓ peak)• Strong scaling (overall batch size fixed):

– Hybrid approach reduces communication and straggler effects

• Weak scaling (const batch per node): – Good scaling - though affected by

variability from communication after fast convolutional layers

• Scaled to 9600 KNL nodes - 11.73 PF (6170x 1-node) (single-precision)

• Time to solution (a target loss) also scales (1024 node time 1/11 of 64-node time )

- 19 -

T Kurth et. al., Deep Learning at 15PF(accepted for SC17) arXiv:1708.05256

Weak

scaling

Strong

scaling

Page 20: Deep Neural Networks for Physics Analysis on low-level

Conclusions

• Implemented deep CNN on large whole detector ‘images’ directly for physics analysis– Outperforms physics-variable based selections (and shallow

classifiers) without jet reconstruction– Further improvements from adding 3 channels; modifying

weights and ensemble of models– Network robust to pileup and to apply other signal masses, and

appears to learn physics of interest

• Used to benchmark and improve popular deep learning libraries on CPU including XeonPhi/KNL at NERSC– Demonstrated distributed training up to 9600 KNL nodes

- 20 -

Page 21: Deep Neural Networks for Physics Analysis on low-level

Thanks: Ben Nachman and Brian Amadio (LBL) for discussions and physics input. Mustafa Mustafa (LBL) for

help with Tensorflow optimisations.

Code and sample datasets will be made available with proceedings

- 21 -

Page 22: Deep Neural Networks for Physics Analysis on low-level

Backups

- 22 -

Page 23: Deep Neural Networks for Physics Analysis on low-level

Benchmark Analysis

Fat-jet object selection: • AntiKt R=1.0 trimmed (Rtrim= 0.2, PTfrac= 0.05)• PT > 200 GeV , |η| < 2.0

Preselection • Leading Fat-jet PT > 440 GeV • NFat-Jet > 2

Analysis Selection• |∆η12| between leading 2 Fat-jets < 1.4• NFat-Jet >= 4 && Sum MFat-jet > 800 GeV• Or NFat-Jet>= 5 && Sum MFat-jet > 600 GeV

- 23 -

Page 24: Deep Neural Networks for Physics Analysis on low-level

Interpretation - feature maps

- 24 -

Background QCD event: Signal RPV event:

Page 25: Deep Neural Networks for Physics Analysis on low-level

Scaling

• Single node performance per layer for 224x224 Caffe implementation

• Time to 0.05 loss (corresponds to fixed significance)

• At 1024 nodes: time is 11x 64-node (scales as expected) and Hybrid time is 1.66x Sync.

- 25 -

Page 26: Deep Neural Networks for Physics Analysis on low-level

Further work:

(with G. Rochette, J. Bruna, G.Louppe, K. Cranmer, NYU) Exploring Graph CNNs • Using a list of clusters rather than an image• Hybrid between graph and CNN

– Represent clusters as nodes of a graph with interactions/similarity as edge weights

• Model the interaction, and achieve precision without sparsity

- 26 -