deep neural networks for physics analysis on low-level

Deep Neural

Networks for

Physics Analysis

on low-level whole-

detector data at the

LHC

Wahid Bhimji, Steve Farrell, Thorsten Kurth, Michela Paganini, Prabhat, Evan Racah

Lawrence Berkeley National LaboratoryACAT 2017: 21st August 2017

- 1 -

Introduction / Aims

• Use Deep Neural Network (NN)s on ‘raw’ data directly for physics analysis:– Without reconstruction of physics objects like jets;

without tuning of analysis variables; using data from whole calorimeter /detector

– Cutting-edge methods: performance and interpretation

• Run efficiently on NERSC supercomputers– Primarily Intel Knights Landing (KNL) XeonPhi CPU based– Distributed training (up to 10k KNL nodes)– Timings, optimisations and recipes

- 2 -

Physics Use-Case

• Search for RPV SUSY gluino decays– Multi-jet final state– Analysis from ATLAS-CONF-2016-057 used as a

benchmark

– Classification problem: RPV Susy vs. QCD

• Simulated samples – Pythia - event gen. (matching ATLAS config)

• Cascade mg ̃= 1400 , mχ ̃0=850 default –explore other masses

– Delphes detector simulation (ATLAS card)• Output calorimeter towers (and tracks)

used in analysis - 3 -

From ATLAS-CONF-2016-057:

http://cds.cern.ch/record/2206149

Data processing

• Bin calorimeter tower energy in η/ɸ to form an ‘image’

– 64x64 bins (~0.1 η/ɸ towers) or 224x224• Also try 3 ‘channels’ (à la RGB images)1:

– Energy in Electromagnetic and Hadronic Calorimeters and no. of tracks in bin

• Reconstruct jets using same algorithm as physics analysis (AntiKt R=1.0 trimmed) for benchmark comparison and pre-selection

- 4 -

1Similar to Komiske, Metodiev, and Schwartz arXiv:1612.01551

https://arxiv.org/abs/1612.01551

Convolutional (CNN) Architecture

- 5 -

Figure from Dumoulin, Vincent, and

Francesco Visin. arXiv:1603.07285

• Popular architecture for natural images and now many HEP studies (so not explained here) – Learn non-linear ‘filters’- slide across image:

shared filter reduces weights– Local structure/ translational invariance– Stacked layers respond to different scales

• We use 3 alternating convolutional and pooling layers (or 4 for large images): with bias and/or batch normalization

• QCD generated in PT ranges:– X-sec weight in training loss and evaluation

Input Conv+Pool(1) Conv+Pool(2) Conv+Pool(3) (Conv+Pool(4)) Fully Connected (FC) FC Output

1(or3)x64x64 64x32x32 128x8x8 256x4x4 4096 512 1

• Need good signal efficiency and (True Positive Rate (TPR)) high background rejection (low False Positive Rate (FPR))

– Compare to physics selections (see backup)– ROC curve (relative to preselection)

• Increased signal efficiency at same background rejection without using jet variables

- 6 -

CNN Performance

TPR=0.41, AMS=2.3

TPR=0.77, AMS=4.2

Also compare AMS (approximate median significance) accounting for initial pre-selection and luminosity

https://higgsml.lal.in2p3.fr/files/2014/04/documentation_v1.8.pdf

• Try (gradient) boosted decision tree (GBDT) and 1 hidden-layer NN (MLP)

– Input jet variables used in the physics analysis (Sum of Jet Mass, Number of Jets, Eta between leading 2 jets) and 4-mom of first 5 jets

• These outperform selections but CNN performs better

- 7 -

Compare to shallow classifier

Weights

- 8 -

• Cross-section weights applied in training loss – Some QCD background

events weighted 107

over RPV signal• Try log of weights

– More stable implementation

– More focussed on signal performance

Channels

- 9 -

• Three channel CNN – Separate Energy in

Electromagnetic and Hadronic Calorimeters

– Number of tracks in the same Eta/Phi bin

• Further improves performance

Further improving performance

- 10 -

• Implementation with full weights and that with log weights focus differently on signal and background

• Can ensemble these by taking mean of predictions - gives best performance

Robustness to different signals

- 11 -

• Model trained on a specific cascade decay Gluino mass (MGlu) of 1400 GeV, and neutralino mass (MNeu) of 850 GeV

• Apply this model to other signal samples without retraining

• Still good performance

Pileup

- 12 -

• Most studies here use Delphes without pileup

• Repeat with Delphes pileup card (mu=20)

• Physics selections have lower bkg rejection

• CNN still performs well– 1 channel CNN shown

Comparing CNN to jet variables

- 13 -

• Plot NN output (P(signal)) vs benchmark analysis variable

• Clear correlation – (Signal cuts:

NJets >= 4 /5 MJet >= 800/600 GeV)

• Add jet variable to CNN output in a 1 layer NN

– Little/no increase in performance

Running at NERSC

- 14 -

NERSC and Cori

• NERSC at LBL, production HPC center for US Dept. of Energy

– >7000 diverse users across science domains including many outside HEP

• Cori – NERSC’s Newest Supercomputer – Cray XC40 (31.4 PF Peak)

–Phase 1: 2388 Intel Haswell dual 16-core (2.3 GHz), 128 GB DDR4 DRAM

–Phase 2: 9668 Intel Knights Landing (KNL) nodes: XeonPhi 68-core (1.4 GHz), 4 hardware threads; AVX-512 Vector pipelines; 16 GB MCDRAM, 96 GB DDR4

• Cray Aries high-speed “dragonfly” topology interconnect

• Many popular deeplearning frameworks available– Caffe; Keras; Lasagne; PyTorch; Tensorflow; Theano– Working with Intel to improve CPU (KNL) performance

- 15 -

http://www.nersc.gov/users/data-analytics/data-analytics-2/deep-learning/

Timing RPV Susy CNN

• Implemented CNN network in different frameworks:– (Pure) Tensorflow, Keras (Theano and TF), Lasagne (Theano), Caffe

• Aim to drive multi-node Cori CPU performance to be comparable with GPU (for real use-cases):– Not aiming for exact comparison: implementation in frameworks differ

slightly and some have been optimised

• Compare training time (per batch ignoring I/O) for:– GPU: Titan X (Pascal) (10.2 TeraFlops (single-precision) peak)– CPU: Haswell E5-2698 v3 32 cores @ 2.3 GHz (2.4 TF)

– KNL: Xeon Phi 7250 68 cores @1.4 GHz (6 TF)

- 16 -

0.4 0.40.1

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Lasagne+Theano

Keras+Theano

Keras+Tensorflow

Keras+Tensorflow

Keras+TF(Intel)

Keras+TF(Latest)

Keras+Tensorflow

Keras+TF(Intel)

Keras+TF(Latest)

Caffe Caffe

GPU CPU-HSW CPU-KNL CPU-8NodeKNL

TimePerBatch(s)

Lasagne+Theano Keras+Theano Keras+Tensorflow

Keras+TF(Intel) Keras+TF(Latest) Caffe

Timings and Tensorflow

- 17 -

• CPU performance of default TF 1.2 is poor

• Intel optimisations with Intel Math Kernel Library (MKL) e.g. Conv layers multi-threaded,vectorize channels/ filters and cache blocking– Now in main TF-repo

• Further optimisations (releasedsoon): e.g. MKL element-wise operations (avoid MKL->Eigen conversions)

Batch Size: 512

• (Intel)Caffe similar optimizations and Multi-node with MLSL library e.g. scale to 8 nodes time 6x faster for this 64x64 network

0.4 0.40.1

0.9

4.6

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Lasagne+Theano

Keras+Theano

Keras+Tensorflow

Keras+Tensorflow

Keras+TF(Intel)

Keras+TF(Latest)

Keras+Tensorflow

Keras+TF(Intel)

Keras+TF(Latest)

Caffe Caffe


TimePerBatch(s)



0.4 0.40.1

0.9

4.6

0.6

1.4

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Lasagne+Theano

Keras+Theano

Keras+Tensorflow

Keras+Tensorflow

Keras+TF(Intel)

Keras+TF(Latest)

Keras+Tensorflow

Keras+TF(Intel)

Keras+TF(Latest)

Caffe Caffe


TimePerBatch(s)



0.4 0.40.1

0.9

4.6

0.6

1.4

0.4 0.4

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Lasagne+Theano

Keras+Theano

Keras+Tensorflow

Keras+Tensorflow

Keras+TF(Intel)

Keras+TF(Latest)

Keras+Tensorflow

Keras+TF(Intel)

Keras+TF(Latest)

Caffe Caffe


TimePerBatch(s)



0.4 0.40.1

0.9

4.6

0.6

1.4

0.4 0.4 0.30.06

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Lasagne+Theano

Keras+Theano

Keras+Tensorflow

Keras+Tensorflow

Keras+TF(Intel)

Keras+TF(Latest)

Keras+Tensorflow

Keras+TF(Intel)

Keras+TF(Latest)

Caffe Caffe


TimePerBatch(s)



https://software.intel.com/en-us/articles/tensorflow-optimizations-on-modern-intel-architecture

Scaling up

• Train on 10 Million 224x224 3-channel images (7.4 TB) • Caffe implementation: multi-node - data parallel

– Use Intel MLSL library (wraps comms - portable)• Sync/Async and Hybrid strategies:

– Sync: barriers so nodes iterate together • can have straggler nodes and limit batch size

– Async: users parameter servers to scale better• can have old gradients so not converge faster.

– Hybrid: Sync within a group and async across• Dedicated parameter servers for each layer of network • Modify our CNN layers to reduce communication:

remove batch norm. and replaced big (~200MB) fully connected layers with convolutional layer

- 18 -

Thorsten Kurth, Jian Zhang, Nadathur Satish, Ioannis Mitliagkas, Evan Racah, Mostofa Patwary, TareqMalas, Narayanan Sundaram, Wahid Bhimji, Mikhail Smorkalov, Jack Deslippe, Mikhail Shiryaev, SrinivasSridharan, Prabhat, Pradeep Dubey, Deep Learning at 15PF (accepted for SC17) arXiv:1708.05256

Hybrid Architecture:

LayerNPS

LayerN-1PS

Layer2PS

Layer1PS

Group1

Group2

GroupG

Modelupdate

Newmodel


Scaling up - results

• Single node 1.9 TF (⅓ peak)• Strong scaling (overall batch size fixed):

– Hybrid approach reduces communication and straggler effects

• Weak scaling (const batch per node): – Good scaling - though affected by

variability from communication after fast convolutional layers

• Scaled to 9600 KNL nodes - 11.73 PF (6170x 1-node) (single-precision)

• Time to solution (a target loss) also scales (1024 node time 1/11 of 64-node time )

- 19 -

T Kurth et. al., Deep Learning at 15PF(accepted for SC17) arXiv:1708.05256

Weak

scaling

Strong

scaling


Conclusions

• Implemented deep CNN on large whole detector ‘images’ directly for physics analysis– Outperforms physics-variable based selections (and shallow

classifiers) without jet reconstruction– Further improvements from adding 3 channels; modifying

weights and ensemble of models– Network robust to pileup and to apply other signal masses, and

appears to learn physics of interest

• Used to benchmark and improve popular deep learning libraries on CPU including XeonPhi/KNL at NERSC– Demonstrated distributed training up to 9600 KNL nodes

- 20 -

Thanks: Ben Nachman and Brian Amadio (LBL) for discussions and physics input. Mustafa Mustafa (LBL) for

help with Tensorflow optimisations.

Code and sample datasets will be made available with proceedings

- 21 -

Backups

- 22 -

Benchmark Analysis

Fat-jet object selection: • AntiKt R=1.0 trimmed (Rtrim= 0.2, PTfrac= 0.05)• PT > 200 GeV , |η| < 2.0

Preselection • Leading Fat-jet PT > 440 GeV • NFat-Jet > 2

Analysis Selection• |∆η12| between leading 2 Fat-jets < 1.4• NFat-Jet >= 4 && Sum MFat-jet > 800 GeV• Or NFat-Jet>= 5 && Sum MFat-jet > 600 GeV

- 23 -

Interpretation - feature maps

- 24 -

Background QCD event: Signal RPV event:

Scaling

• Single node performance per layer for 224x224 Caffe implementation

• Time to 0.05 loss (corresponds to fixed significance)

• At 1024 nodes: time is 11x 64-node (scales as expected) and Hybrid time is 1.66x Sync.

- 25 -

Further work:

(with G. Rochette, J. Bruna, G.Louppe, K. Cranmer, NYU) Exploring Graph CNNs • Using a list of clusters rather than an image• Hybrid between graph and CNN

– Represent clusters as nodes of a graph with interactions/similarity as edge weights

• Model the interaction, and achieve precision without sparsity

- 26 -

http://erez.weizmann.ac.il/pls/htmldb/f?p=101:58:::NO:RP:P58_CODE,P58_FILE:5451,Y

deep neural networks for physics analysis on low-level

Documents