deep neural networks for physics analysis on low-level
TRANSCRIPT
![Page 1: Deep Neural Networks for Physics Analysis on low-level](https://reader034.vdocuments.us/reader034/viewer/2022051319/627b2222af6e0c66e40dd908/html5/thumbnails/1.jpg)
Deep Neural
Networks for
Physics Analysis
on low-level whole-
detector data at the
LHC
Wahid Bhimji, Steve Farrell, Thorsten Kurth, Michela Paganini, Prabhat, Evan Racah
Lawrence Berkeley National LaboratoryACAT 2017: 21st August 2017
- 1 -
![Page 2: Deep Neural Networks for Physics Analysis on low-level](https://reader034.vdocuments.us/reader034/viewer/2022051319/627b2222af6e0c66e40dd908/html5/thumbnails/2.jpg)
Introduction / Aims
• Use Deep Neural Network (NN)s on ‘raw’ data directly for physics analysis:– Without reconstruction of physics objects like jets;
without tuning of analysis variables; using data from whole calorimeter /detector
– Cutting-edge methods: performance and interpretation
• Run efficiently on NERSC supercomputers– Primarily Intel Knights Landing (KNL) XeonPhi CPU based– Distributed training (up to 10k KNL nodes)– Timings, optimisations and recipes
- 2 -
![Page 3: Deep Neural Networks for Physics Analysis on low-level](https://reader034.vdocuments.us/reader034/viewer/2022051319/627b2222af6e0c66e40dd908/html5/thumbnails/3.jpg)
Physics Use-Case
• Search for RPV SUSY gluino decays– Multi-jet final state– Analysis from ATLAS-CONF-2016-057 used as a
benchmark
– Classification problem: RPV Susy vs. QCD
• Simulated samples – Pythia - event gen. (matching ATLAS config)
• Cascade mg ̃= 1400 , mχ ̃0=850 default –explore other masses
– Delphes detector simulation (ATLAS card)• Output calorimeter towers (and tracks)
used in analysis - 3 -
From ATLAS-CONF-2016-057:
![Page 4: Deep Neural Networks for Physics Analysis on low-level](https://reader034.vdocuments.us/reader034/viewer/2022051319/627b2222af6e0c66e40dd908/html5/thumbnails/4.jpg)
Data processing
• Bin calorimeter tower energy in η/ɸ to form an ‘image’
– 64x64 bins (~0.1 η/ɸ towers) or 224x224• Also try 3 ‘channels’ (à la RGB images)1:
– Energy in Electromagnetic and Hadronic Calorimeters and no. of tracks in bin
• Reconstruct jets using same algorithm as physics analysis (AntiKt R=1.0 trimmed) for benchmark comparison and pre-selection
- 4 -
1Similar to Komiske, Metodiev, and Schwartz arXiv:1612.01551
![Page 5: Deep Neural Networks for Physics Analysis on low-level](https://reader034.vdocuments.us/reader034/viewer/2022051319/627b2222af6e0c66e40dd908/html5/thumbnails/5.jpg)
Convolutional (CNN) Architecture
- 5 -
Figure from Dumoulin, Vincent, and
Francesco Visin. arXiv:1603.07285
• Popular architecture for natural images and now many HEP studies (so not explained here) – Learn non-linear ‘filters’- slide across image:
shared filter reduces weights– Local structure/ translational invariance– Stacked layers respond to different scales
• We use 3 alternating convolutional and pooling layers (or 4 for large images): with bias and/or batch normalization
• QCD generated in PT ranges:– X-sec weight in training loss and evaluation
Input Conv+Pool(1) Conv+Pool(2) Conv+Pool(3) (Conv+Pool(4)) Fully Connected (FC) FC Output
1(or3)x64x64 64x32x32 128x8x8 256x4x4 4096 512 1
![Page 6: Deep Neural Networks for Physics Analysis on low-level](https://reader034.vdocuments.us/reader034/viewer/2022051319/627b2222af6e0c66e40dd908/html5/thumbnails/6.jpg)
• Need good signal efficiency and (True Positive Rate (TPR)) high background rejection (low False Positive Rate (FPR))
– Compare to physics selections (see backup)– ROC curve (relative to preselection)
• Increased signal efficiency at same background rejection without using jet variables
- 6 -
CNN Performance
TPR=0.41, AMS=2.3
TPR=0.77, AMS=4.2
Also compare AMS (approximate median significance) accounting for initial pre-selection and luminosity
![Page 7: Deep Neural Networks for Physics Analysis on low-level](https://reader034.vdocuments.us/reader034/viewer/2022051319/627b2222af6e0c66e40dd908/html5/thumbnails/7.jpg)
• Try (gradient) boosted decision tree (GBDT) and 1 hidden-layer NN (MLP)
– Input jet variables used in the physics analysis (Sum of Jet Mass, Number of Jets, Eta between leading 2 jets) and 4-mom of first 5 jets
• These outperform selections but CNN performs better
- 7 -
Compare to shallow classifier
![Page 8: Deep Neural Networks for Physics Analysis on low-level](https://reader034.vdocuments.us/reader034/viewer/2022051319/627b2222af6e0c66e40dd908/html5/thumbnails/8.jpg)
Weights
- 8 -
• Cross-section weights applied in training loss – Some QCD background
events weighted 107
over RPV signal• Try log of weights
– More stable implementation
– More focussed on signal performance
![Page 9: Deep Neural Networks for Physics Analysis on low-level](https://reader034.vdocuments.us/reader034/viewer/2022051319/627b2222af6e0c66e40dd908/html5/thumbnails/9.jpg)
Channels
- 9 -
• Three channel CNN – Separate Energy in
Electromagnetic and Hadronic Calorimeters
– Number of tracks in the same Eta/Phi bin
• Further improves performance
![Page 10: Deep Neural Networks for Physics Analysis on low-level](https://reader034.vdocuments.us/reader034/viewer/2022051319/627b2222af6e0c66e40dd908/html5/thumbnails/10.jpg)
Further improving performance
- 10 -
• Implementation with full weights and that with log weights focus differently on signal and background
• Can ensemble these by taking mean of predictions - gives best performance
![Page 11: Deep Neural Networks for Physics Analysis on low-level](https://reader034.vdocuments.us/reader034/viewer/2022051319/627b2222af6e0c66e40dd908/html5/thumbnails/11.jpg)
Robustness to different signals
- 11 -
• Model trained on a specific cascade decay Gluino mass (MGlu) of 1400 GeV, and neutralino mass (MNeu) of 850 GeV
• Apply this model to other signal samples without retraining
• Still good performance
![Page 12: Deep Neural Networks for Physics Analysis on low-level](https://reader034.vdocuments.us/reader034/viewer/2022051319/627b2222af6e0c66e40dd908/html5/thumbnails/12.jpg)
Pileup
- 12 -
• Most studies here use Delphes without pileup
• Repeat with Delphes pileup card (mu=20)
• Physics selections have lower bkg rejection
• CNN still performs well– 1 channel CNN shown
![Page 13: Deep Neural Networks for Physics Analysis on low-level](https://reader034.vdocuments.us/reader034/viewer/2022051319/627b2222af6e0c66e40dd908/html5/thumbnails/13.jpg)
Comparing CNN to jet variables
- 13 -
• Plot NN output (P(signal)) vs benchmark analysis variable
• Clear correlation – (Signal cuts:
NJets >= 4 /5 MJet >= 800/600 GeV)
• Add jet variable to CNN output in a 1 layer NN
– Little/no increase in performance
![Page 14: Deep Neural Networks for Physics Analysis on low-level](https://reader034.vdocuments.us/reader034/viewer/2022051319/627b2222af6e0c66e40dd908/html5/thumbnails/14.jpg)
Running at NERSC
- 14 -
![Page 15: Deep Neural Networks for Physics Analysis on low-level](https://reader034.vdocuments.us/reader034/viewer/2022051319/627b2222af6e0c66e40dd908/html5/thumbnails/15.jpg)
NERSC and Cori
• NERSC at LBL, production HPC center for US Dept. of Energy
– >7000 diverse users across science domains including many outside HEP
• Cori – NERSC’s Newest Supercomputer – Cray XC40 (31.4 PF Peak)
–Phase 1: 2388 Intel Haswell dual 16-core (2.3 GHz), 128 GB DDR4 DRAM
–Phase 2: 9668 Intel Knights Landing (KNL) nodes: XeonPhi 68-core (1.4 GHz), 4 hardware threads; AVX-512 Vector pipelines; 16 GB MCDRAM, 96 GB DDR4
• Cray Aries high-speed “dragonfly” topology interconnect
• Many popular deeplearning frameworks available– Caffe; Keras; Lasagne; PyTorch; Tensorflow; Theano– Working with Intel to improve CPU (KNL) performance
- 15 -
![Page 16: Deep Neural Networks for Physics Analysis on low-level](https://reader034.vdocuments.us/reader034/viewer/2022051319/627b2222af6e0c66e40dd908/html5/thumbnails/16.jpg)
Timing RPV Susy CNN
• Implemented CNN network in different frameworks:– (Pure) Tensorflow, Keras (Theano and TF), Lasagne (Theano), Caffe
• Aim to drive multi-node Cori CPU performance to be comparable with GPU (for real use-cases):– Not aiming for exact comparison: implementation in frameworks differ
slightly and some have been optimised
• Compare training time (per batch ignoring I/O) for:– GPU: Titan X (Pascal) (10.2 TeraFlops (single-precision) peak)– CPU: Haswell E5-2698 v3 32 cores @ 2.3 GHz (2.4 TF)
– KNL: Xeon Phi 7250 68 cores @1.4 GHz (6 TF)
- 16 -
![Page 17: Deep Neural Networks for Physics Analysis on low-level](https://reader034.vdocuments.us/reader034/viewer/2022051319/627b2222af6e0c66e40dd908/html5/thumbnails/17.jpg)
0.4 0.40.1
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Lasagne+Theano
Keras+Theano
Keras+Tensorflow
Keras+Tensorflow
Keras+TF(Intel)
Keras+TF(Latest)
Keras+Tensorflow
Keras+TF(Intel)
Keras+TF(Latest)
Caffe Caffe
GPU CPU-HSW CPU-KNL CPU-8NodeKNL
TimePerBatch(s)
Lasagne+Theano Keras+Theano Keras+Tensorflow
Keras+TF(Intel) Keras+TF(Latest) Caffe
Timings and Tensorflow
- 17 -
• CPU performance of default TF 1.2 is poor
• Intel optimisations with Intel Math Kernel Library (MKL) e.g. Conv layers multi-threaded,vectorize channels/ filters and cache blocking– Now in main TF-repo
• Further optimisations (releasedsoon): e.g. MKL element-wise operations (avoid MKL->Eigen conversions)
Batch Size: 512
• (Intel)Caffe similar optimizations and Multi-node with MLSL library e.g. scale to 8 nodes time 6x faster for this 64x64 network
0.4 0.40.1
0.9
4.6
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Lasagne+Theano
Keras+Theano
Keras+Tensorflow
Keras+Tensorflow
Keras+TF(Intel)
Keras+TF(Latest)
Keras+Tensorflow
Keras+TF(Intel)
Keras+TF(Latest)
Caffe Caffe
GPU CPU-HSW CPU-KNL CPU-8NodeKNL
TimePerBatch(s)
Lasagne+Theano Keras+Theano Keras+Tensorflow
Keras+TF(Intel) Keras+TF(Latest) Caffe
0.4 0.40.1
0.9
4.6
0.6
1.4
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Lasagne+Theano
Keras+Theano
Keras+Tensorflow
Keras+Tensorflow
Keras+TF(Intel)
Keras+TF(Latest)
Keras+Tensorflow
Keras+TF(Intel)
Keras+TF(Latest)
Caffe Caffe
GPU CPU-HSW CPU-KNL CPU-8NodeKNL
TimePerBatch(s)
Lasagne+Theano Keras+Theano Keras+Tensorflow
Keras+TF(Intel) Keras+TF(Latest) Caffe
0.4 0.40.1
0.9
4.6
0.6
1.4
0.4 0.4
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Lasagne+Theano
Keras+Theano
Keras+Tensorflow
Keras+Tensorflow
Keras+TF(Intel)
Keras+TF(Latest)
Keras+Tensorflow
Keras+TF(Intel)
Keras+TF(Latest)
Caffe Caffe
GPU CPU-HSW CPU-KNL CPU-8NodeKNL
TimePerBatch(s)
Lasagne+Theano Keras+Theano Keras+Tensorflow
Keras+TF(Intel) Keras+TF(Latest) Caffe
0.4 0.40.1
0.9
4.6
0.6
1.4
0.4 0.4 0.30.06
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Lasagne+Theano
Keras+Theano
Keras+Tensorflow
Keras+Tensorflow
Keras+TF(Intel)
Keras+TF(Latest)
Keras+Tensorflow
Keras+TF(Intel)
Keras+TF(Latest)
Caffe Caffe
GPU CPU-HSW CPU-KNL CPU-8NodeKNL
TimePerBatch(s)
Lasagne+Theano Keras+Theano Keras+Tensorflow
Keras+TF(Intel) Keras+TF(Latest) Caffe
![Page 18: Deep Neural Networks for Physics Analysis on low-level](https://reader034.vdocuments.us/reader034/viewer/2022051319/627b2222af6e0c66e40dd908/html5/thumbnails/18.jpg)
Scaling up
• Train on 10 Million 224x224 3-channel images (7.4 TB) • Caffe implementation: multi-node - data parallel
– Use Intel MLSL library (wraps comms - portable)• Sync/Async and Hybrid strategies:
– Sync: barriers so nodes iterate together • can have straggler nodes and limit batch size
– Async: users parameter servers to scale better• can have old gradients so not converge faster.
– Hybrid: Sync within a group and async across• Dedicated parameter servers for each layer of network • Modify our CNN layers to reduce communication:
remove batch norm. and replaced big (~200MB) fully connected layers with convolutional layer
- 18 -
Thorsten Kurth, Jian Zhang, Nadathur Satish, Ioannis Mitliagkas, Evan Racah, Mostofa Patwary, TareqMalas, Narayanan Sundaram, Wahid Bhimji, Mikhail Smorkalov, Jack Deslippe, Mikhail Shiryaev, SrinivasSridharan, Prabhat, Pradeep Dubey, Deep Learning at 15PF (accepted for SC17) arXiv:1708.05256
Hybrid Architecture:
LayerNPS
LayerN-1PS
Layer2PS
Layer1PS
Group1
Group2
GroupG
Modelupdate
Newmodel
![Page 19: Deep Neural Networks for Physics Analysis on low-level](https://reader034.vdocuments.us/reader034/viewer/2022051319/627b2222af6e0c66e40dd908/html5/thumbnails/19.jpg)
Scaling up - results
• Single node 1.9 TF (⅓ peak)• Strong scaling (overall batch size fixed):
– Hybrid approach reduces communication and straggler effects
• Weak scaling (const batch per node): – Good scaling - though affected by
variability from communication after fast convolutional layers
• Scaled to 9600 KNL nodes - 11.73 PF (6170x 1-node) (single-precision)
• Time to solution (a target loss) also scales (1024 node time 1/11 of 64-node time )
- 19 -
T Kurth et. al., Deep Learning at 15PF(accepted for SC17) arXiv:1708.05256
Weak
scaling
Strong
scaling
![Page 20: Deep Neural Networks for Physics Analysis on low-level](https://reader034.vdocuments.us/reader034/viewer/2022051319/627b2222af6e0c66e40dd908/html5/thumbnails/20.jpg)
Conclusions
• Implemented deep CNN on large whole detector ‘images’ directly for physics analysis– Outperforms physics-variable based selections (and shallow
classifiers) without jet reconstruction– Further improvements from adding 3 channels; modifying
weights and ensemble of models– Network robust to pileup and to apply other signal masses, and
appears to learn physics of interest
• Used to benchmark and improve popular deep learning libraries on CPU including XeonPhi/KNL at NERSC– Demonstrated distributed training up to 9600 KNL nodes
- 20 -
![Page 21: Deep Neural Networks for Physics Analysis on low-level](https://reader034.vdocuments.us/reader034/viewer/2022051319/627b2222af6e0c66e40dd908/html5/thumbnails/21.jpg)
Thanks: Ben Nachman and Brian Amadio (LBL) for discussions and physics input. Mustafa Mustafa (LBL) for
help with Tensorflow optimisations.
Code and sample datasets will be made available with proceedings
- 21 -
![Page 22: Deep Neural Networks for Physics Analysis on low-level](https://reader034.vdocuments.us/reader034/viewer/2022051319/627b2222af6e0c66e40dd908/html5/thumbnails/22.jpg)
Backups
- 22 -
![Page 23: Deep Neural Networks for Physics Analysis on low-level](https://reader034.vdocuments.us/reader034/viewer/2022051319/627b2222af6e0c66e40dd908/html5/thumbnails/23.jpg)
Benchmark Analysis
Fat-jet object selection: • AntiKt R=1.0 trimmed (Rtrim= 0.2, PTfrac= 0.05)• PT > 200 GeV , |η| < 2.0
Preselection • Leading Fat-jet PT > 440 GeV • NFat-Jet > 2
Analysis Selection• |∆η12| between leading 2 Fat-jets < 1.4• NFat-Jet >= 4 && Sum MFat-jet > 800 GeV• Or NFat-Jet>= 5 && Sum MFat-jet > 600 GeV
- 23 -
![Page 24: Deep Neural Networks for Physics Analysis on low-level](https://reader034.vdocuments.us/reader034/viewer/2022051319/627b2222af6e0c66e40dd908/html5/thumbnails/24.jpg)
Interpretation - feature maps
- 24 -
Background QCD event: Signal RPV event:
![Page 25: Deep Neural Networks for Physics Analysis on low-level](https://reader034.vdocuments.us/reader034/viewer/2022051319/627b2222af6e0c66e40dd908/html5/thumbnails/25.jpg)
Scaling
• Single node performance per layer for 224x224 Caffe implementation
• Time to 0.05 loss (corresponds to fixed significance)
• At 1024 nodes: time is 11x 64-node (scales as expected) and Hybrid time is 1.66x Sync.
- 25 -
![Page 26: Deep Neural Networks for Physics Analysis on low-level](https://reader034.vdocuments.us/reader034/viewer/2022051319/627b2222af6e0c66e40dd908/html5/thumbnails/26.jpg)
Further work:
(with G. Rochette, J. Bruna, G.Louppe, K. Cranmer, NYU) Exploring Graph CNNs • Using a list of clusters rather than an image• Hybrid between graph and CNN
– Represent clusters as nodes of a graph with interactions/similarity as edge weights
• Model the interaction, and achieve precision without sparsity
- 26 -