“big data in hep” - physics data analysis, machine learning and...
TRANSCRIPT
“Big Data In HEP” - Physics Data Analysis, Machine Learning and Data Reduction at Scale with Apache Spark
Luca Canali (CERN), Vaggelis Motesnitsalis (CERN), Oliver Gutsche (Fermilab)
IXPUG Annual Conference 2019September 24th, 2019
1
Experimental Particle Physics -the Journey
2
Particle Collisions Physics Discoveries
Large ScaleComputing
Large Scale Computing
3
Device
Simulation
RAWData
Algorithms to
reconstruct data
RECOData
Analysis software
Plo
ts
Analysis in CMS
4
Central Hundreds of physicists analyze the data with different goals at the same time
Device
Simulation
RAWData
Algorithms to
reconstruct data
RECOData
Analysis software
Plo
ts
Analysis: A multi-step Process• Minimize Time to Insight
• Analysis is a conversation with data - Interactivity is key
• Many different physics topics concurrently under investigation• Different slices of data are relevant for each analysis
• Programmatically same analysis steps• Skimming (dropping events in a disk-to-disk copy)
• Slimming (dropping branches in a disk-to-disk copy)
• Filtering (selectively reading events into memory)
• Pruning (selectively reading branches into memory)
5
Big Data
• New toolkits and systems collectively called “Big Data” technologies have emerged to support the analysis of PB and EB datasets in industry.
• Our goals in applying these technologies to the HEP analysis challenge:• Reduce Time to Insight• Educate our graduate students and
post docs to use industry-based technologies• Improves chances on the job market outside
academia• Increases the attractiveness of our field
• Be part of an even larger community
6
Bridging the Gap• Physics Analysis is typically done with the ROOT Framework which uses
physics data that are saved in ROOT format files. At CERN these files are
stored within the EOS Storage Service.
EOS Storage Service
1. access data 2. read format 3. visualize
CMS Data Reduction and Analysis
Facility• CERN openlab / Intel project
• Demonstrate reduction capabilities producing analysis ntuples using Apache Spark
• Demonstrator’s goal: reduce 1 PB input in 5 hours
Milestones and Achievements• We solved two important data engineering challenges:
1. Read files in ROOT Format using Spark
2. Access files stored in EOS directly from Hadoop/Spark
• This enabled us to produce, scale up, and optimize Physics Analysis Workloads with data input up to 1 PB.
EOS
Storage
Service
runs
on
access
data fromaccess
data
from
accessed by
runs
on
Scalability TestsThe data processing job of this project was developed in Scala by CMS members.
• Performs event selection (i.e. Data Reduction)
• Uses the filtered events to compute the dimuon invariant mass
• On a single thread/core and one single file as input, the workload reads one branch and calculates the dimuon invariant mass in approximately 10 mins for a 4GB file EOS
Storage
Service
ROOTfile
Driver
Executor
Task 1
Task 2
Executor
Task x-1
Taskx
{Dimuon system Invariant Mass Calculation, Code}
Test Workload Architecture and File-Task Mapping
IT Hadoop and Spark Service (analytix)
ROOTfile
ROOTfile
ROOTfile
Scalability Tests: Technology• Apache Spark
• Hadoop YARN
• Kubernetes and Openstack
• Collaborated with Intel for Spark jobs optimizations: using Intel CoFluent Cluster Simulation Technology
• Services/Tools Used: • EOS Public, CERN open data
• Hadoop-XRootD Connector (allows Spark to access the CERN EOS storage system)
• spark-root (Spark data source for ROOT format)
• sparkMeasure (spark instrumentation)
• Spark on Kubernetes Service
• Issues that we tackled:• Network bottleneck at scale: “readAhead” buffer size configuration of the Hadoop-XRtooD connector
• Running tests on a shared clusters and share infrastructure in IT datacenter
Hadoop and Spark Clusters at CERN
• Clusters:
• YARN/Hadoop
• Spark on Kubernetes
• Hardware: Intel based servers, continuous refresh and capacity expansion
Accelerator logging
(part of LHC
infrastructure)
Hadoop - YARN - 30 nodes
(Cores - 800, Mem - 13 TB, Storage – 7.5 PB)
General Purpose Hadoop - YARN, 65 nodes
(Cores – 1.3k, Mem – 20 TB, Storage – 12.5 PB)
Cloud containers Kubernetes on Openstack VMs, Cores - 250, Mem – 2 TB
Storage: remote HDFS or EOS (for physics data)
Scalability Tests – Optimization Results
• Key workload metrics and time spent,
measured with Spark custom instrumentation
for 1 PB of input with 804 logical cores, 8
logical cores per Spark executor
Metric
Name
Total Time Spent (Sum
Over al Executors)% (Comparedto Execution Time)
Total
Execution
Time
~3000 - 3500 hours 1
CPU
Time
~1200 hours 40%
EOS Read
Time
~1200 - 1800 hours,
depending on
readAhead size
40-50%
Garbage
Collection
Time
~200 hours 7-8 % • Read Throughput in GB/s
• Measure throughout during job execution for 1 PB of input with, 100 Spark executors, each using 8 logical cores.
Scalability Tests - Results• Performance and Scalability of the
tests for different input size in
minutes, 800 logical cores, and 8
logical cores per Spark executor
7.3 11.9
27
59
228
0
50
100
150
200
250
22 TB 44 TB 110 TB 220 TB 1 PB
Job
ru
n t
ime
(min
ute
s)
Input data size
Data reduction job, run time
Input Data Time for EOS Public
22 TB 7.3 mins
44 TB 11.9 mins
110 TB 27 mins (±2)
220 TB 59 mins (±5)
1 PB 228 mins (±10)(~3.8 hours)
• Can we reduce 1 PB in 5 hours(original project milestone)? YES.
• We even dropped to 4 hours in our latesttests
Machine Learning Use Case
15
Deep Learning Pipeline for Physics Data
163%
236%
31%
Particle
Classifier
W + j
QCD
t-t̅
• R&D to improve the quality of filtering systems
• Develop a “Deep Learning classifier” to be used by the filtering system
• Goal: Reduce false positives -> do not store nor process uninteresting events
• “Topology classification with deep learning to improve real-time event selection at the
LHC”, Nguyen et al. Comput.Softw.Big Sci. 3 (2019) no.1, 12
Engineering Efforts to Enable Effective ML
• From “Hidden Technical Debt in Machine Learning
Systems”, D. Sculley at al. (Google), paper at NIPS 2015
Analytics Platform at CERN
HEP software
Experiments storage
HDFS
Personal storage
Integrating new “Big Data”
components with existing
infrastructure:
• Software distribution
• Data platforms
Code
Monitoring
Visualizations
Analytics with SWAN
19
Text
All the required tools, software and data available in a single window!
Extending Spark to Read Physics Data
• Physics data is stored in EOS system, accessible with
xrootd protocol: extended HDFS APIs
• Stored in ROOT format: developed a Spark Datasource
• Currently: 300 PBs
• Growing >50 PB/year
• https://github.com/cerndb/hadoop-xrootd
• https://github.com/diana-hep/spark-root
JNI
Hadoop
HDFS
APIHadoop-
XRootD
Connector
EOS
Storage
Service XRootD
Client
C++ Java
Deep Learning Pipeline for Physics Data
Data
Ingestion
Feature
Preparation
Model
DevelopmentTraining
Read physics
data and feature
engineering
Prepare input
for Deep
Learning
network
1. Specify model
topology
2. Tune model
topology on
small dataset
Train the best
model
Built with Apache Spark + Analytics Zoo + Python Notebooks
The Dataset
● Software simulators generate events and
calculate the detector response
● Every event is a 801x19 matrix: for every
particle momentum, position, energy, charge
and particle type are given
Data Ingestion
• Read input files (4.5 TB) from ROOT format
• Compute physics-motivated features
• Store to parquet format
54 M events
4.5TB
750 GBs
Stored on HDFS
Physics data
storage
Features Engineering• From the 19 features recorded in the
experiment:• 14 more are calculated based on domain specific
knowledge: these are called High Level Features (HLF)
• Order the sequence of particles to be fed to a sequence based classifier• The final sequence is ordered using custom Python
code implementing physics
Feature Preparation• All features need to be
converted to a format
consumable by the neural
network
• One Hot Encoding of
categories
• Sort the particles for the
sequence classifier with a UDF
• Executed in PySpark using
Spark SQL and ML
Models Investigated
1. Fully connected feed-forward DNN with High Level Features
2. DNN with a recursive layer (based on GRUs)
3. Combination of (1) + (2)
Complexity
Performance
Hyper-Parameter Tuning– DNN • Once the network topology is chosen, hyper-parameter
tuning is done with scikit-learn + Keras and parallelized
with Spark
Analytics Zoo & BigDL• Analytics Zoo is a platform for unified analytics
and AI on Apache Spark leveraging BigDL / Tensorflow• For service developers: integration with the existing
distributed and scalable analytics infrastructure (hardware, data access, data processing, configuration and operations)
• For users: Keras APIs to run user models, integration with Spark data structures and pipelines
• BigDL is a distributed deep learning framework for Apache Spark
28
Model Development – DNN
• Model is instantiated with the Keras-
compatible API provided by Analytics Zoo
Model Development – GRU+HLFA more complex topology for the network
Distributed TrainingInstantiate the estimator using Analytics Zoo / BigDL
The actual training is distributed to Spark executors
Storing the model for later use
Analytics Zoo & BigDL scales very well in the
ranges tested
Performance and Scalability of Analytics Zoo & BigDL
Results
• Trained models with
Analytics Zoo and BigDL
• Met the expected
accuracy results
TensorFlow on Kubernetes
• Additional results using TensorFlow 2.0 on Kubernetes
• CERN Cloud on Openstack
• We developed integration for TF distributed on K8S: https://github.com/cerndb/tf-
spawner
Distributed training with TF 2.0
Data and models from Researchers
Input: labeleddata and DL models
Feature engineering at scale
Distributed model training Output: particle
selector model
Hyperparameteroptimization (Random/Grid search)
Machine Learning with Spark and Keras
Conclusions• Spark and “Big Data”-based analysis platforms can improve
High Energy Physics data pipelines• Industry-standard APIs
• Run natively on “data lakes” and cloud
• Profit from large communities in industry and open source
• Two use cases developed• CMS Data reduction at scale with Apache Spark
• Deep learning pipeline with Spark + BigDL and TensorFLow
• Analytics platform at CERN• Open for access to CERN community, notably users in Physics,
Beams and Accelerators, IT.
Acknowledgments• CERN openlab: Riccardo Castellotti, Michał Bień, Viktor Khristenko, Maria
Girone
• CERN Spark and Hadoop service
• CMS, Bigdata team: Matteo Cremonesi, Jim Pivarski
• CMS, University of Padova: Matteo Migliorini, Marco Zanetti
• Intel team for BigDL and analytics Zoo: Jiao (Jennie) Wang, Sajan
Govindan
• References:
• Using Big Data Technologies for HEP Analysis
https://doi.org/10.1051/epjconf/201921406030
• Machine Learning Pipelines with Modern Big Data Tools for High Energy Physics
http://arxiv.org/abs/1909.10389