“big data in hep” - physics data analysis, machine learning and...

“Big Data In HEP” - Physics Data Analysis, Machine Learning and Data Reduction at Scale with Apache Spark

Luca Canali (CERN), Vaggelis Motesnitsalis (CERN), Oliver Gutsche (Fermilab)

IXPUG Annual Conference 2019September 24th, 2019

1

Experimental Particle Physics -the Journey

2

Particle Collisions Physics Discoveries

Large ScaleComputing

Large Scale Computing

3

Device

Simulation

RAWData

Algorithms to

reconstruct data

RECOData

Analysis software

Plo

ts

Analysis in CMS

4

Central Hundreds of physicists analyze the data with different goals at the same time

Device

Simulation

RAWData

Algorithms to

reconstruct data

RECOData

Analysis software

Plo

ts

Analysis: A multi-step Process• Minimize Time to Insight

• Analysis is a conversation with data - Interactivity is key

• Many different physics topics concurrently under investigation• Different slices of data are relevant for each analysis

• Programmatically same analysis steps• Skimming (dropping events in a disk-to-disk copy)

• Slimming (dropping branches in a disk-to-disk copy)

• Filtering (selectively reading events into memory)

• Pruning (selectively reading branches into memory)

5

Big Data

• New toolkits and systems collectively called “Big Data” technologies have emerged to support the analysis of PB and EB datasets in industry.

• Our goals in applying these technologies to the HEP analysis challenge:• Reduce Time to Insight• Educate our graduate students and

post docs to use industry-based technologies• Improves chances on the job market outside

academia• Increases the attractiveness of our field

• Be part of an even larger community

6

Bridging the Gap• Physics Analysis is typically done with the ROOT Framework which uses

physics data that are saved in ROOT format files. At CERN these files are

stored within the EOS Storage Service.

EOS Storage Service

1. access data 2. read format 3. visualize

CMS Data Reduction and Analysis

Facility• CERN openlab / Intel project

• Demonstrate reduction capabilities producing analysis ntuples using Apache Spark

• Demonstrator’s goal: reduce 1 PB input in 5 hours

Milestones and Achievements• We solved two important data engineering challenges:

1. Read files in ROOT Format using Spark

2. Access files stored in EOS directly from Hadoop/Spark

• This enabled us to produce, scale up, and optimize Physics Analysis Workloads with data input up to 1 PB.

EOS

Storage

Service

runs

on

access

data fromaccess

data

from

accessed by

runs

on

Scalability TestsThe data processing job of this project was developed in Scala by CMS members.

• Performs event selection (i.e. Data Reduction)

• Uses the filtered events to compute the dimuon invariant mass

• On a single thread/core and one single file as input, the workload reads one branch and calculates the dimuon invariant mass in approximately 10 mins for a 4GB file EOS

Storage

Service

ROOTfile

Driver

Executor

Task 1

Task 2

Executor

Task x-1

Taskx

{Dimuon system Invariant Mass Calculation, Code}

Test Workload Architecture and File-Task Mapping

IT Hadoop and Spark Service (analytix)

ROOTfile

ROOTfile

ROOTfile

Scalability Tests: Technology• Apache Spark

• Hadoop YARN

• Kubernetes and Openstack

• Collaborated with Intel for Spark jobs optimizations: using Intel CoFluent Cluster Simulation Technology

• Services/Tools Used: • EOS Public, CERN open data

• Hadoop-XRootD Connector (allows Spark to access the CERN EOS storage system)

• spark-root (Spark data source for ROOT format)

• sparkMeasure (spark instrumentation)

• Spark on Kubernetes Service

• Issues that we tackled:• Network bottleneck at scale: “readAhead” buffer size configuration of the Hadoop-XRtooD connector

• Running tests on a shared clusters and share infrastructure in IT datacenter

Hadoop and Spark Clusters at CERN

• Clusters:

• YARN/Hadoop

• Spark on Kubernetes

• Hardware: Intel based servers, continuous refresh and capacity expansion

Accelerator logging

(part of LHC

infrastructure)

Hadoop - YARN - 30 nodes

(Cores - 800, Mem - 13 TB, Storage – 7.5 PB)

General Purpose Hadoop - YARN, 65 nodes

(Cores – 1.3k, Mem – 20 TB, Storage – 12.5 PB)

Cloud containers Kubernetes on Openstack VMs, Cores - 250, Mem – 2 TB

Storage: remote HDFS or EOS (for physics data)

Scalability Tests – Optimization Results

• Key workload metrics and time spent,

measured with Spark custom instrumentation

for 1 PB of input with 804 logical cores, 8

logical cores per Spark executor

Metric

Name

Total Time Spent (Sum

Over al Executors)% (Comparedto Execution Time)

Total

Execution

Time

~3000 - 3500 hours 1

CPU

Time

~1200 hours 40%

EOS Read

Time

~1200 - 1800 hours,

depending on

readAhead size

40-50%

Garbage

Collection

Time

~200 hours 7-8 % • Read Throughput in GB/s

• Measure throughout during job execution for 1 PB of input with, 100 Spark executors, each using 8 logical cores.

Scalability Tests - Results• Performance and Scalability of the

tests for different input size in

minutes, 800 logical cores, and 8

logical cores per Spark executor

7.3 11.9

27

59

228

0

50

100

150

200

250

22 TB 44 TB 110 TB 220 TB 1 PB

Job

ru

n t

ime

(min

ute

s)

Input data size

Data reduction job, run time

Input Data Time for EOS Public

22 TB 7.3 mins

44 TB 11.9 mins

110 TB 27 mins (±2)

220 TB 59 mins (±5)

1 PB 228 mins (±10)(~3.8 hours)

• Can we reduce 1 PB in 5 hours(original project milestone)? YES.

• We even dropped to 4 hours in our latesttests

Machine Learning Use Case

15

Deep Learning Pipeline for Physics Data

163%

236%

31%

Particle

Classifier

W + j

QCD

t-t̅

• R&D to improve the quality of filtering systems

• Develop a “Deep Learning classifier” to be used by the filtering system

• Goal: Reduce false positives -> do not store nor process uninteresting events

• “Topology classification with deep learning to improve real-time event selection at the

LHC”, Nguyen et al. Comput.Softw.Big Sci. 3 (2019) no.1, 12

Engineering Efforts to Enable Effective ML

• From “Hidden Technical Debt in Machine Learning

Systems”, D. Sculley at al. (Google), paper at NIPS 2015

Analytics Platform at CERN

HEP software

Experiments storage

HDFS

Personal storage

Integrating new “Big Data”

components with existing

infrastructure:

• Software distribution

• Data platforms

https://www.google.com/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&ved=2ahUKEwj0r7aN3dThAhWCbVAKHb1hCsIQjRx6BAgBEAQ&url=https%3A%2F%2Findico.cern.ch%2Fevent%2F538540%2Fcontributions%2F2187138%2Fattachments%2F1282513%2F1906054%2FIT-cernbox-2016-05-31.pdf&psig=AOvVaw2pMudr8fBzgEOu2GjfcgVp&ust=1555508026791340

https://www.google.com/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&ved=2ahUKEwiutqHg3dThAhUCPFAKHVErDuQQjRx6BAgBEAU&url=https%3A%2F%2Findico.cern.ch%2Fevent%2F656157%2F&psig=AOvVaw3qNP_2iQRsdOTIFWOKfk_F&ust=1555508199266077

Code

Monitoring

Visualizations

Analytics with SWAN

19

Text

All the required tools, software and data available in a single window!

Extending Spark to Read Physics Data

• Physics data is stored in EOS system, accessible with

xrootd protocol: extended HDFS APIs

• Stored in ROOT format: developed a Spark Datasource

• Currently: 300 PBs

• Growing >50 PB/year

• https://github.com/cerndb/hadoop-xrootd

• https://github.com/diana-hep/spark-root

JNI

Hadoop

HDFS

APIHadoop-

XRootD

Connector

EOS

Storage

Service XRootD

Client

C++ Java

https://github.com/cerndb/hadoop-xrootd

https://github.com/diana-hep/spark-root

Deep Learning Pipeline for Physics Data

Data

Ingestion

Feature

Preparation

Model

DevelopmentTraining

Read physics

data and feature

engineering

Prepare input

for Deep

Learning

network

1. Specify model

topology

2. Tune model

topology on

small dataset

Train the best

model

Built with Apache Spark + Analytics Zoo + Python Notebooks

The Dataset

● Software simulators generate events and

calculate the detector response

● Every event is a 801x19 matrix: for every

particle momentum, position, energy, charge

and particle type are given

Data Ingestion

• Read input files (4.5 TB) from ROOT format

• Compute physics-motivated features

• Store to parquet format

54 M events

4.5TB

750 GBs

Stored on HDFS

Physics data

storage

Features Engineering• From the 19 features recorded in the

experiment:• 14 more are calculated based on domain specific

knowledge: these are called High Level Features (HLF)

• Order the sequence of particles to be fed to a sequence based classifier• The final sequence is ordered using custom Python

code implementing physics

Feature Preparation• All features need to be

converted to a format

consumable by the neural

network

• One Hot Encoding of

categories

• Sort the particles for the

sequence classifier with a UDF

• Executed in PySpark using

Spark SQL and ML

Models Investigated

1. Fully connected feed-forward DNN with High Level Features

2. DNN with a recursive layer (based on GRUs)

3. Combination of (1) + (2)

Complexity

Performance

Hyper-Parameter Tuning– DNN • Once the network topology is chosen, hyper-parameter

tuning is done with scikit-learn + Keras and parallelized

with Spark

Analytics Zoo & BigDL• Analytics Zoo is a platform for unified analytics

and AI on Apache Spark leveraging BigDL / Tensorflow• For service developers: integration with the existing

distributed and scalable analytics infrastructure (hardware, data access, data processing, configuration and operations)

• For users: Keras APIs to run user models, integration with Spark data structures and pipelines

• BigDL is a distributed deep learning framework for Apache Spark

28

Model Development – DNN

• Model is instantiated with the Keras-

compatible API provided by Analytics Zoo

Model Development – GRU+HLFA more complex topology for the network

Distributed TrainingInstantiate the estimator using Analytics Zoo / BigDL

The actual training is distributed to Spark executors

Storing the model for later use

Analytics Zoo & BigDL scales very well in the

ranges tested

Performance and Scalability of Analytics Zoo & BigDL

Results

• Trained models with

Analytics Zoo and BigDL

• Met the expected

accuracy results

TensorFlow on Kubernetes

• Additional results using TensorFlow 2.0 on Kubernetes

• CERN Cloud on Openstack

• We developed integration for TF distributed on K8S: https://github.com/cerndb/tf-

spawner

Distributed training with TF 2.0

Data and models from Researchers

Input: labeleddata and DL models

Feature engineering at scale

Distributed model training Output: particle

selector model

Hyperparameteroptimization (Random/Grid search)

Machine Learning with Spark and Keras

Conclusions• Spark and “Big Data”-based analysis platforms can improve

High Energy Physics data pipelines• Industry-standard APIs

• Run natively on “data lakes” and cloud

• Profit from large communities in industry and open source

• Two use cases developed• CMS Data reduction at scale with Apache Spark

• Deep learning pipeline with Spark + BigDL and TensorFLow

• Analytics platform at CERN• Open for access to CERN community, notably users in Physics,

Beams and Accelerators, IT.

Acknowledgments• CERN openlab: Riccardo Castellotti, Michał Bień, Viktor Khristenko, Maria

Girone

• CERN Spark and Hadoop service

• CMS, Bigdata team: Matteo Cremonesi, Jim Pivarski

• CMS, University of Padova: Matteo Migliorini, Marco Zanetti

• Intel team for BigDL and analytics Zoo: Jiao (Jennie) Wang, Sajan

Govindan

• References:

• Using Big Data Technologies for HEP Analysis

https://doi.org/10.1051/epjconf/201921406030

• Machine Learning Pipelines with Modern Big Data Tools for High Energy Physics

http://arxiv.org/abs/1909.10389

https://doi.org/10.1051/epjconf/201921406030

http://arxiv.org/abs/1909.10389

“big data in hep” - physics data analysis, machine learning and...

Documents