arxiv:1909.10389v2 [cs.dc] 24 sep 2019 · spark external tools, like bigdl [3], allowing the...

Machine Learning Pipelines with Modern Big Data

Tools for High Energy Physics

M. Migliorini1,2, R. Castellotti1, L. Canali1,*, and M. Zanetti2

1European Organization for Nuclear Research (CERN), Geneva, Switzerland2University of Padova, Padua, Italy

*[email protected]

Abstract

The effective utilization at scale of complex machinelearning (ML) techniques for HEP use cases posesseveral technological challenges, most importantly onthe actual implementation of dedicated end-to-enddata pipelines. A solution to these challenges is pre-sented, which allows training neural network clas-sifiers using solutions from the Big Data and datascience ecosystems, integrated with tools, software,and platforms common in the HEP environment.In particular, Apache Spark is exploited for datapreparation and feature engineering, running the cor-responding (Python) code interactively on Jupyternotebooks. Key integrations and libraries that makeSpark capable of ingesting data stored using ROOTformat and accessed via the XRootD protocol, are de-scribed and discussed. Training of the neural networkmodels, defined using the Keras API, is performedin a distributed fashion on Spark clusters by usingBigDL with Analytics Zoo and also by using Tensor-Flow, notably for distributed training on CPU andusing GPUs. The implementation and the results ofthe distributed training are described in detail in thiswork.

Introduction

High energy physics (HEP) experiments like those atthe Large Hadron Collider (LHC) are paramount ex-

amples of “big-data” endeavors: chasing extremelyrare physics processes requires producing, manag-ing and analyzing large amounts of complex data.Data processing throughput of those experiments isexpected to exceed 200 TB/s in 2025 with the up-grade of the LHC (HL-LHC project), which, aftertight online filtering, implies saving on permanentstorage 100 PB of data per year. Thorough and com-plex data processing enables major scientific achieve-ments, like the discovery of the Higgs boson in 2012.Data ingestion, feature engineering, data reductionand classification are complex tasks, each requiringadvanced techniques to be accomplished. While, sofar, custom solutions have been employed, recent de-velopments of open source tools are making the lat-ter compelling options for HEP specific use-cases.Furthermore, physics data analysis is profiting to alarge extent from modern Machine Learning (ML)techniques, which are revolutionizing each processingstep, from physics objects reconstruction (feature en-gineering), to parameter estimation (regression) andsignal selection (classification). In this scope, ApacheSpark [1] represents a very promising tool to extendthe traditional HEP approach, by combining in aunique system powerful means for both sophisticateddata engineering and Machine Learning. Among themost popular analytics engines for big data process-ing, Spark allows performing interactive analysis anddata exploration through its mature data processingengine and API for distributed data processing, its in-

1

arX

iv:1

909.

1038

9v4

[cs

.DC

] 1

7 M

ar 2

020

tegration with cluster systems and by featuring MLlibraries giving the possibility to train in a distributedfashion all common classifiers and regressors on largedatasets.

It has been proved (e.g. in [2]) that Deep Learningcan boost considerably the performances of physicsdata analysis, yielding remarkable results from largersets of low-level (i.e. less “engineered”) quantities.There is thus a significant interest in integratingSpark with tools, like BigDL [3], allowing distributedtraining of deep learning models.

The development of an end-to-end machine learn-ing pipeline to analyze HEP data using Apache Sparkis described in this paper. After briefly recalling thetraditional data processing and analysis workflow inHEP, the specific physics use-case addressed in workis presented; the various steps of the pipeline arethen described in detail, from data ingestion to modeltraining, whereas the overall results are reported inthe final section.

The primary goal of this work is to reproduce theclassification performance results of [4] using toolsfrom the Big Data and data science ecosystems, show-ing that the proposed pipeline makes more efficientusage of computing resources and provides a moreproductive interface for the physicists, along all thesteps of the processing pipeline.

Traditional Analysis Workflowand Tools

The backbone of the traditional HEP analysis work-flow is ROOT [5], a multipurpose C++ toolkit devel-oped at CERN implementing functionalities for I/Ooperations, persistent storage, statistical analysis,and data visualization. Data gathered by LHC exper-iments or produced by their simulation software areprovided in ROOT format, with file-based data rep-resentation and an event-based class structure withbranches. The latter is a feature of paramount impor-tance, as it enables the flexibility required to preservethe complexity of the recorded data, allowing keepingtrack of intrinsic dependencies among physics objectsof each collision event.

Centralized production systems orchestrate thedata processing workflow, converting the raw infor-mation into higher-level quantities (data or featureengineering). Computing resources, organized world-wide in hierarchical tiers, are exploited employingGRID protocols [6]. Although the centrally pro-duced datasets may require some additional featurepreparation, from this stage on, the processing isanalysis-dependent and is done by the users usingbatch systems. Machine Learning algorithms are ex-ecuted either from within the ROOT framework (us-ing TMVA [7], a toolkit for multivariate data analy-sis) or using more common open source frameworks(Keras/Tensorflow, PyTorch, etc.).

The Physics use case

Physicists primarily aim at distinguishing interest-ing collision events from uninteresting ones, the for-mer being those associated with specific physics sig-nals whose existence is sought or whose propertiesare worth being studied. Typically, those signals areextremely rare and correspond to a tiny fraction ofthe whole dataset. Data analysis results then in aclassification problem, where, in addition to the sig-nal category, the background is often also split intoseveral classes.

Out of the 40 million collisions produced by theLHC every second, only a small fraction (about 1000events) can currently be stored by the online pipelinesof the two omni-purpose detectors, CMS and AT-LAS. A haphazard selection of those events woulddilute the already rare signal processes, thus efficientdata classification needs to take place already online.Experiments implement complex trigger systems, de-signed to maximize the true-positive rate and min-imize the false-positive rate thus allowing effectiveutilization of computing resources both online andoffline (e.g. processing units, storage, etc.).

The Machine Learning pipeline described in thispaper addresses the same physics use-case consideredin the work by Nguyen et al. [4] where event topol-ogy classification, based on deep learning, is used toimprove the purity of data samples selected at trig-ger level. The dataset is the result of a Monte Carlo

2

event generation, where three different processes (cat-egories) have been simulated: the inclusive produc-tion of a leptonically decaying W± boson, the pairproduction of a top-antitop pair (tt) and hadronicproduction of multijet events. Variables of low andhigh level are included in the dataset.

Data Pipeline For MachineLearning

Data pipelines are of paramount importance to makemachine learning projects successful, by integratingmultiple components and APIs used across the en-tire data processing chain. A good data pipelineimplementation can help to achieve analysis resultsfaster by improving productivity and by reducing theamount of work and toil around core machine learn-ing tasks. In particular, data pipelines are expectedto provide solid tools for data processing, a task thatends up being one of the most time-consuming forphysicists, and data scientists in general, approach-ing data analysis problems. Traditionally, HEP hasdeveloped custom tools for data processing, whichhave been successfully used for decades. Recently,a large range of solutions for data processing andmachine learning have become available from opensource communities. The maturity and adoption ofsuch solutions continue to grow both in industry andacademia. Using software from open source commu-nities comes with several advantages, including low-ering the cost of development and support, and thepossibility of sharing solutions and expertise with alarge user community. In this work, we implementthe machine learning pipeline detailed in [4] usingtools from the “Big Data” ecosystem. One of the keyobjectives for the machine learning data pipeline isto transform raw data into more valuable informa-tion used to train the ML/DL models. Apache Sparkprovides the backbone of the pipeline, from the taskof fetching data from the storage system to featureprocessing and feeding training data into a DL en-gine (BigDL and Analytics Zoo are used in this work).Additionally, distributed deep learning results are de-tailed, obtained using TensorFlow on cloud resources

Figure 1: Scheme of the data processing pipeline usedfor training the event topology classifier.

and container-based environments. The four steps ofthe pipeline we built are (see also Figure 1):

• Data Ingestion, where we read data in ROOTformat from the CERN EOS storage system, intoa Spark DataFrame and save the results as alarge table stored in a set of Apache Parquetfiles.

• Feature Engineering and Event Selection,where the Parquet files containing all the eventsdetails processed in Data Ingestion are filtered,and datasets with new features are produced.

• Parameter Tuning, where the hyperparame-ters for each model architecture are optimizedby performing grid search.

• Training, where the neural network models aretrained on the full dataset.

In the next sections, we will describe in detail eachstep of this pipeline.

Data Source

Data used for this work have been generated usingsoftware simulators to generate events and to calcu-late the detector response, as previously discussed,see also [4] for details. For this exercise, the gener-ated training data amounts to 4.5 TB, for a total of54 million events, divided in 3 classes: “W + jet”,“QCD”, “tt” events. The generated training data isstored using the ROOT format, as it is a common for-mat for HEP. Data are originally stored in the CERNEOS storage system as it is the case for the major-ity of HEP data at CERN at present. The authors

3

of [4] have kindly shared the training data for thiswork. Each event of the dataset consists of a list ofreconstructed particles. Each particle is associatedwith features providing information on the particlecinematic (position and momentum) and on the typeof particle.

Data Ingestion

Data ingestion is the first step of the pipeline, wherewe read ROOT files from the CERN EOS [8] stor-age system into a Spark DataFrame. For this, we usea dedicated library able to ingest ROOT data intoSpark DataFrames: spark-root [9], an Apache Sparkdata source for the ROOT file format. It is based ona Java implementation of the ROOT I/O libraries,which offers the ability to translate ROOT files intoSpark DataFrames and RDDs. To access the filesstored in the EOS storage system from Spark appli-cations, another library was developed: the Hadoop-XRootD connector [10]. The Hadoop-XRootD con-nector is a Java library extending the Apache HadoopFilesystem YARN [11] to allow accessing files storedin EOS via the XRootD protocol. This allows Sparkto read directly from the EOS storage system, whichis convenient for our use case as it avoids the need forcopying data into HDFS or other storage compatiblewith Spark/Hadoop libraries. At the end of the dataingestion step, the result is that data, with the samestructure as the original ROOT files, are made avail-able as a Spark DataFrame on which we can performevent selection and feature engineering.

Event Selection and Feature Engineer-ing

In this step of the pipeline, we process the datasetby applying relevant filters, by computing derivedfeatures and by applying data normalization tech-niques. The first part of the processing requiresdomain-specific knowledge in HEP to simulate trig-ger selection: this is emulated by requiring all theevents to include one isolated electron or muon withtransverse momentum (pT ) above a given threshold,pT ≥23 GeV. All particles passing the trigger se-lection are then ranked in decreasing order of pT .

For each event, the isolated lepton is the first en-try of the list of particles. Together with the iso-lated lepton, the first 450 charged particles, the first150 photons, and the first 200 neutral hadrons havebeen considered, for a total of 801 particles with 19features each. The result is that each event is asso-ciated with a matrix with shape 801 x 19. This de-fines the Low-Level Features (LLF) dataset. Startingfrom the LLF, an additional set of 14 High-Level Fea-tures (HLF) is computed. These additional featuresare motivated by known physics and data process-ing steps and will be of great help for improving theneural network model in later steps. LLF and HLFdatasets, computed as described above, are saved inApache Parquet [12] format: the amount of trainingdata is reduced at this point from the original 4.5 TBof ROOT files to 950 GB of snappy-compressed Par-quet files. Additional processing steps are performedand include operations frequently found when prepar-ing training data for classifiers, notably:

• Data undersampling. This is done to workaround the class imbalance in the original train-ing data. After this step, we have the same num-ber of events (1.4 million events) for each of thethree classes.

• Data shuffling. This is standard practice, usefulto improve the convergence of mini-batch gradi-ent descent-based training. Shuffling is done atthis stage over the whole data set, making use ofSpark’s ability to perform large sorts.

• All the features present in the datasets have beenpre-processed, by scaling them take values be-tween 0 and 1 (using MinMaxScaler) or normal-ized, using the StandardScaler, as needed for thedifferent classifiers.

• The datasets, containing HLF and LLF featuresand labels, have been split into training andtest datasets (80% and 20% respectively) andsaved in two separate folders, as Parquet files.Smaller datasets, samples of the full train andtest datasets, have also been generated for testand development purposes.

4

Training and test datasets were saved as files insnappy-compressed Parquet format at the end of thisstage. To use TensorFlow with the data prepared inour pipeline we had to introduce an additional step,where we converted the training and test data setinto TFRecord format. TFRecord file format is asimple record-oriented binary format, that in essenceconsists of serialized protocol buffer entries [13]. Weused Apache Spark to convert the training and testdatasets from Parquet to TFRecord using the libraryand Spark datasource ”spark-tensorflow-connector”,which is distributed as open source software by theTensorflow project. The conversion took just a fewminutes, as we ran it on a cluster, taking advantageof Spark’s parallel processing capabilities. Test andtraining data after this stage amount to about 250GB. The decrease in total data size from the previ-ous step is mostly due to the undersampling step andthe fact that the population of the 3 topology classesin the original training data used for this work is notbalanced.

Neural Network Models

We have tested three different neural network models,following [4]:

• The first and simplest model is the “HLF Clas-sifier”. It is a fully connected feed-forward deepneural network taking as input the 14 high-levelfeatures. The chosen architecture consists ofthree hidden layers with 50, 20, 10 nodes ac-tivated by Rectified Linear Units (ReLU). Theoutput layer consists of 3 nodes, activated bythe Softmax activation function.

• The “Particle Sequence Classifier” is trained us-ing recursive layers, taking as input the 801 par-ticles in the Low-Level Features dataset. The listof particles is ordered before feeding it into a re-current neural network. Particles are ordered bydecreasing ∆R distance from the isolated lepton,calculated as

∆R =√

∆η2 + ∆φ2

where η is the pseudorapidity and φ the az-imuthal angle of the particle. Gated Recurrent

Units (GRU) have been used to aggregate theparticles input sequence, using a recurrent layerof width 50. The output of the GRU layer isfed into a fully connected layer with 3 softmax-activated nodes. Notably, this model does notmake use of High-Level Features, but only uses“Low-Level Features” from simulation data.

• The “Inclusive Classifier” is the most complexand complete of the 3 models tested. This classi-fier combines the “HLF classifier” with the “Par-ticle Sequence Classifier”. The model consists inconcatenating the 14 High Level Features to theoutput of the GRU layer after a dropout layer.An additional dense layer of 25 nodes is intro-duced before the final output layer consisting of 3nodes, activated by the softmax activation func-tion.

Parameter Tuning

Hyperparameter tuning is a common step for improv-ing machine learning pipelines. In this work, we haveused Spark to speed up this step. Spark allows train-ing multiple models with different parameters concur-rently on a cluster, with the result of speeding up thehyperparameter tuning step. We used AUC, the AreaUnder the ROC curve, as the performance metric tocompare different classifiers. When performing a gridsearch each run is independent of the others, hencethis process can be easily parallelized and scales verywell. For example, we tested the High-Level Featuresclassifier, a feed forward neural network, taking asinput the 14 High-Level Features. For this model, wetested changing the number of layers and units perlayer, the activation function, the optimizer, etc. Asan experiment, we ran grid search on a small datasetcontaining 100K events sampled from the full dataset,using a grid of 200 hyper-parameters sets (models).Hyperparameter tuning can similarly be repeated forall three classifiers described above. For the followingwork, we have decided to use the same models likethe ones presented in [4], as they were offering thebest trade-off between performances and complexity.To run grid search in parallel, we used Spark withspark-sklearn and the TensorFlow/Keras wrapper for

5

scikit-learn [14]. Additionally, we obtained similarresults for parallelizing hyperparameter search usingkeras-tuner [15], a specialized package for hyperpa-rameter tuning suitable for TensorFlow and we haveintegrated it with cloud resources on Kubernetes witha custom script.

Distributed Training with Spark,BigDL/Analytics Zoo and TensorFlow

There are many suitable software and platform so-lutions for deep learning training nowadays, howeverchoosing among them is not straightforward, as manyproducts are available with different characteristicsand optimizations for different areas of application.For this work, we wanted to use a solution that eas-ily integrates with the Spark service at CERN, run-ning on Hadoop YARN [11] cluster, and more re-cently also running Spark on Kubernetes [16] usingcloud resources. GPUs and other hardware accelera-tors are only available in limited quantities at CERNat the time of this work, so we also wanted to explorea solution that could scale on CPUs. Moreover, wewanted to use Python/PySpark and well-known APIsfor neural network processing: the Keras API [17]in this case. Those reasons combined with an on-going collaboration between CERN openlab [18] andIntel has led us to test and develop this work us-ing BigDL [3] and Analytics Zoo [19] for distributedmodel training. BigDL and Analytics Zoo are opensource projects distributed under the Apache 2.0 li-cense. They provide a distributed deep learningframework for Big Data platforms and are imple-mented as libraries on top of Apache Spark. Analyt-ics Zoo, in particular, provides a unified analytics andAI platform that seamlessly unites Spark, Tensor-Flow, Keras, and BigDL programs into an integratedpipeline. Notably, with Analytics Zoo and BigDLusers can work with models defined with Keras andTensorflow [20] APIs and run the training at scaleusing Spark. More information on Analytics Zoo canbe found in the Analytics Zoo repository [19]. BigDLprovides data-parallelism to train models in a dis-tributed fashion across a cluster using synchronousmini-batch Stochastic Gradient Descent. Data areautomatically partitioned across Spark executors as

an RDD of Sample: an RDD of N-Dimensional arraycontaining the input features and label. Distributedtraining is implemented as an iterative process, thusthere will be multiple iterations over the same data.Reading data from disk multiple times is slow, forthis reason, BigDL exploits the in-memory capabili-ties of Spark to cache the train RDD in the memoryof each worker allowing faster access during the iter-ations (this also means that sufficient memory needsto be allocated for training large datasets). Moreinformation on BigDL architecture can be found inthe BigDL white paper [3]. In addition, we havetested distributed training using TensorFlow. Ten-sorFlow version 1.14 and higher introduce an easy-to-use API for running distributed training. Usingthe package tf.distribute, bundled with TensorFlowdistribution, distributed training of a Keras modelcan be activated by simply wrapping the code defin-ing and compiling the model with the chosen strategyfor distributed training. In this work, we have usedthe “Multi Worker Mirrored Strategy” to distributetraining across multiple machines. Multi worker mir-rored strategy is an experimental API in TensorFlow2.0, it uses the all-reduce algorithm to keep the neuralnetwork variables in sync during training.

Models Training Results

After training the three neural network models, weevaluated the results using the test dataset. Each ofthe models presented in the previous section returnsas output the probability that an input event is asso-ciated with a given topology: yQCD, yW or ytt. Thiscan be used to define a classifier, for example, it suf-fice to apply a threshold requirement on ytt or yW todefine a W or a tt classifier, respectively. A commontechnique to evaluate the performance of classifiers,also utilized in the reference work [4], is to computeand compare their corresponding ROC (receiver op-erating characteristic curve) curves and AUC (areaunder the ROC curve). Figure 2 shows the compar-ison of the ROC curves for the three classifiers for att selector.

The fact that the HLF classifier performs well, de-spite its simplicity, can be justified by the fact that

6

0.0 0.2 0.4 0.6 0.8 1.0Background Contamination (FPR)

0.0

0.2

0.4

0.6

0.8

1.0

Sign

al E

fficie

ncy

(TPR

)

HLF classifier (AUC) = 0.9821Particle-sequence classifier (AUC) = 0.9906Inclusive classifier (AUC) = 0.9929

Figure 2: AUC and ROC curves for the tt selec-tor trained using BigDL. The results show that allthree models perform well, with the inclusive classi-fier model being the best result of the three. This ismatches the results of Nguyen et al. [4]

considerable physics knowledge is built into the def-inition of the 14 high-level features. This conclu-sion is further reinforced by the results of additionaltests, where we have built the topology classifier usinga “random forest” and a “gradient boosting” (XG-Boost) model, respectively, and trained them usingthe high level features dataset. In both cases, we haveobtained very good performance of the classifiers, justclose to what has been achieved with the HLF classi-fier model based on feed forward neural networks. Incontrast, the fact that the particle sequence classifierperforms better than the HLF classifier is remarkable,because we are not putting any a priori knowledgeinto the particle sequence classifier: the model justuses a list of reconstructed particles as input for thetraining data. In some sense, the GRU layer is iden-tifying important features and physics quantities outof the training data, rather than using knowledge in-jected via feature engineering. Further improvementsare seen by combining the HLF classifier and the par-ticle sequence classifier, we call the resulting modelthe ”inclusive classifier” model.Another remarkable fact is that the training proce-dure used in this work is converging to the same re-

Figure 3: Training and validation loss (categoricalcross entropy) plotted as a function of iteration num-ber for HLF classifier. Training convergence for theneural networks used in this work is obtained ataround 50 epochs, depending on the model.

sults presented in the original paper [4] even if we areusing distributed training methods in this work.

Workload and Performance

Platforms and hardware

Multiple Spark clusters have been used to develop,run and test the data pipeline. The first group con-sisted of two Hadoop/YARN clusters, part the CERNHadoop and Spark service: one development cluster,and one production cluster consisting of 52 nodes,with the following resources: 1800 vcores, 14 TB ofRAM, 9 PB of storage. The production cluster is ashared general-purpose multi-tenant Hadoop YARNcluster built using commodity hardware and runningLinux (CentOS 7). Only a fraction of the resourcesof the production cluster capacity was used when ex-ecuting the pipeline (up to about 30% of the core ca-pacity). Jobs from other and different workloads wereconcurrently running on the system used to developthis work, which has introduced some noise and pos-sibly impacted performance, however, this has alsoprovided with a “real-life” scenario of what physicists

7

could achieve when running their data pipelines andDL training jobs on a shared Spark cluster. Furtherwork has been done using running on cloud resources,on-premises and at a public cloud. In particular, us-ing Kubernetes clusters on the CERN Cloud Serviceand on Oracle Cloud Infrastructure have been tested.Analogously to the case of YARN tests, also in thiscase resources were allocated on a multi-tenant sys-tem, such as cloud systems, although with a notabledifference that the allocated VM resources were notfurther shared, but used exclusively for this workload.When running Spark workloads on cloud resources,we have used Kubernetes as a cluster solution forSpark.

Data Preparation

Data ingestion and event filtering is a very resource-demanding step of the pipeline. Processing the orig-inal data set of 4.5 TB took approximately 3 hourswhen deployed on a Hadoop/YARN cluster with 50executors, each executor allocating 8 cores, for a to-tal of 400 cores allocated to the Spark application.The data ingestion and feature preparation workloadswere monitored and measured using metrics from OStools and from the Spark metrics system. Monitor-ing data showed that the majority of the applicationtime was spent by tasks running “on CPU”. CPU cy-cles were spent mostly executing Python code, there-fore outside of the Spark JVM. This can be explainedby the fact that we chose to process the bulk of thetraining data using Python UDF functions. This awell-known behavior in current versions of ApacheSpark when using Python UDF extensively. In suchsystems many CPU cycles are “wasted” in data serial-ization and deserialization operations, going back andforth from JVM and Python, in the current Sparkimplementation. Spark runs more efficiently whenusing the DataFrame API and/or Spark SQL. To im-prove the performance of the event filtering we haverefactored part of the Python UDF code in the dataingestion step, by using Spark SQL functions. No-tably, we have made use of Spark SQL Higher Or-der Functions, a specialized category of SQL func-tions, introduced in Spark from version 2.4.0, thatallow improving processing for nested data (arrays)

using SQL and Spark’s Dataframe API. In the caseof our workload, this has introduced the benefit ofrunning a significant fraction of the filtering opera-tions fully inside the JVM, optimized by the Sparkcode for DataFrame operations. The result is that thedata ingestion step, optimized with Spark SQL andhigher order functions, ran in about 2 hours, improv-ing on previously measured job duration of 3 hours,in the implementation that uses only Python UDF.Future work on the pipeline may further address theissue of reducing the time spent in serialization anddeserialization when using Python UDF, for example,we might decide to re-write the critical parts of theingestion code in Scala. However, with the currenttraining and test data size (4.5 TB of raw data), theperformance of the data preparation steps, currentlyof the order of 3 hours for the combined data in-gestion and feature preparation steps, is acceptable.Notably, running data preparation on the full datasetis in practice not the most time-consuming part ofthe process: development and testing on subsets ofthe full dataset, being typically where the physicistswould spend most of their time for the data prepara-tion steps. A standard and easy-to-use API to pro-cess data as scale, such the one provided by Spark,can play an important role in helping the physiciststo be more productive and ultimately improve theperformance of the data preparation step.

Hyperparameter tuning

Performance of hyperparameter tuning with gridsearch: grid search runs multiple training jobs in par-allel, therefore it is expected to scale well when runin parallel, as it was the case in our work (see alsothe paragraph on hyperparameter tuning). This wasconfirmed by measurements, as shown in Figure 4,by adding executors, and consequently the numberof models trained in parallel, the time required toscan the parameters space decreased, as expected.

Training with BigDL

Performance of distributed training with BigDL andAnalytics Zoo is also important for this exercise, asfaster training time means improved productivity of

8

0 5 10 15 20 25 30 35 40Number of executors

0

5

10

15

20

25

30

35

40

Spe

edup

Ideal speedupMeasured speedup

Figure 4: Speedup of the grid search time for the HighLevel Features classifiers on 100 K events and on agrid of ∼ 200 parameters with 8-fold cross validation.

the physicists who will typically perform many exper-iments and fine-tuning on the models’ structure andtraining parameters. Figure 5 shows a few measure-ments of the training speed for the HLF classifier.The tests have been run on the development cluster,using batch size of 32 per worker, and show very goodscalability behavior of the training at the scale tested.Additional measurements, using 20 executors, with 6cores each (for a total of 120 allocated cores), usingbatch size of 128 per worker, showed training speedfor the HLF classifier of the order of 100K rows/sec,sustained for the training of the full dataset. Wewere able to train the model in less than 5 minutes,running for 12 epochs (each epoch processing 3.4 mil-lion training events). We found that in our setup thebatch size we used had an impact on the executiontime and the accuracy of the training. We found thata batch size of 128 for the HLF classifier was a goodcompromise for speed, while a batch size of 32 wasslower but gave improved results. We would use theformer for model development and the latter, for ex-ample for producing the final training results.

An important point to keep in mind when train-ing models with BigDL, is that the RDDs containingthe features and labels datasets need to be cached inthe executors’ JVM memory. This can be a signifi-

Figure 5: Throughput of BigDL changing the num-ber of executors and cores per executor for the HLFclassifier.

cant amount of memory, of about 250 GB when train-ing the ”particle sequence” and ”inclusive classifier”modals, as they feed on training data with featurescontaining large particle matrices (801 x 19).

The training dataset size and the model complex-ity of the HLF classifier are of relatively small scale,this makes the HLF classifier suitable for training ondesktop computers, i.e. without using distributedcomputing solutions. In contrast, the Particle Se-quence Classifier and the Inclusive Classifier mod-els have much higher complexity and require pro-cessing hundreds of GB of training data. Figure 6shows the amount of CPU time consumed duringthe training of the Inclusive Sequence classifier usingBigDL/Analytics-zoo on a Spark cluster, running on30 executor instances (12 cores allocated for each ex-ecutor) and training with a batch size of 32. Notably,neural network training, in this case, has lasted for8.5 hours (50 epochs) and has utilized 6.6 106 CPUseconds, this is means that, on average, 215 CPUcores were concurrently active for the duration of thetraining job. The executor CPU time measurementshave been collected using Spark metrics instrumen-tation feeding into an InfluxDB instance, the mon-itoring plots have been generated using a Grafanadashboard, custom-built to visualize Spark monitor-

9

Figure 6: Measured aggregated CPU utilization ofSpark executors during the training of the Inclusiveclassifier. The neural network training has lasted for8.5 hours and has utilized about 215 concurrentlyrunning CPU cores for the whole duration of thetraining job.

ing data. By measuring the training performancemultiple times across several days and months, wenoticed that the training performance in our setupdepends on the state of the cluster and is affected bystragglers and executors under-performing because ofhardware faults of high server load.

Training with TensorFlow

We have also run tests and measured the distributedtraining performance of the Inclusive Classifier modelusing TensorFlow. In particular, we have usedtf.keras and the module tf.distribute implementing“multi worker mirror strategy”, to run and distributethe training over multiple workers. We have testeddifferent configurations, varying the number of work-ers, servers, CPU cores, and GPU devices. Ourtests with TensorFlow (tf.keras) used training andtest data in TFRecord format, produced at the endof the data preparation part of the pipeline. Ten-sorFlow reads natively TFRecord format and hastunable parameters and optimizations when ingest-ing this type of data using the modules tf.data andtf.io. In particular, we followed TensorFlow’s docu-mentation recommendations for improving the datapipeline performance, by using prefetching, paralleldata extraction, sequential interleaving, and by usinga large read buffer, and caching (in particular cachingwas used for distributed training when running onmultiple nodes, to take advantage of the aggregated

memory available in the cluster).The first TensorFlow training tests we performed

were deployed on a single bare metal machine, with24 physical cores (Broadwell) and 512 GB of RAM.Training and test data were stored locally on a SSD,therefore minimizing time spent doing I/O. Trainingfor 5 epoch, with batch size 128, using TensorFlow2.0-rc0 using this system took 21 hours and reachedresults for loss and AUC are similar to what hadbeen obtained in previous runs with BigDL (how-ever, from further tests with this type of configurationusing distributed training discussed later, we foundthat 12 epochs of training would provide the high-est quality of training results). While running thetraining experiments with TensorFlow on the baremetal machine, we noticed that the TensorFlow pro-cess performing the training would not utilize morethan about 5 concurrent cores, despite the availabil-ity of idle CPU cores in the machine. External causesof this behavior, such as an I/O bottlenecks, could beruled out; we explain the observed behavior as dueto the way TensorFlow internally parallelizes its op-erations for training the Inclusive Classifier neuralnetwork on CPU, as this would naturally lead to amaximum number of executing threads, dependingon the implementation. To take advantage of thefull 24 physical cores of the test machine, we, there-fore, implemented distributed training, by using thetf.distribute module implementing the multi workermirror strategy, and by manually starting multipleworkers. We found that we could scale up to 4 con-current training workers on the bare metal machineused for testing. Training of the Inclusive Classifierwith 4 concurrent workers for 6 epochs, with batchsize 128 (per worker) using TensorFlow 2.0.1 took 7.7hours. We have also run training using TensorFlowon a dedicated test machine equipped with GPU.The first finding we found was striking differences inperformance depending on the TensorFlow version.TensorFlow 2.0 has introduced an optimization fortraining GRU layers on GPU, which improves theperformance of about 2 orders of magnitude com-pared to TensorFlow 1.14. We used a bare metalmachine with NVidia P100, running TensorFlow 2.0-rc1 with CUDA 10.0, with the training and test datastored locally on the machine filesystem. Training

10

the Inclusive Classifier took 2.3 hours for 6 epochswith batch size 128, producing very good results forthe trained classifier. We have also trained using thesame hardware and software setup, i.e. using Ten-sorFlow 2.0-rc0 and an NVidia P100 GPU, a neuralnetwork for the Inclusive Classifier with a small mod-ification, consisting in replacing the GRU layer withan LSTM layer. The neural network training wasrun for 6 epochs, using batch size 128. The trainingtime in this case was 1.5 hours, with the achieved lossand AUC comparable to the results obtained with theGRU model. Previously, a model based on LSTMwas discarded by the authors of [4], and the GRUlayer was chosen for performance reasons. We findthis to be reproducible in our training tests on CPUresources. We believe that the performance improve-ment we measured on GPU training tests is due toan optimization introduced in the implementation ofLSTM layers for CUDA/GPU in the version of Ten-sorFlow used in our tests.

TensorFlow on Kubernetes

Cloud resources provide a suitable environment forscaling distributed training of neural networks. Oneof the key advantages of using cloud resource is theelasticity of the platform that allows allocating re-sources when needed. Moreover, container orches-tration systems, in particular Kubernetes, provide apowerful and flexible API for deploying many typesof workloads on cloud resources, including machinelearning and data processing tools. CERN physi-cists can access cloud resources and Kubernetes clus-ters via the CERN private cloud. The use of publicclouds is also being actively tested for HEP work-loads. In particular, the tests reported here havebeen run using resources from Oracle’s OCI. For thiswork, we have developed a custom launcher script(tf-spawner [21]) for running distributed TensorFlowtraining code on Kubernetes clusters. Tf-spawnertakes as input the Python code used for Tensor-Flow training described earlier, which makes use oftf.distribute for distributing the training, of tf.datafor the data pipeline, and it runs the code on a Ku-bernetes cluster using container images. We usedthe official TensorFlow images from Docker Hub for

these tests. Moreover, tf-spawer takes care of dis-tributing the necessary credentials for authenticatingwith cloud storage and environment variables neededby tf.distribute, it takes care of allocating the de-sired number of workers, as pods (units of execution)on a Kubernetes cluster, and of cleaning up the re-sources. Training and test data had been copied tothe cloud object storage prior to running the tests,notably copying the TFRecords files to the OCI ob-ject storage (for tests run at CERN we used the S3object storage). Reading from OCI storage can be-come a bottleneck for distributed training, as it re-quires reading data over the network which can suf-fer from bandwidth saturation, latency spikes and/ormulti-tenancy noise. TensorFlow’s optimizations forthe data pipeline performance discussed earlier wereapplied to these tests too. Notably, caching hasproven to be very useful for distributed training withGPUs and for some of the largest tests on CPU,as we observed that in those cases the first train-ing epoch, which has to read the data into the cache,was much slower than subsequent epoch which wouldfind data already cached. Tests discussed in this sec-tion were run using TensorFlow version 2.0.1, usingtf.distribute strategy “multi worker mirror strategy”.Additional care was taken to make sure that the dif-ferent tests would also yield the same good results interms of accuracy on the test dataset as what wasfound with previously tested training methods. Toachieve this we have found that additional tuning wasneeded on the settings of the learning rate for the op-timizer (we use the Adam optimizer for all the testsdiscussed in this article). We scaled the learning ratewith the number of workers, to match the increasein effective batch size (we used 128 for each worker),this is a well-known technique and it is described forexample in [22]. In addition, we found that slowlyreducing the learning rate as the number of epochsprogressed, was beneficial to the convergence of thenetwork. This additional step is an ad hoc tuningthat we developed by trial and error and that we val-idated by monitoring the accuracy and loss on thetest set at the end of each training. To gather perfor-mance data, we ran the training for 6 epochs, whichprovided accuracy and loss very close to the best re-sults reported earlier, while optimizing on the time

11

needed to take the measurements. Similarly to whatwe observed with previous tests, we also confirmed forthis case that training the network up to 12 epochsprovides better results for accuracy and loss.

Figure 7 shows the results of the Inclusive Clas-sifier model training speedup for a variable numberof nodes and CPU cores. Measurements show thatthe training time decreases as the number of allo-cated cores is increased. The speedup grows closeto linearly in the range tested: from 32 cores to 480cores. Tests used resources from Oracle’s OCI, wherewe built a Kubernetes cluster using virtual machines(VMs) and configured it with a set of Terraformsscript to automate the process. The cluster for CPUtests used VMs of the flavor “VM.Standard2.16”,based on 2.0 GHz Intel Xeon Platinum 8167M, eachproviding 16 physical cores (Oracle cloud refers tothis as OCPUs) and 240 GB of RAM. Tests in thisconfiguration deployed 3 pods for each VM, eachpod running one TensorFlow worker part of the dis-tributed training cluster. Additional OS-based mea-surements on the VMs confirmed that this was asuitable configuration, as we could measure that theCPU utilization on each VM matched the number ofavailable physical cores (OCPUs), therefore provid-ing good utilization without saturation. The avail-able RAM in the worker nodes was used to cache thetraining dataset using the tf.data API (data popu-lates the cache during the first epoch).

Similarly, we have performed tests using GPU re-sources on OCI and tf-spawner. For the GPU testswe have used the VM flavor “GPU 2.1” which comesequipped with a Nvidia P100 GPU, 12 physical cores(OCPU) and 72 GB of RAM. We have tested up to10 GPUs and found almost linear scalability in thetested range. One important lesson learned is thatwhen using GPUs the slow performance of readingdata from OCI storage makes the first training epochmuch slower than the rest of the epochs (up to 3-4times slower). It was therefore very important to useTensorFlow’s caching for the training dataset for ourtests with GPUs, however, we could only do that fortests with 4 nodes or more, given the limited amountof memory in the VM flavor used (72 GB of RAMper node) compared to the size of the training set(200 GB). Distributed training tests with CPUs and

Figure 7: Measured speedup for the distributed train-ing of the Inclusive Classifier model using TensorFlowand tf.distribute with “multi worker mirror strat-egy”, running on cloud resources with CPU and GPUnodes, training for 6 epochs. The speedup values in-dicate how well the distributed training scales as thenumber of worker nodes with CPU and/or GPU re-sources increases.

GPUs were performed using the same infrastructure,namely a Kubernetes cluster built on cloud resourcesand cloud storage allocated on OCI, moreover, weused the same script for training with tf.distributeand tf.keras, and the same TensorFlow version. Fig-ure 8 shows the distributed training time measuredfor some selected cluster configurations. We can usethese results to compare the performance we foundwhen training on GPU and on CPU. For example, wefind there that the time to train the Inclusive Clas-sifier for 6 epochs using 400 CPU cores (distributedover 25 VMs equipped with 16 physical cores each)is about 2000 seconds, which is similar to the train-ing time we measured when distributing the trainingover 6 nodes equipped with GPUs. We do not be-

12

Figure 8: Selected measurements of the distributedtraining time for the Inclusive Classifier model usingTensorFlow and tf.distribute with “multi worker mir-ror strategy”, training for 6 epochs, running on cloudresources using CPU and GPU nodes.

lieve these results can be easily generalized to otherenvironments and models, however, they are reportedhere as they can be useful as an example and futurereference.

Conclusions and Future Outlook

This work shows an example of a pipeline for end-to-end data preparation and deep learning for a highenergy physics use case and details of how it canimplemented using tools and techniques from opensource projects and “big data” communities at large.In particular, it addresses the implementation of apipeline for data preparation and training of a par-ticle topology classifier based on deep learning. Fol-lowing the work and models developed by Nguyen etal. [4], event topology classification, based on deeplearning, is used to improve the purity of data sam-ples selected at trigger level. The application of theclassifier developed in this work is intended at im-proving the efficiency of LHC detectors and data flowsystems for data acquisition. The application of themethods and techniques developed in this paper, us-ing open source and big data tools, is intended to im-

prove scientist productivity and resource utilizationwhen working on data analysis and machine learningmodel development.

Machine learning and deep learning on largeamounts of data are standard tools for particlephysics, and their use is expected to increase in theHEP community in the coming year, both for dataacquisition and for data analysis workflows, notablyin the context of the challenges of the High Luminos-ity LHC project [23]. Improvements in productivityand cost reduction for development, deployment andmaintenance of machine learning pipelines on HEPdata are of high interest. The authors believe thatusing tools and methods from open source and bigdata communities at large, brings important advan-tages in this area, in particular in terms of usabilityand developers’ productivity. Key advantages are theuse of standard APIs supported by a large commu-nity, and ease of deployment and integration withmodern computing systems, notably cloud systems.

The results discussed in this paper provide insightson how to perform data preparation at scale and howto use scale-out cluster resources like YARN and Ku-bernetes. Moreover, we showed howw one can easilydeploy distributed training on cloud resources bothusing CPUs and GPUs. We expect that many of thetools and methods discussed here will evolve consider-ably in the near future, following the directions takenby the relevant communities. This will most likelymake the details of the implementations discussed inthis paper obsolete, however we can also expect thatthe evolution will bring important improvements andwill profit greatly from the experience and lessonslearned by a large user and developer base.

Reference notebooks with code developed forthis work and data samples are available at:https://github.com/cerndb/SparkDLTrigger

The authors would like to thank the authors of [4]and in particular Thong Nguyen and Maurizio Pierinifor their help and suggestions, the team of CERNSpark, Hadoop and streaming service, CERN open-lab, in particular Maria Girone, Viktor Khristenko,Michal Bien, the team of CMS Bigdata project, inparticular Oliver Gutsche, Jim Pivarski, Matteo Cre-monesi and the BigDL and Analytics Zoo team atIntel, in particular Jiao (Jennie) Wang and Sajan

13

Govindan.

References

[1] M. Zaharia, M. Chowdhury, M. J. Franklin,S. Shenker, and I. Stoica, “Spark: Cluster com-puting with working sets,” in Proceedings ofthe 2Nd USENIX Conference on Hot Topics inCloud Computing, HotCloud’10, (Berkeley, CA,USA), pp. 10–10, USENIX Association, 2010.

[2] P. Baldi, P. Sadowski, and D. Whiteson,“Searching for Exotic Particles in High-EnergyPhysics with Deep Learning,” Nature Commun.,vol. 5, p. 4308, 2014.

[3] J. Dai, Y. Wang, X. Qiu, D. Ding, Y. Zhang,Y. Wang, X. Jia, C. Zhang, Y. Wan, Z. Li,J. Wang, S. Huang, Z. Wu, Y. Wang, Y. Yang,B. She, D. Shi, Q. Lu, K. Huang, andG. Song, “BigDL: A Distributed Deep Learn-ing Framework for Big Data,” arXiv e-prints,p. arXiv:1804.05839, Apr 2018.

[4] T. Q. Nguyen, I. Weitekamp, Daniel, D. Ander-son, R. Castello, O. Cerri, M. Pierini, M. Spirop-ulu, and J.-R. Vlimant, “Topology classificationwith deep learning to improve real-time eventselection at the LHC,” Comput Softw Big Sci(2019) 3: 12, August 2019.

[5] R. Brun and F. Rademakers, “Root — an ob-ject oriented data analysis framework,” NuclearInstruments and Methods in Physics ResearchSection A: Accelerators, Spectrometers, Detec-tors and Associated Equipment, vol. 389, no. 1,pp. 81 – 86, 1997. New Computing Techniquesin Physics Research V.

[6] I. Bird, P. Buncic, F. Carminati, M. Catta-neo, P. Clarke, I. Fisk, M. Girone, J. Har-vey, B. Kersevan, P. Mato, R. Mount, andB. Panzer-Steindel, “Update of the Comput-ing Models of the WLCG and the LHC Ex-periments,” Tech. Rep. CERN-LHCC-2014-014.LCG-TDR-002, Apr 2014.

[7] A. Hoecker, P. Speckmayer, J. Stelzer, J. Ther-haag, E. von Toerne, H. Voss, M. Backes,T. Carli, O. Cohen, A. Christov, D. Dannheim,K. Danielowski, S. Henrot-Versille, M. Ja-chowski, K. Kraszewski, J. Krasznahorkay, A.,M. Kruk, Y. Mahalalel, R. Ospanov, X. Prudent,A. Robert, D. Schouten, F. Tegenfeldt, A. Voigt,K. Voss, M. Wolter, and A. Zemla, “TMVA -Toolkit for Multivariate Data Analysis,” arXive-prints, p. physics/0703039, Mar 2007.

[8] A. J Peters and L. Janyst, “Exabyte scale stor-age at cern,” Journal of Physics: Conference Se-ries, vol. 331, 12 2011.

[9] V. Khristenko and J. Pivarski, “diana-hep/spark-root: Release 0.1.14,” Oct 2017.

[10] CERN-DB, “Hadoop-xrootd connector.”https://github.com/cerndb/hadoop-xrootd,2013.

[11] “Apache hadoop yarn.” https://hadoop.

apache.org/.

[12] “Apache parquet.” https://parquet.apache.

org/.

[13] Google, “Protocol buffers.” http://code.

google.com/apis/protocolbuffers/.

[14] “Scikit-learn.” https://scikit-learn.org/.

[15] “Keras tuner.” https://keras-team.github.

io/keras-tuner/.

[16] “Kubernetes.” https://kubernetes.io/.

[17] F. Chollet et al., “Keras.” https://keras.io, 2015.

[18] “CERN openlab.” https://openlab.cern/.

[19] “Analytics zoo.” https://analytics-zoo.

github.io/.

[20] “Tensorflow.” https://www.tensorflow.org/.

[21] “tf-spawner.” https://github.com/cerndb/

tf-spawner.

14

https://hadoop.apache.org/

https://hadoop.apache.org/

https://parquet.apache.org/

https://parquet.apache.org/

http://code.google.com/apis/protocolbuffers/

http://code.google.com/apis/protocolbuffers/

https://scikit-learn.org/

https://keras-team.github.io/keras-tuner/

https://keras-team.github.io/keras-tuner/

https://kubernetes.io/

https://openlab.cern/

https://analytics-zoo.github.io/

https://analytics-zoo.github.io/

https://www.tensorflow.org/

https://github.com/cerndb/tf-spawner

https://github.com/cerndb/tf-spawner

[22] P. Goyal, P. Dollar, R. Girshick, P. Noordhuis,L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia,and K. He, “Accurate, Large Minibatch SGD:Training ImageNet in 1 Hour,” arXiv e-prints,p. arXiv:1706.02677, Jun 2017.

[23] “High luminosity lhc.” http://hilumilhc.web.cern.ch/about/hl-lhc-project.

15

http://hilumilhc.web.cern.ch/about/hl-lhc-project

http://hilumilhc.web.cern.ch/about/hl-lhc-project

arxiv:1909.10389v2 [cs.dc] 24 sep 2019 · spark external tools, like bigdl [3], allowing the...

Documents