rocm and hopsworks for end- to-end deep learning pipelines

38
ROCm and Hopsworks for end- to-end deep learning pipelines Jim Dowling, CEO, Logical Clocks Ajit Mathews, CVP ML Software Eng, AMD 17 October 2019 O’Reilly AI, London

Upload: others

Post on 16-Oct-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ROCm and Hopsworks for end- to-end deep learning pipelines

ROCm and Hopsworks for end-to-end deep learning pipelines

Jim Dowling, CEO, Logical ClocksAjit Mathews, CVP ML Software Eng, AMD

17 October 2019O’Reilly AI, London

Page 2: ROCm and Hopsworks for end- to-end deep learning pipelines

Great Hedge of India

● East India Company was one of

the industrial world’s first monopolies.

● They assembled a thorny hedge

(not a wall!) spanning India.

● You paid customs duty to bring salt

over the wall (sorry, hedge).

In 2019, not all graphics cards are allowed

to be used in a Data Center.

Monoplies are not good for deep learning!

[Image from Wikipedia]

Page 3: ROCm and Hopsworks for end- to-end deep learning pipelines

Nvidia™ 2080Ti vs AMD Radeon™ VII: ResNet-50

Nvidia™ 2080Ti

Memory: 11GBTensorFlow 1.12

CUDA 10.0.130, cuDNN 7.4.1

Model: RESNET-50Dataset: imagenet (synthetic)

------------------------------------------------------------

FP32 total images/sec: ~322

FP16 total images/sec: ~560

AMD Radeon™ VII

Memory: 16 GB TensorFlow 1.14

ROCm: 2.7

Model: RESNET-50Dataset: imagenet (synthetic)

------------------------------------------------------------

FP32 total images/sec: ~316

FP16 total images/sec: ~421References:

https://lambdalabs.com/blog/2080-ti-deep-learning-benchmarks/

https://www.phoronix.com/scan.php?page=article&item=nvidia-

rtx2080ti-tensorflow&num=2

Reference:

https://github.com/ROCmSoftwarePlatform/tensorflow-

upstream/issues/173

Nvidia™ 2080Ti vs AMD Radeon™ VII: ResNet-50

Page 4: ROCm and Hopsworks for end- to-end deep learning pipelines

Data validation

Distributed Training

ModelServing

A/BTesting

Monitoring

Pipeline Management

HyperParameterTuning

Feature Engineering

Data Collection

HardwareManagement

Data Model Prediction

φ(x)

Hopsworks hides the Complexity of Deep Learning

HopsworksREST API

HopsworksFeature Store

[Adapted from Schulley et Al “Technical Debt of ML” ]

Page 5: ROCm and Hopsworks for end- to-end deep learning pipelines
Page 6: ROCm and Hopsworks for end- to-end deep learning pipelines
Page 7: ROCm and Hopsworks for end- to-end deep learning pipelines

DatasourcesApplications

APIDashboards

Hopsworks

Apache BeamApache Spark Pip

Conda

Tensorflowscikit-learn

PyTorch

Jupyter Notebooks

TensorboardApache BeamApache SparkApache Flink

Kubernetes

Batch Distributed ML & DL

ModelServing

HopsworksFeature Store

Kafka +Spark

Streaming

ModelMonitoring

Orchestration in Airflow

Data Preparation & Ingestion

Experimentation& Model Training

Deploy & Productionalize

Streaming

Filesystem and Metadata storage

HopsFS

Apache Kafka

Datasources

Page 8: ROCm and Hopsworks for end- to-end deep learning pipelines

Scientific AppsMachine Learning

TRANSLATE WEATHER EDUCATION SEISMICPROCESSING

SPEECH RECOMMENDER SIMULATIONS MOLECULARSIMULATIONS

8

Frameworks

Middleware and Libraries Eigen

Runtimes & Drivers

ROCm Kernel Driver

OpenMP HIP OpenCL™ Python™

Devices GPU CPU APU DL Accelerator

RCCLrocBLASMIOpen

Open Source Foundation for Machine learning

rocFFT rocPRIM

Open source stack

Latest ML Frameworks

Optimized Compilers, Math & Communication Libraries

Write parallel GPU algorithms entirely from Python™

Cluster Scale: Docker™, SLURM and Kubernetes Extensions

Page 9: ROCm and Hopsworks for end- to-end deep learning pipelines

Distro: Upstream Linux Kernel Support

1100+ upstream ROCm driver commits since 4.12 kernelhttps://github.com/RadeonOpenCompute/ROCK-Kernel-Driver

Page 10: ROCm and Hopsworks for end- to-end deep learning pipelines

Languages: Multiple Programming options

Programming Models

LLVM: https://llvm.org/docs/AMDGPUUsage.htmlCLANG HIP: https://clang.llvm.org/doxygen/HIP_8h_source.html

OpenMP Python™ OpenCL™HIP

LLVM -> AMDGCN Compiler

AMDGPU Code

LLVM

Page 11: ROCm and Hopsworks for end- to-end deep learning pipelines

Machine Learning Frameworks

ROCm enablement upstreamed to mainline and latest version supportedSupport for major applications and benchmarks

Support for fp32 and fp16 precisions

Support for automatic mixed precision via rocPyX (ROCmPyTorch Extensions)

Available today as a Docker™ container or build from source

https://medium.com/tensorflow/community-supported-amd-

rocm-build-for-tensorflow-e8e9ac258369

Available today as a Docker™ container or as Python™ PIPwheel

Page 12: ROCm and Hopsworks for end- to-end deep learning pipelines

https://github.com/ROCmSoftwarePlatform/MIOpenhttps://arxiv.org/abs/1910.00078

Helps realize the incredible benefits of high-performance, highly-tuned Deep Learning primitives• Single Precision (FP32), Half Precision (FP16), Mixed Precision and bFloat16 supported• >500 Operators Accelerated. Hand tuned Assembly Operations

RNNs

LSTMs

GRU

ACTIVATIONReLU

SigmoidTANH

Leaky RELU

Dropout

CONVOLUTIONWinograd

FFTDirect GEMM

Implicit GEMM

Dilated Convolutions

Group Convolutions

POOLINGMax, Average

Transpose Convolutions

Embeddings

SoftMax

NORMALIZATIONLRN

Batch Normalization

MIOpen: Open Source Machine Learning Library

Page 13: ROCm and Hopsworks for end- to-end deep learning pipelines

Support for Robust AI Compiler Technologies

XLA enabled on ROCm and UpstreamedPerformance improvements realized over classic backend TensorFlow →MLIR

Stream Executor

MLIR MIOpen Dialect

MIOpen with “Composable” Convolutions → LLVM IRMLIR-aware MIOpen

MLIR MIOpen Dialect

+Kernel execution

Eliminate “black-box” nature of Convolutions to drive next-gen performance optimizations

Learn more about MLIR on ROCm at TensorFlow World in 2 weeks

Basic functionality enabled on ROCmUpstreaming in progress

MLIR-based applications

AMD is a launch partner of Google’s MLIR technology

Page 14: ROCm and Hopsworks for end- to-end deep learning pipelines

ROCm Performance Improvements>1.5X Year-over-Year Software Performance Improvements on the Same Hardware

100%

157%

100%

153%

80%

100%

120%

140%

160%

180%

Oct '18 Oct '19

Rela

tive P

erf

orm

ance

ResNet50_128 Inception3_128

ROCm is on a monthly release cadence packed with new features and performance optimizations released both as source and docker containers

All performance measured on a single MI25 GPU using fp32

Page 15: ROCm and Hopsworks for end- to-end deep learning pipelines

Intra-Node and Inter-Node ready

100% 96% 94%92%

92%

91%

84%

0X

8X

16X

24X

32X

40X

48X

56X

64X

0

500

1000

1500

2000

2500

3000

3500

4000

1GPU 2GPU 4GPU 8GPU 16GPU 32GPU 64GPU

MG

PU

Sca

ling

Ima

ge

s/S

ec

PyTorch – ResNext101Radeon Instinct (TM) MI50, FP32, BS32, 1Proc/GPU

CPU

CPU

MI50

MI50

MI50

V20

MI50

MI50

MI50

MI50MI50

IB IB IB IB

100 G Switch

…Eight Nodes with RDMA

CPU

CPU

MI50

MI50

MI50

V20

MI50

MI50

MI50

MI50MI50

CPU

CPU

MI50

MI50

MI50

V20

MI50

MI50

MI50

MI50MI50

CPU

CPU

MI50

MI50

MI50

V20

MI50

MI50

MI50

MI50MI50

Chordal Ring with Infinity Fabric ~ All to All Communication

100% 97% 96%95%

94%

94%

92%

0X

8X

16X

24X

32X

40X

48X

56X

64X

0

1600

3200

4800

6400

8000

9600

11200

12800

14400

16000

1GPU 2GPU 4GPU 8GPU 16GPU 32GPU 64GPU

MG

PU

Scaling

Axis

Title

Axis Title

PyTorch – ResNet-50 Radeon Instinct™ MI50, FP32, BS 128, 1Proc/GPU

RCCL - Optimized collective communication operations library. Support for Infiniband and RoCE highspeed network fabrics. Designed for Easy MPI integration

Page 16: ROCm and Hopsworks for end- to-end deep learning pipelines

Tools: System ManagementROCm Validation Suite (RVS)A cluster management tool for detecting and troubleshooting software and hardware configuration issues, basic diagnostics, integration issues and system performance.

https://github.com/ROCm-Developer-Tools/ROCmValidationSuite

ROCm System Management Interface (SMI)A system administrator’s tool for management and monitoring of GPU devices in ROCm enabled system.

https://github.com/RadeonOpenCompute/ROC-smi

Page 17: ROCm and Hopsworks for end- to-end deep learning pipelines

Tools: Debugger and Performance ProfilingROCM DebuggerA multi-architecture x86_64 CPU and amdgcn GPUdebugger. Support for amdgcn ISA level and x86_64debugging via GDB. There will be rocm debugger APIsavailable for 3rd party debugger developers.

ROCM Performance ProfilingrocProfiler allows collection of GPU HW Counter. rocTracer is aondemand tracing with generic runtime’s Events, Callback andActivity API. rocTX provides code annotation markers to profilespecific sections of code.

https://github.com/ROCm-Developer-Tools/rocprofiler

https://github.com/ROCm-Developer-Tools/roctracer

Page 18: ROCm and Hopsworks for end- to-end deep learning pipelines

ROCm: Machine Learning Applications

Image Classification Object Detection Machine TranslationRecommendation

Systems

All state-of-the-art models are enabled on ROCm using TensorFlow & PyTorch

ResNext101 (fp32) Batch=32

63 img/sec

All performance measured on a single MI50 GPU using fp32

Mask-RCNN-Resnet101(fp32) Batch=2, Img:1Kx1K

144 img/sec

BERT66 examples/sec

Neural Collaborative Filtering

2,961,459 tokens/sec

GNMT: LSTMsTranslate: LSTMsBERT: TransformerGPT-2: Transformer

ResNet50, ResNet101VGGResNext101Inception3, Inception4ShuffleNet, DenseNetMobileNet, SqueezeNet…

Faster-RCNN-Resnet50Mask-RCNN-Resnet50SSD-Resnet50…

DLRMNCF

Page 19: ROCm and Hopsworks for end- to-end deep learning pipelines

NEW ERA OF GPU-ACCELERATED MACHINE LEARNING

OPEN SOFTWARE PLATFORMFOR COMPUTE

Enabling Innovation,Collaboration, and Efficiency

Call to Action: Download latest ROCm release and contribute

Optimized for HPC and Deep Learning at Scale

Page 20: ROCm and Hopsworks for end- to-end deep learning pipelines

ROCm in Hopsworks ML Pipelines

Page 21: ROCm and Hopsworks for end- to-end deep learning pipelines

Spark / TensorFlow Applications on ROCm

Goal: Spark / TensorFlow applications in Hopsworks

should run unchanged on ROCm

Solution: Hopsworks runs Spark/TensorFlow

on YARN with support for ROCm through CGroups

Page 22: ROCm and Hopsworks for end- to-end deep learning pipelines

YARN support for ROCm in Hopsworks

Container

A Container

is a CGroup that

isolates CPU,

memory, and

GPU resources

and has a conda

environment and

TLS certs.

ContainerContainerContainer

Resource Manager

NodeManager

NodeManager

NodeManager

Executor ExecutorExecutorDriver

Page 23: ROCm and Hopsworks for end- to-end deep learning pipelines

Kubernetes support for Jupyter/ROCm in Hopsworks

Container

KubeMaster

JupyterHopsworks YARNDistributed ML with

Spark-on-YARN

https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/

POD

Worker

Page 24: ROCm and Hopsworks for end- to-end deep learning pipelines

End-to-End ML PipelinesRaw Data

Event Data

Monitor

Model Serving

Data Prep

Data Lake

FeatureStore

Train &Validate

Data-Intensive GPUs Scale In/Out

Page 25: ROCm and Hopsworks for end- to-end deep learning pipelines

End-to-End ML Pipeline Technologies in HopsworksRaw Data

Event Data

Monitor

Model Serving

Data Prep

Data Lake

FeatureStore

Train &Validate

AirflowAirflow

Page 26: ROCm and Hopsworks for end- to-end deep learning pipelines

ML Pipelines of Jupyter Notebooks with Airflow

Select

Features,

File Format

Feature

Engineering

Validate &

Deploy Model

Experiment,

Train Model

Dataprep Pipeline Training and Deployment Pipeline

Feature

Store

Page 27: ROCm and Hopsworks for end- to-end deep learning pipelines

Don’t just Throw Away Notebooks

Explore Data,

Train modelHyperparam

Opt.

Distributed

Training

Notebook Python + YML in Gitport/rewrite code

Iteration is hard/impossible/a-bad-idea

Page 28: ROCm and Hopsworks for end- to-end deep learning pipelines

PySpark Notebooks as Jobs in ML Pipelines

Page 29: ROCm and Hopsworks for end- to-end deep learning pipelines

Apache Airflow to Orchestrate ML Pipelines

Page 30: ROCm and Hopsworks for end- to-end deep learning pipelines

Apache Airflow to Orchestrate ML PipelinesAirflow

Jobs REST API

Hopsworks Jobs:

PySpark, Spark,

Flink, Beam/Flink

Page 31: ROCm and Hopsworks for end- to-end deep learning pipelines

Distributed Deep Learning with ROCm in Hopsworks

Page 32: ROCm and Hopsworks for end- to-end deep learning pipelines

HParam Tuning with Spark+TensorFlow

Executor 1 Executor N

Driver

HopsFS

TensorBoard ModelsCheckpoints Training Data Logs

conda_env

conda_env conda_env

# RUNS ON THE EXECUTORS

def train(lr, dropout):

def input_fn(): # return dataset

optimizer = …

model = …

model.add(Conv2D(…))

model.compile(…)

model.fit(…)

model.evaluate(…)

# RUNS ON THE DRIVER

Hparams= {‘lr’:[0.001, 0.0001], ‘dropout’: [0.25, 0.5, 0.75]}

experiment.grid_search(train,HParams)

https://github.com/logicalclocks/hops-examples

Page 33: ROCm and Hopsworks for end- to-end deep learning pipelines

Distributed Training with Spark+TensorFlow

https://github.com/logicalclocks/hops-examples

# RUNS ON THE EXECUTORS

def train():

def input_fn(): # return dataset

model = …

optimizer = …

model.compile(…)

rc = tf.estimator.RunConfig(‘CollectiveAllReduceStrategy’)

keras_estimator = tf.keras.estimator.model_to_estimator(….)

tf.estimator.train_and_evaluate(keras_estimator, input_fn)

# RUNS ON THE DRIVER

experiment.collective_all_reduce(train)

Executor 1 Executor N

Driver

HopsFS

TensorBoard ModelsCheckpoints Training Data Logs

conda_env

conda_env conda_env

Page 34: ROCm and Hopsworks for end- to-end deep learning pipelines

Demo

DEMOEnd-to-End ML Workflow with Spark/TensorFlow and ROCm

Page 35: ROCm and Hopsworks for end- to-end deep learning pipelines

That was Hopsworks with ROCm

Efficiency & Performance Security & GovernanceDevelopment & Operations

Secure Multi-Tenancy

Project-based restricted access

Encryption At-Rest, In-Motion

TLS/SSL everywhere

AI-Asset Governance

Models, experiments, data, GPUs

Data/Model/Feature Lineage

Discover/track dependencies

Development Environment

First-class Python Support

Version Everything

Code, Infrastructure, Data

Model Serving on Kubernetes

TF Serving, SkLearn

End-to-End ML Pipelines

Orchestrated by Airflow

Feature Store

Data warehouse for ML

Distributed Deep Learning

Faster with more GPUs

HopsFS

NVMe speed with Big Data

Horizontally Scalable

Ingestion, DataPrep,

Training, Serving

FS

Page 36: ROCm and Hopsworks for end- to-end deep learning pipelines

@logicalclocks

www.logicalclocks.com

O P E N S O U R C E

F O U N D A T I O N

F O R M A C H I N E

L E A R N I N G

Page 37: ROCm and Hopsworks for end- to-end deep learning pipelines

Acknowledgements and References

Slides and Diagrams from colleagues:

● Maggy: Moritz Meister and Sina Sheikholeslami

● Feature Store: Kim Hammar

● Beam/Flink on Hopsworks: Theofilos Kakantousis

References● HopsFS: Scaling hierarchical file system metadata …, USENIX FAST 2017.

● Size matters: Improving the performance of small files …, ACM Middleware 2018.

● ePipe: Near Real-Time Polyglot Persistence of HopsFS Metadata, CCGrid, 2019.

● Hopsworks Demo, SysML 2019.

Page 38: ROCm and Hopsworks for end- to-end deep learning pipelines

Disclaimer & AttributionThe information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical

errors.

The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and

roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing

manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise

this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without

obligation of AMD to notify any person of such revisions or changes.

AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY

INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.

AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT

WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF

ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

ATTRIBUTION

© 2019 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, Radeon and combinations thereof are trademarks of Advanced

Micro Devices, Inc. in the United States and/or other jurisdictions. Other names are for informational purposes only and may be trademarks of their

respective owners. Use of third party marks / names is for informational purposes only and no endorsement of or by AMD is intended or implied.