scaling deep learning to 100s of gpus on hops hadoop€¦ · scaling deep learning to 100s of gpus...

22
Scaling Deep Learning to 100s of GPUs on Hops Hadoop Fabio Buso Software Engineer Logical Clocks AB

Upload: others

Post on 08-Oct-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Scaling Deep Learning to 100s of GPUs on Hops Hadoop€¦ · Scaling Deep Learning to 100s of GPUs on Hops Hadoop Fabio Buso Software Engineer Logical Clocks AB

Scaling Deep Learning to 100s of GPUs on Hops Hadoop

Fabio BusoSoftware EngineerLogical Clocks AB

Page 2: Scaling Deep Learning to 100s of GPUs on Hops Hadoop€¦ · Scaling Deep Learning to 100s of GPUs on Hops Hadoop Fabio Buso Software Engineer Logical Clocks AB

2

HopsFS: Next generation HDFS

37xNumber of fles

16xThroughput

Scale Challenge Winner (2017)

*https://www.usenix.org/conference/fast17/technical-sessions/presentation/niazi**https://eurosys2017.github.io/assets/data/posters/poster09-Niazi.pdf

Page 3: Scaling Deep Learning to 100s of GPUs on Hops Hadoop€¦ · Scaling Deep Learning to 100s of GPUs on Hops Hadoop Fabio Buso Software Engineer Logical Clocks AB

3

Hops platform

Projects, Datasets, Users

HopsFS, HopsYARN, MySQL NDB Cluster

Spark, Tensorfow, Hive, Kafka, Flink

Jupyter, Zeppelin

Jobs, Grafana, ELK

RESTAPI

Version 0.3.0 just released!

Page 4: Scaling Deep Learning to 100s of GPUs on Hops Hadoop€¦ · Scaling Deep Learning to 100s of GPUs on Hops Hadoop Fabio Buso Software Engineer Logical Clocks AB

4

Python frst

Conda Repo

Project Conda env

Search

Install/Remove

Python-3.6, pandas-1.4,Numpy-0.9

Environment usable by Spark/Tensorfow

Hops python library: Make development easy● Hyperparameter searching● Manage Tensorboard lifecycle

Page 5: Scaling Deep Learning to 100s of GPUs on Hops Hadoop€¦ · Scaling Deep Learning to 100s of GPUs on Hops Hadoop Fabio Buso Software Engineer Logical Clocks AB

5

Find big datasets - Dela*

● Discover, Share and experiment with interesting datasets

● p2p network of Hops Cluster● ImageNet, YouTube8M, Reddit comments...● Exploits unused bandwidth

*http://ieeexplore.ieee.org/document/7980225/ (ICDCS 2017)

Page 6: Scaling Deep Learning to 100s of GPUs on Hops Hadoop€¦ · Scaling Deep Learning to 100s of GPUs on Hops Hadoop Fabio Buso Software Engineer Logical Clocks AB

Scale out level: 1Parallel Hyper parameter searching

Page 7: Scaling Deep Learning to 100s of GPUs on Hops Hadoop€¦ · Scaling Deep Learning to 100s of GPUs on Hops Hadoop Fabio Buso Software Engineer Logical Clocks AB

7

Parallel Hyperparameter searching

def model(lr, dropout):…

args_dict = {'learning_rate': [0.001, 0.0005, 0.0001], 'dropout': [0.45, 0.7]}

args_dict_grid = util.grid_params(args_dict)

tflauncher.launch(spark, model, args_dict_grid)

Starts 6 parallel experiments

Page 8: Scaling Deep Learning to 100s of GPUs on Hops Hadoop€¦ · Scaling Deep Learning to 100s of GPUs on Hops Hadoop Fabio Buso Software Engineer Logical Clocks AB

Scale out Level: 2Distributed Training

Page 9: Scaling Deep Learning to 100s of GPUs on Hops Hadoop€¦ · Scaling Deep Learning to 100s of GPUs on Hops Hadoop Fabio Buso Software Engineer Logical Clocks AB

9

TensorFlowOnSpark (TFoS) by Yahoo!

● Distributed TensorFlow over Spark● Runs on top of a Hadoop cluster● PS/Workers executed inside Spark executors● Uses Spark for resource allocations

– Our version: exclusive GPUs allocations– Parameter server(s) do not get GPU(s)

● Manages Tensorboard

Page 10: Scaling Deep Learning to 100s of GPUs on Hops Hadoop€¦ · Scaling Deep Learning to 100s of GPUs on Hops Hadoop Fabio Buso Software Engineer Logical Clocks AB

10

Run TFoS

def training_fun(argv, ctx):

…..

TFNode.start_cluster_server()

…..

TFCluster.run(spark, training_fun, num_exec, num_ps…)

Full conversion guide: https://github.com/yahoo/TensorFlowOnSpark/wiki/Conversion-Guide

Page 11: Scaling Deep Learning to 100s of GPUs on Hops Hadoop€¦ · Scaling Deep Learning to 100s of GPUs on Hops Hadoop Fabio Buso Software Engineer Logical Clocks AB

Scale out level: Master of the dark artsHorovod

Page 12: Scaling Deep Learning to 100s of GPUs on Hops Hadoop€¦ · Scaling Deep Learning to 100s of GPUs on Hops Hadoop Fabio Buso Software Engineer Logical Clocks AB

12

PS server architecture doesn’t scale

From: https://github.com/uber/horovod

Page 13: Scaling Deep Learning to 100s of GPUs on Hops Hadoop€¦ · Scaling Deep Learning to 100s of GPUs on Hops Hadoop Fabio Buso Software Engineer Logical Clocks AB

13

Horovod by Uber

● Based on previous work done by Baidu

● Organize workers in a ring● Gradients updates distributed using All-Reduce

● Synchronous protocol

Page 14: Scaling Deep Learning to 100s of GPUs on Hops Hadoop€¦ · Scaling Deep Learning to 100s of GPUs on Hops Hadoop Fabio Buso Software Engineer Logical Clocks AB

14

All-Reduce

GPU1

GPU2

GPU3

a0 b0 c0

a1 b1 c1

a2 b2 c2

Page 15: Scaling Deep Learning to 100s of GPUs on Hops Hadoop€¦ · Scaling Deep Learning to 100s of GPUs on Hops Hadoop Fabio Buso Software Engineer Logical Clocks AB

15

All-Reduce

a0 b0 c0 + c2

a0 + a1 b1 c1

a2 b1 + b2 c2

GPU1

GPU2

GPU3

Page 16: Scaling Deep Learning to 100s of GPUs on Hops Hadoop€¦ · Scaling Deep Learning to 100s of GPUs on Hops Hadoop Fabio Buso Software Engineer Logical Clocks AB

16

All-Reduce

a0 b0 + b1 + b2 c0 + c2

a0 + a1 b1 c0 + c1 + c2

a0 + a1 + a2 b1 + b2 c2

GPU1

GPU2

GPU3

Page 17: Scaling Deep Learning to 100s of GPUs on Hops Hadoop€¦ · Scaling Deep Learning to 100s of GPUs on Hops Hadoop Fabio Buso Software Engineer Logical Clocks AB

17

All-Reduce

a0 b0 + b1 + b2 c0 + c2

a0 + a1 b1 c0 + c1 + c2

a0 + a1 + a2 b1 + b2 c2

GPU1

GPU2

GPU3

Page 18: Scaling Deep Learning to 100s of GPUs on Hops Hadoop€¦ · Scaling Deep Learning to 100s of GPUs on Hops Hadoop Fabio Buso Software Engineer Logical Clocks AB

18

All-Reduce

a0 + a1 + a2 b0 + b1 + b2 c0 + c2

a0 + a1 b0 + b1 + b2 c0 + c1 + c2

a0 + a1 + a2 b1 + b2 c0 + c1 + c2

GPU1

GPU2

GPU3

Page 19: Scaling Deep Learning to 100s of GPUs on Hops Hadoop€¦ · Scaling Deep Learning to 100s of GPUs on Hops Hadoop Fabio Buso Software Engineer Logical Clocks AB

19

All-Reduce

a0 + a1 + a2 b0 + b1 + b2 c0 + c1 + c2

a0 + a1 + a2 b0 + b1 + b2 c0 + c1 + c2

a0 + a1 + a2 b0 + b1 + b2 c0 + c1 + c2

GPU1

GPU2

GPU3

Page 20: Scaling Deep Learning to 100s of GPUs on Hops Hadoop€¦ · Scaling Deep Learning to 100s of GPUs on Hops Hadoop Fabio Buso Software Engineer Logical Clocks AB

20

Hops AllReduce

import horovod.tensorflow as hvddef conv_model(feature, target, mode) …..def main(_): hvd.init() opt = hvd.DistributedOptimizer(opt) if hvd.local_rank()==0: hooks = [hvd.BroadcastGlobalVariablesHook(0), ..] ….. else: hooks = [hvd.BroadcastGlobalVariablesHook(0), ..]

…..from hops import allreduceallreduce.launch(spark, 'hdfs:///Projects/…/all_reduce.ipynb')

Page 21: Scaling Deep Learning to 100s of GPUs on Hops Hadoop€¦ · Scaling Deep Learning to 100s of GPUs on Hops Hadoop Fabio Buso Software Engineer Logical Clocks AB

Demo time!

Page 22: Scaling Deep Learning to 100s of GPUs on Hops Hadoop€¦ · Scaling Deep Learning to 100s of GPUs on Hops Hadoop Fabio Buso Software Engineer Logical Clocks AB

Play with it → hops.io/?q=content/hopsworks-vagrant

Doc → hops.ioStar us! → github.com/hopshadoopFollow us! → @hopshadoop