ray, distributed computing, and · tensorflow serving flink, many others baselines, rllab, elf,...

104
Ray, Distributed Computing, and Machine Learning Robert Nishihara 11/15/2008

Upload: others

Post on 21-May-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

Ray, Distributed Computing, and Machine Learning

Robert Nishihara

11/15/2008

Page 2: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

The Machine Learning Ecosystem

Machine Learning Ecosystem

Page 3: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

The Machine Learning Ecosystem

Machine Learning Ecosystem

Training

Page 4: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

The Machine Learning Ecosystem

Machine Learning Ecosystem

Training

Page 5: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

The Machine Learning Ecosystem

Machine Learning Ecosystem

Training

Page 6: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

The Machine Learning Ecosystem

Machine Learning Ecosystem

Training

Page 7: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

The Machine Learning Ecosystem

Machine Learning Ecosystem

Training RL

Page 8: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

The Machine Learning Ecosystem

Machine Learning Ecosystem

Training RL

Page 9: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

The Machine Learning Ecosystem

Machine Learning Ecosystem

Training RL

Page 10: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

The Machine Learning Ecosystem

Machine Learning Ecosystem

Training RL

Page 11: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

The Machine Learning Ecosystem

Machine Learning Ecosystem

Training RL

Page 12: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

The Machine Learning Ecosystem

Machine Learning Ecosystem

Training Model Serving RL

Page 13: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

The Machine Learning Ecosystem

Machine Learning Ecosystem

Training Model Serving

Hyperparameter SearchRL

Page 14: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

The Machine Learning Ecosystem

Machine Learning Ecosystem

Training Model Serving RL Hyperparameter

Search

Page 15: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

The Machine Learning Ecosystem

Machine Learning Ecosystem

Training Data Processing

Model Serving

Hyperparameter SearchRL

Page 16: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

The Machine Learning Ecosystem

Machine Learning Ecosystem

Training Model Serving

Hyperparameter SearchRL Data

Processing

Page 17: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

The Machine Learning Ecosystem

Machine Learning Ecosystem

Training Model Serving

Hyperparameter SearchRL Data

Processing

Page 18: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

The Machine Learning Ecosystem

Machine Learning Ecosystem

Training Data ProcessingStreamingModel

ServingHyperparameter

SearchRL

Page 19: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

Machine Learning Ecosystem

Distributed System

DistributedSystem

Distributed System

Distributed System

The Machine Learning Ecosystem

Training Data ProcessingStreaming RL

Distributed System

Model Serving

Distributed System

Hyperparameter Search

Page 20: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

Machine Learning Ecosystem

Distributed System

Distributed System

Distributed System

Distributed System

The Machine Learning Ecosystem

Training Data ProcessingStreaming RL

Distributed System

Model Serving

Distributed System

Hyperparameter Search

Horovod, Distributed TF,

Parameter Server

Clipper, TensorFlow

Serving

Flink, many others

Baselines, RLlab, ELF, Coach, TensorForce,

ChainerRL

MapReduce, Hadoop, Spark

Vizier, many internal systems at companies

Page 21: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

Distributed System

Distributed System

Distributed System

Distributed System

The Machine Learning Ecosystem

Training Data ProcessingStreaming RL

Distributed System

Model Serving

Distributed System

Hyperparameter Search

Horovod, Distributed TF,

Parameter Server

Clipper, TensorFlow

Serving

Flink, many others

Baselines, RLlab, ELF, Coach, TensorForce,

ChainerRL

MapReduce, Hadoop, Spark

Vizier, many internal systems at companies

Why distributed systems?

Page 22: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

Distributed System

Distributed System

Distributed System

Distributed System

The Machine Learning Ecosystem

Training Data ProcessingStreaming RL

Distributed System

Model Serving

Distributed System

Hyperparameter Search

Horovod, Distributed TF,

Parameter Server

Clipper, TensorFlow

Serving

Flink, many others

Baselines, RLlab, ELF, Coach, TensorForce,

ChainerRL

MapReduce, Hadoop, Spark

Vizier, many internal systems at companies

Why distributed systems?● More work (computation) than one

machine can do in a reasonable amount of time.

● More data than can fit in one machine.

Page 23: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

Machine Learning Ecosystem

Distributed System

Distributed System

Distributed System

Distributed System

The Machine Learning Ecosystem

Training Data ProcessingStreaming RL

Distributed System

Model Serving

Distributed System

Hyperparameter Search

Horovod, Distributed TF,

Parameter Server

Clipper, TensorFlow

Serving

Flink, many others

Baselines, RLlab, ELF, Coach, TensorForce,

ChainerRL

MapReduce, Hadoop, Spark

Vizier, many internal systems at companies

Aspects of a distributed system Why distributed systems?● More work (computation) than one

machine can do in a reasonable amount of time.

● More data than can fit in one machine.

Page 24: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

Machine Learning Ecosystem

Distributed System

Distributed System

Distributed System

Distributed System

The Machine Learning Ecosystem

Training Data ProcessingStreaming RL

Distributed System

Model Serving

Distributed System

Hyperparameter Search

Horovod, Distributed TF,

Parameter Server

Clipper, TensorFlow

Serving

Flink, many others

Baselines, RLlab, ELF, Coach, TensorForce,

ChainerRL

MapReduce, Hadoop, Spark

Vizier, many internal systems at companies

Aspects of a distributed system● Units of work “tasks” executed in parallel● Scheduling (which tasks run on which machines and when)● Data transfer● Failure handling● Resource management (CPUs, GPUs, memory)

Why distributed systems?● More work (computation) than one

machine can do in a reasonable amount of time.

● More data than can fit in one machine.

Page 25: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

Machine Learning Ecosystem

What is Ray?

Training Data ProcessingStreaming RLModel

ServingHyperparameter

Search

Aspects of a distributed system● Units of work “tasks” executed in parallel● Scheduling (which tasks run on which machines and when)● Data transfer● Failure handling● Resource management (CPUs, GPUs, memory)

Why distributed systems?● More work (computation) than one

machine can do in a reasonable amount of time.

● More data than can fit in one machine.

Page 26: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

Machine Learning Ecosystem

Distributed System (Ray)

What is Ray?

Libraries

Training Data ProcessingStreaming RLModel

ServingHyperparameter

Search

Aspects of a distributed system● Units of work “tasks” executed in parallel● Scheduling (which tasks run on which machines and when)● Data transfer● Failure handling● Resource management (CPUs, GPUs, memory)

Why distributed systems?● More work (computation) than one

machine can do in a reasonable amount of time.

● More data than can fit in one machine.

Page 27: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

Distributed System

Distributed System

Distributed System

Distributed System

Training Data ProcessingStreaming RL

Distributed System

Model Serving

Distributed System

Hyperparameter Search

Horovod, Distributed TF,

Parameter Server

Clipper, TensorFlow

Serving

Flink, many others

Baselines, RLlab, ELF, Coach, TensorForce,

ChainerRL

MapReduce, Hadoop, Spark

Vizier, many internal systems at companies

Page 28: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

Distributed System (Ray)

Libraries

Training Data ProcessingStreaming RLModel

ServingHyperparameter

Search

Distributed System

Distributed System

Distributed System

Distributed System

Training Data ProcessingStreaming RL

Distributed System

Model Serving

Distributed System

Hyperparameter Search

Horovod, Distributed TF,

Parameter Server

Clipper, TensorFlow

Serving

Flink, many others

Baselines, RLlab, ELF, Coach, TensorForce,

ChainerRL

MapReduce, Hadoop, Spark

Vizier, many internal systems at companies

Page 29: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

Distributed System (Ray)

Libraries

Training Data ProcessingStreaming RLModel

ServingHyperparameter

Search

Distributed System

Distributed System

Distributed System

Distributed System

Training Data ProcessingStreaming RL

Distributed System

Model Serving

Distributed System

Hyperparameter Search

Horovod, Distributed TF,

Parameter Server

Clipper, TensorFlow

Serving

Flink, many others

Baselines, RLlab, ELF, Coach, TensorForce,

ChainerRL

MapReduce, Hadoop, Spark

Vizier, many internal systems at companies

This requires a very general underlying distributed system.

Page 30: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

Distributed System (Ray)

Libraries

Training Data ProcessingStreaming RLModel

ServingHyperparameter

Search

Distributed System

Distributed System

Distributed System

Distributed System

Training Data ProcessingStreaming RL

Distributed System

Model Serving

Distributed System

Hyperparameter Search

Horovod, Distributed TF,

Parameter Server

Clipper, TensorFlow

Serving

Flink, many others

Baselines, RLlab, ELF, Coach, TensorForce,

ChainerRL

MapReduce, Hadoop, Spark

Vizier, many internal systems at companies

This requires a very general underlying distributed system.

Generality comes from tasks (functions) and actors (classes).

Page 31: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

31 ©2017 RISELab

def zeros(shape): return np.zeros(shape)

def dot(a, b): return np.dot(a, b)

Ray API

Page 32: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

32 ©2017 RISELab

@ray.remotedef zeros(shape): return np.zeros(shape)

@ray.remotedef dot(a, b): return np.dot(a, b)

Ray APITasks

Page 33: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

33 ©2017 RISELab

@ray.remotedef zeros(shape): return np.zeros(shape)

@ray.remotedef dot(a, b): return np.dot(a, b)

id1 = zeros.remote([5, 5])

Ray APITasks

id1

zeros

Page 34: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

34 ©2017 RISELab

@ray.remotedef zeros(shape): return np.zeros(shape)

@ray.remotedef dot(a, b): return np.dot(a, b)

id1 = zeros.remote([5, 5])id2 = zeros.remote([5, 5])

Ray APITasks

id1 id2

zeros zeros

Page 35: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

35 ©2017 RISELab

@ray.remotedef zeros(shape): return np.zeros(shape)

@ray.remotedef dot(a, b): return np.dot(a, b)

id1 = zeros.remote([5, 5])id2 = zeros.remote([5, 5])id3 = dot.remote(id1, id2)

Ray APITasks

id1 id2

id3

zeros zeros

dot

Page 36: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

36 ©2017 RISELab

@ray.remotedef zeros(shape): return np.zeros(shape)

@ray.remotedef dot(a, b): return np.dot(a, b)

id1 = zeros.remote([5, 5])id2 = zeros.remote([5, 5])id3 = dot.remote(id1, id2)ray.get(id3)

Ray APITasks

id1 id2

id3

zeros zeros

dot

Page 37: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

37 ©2017 RISELab

@ray.remotedef zeros(shape): return np.zeros(shape)

@ray.remotedef dot(a, b): return np.dot(a, b)

id1 = zeros.remote([5, 5])id2 = zeros.remote([5, 5])id3 = dot.remote(id1, id2)ray.get(id3)

Ray APITasks

id1 id2

id3

zeros zeros

dot

Show in Jupyter notebook!

Page 38: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

38 ©2017 RISELab

class Counter(object): def __init__(self): self.value = 0 def inc(self): self.value += 1 return self.value

Ray APIActors

Page 39: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

39 ©2017 RISELab

@ray.remote(num_gpus=1)class Counter(object): def __init__(self): self.value = 0 def inc(self): self.value += 1 return self.value

Ray APIActors

Page 40: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

40 ©2017 RISELab

@ray.remote(num_gpus=1)class Counter(object): def __init__(self): self.value = 0 def inc(self): self.value += 1 return self.value

c = Counter.remote()

Ray APIActors

Counter

Page 41: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

41 ©2017 RISELab

@ray.remote(num_gpus=1)class Counter(object): def __init__(self): self.value = 0 def inc(self): self.value += 1 return self.value

c = Counter.remote()id4 = c.inc.remote()

Ray APIActors

id4inc

Counter

Page 42: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

42 ©2017 RISELab

@ray.remote(num_gpus=1)class Counter(object): def __init__(self): self.value = 0 def inc(self): self.value += 1 return self.value

c = Counter.remote()id4 = c.inc.remote()id5 = c.inc.remote()

Ray APIActors

id4inc

Counter

inc id5

Page 43: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

43 ©2017 RISELab

@ray.remote(num_gpus=1)class Counter(object): def __init__(self): self.value = 0 def inc(self): self.value += 1 return self.value

c = Counter.remote()id4 = c.inc.remote()id5 = c.inc.remote()ray.get([id4, id5])

Ray APIActors

id4inc

Counter

inc id5

Page 44: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

How does this work under the hood?

Page 45: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

How does this work under the hood?

@ray.remotedef zeros(shape): return np.zeros(shape)

@ray.remotedef dot(a, b): return np.dot(a, b)

id1 = zeros.remote([5, 5])id2 = zeros.remote([5, 5])id3 = dot.remote(id1, id2)ray.get(id3)

Tasks

id1 id2

id3

zeros zeros

dot

Page 46: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

How does this work under the hood?

id1 id2

id3

zeros zeros

dot

Page 47: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

WorkerWorker

Object Store

Scheduler

Global Control StoreGlobal Control Store

WorkerWorker

Object Store

Scheduler

Global Control Store

The Ray Architecture

WorkerWorker

Object Store

Scheduler

Page 48: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

WorkerWorker

Object Store

Scheduler

Global Control StoreGlobal Control Store

WorkerWorker

Object Store

Scheduler

Global Control Store

The Ray Architecture

WorkerWorker

Object Store

Scheduler

id1 id2

id3

zeros zeros

dot

Page 49: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

WorkerWorker

Object Store

Scheduler

Global Control StoreGlobal Control Store

WorkerWorker

Object Store

Scheduler

Global Control Store

The Ray Architecture

WorkerWorker

Object Store

Scheduler

id1 id2

id3

zeros zeros

dot

zeroszeros

dot

Page 50: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

WorkerWorker

Object Store

Scheduler

Global Control StoreGlobal Control Store

WorkerWorker

Object Store

Scheduler

Global Control Store

The Ray Architecture

WorkerWorker

Object Store

Scheduler

id1 id2

id3

zeros zeros

dot

zeros

dot

zeros

Page 51: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

WorkerWorker

Object Store

Scheduler

Global Control StoreGlobal Control Store

WorkerWorker

Object Store

Scheduler

Global Control Store

The Ray Architecture

WorkerWorker

Object Store

Scheduler

id1 id2

id3

zeros zeros

dot

zeroszeros

dot

Page 52: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

WorkerWorker

Object Store

Scheduler

Global Control StoreGlobal Control Store

WorkerWorker

Object Store

Scheduler

Global Control Store

The Ray Architecture

WorkerWorker

Object Store

Scheduler

id1 id2

id3

zeros zeros

dot

zeroszeros

obj2obj1

dot

Page 53: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

WorkerWorker

Object Store

Scheduler

Global Control StoreGlobal Control Store

WorkerWorker

Object Store

Scheduler

Global Control Store

The Ray Architecture

WorkerWorker

Object Store

Scheduler

id1 id2

id3

zeros zeros

dot

obj2obj1

dot

Page 54: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

WorkerWorker

Object Store

Scheduler

Global Control StoreGlobal Control Store

WorkerWorker

Object Store

Scheduler

Global Control Store

The Ray Architecture

WorkerWorker

Object Store

Scheduler

id1 id2

id3

zeros zeros

dot

obj2obj1

dot

Page 55: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

WorkerWorker

Object Store

Scheduler

Global Control StoreGlobal Control Store

WorkerWorker

Object Store

Scheduler

Global Control Store

The Ray Architecture

WorkerWorker

Object Store

Scheduler

id1 id2

id3

zeros zeros

dot

obj2obj1

dot

Page 56: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

WorkerWorker

Object Store

Scheduler

Global Control StoreGlobal Control Store

WorkerWorker

Object Store

Scheduler

Global Control Store

The Ray Architecture

WorkerWorker

Object Store

Scheduler

id1 id2

id3

zeros zeros

dot

obj2obj1

dot

obj2

Page 57: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

WorkerWorker

Object Store

Scheduler

Global Control StoreGlobal Control Store

WorkerWorker

Object Store

Scheduler

Global Control Store

The Ray Architecture

WorkerWorker

Object Store

Scheduler

id1 id2

id3

zeros zeros

dot

obj2obj1

dot

obj2

Page 58: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

WorkerWorker

Object Store

Scheduler

Global Control StoreGlobal Control Store

WorkerWorker

Object Store

Scheduler

Global Control Store

The Ray Architecture

WorkerWorker

Object Store

Scheduler

id1 id2

id3

zeros zeros

dot

obj2obj1

dot

obj2obj3

Page 59: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

WorkerWorker

Object Store

Scheduler

Global Control StoreGlobal Control Store

WorkerWorker

Object Store

Scheduler

Global Control Store

The Ray Architecture

WorkerWorker

Object Store

Scheduler

id1 id2

id3

zeros zeros

dot

obj2obj1

dot

obj2obj3

id1

zeros

id2

zeros

Page 60: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

Parallelism Example: Primality Testing

Page 61: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

Parallelism Example: Primality TestingWrite a function to test if p is a prime number (no divisors other than 1 and p).

Page 62: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

Parallelism Example: Primality TestingWrite a function to test if p is a prime number (no divisors other than 1 and p).

Page 63: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

Parallelism Example: Primality TestingWrite a function to test if p is a prime number (no divisors other than 1 and p).

Can this be parallelized?

Page 64: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

Parallelism Example: Primality TestingWrite a function to test if p is a prime number (no divisors other than 1 and p).

Can this be parallelized?Yes! “For loops” are great candidates for parallelization.

Page 65: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

Parallelism Example: Primality TestingWrite a function to test if p is a prime number (no divisors other than 1 and p).

1,2,3,...,k k+1,...,2k … p-k+1,...pCheck batches of divisors in parallel

Page 66: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

Parallelism Example: Primality TestingWrite a function to test if p is a prime number (no divisors other than 1 and p).

1,2,3,...,k k+1,...,2k … p-k+1,...pCheck batches of divisors in parallel

Show in Jupyter notebook!

Page 67: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

Parallelism Example: Tree Reduction

Page 68: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

Parallelism Example: Tree Reduction

Relevant for upcoming homework assignment!

Page 69: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

Parallelism Example: Tree Reduction

Add a bunch of values (2 at a time).

1 2 3 4 5 6 7 8

Page 70: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

Parallelism Example: Tree Reduction

Add a bunch of values (2 at a time).

1 2 3 4 5 6 7 8

Page 71: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

Parallelism Example: Tree Reduction

Add a bunch of values (2 at a time).

add

1 2 3 4 5 6 7 8

Page 72: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

Parallelism Example: Tree Reduction

Add a bunch of values (2 at a time).

id1

add

1 2 3 4 5 6 7 8

Page 73: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

Parallelism Example: Tree Reduction

Add a bunch of values (2 at a time).

id1

add

1 2 3 4 5 6 7 8

add

Page 74: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

Parallelism Example: Tree Reduction

Add a bunch of values (2 at a time).

id1

add

1 2 3 4 5 6 7 8

add

id2

Page 75: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

Parallelism Example: Tree Reduction

Add a bunch of values (2 at a time).

id1

add

1 2 3 4 5 6 7 8

add

id2 add

Page 76: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

Parallelism Example: Tree Reduction

Add a bunch of values (2 at a time).

id1

add

1 2 3 4 5 6 7 8

add

id2 add

id3

Page 77: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

Parallelism Example: Tree Reduction

Add a bunch of values (2 at a time).

id1

add

1 2 3 4 5 6 7 8

add

id2 add

id3 add

Page 78: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

Parallelism Example: Tree Reduction

Add a bunch of values (2 at a time).

id1

add

1 2 3 4 5 6 7 8

add

id2 add

id3 add

id4

Page 79: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

Parallelism Example: Tree Reduction

Add a bunch of values (2 at a time).

id1

add

1 2 3 4 5 6 7 8

add

id2 add

id3 add

addid4

Page 80: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

Parallelism Example: Tree Reduction

Add a bunch of values (2 at a time).

id1

add

1 2 3 4 5 6 7 8

add

id2 add

id3 add

addid4

id5

Page 81: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

Parallelism Example: Tree Reduction

Add a bunch of values (2 at a time).

id1

add

1 2 3 4 5 6 7 8

add

id2 add

id3 add

add

add

id4

id5

Page 82: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

Parallelism Example: Tree Reduction

Add a bunch of values (2 at a time).

id1

add

1 2 3 4 5 6 7 8

add

id2 add

id3 add

add

add

id4

id5

id6

Page 83: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

Parallelism Example: Tree Reduction

Add a bunch of values (2 at a time).

id1

add

1 2 3 4 5 6 7 8

add

id2 add

id3 add

add

add

add

id4

id5

id6

Page 84: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

Parallelism Example: Tree Reduction

Add a bunch of values (2 at a time).

id1

add

1 2 3 4 5 6 7 8

add

id2 add

id3 add

add

add

add

id4

id5

id6 id7

Page 85: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

Parallelism Example: Tree Reduction1 2 3 4 5 6 7 8

Add a bunch of values (2 at a time).

Page 86: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

Parallelism Example: Tree Reduction

add

1 2 3 4 5 6 7 8

add add add

Add a bunch of values (2 at a time).

Page 87: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

Parallelism Example: Tree Reduction

id1

add

1 2 3 4 5 6 7 8

add add add

id2 id3 id4

Add a bunch of values (2 at a time).

Page 88: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

Parallelism Example: Tree Reduction

id1

add

1 2 3 4 5 6 7 8

add add add

id2 id3 id4

add add

Add a bunch of values (2 at a time).

Page 89: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

Parallelism Example: Tree Reduction

id1

add

1 2 3 4 5 6 7 8

add add add

id2 id3 id4

add add

id5 id6

Add a bunch of values (2 at a time).

Page 90: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

Parallelism Example: Tree Reduction

id1

add

1 2 3 4 5 6 7 8

add add add

id2 id3 id4

add add

id5 id6

add

Add a bunch of values (2 at a time).

Page 91: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

Parallelism Example: Tree Reduction

id1

add

1 2 3 4 5 6 7 8

add add add

id2 id3 id4

add add

id5 id6

add

id7

Add a bunch of values (2 at a time).

Page 92: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

Parallelism Example: Tree Reduction

Add a bunch of values (2 at a time).

Page 93: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

Parallelism Example: Tree Reduction

Add a bunch of values (2 at a time).

What does the difference look like in code?

Page 94: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

Parallelism Example: Tree Reduction

Add a bunch of values (2 at a time).

What does the difference look like in code?

Page 95: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

Parallelism Example: Tree Reduction

Add a bunch of values (2 at a time).

What does the difference look like in code?

Page 96: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

Parallelism Example: Parameter Server

Page 97: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

Parallelism Example: Parameter Server

Page 98: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

Parallelism Example: Parameter Server

Page 99: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

Parallelism Example: Parameter Server

Page 100: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

Parallelism Example: Parameter Server

Worker Worker Worker Worker

Page 101: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

Parallelism Example: Parameter Server

Worker

Parameter Server Actor

Worker Worker Worker

Page 102: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

Parallelism Example: Parameter Server

Worker

Parameter Server Actor

Worker Worker Worker

para

met

ers

grad

ient

s

Page 103: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

Parallelism Example: Parameter Server

Parameter Server Actor

Worker

Parameter Server Actor

Parameter Server Actor

Worker Worker Worker

para

met

ers

grad

ient

s

Page 104: Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

Conclusion● The ML ecosystem consists of many special-purpose systems

○ We are moving toward a single general-purpose system○ “Distributed systems” are becoming “libraries” and “applications”○ Examples like MapReduce, tree reduce, parameter server○ Interesting applications will combine all of these elements

● Ray is open source at https://github.com/ray-project/ray○ pip install ray (on Linux/Mac)○ Being developed here at Berkeley!

● Will be used in the next homework assignment