ray, distributed computing, and · tensorflow serving flink, many others baselines, rllab, elf,...

Post on 21-May-2020

11 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Ray, Distributed Computing, and Machine Learning

Robert Nishihara

11/15/2008

The Machine Learning Ecosystem

Machine Learning Ecosystem

The Machine Learning Ecosystem

Machine Learning Ecosystem

Training

The Machine Learning Ecosystem

Machine Learning Ecosystem

Training

The Machine Learning Ecosystem

Machine Learning Ecosystem

Training

The Machine Learning Ecosystem

Machine Learning Ecosystem

Training

The Machine Learning Ecosystem

Machine Learning Ecosystem

Training RL

The Machine Learning Ecosystem

Machine Learning Ecosystem

Training RL

The Machine Learning Ecosystem

Machine Learning Ecosystem

Training RL

The Machine Learning Ecosystem

Machine Learning Ecosystem

Training RL

The Machine Learning Ecosystem

Machine Learning Ecosystem

Training RL

The Machine Learning Ecosystem

Machine Learning Ecosystem

Training Model Serving RL

The Machine Learning Ecosystem

Machine Learning Ecosystem

Training Model Serving

Hyperparameter SearchRL

The Machine Learning Ecosystem

Machine Learning Ecosystem

Training Model Serving RL Hyperparameter

Search

The Machine Learning Ecosystem

Machine Learning Ecosystem

Training Data Processing

Model Serving

Hyperparameter SearchRL

The Machine Learning Ecosystem

Machine Learning Ecosystem

Training Model Serving

Hyperparameter SearchRL Data

Processing

The Machine Learning Ecosystem

Machine Learning Ecosystem

Training Model Serving

Hyperparameter SearchRL Data

Processing

The Machine Learning Ecosystem

Machine Learning Ecosystem

Training Data ProcessingStreamingModel

ServingHyperparameter

SearchRL

Machine Learning Ecosystem

Distributed System

DistributedSystem

Distributed System

Distributed System

The Machine Learning Ecosystem

Training Data ProcessingStreaming RL

Distributed System

Model Serving

Distributed System

Hyperparameter Search

Machine Learning Ecosystem

Distributed System

Distributed System

Distributed System

Distributed System

The Machine Learning Ecosystem

Training Data ProcessingStreaming RL

Distributed System

Model Serving

Distributed System

Hyperparameter Search

Horovod, Distributed TF,

Parameter Server

Clipper, TensorFlow

Serving

Flink, many others

Baselines, RLlab, ELF, Coach, TensorForce,

ChainerRL

MapReduce, Hadoop, Spark

Vizier, many internal systems at companies

Distributed System

Distributed System

Distributed System

Distributed System

The Machine Learning Ecosystem

Training Data ProcessingStreaming RL

Distributed System

Model Serving

Distributed System

Hyperparameter Search

Horovod, Distributed TF,

Parameter Server

Clipper, TensorFlow

Serving

Flink, many others

Baselines, RLlab, ELF, Coach, TensorForce,

ChainerRL

MapReduce, Hadoop, Spark

Vizier, many internal systems at companies

Why distributed systems?

Distributed System

Distributed System

Distributed System

Distributed System

The Machine Learning Ecosystem

Training Data ProcessingStreaming RL

Distributed System

Model Serving

Distributed System

Hyperparameter Search

Horovod, Distributed TF,

Parameter Server

Clipper, TensorFlow

Serving

Flink, many others

Baselines, RLlab, ELF, Coach, TensorForce,

ChainerRL

MapReduce, Hadoop, Spark

Vizier, many internal systems at companies

Why distributed systems?● More work (computation) than one

machine can do in a reasonable amount of time.

● More data than can fit in one machine.

Machine Learning Ecosystem

Distributed System

Distributed System

Distributed System

Distributed System

The Machine Learning Ecosystem

Training Data ProcessingStreaming RL

Distributed System

Model Serving

Distributed System

Hyperparameter Search

Horovod, Distributed TF,

Parameter Server

Clipper, TensorFlow

Serving

Flink, many others

Baselines, RLlab, ELF, Coach, TensorForce,

ChainerRL

MapReduce, Hadoop, Spark

Vizier, many internal systems at companies

Aspects of a distributed system Why distributed systems?● More work (computation) than one

machine can do in a reasonable amount of time.

● More data than can fit in one machine.

Machine Learning Ecosystem

Distributed System

Distributed System

Distributed System

Distributed System

The Machine Learning Ecosystem

Training Data ProcessingStreaming RL

Distributed System

Model Serving

Distributed System

Hyperparameter Search

Horovod, Distributed TF,

Parameter Server

Clipper, TensorFlow

Serving

Flink, many others

Baselines, RLlab, ELF, Coach, TensorForce,

ChainerRL

MapReduce, Hadoop, Spark

Vizier, many internal systems at companies

Aspects of a distributed system● Units of work “tasks” executed in parallel● Scheduling (which tasks run on which machines and when)● Data transfer● Failure handling● Resource management (CPUs, GPUs, memory)

Why distributed systems?● More work (computation) than one

machine can do in a reasonable amount of time.

● More data than can fit in one machine.

Machine Learning Ecosystem

What is Ray?

Training Data ProcessingStreaming RLModel

ServingHyperparameter

Search

Aspects of a distributed system● Units of work “tasks” executed in parallel● Scheduling (which tasks run on which machines and when)● Data transfer● Failure handling● Resource management (CPUs, GPUs, memory)

Why distributed systems?● More work (computation) than one

machine can do in a reasonable amount of time.

● More data than can fit in one machine.

Machine Learning Ecosystem

Distributed System (Ray)

What is Ray?

Libraries

Training Data ProcessingStreaming RLModel

ServingHyperparameter

Search

Aspects of a distributed system● Units of work “tasks” executed in parallel● Scheduling (which tasks run on which machines and when)● Data transfer● Failure handling● Resource management (CPUs, GPUs, memory)

Why distributed systems?● More work (computation) than one

machine can do in a reasonable amount of time.

● More data than can fit in one machine.

Distributed System

Distributed System

Distributed System

Distributed System

Training Data ProcessingStreaming RL

Distributed System

Model Serving

Distributed System

Hyperparameter Search

Horovod, Distributed TF,

Parameter Server

Clipper, TensorFlow

Serving

Flink, many others

Baselines, RLlab, ELF, Coach, TensorForce,

ChainerRL

MapReduce, Hadoop, Spark

Vizier, many internal systems at companies

Distributed System (Ray)

Libraries

Training Data ProcessingStreaming RLModel

ServingHyperparameter

Search

Distributed System

Distributed System

Distributed System

Distributed System

Training Data ProcessingStreaming RL

Distributed System

Model Serving

Distributed System

Hyperparameter Search

Horovod, Distributed TF,

Parameter Server

Clipper, TensorFlow

Serving

Flink, many others

Baselines, RLlab, ELF, Coach, TensorForce,

ChainerRL

MapReduce, Hadoop, Spark

Vizier, many internal systems at companies

Distributed System (Ray)

Libraries

Training Data ProcessingStreaming RLModel

ServingHyperparameter

Search

Distributed System

Distributed System

Distributed System

Distributed System

Training Data ProcessingStreaming RL

Distributed System

Model Serving

Distributed System

Hyperparameter Search

Horovod, Distributed TF,

Parameter Server

Clipper, TensorFlow

Serving

Flink, many others

Baselines, RLlab, ELF, Coach, TensorForce,

ChainerRL

MapReduce, Hadoop, Spark

Vizier, many internal systems at companies

This requires a very general underlying distributed system.

Distributed System (Ray)

Libraries

Training Data ProcessingStreaming RLModel

ServingHyperparameter

Search

Distributed System

Distributed System

Distributed System

Distributed System

Training Data ProcessingStreaming RL

Distributed System

Model Serving

Distributed System

Hyperparameter Search

Horovod, Distributed TF,

Parameter Server

Clipper, TensorFlow

Serving

Flink, many others

Baselines, RLlab, ELF, Coach, TensorForce,

ChainerRL

MapReduce, Hadoop, Spark

Vizier, many internal systems at companies

This requires a very general underlying distributed system.

Generality comes from tasks (functions) and actors (classes).

31 ©2017 RISELab

def zeros(shape): return np.zeros(shape)

def dot(a, b): return np.dot(a, b)

Ray API

32 ©2017 RISELab

@ray.remotedef zeros(shape): return np.zeros(shape)

@ray.remotedef dot(a, b): return np.dot(a, b)

Ray APITasks

33 ©2017 RISELab

@ray.remotedef zeros(shape): return np.zeros(shape)

@ray.remotedef dot(a, b): return np.dot(a, b)

id1 = zeros.remote([5, 5])

Ray APITasks

id1

zeros

34 ©2017 RISELab

@ray.remotedef zeros(shape): return np.zeros(shape)

@ray.remotedef dot(a, b): return np.dot(a, b)

id1 = zeros.remote([5, 5])id2 = zeros.remote([5, 5])

Ray APITasks

id1 id2

zeros zeros

35 ©2017 RISELab

@ray.remotedef zeros(shape): return np.zeros(shape)

@ray.remotedef dot(a, b): return np.dot(a, b)

id1 = zeros.remote([5, 5])id2 = zeros.remote([5, 5])id3 = dot.remote(id1, id2)

Ray APITasks

id1 id2

id3

zeros zeros

dot

36 ©2017 RISELab

@ray.remotedef zeros(shape): return np.zeros(shape)

@ray.remotedef dot(a, b): return np.dot(a, b)

id1 = zeros.remote([5, 5])id2 = zeros.remote([5, 5])id3 = dot.remote(id1, id2)ray.get(id3)

Ray APITasks

id1 id2

id3

zeros zeros

dot

37 ©2017 RISELab

@ray.remotedef zeros(shape): return np.zeros(shape)

@ray.remotedef dot(a, b): return np.dot(a, b)

id1 = zeros.remote([5, 5])id2 = zeros.remote([5, 5])id3 = dot.remote(id1, id2)ray.get(id3)

Ray APITasks

id1 id2

id3

zeros zeros

dot

Show in Jupyter notebook!

38 ©2017 RISELab

class Counter(object): def __init__(self): self.value = 0 def inc(self): self.value += 1 return self.value

Ray APIActors

39 ©2017 RISELab

@ray.remote(num_gpus=1)class Counter(object): def __init__(self): self.value = 0 def inc(self): self.value += 1 return self.value

Ray APIActors

40 ©2017 RISELab

@ray.remote(num_gpus=1)class Counter(object): def __init__(self): self.value = 0 def inc(self): self.value += 1 return self.value

c = Counter.remote()

Ray APIActors

Counter

41 ©2017 RISELab

@ray.remote(num_gpus=1)class Counter(object): def __init__(self): self.value = 0 def inc(self): self.value += 1 return self.value

c = Counter.remote()id4 = c.inc.remote()

Ray APIActors

id4inc

Counter

42 ©2017 RISELab

@ray.remote(num_gpus=1)class Counter(object): def __init__(self): self.value = 0 def inc(self): self.value += 1 return self.value

c = Counter.remote()id4 = c.inc.remote()id5 = c.inc.remote()

Ray APIActors

id4inc

Counter

inc id5

43 ©2017 RISELab

@ray.remote(num_gpus=1)class Counter(object): def __init__(self): self.value = 0 def inc(self): self.value += 1 return self.value

c = Counter.remote()id4 = c.inc.remote()id5 = c.inc.remote()ray.get([id4, id5])

Ray APIActors

id4inc

Counter

inc id5

How does this work under the hood?

How does this work under the hood?

@ray.remotedef zeros(shape): return np.zeros(shape)

@ray.remotedef dot(a, b): return np.dot(a, b)

id1 = zeros.remote([5, 5])id2 = zeros.remote([5, 5])id3 = dot.remote(id1, id2)ray.get(id3)

Tasks

id1 id2

id3

zeros zeros

dot

How does this work under the hood?

id1 id2

id3

zeros zeros

dot

WorkerWorker

Object Store

Scheduler

Global Control StoreGlobal Control Store

WorkerWorker

Object Store

Scheduler

Global Control Store

The Ray Architecture

WorkerWorker

Object Store

Scheduler

WorkerWorker

Object Store

Scheduler

Global Control StoreGlobal Control Store

WorkerWorker

Object Store

Scheduler

Global Control Store

The Ray Architecture

WorkerWorker

Object Store

Scheduler

id1 id2

id3

zeros zeros

dot

WorkerWorker

Object Store

Scheduler

Global Control StoreGlobal Control Store

WorkerWorker

Object Store

Scheduler

Global Control Store

The Ray Architecture

WorkerWorker

Object Store

Scheduler

id1 id2

id3

zeros zeros

dot

zeroszeros

dot

WorkerWorker

Object Store

Scheduler

Global Control StoreGlobal Control Store

WorkerWorker

Object Store

Scheduler

Global Control Store

The Ray Architecture

WorkerWorker

Object Store

Scheduler

id1 id2

id3

zeros zeros

dot

zeros

dot

zeros

WorkerWorker

Object Store

Scheduler

Global Control StoreGlobal Control Store

WorkerWorker

Object Store

Scheduler

Global Control Store

The Ray Architecture

WorkerWorker

Object Store

Scheduler

id1 id2

id3

zeros zeros

dot

zeroszeros

dot

WorkerWorker

Object Store

Scheduler

Global Control StoreGlobal Control Store

WorkerWorker

Object Store

Scheduler

Global Control Store

The Ray Architecture

WorkerWorker

Object Store

Scheduler

id1 id2

id3

zeros zeros

dot

zeroszeros

obj2obj1

dot

WorkerWorker

Object Store

Scheduler

Global Control StoreGlobal Control Store

WorkerWorker

Object Store

Scheduler

Global Control Store

The Ray Architecture

WorkerWorker

Object Store

Scheduler

id1 id2

id3

zeros zeros

dot

obj2obj1

dot

WorkerWorker

Object Store

Scheduler

Global Control StoreGlobal Control Store

WorkerWorker

Object Store

Scheduler

Global Control Store

The Ray Architecture

WorkerWorker

Object Store

Scheduler

id1 id2

id3

zeros zeros

dot

obj2obj1

dot

WorkerWorker

Object Store

Scheduler

Global Control StoreGlobal Control Store

WorkerWorker

Object Store

Scheduler

Global Control Store

The Ray Architecture

WorkerWorker

Object Store

Scheduler

id1 id2

id3

zeros zeros

dot

obj2obj1

dot

WorkerWorker

Object Store

Scheduler

Global Control StoreGlobal Control Store

WorkerWorker

Object Store

Scheduler

Global Control Store

The Ray Architecture

WorkerWorker

Object Store

Scheduler

id1 id2

id3

zeros zeros

dot

obj2obj1

dot

obj2

WorkerWorker

Object Store

Scheduler

Global Control StoreGlobal Control Store

WorkerWorker

Object Store

Scheduler

Global Control Store

The Ray Architecture

WorkerWorker

Object Store

Scheduler

id1 id2

id3

zeros zeros

dot

obj2obj1

dot

obj2

WorkerWorker

Object Store

Scheduler

Global Control StoreGlobal Control Store

WorkerWorker

Object Store

Scheduler

Global Control Store

The Ray Architecture

WorkerWorker

Object Store

Scheduler

id1 id2

id3

zeros zeros

dot

obj2obj1

dot

obj2obj3

WorkerWorker

Object Store

Scheduler

Global Control StoreGlobal Control Store

WorkerWorker

Object Store

Scheduler

Global Control Store

The Ray Architecture

WorkerWorker

Object Store

Scheduler

id1 id2

id3

zeros zeros

dot

obj2obj1

dot

obj2obj3

id1

zeros

id2

zeros

Parallelism Example: Primality Testing

Parallelism Example: Primality TestingWrite a function to test if p is a prime number (no divisors other than 1 and p).

Parallelism Example: Primality TestingWrite a function to test if p is a prime number (no divisors other than 1 and p).

Parallelism Example: Primality TestingWrite a function to test if p is a prime number (no divisors other than 1 and p).

Can this be parallelized?

Parallelism Example: Primality TestingWrite a function to test if p is a prime number (no divisors other than 1 and p).

Can this be parallelized?Yes! “For loops” are great candidates for parallelization.

Parallelism Example: Primality TestingWrite a function to test if p is a prime number (no divisors other than 1 and p).

1,2,3,...,k k+1,...,2k … p-k+1,...pCheck batches of divisors in parallel

Parallelism Example: Primality TestingWrite a function to test if p is a prime number (no divisors other than 1 and p).

1,2,3,...,k k+1,...,2k … p-k+1,...pCheck batches of divisors in parallel

Show in Jupyter notebook!

Parallelism Example: Tree Reduction

Parallelism Example: Tree Reduction

Relevant for upcoming homework assignment!

Parallelism Example: Tree Reduction

Add a bunch of values (2 at a time).

1 2 3 4 5 6 7 8

Parallelism Example: Tree Reduction

Add a bunch of values (2 at a time).

1 2 3 4 5 6 7 8

Parallelism Example: Tree Reduction

Add a bunch of values (2 at a time).

add

1 2 3 4 5 6 7 8

Parallelism Example: Tree Reduction

Add a bunch of values (2 at a time).

id1

add

1 2 3 4 5 6 7 8

Parallelism Example: Tree Reduction

Add a bunch of values (2 at a time).

id1

add

1 2 3 4 5 6 7 8

add

Parallelism Example: Tree Reduction

Add a bunch of values (2 at a time).

id1

add

1 2 3 4 5 6 7 8

add

id2

Parallelism Example: Tree Reduction

Add a bunch of values (2 at a time).

id1

add

1 2 3 4 5 6 7 8

add

id2 add

Parallelism Example: Tree Reduction

Add a bunch of values (2 at a time).

id1

add

1 2 3 4 5 6 7 8

add

id2 add

id3

Parallelism Example: Tree Reduction

Add a bunch of values (2 at a time).

id1

add

1 2 3 4 5 6 7 8

add

id2 add

id3 add

Parallelism Example: Tree Reduction

Add a bunch of values (2 at a time).

id1

add

1 2 3 4 5 6 7 8

add

id2 add

id3 add

id4

Parallelism Example: Tree Reduction

Add a bunch of values (2 at a time).

id1

add

1 2 3 4 5 6 7 8

add

id2 add

id3 add

addid4

Parallelism Example: Tree Reduction

Add a bunch of values (2 at a time).

id1

add

1 2 3 4 5 6 7 8

add

id2 add

id3 add

addid4

id5

Parallelism Example: Tree Reduction

Add a bunch of values (2 at a time).

id1

add

1 2 3 4 5 6 7 8

add

id2 add

id3 add

add

add

id4

id5

Parallelism Example: Tree Reduction

Add a bunch of values (2 at a time).

id1

add

1 2 3 4 5 6 7 8

add

id2 add

id3 add

add

add

id4

id5

id6

Parallelism Example: Tree Reduction

Add a bunch of values (2 at a time).

id1

add

1 2 3 4 5 6 7 8

add

id2 add

id3 add

add

add

add

id4

id5

id6

Parallelism Example: Tree Reduction

Add a bunch of values (2 at a time).

id1

add

1 2 3 4 5 6 7 8

add

id2 add

id3 add

add

add

add

id4

id5

id6 id7

Parallelism Example: Tree Reduction1 2 3 4 5 6 7 8

Add a bunch of values (2 at a time).

Parallelism Example: Tree Reduction

add

1 2 3 4 5 6 7 8

add add add

Add a bunch of values (2 at a time).

Parallelism Example: Tree Reduction

id1

add

1 2 3 4 5 6 7 8

add add add

id2 id3 id4

Add a bunch of values (2 at a time).

Parallelism Example: Tree Reduction

id1

add

1 2 3 4 5 6 7 8

add add add

id2 id3 id4

add add

Add a bunch of values (2 at a time).

Parallelism Example: Tree Reduction

id1

add

1 2 3 4 5 6 7 8

add add add

id2 id3 id4

add add

id5 id6

Add a bunch of values (2 at a time).

Parallelism Example: Tree Reduction

id1

add

1 2 3 4 5 6 7 8

add add add

id2 id3 id4

add add

id5 id6

add

Add a bunch of values (2 at a time).

Parallelism Example: Tree Reduction

id1

add

1 2 3 4 5 6 7 8

add add add

id2 id3 id4

add add

id5 id6

add

id7

Add a bunch of values (2 at a time).

Parallelism Example: Tree Reduction

Add a bunch of values (2 at a time).

Parallelism Example: Tree Reduction

Add a bunch of values (2 at a time).

What does the difference look like in code?

Parallelism Example: Tree Reduction

Add a bunch of values (2 at a time).

What does the difference look like in code?

Parallelism Example: Tree Reduction

Add a bunch of values (2 at a time).

What does the difference look like in code?

Parallelism Example: Parameter Server

Parallelism Example: Parameter Server

Parallelism Example: Parameter Server

Parallelism Example: Parameter Server

Parallelism Example: Parameter Server

Worker Worker Worker Worker

Parallelism Example: Parameter Server

Worker

Parameter Server Actor

Worker Worker Worker

Parallelism Example: Parameter Server

Worker

Parameter Server Actor

Worker Worker Worker

para

met

ers

grad

ient

s

Parallelism Example: Parameter Server

Parameter Server Actor

Worker

Parameter Server Actor

Parameter Server Actor

Worker Worker Worker

para

met

ers

grad

ient

s

Conclusion● The ML ecosystem consists of many special-purpose systems

○ We are moving toward a single general-purpose system○ “Distributed systems” are becoming “libraries” and “applications”○ Examples like MapReduce, tree reduce, parameter server○ Interesting applications will combine all of these elements

● Ray is open source at https://github.com/ray-project/ray○ pip install ray (on Linux/Mac)○ Being developed here at Berkeley!

● Will be used in the next homework assignment

top related