Transcript
Page 1: Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML

Scaling Tensorflow models for training using multi-GPUs &

Google Cloud ML

BEE PART OF THE CHANGE

Avenida de Burgos, 16 D, 28036 [email protected]

www.beeva.com

Page 2: Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML

2

Topics

Cloud Machine Learning Engine -> a.k.a. Cloud ML NVIDIA GPUs

Distributed computing

Tensorflow

Page 3: Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML

3

Index

1. What is BEEVA? Who are we?2. High Performance Computing: objectives3. Experimental setup4. Scenario 1: Distributed Tensorflow5. Scenario 2: Cloud ML6. Overall Conclusions7. Future lines

Page 4: Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML

4

What is BEEVA?

Page 5: Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML

WWW.BEEVA.COM 5

“ WE MAKE COMPLEX THINGS SIMPLE”

100 % +40%Annual growth last

3 years

+550 Employees in Spain

BIG DATACLOUD

COMPUTINGMACHINE

INTELLIGENCE

● HIGH VALUE FOR INNOVATION● PRODUCT DEVELOPMENT (APIVERSITY, lince.io, Clever)

Page 6: Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML

WWW.BEEVA.COM 6

Technological Partners

6

In cloud we bet on those partners

that we believe best work and

cover the needs of the client,

making us experts and finding the

best cloud solution for each project.

AWS, Azure & Google Cloud Platform

Data is the oil of the XXI century. In

BEEVA we seek to ally with the best

providers of solutions for the data.

Cloudera, Hortonworks, MongoDB & Neo4j

The needs of BEEVA are constantly

renewed and we always seek to

add new and powerful references

of the sector to our portfolio of

technological partners.

RedHat, Puppet & Docker

CLOUD DATA TECH

Page 7: Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML

BEE DIFFERENT WORK DIFFERENT

PROVIDE PASSION AND VALUE TO THE WORK

LEARN AND ENJOY WHAT YOU DO

CREATE A GOOD ENVIRONMENT EVERY DAY

‘OUT OF THE BOX’ THINKING

BEE DIFFERENT AND SPECIAL

www.beeva.com/empleo

[email protected]

Page 8: Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML

8

Who am I?

Ricardo Guerrero A (very geeky) Telecommunications Engineer.

1. Research: Computer Vision 2. Development: Embedded

systems (routers)3. Innovation: Data scientist in

BEEVA.

Free time: Not too much (Self-driving cars)Plants Vs Zombies

Page 9: Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML

9

Who is this?

Telecommunications Engineer

Data Scientist (Innovation team)Geek

Free time: compute PI decimals(just kidding… I hope)

Enrique Otero

Page 10: Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML

10

High Performance Computing: objectives

Page 11: Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML

11

HPC line

1. Scaling ML models over GPU clusters.2. Ease ML deployments and its consume by analysts.3. Analyze GPU clouds providers.4. Study vertical scaling Vs horizontal scaling.5. Paradigms of parallelization: data parallelism (sync or

async) Vs model parallelism.

Page 12: Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML

12

Experimental setup

Page 13: Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML

13

MNIST problem

The Hello World in Machine Learning:

easy to reach an accuracy over 97%

MNIST Dataset

Page 14: Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML

14

MNIST problem

MNIST Dataset

Classify digits in bank checks (1998)

Page 15: Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML

15

MNIST problem

ICLR 2017This happy guy is me.

Page 16: Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML

16

Benchmark

Model employed

5-layered Neural Network proposed by Yann Lecun

Page 17: Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML

17

Scenario 1: Distributed Tensorflow

Page 18: Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML

18

How can we parallelize learning?

CLUSTERS

Communication issues: Latency

Page 19: Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML

19

Single-machine learning

Forward prop -> compute output

Ytrue =3

Yest = -201.2

Random initialization of weights

Page 20: Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML

20

Single-machine learning

Backprop -> weights update

Ytrue =3

Yest = -201.2

Err = 204.2

Page 21: Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML

21

How can we parallelize learning?

Machine Learning:

Andrew Ng

Page 22: Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML

22

How can we parallelize learning?

Example:

● Optimizer: Mini-batch Gradient Descent.● Training set: 10 samples.● Iterations: 1000 (10x100) -> the network will see 100 times the

whole training set.

Page 23: Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML

23

How can we parallelize learning?

Equation warning

Page 24: Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML

24

How can we parallelize learning?

Page 25: Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML

25

How can we parallelize learning?

Page 26: Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML

26

How can we parallelize learning?

Neuron

weights

The famous gradients

Page 27: Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML

27

How can we parallelize learning?

Page 28: Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML

28

How can we parallelize learning?

5 examples 5 examples

Page 29: Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML

29

How can we parallelize learning?

5 examples 5 examples

Parameter server

● Distribute data

● Aggregate gradients

Page 30: Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML

30

How can we parallelize learning?

N examples batch_size = N

Parameter server

{M machines

Single machineMathematically equivalent

N examples batch_size = N

batch_size = M * N

Synchronous

training

Page 31: Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML

31

How can we parallelize learning?

Synchronous

training

Page 32: Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML

32

How can we parallelize learning?

Synchronous

training

Asynchronous

training

Page 33: Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML

33

How can we parallelize learning?

Fast-responsedriver

Slow-response driver

Synchronous

training

Asynchronous

training

Car

driver ->

machine

Page 34: Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML

● Tensorflow examples are hard to adapt to other scenarios.

○ High coupling between model, input, and parallel paradigm.

○ Not a Deep Learning library, but a mathematical engine. Very high verbosity

○ High level abstraction is recommended:

■ Keras, TF-slim, TF Learn (old skflow, now tf.contrib.learn), TFLearn,

Sonnet (Deep Mind).

Preliminary conclusions

Page 35: Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML

● We were not able to use a GPU cluster on GKE (Google Container Engine)

○ Not enough documentation on this issue

● Parallel paradigm (on single-machine):

○ Asynchronous data parallel is much faster than synchronous, a little less accurate

● We tried first TF-Slim. But we were not able to make it work with multiworker :(

Distributed Tensorflow. Results

paradigm workers accuracy steps time

sync. 3 0.975 5000 62.8

async. 3 0.967 5000 21.6

Page 36: Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML

● Keras was our final choice

○ We patched an external project and made it work on AWS p2.8x :)

○ with 4 GPUs we got (only) 30% speedup. With 8 GPUs even worse :(

Single machine multi-GPUs. Results (I)

GPUs epochs accuracy time (s/epoch)

1 12 0.9884 6.8

2 12 0.9898 5.2

4 12 0.9891 4.9

8 12 0.9899 6.4

Page 37: Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML

37

How can we parallelize learning?

CLUSTERS

Communication issues: Latency

Page 38: Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML

● Tensorflow ecosystem is a bit inmature

○ v1.0 not backwards compatible to v0.12

■ Google provides tf_upgrade.py. But manual changes are

sometimes necessary

○ Many open issues awaiting tensorflower...

Preliminary conclusions

Page 39: Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML

● Scaling to serve models seems a solved issue

○ Seldon, Tensorflow Serving...

● Scaling to train models efficiently is not a solved issue

○ Our first experiments and external benchmarks confirm this point○ Horizontal scaling is not efficient

○ Data parallelism (synch or asynch) and GPU optimization are not solved issues.

Preliminary conclusions

Page 40: Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML

40

Scenario 2: Cloud ML

Page 41: Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML

41

Are you more familiar with Amazon?

AWS

EC2

S3

??

It’s like Heroku, a PaaS,

but for Machine Learning

Google Cloud Platform (GCP)

Google Cloud Compute Engine

Google Cloud Storage

Google Cloud Machine Learning Engine (Cloud ML)

Page 42: Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML

42

What is Cloud ML?

Page 43: Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML

43

What is Google Cloud ML?

Google

Cloud

Storage

Page 44: Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML

44

Cloud ML & Kaggle

The free trial account includes $300 in credits!

Page 45: Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML

45

Pricing

“Pricing for training your models in the cloud is defined in terms

of ML training units, which are an abstract measurement of the

processing power involved. 1 ML training unit represents a

standard machine configuration used by the training service.”

It’s a bit complex. Let’s read it:

Page 46: Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML

46

Cluster configuration

Page 47: Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML

47

Cluster configuration

“many workers”, “a few servers”, “a large number”

Page 48: Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML

48

Cluster configuration

“The following table uses rough "t-shirt"

sizing to describe the machine types.”

Page 49: Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML

49

Cluster configuration

Page 50: Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML

50

Results

Duration Price Accuracy

BASIC 1h 2 min 0.01 ML units = 0.0049$

0.9886

STANDARD_1 16 min 4 sec 1.67 ML Units = 0.818$

0.99

BASIC_GPU 23 min 56 sec 0.82 ML Units = 0.4018$

0.989

Infrastructure provisioning time not negligible (~8 minutes)

Page 51: Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML

51

Conclusion

Page 52: Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML

52

Overall Conclusions

● Distributed computing for ML is not a commodity: you need highly qualified engineers.

● Don’t scale horizontally in ML. Most of the time does not worth it unless you have special conditions:

○ A huge dataset (really huge).○ A medium size dataset + Infiniband connections + ML/DL framework with

RDMA support (reduce latency)

Page 53: Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML

53

Overall Conclusions

● Google GPUs (beta) vs AWS GPUs: more cons than pros :(

● Tensorflow is growing fast but...

a. Not easy, but there is Keras.

b. We recommend (careful) adoption because of big community

Page 54: Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML

54

Future lines

Page 55: Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML

55

Future lines: Cloud ML changes very fast

CIFAR10

Recommender Systems(Movielens)

Page 56: Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML

56

ANY QUESTIONS?

?

?

?

?

Page 57: Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML

Ricardo Guerrero Gómez-Olmedo

Email:[email protected]

Twitter: @ricgu8086 Medium: medium.com/@ricardo.guerrero

IT Researcher | BEEVA LABS

[email protected] | www.beeva.comWe are

hiring!!


Top Related