scaling tensorflow models for training using multi-gpus & google cloud ml

Scaling Tensorflow models for training using multi-GPUs &

Google Cloud ML

BEE PART OF THE CHANGE

Avenida de Burgos, 16 D, 28036 [email protected]

www.beeva.com

2

Topics

Cloud Machine Learning Engine -> a.k.a. Cloud ML NVIDIA GPUs

Distributed computing

Tensorflow

3

Index

1. What is BEEVA? Who are we?2. High Performance Computing: objectives3. Experimental setup4. Scenario 1: Distributed Tensorflow5. Scenario 2: Cloud ML6. Overall Conclusions7. Future lines

4

What is BEEVA?

WWW.BEEVA.COM 5

“ WE MAKE COMPLEX THINGS SIMPLE”

100 % +40%Annual growth last

3 years

+550 Employees in Spain

BIG DATACLOUD

COMPUTINGMACHINE

INTELLIGENCE

● HIGH VALUE FOR INNOVATION● PRODUCT DEVELOPMENT (APIVERSITY, lince.io, Clever)

WWW.BEEVA.COM 6

Technological Partners

6

In cloud we bet on those partners

that we believe best work and

cover the needs of the client,

making us experts and finding the

best cloud solution for each project.

AWS, Azure & Google Cloud Platform

Data is the oil of the XXI century. In

BEEVA we seek to ally with the best

providers of solutions for the data.

Cloudera, Hortonworks, MongoDB & Neo4j

The needs of BEEVA are constantly

renewed and we always seek to

add new and powerful references

of the sector to our portfolio of

technological partners.

RedHat, Puppet & Docker

CLOUD DATA TECH

BEE DIFFERENT WORK DIFFERENT

PROVIDE PASSION AND VALUE TO THE WORK

LEARN AND ENJOY WHAT YOU DO

CREATE A GOOD ENVIRONMENT EVERY DAY

‘OUT OF THE BOX’ THINKING

BEE DIFFERENT AND SPECIAL

www.beeva.com/empleo

[email protected]

8

Who am I?

Ricardo Guerrero A (very geeky) Telecommunications Engineer.

1. Research: Computer Vision 2. Development: Embedded

systems (routers)3. Innovation: Data scientist in

BEEVA.

Free time: Not too much (Self-driving cars)Plants Vs Zombies

9

Who is this?

Telecommunications Engineer

Data Scientist (Innovation team)Geek

Free time: compute PI decimals(just kidding… I hope)

Enrique Otero

10

High Performance Computing: objectives

11

HPC line

1. Scaling ML models over GPU clusters.2. Ease ML deployments and its consume by analysts.3. Analyze GPU clouds providers.4. Study vertical scaling Vs horizontal scaling.5. Paradigms of parallelization: data parallelism (sync or

async) Vs model parallelism.

12

Experimental setup

13

MNIST problem

The Hello World in Machine Learning:

easy to reach an accuracy over 97%

MNIST Dataset

14

MNIST problem

MNIST Dataset

Classify digits in bank checks (1998)

15

MNIST problem

ICLR 2017This happy guy is me.

16

Benchmark

Model employed

5-layered Neural Network proposed by Yann Lecun

17

Scenario 1: Distributed Tensorflow

18

How can we parallelize learning?

CLUSTERS

Communication issues: Latency

19

Single-machine learning

Forward prop -> compute output

Ytrue =3

Yest = -201.2

Random initialization of weights

20

Single-machine learning

Backprop -> weights update

Ytrue =3

Yest = -201.2

Err = 204.2

21


Machine Learning:

Andrew Ng

https://es.coursera.org/learn/machine-learning/lecture/9zJUs/mini-batch-gradient-descent

22


Example:

● Optimizer: Mini-batch Gradient Descent.● Training set: 10 samples.● Iterations: 1000 (10x100) -> the network will see 100 times the

whole training set.

23


Equation warning

24


25


26


Neuron

weights

The famous gradients

27


28


5 examples 5 examples

29


5 examples 5 examples

Parameter server

● Distribute data

● Aggregate gradients

30


N examples batch_size = N

Parameter server

{M machines

Single machineMathematically equivalent

N examples batch_size = N

batch_size = M * N

Synchronous

training

31


Synchronous

training

32


Synchronous

training

Asynchronous

training

33


Fast-responsedriver

Slow-response driver

Synchronous

training

Asynchronous

training

Car

driver ->

machine

● Tensorflow examples are hard to adapt to other scenarios.

○ High coupling between model, input, and parallel paradigm.

○ Not a Deep Learning library, but a mathematical engine. Very high verbosity

○ High level abstraction is recommended:

■ Keras, TF-slim, TF Learn (old skflow, now tf.contrib.learn), TFLearn,

Sonnet (Deep Mind).

Preliminary conclusions

● We were not able to use a GPU cluster on GKE (Google Container Engine)

○ Not enough documentation on this issue

● Parallel paradigm (on single-machine):

○ Asynchronous data parallel is much faster than synchronous, a little less accurate

● We tried first TF-Slim. But we were not able to make it work with multiworker :(

Distributed Tensorflow. Results

paradigm workers accuracy steps time

sync. 3 0.975 5000 62.8

async. 3 0.967 5000 21.6

● Keras was our final choice

○ We patched an external project and made it work on AWS p2.8x :)

○ with 4 GPUs we got (only) 30% speedup. With 8 GPUs even worse :(

Single machine multi-GPUs. Results (I)

GPUs epochs accuracy time (s/epoch)

1 12 0.9884 6.8

2 12 0.9898 5.2

4 12 0.9891 4.9

8 12 0.9899 6.4

37


CLUSTERS

Communication issues: Latency

● Tensorflow ecosystem is a bit inmature

○ v1.0 not backwards compatible to v0.12

■ Google provides tf_upgrade.py. But manual changes are

sometimes necessary

○ Many open issues awaiting tensorflower...


● Scaling to serve models seems a solved issue

○ Seldon, Tensorflow Serving...

● Scaling to train models efficiently is not a solved issue

○ Our first experiments and external benchmarks confirm this point○ Horizontal scaling is not efficient

○ Data parallelism (synch or asynch) and GPU optimization are not solved issues.


http://chainer.org/general/2017/02/08/Performance-of-Distributed-Deep-Learning-Using-ChainerMN.html

40

Scenario 2: Cloud ML

41

Are you more familiar with Amazon?

AWS

EC2

S3

??

It’s like Heroku, a PaaS,

but for Machine Learning

Google Cloud Platform (GCP)

Google Cloud Compute Engine

Google Cloud Storage

Google Cloud Machine Learning Engine (Cloud ML)

42

What is Cloud ML?

43

What is Google Cloud ML?

Google

Cloud

Storage

44

Cloud ML & Kaggle

The free trial account includes $300 in credits!

45

Pricing

“Pricing for training your models in the cloud is defined in terms

of ML training units, which are an abstract measurement of the

processing power involved. 1 ML training unit represents a

standard machine configuration used by the training service.”

It’s a bit complex. Let’s read it:

46

Cluster configuration

47


“many workers”, “a few servers”, “a large number”

48


“The following table uses rough "t-shirt"

sizing to describe the machine types.”

49


50

Results

Duration Price Accuracy

BASIC 1h 2 min 0.01 ML units = 0.0049$

0.9886

STANDARD_1 16 min 4 sec 1.67 ML Units = 0.818$

0.99

BASIC_GPU 23 min 56 sec 0.82 ML Units = 0.4018$

0.989

Infrastructure provisioning time not negligible (~8 minutes)

51

Conclusion

52

Overall Conclusions

● Distributed computing for ML is not a commodity: you need highly qualified engineers.

● Don’t scale horizontally in ML. Most of the time does not worth it unless you have special conditions:

○ A huge dataset (really huge).○ A medium size dataset + Infiniband connections + ML/DL framework with

RDMA support (reduce latency)

53

Overall Conclusions

● Google GPUs (beta) vs AWS GPUs: more cons than pros :(

● Tensorflow is growing fast but...

a. Not easy, but there is Keras.

b. We recommend (careful) adoption because of big community

54

Future lines

55

Future lines: Cloud ML changes very fast

CIFAR10

Recommender Systems(Movielens)

56

ANY QUESTIONS?

?

?

?

?

Ricardo Guerrero Gómez-Olmedo

Email:[email protected]

Twitter: @ricgu8086 Medium: medium.com/@ricardo.guerrero

IT Researcher | BEEVA LABS

[email protected] | www.beeva.comWe are

hiring!!

scaling tensorflow models for training using multi-gpus & google cloud ml

Technology