Scaling Tensorflow models for training using multi-GPUs &
Google Cloud ML
BEE PART OF THE CHANGE
Avenida de Burgos, 16 D, 28036 [email protected]
www.beeva.com
2
Topics
Cloud Machine Learning Engine -> a.k.a. Cloud ML NVIDIA GPUs
Distributed computing
Tensorflow
3
Index
1. What is BEEVA? Who are we?2. High Performance Computing: objectives3. Experimental setup4. Scenario 1: Distributed Tensorflow5. Scenario 2: Cloud ML6. Overall Conclusions7. Future lines
4
What is BEEVA?
WWW.BEEVA.COM 5
“ WE MAKE COMPLEX THINGS SIMPLE”
100 % +40%Annual growth last
3 years
+550 Employees in Spain
BIG DATACLOUD
COMPUTINGMACHINE
INTELLIGENCE
● HIGH VALUE FOR INNOVATION● PRODUCT DEVELOPMENT (APIVERSITY, lince.io, Clever)
WWW.BEEVA.COM 6
Technological Partners
6
In cloud we bet on those partners
that we believe best work and
cover the needs of the client,
making us experts and finding the
best cloud solution for each project.
AWS, Azure & Google Cloud Platform
Data is the oil of the XXI century. In
BEEVA we seek to ally with the best
providers of solutions for the data.
Cloudera, Hortonworks, MongoDB & Neo4j
The needs of BEEVA are constantly
renewed and we always seek to
add new and powerful references
of the sector to our portfolio of
technological partners.
RedHat, Puppet & Docker
CLOUD DATA TECH
BEE DIFFERENT WORK DIFFERENT
PROVIDE PASSION AND VALUE TO THE WORK
LEARN AND ENJOY WHAT YOU DO
CREATE A GOOD ENVIRONMENT EVERY DAY
‘OUT OF THE BOX’ THINKING
BEE DIFFERENT AND SPECIAL
www.beeva.com/empleo
8
Who am I?
Ricardo Guerrero A (very geeky) Telecommunications Engineer.
1. Research: Computer Vision 2. Development: Embedded
systems (routers)3. Innovation: Data scientist in
BEEVA.
Free time: Not too much (Self-driving cars)Plants Vs Zombies
9
Who is this?
Telecommunications Engineer
Data Scientist (Innovation team)Geek
Free time: compute PI decimals(just kidding… I hope)
Enrique Otero
10
High Performance Computing: objectives
11
HPC line
1. Scaling ML models over GPU clusters.2. Ease ML deployments and its consume by analysts.3. Analyze GPU clouds providers.4. Study vertical scaling Vs horizontal scaling.5. Paradigms of parallelization: data parallelism (sync or
async) Vs model parallelism.
12
Experimental setup
13
MNIST problem
The Hello World in Machine Learning:
easy to reach an accuracy over 97%
MNIST Dataset
14
MNIST problem
MNIST Dataset
Classify digits in bank checks (1998)
15
MNIST problem
ICLR 2017This happy guy is me.
16
Benchmark
Model employed
5-layered Neural Network proposed by Yann Lecun
17
Scenario 1: Distributed Tensorflow
18
How can we parallelize learning?
CLUSTERS
Communication issues: Latency
19
Single-machine learning
Forward prop -> compute output
Ytrue =3
Yest = -201.2
Random initialization of weights
20
Single-machine learning
Backprop -> weights update
Ytrue =3
Yest = -201.2
Err = 204.2
21
How can we parallelize learning?
Machine Learning:
Andrew Ng
22
How can we parallelize learning?
Example:
● Optimizer: Mini-batch Gradient Descent.● Training set: 10 samples.● Iterations: 1000 (10x100) -> the network will see 100 times the
whole training set.
23
How can we parallelize learning?
Equation warning
24
How can we parallelize learning?
25
How can we parallelize learning?
26
How can we parallelize learning?
Neuron
weights
The famous gradients
27
How can we parallelize learning?
28
How can we parallelize learning?
5 examples 5 examples
29
How can we parallelize learning?
5 examples 5 examples
Parameter server
● Distribute data
● Aggregate gradients
30
How can we parallelize learning?
N examples batch_size = N
Parameter server
{M machines
Single machineMathematically equivalent
N examples batch_size = N
batch_size = M * N
Synchronous
training
31
How can we parallelize learning?
Synchronous
training
32
How can we parallelize learning?
Synchronous
training
Asynchronous
training
33
How can we parallelize learning?
Fast-responsedriver
Slow-response driver
Synchronous
training
Asynchronous
training
Car
driver ->
machine
● Tensorflow examples are hard to adapt to other scenarios.
○ High coupling between model, input, and parallel paradigm.
○ Not a Deep Learning library, but a mathematical engine. Very high verbosity
○ High level abstraction is recommended:
■ Keras, TF-slim, TF Learn (old skflow, now tf.contrib.learn), TFLearn,
Sonnet (Deep Mind).
Preliminary conclusions
● We were not able to use a GPU cluster on GKE (Google Container Engine)
○ Not enough documentation on this issue
● Parallel paradigm (on single-machine):
○ Asynchronous data parallel is much faster than synchronous, a little less accurate
● We tried first TF-Slim. But we were not able to make it work with multiworker :(
Distributed Tensorflow. Results
paradigm workers accuracy steps time
sync. 3 0.975 5000 62.8
async. 3 0.967 5000 21.6
● Keras was our final choice
○ We patched an external project and made it work on AWS p2.8x :)
○ with 4 GPUs we got (only) 30% speedup. With 8 GPUs even worse :(
Single machine multi-GPUs. Results (I)
GPUs epochs accuracy time (s/epoch)
1 12 0.9884 6.8
2 12 0.9898 5.2
4 12 0.9891 4.9
8 12 0.9899 6.4
37
How can we parallelize learning?
CLUSTERS
Communication issues: Latency
● Tensorflow ecosystem is a bit inmature
○ v1.0 not backwards compatible to v0.12
■ Google provides tf_upgrade.py. But manual changes are
sometimes necessary
○ Many open issues awaiting tensorflower...
Preliminary conclusions
● Scaling to serve models seems a solved issue
○ Seldon, Tensorflow Serving...
● Scaling to train models efficiently is not a solved issue
○ Our first experiments and external benchmarks confirm this point○ Horizontal scaling is not efficient
○ Data parallelism (synch or asynch) and GPU optimization are not solved issues.
Preliminary conclusions
40
Scenario 2: Cloud ML
41
Are you more familiar with Amazon?
AWS
EC2
S3
??
It’s like Heroku, a PaaS,
but for Machine Learning
Google Cloud Platform (GCP)
Google Cloud Compute Engine
Google Cloud Storage
Google Cloud Machine Learning Engine (Cloud ML)
42
What is Cloud ML?
43
What is Google Cloud ML?
Cloud
Storage
44
Cloud ML & Kaggle
The free trial account includes $300 in credits!
45
Pricing
“Pricing for training your models in the cloud is defined in terms
of ML training units, which are an abstract measurement of the
processing power involved. 1 ML training unit represents a
standard machine configuration used by the training service.”
It’s a bit complex. Let’s read it:
46
Cluster configuration
47
Cluster configuration
“many workers”, “a few servers”, “a large number”
48
Cluster configuration
“The following table uses rough "t-shirt"
sizing to describe the machine types.”
49
Cluster configuration
50
Results
Duration Price Accuracy
BASIC 1h 2 min 0.01 ML units = 0.0049$
0.9886
STANDARD_1 16 min 4 sec 1.67 ML Units = 0.818$
0.99
BASIC_GPU 23 min 56 sec 0.82 ML Units = 0.4018$
0.989
Infrastructure provisioning time not negligible (~8 minutes)
51
Conclusion
52
Overall Conclusions
● Distributed computing for ML is not a commodity: you need highly qualified engineers.
● Don’t scale horizontally in ML. Most of the time does not worth it unless you have special conditions:
○ A huge dataset (really huge).○ A medium size dataset + Infiniband connections + ML/DL framework with
RDMA support (reduce latency)
53
Overall Conclusions
● Google GPUs (beta) vs AWS GPUs: more cons than pros :(
● Tensorflow is growing fast but...
a. Not easy, but there is Keras.
b. We recommend (careful) adoption because of big community
54
Future lines
55
Future lines: Cloud ML changes very fast
CIFAR10
Recommender Systems(Movielens)
56
ANY QUESTIONS?
?
?
?
?
Ricardo Guerrero Gómez-Olmedo
Email:[email protected]
Twitter: @ricgu8086 Medium: medium.com/@ricardo.guerrero
IT Researcher | BEEVA LABS
[email protected] | www.beeva.comWe are
hiring!!