Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML

Download Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML

Post on 23-Jan-2018

1.217 views

Category:

Technology

0 download

TRANSCRIPT

  • Scaling Tensorflow models for training using multi-GPUs &

    Google Cloud ML

    BEE PART OF THE CHANGE

    Avenida de Burgos, 16 D, 28036 Madridhablemos@beeva.com

    www.beeva.com

  • 2

    Topics

    Cloud Machine Learning Engine -> a.k.a. Cloud ML NVIDIA GPUs

    Distributed computing

    Tensorflow

  • 3

    Index

    1. What is BEEVA? Who are we?2. High Performance Computing: objectives3. Experimental setup4. Scenario 1: Distributed Tensorflow5. Scenario 2: Cloud ML6. Overall Conclusions7. Future lines

  • 4

    What is BEEVA?

  • WWW.BEEVA.COM 5

    WE MAKE COMPLEX THINGS SIMPLE

    100 % +40%Annual growth last

    3 years

    +550 Employees in Spain

    BIG DATACLOUD

    COMPUTINGMACHINE

    INTELLIGENCE

    HIGH VALUE FOR INNOVATION PRODUCT DEVELOPMENT (APIVERSITY, lince.io, Clever)

  • WWW.BEEVA.COM 6

    Technological Partners

    6

    In cloud we bet on those partners

    that we believe best work and

    cover the needs of the client,

    making us experts and finding the

    best cloud solution for each project.

    AWS, Azure & Google Cloud Platform

    Data is the oil of the XXI century. In

    BEEVA we seek to ally with the best

    providers of solutions for the data.

    Cloudera, Hortonworks, MongoDB & Neo4j

    The needs of BEEVA are constantly

    renewed and we always seek to

    add new and powerful references

    of the sector to our portfolio of

    technological partners.

    RedHat, Puppet & Docker

    CLOUD DATA TECH

  • BEE DIFFERENT WORK DIFFERENT

    PROVIDE PASSION AND VALUE TO THE WORK

    LEARN AND ENJOY WHAT YOU DO

    CREATE A GOOD ENVIRONMENT EVERY DAY

    OUT OF THE BOX THINKING

    BEE DIFFERENT AND SPECIAL

    www.beeva.com/empleo

    rrhh@beeva.com

  • 8

    Who am I?

    Ricardo Guerrero A (very geeky) Telecommunications Engineer.

    1. Research: Computer Vision 2. Development: Embedded

    systems (routers)3. Innovation: Data scientist in

    BEEVA.

    Free time: Not too much (Self-driving cars)Plants Vs Zombies

  • 9

    Who is this?

    Telecommunications Engineer

    Data Scientist (Innovation team)Geek

    Free time: compute PI decimals(just kidding I hope)

    Enrique Otero

  • 10

    High Performance Computing: objectives

  • 11

    HPC line

    1. Scaling ML models over GPU clusters.2. Ease ML deployments and its consume by analysts.3. Analyze GPU clouds providers.4. Study vertical scaling Vs horizontal scaling.5. Paradigms of parallelization: data parallelism (sync or

    async) Vs model parallelism.

  • 12

    Experimental setup

  • 13

    MNIST problem

    The Hello World in Machine Learning:

    easy to reach an accuracy over 97%

    MNIST Dataset

  • 14

    MNIST problem

    MNIST Dataset

    Classify digits in bank checks (1998)

  • 15

    MNIST problem

    ICLR 2017This happy guy is me.

  • 16

    Benchmark

    Model employed

    5-layered Neural Network proposed by Yann Lecun

  • 17

    Scenario 1: Distributed Tensorflow

  • 18

    How can we parallelize learning?

    CLUSTERS

    Communication issues: Latency

  • 19

    Single-machine learning

    Forward prop -> compute output

    Ytrue =3

    Yest = -201.2

    Random initialization of weights

  • 20

    Single-machine learning

    Backprop -> weights update

    Ytrue =3

    Yest = -201.2

    Err = 204.2

  • 21

    How can we parallelize learning?

    Machine Learning:

    Andrew Ng

    https://es.coursera.org/learn/machine-learning/lecture/9zJUs/mini-batch-gradient-descent

  • 22

    How can we parallelize learning?

    Example:

    Optimizer: Mini-batch Gradient Descent. Training set: 10 samples. Iterations: 1000 (10x100) -> the network will see 100 times the

    whole training set.

  • 23

    How can we parallelize learning?

    Equation warning

  • 24

    How can we parallelize learning?

  • 25

    How can we parallelize learning?

  • 26

    How can we parallelize learning?

    Neuron

    weights

    The famous gradients

  • 27

    How can we parallelize learning?

  • 28

    How can we parallelize learning?

    5 examples 5 examples

  • 29

    How can we parallelize learning?

    5 examples 5 examples

    Parameter server

    Distribute data

    Aggregate gradients

  • 30

    How can we parallelize learning?

    N examples batch_size = N

    Parameter server

    {M machinesSingle machineMathematically equivalent

    N examples batch_size = N

    batch_size = M * N

    Synchronous

    training

  • 31

    How can we parallelize learning?

    Synchronous

    training

  • 32

    How can we parallelize learning?

    Synchronous

    training

    Asynchronous

    training

  • 33

    How can we parallelize learning?

    Fast-responsedriver

    Slow-response driver

    Synchronous

    training

    Asynchronous

    training

    Car

    driver ->

    machine

  • Tensorflow examples are hard to adapt to other scenarios.

    High coupling between model, input, and parallel paradigm.

    Not a Deep Learning library, but a mathematical engine. Very high verbosity

    High level abstraction is recommended:

    Keras, TF-slim, TF Learn (old skflow, now tf.contrib.learn), TFLearn,

    Sonnet (Deep Mind).

    Preliminary conclusions

  • We were not able to use a GPU cluster on GKE (Google Container Engine)

    Not enough documentation on this issue

    Parallel paradigm (on single-machine):

    Asynchronous data parallel is much faster than synchronous, a little less accurate

    We tried first TF-Slim. But we were not able to make it work with multiworker :(

    Distributed Tensorflow. Results

    paradigm workers accuracy steps time

    sync. 3 0.975 5000 62.8

    async. 3 0.967 5000 21.6

  • Keras was our final choice

    We patched an external project and made it work on AWS p2.8x :)

    with 4 GPUs we got (only) 30% speedup. With 8 GPUs even worse :(

    Single machine multi-GPUs. Results (I)

    GPUs epochs accuracy time (s/epoch)

    1 12 0.9884 6.8

    2 12 0.9898 5.2

    4 12 0.9891 4.9

    8 12 0.9899 6.4

  • 37

    How can we parallelize learning?

    CLUSTERS

    Communication issues: Latency

  • Tensorflow ecosystem is a bit inmature

    v1.0 not backwards compatible to v0.12

    Google provides tf_upgrade.py. But manual changes are

    sometimes necessary

    Many open issues awaiting tensorflower...

    Preliminary conclusions

  • Scaling to serve models seems a solved issue

    Seldon, Tensorflow Serving...

    Scaling to train models efficiently is not a solved issue

    Our first experiments and external benchmarks confirm this point Horizontal scaling is not efficient

    Data parallelism (synch or asynch) and GPU optimization are not solved issues.

    Preliminary conclusions

    http://chainer.org/general/2017/02/08/Performance-of-Distributed-Deep-Learning-Using-ChainerMN.html

  • 40

    Scenario 2: Cloud ML

  • 41

    Are you more familiar with Amazon?

    AWS

    EC2

    S3

    ??

    Its like Heroku, a PaaS,

    but for Machine Learning

    Google Cloud Platform (GCP)

    Google Cloud Compute Engine

    Google Cloud Storage

    Google Cloud Machine Learning Engine (Cloud ML)

  • 42

    What is Cloud ML?

  • 43

    What is Google Cloud ML?

    Google

    Cloud

    Storage

  • 44

    Cloud ML & Kaggle

    The free trial account includes $300 in credits!

  • 45

    Pricing

    Pricing for training your models in the cloud is defined in terms

    of ML training units, which are an abstract measurement of the

    processing power involved. 1 ML training unit represents a

    standard machine configuration used by the training service.

    Its a bit complex. Lets read it:

  • 46

    Cluster configuration

  • 47

    Cluster configuration

    many workers, a few servers, a large number

  • 48

    Cluster configuration

    The following table uses rough "t-shirt"

    sizing to describe the machine types.

  • 49

    Cluster configuration

  • 50

    Results

    Duration Price Accuracy

    BASIC 1h 2 min 0.01 ML units = 0.0049$

    0.9886

    STANDARD_1 16 min 4 sec 1.67 ML Units = 0.818$

    0.99

    BASIC_GPU 23 min 56 sec 0.82 ML Units = 0.4018$

    0.989

    Infrastructure provisioning time not negligible (~8 minutes)

  • 51

    Conclusion

  • 52

    Overall Conclusions

    Distributed computing for ML is not a commodity: you need highly qualified engineers.

    Dont scale horizontally in ML. Most of the time does not worth it unless you have special conditions:

    A huge dataset (really huge). A medium size dataset + Infiniband connections + ML/DL framework with

    RDMA support (reduce latency)

  • 53

    Overall Conclusions

    Google GPUs (beta) vs AWS GPUs: more cons than pros :(

    Tensorflow is growing fast but...

    a. Not easy, but there is Keras.

    b. We recommend (careful) adoption because of big community

  • 54

    Future lines

  • 55

    Future lines: Cloud ML changes very fast

    CIFAR10

    Recommender Systems(Movielens)

  • 56

    ANY QUESTIONS?

    ?

    ?

    ?

    ?

  • Ricardo Guerrero Gmez-Olmedo

    Email:ricardo.guerrero@beeva.com

    Twitter: @ricgu8086 Medium: medium.com/@ricardo.guerrero

    IT Researcher | BEEVA LABS

    hablemos@beeva.com | www.beeva.comWe are

    hiring!!