Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML

Download Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML

Post on 23-Jan-2018




0 download


Scaling Tensorflow models for training using multi-GPUs & Google Cloud MLBEE PART OF THE CHANGEAvenida de Burgos, 16 D, 28036 Madridhablemos@beeva.comwww.beeva.com2TopicsCloud Machine Learning Engine -> a.k.a. Cloud ML NVIDIA GPUsDistributed computingTensorflow3Index1. What is BEEVA? Who are we?2. High Performance Computing: objectives3. Experimental setup4. Scenario 1: Distributed Tensorflow5. Scenario 2: Cloud ML6. Overall Conclusions7. Future lines4What is BEEVA?WWW.BEEVA.COM 5 WE MAKE COMPLEX THINGS SIMPLE100 % +40%Annual growth last 3 years+550 Employees in SpainBIG DATACLOUD COMPUTINGMACHINE INTELLIGENCE HIGH VALUE FOR INNOVATION PRODUCT DEVELOPMENT (APIVERSITY,, Clever)WWW.BEEVA.COM 6Technological Partners6In cloud we bet on those partners that we believe best work and cover the needs of the client, making us experts and finding the best cloud solution for each project.AWS, Azure & Google Cloud PlatformData is the oil of the XXI century. In BEEVA we seek to ally with the best providers of solutions for the data.Cloudera, Hortonworks, MongoDB & Neo4jThe needs of BEEVA are constantly renewed and we always seek to add new and powerful references of the sector to our portfolio of technological partners.RedHat, Puppet & DockerCLOUD DATA TECHBEE DIFFERENT WORK DIFFERENTPROVIDE PASSION AND VALUE TO THE WORKLEARN AND ENJOY WHAT YOU DOCREATE A GOOD ENVIRONMENT EVERY DAYOUT OF THE BOX THINKINGBEE DIFFERENT AND am I?Ricardo Guerrero A (very geeky) Telecommunications Engineer.1. Research: Computer Vision 2. Development: Embedded systems (routers)3. Innovation: Data scientist in BEEVA.Free time: Not too much (Self-driving cars)Plants Vs Zombies9Who is this?Telecommunications EngineerData Scientist (Innovation team)GeekFree time: compute PI decimals(just kidding I hope)Enrique Otero 10High Performance Computing: objectives11HPC line1. Scaling ML models over GPU clusters.2. Ease ML deployments and its consume by analysts.3. Analyze GPU clouds providers.4. Study vertical scaling Vs horizontal scaling.5. Paradigms of parallelization: data parallelism (sync or async) Vs model parallelism.12Experimental setup13MNIST problemThe Hello World in Machine Learning:easy to reach an accuracy over 97%MNIST Dataset14MNIST problemMNIST DatasetClassify digits in bank checks (1998)15MNIST problemICLR 2017This happy guy is me.16BenchmarkModel employed5-layered Neural Network proposed by Yann Lecun17Scenario 1: Distributed Tensorflow18How can we parallelize learning?CLUSTERSCommunication issues: Latency19Single-machine learningForward prop -> compute outputYtrue =3Yest = -201.2Random initialization of weights20Single-machine learningBackprop -> weights updateYtrue =3Yest = -201.2Err = 204.221How can we parallelize learning?Machine Learning: Andrew Ng can we parallelize learning?Example: Optimizer: Mini-batch Gradient Descent. Training set: 10 samples. Iterations: 1000 (10x100) -> the network will see 100 times the whole training set.23How can we parallelize learning?Equation warning24How can we parallelize learning?25How can we parallelize learning?26How can we parallelize learning?Neuron weightsThe famous gradients27How can we parallelize learning?28How can we parallelize learning?5 examples 5 examples29How can we parallelize learning?5 examples 5 examplesParameter server Distribute data Aggregate gradients30How can we parallelize learning?N examples batch_size = NParameter server{M machinesSingle machineMathematically equivalentN examples batch_size = Nbatch_size = M * NSynchronous training31How can we parallelize learning?Synchronous training32How can we parallelize learning?Synchronous trainingAsynchronous training33How can we parallelize learning?Fast-responsedriverSlow-response driverSynchronous trainingAsynchronous trainingCar driver -> machine Tensorflow examples are hard to adapt to other scenarios. High coupling between model, input, and parallel paradigm. Not a Deep Learning library, but a mathematical engine. Very high verbosity High level abstraction is recommended: Keras, TF-slim, TF Learn (old skflow, now tf.contrib.learn), TFLearn, Sonnet (Deep Mind).Preliminary conclusions We were not able to use a GPU cluster on GKE (Google Container Engine) Not enough documentation on this issue Parallel paradigm (on single-machine): Asynchronous data parallel is much faster than synchronous, a little less accurate We tried first TF-Slim. But we were not able to make it work with multiworker :(Distributed Tensorflow. Resultsparadigm workers accuracy steps timesync. 3 0.975 5000 62.8async. 3 0.967 5000 21.6 Keras was our final choice We patched an external project and made it work on AWS p2.8x :) with 4 GPUs we got (only) 30% speedup. With 8 GPUs even worse :(Single machine multi-GPUs. Results (I)GPUs epochs accuracy time (s/epoch)1 12 0.9884 6.82 12 0.9898 5.24 12 0.9891 4.98 12 0.9899 6.437How can we parallelize learning?CLUSTERSCommunication issues: Latency Tensorflow ecosystem is a bit inmature v1.0 not backwards compatible to v0.12 Google provides But manual changes are sometimes necessary Many open issues awaiting tensorflower...Preliminary conclusions Scaling to serve models seems a solved issue Seldon, Tensorflow Serving... Scaling to train models efficiently is not a solved issue Our first experiments and external benchmarks confirm this point Horizontal scaling is not efficient Data parallelism (synch or asynch) and GPU optimization are not solved issues.Preliminary conclusions 2: Cloud ML41Are you more familiar with Amazon?AWSEC2S3??Its like Heroku, a PaaS, but for Machine LearningGoogle Cloud Platform (GCP) Google Cloud Compute Engine Google Cloud Storage Google Cloud Machine Learning Engine (Cloud ML)42What is Cloud ML?43What is Google Cloud ML?Google Cloud Storage 44Cloud ML & KaggleThe free trial account includes $300 in credits!45PricingPricing for training your models in the cloud is defined in terms of ML training units, which are an abstract measurement of the processing power involved. 1 ML training unit represents a standard machine configuration used by the training service.Its a bit complex. Lets read it:46Cluster configuration47Cluster configurationmany workers, a few servers, a large number48Cluster configurationThe following table uses rough "t-shirt" sizing to describe the machine types.49Cluster configuration50ResultsDuration Price AccuracyBASIC 1h 2 min 0.01 ML units = 0.0049$0.9886STANDARD_1 16 min 4 sec 1.67 ML Units = 0.818$0.99BASIC_GPU 23 min 56 sec 0.82 ML Units = 0.4018$0.989Infrastructure provisioning time not negligible (~8 minutes)51Conclusion52Overall Conclusions Distributed computing for ML is not a commodity: you need highly qualified engineers. Dont scale horizontally in ML. Most of the time does not worth it unless you have special conditions: A huge dataset (really huge). A medium size dataset + Infiniband connections + ML/DL framework with RDMA support (reduce latency)53Overall Conclusions Google GPUs (beta) vs AWS GPUs: more cons than pros :( Tensorflow is growing fast but...a. Not easy, but there is Keras.b. We recommend (careful) adoption because of big community54Future lines55Future lines: Cloud ML changes very fastCIFAR10Recommender Systems(Movielens)56ANY QUESTIONS?????Ricardo Guerrero Twitter: @ricgu8086 Medium: Researcher | BEEVA | www.beeva.comWe are hiring!!