nvidia™ gpu performance testing and powerai on ibm cloud...mixed precision (fp16/32) leverages...

IBM Cloud / March, 2018 / © 2018 IBM Corporation

NVIDIA™ GPU Performance Testing and PowerAI on IBM Cloud—Alex HudakOffering Manager, IBM Cloud

Brian WanSoftware Engineer, IBM Cloud

2IBM Cloud / March, 2018 / © 2018 IBM Corporation

Please note

IBM’s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice and at IBM’s sole discretion.

Information regarding potential future products is intended to outline our general product direction and it should not be relied on in making a purchasing decision.

The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver any material, code or functionality. Information about potential future products may not be incorporated into any contract.

The development, release, and timing of any future features or functionality described for our products remains at our sole discretion.

Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the user’s job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here.

Contents


Part One: The AI & GPU Market 04

Part Two: NVIDIA GPUs on IBM Cloud 09

Part Three: Performance Data 16

Part Four: PowerAI on IBM Cloud 25

IBM Cloud / March, 2018 / © 2018 IBM Corporation 4

Part One: The Market

% of organizations that report some form of AI is already in production in their organization

5IBM Cloud / March, 2018 / © 2018 IBM Corporation Source: Teradata 2017

% of these organizations that still believe they will need to invest more in AI tech over the next 36 months


30%

“Nowadays, for machine learning and particularly deep learning, it’s all about GPUs.”

ForbesDecember 2017


Text or image


“Humanity’s moonshots like eradicating cancer, intelligent customer experiences, and self-driving vehicles are within reach of this next era of AI.”

Text or image

Financial Services

Gaming

Medical Research

Automotive

Manufacturing

GPU Use Cases

IBM Cloud / March, 2018 / © 2018 IBM Corporation 9

Part Two: NVIDIA GPUs on IBM Cloud

Tesla V100

Bare Metal Monthly

&

Virtual (Coming Soon)Monthly & Hourly

Tesla K80

Bare Metal Monthly & Hourly

Tesla P100

Bare Metal Monthly

&

VirtualMonthly & Hourly

Tesla M60



NVIDIA GPUs on IBM Cloud

Tesla M60Used for enterprise virtualization as well as boosting professional graphics performance.


Tesla K80Reliable GPU performance ideal for introductory AI computing at an affordable price.


Tesla P100Essential performance for standard AI and HPC capabilities.

Bare Metal Monthly &Virtual Monthly & Hourly

Tesla V100IBM Cloud’s most powerful and advanced GPU, purpose-built for progressive Deep Learning workloads —with the performance of 100 CPUs in a single GPU.

Bare Metal Monthly &Virtual (Coming Soon)Monthly & Hourly


NVIDIA GPUs on IBM Cloud

NVIDIA GPU-Enabled Data Centersby region


North AmericaDallasHoustonMexicoMontreal

SeattleSan JoseTorontoWashington DC

EuropeMilanOsloParis

AmsterdamFrankfurtLondon

South AmericaSao Paulo

AsiaSingaporeTokyo

ChennaiHong KongSeoul

AustraliaMelbourneSydney

NVIDIA™ Tesla™ V100 (Bare Metal)

NVIDIA™ Tesla™ P100 (Bare Metal)

NVIDIA™ Tesla™ P100 (Virtual)

NVIDIA™ Tesla™ K80 (Bare Metal)

NVIDIA™ Tesla™ M60 (Bare Metal)

NVIDIA GPU-Enabled Data Centersby GPU

13

DallasWashington DC

DallasSan JoseWashington DC

AmsterdamSeoulTokyo

DallasWashington DC (Mar 2018)

London (April 2018)

All GPU DCs

All GPU DCsIBM Cloud / March, 2018 / © 2018 IBM Corporation

NVIDIA GPUs on Virtual Servers


Run HPC, AI, and simulation workloads with

efficiency and

scalabilityin a virtual environment

Source: https://console.bluemix.net/docs/vsi/vsi_public_gpu.html#gpu

ac1.8x60 acl1.8x60 ac1.16x120 acl1.16x120

GPU 1 x P100 1 x P100 2 x P100 2 x P100

GPU RAM (GB) 16 16 32 32

vCPU 8 8 16 16

vCPU RAM (GB) 60 60 120 120

Storage Type Block (SAN) Local SSD Block (SAN) Local SSD

Boot Disc (GB) 25 and 100 100 25 and 100 100

Secondary Disc (GB) 4 x 2000 2 x 300 4 x 2000 2 x 300

The only bare metal GPU provider

Superior security over competitors

All resources dedicated to a single user

IBM Cloud / March, 2018 / © 2018 IBM Corporation

NVIDIA GPUs on Bare Metal

15

Hard Drive1 TB SATA –3.8 TB SSD

Core12 – 28 Cores

Network100 MBps –10 GBps

Memory64 GB –1.5 TB

© 2018 IBM Corporation 16

Part Three: Performance Data

Notes: • Input dataset: ImageNet (crop size=224x224); Batch size = 64 per GPU (for both InceptionV3 and ResNet50 neural net models)• With NVIDIA V100 GPUs, independent distribution mode for model variables and gradients was used for optimal performance.• Mixed precision (FP16/32) leverages Tensor Cores in V100 GPUs. SoftLayer bare-metal server has 48 logical CPU cores while Power9 server and AWS P3 instance have 64 logical CPU cores.

• With same numbers of V100 GPUs, Power9 servers deliver better performance (up to 1.58X in single precision) than Amazon P3 instances. This could be attributed to Power9 CPU optimizations and CPU-GPU NVLink support.

• With 4 x V100 GPUs, Power9 server has higher performance than Amazon P3 instance with 8 x V100 GPUs in single precision mode.

• AWS P3 instance does not scale well beyond 4 x V100 GPUs in single precision mode (although it does scale well leveraging Tensor Cores in half precision (FP16) mode.

• TensorFlow 1.4 and 1.5 versions do not leverage the Tensor Cores in V100 GPUs very well, so the latest TensorFlow 1.6-dev build was used for optimal half precision (FP16) performance.

• For InceptionV3 on TensorFlow, half precision (FP16) on the V100 GPUs uses Tensor Cores to achieve ~1.8X better performance than single precision. The larger performance gain (up to 4.4X) of FP16 on AWS P3 is due to the relatively low performance of the Deep Learning AMI with TensorFlow v1.4 used for single precision compared to TensorFlow 1.6-dev used for half precision mode.

Higher is better

• For TensorFlow 1.6, in the default single precision mode and parameter_server method of model variable and gradient distribution, we need at least 32 logical/virtual CPU cores (or 16 physical CPU cores) for each V100 GPU for optimal performance. TensorFlow 1.5 does not leverage the V100 GPU well.

• For the current version of TensorFlow (v1.5) in half-precision (FP16) mode, we still need 32 logical/virtual CPU cores (16 physical CPU cores) to drive a V100 GPU optimally

• For the latest TensorFlow v1.6-dev build in half-precision (FP16) mode where the Tensor Cores on the V100 GPU are fully leveraged, the default parameter_server mode for model variable and gradient distribution becomes the performance bottleneck, so having more than 8 logical/virtual CPU cores (4 physical CPU cores) would not improve performance further.

• To alleviate the performance bottleneck of parameter_server, we could use the independent(replicated_distributed) method where model variables are replicated on each GPU. In this indepdendentmode, we only need 4 logical/virtual CPU cores (2 physical CPU cores) for each V100 GPU.

CPU cores required for an NVIDIA GPUDeep Learning Model Training – Impact of vCPUs on NVIDIA V100 GPUs

Deep Learning Model Training – Impact of vCPUs on NVIDIA V100 GPUs

• For each P100 GPU, we need at least 8 logical/virtual CPU cores (4 physical CPU cores) for optimal performance. For 2 x P100 GPUs, we need at least 16 logical/virtual CPU cores (8 physical CPU cores) for optimal performance.

• For 4 x K80 GPU cores (2 x K80 PCIe GPU cards), we need at least 8 logical/virtual (4 physical CPU cores) for optimal performance.

CPU cores required for an NVIDIA GPU Deep Learning Model Training – Impact of vCPUs on NVIDIA P100 GPUs

Deep Learning Model Training – Impact of vCPUs on NVIDIA K80 GPUs

Amazon P3 instance in single and half-precision modes

IBM Cloud (SoftLayer) server delivers better price-performance in single and half-precision modes than Amazon P3 instance

Higher is better

Multi-Node - Distributed Deep Learning

• Optimal deep-learning training across large number of GPUs (> 16 GPUs)

• Two options IBM Research’s Distributed Deep Learning (DDL) AI framework-independent, low-latency communication

library for implementing distributed deep learning Currently for PowerAI only Supports multiple AI frameworks (TensorFlow, Caffe, Caffe2,

Torch, Keras, pyTorch, etc.) Tested – could scale very well up to 256 GPUs across

multiple servers Horovod (open source, contributed by Uber) Novel overlapping of compute and communication Requires integration of communication protocols into each AI

framework Only supports TensorFlow at this time Tested – could scale up very well to (at least) 24 GPUs IBM DDL Scaling Up to 256 GPUs on 64 Power8 Servers

(Nodes)

Notes: • Input dataset: ImageNet (crop size=224x224)• For Caffe, highest batch sizes were used to fully exploit GPU memory. For TensorFlow, batch size = 64 per GPU.• Mixed precision (16-bit input matrices, 32-bit accumulator) leverages Tensor Cores in V100 GPUs

For TensorFlow, independent distribution mode (replicated_distributed) for model variables and gradient aggregation delivers much better performance for 4 GPUs (and higher) than the default parameter_server mode.

Higher is better

Deep Learning Model Training – Power9 Server w/ NVIDIA Tesla V100 GPUs

NVIDIA CaffeImpact of Batch Size – NVIDIA Tesla V100 GPUs – Single Precision

Deep Learning Training – VGG-16 on NVIDIA Caffe

Deep Learning Training – BVLC Caffe vs. NVIDIA Caffe – NVIDIA Tesla V100 GPUs

NVIDIA Caffe w/ 1xV100NVIDIA Caffe w/ 1xV100 NVIDIA Caffe w/ 2xV100

26

In IBM Cloud

Available early 2Q 2018Delivered via IBM Cloud CatalogBilled through IBM CloudSupported by IBM and Nimbix

• PowerAI Version 5• On-Demand Cloud Provisioning• Superb Price Performance• Highly Scalable Distributed Deep

Learning (DDL)• Large Model Support• Containerized and Extensible• Powered by Trusted Partner Nimbix

PowerAI on IBM Cloud

(1) Based on IBM internal measurements on 1/25/18 on the following configuration: InceptionV3 neural network model training with TensorFlow benchmark script (https://github.com/tensorflow/benchmarks/tree/master/scripts/tf_cnn_benchmarks) on IBM Power System AC922 with 4 Nvidia V100 GPUs vs. Amazon P3 instance p3.8xlarge with 4 Nvidia V100 GPUs. Software versions on Power System AC922: CUDA 9.1, CuDNN 7.1, TensorFlow 1.5; on Amazon P3 instance: CUDA 9, CuDNN 7.0.5, TensorFlow 1.6-dev, Amazon Deep Learning AMI (Ubuntu) Version 2.0 (ami-9ba7c4e1). Hourly retail pricing for IBM PowerAI in IBM Cloud and Amazon P3 instances (https://aws.amazon.com/ec2/pricing/on-demand/).

(2) POWER8 performance data was collected on 3/7/2018 in Nimbix Cloud with IBM Power System S822LC for InceptionV3 model training with TensorFlow benchmark script (https://github.com/tensorflow/benchmarks/tree/master/scripts/tf_cnn_benchmarks), 4 Nvidia P100 GPUs, CUDA 9.0.176, CuDNN 7.0.5, and TensorFlow 1.4 in PowerAI R5. Amazon p3.8xlarge with 4 Nvidia V100 GPUs, CUDA 9, CuDNN 7.0.5, TensorFlow 1.6-dev, Amazon Deep Learning AMI (Ubuntu) Version 2.0 (ami-9ba7c4e1). Hourly retail pricing for IBM PowerAI in IBM Cloud and Amazon P3 instances (https://aws.amazon.com/ec2/pricing/on-demand/).

(3) Based on IBM internal measurements on 1/25/18 on the following configuration: InceptionV3 neural network model training with TensorFlow benchmark script (https://github.com/tensorflow/benchmarks/tree/master/scripts/tf_cnn_benchmarks) on IBM Power System AC922 with 4 Nvidia V100 GPUs vs. Amazon P3 instance p3.8xlarge with 4 Nvidia V100 GPUs. Software versions on Power System AC922: CUDA 9.1, CuDNN 7.1, TensorFlow 1.5; on Amazon P3 instance: CUDA 9, CuDNN 7.0.5, TensorFlow 1.6-dev, Amazon Deep Learning AMI (Ubuntu) Version 2.0 (ami-9ba7c4e1).

(4) Based on IBM internal measurements on 11/26/2017: https://developer.ibm.com/linuxonpower/perfcol/perfcol-mldl/.

1.6x3

1.1x2

1.2x1

3.7x4

DL Training Throughput (Images/sec) with 4 NVIDIA V100 GPUs vs. Amazon

AWS P3 instance

Better Price Performance with POWER8 vs. Amazon AWS P3 instance

Better Price Performance (DL Training Throughput per US$) with POWER9 vs.

Amazon AWS P3 instance

DL Training Throughput with Large Model Support (LMS) feature vs.

comparable x86 server

Cloud Price-Performance Leadership

POWER9 Performance Leadership

Open Source Frameworks: Supported Distribution

Developer Ease-of-Use Tools

Faster Training Times viaHW & SW Performance Optimizations

https://github.com/tensorflow/benchmarks/tree/master/scripts/tf_cnn_benchmarks

https://aws.amazon.com/ec2/pricing/on-demand/


https://aws.amazon.com/ec2/pricing/on-demand/


https://developer.ibm.com/linuxonpower/perfcol/perfcol-mldl/

28

PowerAI Service on IBM Cloud

• PowerAI with Distributed Deep Learning will be a generally available offering in the IBM Cloud Catalog

• Users will be able to provision PowerAIinstances of various sizes through the web interface and CLI

• The offering will enable automated transfer and loading of data from IBM Cloud Object Storage into instances

• Offering authentication through IBM Identity Management• IBM Cloud Logging and Auditing for user provisioning actions• Data transfer between IBM Cloud and Nimbix over Direct Link Dedicated• Data encrypted at rest and in-transit between IBM and Nimbix• Nimbix environment physical and digital security inspected• PowerAI delivered to Nimbix by IBM via images inspected by IBM

29

PowerAI Service on IBM Cloud

IBM Cloud Catalog NimbixDirect Link

Dedicated

IBM Cloud Object

StorageIBM Cloud PowerAIService

Notices and disclaimers

30© 2018 IBM Corporation

© 2018 International Business Machines Corporation. No part of this document may be reproduced or transmitted in any form without written permission from IBM.U.S. Government Users Restricted Rights — use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM.Information in these presentations (including information relating to products that have not yet been announced by IBM) has been reviewed for accuracy as of the date of initial publication and could include unintentional technical or typographical errors. IBM shall have no responsibility to update this information. This document is distributed “as is” without any warranty, either express or implied. In no event, shall IBM be liable for any damage arising from the use of this information, including but not limited to, loss of data, business interruption, loss of profit or loss of opportunity. IBM products and services are warranted per the terms and conditions of the agreements under which they are provided.IBM products are manufactured from new parts or new and used parts. In some cases, a product may not be new and may have been previously installed. Regardless, our warranty terms apply.”Any statements regarding IBM's future direction, intent or product plans are subject to change or withdrawal without notice.

Performance data contained herein was generally obtained in a controlled, isolated environments. Customer examples are presented as illustrations of how thosecustomers have used IBM products and the results they may have achieved. Actual performance, cost, savings or other results in other operating environments may vary.References in this document to IBM products, programs, or services does not imply that IBM intends to make such products, programs or services available in all countries in which IBM operates or does business.Workshops, sessions and associated materials may have been prepared by independent session speakers, and do not necessarily reflect the views of IBM. All materials and discussions are provided for informational purposes only, and are neither intended to, nor shall constitute legal or other guidance or advice to any individual participant or their specific situation.It is the customer’s responsibility to insure its own compliance with legal requirements and to obtain advice of competent legal counsel as to the identification and interpretation of any relevant laws and regulatory requirements that may affect the customer’s business and any actions the customer may need to take to comply with such laws. IBM does not provide legal advice or represent or warrant that its services or products will ensure that the customer follows any law.

Notices and disclaimerscontinued


Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products about this publication and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. IBM does not warrant the quality of any third-party products, or the ability of any such third-party products to interoperate with IBM’s products. IBM expressly disclaims all warranties, expressed or implied, including but not limited to, the implied warranties of merchantability and fitness for a purpose.The provision of the information contained herein is not intended to, and does not, grant any right or license under any IBM patents, copyrights, trademarks or other intellectual property right.

IBM, the IBM logo, ibm.com and [names of other referenced IBM products and services used in the presentation] are trademarks of International Business Machines Corporation, registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at "Copyright and trademark information" at: www.ibm.com/legal/copytrade.shtml..

http://www.ibm.com/legal/copytrade.shtml

Thank you


Alex HudakOffering Manager, IBM Cloud—[email protected]+1-469-766-8058ibm.com

Brian WanSoftware Engineer, IBM Watson and Cloud Platform—[email protected]+1-512-286-8711ibm.com

nvidia™ gpu performance testing and powerai on ibm cloud...mixed precision (fp16/32) leverages...

Documents