aws re:invent 2016: deep learning, 3d content rendering, and massively parallel, compute intensive...

45
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Letian Feng, Senior Product Manager - Technical, Amazon EC2 Pouyan Djahani, Director, Aon Benfield Oliver Gunasekara, CEO & Co-founder, NGCodec Inc. December 1, 2016 CMP317 Massively Parallel, Compute Intensive Workloads in the Cloud Choosing the right hardware accelerator and example use cases

Upload: amazon-web-services

Post on 16-Apr-2017

481 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parallel, Compute Intensive Workloads in the Cloud (CMP317)

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Letian Feng, Senior Product Manager - Technical, Amazon EC2

Pouyan Djahani, Director, Aon Benfield

Oliver Gunasekara, CEO & Co-founder, NGCodec Inc.

December 1, 2016

CMP317

Massively Parallel, Compute

Intensive Workloads in the Cloud Choosing the right hardware accelerator and example use cases

Page 2: AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parallel, Compute Intensive Workloads in the Cloud (CMP317)

What to Expect from the Session

• Overview of hardware acceleration

• Hardware acceleration on AWS

• GPU and FPGA use cases

• Guest speakers

• Best practices

Page 3: AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parallel, Compute Intensive Workloads in the Cloud (CMP317)

Compute Intensive Workload

Virtual reality

Fluid dynamics

Genomics

CPU

Page 4: AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parallel, Compute Intensive Workloads in the Cloud (CMP317)

Compute Intensive Workload

• Scale on CPU

• Batch job: Spot instances

• Can we do better?

• Some workloads are practically impossible to run on CPU

only – take weeks

• Execution latency reduction

• Performance / cost optimization

Page 5: AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parallel, Compute Intensive Workloads in the Cloud (CMP317)

What is Hardware Acceleration?

• Use of specialized hardware (hardware accelerator) to

perform some functions more efficiently than in software

running on CPUs

GPU FPGA Custom Accelerator

Page 6: AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parallel, Compute Intensive Workloads in the Cloud (CMP317)

CPUs are like Swiss army

knives

What is Hardware Acceleration?

Hardware accelerators are like

egg slicers

Page 7: AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parallel, Compute Intensive Workloads in the Cloud (CMP317)

GPU for Accelerated Computing

• Ubiquitous

• High degree of data parallelism

• High floating-point arithmetic intensity

• Consistent, well documented set of APIs (CUDA,

OpenCL)

• Supported by a wide variety of ISVs and open source

frameworks

Page 8: AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parallel, Compute Intensive Workloads in the Cloud (CMP317)

GPU for Accelerated Computing

for (i=0;i<N;i++) {

}

for (j=0;j<M;j++) {

}

GPU handles compute-

intensive functions

5% of code

80% of run-time

CPU handles the rest

Page 9: AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parallel, Compute Intensive Workloads in the Cloud (CMP317)

FPGA for Accelerated Computing

• Custom hardware for specific algorithms

• Supports non-standard data structure

• Easier maintenance using field re-programmability

• Dataflow programming

• Suitable for applications that have high dependencies

between threads

• Offers large local memory and high memory bandwidth

• Cost-effective

Page 10: AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parallel, Compute Intensive Workloads in the Cloud (CMP317)

GPU – data parallel

GPU vs FPGA

FPGA – data flow

Page 11: AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parallel, Compute Intensive Workloads in the Cloud (CMP317)

Hardware Acceleration on AWS

• 2010 - First GPU Instance – CG1

• 2013 - GPU-Graphics Instance – G2

• 2016 - GPU-Compute Instance – P2

• 2016 - FPGA Instance – F1 (Preview)

• And we continue to innovate…

Page 12: AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parallel, Compute Intensive Workloads in the Cloud (CMP317)

Benefit of Hardware Acceleration on AWS

Easy-to-use Flexible Scalable Cost-effective

Page 13: AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parallel, Compute Intensive Workloads in the Cloud (CMP317)

Benefit of FPGA on AWS

• Simplified hardware development process

• AWS takes care of the non-differentiated heavy lifting

• Hardware Development Kit (HDK)

• FPGA Developer AMI for free on AWS Marketplace

Page 14: AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parallel, Compute Intensive Workloads in the Cloud (CMP317)

Hardware Acceleration Use Cases

Page 15: AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parallel, Compute Intensive Workloads in the Cloud (CMP317)

GPU-Compute Use Cases

• Machine learning

• Engineering simulation

• Financial simulation

• Virtual reality

• In-memory database

• Rendering

• Transcoding

• And many more…

Page 16: AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parallel, Compute Intensive Workloads in the Cloud (CMP317)

FPGA Use Cases

• Genomics and proteomics

• Security analytics

• Big data analytics and search

• Video encoding

• Financial simulation

• Cryptography

• Data compression

• Chip simulation acceleration

• And many more…

Page 17: AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parallel, Compute Intensive Workloads in the Cloud (CMP317)

Machine Learning

• Deep learning training requires massive parallel floating-

point performance – good candidate for accelerated

computing

Page 18: AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parallel, Compute Intensive Workloads in the Cloud (CMP317)

Engineering Simulation

Page 19: AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parallel, Compute Intensive Workloads in the Cloud (CMP317)

Engineering Simulation

Page 20: AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parallel, Compute Intensive Workloads in the Cloud (CMP317)

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Pouyan Djahani, Director, Aon Benfield

PathWise™

Page 21: AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parallel, Compute Intensive Workloads in the Cloud (CMP317)

PathWiseTM HPC Risk Management Platform

• Aon Benfield’s PathWiseTM is the fastest, most scalable (up to 1

million cores), and integrated high performance computing risk

management platform in the industry today

• Computational capabilities offered by GPU-driven HPC enables

quantitative analysts and actuaries to accelerate financial

computations from days to minutes, with 50-300 times the

throughput over conventional legacy business solutions

• The platform includes tools for scenario generation, hedging,

pricing, financial reporting, and forecasting of capital and reserves

and complex Asset and Liability Management strategies for life

insurance companies

Page 22: AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parallel, Compute Intensive Workloads in the Cloud (CMP317)

Hardware Accelerator Benefits for PathWiseTM

• Many Financial Computations in the Life Insurance

Business fall into one of the following categories:

• Data Parallel Computing for Deterministic Calculations

• Pricing a large number of financial instruments using closed

form methods, e.g. Black Scholes or pricing simple instruments

such as Interest Rate Swaps, etc.

• Monte Carlo Simulations for Stochastic Calculations

• Pricing of exotic options and complex insurance products with

no closed form solutions.

• Stochastic-on-Stochastic (SoS) calculations for Capital and

hedging simulations.

Page 23: AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parallel, Compute Intensive Workloads in the Cloud (CMP317)

Example: Stochastic-on-Stochastic Calculations

Typical number of Valuations per policy

Outer loop 1000

Inner loop 5000

Time steps 360

Shocks 30

Total 54 billion

At each valuation calculate:

• Stochastic lapse and mortalities

• Stochastic Equity returns and Interest rates

over 30 years

• Cashflows for benefits/claims and premiums

• Stochastic valuation of Assets and Liabilities

and Investment management strategies

Page 24: AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parallel, Compute Intensive Workloads in the Cloud (CMP317)

Hardware Accelerator Benefits for PathWiseTM

• Traditional CPU based hardware with limited number of cores are

not well suited for parallel calculations and cannot meet the

computational demands.

• SIMD (Single Instruction Multiple Data) Architectures with a large

number of cores and fast memory bandwidth are well suited for

Monte Carlo simulations and Data Parallel computations; allowing

thousands of paths/instances to execute the same instructions in

parallel but on different sets of data.

• SIMT (Single Instruction Multiple Threads) Architectures provide

even more flexibility and improved performance over SIMD allowing

higher level of parallelization through multiple flow paths.

Page 25: AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parallel, Compute Intensive Workloads in the Cloud (CMP317)

Cell Processor

• Specialized Hardware

• STI (Sony Toshiba IBM)

• Used in Sony PlayStation 3

• Extremely difficult to program

• Discontinued abruptly in late

2009

Hardware Accelerator Choices for PathWiseTM

GPUs

• Commodity Hardware

• Proven track record for quality and

performance in millions of graphics

accelerators

• Comprehensive Cuda SDK and active

support and commitment to innovation

of GPGPU computing from Nvidia

• Our benchmarking in 2010 showed an

average 150x performance advantage

of Nvidia C2050 GPUs over state-of-

the-art Intel Xeon quad core CPUs

• Availability in the AWS cloud

Page 26: AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parallel, Compute Intensive Workloads in the Cloud (CMP317)

The PathWise™ Model for Accelerated Hardware

PWML(PathWiseTM Modeling Language)

Cuda OpenCL HDL

GPU CPU FPGA

• Business logic implemented in PWML in a

spreadsheet-like Interface, completely decoupled

from underlying hardware

• No advanced programming skills required to

leverage the power of high performance

computers

• Syntax similar to Excel/VBA

• System functions

• min(), max(), iif(), avg(), …

• User-defined functions

• Shared libraries

• Support for wide range of RNGs

• Currently investigating FPGAs

• Very excited about the AWS F1

announcement

• Accelerate calculations even further

• Performance per Watt advantage over GPUs

Page 27: AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parallel, Compute Intensive Workloads in the Cloud (CMP317)

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Oliver Gunasekara, CEO & Co-founder NGCodec Inc.

Using Cloud FPGA AccelerationFor live H.265/HEVC video encoding

[email protected]

Page 28: AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parallel, Compute Intensive Workloads in the Cloud (CMP317)

Founded in 2012: Belmore Capital + Xilinx Capital + NSF + Customers

Technology: Realtime low latency video codec in hardware (RTL FPGA)

World's top experts in low latency video compression (~15 people)

Team has 10 years experience of low latency video codec hardware

1 granted + 8 pending patent applications + trade secrets on low latency

HQ in Sunnyvale CA

Page 29: AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parallel, Compute Intensive Workloads in the Cloud (CMP317)

Tsunamis Coming: Traditional Video Encoding

Online Video Exploding

Higher Quality Video

Up to 10x more CPU’s

We need a new Accelerator

Page 30: AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parallel, Compute Intensive Workloads in the Cloud (CMP317)

New Type of Cloud Accelerator - FPGAs

Performance & Power Efficiency

CPU

GPU

FPGA

ASIC

Better

Better

Page 31: AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parallel, Compute Intensive Workloads in the Cloud (CMP317)

Live Transcoding Demo (H.264 to H.265)

Live stream (HD H.264 5Mbps)

Live stream (HD H.265 2.5Mbps)

Page 32: AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parallel, Compute Intensive Workloads in the Cloud (CMP317)

Comparison for Live H.265/HEVC Encoding

EC2 Instance type CPU (C4) CPU + GPU (P2) FPGA (F1)

Type of hardware Xeon E5 E5 & Tesla K80 Virtex UltraScale+

Video Quality (VQ) Average Excellent Excellent*

Video Latency Medium Long Very Low

Cost to encode (4K) Medium High Low

Time to develop encoder Short Medium Long

*NGCodec Broadcast VQ is coming in 2017

Page 33: AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parallel, Compute Intensive Workloads in the Cloud (CMP317)

NGCodec H.265/HEVC Encoding on F1 Instance

Ported in just 3 weeks to EC2 F1 instance!

Single F1 instance for 2160p30, multiple instances for 8K

Enables ultra low latency for new applications like Cloud VR/AR

Significantly better VQ for live encoding at lower cost

Page 34: AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parallel, Compute Intensive Workloads in the Cloud (CMP317)

Enabling New Applications: Cloud VR/AR

Mobile Desktop Cloud

Smartphone

PoweredStandalone Tethered

Un-Tethered

(WiFi)4.5G / Fiber + WiFi

Performance limited

Limited battery life

Poor positional tracking

Tether

System cost

Low Latency codec

Error robustness

System cost

Low latency network

Local datacenter

Low Latency codec

Error robustness

Performance limited

Limited battery life

Page 35: AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parallel, Compute Intensive Workloads in the Cloud (CMP317)

Best Practices

Page 36: AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parallel, Compute Intensive Workloads in the Cloud (CMP317)

Best Practices

• AWS Deep Learning AMI

• AWS FPGA Developer AMI

• General system tuning tips

• NVIDIA driver settings for P2

• Data transfer between memory and GPU

• GPUDirect (GPU peer-to-peer communication)

Page 37: AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parallel, Compute Intensive Workloads in the Cloud (CMP317)

General System Tuning Tips

• Keep Linux kernel up to date (3.10+)

• Use Enhanced Networking (Elastic Network Adaptor) for

best network performance

• Use placement group to achieve maximum network

bandwidth within a cluster

• Use TSC clock source

• Fully utilize host memory to cache hot data

• Amazon Linux is fully optimized for P2 and F1

Page 38: AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parallel, Compute Intensive Workloads in the Cloud (CMP317)

NVIDIA Driver Settings for P2

• Keep GPU driver up to date

• Need NVIDIA driver version 352.99 or above for P2 instances

• Enable persistence mode

• nvidia-smi -pm 1

• Set clock speed at max frequency

• nvidia-smi -ac 2505,875

• Enable/disable turbo for bursting performance or high

consistency

• nvidia-smi --auto-boost-permission=0

Page 39: AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parallel, Compute Intensive Workloads in the Cloud (CMP317)

Data Transfer Between Memory and GPU

• Minimize data transfer between host memory and GPU

• PCIe bandwidth is lower than local memory bandwidth

• Bulk copy before processing

• Each cudaMemcpy has overhead

• Use cudaMemcpy2D or cudaMemcpy3D when copying higher

dimensional array

• Transferring from pinned host memory to GPU is faster

than transferring from pageable host memory

Page 40: AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parallel, Compute Intensive Workloads in the Cloud (CMP317)

Data Transfer Between Memory and GPU

// 128MB copied in 1 cudaMemcpy call

Time(%) Time Calls Avg Min Max Name

100.00% 21.967ms 1 21.967ms 21.967ms 21.967ms [CUDA memcpyHtoD]

// 128MB copied in 32768 chunks

Time(%) Time Calls Avg Min Max Name

100.00% 80.819ms 32768 2.4660us 2.3990us 9.0560us [CUDA memcpyHtoD]

Page 41: AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parallel, Compute Intensive Workloads in the Cloud (CMP317)

Data Transfer Between Memory and GPU

Block size: 512 MB

pPageable = (float*)malloc(bytes);

Host to Device bandwidth: 6.136164 GB/s

Device to Host bandwidth: 7.666220 GB/s

cudaMallocHost((void**)&pPinned, bytes);

Host to Device bandwidth: 7.932625 GB/s

Device to Host bandwidth: 7.953571 GB/s

Page 42: AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parallel, Compute Intensive Workloads in the Cloud (CMP317)

GPUDirect (GPU peer-to-peer communication)

• Use high-speed DMA transfers to copy data between the

memories of two GPUs

• Use cudaMemcpyPeer/cudaMemcpyPeerAsync

• NUMA-style access to memory on other GPUs from

within CUDA kernels

Page 43: AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parallel, Compute Intensive Workloads in the Cloud (CMP317)

Thank you!

Page 44: AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parallel, Compute Intensive Workloads in the Cloud (CMP317)

Remember to complete

your evaluations!

Page 45: AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parallel, Compute Intensive Workloads in the Cloud (CMP317)

Related Sessions

• CMP207 - High Performance Computing on AWS

• CMP312 - Powering the Next Generation of Virtual

Reality with Verizon

• CMP314 - Bringing Deep Learning to the Cloud with

Amazon EC2