aws re:invent 2016: deep learning, 3d content rendering, and massively parallel, compute intensive...

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Letian Feng, Senior Product Manager - Technical, Amazon EC2

Pouyan Djahani, Director, Aon Benfield

Oliver Gunasekara, CEO & Co-founder, NGCodec Inc.

December 1, 2016

CMP317

Massively Parallel, Compute

Intensive Workloads in the Cloud Choosing the right hardware accelerator and example use cases

What to Expect from the Session

• Overview of hardware acceleration

• Hardware acceleration on AWS

• GPU and FPGA use cases

• Guest speakers

• Best practices

Compute Intensive Workload

Virtual reality

Fluid dynamics

Genomics

CPU

Compute Intensive Workload

• Scale on CPU

• Batch job: Spot instances

• Can we do better?

• Some workloads are practically impossible to run on CPU

only – take weeks

• Execution latency reduction

• Performance / cost optimization

What is Hardware Acceleration?

• Use of specialized hardware (hardware accelerator) to

perform some functions more efficiently than in software

running on CPUs

GPU FPGA Custom Accelerator

CPUs are like Swiss army

knives

What is Hardware Acceleration?

Hardware accelerators are like

egg slicers

GPU for Accelerated Computing

• Ubiquitous

• High degree of data parallelism

• High floating-point arithmetic intensity

• Consistent, well documented set of APIs (CUDA,

OpenCL)

• Supported by a wide variety of ISVs and open source

frameworks

GPU for Accelerated Computing

for (i=0;i<N;i++) {

}

…

for (j=0;j<M;j++) {

}

GPU handles compute-

intensive functions

5% of code

80% of run-time

CPU handles the rest

FPGA for Accelerated Computing

• Custom hardware for specific algorithms

• Supports non-standard data structure

• Easier maintenance using field re-programmability

• Dataflow programming

• Suitable for applications that have high dependencies

between threads

• Offers large local memory and high memory bandwidth

• Cost-effective

GPU – data parallel

GPU vs FPGA

FPGA – data flow

Hardware Acceleration on AWS

• 2010 - First GPU Instance – CG1

• 2013 - GPU-Graphics Instance – G2

• 2016 - GPU-Compute Instance – P2

• 2016 - FPGA Instance – F1 (Preview)

• And we continue to innovate…

Benefit of Hardware Acceleration on AWS

Easy-to-use Flexible Scalable Cost-effective

Benefit of FPGA on AWS

• Simplified hardware development process

• AWS takes care of the non-differentiated heavy lifting

• Hardware Development Kit (HDK)

• FPGA Developer AMI for free on AWS Marketplace

Hardware Acceleration Use Cases

GPU-Compute Use Cases

• Machine learning

• Engineering simulation

• Financial simulation

• Virtual reality

• In-memory database

• Rendering

• Transcoding

• And many more…

FPGA Use Cases

• Genomics and proteomics

• Security analytics

• Big data analytics and search

• Video encoding

• Financial simulation

• Cryptography

• Data compression

• Chip simulation acceleration

• And many more…

Machine Learning

• Deep learning training requires massive parallel floating-

point performance – good candidate for accelerated

computing

Engineering Simulation


Pouyan Djahani, Director, Aon Benfield

PathWise™

PathWiseTM HPC Risk Management Platform

• Aon Benfield’s PathWiseTM is the fastest, most scalable (up to 1

million cores), and integrated high performance computing risk

management platform in the industry today

• Computational capabilities offered by GPU-driven HPC enables

quantitative analysts and actuaries to accelerate financial

computations from days to minutes, with 50-300 times the

throughput over conventional legacy business solutions

• The platform includes tools for scenario generation, hedging,

pricing, financial reporting, and forecasting of capital and reserves

and complex Asset and Liability Management strategies for life

insurance companies

Hardware Accelerator Benefits for PathWiseTM

• Many Financial Computations in the Life Insurance

Business fall into one of the following categories:

• Data Parallel Computing for Deterministic Calculations

• Pricing a large number of financial instruments using closed

form methods, e.g. Black Scholes or pricing simple instruments

such as Interest Rate Swaps, etc.

• Monte Carlo Simulations for Stochastic Calculations

• Pricing of exotic options and complex insurance products with

no closed form solutions.

• Stochastic-on-Stochastic (SoS) calculations for Capital and

hedging simulations.

Example: Stochastic-on-Stochastic Calculations

Typical number of Valuations per policy

Outer loop 1000

Inner loop 5000

Time steps 360

Shocks 30

Total 54 billion

At each valuation calculate:

• Stochastic lapse and mortalities

• Stochastic Equity returns and Interest rates

over 30 years

• Cashflows for benefits/claims and premiums

• Stochastic valuation of Assets and Liabilities

and Investment management strategies

Hardware Accelerator Benefits for PathWiseTM

• Traditional CPU based hardware with limited number of cores are

not well suited for parallel calculations and cannot meet the

computational demands.

• SIMD (Single Instruction Multiple Data) Architectures with a large

number of cores and fast memory bandwidth are well suited for

Monte Carlo simulations and Data Parallel computations; allowing

thousands of paths/instances to execute the same instructions in

parallel but on different sets of data.

• SIMT (Single Instruction Multiple Threads) Architectures provide

even more flexibility and improved performance over SIMD allowing

higher level of parallelization through multiple flow paths.

Cell Processor

• Specialized Hardware

• STI (Sony Toshiba IBM)

• Used in Sony PlayStation 3

• Extremely difficult to program

• Discontinued abruptly in late

2009

Hardware Accelerator Choices for PathWiseTM

GPUs

• Commodity Hardware

• Proven track record for quality and

performance in millions of graphics

accelerators

• Comprehensive Cuda SDK and active

support and commitment to innovation

of GPGPU computing from Nvidia

• Our benchmarking in 2010 showed an

average 150x performance advantage

of Nvidia C2050 GPUs over state-of-

the-art Intel Xeon quad core CPUs

• Availability in the AWS cloud

The PathWise™ Model for Accelerated Hardware

PWML(PathWiseTM Modeling Language)

Cuda OpenCL HDL

GPU CPU FPGA

• Business logic implemented in PWML in a

spreadsheet-like Interface, completely decoupled

from underlying hardware

• No advanced programming skills required to

leverage the power of high performance

computers

• Syntax similar to Excel/VBA

• System functions

• min(), max(), iif(), avg(), …

• User-defined functions

• Shared libraries

• Support for wide range of RNGs

• Currently investigating FPGAs

• Very excited about the AWS F1

announcement

• Accelerate calculations even further

• Performance per Watt advantage over GPUs


Oliver Gunasekara, CEO & Co-founder NGCodec Inc.

Using Cloud FPGA AccelerationFor live H.265/HEVC video encoding

[email protected]

mailto:[email protected]

Founded in 2012: Belmore Capital + Xilinx Capital + NSF + Customers

Technology: Realtime low latency video codec in hardware (RTL FPGA)

World's top experts in low latency video compression (~15 people)

Team has 10 years experience of low latency video codec hardware

1 granted + 8 pending patent applications + trade secrets on low latency

HQ in Sunnyvale CA

Tsunamis Coming: Traditional Video Encoding

Online Video Exploding

Higher Quality Video

Up to 10x more CPU’s

We need a new Accelerator

New Type of Cloud Accelerator - FPGAs

Performance & Power Efficiency

CPU

GPU

FPGA

ASIC

Better

Better

Live Transcoding Demo (H.264 to H.265)

Live stream (HD H.264 5Mbps)

Live stream (HD H.265 2.5Mbps)

Comparison for Live H.265/HEVC Encoding

EC2 Instance type CPU (C4) CPU + GPU (P2) FPGA (F1)

Type of hardware Xeon E5 E5 & Tesla K80 Virtex UltraScale+

Video Quality (VQ) Average Excellent Excellent*

Video Latency Medium Long Very Low

Cost to encode (4K) Medium High Low

Time to develop encoder Short Medium Long

*NGCodec Broadcast VQ is coming in 2017

NGCodec H.265/HEVC Encoding on F1 Instance

Ported in just 3 weeks to EC2 F1 instance!

Single F1 instance for 2160p30, multiple instances for 8K

Enables ultra low latency for new applications like Cloud VR/AR

Significantly better VQ for live encoding at lower cost

Enabling New Applications: Cloud VR/AR

Mobile Desktop Cloud

Smartphone

PoweredStandalone Tethered

Un-Tethered

(WiFi)4.5G / Fiber + WiFi

Performance limited

Limited battery life

Poor positional tracking

Tether

System cost

Low Latency codec

Error robustness

System cost

Low latency network

Local datacenter

Low Latency codec

Error robustness

Performance limited

Limited battery life

Best Practices

Best Practices

• AWS Deep Learning AMI

• AWS FPGA Developer AMI

• General system tuning tips

• NVIDIA driver settings for P2

• Data transfer between memory and GPU

• GPUDirect (GPU peer-to-peer communication)

General System Tuning Tips

• Keep Linux kernel up to date (3.10+)

• Use Enhanced Networking (Elastic Network Adaptor) for

best network performance

• Use placement group to achieve maximum network

bandwidth within a cluster

• Use TSC clock source

• Fully utilize host memory to cache hot data

• Amazon Linux is fully optimized for P2 and F1

NVIDIA Driver Settings for P2

• Keep GPU driver up to date

• Need NVIDIA driver version 352.99 or above for P2 instances

• Enable persistence mode

• nvidia-smi -pm 1

• Set clock speed at max frequency

• nvidia-smi -ac 2505,875

• Enable/disable turbo for bursting performance or high

consistency

• nvidia-smi --auto-boost-permission=0

Data Transfer Between Memory and GPU

• Minimize data transfer between host memory and GPU

• PCIe bandwidth is lower than local memory bandwidth

• Bulk copy before processing

• Each cudaMemcpy has overhead

• Use cudaMemcpy2D or cudaMemcpy3D when copying higher

dimensional array

• Transferring from pinned host memory to GPU is faster

than transferring from pageable host memory


// 128MB copied in 1 cudaMemcpy call

Time(%) Time Calls Avg Min Max Name

100.00% 21.967ms 1 21.967ms 21.967ms 21.967ms [CUDA memcpyHtoD]

// 128MB copied in 32768 chunks

Time(%) Time Calls Avg Min Max Name

100.00% 80.819ms 32768 2.4660us 2.3990us 9.0560us [CUDA memcpyHtoD]


Block size: 512 MB

pPageable = (float*)malloc(bytes);

Host to Device bandwidth: 6.136164 GB/s

Device to Host bandwidth: 7.666220 GB/s

cudaMallocHost((void**)&pPinned, bytes);

Host to Device bandwidth: 7.932625 GB/s

Device to Host bandwidth: 7.953571 GB/s

GPUDirect (GPU peer-to-peer communication)

• Use high-speed DMA transfers to copy data between the

memories of two GPUs

• Use cudaMemcpyPeer/cudaMemcpyPeerAsync

• NUMA-style access to memory on other GPUs from

within CUDA kernels

Thank you!

Remember to complete

your evaluations!

Related Sessions

• CMP207 - High Performance Computing on AWS

• CMP312 - Powering the Next Generation of Virtual

Reality with Verizon

• CMP314 - Bringing Deep Learning to the Cloud with

Amazon EC2