aws re:invent 2016: deep learning, 3d content rendering, and massively parallel, compute intensive...
Post on 16-Apr-2017
481 Views
Preview:
TRANSCRIPT
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Letian Feng, Senior Product Manager - Technical, Amazon EC2
Pouyan Djahani, Director, Aon Benfield
Oliver Gunasekara, CEO & Co-founder, NGCodec Inc.
December 1, 2016
CMP317
Massively Parallel, Compute
Intensive Workloads in the Cloud Choosing the right hardware accelerator and example use cases
What to Expect from the Session
• Overview of hardware acceleration
• Hardware acceleration on AWS
• GPU and FPGA use cases
• Guest speakers
• Best practices
Compute Intensive Workload
Virtual reality
Fluid dynamics
Genomics
CPU
Compute Intensive Workload
• Scale on CPU
• Batch job: Spot instances
• Can we do better?
• Some workloads are practically impossible to run on CPU
only – take weeks
• Execution latency reduction
• Performance / cost optimization
What is Hardware Acceleration?
• Use of specialized hardware (hardware accelerator) to
perform some functions more efficiently than in software
running on CPUs
GPU FPGA Custom Accelerator
CPUs are like Swiss army
knives
What is Hardware Acceleration?
Hardware accelerators are like
egg slicers
GPU for Accelerated Computing
• Ubiquitous
• High degree of data parallelism
• High floating-point arithmetic intensity
• Consistent, well documented set of APIs (CUDA,
OpenCL)
• Supported by a wide variety of ISVs and open source
frameworks
GPU for Accelerated Computing
for (i=0;i<N;i++) {
}
…
for (j=0;j<M;j++) {
}
GPU handles compute-
intensive functions
5% of code
80% of run-time
CPU handles the rest
FPGA for Accelerated Computing
• Custom hardware for specific algorithms
• Supports non-standard data structure
• Easier maintenance using field re-programmability
• Dataflow programming
• Suitable for applications that have high dependencies
between threads
• Offers large local memory and high memory bandwidth
• Cost-effective
GPU – data parallel
GPU vs FPGA
FPGA – data flow
Hardware Acceleration on AWS
• 2010 - First GPU Instance – CG1
• 2013 - GPU-Graphics Instance – G2
• 2016 - GPU-Compute Instance – P2
• 2016 - FPGA Instance – F1 (Preview)
• And we continue to innovate…
Benefit of Hardware Acceleration on AWS
Easy-to-use Flexible Scalable Cost-effective
Benefit of FPGA on AWS
• Simplified hardware development process
• AWS takes care of the non-differentiated heavy lifting
• Hardware Development Kit (HDK)
• FPGA Developer AMI for free on AWS Marketplace
Hardware Acceleration Use Cases
GPU-Compute Use Cases
• Machine learning
• Engineering simulation
• Financial simulation
• Virtual reality
• In-memory database
• Rendering
• Transcoding
• And many more…
FPGA Use Cases
• Genomics and proteomics
• Security analytics
• Big data analytics and search
• Video encoding
• Financial simulation
• Cryptography
• Data compression
• Chip simulation acceleration
• And many more…
Machine Learning
• Deep learning training requires massive parallel floating-
point performance – good candidate for accelerated
computing
Engineering Simulation
Engineering Simulation
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Pouyan Djahani, Director, Aon Benfield
PathWise™
PathWiseTM HPC Risk Management Platform
• Aon Benfield’s PathWiseTM is the fastest, most scalable (up to 1
million cores), and integrated high performance computing risk
management platform in the industry today
• Computational capabilities offered by GPU-driven HPC enables
quantitative analysts and actuaries to accelerate financial
computations from days to minutes, with 50-300 times the
throughput over conventional legacy business solutions
• The platform includes tools for scenario generation, hedging,
pricing, financial reporting, and forecasting of capital and reserves
and complex Asset and Liability Management strategies for life
insurance companies
Hardware Accelerator Benefits for PathWiseTM
• Many Financial Computations in the Life Insurance
Business fall into one of the following categories:
• Data Parallel Computing for Deterministic Calculations
• Pricing a large number of financial instruments using closed
form methods, e.g. Black Scholes or pricing simple instruments
such as Interest Rate Swaps, etc.
• Monte Carlo Simulations for Stochastic Calculations
• Pricing of exotic options and complex insurance products with
no closed form solutions.
• Stochastic-on-Stochastic (SoS) calculations for Capital and
hedging simulations.
Example: Stochastic-on-Stochastic Calculations
Typical number of Valuations per policy
Outer loop 1000
Inner loop 5000
Time steps 360
Shocks 30
Total 54 billion
At each valuation calculate:
• Stochastic lapse and mortalities
• Stochastic Equity returns and Interest rates
over 30 years
• Cashflows for benefits/claims and premiums
• Stochastic valuation of Assets and Liabilities
and Investment management strategies
Hardware Accelerator Benefits for PathWiseTM
• Traditional CPU based hardware with limited number of cores are
not well suited for parallel calculations and cannot meet the
computational demands.
• SIMD (Single Instruction Multiple Data) Architectures with a large
number of cores and fast memory bandwidth are well suited for
Monte Carlo simulations and Data Parallel computations; allowing
thousands of paths/instances to execute the same instructions in
parallel but on different sets of data.
• SIMT (Single Instruction Multiple Threads) Architectures provide
even more flexibility and improved performance over SIMD allowing
higher level of parallelization through multiple flow paths.
Cell Processor
• Specialized Hardware
• STI (Sony Toshiba IBM)
• Used in Sony PlayStation 3
• Extremely difficult to program
• Discontinued abruptly in late
2009
Hardware Accelerator Choices for PathWiseTM
GPUs
• Commodity Hardware
• Proven track record for quality and
performance in millions of graphics
accelerators
• Comprehensive Cuda SDK and active
support and commitment to innovation
of GPGPU computing from Nvidia
• Our benchmarking in 2010 showed an
average 150x performance advantage
of Nvidia C2050 GPUs over state-of-
the-art Intel Xeon quad core CPUs
• Availability in the AWS cloud
The PathWise™ Model for Accelerated Hardware
PWML(PathWiseTM Modeling Language)
Cuda OpenCL HDL
GPU CPU FPGA
• Business logic implemented in PWML in a
spreadsheet-like Interface, completely decoupled
from underlying hardware
• No advanced programming skills required to
leverage the power of high performance
computers
• Syntax similar to Excel/VBA
• System functions
• min(), max(), iif(), avg(), …
• User-defined functions
• Shared libraries
• Support for wide range of RNGs
• Currently investigating FPGAs
• Very excited about the AWS F1
announcement
• Accelerate calculations even further
• Performance per Watt advantage over GPUs
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Oliver Gunasekara, CEO & Co-founder NGCodec Inc.
Using Cloud FPGA AccelerationFor live H.265/HEVC video encoding
Oliver.Gunasekara@NGCodec.com
Founded in 2012: Belmore Capital + Xilinx Capital + NSF + Customers
Technology: Realtime low latency video codec in hardware (RTL FPGA)
World's top experts in low latency video compression (~15 people)
Team has 10 years experience of low latency video codec hardware
1 granted + 8 pending patent applications + trade secrets on low latency
HQ in Sunnyvale CA
Tsunamis Coming: Traditional Video Encoding
Online Video Exploding
Higher Quality Video
Up to 10x more CPU’s
We need a new Accelerator
New Type of Cloud Accelerator - FPGAs
Performance & Power Efficiency
CPU
GPU
FPGA
ASIC
Better
Better
Live Transcoding Demo (H.264 to H.265)
Live stream (HD H.264 5Mbps)
Live stream (HD H.265 2.5Mbps)
Comparison for Live H.265/HEVC Encoding
EC2 Instance type CPU (C4) CPU + GPU (P2) FPGA (F1)
Type of hardware Xeon E5 E5 & Tesla K80 Virtex UltraScale+
Video Quality (VQ) Average Excellent Excellent*
Video Latency Medium Long Very Low
Cost to encode (4K) Medium High Low
Time to develop encoder Short Medium Long
*NGCodec Broadcast VQ is coming in 2017
NGCodec H.265/HEVC Encoding on F1 Instance
Ported in just 3 weeks to EC2 F1 instance!
Single F1 instance for 2160p30, multiple instances for 8K
Enables ultra low latency for new applications like Cloud VR/AR
Significantly better VQ for live encoding at lower cost
Enabling New Applications: Cloud VR/AR
Mobile Desktop Cloud
Smartphone
PoweredStandalone Tethered
Un-Tethered
(WiFi)4.5G / Fiber + WiFi
Performance limited
Limited battery life
Poor positional tracking
Tether
System cost
Low Latency codec
Error robustness
System cost
Low latency network
Local datacenter
Low Latency codec
Error robustness
Performance limited
Limited battery life
Best Practices
Best Practices
• AWS Deep Learning AMI
• AWS FPGA Developer AMI
• General system tuning tips
• NVIDIA driver settings for P2
• Data transfer between memory and GPU
• GPUDirect (GPU peer-to-peer communication)
General System Tuning Tips
• Keep Linux kernel up to date (3.10+)
• Use Enhanced Networking (Elastic Network Adaptor) for
best network performance
• Use placement group to achieve maximum network
bandwidth within a cluster
• Use TSC clock source
• Fully utilize host memory to cache hot data
• Amazon Linux is fully optimized for P2 and F1
NVIDIA Driver Settings for P2
• Keep GPU driver up to date
• Need NVIDIA driver version 352.99 or above for P2 instances
• Enable persistence mode
• nvidia-smi -pm 1
• Set clock speed at max frequency
• nvidia-smi -ac 2505,875
• Enable/disable turbo for bursting performance or high
consistency
• nvidia-smi --auto-boost-permission=0
Data Transfer Between Memory and GPU
• Minimize data transfer between host memory and GPU
• PCIe bandwidth is lower than local memory bandwidth
• Bulk copy before processing
• Each cudaMemcpy has overhead
• Use cudaMemcpy2D or cudaMemcpy3D when copying higher
dimensional array
• Transferring from pinned host memory to GPU is faster
than transferring from pageable host memory
Data Transfer Between Memory and GPU
// 128MB copied in 1 cudaMemcpy call
Time(%) Time Calls Avg Min Max Name
100.00% 21.967ms 1 21.967ms 21.967ms 21.967ms [CUDA memcpyHtoD]
// 128MB copied in 32768 chunks
Time(%) Time Calls Avg Min Max Name
100.00% 80.819ms 32768 2.4660us 2.3990us 9.0560us [CUDA memcpyHtoD]
Data Transfer Between Memory and GPU
Block size: 512 MB
pPageable = (float*)malloc(bytes);
Host to Device bandwidth: 6.136164 GB/s
Device to Host bandwidth: 7.666220 GB/s
cudaMallocHost((void**)&pPinned, bytes);
Host to Device bandwidth: 7.932625 GB/s
Device to Host bandwidth: 7.953571 GB/s
GPUDirect (GPU peer-to-peer communication)
• Use high-speed DMA transfers to copy data between the
memories of two GPUs
• Use cudaMemcpyPeer/cudaMemcpyPeerAsync
• NUMA-style access to memory on other GPUs from
within CUDA kernels
Thank you!
Remember to complete
your evaluations!
Related Sessions
• CMP207 - High Performance Computing on AWS
• CMP312 - Powering the Next Generation of Virtual
Reality with Verizon
• CMP314 - Bringing Deep Learning to the Cloud with
Amazon EC2
top related