IEEE HPEC 2014
A Multi-Tiered Optimization Framework for Heterogeneous Computing
Andrew MilluzziPh.D. Student
University of Florida
Justin RichardsonPh.D. Candidate
University of Florida
Alan GeorgeProfessor of ECEUniversity of Florida
Herman LamAssoc. Professor of ECEUniversity of Florida
Agenda Motivation Device Metrics Approach Overview Kernel Implementation Tier Device Performance Tier System Configuration Tier Case Study Conclusions
2
Motivations Device Metrics comparisons provide a
first-order estimate of performance Performance can vary based on computational
kernel and size of data to process DSP, GPU, and CPU devices are rarely
optimized for your personal application Benchmarking is an expensive process
Access to hardware requires computational time or purchase of a given device
Large non-recurring engineering costs for developing a platform specific application
Lack of quantifiable data for kernel performance Micro benchmarks do not always correlate to kernel performance Kernel performance is not the same across all types of devices
3
Device Metrics
Device Int8 (GOPS)
Int16 (GOPS)
Int32 (GOPS)
SPFP (GOPS)
DPFP (GOPS)
Intel Xeon E5-2670 998.40 499.20 249.60 332.80 166.40
Intel Xeon Phi 5110P 1074.06 1074.06 1074.06 1074.06 568.62
NVIDIA K20 1762.18 1762.18 1762.18 1762.18 587.39
NVIDIA K20x 1967.60 1967.60 1967.60 1967.60 655.87
NVIDIA K40 2145.60 2145.60 2145.60 2145.60 715.60
Computational Density (CD) Sustained operations assuming
random stream of add and multiply
Computational Densityper Watt (CD/W) CD normalized by TDP
External Memory Bandwidth (EMB) Device to RAM
Internal Memory Bandwidth (IMB) Cache bandwidth
I/O Bandwidth (IOB) Bandwidth of EMB plus
all I/O ports E.g. I2C, UART, SPI
GOPS = Giga Operations per Second
GB/s = Gigabytes Per Second
CD of Devices Studied
4
Computational Density Example NVIDIA GK110 Architecture SMX Unit
192 Single-Precision Floating Point (SPFP)Cores 64 Double-Precision Floating Point (DPFP)Cores Frequency of 700+ MHz
NVIDIA K40 GPU Stats Operating Frequency of 745 MHz 15 SMX Cores
NVIDIA K40 Int8, Int16, Int32, SPFP CD 15 x 192 x .745 GHz = 2145.60 GOPS
NVIDIA K40 DPFP CD 15 x 64 x .745 GHz = 715.20 GOPS
1 MAC = 1 OPS1 MAC = 2 FLOPs5
Approach Overview Framework Inputs
Application Kernels Subset of kernels already existing in benchmarking database
Target Device List Optional input, if not included, framework assumes all possible
Framework Outputs Pareto set of best system configurations and application mappings
Set is scoped to only kernels of interest to user
Framework Processing Kernel Implementation Tier
Compare and contrast various kernel implementations for optimal performance
Device Performance Tier Identify most efficient kernel for a given architecture
System Configuration Tier Leverage data from other two tiers to determine optimal mapping
6
Device Performance Tier
Pareto set of best kernel on device
Approach – Concept DiagramFramework
Kernel Implementation Tier
Pareto set of best implementation
Implementations of kernel
…
Device n Performance Tier
Pareto set of best kernel on device
…Device 2 Performance Tier
Pareto set of best kernel on device
System Configuration Tier
Pareto set of best devices and mappings for specified kernels
Pareto set of best devices
and mappings for specified
kernels
Application Kernels
(user specified)
n Target Devices(optional)
7
Kernel Implementation Tier Leverage database of existing benchmarking results Tier Function
Compare implementationsof a given computationalkernel on a given device Identify optimal implementation
at each dataset size
Easily expanded to newimplementations of kernels
Tier Output Pareto set of optimally
performing benchmark Currently only considers performance Can be extended to include productivity in terms of NRE
DPFP 1D FFT Implementations on Intel E5-2670
8
Device Performance Tier Leverage Pareto set for each kernel on each device Tier Function
Combine Kernel Implementation Tier Pareto sets for a given device Identify most efficient
computational kernel ongiven device Evaluate performance at each
dataset size and select highestperformance
Tier can extrapolateperformance based on Device Metrics andRealizable Utilization Expands range of framework Discussed later on slide 11
Optimal Kernel Implementation for NVIDIA K20X
9
Tier Outputs Pareto set of optimal performing computational kernel at various
dataset sizes for a given device Kernels can vary with
different datasets sizes E.g. Both 1D and 2D FFTs
tend to perform better thanMatrix-Multiplication atsmall dataset sizes
Tier outputs can be laterexpanded to includeadditional factors such asdata transfer
Tier’s focus ondevice architecture enables analysis of various devices in same family
Device Performance Tier
Pareto Front of Kernel Performance for NVIDIA K20X
Note: no DGEMM data point at this specific size
10
Benchmarking every computation device is implausible CD and RU
Device metrics enableapplication-independentarchitecture comparison
RU relates real-worldbenchmarking results todevice metrics RU is typically expressed
as a percent of CD
Apply RU results fora given architecture toa CD for anotherdevice in same family
Device Performance Extrapolation
Extension of K20X Pareto Front for NVIDIA Kepler Family
11
System Configuration Tier Tier chooses devices and kernel mappings based on
Pareto set from Device Performance tier Tier Inputs
Pareto fronts from both the Kernel Implementation and Device Performance Tiers If kernel is not present on Pareto set in
Device Performance Tier, compare Kernel Implementation Tier results between devices
Concept diagram presents Device Performance and Kernel Implementation Tiers as children of System Configuration Tier
Tier Outputs Optimal mapping of application kernels onto hardware devices Tier produces outputs for framework
12
Dataset size plays a large part in application performance Memory access and sustained computation play a large factor in
observed performance In comparing devices
there is often a“crossover point” inobserved performance Example: Matrix
Multiplication data forK20X, Phi, and Xeon CPUeach have their own dataset size of optimalperformance
Direct Kernel Comparisons
Comparison of Matrix Multiplication Performance
13
Direct Kernel Comparisons Some kernels are never optimal on any device
More complex kernels will never outperform some kernels Limits due to memory
access (cache) or computational complexity slow downsome kernels
Framework must findoptimal mapping for all input kernels Leverage Kernel
Implementation Tierdata to fill in gaps
Similar approach asDevice Performance Tier with a subset of kernels to map
Comparison of Singular Value Decomposition Performance
14
Case Study RequirementsKernel Dataset Size2D FFT 4096Matrix Multiplication 1024SVD 4096
DevicesNVIDIA K20NVIDIA K20XNVIDIA K40Intel Xeon E5-2670Intel Xeon Phi 5110P
Sample application consisting of common computational kernels Explore common accelerated
libraries such as CUBLAS, Intel MKL, LAPACK, ATLAS, etc.
Leverage common dataset sizes Assume pipelining (concurrent
kernel execution)
Test range of devices leveraging projected performance and actual benchmarking
15
Case Study ResultsDevice Quantity Kernel Dataset Size
Intel Xeon E5-2670 1 Matrix Multiply 1024
NVIDIA K40 2 SVD1D FFT
40964096
Device Quantity Kernel Dataset SizeIntel Xeon E5-2670 1 Matrix Multiply 1024
NVIDIA K20X 2 SVD1D FFT
40964096
NVIDIA K20X and K40 GPUs show similar performance at given dataset sizes NVIDIA K40 has significant cost over NVIDIA K20X compared to
performance gain
System 1
System 2
16
Future Work Expand framework to consider additional factors
Data transfer between devices in a node or between nodes can be a significant performance hit
Augment framework to consider data locality Expand devices and benchmarking
Include additional device families Intel Xeon Phi family NVIDIA Maxwell family
Grow benchmarking suite Sorting Image Processing Additional BLAS functions
17
Conclusions Determining optimal system mappings with only
application-independent metrics is difficult Benchmarking is both expensive and time-consuming Limited benchmarking and realizable utilization can
enable projection of device performance Structured optimization framework enables transparency
at critical decision points in mapping process Observed kernel performance varies significantly
with dataset size Hardware accelerators do not always provide
the best kernel performance for every situation
18
Questions Andrew Milluzzi [email protected]
19