Download - A Multi-Tiered Optimization Framework for Heterogeneous … · NVIDIA K40 Int8, Int16, Int32, SPFP CD 15 x 192 x .745 GHz = 2145.60 GOPS NVIDIA K40 DPFP CD 15 x 64 x .745 GHz = 715.20

IEEE HPEC 2014

A Multi-Tiered Optimization Framework for Heterogeneous Computing

Andrew MilluzziPh.D. Student

University of Florida

Justin RichardsonPh.D. Candidate

University of Florida

Alan GeorgeProfessor of ECEUniversity of Florida

Herman LamAssoc. Professor of ECEUniversity of Florida

Agenda Motivation Device Metrics Approach Overview Kernel Implementation Tier Device Performance Tier System Configuration Tier Case Study Conclusions

2

Motivations Device Metrics comparisons provide a

first-order estimate of performance Performance can vary based on computational

kernel and size of data to process DSP, GPU, and CPU devices are rarely

optimized for your personal application Benchmarking is an expensive process

Access to hardware requires computational time or purchase of a given device

Large non-recurring engineering costs for developing a platform specific application

Lack of quantifiable data for kernel performance Micro benchmarks do not always correlate to kernel performance Kernel performance is not the same across all types of devices

3

Device Metrics

Device Int8 (GOPS)

Int16 (GOPS)

Int32 (GOPS)

SPFP (GOPS)

DPFP (GOPS)

Intel Xeon E5-2670 998.40 499.20 249.60 332.80 166.40

Intel Xeon Phi 5110P 1074.06 1074.06 1074.06 1074.06 568.62

NVIDIA K20 1762.18 1762.18 1762.18 1762.18 587.39

NVIDIA K20x 1967.60 1967.60 1967.60 1967.60 655.87

NVIDIA K40 2145.60 2145.60 2145.60 2145.60 715.60

Computational Density (CD) Sustained operations assuming

random stream of add and multiply

Computational Densityper Watt (CD/W) CD normalized by TDP

External Memory Bandwidth (EMB) Device to RAM

Internal Memory Bandwidth (IMB) Cache bandwidth

I/O Bandwidth (IOB) Bandwidth of EMB plus

all I/O ports E.g. I2C, UART, SPI

GOPS = Giga Operations per Second

GB/s = Gigabytes Per Second

CD of Devices Studied

4

Computational Density Example NVIDIA GK110 Architecture SMX Unit

192 Single-Precision Floating Point (SPFP)Cores 64 Double-Precision Floating Point (DPFP)Cores Frequency of 700+ MHz

NVIDIA K40 GPU Stats Operating Frequency of 745 MHz 15 SMX Cores

NVIDIA K40 Int8, Int16, Int32, SPFP CD 15 x 192 x .745 GHz = 2145.60 GOPS

NVIDIA K40 DPFP CD 15 x 64 x .745 GHz = 715.20 GOPS

1 MAC = 1 OPS1 MAC = 2 FLOPs5

Approach Overview Framework Inputs

Application Kernels Subset of kernels already existing in benchmarking database

Target Device List Optional input, if not included, framework assumes all possible

Framework Outputs Pareto set of best system configurations and application mappings

Set is scoped to only kernels of interest to user

Framework Processing Kernel Implementation Tier

Compare and contrast various kernel implementations for optimal performance

Device Performance Tier Identify most efficient kernel for a given architecture

System Configuration Tier Leverage data from other two tiers to determine optimal mapping

6

Device Performance Tier

Pareto set of best kernel on device

Approach – Concept DiagramFramework

Kernel Implementation Tier

Pareto set of best implementation

Implementations of kernel

…

Device n Performance Tier


…Device 2 Performance Tier


System Configuration Tier

Pareto set of best devices and mappings for specified kernels

Pareto set of best devices

and mappings for specified

kernels

Application Kernels

(user specified)

n Target Devices(optional)

7

Kernel Implementation Tier Leverage database of existing benchmarking results Tier Function

Compare implementationsof a given computationalkernel on a given device Identify optimal implementation

at each dataset size

Easily expanded to newimplementations of kernels

Tier Output Pareto set of optimally

performing benchmark Currently only considers performance Can be extended to include productivity in terms of NRE

DPFP 1D FFT Implementations on Intel E5-2670

8

Device Performance Tier Leverage Pareto set for each kernel on each device Tier Function

Combine Kernel Implementation Tier Pareto sets for a given device Identify most efficient

computational kernel ongiven device Evaluate performance at each

dataset size and select highestperformance

Tier can extrapolateperformance based on Device Metrics andRealizable Utilization Expands range of framework Discussed later on slide 11

Optimal Kernel Implementation for NVIDIA K20X

9

Tier Outputs Pareto set of optimal performing computational kernel at various

dataset sizes for a given device Kernels can vary with

different datasets sizes E.g. Both 1D and 2D FFTs

tend to perform better thanMatrix-Multiplication atsmall dataset sizes

Tier outputs can be laterexpanded to includeadditional factors such asdata transfer

Tier’s focus ondevice architecture enables analysis of various devices in same family

Device Performance Tier

Pareto Front of Kernel Performance for NVIDIA K20X

Note: no DGEMM data point at this specific size

10

Benchmarking every computation device is implausible CD and RU

Device metrics enableapplication-independentarchitecture comparison

RU relates real-worldbenchmarking results todevice metrics RU is typically expressed

as a percent of CD

Apply RU results fora given architecture toa CD for anotherdevice in same family

Device Performance Extrapolation

Extension of K20X Pareto Front for NVIDIA Kepler Family

11

System Configuration Tier Tier chooses devices and kernel mappings based on

Pareto set from Device Performance tier Tier Inputs

Pareto fronts from both the Kernel Implementation and Device Performance Tiers If kernel is not present on Pareto set in

Device Performance Tier, compare Kernel Implementation Tier results between devices

Concept diagram presents Device Performance and Kernel Implementation Tiers as children of System Configuration Tier

Tier Outputs Optimal mapping of application kernels onto hardware devices Tier produces outputs for framework

12

Dataset size plays a large part in application performance Memory access and sustained computation play a large factor in

observed performance In comparing devices

there is often a“crossover point” inobserved performance Example: Matrix

Multiplication data forK20X, Phi, and Xeon CPUeach have their own dataset size of optimalperformance

Direct Kernel Comparisons

Comparison of Matrix Multiplication Performance

13

Direct Kernel Comparisons Some kernels are never optimal on any device

More complex kernels will never outperform some kernels Limits due to memory

access (cache) or computational complexity slow downsome kernels

Framework must findoptimal mapping for all input kernels Leverage Kernel

Implementation Tierdata to fill in gaps

Similar approach asDevice Performance Tier with a subset of kernels to map

Comparison of Singular Value Decomposition Performance

14

Case Study RequirementsKernel Dataset Size2D FFT 4096Matrix Multiplication 1024SVD 4096

DevicesNVIDIA K20NVIDIA K20XNVIDIA K40Intel Xeon E5-2670Intel Xeon Phi 5110P

Sample application consisting of common computational kernels Explore common accelerated

libraries such as CUBLAS, Intel MKL, LAPACK, ATLAS, etc.

Leverage common dataset sizes Assume pipelining (concurrent

kernel execution)

Test range of devices leveraging projected performance and actual benchmarking

15

Case Study ResultsDevice Quantity Kernel Dataset Size

Intel Xeon E5-2670 1 Matrix Multiply 1024

NVIDIA K40 2 SVD1D FFT

40964096

Device Quantity Kernel Dataset SizeIntel Xeon E5-2670 1 Matrix Multiply 1024

NVIDIA K20X 2 SVD1D FFT

40964096

NVIDIA K20X and K40 GPUs show similar performance at given dataset sizes NVIDIA K40 has significant cost over NVIDIA K20X compared to

performance gain

System 1

System 2

16

Future Work Expand framework to consider additional factors

Data transfer between devices in a node or between nodes can be a significant performance hit

Augment framework to consider data locality Expand devices and benchmarking

Include additional device families Intel Xeon Phi family NVIDIA Maxwell family

Grow benchmarking suite Sorting Image Processing Additional BLAS functions

17

Conclusions Determining optimal system mappings with only

application-independent metrics is difficult Benchmarking is both expensive and time-consuming Limited benchmarking and realizable utilization can

enable projection of device performance Structured optimization framework enables transparency

at critical decision points in mapping process Observed kernel performance varies significantly

with dataset size Hardware accelerators do not always provide

the best kernel performance for every situation

18

Questions Andrew Milluzzi [email protected]

19

Download - A Multi-Tiered Optimization Framework for Heterogeneous … · NVIDIA K40 Int8, Int16, Int32, SPFP CD 15 x 192 x .745 GHz = 2145.60 GOPS NVIDIA K40 DPFP CD 15 x 64 x .745 GHz = 715.20

Top Related