Uniform Abstractions for
Heterogeneous Parallel Systems
Vikram Adve
With:
Maria Kotsifakou, Prakalp Srivastava, Adel Ejjeh, Hashim Sharif, Matt Sinclair,
Rakesh Komuravelli, Sarita Adve and Sasa Misailovic
University of Illinois at Urbana-Champaign
Supported by: NSF, SRC, DARPA, Intel
Main Memory
Interconnect
Modem
GPS
DSP DSP
GPU
A/V Hardware
Accelerators
DSPMulti-media
CPU
L1 Cache
L2 Cache
CPU
L1 Cache
VectorVector
different
parallelism
models
Incompatible memory systems different hardware ISAs
And different SoCs have different combinations of such hardware!
Key to Programmability:
Common abstractions for heterogeneous parallel hardware
A Modern Mobile SOC
Interface Levels and Key Benefit
CPUs + Vector
SIMD Units
…
GPUDSP
Domain-specific
Accelerators
FPGA
"Hardware" ISA
Virtual ISA
Language-neutral Compiler IR
Language-level Compiler IR
General-purpose prog. language
Domain-specific prog. language
Delite IR, HPVM, MLIR
Delite DSL IR, DLVM, TVM, …
CUDA, OpenCL, OpenAcc,
OpenMP, Python, Julia
TensorFlow, MXNet, Halide, …
Hardware innovation
Object-code portability
Compiler investment
Language innovation
App. performance
App. productivity
GPU ISAs, SIMD ISAs, TPU,
Domain-specific accelerators, …
IBM AS/400, PTX, SPIR-V
HSAIL, HPVM
Interface Levels and Key Benefit
CPUs + Vector
SIMD Units
…
GPUDSP
Domain-specific
Accelerators
FPGA
"Hardware" ISA
Virtual ISA
Language-neutral Compiler IR
Language-level Compiler IR
General-purpose prog. language
Domain-specific prog. language
Delite IR, HPVM, MLIR
Delite DSL IR, DLVM, TVM, …
CUDA, OpenCL, OpenAcc,
OpenMP, Python, Julia
TensorFlow, MXNet, Halide, …
Hardware innovation
Object-code portability
Compiler investment
Language innovation
App. performance
App. productivity
GPU ISAs, SIMD ISAs, TPU,
Domain-specific accelerators, …
IBM AS/400, PTX, SPIR-V
HSAIL, HPVM
Interface Levels and Key Benefit
CPUs + Vector
SIMD Units
…
GPUDSP
Domain-specific
Accelerators
FPGA
"Hardware" ISA
Virtual ISA
Language-neutral Compiler IR
Language-level Compiler IR
General-purpose prog. language
Domain-specific prog. language
Delite IR, HPVM, MLIR
Delite DSL IR, DLVM, TVM, …
CUDA, OpenCL, OpenAcc,
OpenMP, Python, Julia
TensorFlow, MXNet, Halide, …
Hardware innovation
Object-code portability
Compiler investment
Language innovation
App. performance
App. productivity
GPU ISAs, SIMD ISAs, TPU,
Domain-specific accelerators, …
IBM AS/400, PTX, SPIR-V
HSAIL, HPVM
Interface Levels and Key Benefit
CPUs + Vector
SIMD Units
…
GPUDSP
Domain-specific
Accelerators
FPGA
"Hardware" ISA
Virtual ISA
Language-neutral Compiler IR
Language-level Compiler IR
General-purpose prog. language
Domain-specific prog. language
Delite IR, HPVM, MLIR
Delite DSL IR, DLVM, TVM, …
CUDA, OpenCL, OpenAcc,
OpenMP, Python, Julia
TensorFlow, MXNet, Halide, …
Hardware innovation
Object-code portability
Compiler investment
Language innovation
App. performance
App. productivity
GPU ISAs, SIMD ISAs, TPU,
Domain-specific accelerators, …
IBM AS/400, PTX, SPIR-V
HSAIL, HPVM
Interface Levels and Key Benefit
CPUs + Vector
SIMD Units
…
GPUDSP
Domain-specific
Accelerators
FPGA
"Hardware" ISA
Virtual ISA
Language-neutral Compiler IR
Language-level Compiler IR
General-purpose prog. language
Domain-specific prog. language
Delite DSL IR, DLVM, TVM, …
CUDA, OpenCL, OpenAcc,
OpenMP, Python, Julia
TensorFlow, MXNet, Halide, …
Hardware innovation
Object-code portability
Compiler investment
Language innovation
App. performance
App. productivity
GPU ISAs, SIMD ISAs, TPU,
Domain-specific accelerators, …
IBM AS/400, PTX, SPIR-V
HSAIL, HPVM
Delite IR, HPVM, MLIR
Interface Levels and Key Benefit
CPUs + Vector
SIMD Units
…
GPUDSP
Domain-specific
Accelerators
FPGA
"Hardware" ISA
Virtual ISA
Language-neutral Compiler IR
Language-level Compiler IR
General-purpose prog. language
Domain-specific prog. language
Delite DSL IR, DLVM, TVM, …
CUDA, OpenCL, OpenAcc,
OpenMP, Python, Julia
TensorFlow, MXNet, Halide, …
Hardware innovation
Object-code portability
Compiler investment
Language innovation
App. performance
App. productivity
GPU ISAs, SIMD ISAs, TPU,
Domain-specific accelerators, …
IBM AS/400, PTX, SPIR-V
HSAIL, HPVM
Delite IR, HPVM, MLIR
Interface Levels and Key Benefit
CPUs + Vector
SIMD Units
…
GPUDSP
Domain-specific
Accelerators
FPGA
"Hardware" ISA
Virtual ISA
Language-neutral Compiler IR
Language-level Compiler IR
General-purpose prog. language
Domain-specific prog. language
Delite DSL IR, DLVM, TVM, …
CUDA, OpenCL, OpenAcc,
OpenMP, Python, Julia
TensorFlow, MXNet, Halide, …
Hardware innovation
Object-code portability
Compiler investment
Language innovation
App. performance
App. productivity
IBM AS/400, PTX, SPIR-V
HSAIL, HPVM
GPU ISAs, SIMD ISAs, TPU,
Domain-specific accelerators, …
Delite IR, HPVM, MLIR
Which Interface Levels Can Be Uniform?
CPUs + Vector
SIMD Units
…
GPUDSP
Domain-specific
Accelerators
FPGA
"Hardware" ISA
Virtual ISA
Language-neutral Compiler IR
Language-level Compiler IR
General-purpose prog. language
Domain-specific prog. language Too diverse
to define a
uniform
interface
Also too
diverse …
Much more
uniform
GPU ISAs, SIMD ISAs, TPU, Domain-specific accelerators, …
IBM AS 400, PTX, SPIR
HSAIL, HPVM
Delite DSL IR, XLA IR, TVM, …
CUDA, OpenCL, OpenAcc,
OpenMP, Python, Julia, …
TensorFlow, MXNet, Halide, …
Delite IR, HPVM, MLIR
What Should the Interface Enable?
• Uniform parallel abstraction for diverse hardware
• Aggressive compiler optimizations
• Vendor-provided back ends
• Use of target-specific low-level libraries: MKL, cuDNN, …
• Partitioning, static scheduling, dynamic scheduling
• Application-guided error vs. energy vs performance tradeoffs
• H/w-agnostic HLS for FPGAs, ASICs
• Application-driven software + hardware specialization
• Mechanized formal verification of designs
WITHIN
REACH:
2-5 Yrs
10 Yr. GOALS
The HPVM Program Representation
• A common parallel abstraction
• Compiler IR + Virtual ISA + Run-time scheduling
Kotsifakou et al., PPOPP 2018
Goal: Programmability for Heterogeneous Parallel SystemsMobile phone SoCs
Supercomputers
Cloud with accelerators
Key to Programmability:
Common abstractions for heterogeneous parallel hardware
Heterogeneous Parallel Virtual Machine
Use HPVM for:
1. Portable object code
2. Retargetable parallel
compiler IR and system
3. Run-time scheduling
Translators
HPVM
Virtual ISARuntime
Scheduler
C+HPVM
Keras
TensorFlow
Other
DSLs
Front ends
HPVM: IR and Tools
CPUs + Vector
SIMD Units
…
GPUDSP
Domain-specific
Accelerators
FPGA
Halide
HPVM Abstraction of Parallel Computation
Dataflow Graph
with side effects
Vector
VA = load <L4 x float>* AVB = load <L4 x float>* B
…VC = fmul <L4 x float> VA,
VB
Hierarchical
or
• Graph nodes – coarse-grain or fine-grain computational tasks
• Graph edges – explicit data transfer between nodes
• Loads and stores – implicit communication via shared memory
• Hierarchical – multiple levels of nested parallelism
Static Dataflow Graph
Dynamic Dataflow Graph
[N] 1 2 N
✓ Graph Structure – coarse grain task parallelism, streams, pipelines✓ Graph hierarchy – nested parallelism✓ Node Instantiation – captures SPMD-style data parallelism✓ Vector instructions in leaf nodes – fine grain vector parallelism✓ Supports high-level optimizations✓ Captures FPGAs, some semi-custom hardware
N different parallelism models single unified parallelism model!
Node instantiation
HPVM Abstractions
✓ Pipelined (task) parallelism with streaming input images
✓ Medium-grain data parallelism within pipeline stages
✓ Fine-grain data parallelism in most stages
E.g., Edge Detection in Images
HPVM Compiler Optimizations
Complex optimizations as simple graph transforms
• Graph node “tiling” for memory hierarchy
• Graph node merging
• Graph pipelining
• Graph partitioning and mapping
• (Future) Graph-based loop optimizations
Host code
x86 binary
SPIR
binary
Intel OpenCL runtime
Intel Xeon E5 core i7
+AVXnVidia GeForce
GTX 680 GPU
nVidia OpenCL runtime
PTX
binary
Host code
x86 binary
Host code
x86 binary
P-threads
Intel Xeon E5 core i7
Front
end
Source
program
.bc (with HPVM intrinsics)
Developer site
User site
HPVM-to-
PTX
HPVM-to-
SPIR-to-AVX
HPVM-to-
x86
HPVM graph optimizer
Code-gen: Bottom-up on graph hierarchy
Code Generation Strategy – Overview
Key:
1. any node
any device
2. reuse vendor
back ends
Evaluation: Summary
Abstraction and object-code portability
➢Single HPVM code is close to (or slightly worse than) separately
hand-tuned code on both GPU, AVX
➢HPVM performance limited by vendor-specific back ends, not by
HPVM abstractions
Flexible scheduling
➢HPVM enables highly flexible mappings to diverse h/w
Ongoing Research (1)
ApproxHPVM for accuracy-aware optimization
• App developers only express end-to-end accuracy goals
• Domain-specific strategy:
➢Extend HPVM with tensor domain ops
➢Express hardware-independent accuracy metrics in IR
• Algorithmic approximations as well as system-level
• Portable virtual ISA after hardware-agnostic autotuning
• Dynamic optimization to adapt to run-time conditions
Sharif et al., OOPSLA 2019
Ongoing Research (2)
Hardware-agnostic programming of FPGAs
• FPGAs are becoming widely available in data centers
• Application users lack hardware expertise
Intermediate Compilation
AOC Compiler
Full Compilation
Transformations
Code Gen
HPVM virtual object code
Analyze Report
Ke
rnel (.
cl)
Optim
iza
tion R
eport
Bitstream (.aocx)
HPVM-OpenCL Goal: Use compiler optimizations to
achieve high-perf. FPGA designs
from hardware-agnostic code
Ongoing Research (3)
Integrate ApproxHPVM with Jasmine Toolflow
• Improve hw-agnostic tuning to match hw-specific
• Partition application + iterate through design space
• Explore approximate hw, sw mechanisms
DSSoC: Hardware Design Space Exploration
…
…
ReLU
…
Ontology 1
Ontology 2
Ontology 3
Ontology n
Conv
1D
Conv
2D ……
Convolution
MatMul
Ontology
discovery
using graph
analytics &
static
analysisHPVM
Acc
GPUCPU
AccGPU
CPU
Acc
GPU
CPU
“Test” Set
38% 41% 3%Workload Mix
CPU
GPU
CPU
A A AA
A
A
CNN design space
WL1
“Training” Set
WL2 WL3 WLn
CV
Jasmine
DSSoC
Ontology learning
DSSoC design exploration
NOC
architectures
Design
constraints
Physical
interface
Dynamic
DDG
Hierarchical
static DDG
Compiler
flow
Training
flow
Design
flow
HPVMJasmine:
Design Space Exploration
Hierarchical
DFG
Performance
results
ESP:
SoC Design Framework
DSSoC
Applications
(IBM, Columbia, Harvard, Illinois)
(Harvard)
(Columbia)
Accelerator
Pareto curves
Ongoing Research (4)
Domain-specific programming of edge systems
• Xilinx Zynq, NVIDIA Jetson Nano, Intel Movidius …
• Example: ARM (+ GPU) (+ FPGA) (+ DNN)
• Users: Crop scientists, civil engrs, medical researchers…
• Can we enable non-expert users to program complex
heterogeneous SoCs?
➢Very high-level DSLs
➢Automatic partitioning, approximation, mapping, code generation
➢Automatic run-time scheduling, performance analysis
Summary
HPVM: portability + performance for heterogeneous systems
ApproxHPVM: easy access to approximation techniques
Long-term goals:
➢Application-driven hardware design needs uniform interfaces!
➢Rich compiler infrastructure for DSLs
➢Easy programming of energy-efficient edge compute systems
Questions?