an openmp runtime for uvm-capable gpu’s · 2021. 2. 5. · runtime for uvm-capable gpus hashim...

Developing an OpenMP Offloading Runtime for UVM-Capable GPUs

Hashim Sharif and Vikram Adve

University of Illinois at Urbana-Champaign

Hal Finkel

Argonne National Laboratory

Lingda Li

Brookhaven National Laboratory

Heterogenous Programming

“Many Core” CPUs GPUs

➢ Heterogeneous programming allows for optimizing application subcomponents to their specific computation needs

Heterogenous Programming

“Many Core” CPUs GPUs

➢ Heterogeneous programming allows for optimizing application subcomponents to their specific computation needs

➢ OpenMP 4.0/4.5 supports heterogeneous programming

CUDA Unified Virtual Memory

➢ UVM provides a single memory space accessible by all GPUs and CPUs in the system

New technology!

➢ Pascal introduces a page migration engine to enable automatic page migration on data access

New technology!

➢ Pascal introduces a page migration engine to enable automatic page migration on data access

➢ Pascal UVM is only limited by the overall System Memory size

int main( int argc, char* argv[] ) {

// Allocate memory for each vector on GPU

cudaMalloc(&X, bytes);

cudaMalloc(&Y, bytes);

cudaMalloc(&Z, bytes);

// Copy host vectors to device

cudaMemcpy(x, X, bytes, hostToDevice);

cudaMemcpy(y, Y, bytes, hostToDevice);

// Execute the kernel

vecAdd<<<gridSize, blockSize>>>(X, Y, Z);

// Copy the vector add result back to the host

cudaMemcpy(z, Z, bytes, deviceToHost);

// Allocate memory in Unified virtual space

cudaMallocManaged(&X, bytes);

cudaMallocManaged(&Y, bytes);

cudaMallocManaged(&Z, bytes);

// NOTE: No need for explicit memory copies

vecAdd<<<gridSize, blockSize>>>(X, Y, Z );

CUDA UVM Example

CUDA Code CUDA UVM Code

CUDA UVM Example

explicit data copies

CUDA UVM Example

No explicit copies

Processor XProcessor Y

Cache XCache Y

OpenMP Offloading

Shared Data

Processor X

Processor Y

Cache XCache Y

Shared Data

OpenMP threads share memory

OpenMP Offloading

Device Memory

Accelerator

Cache XCache Y

OpenMP Offloading

Shared Data

Device Memory

Accelerator

Cache XCache Y

OpenMP 4 includes

directives for

mapping data to device

memory

OpenMP Offloading

Shared Data

#pragma omp target data map(to: x[0:N])

#pragma omp target data map(tofrom: y[0:N])

#pragma omp target

#pragma omp teams

#pragma omp distribute parallel for

for (long int i = 0; i < N; i++)

y[i] = a*x[i] + y[i];

OpenMP Target Directives - Saxpy

#pragma omp target

#pragma omp teams

y[i] = a*x[i] + y[i];

directs target

compilation

#pragma omp target

#pragma omp teams

y[i] = a*x[i] + y[i];

map specifies data

movement

Clang/LLVM Binary

Host Code Device CodeTarget constructsNon offloading constructs

OpenMP Offloading Runtime - libomptarget

Clang/LLVM Binary

Host Code Target constructsNon offloading constructs

Host OpenMP RTlibomp

Serves OpenMP Runtime calls for non-target regions

Device Code

Clang/LLVM Binary

Offloading Runtime

libomptarget

Device-agnostic openMP runtime; implementing

support for target offload regions

Device Code

Clang/LLVM Binary

Offloading Runtime

libomptarget

CUDA Device Plugin

libomptarget

CUDA Driver API

Device plugins implements device

specific operations e.gdata copies, kernel

execution

Device Code

An OpenMP Framework for UVM

➢Improved Performance

➢Performance Portability

➢How can we leverage OpenMP target constructs? ➢Does performance scale with large datasets?

Design Considerations

An OpenMP Framework for UVM

➢Developed as LLVM Transformations➢Extracts important application-specific information

➢e.g data access probability

➢Our implementation extends the libomptarget library➢More specifically, we extend the CUDA device plugin➢Developed a UVM-compatible offloading plugin➢ Includes UVM-specific performance optimizations

OpenMP Runtime

Compiler Technology

Optimizations - Prefetching

#pragma omp target

#pragma omp teams

y[i] = a*x[i] + y[i];

ExecuteTargetRegion()

Runtime

#pragma omp target

#pragma omp teams

y[i] = a*x[i] + y[i];

Runtime

Data pages are fetched on demand. This leads to page fault processing overhead

#pragma omp target

#pragma omp teams

y[i] = a*x[i] + y[i];

PrefetchToDevice(x)

PrefetchToDevice(y)

PrefetchToHost(y)

Runtime

#pragma omp target

#pragma omp teams

y[i] = a*x[i] + y[i];

PrefetchToDevice(x)

PrefetchToDevice(y)

PrefetchToHost(y)

Runtime

Prefetching used data pages helps avoid page fault processing overhead

App source

LLVM IRBuild

Prefetching Workflow - Compiler

Goal: The compiler extracts the access probabilities associated with the OpenMP mapped data

App source

LLVM IRBuild Profile Application

LLVM Pass:Extract Access

Probability

Compiler

App source

Probability

LLVM Pass:Add Access Probability

Transformed IR

Compiler

App source

Probability

LLVM Pass:Add Access Probability

Transformed IR

Binary

Compiler

ExecutableOffloading Runtime -

libomptargetAPI calls including access probabilities

Prefetching Workflow - Runtime

CUDA Device Plugin-

libomptarget

CUDA Device Plugin-

libomptarget

Uses a cost-model to

determine the profitability of

prefetching

libomptarget

Device Memory

API calls including access probabilities

Prefetch data with high access probability

Data Prefetching

CUDA Device Plugin-

libomptarget

Uses a cost-model to

determine the profitability of

prefetching

Prefetch Compute

➢ With device memory oversubscription, naïve data prefetching leads to memory thrashing

Memory Oversubscription

Prefetch Compute

➢ With device memory oversubscription, naïve data prefetching leads to memory thrashing

Prefetch

(Partial)

Compute

(Partial)

Prefetch

(Partial)

Compute

(Partial)

(Now everything fits)

➢ Pipelining partial prefetches with partial compute

Memory Oversubscription

App source

LLVM IRBuild

Pipelining Workflow - Compiler

App source

LLVM IRBuild LLVM Pass:

Extract OpenMP Loop bounds

Compiler

App source

Compiler

Required for chunking the

iteration space

App source

Compiler

LLVM Pass:Extract memory

access expressions

iteration space

App source

Compiler

access expressions

iteration space

data prefetches

App source

Transformed IR

Compiler

LLVM Pass:Add OpenMP Region Info

access expressions

iteration space

data prefetches

App source

Transformed IR

Compiler

LLVM Pass:Add OpenMP Region Info

access expressions

iteration space

data prefetches

BinaryBuild

tgt_run_target()

Runtime API call including OpenMP Region Info

Pipelining Workflow - Runtime

tgt_run_target()

Application Binary

libomptarget

tgt_run_target()

Application Binary

libomptarget

Application calls into the runtime to

invoke the target offload region

tgt_run_target()

chunkCount = getChunkCount()

for(i=0; i < chunkCount; i++){

prefetchChunk()

executeChunk()

synchronizeTask()

Application Binary

libomptarget Device Plugin

tgt_run_target()

prefetchChunk()

executeChunk()

synchronizeTask()

Application Binary

The device plugins chunks a kernel into multiple smaller

kernels – if the device memory is being oversubscribed

tgt_run_target()

prefetchChunk()

executeChunk()

synchronizeTask()

Application Binary

tgt_run_target()

prefetchChunk()

executeChunk()

synchronizeTask()

Application Binary

tgt_run_target()

prefetchChunk()

executeChunk()

synchronizeTask()

Application Binary

tgt_run_target()

prefetchChunk()

executeChunk()

synchronizeTask()

Application Binary

tgt_run_target()

prefetchChunk()

executeChunk()

synchronizeTask()

Application Binary

Experiments

➢Do the optimizations help improve performance for computational kernels?

➢Do the optimizations scale with increasing dataset sizes?

Experiments

➢Baseline: Demand Paging

➢Comparisons: Prefetching, Pipelined Prefetching

➢Benchmarks: Saxpy, Kmeans (Rodinia)

Experiments

Experimental Setup

➢ Summitdev cluster at ORNL ➢ Pascal P100 GPU cards➢16GB device memory

➢Nvlink interconnect

➢ Clang, LLVM ➢clang-ykt project

➢ libomp, libomptarget➢clang-ykt project

Software

Hardware

Results - SAXPY

27 28 29 30 31

log(N) elements

Paging Prefetch Pipelining+Prefetching

Results - SAXPY

This does not fit in device memory

27 28 29 30 31

log(N) elements

Kmeans – 10 Iterations

27 28 29 30 31

log(N) points

27 28 29 30 31

log(N) points

This does not fit in device memory

27 28 29 30 31

log(N) points

Paging Prefetch Pipelining

27 28 29 30 31

log(N) points

Paging Prefetch Pipelining

Demand paging performs better – hardware learns to use a more suitable eviction

policy

➢Developed an OpenMP Framework for UVM-capable GPUs

➢Develop optimizations to reduce page fault processing overhead

➢Prefetching

➢Pipelining

➢Optimizations enable reasonable improvements on benchmarks

➢Future Work: Developing more sophisticated pipelining strategies

Summary

Thanks & Questions?

• The LLVM community (including our many contributing vendors)

• The SOLLVE project (part of the Exascale Computing Project)

• OLCF, ORNL for access to the SummitDev system.

• ALCF, ANL, and DOE

Acknowledgements

References

• https://devblogs.nvidia.com/parallelforall/beyond-gpu-memory-limits-unified-memory-pascal/

• http://on-demand.gputechconf.com/gtc/2016/presentation/s6510-jeff-larkin-targeting-gpus-openmp.pdf

• https://drive.google.com/file/d/0B-jX56_FbGKRM21sYlNYVnB4eFk/view

• https://www.nersc.gov/assets/Uploads/XE62011OpenMP.pdf

• http://www.openmp.org/wp-content/uploads/openmp-4.5.pdf

• https://llvm-hpc3-workshop.github.io/slides/Bertolli.pdf

an openmp runtime for uvm-capable gpu’s · 2021. 2. 5. · runtime for uvm-capable gpus hashim...

Documents

review of uvm erm planning€¦ · uvm erm plan t:\higher...

uvm rapid adoption: a practical subset of uvm...uvm rapid...

verification academy uvm forum seminar uvm and emulation...

uvm sequences

edition 3 - uvm

uvm sans uvm - an approach to automating uvm testbench...

uvm presentation

uvm ramakrishna

uvm goes universal - introducing uvm in systemc...•...

uvm quickstart

universal valve module (uvm) installation manual version...

uvm tutorial;

course basic uvm session8 uvm reporting tfitzpatrick

simplified uvm for fpga reliability uvm for “sufficient...

enosburgh - uvm

passeriformes - uvm

questions: jstaylor@uvm slide show (fair use images...

going global: vermont librarians’ international...

uvm 'hello world' session | basic uvm...

uvm football