an openmp runtime for uvm-capable gpu’s · 2021. 2. 5. · runtime for uvm-capable gpus hashim...

66
Developing an OpenMP Offloading Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne National Laboratory Lingda Li Brookhaven National Laboratory

Upload: others

Post on 02-Apr-2021

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne

Developing an OpenMP Offloading Runtime for UVM-Capable GPUs

Hashim Sharif and Vikram Adve

University of Illinois at Urbana-Champaign

Hal Finkel

Argonne National Laboratory

Lingda Li

Brookhaven National Laboratory

Page 2: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne

Heterogenous Programming

“Many Core” CPUs GPUs

➢ Heterogeneous programming allows for optimizing application subcomponents to their specific computation needs

Page 3: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne

Heterogenous Programming

“Many Core” CPUs GPUs

➢ Heterogeneous programming allows for optimizing application subcomponents to their specific computation needs

➢ OpenMP 4.0/4.5 supports heterogeneous programming

Page 4: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne

CUDA Unified Virtual Memory

➢ UVM provides a single memory space accessible by all GPUs and CPUs in the system

Page 5: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne

New technology!

➢ UVM provides a single memory space accessible by all GPUs and CPUs in the system

➢ Pascal introduces a page migration engine to enable automatic page migration on data access

CUDA Unified Virtual Memory

Page 6: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne

New technology!

➢ UVM provides a single memory space accessible by all GPUs and CPUs in the system

➢ Pascal introduces a page migration engine to enable automatic page migration on data access

➢ Pascal UVM is only limited by the overall System Memory size

CUDA Unified Virtual Memory

Page 7: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne

int main( int argc, char* argv[] ) {

// Allocate memory for each vector on GPU

cudaMalloc(&X, bytes);

cudaMalloc(&Y, bytes);

cudaMalloc(&Z, bytes);

// Copy host vectors to device

cudaMemcpy(x, X, bytes, hostToDevice);

cudaMemcpy(y, Y, bytes, hostToDevice);

// Execute the kernel

vecAdd<<<gridSize, blockSize>>>(X, Y, Z);

// Copy the vector add result back to the host

cudaMemcpy(z, Z, bytes, deviceToHost);

}

int main( int argc, char* argv[] ) {

// Allocate memory in Unified virtual space

cudaMallocManaged(&X, bytes);

cudaMallocManaged(&Y, bytes);

cudaMallocManaged(&Z, bytes);

// NOTE: No need for explicit memory copies

// Execute the kernel

vecAdd<<<gridSize, blockSize>>>(X, Y, Z );

}

CUDA UVM Example

CUDA Code CUDA UVM Code

Page 8: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne

int main( int argc, char* argv[] ) {

// Allocate memory for each vector on GPU

cudaMalloc(&X, bytes);

cudaMalloc(&Y, bytes);

cudaMalloc(&Z, bytes);

// Copy host vectors to device

cudaMemcpy(x, X, bytes, hostToDevice);

cudaMemcpy(y, Y, bytes, hostToDevice);

// Execute the kernel

vecAdd<<<gridSize, blockSize>>>(X, Y, Z);

// Copy the vector add result back to the host

cudaMemcpy(z, Z, bytes, deviceToHost);

}

int main( int argc, char* argv[] ) {

// Allocate memory in Unified virtual space

cudaMallocManaged(&X, bytes);

cudaMallocManaged(&Y, bytes);

cudaMallocManaged(&Z, bytes);

// NOTE: No need for explicit memory copies

// Execute the kernel

vecAdd<<<gridSize, blockSize>>>(X, Y, Z );

}

CUDA UVM Example

CUDA Code CUDA UVM Code

explicit data copies

Page 9: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne

int main( int argc, char* argv[] ) {

// Allocate memory for each vector on GPU

cudaMalloc(&X, bytes);

cudaMalloc(&Y, bytes);

cudaMalloc(&Z, bytes);

// Copy host vectors to device

cudaMemcpy(x, X, bytes, hostToDevice);

cudaMemcpy(y, Y, bytes, hostToDevice);

// Execute the kernel

vecAdd<<<gridSize, blockSize>>>(X, Y, Z);

// Copy the vector add result back to the host

cudaMemcpy(z, Z, bytes, deviceToHost);

}

int main( int argc, char* argv[] ) {

// Allocate memory in Unified virtual space

cudaMallocManaged(&X, bytes);

cudaMallocManaged(&Y, bytes);

cudaMallocManaged(&Z, bytes);

// NOTE: No need for explicit memory copies

// Execute the kernel

vecAdd<<<gridSize, blockSize>>>(X, Y, Z );

}

CUDA UVM Example

CUDA Code CUDA UVM Code

No explicit copies

Page 10: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne

Processor XProcessor Y

Cache XCache Y

OpenMP Offloading

RAM

Shared Data

Page 11: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne

Processor X

RAM

Processor Y

Cache XCache Y

Shared Data

OpenMP threads share memory

OpenMP Offloading

Page 12: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne

Processor XProcessor Y

Device Memory

Accelerator

Bus

Cache XCache Y

OpenMP Offloading

RAM

Shared Data

Page 13: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne

Processor XProcessor Y

Device Memory

Accelerator

Bus

Cache XCache Y

OpenMP 4 includes

directives for

mapping data to device

memory

OpenMP Offloading

RAM

Shared Data

Page 14: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne

#pragma omp target data map(to: x[0:N])

#pragma omp target data map(tofrom: y[0:N])

{

#pragma omp target

#pragma omp teams

#pragma omp distribute parallel for

for (long int i = 0; i < N; i++)

y[i] = a*x[i] + y[i];

}

OpenMP Target Directives - Saxpy

Page 15: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne

#pragma omp target data map(to: x[0:N])

#pragma omp target data map(tofrom: y[0:N])

{

#pragma omp target

#pragma omp teams

#pragma omp distribute parallel for

for (long int i = 0; i < N; i++)

y[i] = a*x[i] + y[i];

}

OpenMP Target Directives - Saxpy

directs target

compilation

Page 16: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne

#pragma omp target data map(to: x[0:N])

#pragma omp target data map(tofrom: y[0:N])

{

#pragma omp target

#pragma omp teams

#pragma omp distribute parallel for

for (long int i = 0; i < N; i++)

y[i] = a*x[i] + y[i];

}

map specifies data

movement

OpenMP Target Directives - Saxpy

Page 17: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne

Clang/LLVM Binary

Host Code Device CodeTarget constructsNon offloading constructs

OpenMP Offloading Runtime - libomptarget

Page 18: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne

Clang/LLVM Binary

Host Code Target constructsNon offloading constructs

Host OpenMP RTlibomp

OpenMP Offloading Runtime - libomptarget

Serves OpenMP Runtime calls for non-target regions

Device Code

Page 19: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne

Clang/LLVM Binary

Host Code Target constructsNon offloading constructs

Host OpenMP RTlibomp

Offloading Runtime

libomptarget

Device-agnostic openMP runtime; implementing

support for target offload regions

OpenMP Offloading Runtime - libomptarget

Device Code

Page 20: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne

Clang/LLVM Binary

Host Code Target constructsNon offloading constructs

Host OpenMP RTlibomp

Offloading Runtime

libomptarget

CUDA Device Plugin

libomptarget

CUDA Driver API

Device plugins implements device

specific operations e.gdata copies, kernel

execution

OpenMP Offloading Runtime - libomptarget

Device Code

Page 21: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne

An OpenMP Framework for UVM

➢Improved Performance

➢Performance Portability

➢How can we leverage OpenMP target constructs? ➢Does performance scale with large datasets?

Design Considerations

Goals

Page 22: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne

An OpenMP Framework for UVM

➢Developed as LLVM Transformations➢Extracts important application-specific information

➢e.g data access probability

➢Our implementation extends the libomptarget library➢More specifically, we extend the CUDA device plugin➢Developed a UVM-compatible offloading plugin➢ Includes UVM-specific performance optimizations

OpenMP Runtime

Compiler Technology

Page 23: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne

Optimizations - Prefetching

#pragma omp target data map(to: x[0:N])

#pragma omp target data map(tofrom: y[0:N])

{

#pragma omp target

#pragma omp teams

#pragma omp distribute parallel for

for (long int i = 0; i < N; i++)

y[i] = a*x[i] + y[i];

}

ExecuteTargetRegion()

Runtime

Page 24: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne

Optimizations - Prefetching

#pragma omp target data map(to: x[0:N])

#pragma omp target data map(tofrom: y[0:N])

{

#pragma omp target

#pragma omp teams

#pragma omp distribute parallel for

for (long int i = 0; i < N; i++)

y[i] = a*x[i] + y[i];

}

ExecuteTargetRegion()

Runtime

Data pages are fetched on demand. This leads to page fault processing overhead

Page 25: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne

Optimizations - Prefetching

#pragma omp target data map(to: x[0:N])

#pragma omp target data map(tofrom: y[0:N])

{

#pragma omp target

#pragma omp teams

#pragma omp distribute parallel for

for (long int i = 0; i < N; i++)

y[i] = a*x[i] + y[i];

}

PrefetchToDevice(x)

PrefetchToDevice(y)

ExecuteTargetRegion()

PrefetchToHost(y)

Runtime

Page 26: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne

Optimizations - Prefetching

#pragma omp target data map(to: x[0:N])

#pragma omp target data map(tofrom: y[0:N])

{

#pragma omp target

#pragma omp teams

#pragma omp distribute parallel for

for (long int i = 0; i < N; i++)

y[i] = a*x[i] + y[i];

}

PrefetchToDevice(x)

PrefetchToDevice(y)

ExecuteTargetRegion()

PrefetchToHost(y)

Runtime

Prefetching used data pages helps avoid page fault processing overhead

Page 27: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne

App source

LLVM IRBuild

Prefetching Workflow - Compiler

Goal: The compiler extracts the access probabilities associated with the OpenMP mapped data

Page 28: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne

App source

LLVM IRBuild Profile Application

LLVM Pass:Extract Access

Probability

Compiler

Prefetching Workflow - Compiler

Goal: The compiler extracts the access probabilities associated with the OpenMP mapped data

Page 29: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne

App source

LLVM IRBuild Profile Application

LLVM Pass:Extract Access

Probability

LLVM Pass:Add Access Probability

Transformed IR

Compiler

Prefetching Workflow - Compiler

Goal: The compiler extracts the access probabilities associated with the OpenMP mapped data

Page 30: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne

App source

LLVM IRBuild Profile Application

LLVM Pass:Extract Access

Probability

LLVM Pass:Add Access Probability

Transformed IR

Binary

Compiler

Prefetching Workflow - Compiler

Goal: The compiler extracts the access probabilities associated with the OpenMP mapped data

Build

Page 31: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne

ExecutableOffloading Runtime -

libomptargetAPI calls including access probabilities

Prefetching Workflow - Runtime

Page 32: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne

ExecutableOffloading Runtime -

libomptargetAPI calls including access probabilities

CUDA Device Plugin-

libomptarget

Prefetching Workflow - Runtime

Page 33: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne

ExecutableOffloading Runtime -

libomptargetAPI calls including access probabilities

CUDA Device Plugin-

libomptarget

Prefetching Workflow - Runtime

Uses a cost-model to

determine the profitability of

prefetching

Page 34: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne

ExecutableOffloading Runtime -

libomptarget

Device Memory

RAM

API calls including access probabilities

Prefetch data with high access probability

Data Prefetching

CUDA Device Plugin-

libomptarget

Prefetching Workflow - Runtime

Uses a cost-model to

determine the profitability of

prefetching

Page 35: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne

Prefetch Compute

➢ With device memory oversubscription, naïve data prefetching leads to memory thrashing

Memory Oversubscription

Page 36: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne

Prefetch Compute

➢ With device memory oversubscription, naïve data prefetching leads to memory thrashing

Prefetch

(Partial)

Compute

(Partial)

Prefetch

(Partial)

Compute

(Partial)

(Now everything fits)

➢ Pipelining partial prefetches with partial compute

Memory Oversubscription

Page 37: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne

App source

LLVM IRBuild

Pipelining Workflow - Compiler

Page 38: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne

App source

LLVM IRBuild LLVM Pass:

Extract OpenMP Loop bounds

Compiler

Pipelining Workflow - Compiler

Page 39: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne

App source

LLVM IRBuild LLVM Pass:

Extract OpenMP Loop bounds

Compiler

Pipelining Workflow - Compiler

Required for chunking the

iteration space

Page 40: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne

App source

LLVM IRBuild LLVM Pass:

Extract OpenMP Loop bounds

Compiler

Pipelining Workflow - Compiler

LLVM Pass:Extract memory

access expressions

Required for chunking the

iteration space

Page 41: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne

App source

LLVM IRBuild LLVM Pass:

Extract OpenMP Loop bounds

Compiler

Pipelining Workflow - Compiler

LLVM Pass:Extract memory

access expressions

Required for chunking the

iteration space

Required for chunking the

data prefetches

Page 42: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne

App source

LLVM IRBuild LLVM Pass:

Extract OpenMP Loop bounds

Transformed IR

Compiler

Pipelining Workflow - Compiler

LLVM Pass:Add OpenMP Region Info

LLVM Pass:Extract memory

access expressions

Required for chunking the

iteration space

Required for chunking the

data prefetches

Page 43: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne

App source

LLVM IRBuild LLVM Pass:

Extract OpenMP Loop bounds

Transformed IR

Compiler

Pipelining Workflow - Compiler

LLVM Pass:Add OpenMP Region Info

LLVM Pass:Extract memory

access expressions

Required for chunking the

iteration space

Required for chunking the

data prefetches

BinaryBuild

Page 44: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne

tgt_run_target()

Runtime API call including OpenMP Region Info

Pipelining Workflow - Runtime

tgt_run_target()

Application Binary

libomptarget

Page 45: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne

tgt_run_target()

Runtime API call including OpenMP Region Info

Pipelining Workflow - Runtime

tgt_run_target()

Application Binary

libomptarget

Application calls into the runtime to

invoke the target offload region

Page 46: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne

tgt_run_target()

Runtime API call including OpenMP Region Info

Pipelining Workflow - Runtime

tgt_run_target()

chunkCount = getChunkCount()

for(i=0; i < chunkCount; i++){

prefetchChunk()

executeChunk()

synchronizeTask()

}

Application Binary

libomptarget Device Plugin

Application calls into the runtime to

invoke the target offload region

Page 47: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne

tgt_run_target()

Runtime API call including OpenMP Region Info

Pipelining Workflow - Runtime

tgt_run_target()

chunkCount = getChunkCount()

for(i=0; i < chunkCount; i++){

prefetchChunk()

executeChunk()

synchronizeTask()

}

Application Binary

libomptarget Device Plugin

Application calls into the runtime to

invoke the target offload region

The device plugins chunks a kernel into multiple smaller

kernels – if the device memory is being oversubscribed

Page 48: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne

tgt_run_target()

Runtime API call including OpenMP Region Info

Pipelining Workflow - Runtime

tgt_run_target()

chunkCount = getChunkCount()

for(i=0; i < chunkCount; i++){

prefetchChunk()

executeChunk()

synchronizeTask()

}

Application Binary

libomptarget Device Plugin

Application calls into the runtime to

invoke the target offload region

The device plugins chunks a kernel into multiple smaller

kernels – if the device memory is being oversubscribed

Page 49: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne

tgt_run_target()

Runtime API call including OpenMP Region Info

Pipelining Workflow - Runtime

tgt_run_target()

chunkCount = getChunkCount()

for(i=0; i < chunkCount; i++){

prefetchChunk()

executeChunk()

synchronizeTask()

}

Application Binary

libomptarget Device Plugin

Application calls into the runtime to

invoke the target offload region

The device plugins chunks a kernel into multiple smaller

kernels – if the device memory is being oversubscribed

Page 50: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne

tgt_run_target()

Runtime API call including OpenMP Region Info

Pipelining Workflow - Runtime

tgt_run_target()

chunkCount = getChunkCount()

for(i=0; i < chunkCount; i++){

prefetchChunk()

executeChunk()

synchronizeTask()

}

Application Binary

libomptarget Device Plugin

Application calls into the runtime to

invoke the target offload region

The device plugins chunks a kernel into multiple smaller

kernels – if the device memory is being oversubscribed

Page 51: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne

tgt_run_target()

Runtime API call including OpenMP Region Info

Pipelining Workflow - Runtime

tgt_run_target()

chunkCount = getChunkCount()

for(i=0; i < chunkCount; i++){

prefetchChunk()

executeChunk()

synchronizeTask()

}

Application Binary

libomptarget Device Plugin

Application calls into the runtime to

invoke the target offload region

The device plugins chunks a kernel into multiple smaller

kernels – if the device memory is being oversubscribed

Page 52: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne

tgt_run_target()

Runtime API call including OpenMP Region Info

Pipelining Workflow - Runtime

tgt_run_target()

chunkCount = getChunkCount()

for(i=0; i < chunkCount; i++){

prefetchChunk()

executeChunk()

synchronizeTask()

}

Application Binary

libomptarget Device Plugin

Application calls into the runtime to

invoke the target offload region

The device plugins chunks a kernel into multiple smaller

kernels – if the device memory is being oversubscribed

Page 53: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne

Experiments

Page 54: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne

➢Do the optimizations help improve performance for computational kernels?

➢Do the optimizations scale with increasing dataset sizes?

Experiments

Page 55: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne

➢Baseline: Demand Paging

➢Comparisons: Prefetching, Pipelined Prefetching

➢Benchmarks: Saxpy, Kmeans (Rodinia)

Experiments

Page 56: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne

Experimental Setup

➢ Summitdev cluster at ORNL ➢ Pascal P100 GPU cards➢16GB device memory

➢Nvlink interconnect

➢ Clang, LLVM ➢clang-ykt project

➢ libomp, libomptarget➢clang-ykt project

Software

Hardware

Page 57: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne

Results - SAXPY

0

20

40

60

80

100

120

27 28 29 30 31

Tim

e (%

)

log(N) elements

Paging Prefetch Pipelining+Prefetching

Page 58: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne

Results - SAXPY

This does not fit in device memory

0

20

40

60

80

100

120

27 28 29 30 31

Tim

e (%

)

log(N) elements

Paging Prefetch Pipelining+Prefetching

Page 59: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne

Kmeans – 10 Iterations

0

20

40

60

80

100

120

27 28 29 30 31

Tim

e (%

)

log(N) points

Paging Prefetch Pipelining+Prefetching

Page 60: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne

Kmeans – 10 Iterations

0

20

40

60

80

100

120

27 28 29 30 31

Tim

e (%

)

log(N) points

Paging Prefetch Pipelining+Prefetching

This does not fit in device memory

Page 61: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne

Kmeans – 20 Iterations

0

20

40

60

80

100

120

140

160

180

200

27 28 29 30 31

Tim

e (%

)

log(N) points

Paging Prefetch Pipelining

Page 62: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne

Kmeans – 20 Iterations

0

20

40

60

80

100

120

140

160

180

200

27 28 29 30 31

Tim

e (%

)

log(N) points

Paging Prefetch Pipelining

Demand paging performs better – hardware learns to use a more suitable eviction

policy

Page 63: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne

➢Developed an OpenMP Framework for UVM-capable GPUs

➢Develop optimizations to reduce page fault processing overhead

➢Prefetching

➢Pipelining

➢Optimizations enable reasonable improvements on benchmarks

➢Future Work: Developing more sophisticated pipelining strategies

Summary

Page 64: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne

Thanks & Questions?

Page 65: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne

• The LLVM community (including our many contributing vendors)

• The SOLLVE project (part of the Exascale Computing Project)

• OLCF, ORNL for access to the SummitDev system.

• ALCF, ANL, and DOE

Acknowledgements

Page 66: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne

References

• https://devblogs.nvidia.com/parallelforall/beyond-gpu-memory-limits-unified-memory-pascal/

• http://on-demand.gputechconf.com/gtc/2016/presentation/s6510-jeff-larkin-targeting-gpus-openmp.pdf

• https://drive.google.com/file/d/0B-jX56_FbGKRM21sYlNYVnB4eFk/view

• https://www.nersc.gov/assets/Uploads/XE62011OpenMP.pdf

• http://www.openmp.org/wp-content/uploads/openmp-4.5.pdf

• https://llvm-hpc3-workshop.github.io/slides/Bertolli.pdf