an openmp runtime for uvm-capable gpu’s · 2021. 2. 5. · runtime for uvm-capable gpus hashim...

Post on 02-Apr-2021

10 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Developing an OpenMP Offloading Runtime for UVM-Capable GPUs

Hashim Sharif and Vikram Adve

University of Illinois at Urbana-Champaign

Hal Finkel

Argonne National Laboratory

Lingda Li

Brookhaven National Laboratory

Heterogenous Programming

“Many Core” CPUs GPUs

➢ Heterogeneous programming allows for optimizing application subcomponents to their specific computation needs

Heterogenous Programming

“Many Core” CPUs GPUs

➢ Heterogeneous programming allows for optimizing application subcomponents to their specific computation needs

➢ OpenMP 4.0/4.5 supports heterogeneous programming

CUDA Unified Virtual Memory

➢ UVM provides a single memory space accessible by all GPUs and CPUs in the system

New technology!

➢ UVM provides a single memory space accessible by all GPUs and CPUs in the system

➢ Pascal introduces a page migration engine to enable automatic page migration on data access

CUDA Unified Virtual Memory

New technology!

➢ UVM provides a single memory space accessible by all GPUs and CPUs in the system

➢ Pascal introduces a page migration engine to enable automatic page migration on data access

➢ Pascal UVM is only limited by the overall System Memory size

CUDA Unified Virtual Memory

int main( int argc, char* argv[] ) {

// Allocate memory for each vector on GPU

cudaMalloc(&X, bytes);

cudaMalloc(&Y, bytes);

cudaMalloc(&Z, bytes);

// Copy host vectors to device

cudaMemcpy(x, X, bytes, hostToDevice);

cudaMemcpy(y, Y, bytes, hostToDevice);

// Execute the kernel

vecAdd<<<gridSize, blockSize>>>(X, Y, Z);

// Copy the vector add result back to the host

cudaMemcpy(z, Z, bytes, deviceToHost);

}

int main( int argc, char* argv[] ) {

// Allocate memory in Unified virtual space

cudaMallocManaged(&X, bytes);

cudaMallocManaged(&Y, bytes);

cudaMallocManaged(&Z, bytes);

// NOTE: No need for explicit memory copies

// Execute the kernel

vecAdd<<<gridSize, blockSize>>>(X, Y, Z );

}

CUDA UVM Example

CUDA Code CUDA UVM Code

int main( int argc, char* argv[] ) {

// Allocate memory for each vector on GPU

cudaMalloc(&X, bytes);

cudaMalloc(&Y, bytes);

cudaMalloc(&Z, bytes);

// Copy host vectors to device

cudaMemcpy(x, X, bytes, hostToDevice);

cudaMemcpy(y, Y, bytes, hostToDevice);

// Execute the kernel

vecAdd<<<gridSize, blockSize>>>(X, Y, Z);

// Copy the vector add result back to the host

cudaMemcpy(z, Z, bytes, deviceToHost);

}

int main( int argc, char* argv[] ) {

// Allocate memory in Unified virtual space

cudaMallocManaged(&X, bytes);

cudaMallocManaged(&Y, bytes);

cudaMallocManaged(&Z, bytes);

// NOTE: No need for explicit memory copies

// Execute the kernel

vecAdd<<<gridSize, blockSize>>>(X, Y, Z );

}

CUDA UVM Example

CUDA Code CUDA UVM Code

explicit data copies

int main( int argc, char* argv[] ) {

// Allocate memory for each vector on GPU

cudaMalloc(&X, bytes);

cudaMalloc(&Y, bytes);

cudaMalloc(&Z, bytes);

// Copy host vectors to device

cudaMemcpy(x, X, bytes, hostToDevice);

cudaMemcpy(y, Y, bytes, hostToDevice);

// Execute the kernel

vecAdd<<<gridSize, blockSize>>>(X, Y, Z);

// Copy the vector add result back to the host

cudaMemcpy(z, Z, bytes, deviceToHost);

}

int main( int argc, char* argv[] ) {

// Allocate memory in Unified virtual space

cudaMallocManaged(&X, bytes);

cudaMallocManaged(&Y, bytes);

cudaMallocManaged(&Z, bytes);

// NOTE: No need for explicit memory copies

// Execute the kernel

vecAdd<<<gridSize, blockSize>>>(X, Y, Z );

}

CUDA UVM Example

CUDA Code CUDA UVM Code

No explicit copies

Processor XProcessor Y

Cache XCache Y

OpenMP Offloading

RAM

Shared Data

Processor X

RAM

Processor Y

Cache XCache Y

Shared Data

OpenMP threads share memory

OpenMP Offloading

Processor XProcessor Y

Device Memory

Accelerator

Bus

Cache XCache Y

OpenMP Offloading

RAM

Shared Data

Processor XProcessor Y

Device Memory

Accelerator

Bus

Cache XCache Y

OpenMP 4 includes

directives for

mapping data to device

memory

OpenMP Offloading

RAM

Shared Data

#pragma omp target data map(to: x[0:N])

#pragma omp target data map(tofrom: y[0:N])

{

#pragma omp target

#pragma omp teams

#pragma omp distribute parallel for

for (long int i = 0; i < N; i++)

y[i] = a*x[i] + y[i];

}

OpenMP Target Directives - Saxpy

#pragma omp target data map(to: x[0:N])

#pragma omp target data map(tofrom: y[0:N])

{

#pragma omp target

#pragma omp teams

#pragma omp distribute parallel for

for (long int i = 0; i < N; i++)

y[i] = a*x[i] + y[i];

}

OpenMP Target Directives - Saxpy

directs target

compilation

#pragma omp target data map(to: x[0:N])

#pragma omp target data map(tofrom: y[0:N])

{

#pragma omp target

#pragma omp teams

#pragma omp distribute parallel for

for (long int i = 0; i < N; i++)

y[i] = a*x[i] + y[i];

}

map specifies data

movement

OpenMP Target Directives - Saxpy

Clang/LLVM Binary

Host Code Device CodeTarget constructsNon offloading constructs

OpenMP Offloading Runtime - libomptarget

Clang/LLVM Binary

Host Code Target constructsNon offloading constructs

Host OpenMP RTlibomp

OpenMP Offloading Runtime - libomptarget

Serves OpenMP Runtime calls for non-target regions

Device Code

Clang/LLVM Binary

Host Code Target constructsNon offloading constructs

Host OpenMP RTlibomp

Offloading Runtime

libomptarget

Device-agnostic openMP runtime; implementing

support for target offload regions

OpenMP Offloading Runtime - libomptarget

Device Code

Clang/LLVM Binary

Host Code Target constructsNon offloading constructs

Host OpenMP RTlibomp

Offloading Runtime

libomptarget

CUDA Device Plugin

libomptarget

CUDA Driver API

Device plugins implements device

specific operations e.gdata copies, kernel

execution

OpenMP Offloading Runtime - libomptarget

Device Code

An OpenMP Framework for UVM

➢Improved Performance

➢Performance Portability

➢How can we leverage OpenMP target constructs? ➢Does performance scale with large datasets?

Design Considerations

Goals

An OpenMP Framework for UVM

➢Developed as LLVM Transformations➢Extracts important application-specific information

➢e.g data access probability

➢Our implementation extends the libomptarget library➢More specifically, we extend the CUDA device plugin➢Developed a UVM-compatible offloading plugin➢ Includes UVM-specific performance optimizations

OpenMP Runtime

Compiler Technology

Optimizations - Prefetching

#pragma omp target data map(to: x[0:N])

#pragma omp target data map(tofrom: y[0:N])

{

#pragma omp target

#pragma omp teams

#pragma omp distribute parallel for

for (long int i = 0; i < N; i++)

y[i] = a*x[i] + y[i];

}

ExecuteTargetRegion()

Runtime

Optimizations - Prefetching

#pragma omp target data map(to: x[0:N])

#pragma omp target data map(tofrom: y[0:N])

{

#pragma omp target

#pragma omp teams

#pragma omp distribute parallel for

for (long int i = 0; i < N; i++)

y[i] = a*x[i] + y[i];

}

ExecuteTargetRegion()

Runtime

Data pages are fetched on demand. This leads to page fault processing overhead

Optimizations - Prefetching

#pragma omp target data map(to: x[0:N])

#pragma omp target data map(tofrom: y[0:N])

{

#pragma omp target

#pragma omp teams

#pragma omp distribute parallel for

for (long int i = 0; i < N; i++)

y[i] = a*x[i] + y[i];

}

PrefetchToDevice(x)

PrefetchToDevice(y)

ExecuteTargetRegion()

PrefetchToHost(y)

Runtime

Optimizations - Prefetching

#pragma omp target data map(to: x[0:N])

#pragma omp target data map(tofrom: y[0:N])

{

#pragma omp target

#pragma omp teams

#pragma omp distribute parallel for

for (long int i = 0; i < N; i++)

y[i] = a*x[i] + y[i];

}

PrefetchToDevice(x)

PrefetchToDevice(y)

ExecuteTargetRegion()

PrefetchToHost(y)

Runtime

Prefetching used data pages helps avoid page fault processing overhead

App source

LLVM IRBuild

Prefetching Workflow - Compiler

Goal: The compiler extracts the access probabilities associated with the OpenMP mapped data

App source

LLVM IRBuild Profile Application

LLVM Pass:Extract Access

Probability

Compiler

Prefetching Workflow - Compiler

Goal: The compiler extracts the access probabilities associated with the OpenMP mapped data

App source

LLVM IRBuild Profile Application

LLVM Pass:Extract Access

Probability

LLVM Pass:Add Access Probability

Transformed IR

Compiler

Prefetching Workflow - Compiler

Goal: The compiler extracts the access probabilities associated with the OpenMP mapped data

App source

LLVM IRBuild Profile Application

LLVM Pass:Extract Access

Probability

LLVM Pass:Add Access Probability

Transformed IR

Binary

Compiler

Prefetching Workflow - Compiler

Goal: The compiler extracts the access probabilities associated with the OpenMP mapped data

Build

ExecutableOffloading Runtime -

libomptargetAPI calls including access probabilities

Prefetching Workflow - Runtime

ExecutableOffloading Runtime -

libomptargetAPI calls including access probabilities

CUDA Device Plugin-

libomptarget

Prefetching Workflow - Runtime

ExecutableOffloading Runtime -

libomptargetAPI calls including access probabilities

CUDA Device Plugin-

libomptarget

Prefetching Workflow - Runtime

Uses a cost-model to

determine the profitability of

prefetching

ExecutableOffloading Runtime -

libomptarget

Device Memory

RAM

API calls including access probabilities

Prefetch data with high access probability

Data Prefetching

CUDA Device Plugin-

libomptarget

Prefetching Workflow - Runtime

Uses a cost-model to

determine the profitability of

prefetching

Prefetch Compute

➢ With device memory oversubscription, naïve data prefetching leads to memory thrashing

Memory Oversubscription

Prefetch Compute

➢ With device memory oversubscription, naïve data prefetching leads to memory thrashing

Prefetch

(Partial)

Compute

(Partial)

Prefetch

(Partial)

Compute

(Partial)

(Now everything fits)

➢ Pipelining partial prefetches with partial compute

Memory Oversubscription

App source

LLVM IRBuild

Pipelining Workflow - Compiler

App source

LLVM IRBuild LLVM Pass:

Extract OpenMP Loop bounds

Compiler

Pipelining Workflow - Compiler

App source

LLVM IRBuild LLVM Pass:

Extract OpenMP Loop bounds

Compiler

Pipelining Workflow - Compiler

Required for chunking the

iteration space

App source

LLVM IRBuild LLVM Pass:

Extract OpenMP Loop bounds

Compiler

Pipelining Workflow - Compiler

LLVM Pass:Extract memory

access expressions

Required for chunking the

iteration space

App source

LLVM IRBuild LLVM Pass:

Extract OpenMP Loop bounds

Compiler

Pipelining Workflow - Compiler

LLVM Pass:Extract memory

access expressions

Required for chunking the

iteration space

Required for chunking the

data prefetches

App source

LLVM IRBuild LLVM Pass:

Extract OpenMP Loop bounds

Transformed IR

Compiler

Pipelining Workflow - Compiler

LLVM Pass:Add OpenMP Region Info

LLVM Pass:Extract memory

access expressions

Required for chunking the

iteration space

Required for chunking the

data prefetches

App source

LLVM IRBuild LLVM Pass:

Extract OpenMP Loop bounds

Transformed IR

Compiler

Pipelining Workflow - Compiler

LLVM Pass:Add OpenMP Region Info

LLVM Pass:Extract memory

access expressions

Required for chunking the

iteration space

Required for chunking the

data prefetches

BinaryBuild

tgt_run_target()

Runtime API call including OpenMP Region Info

Pipelining Workflow - Runtime

tgt_run_target()

Application Binary

libomptarget

tgt_run_target()

Runtime API call including OpenMP Region Info

Pipelining Workflow - Runtime

tgt_run_target()

Application Binary

libomptarget

Application calls into the runtime to

invoke the target offload region

tgt_run_target()

Runtime API call including OpenMP Region Info

Pipelining Workflow - Runtime

tgt_run_target()

chunkCount = getChunkCount()

for(i=0; i < chunkCount; i++){

prefetchChunk()

executeChunk()

synchronizeTask()

}

Application Binary

libomptarget Device Plugin

Application calls into the runtime to

invoke the target offload region

tgt_run_target()

Runtime API call including OpenMP Region Info

Pipelining Workflow - Runtime

tgt_run_target()

chunkCount = getChunkCount()

for(i=0; i < chunkCount; i++){

prefetchChunk()

executeChunk()

synchronizeTask()

}

Application Binary

libomptarget Device Plugin

Application calls into the runtime to

invoke the target offload region

The device plugins chunks a kernel into multiple smaller

kernels – if the device memory is being oversubscribed

tgt_run_target()

Runtime API call including OpenMP Region Info

Pipelining Workflow - Runtime

tgt_run_target()

chunkCount = getChunkCount()

for(i=0; i < chunkCount; i++){

prefetchChunk()

executeChunk()

synchronizeTask()

}

Application Binary

libomptarget Device Plugin

Application calls into the runtime to

invoke the target offload region

The device plugins chunks a kernel into multiple smaller

kernels – if the device memory is being oversubscribed

tgt_run_target()

Runtime API call including OpenMP Region Info

Pipelining Workflow - Runtime

tgt_run_target()

chunkCount = getChunkCount()

for(i=0; i < chunkCount; i++){

prefetchChunk()

executeChunk()

synchronizeTask()

}

Application Binary

libomptarget Device Plugin

Application calls into the runtime to

invoke the target offload region

The device plugins chunks a kernel into multiple smaller

kernels – if the device memory is being oversubscribed

tgt_run_target()

Runtime API call including OpenMP Region Info

Pipelining Workflow - Runtime

tgt_run_target()

chunkCount = getChunkCount()

for(i=0; i < chunkCount; i++){

prefetchChunk()

executeChunk()

synchronizeTask()

}

Application Binary

libomptarget Device Plugin

Application calls into the runtime to

invoke the target offload region

The device plugins chunks a kernel into multiple smaller

kernels – if the device memory is being oversubscribed

tgt_run_target()

Runtime API call including OpenMP Region Info

Pipelining Workflow - Runtime

tgt_run_target()

chunkCount = getChunkCount()

for(i=0; i < chunkCount; i++){

prefetchChunk()

executeChunk()

synchronizeTask()

}

Application Binary

libomptarget Device Plugin

Application calls into the runtime to

invoke the target offload region

The device plugins chunks a kernel into multiple smaller

kernels – if the device memory is being oversubscribed

tgt_run_target()

Runtime API call including OpenMP Region Info

Pipelining Workflow - Runtime

tgt_run_target()

chunkCount = getChunkCount()

for(i=0; i < chunkCount; i++){

prefetchChunk()

executeChunk()

synchronizeTask()

}

Application Binary

libomptarget Device Plugin

Application calls into the runtime to

invoke the target offload region

The device plugins chunks a kernel into multiple smaller

kernels – if the device memory is being oversubscribed

Experiments

➢Do the optimizations help improve performance for computational kernels?

➢Do the optimizations scale with increasing dataset sizes?

Experiments

➢Baseline: Demand Paging

➢Comparisons: Prefetching, Pipelined Prefetching

➢Benchmarks: Saxpy, Kmeans (Rodinia)

Experiments

Experimental Setup

➢ Summitdev cluster at ORNL ➢ Pascal P100 GPU cards➢16GB device memory

➢Nvlink interconnect

➢ Clang, LLVM ➢clang-ykt project

➢ libomp, libomptarget➢clang-ykt project

Software

Hardware

Results - SAXPY

0

20

40

60

80

100

120

27 28 29 30 31

Tim

e (%

)

log(N) elements

Paging Prefetch Pipelining+Prefetching

Results - SAXPY

This does not fit in device memory

0

20

40

60

80

100

120

27 28 29 30 31

Tim

e (%

)

log(N) elements

Paging Prefetch Pipelining+Prefetching

Kmeans – 10 Iterations

0

20

40

60

80

100

120

27 28 29 30 31

Tim

e (%

)

log(N) points

Paging Prefetch Pipelining+Prefetching

Kmeans – 10 Iterations

0

20

40

60

80

100

120

27 28 29 30 31

Tim

e (%

)

log(N) points

Paging Prefetch Pipelining+Prefetching

This does not fit in device memory

Kmeans – 20 Iterations

0

20

40

60

80

100

120

140

160

180

200

27 28 29 30 31

Tim

e (%

)

log(N) points

Paging Prefetch Pipelining

Kmeans – 20 Iterations

0

20

40

60

80

100

120

140

160

180

200

27 28 29 30 31

Tim

e (%

)

log(N) points

Paging Prefetch Pipelining

Demand paging performs better – hardware learns to use a more suitable eviction

policy

➢Developed an OpenMP Framework for UVM-capable GPUs

➢Develop optimizations to reduce page fault processing overhead

➢Prefetching

➢Pipelining

➢Optimizations enable reasonable improvements on benchmarks

➢Future Work: Developing more sophisticated pipelining strategies

Summary

Thanks & Questions?

• The LLVM community (including our many contributing vendors)

• The SOLLVE project (part of the Exascale Computing Project)

• OLCF, ORNL for access to the SummitDev system.

• ALCF, ANL, and DOE

Acknowledgements

References

• https://devblogs.nvidia.com/parallelforall/beyond-gpu-memory-limits-unified-memory-pascal/

• http://on-demand.gputechconf.com/gtc/2016/presentation/s6510-jeff-larkin-targeting-gpus-openmp.pdf

• https://drive.google.com/file/d/0B-jX56_FbGKRM21sYlNYVnB4eFk/view

• https://www.nersc.gov/assets/Uploads/XE62011OpenMP.pdf

• http://www.openmp.org/wp-content/uploads/openmp-4.5.pdf

• https://llvm-hpc3-workshop.github.io/slides/Bertolli.pdf

top related