a gpu kernel transactionization scheme for preemptive priority...

A GPU Kernel Transactionization Schemefor Preemptive Priority Scheduling

Hyeonsu Lee, Jaehun Roh, Euiseong Seo

School of Software

Sungkyunkwan University

• Introduction

• Motivation

• Related Work

• Background

• Our Approach• Design• Implementation

• Evaluation

• Conclusion

IndexContents

2

• Introduction

• Motivation

• Related Work

• Background


• Evaluation

• Conclusion

Introduction

● GPU is specialized in parallel computing workload○ Embedded systems can earn substantial benefit from GPU

computing

3

Embedded

Voice Recognition

Face Detection

Health Analysis

NVIDIA’s Machine Learning Benchmark

13.6× Faster

A GPU Kernel Transactionization Scheme for Preemptive Priority Scheduling

• Introduction

• Motivation

• Related Work

• Background


• Evaluation

• Conclusion4

Delayed

Motivation

● Example: Autonomous vehicle control systems using GPU

5

W..wait!

GPU Embedded System

Object Detection

Run

High PriorityEvasion Path Finder

Crash


Motivation

● Priority inversion problem on GPU○ No GPU Preemption HW and Interfaces

6

Priority : Kernel B > Kernel A

CPU

GPU

Kernel BDeadline

Kernel A Kernel B

Time


Motivation

● Priority inversion problem on GPU○ No GPU Preemption HW and Interfaces

7

Priority : Kernel B > Kernel A

CPU

GPU

Kernel BDeadline

Kernel A Kernel B

Time


Embedded Systems NeedPreemptive Priority Scheduling of GPU Kernels

• Introduction

• Motivation

• Related Work

• Background


• Evaluation

• Conclusion8

Related Work - Hardware

● HW extension for preemptive GPU scheduling• Experimental architecture was proposed and evaluated with

simulation• Nvidia commercialized GPUs with HW preemption

● Limitations of HW GPU preemption• Up to 100μs context switching time from massive amount of state

saving/restoring • Increased complexity in HW design and thus cost to manufacture

9

(But, they have not disclosed API for its, yet)


● Kernel slicing approach

● Thread-block-level scheduling approach

High PrioritySub Kernel

Related Work - Software

10

Short Delay

SubKernel



Determined through static analysis

BlockTask

Preemption?

Repeat

No

ThreadBlock

YesSwitching to the

highest-priority kernel

Delayed

We propose a software preemption solution that• Immediately aborts currently running kernel • Re-executes aborted kernels later


• Introduction

• Motivation

• Related Work

• Background


• Evaluation

• Conclusion11

Background – Shared System Memory

● GPU memory management in embedded systems

12

KernelBuffer 1

KernelBuffer 2

GPU Virtual Address SpaceCPU Virtual Address Space

Host-sideBuffer

GPUIOMMUCPUMMU

EntryShared

EntryShared


CPU can directly access GPU memory

Background – Shared Kernel Buffers

● Producer-consumer relationships through sharing kernel buffers● Blind abortion and re-execution may corrupt data delivered from

its producer

13

KernelB

CPU(ready)

SharedKernelBuffer

GPU(running)

GPU MemoryTask

KernelA

Input Data

Result Data


Producer Consumer Relationship

Write

Re-useAbortion

• Introduction

• Motivation

• Related Work

• Background


• Evaluation

• Conclusion14

Design - Preemption with Transactionized Kernels

● Our design is structured in

15GPU

SchedulingQueue

Running Queue

Low PriorityKernel

AbortionCommand

Low PriorityKernel


GPU kernel abortion : To immediately abort the low priority kernel



16GPU

SchedulingQueue

Running Queue

Kernel BufferSnapshot

Kernel BufferRoll-back


GPU memory context saving/restoring : For re-execution of a preempted kernel



17GPU

SchedulingQueue

Running Queue

Dirty DataSnapshot

Inconsistency of a kernel buffer


Transaction of GPU kernel execution :To prevent simultaneously access shared kernel buffer

Design - Transactionization Scheme

● GPU kernel transaction structure for re-execution of kernel○ Procedures from snapshotting to kernel finish

18

GPU JobScheduling

SchedulingProcess

GPU KernelTransaction Job Running Job DoneSnapshotting

Scheduling Request

Kernel

GPU KernelTransaction


Design – Example of Transactionization Scheme

● Pseudo-preemptive priority scheduling flow of TransactionizedGPU Kernels

19

GPU JobScheduler

SnapshotModule

SystemMemory

GPU

KernelA

KernelA

Schedule

Submit

KernelB

Submit

Running

LaunchKernel

A

Running

KernelB

Launch

KernelA

AbortAbort

Command

Done & Schedule

KernelB

KernelA Re-schedule

Rolling backSnapshot Mem.GPU Mem.

Running

KernelA

Launch

Priority : Kernel A < Kernel B

SnapshottingGPU Mem.Snapshot Mem.


• Introduction

• Motivation

• Related Work

• Background


• Evaluation

• Conclusion20

Implementation – Snapshot Mechanism

● Implementing snapshot mechanism through GPGPU programming model

21

Task

GPUDeviceDriver

Snapshot TargetDetermination SnapshottingSnapshot

Allocation

Kernel InstanceCreation

KernelLaunch

Kernel BufferAllocation


Implementation – Snapshot Memory Management

● Allocating snapshot memory when allocating a kernel buffer● Tracking kernel buffer addresses thorough metadata of kernel

when launching kernel

22

Root

Buffer Buffer

Buffer Buffer Buffer Buffer

JobJobKernel

Instance

KernelBuffer

SnapshotMemory

Allocation with same size

Kernel Buffer Address List

Snapshot TargetAddress List

Binary DataDecoding &Generating

Address pointing


Implementation – Snapshot Process

● Snapshot process is implemented by kernel threads

23

while(1){if( ! Is_next_kernel )

interruptible_sleep()

if( re-execution )roll-back();

elsesnapshot();

submit_to_GPU();}

Snapshot Thread Pseudo Code

Snapshot Cancel Range

Low PriorityTask

High PriorityTask

SnapshotThread-1

SnapshotThread-2

Kernel 1.Launch(wake-up)4.Launch

2.SnapshotCancel

3.Switching

Kernel


Implementation – Preemption Mechanism

24

● Pre-preemption process○ Minimizing process of aborting scheduled kernels

● Post-preemption process○ Recovering aborted kernels to the scheduling queue○ Running a kernel launch and a post-preemption process

K Scheduling &Preemption

SchedulingQueue Slot Kernel

ExecutionSnapshotting InterruptHandler

De-queueSlot

SnapshotCancel

GPUAbortion Done

P_RUNP_SNAPP_SLOT

Pre-Preemption Process

Post-Preemption Process

STATUS CODE ->


• Introduction

• Motivation

• Related Work

• Background


• Evaluation

• Conclusion25

Evaluation Setup (1)

● Target device environment

● Rodinia benchmark suite 3.1

26A GPU Kernel Transactionization Scheme for Preemptive Priority Scheduling

Evaluation Setup (2)

● Micro benchmark (M-Bench)○ We made simple GPGPU workload to be used for lower priority

workload than Rodinia benchmark’s workloads○ M-Bench running in background disturb the high priority task

● Priority configuration○ [E] : Emergency priority

§ This priority is the highest priority and does not require snapshotting○ [C] : Common priority

§ This priority is a higher priority than M-Bench○ [N] : Normal environment

§ This is result of evaluation in vanilla kernel


Performance of High Priority Task

● Performance of a high priority task○ Degrading performance up to 186.9× depending on low priority tasks

on normal environment○ Guaranteeing consistent performance regardless of background load

on Transactionization scheme

28

0.5 1

1.5 2

2.5 3

3.5

NN BP FP BF KM 3D LUD NW GE MC CFD

Nor

mal

ized

Tur

naro

und

Tim

e

20 40 60 80

100 186.91xM-Bench[N]1xM-Bench[C]1xM-Bench[E]

2xM-Bench[N]2xM-Bench[C]2xM-Bench[E]


Scheduling Delay on Original Environment

● Histogram of scheduling delay of high priority kernels○ Scheduling delay depending on the running kernel length in normal

environment● Our approach suppressed the launch delay of high-priority kernels

within 18μs

29

!

"

"!

"!!

"!!!

"!!!!

"! #! $! %! &! '! (! )! *! "!! ""! "#!!"#

$%&'()

**"&&%+*%,

+,-./01234 5.167 89:;

!

"

"!

"!!

"!!!

"!!!!

"!!!!!

"! "# "$ "% "& #! ## #$ #% #& '! '# '$ '% '& $! $# $$ $% $& (!!"#

$%&'()

**"&&%+*%,

!"#$%#&' ()*+, -./0

Normal environment

99.9% percentile within 18μs

Transactionization scheme

Increased delay due to kernel lengthA GPU Kernel Transactionization Scheme for Preemptive Priority Scheduling

Analysis of Preemption Delay

● There is no delay caused by eviction in the preemption process since it exceeds the majority of 18μs

30

5 10 15 20 25 30 35 40 45 50

NN BP PF BFS KM 3D LUD NW GS MC CFS

Del

ayed

Tim

e (µ

s)

Launch Pre-preemption

Preemption overheadwithin 10μs

> Eviction Delay (18μs)


Snapshotting Overhead

● Our scheme has significant overhead due to snapshotting○ Advantages of performance improvement due to launch without

snapshotting at emergency priority

31

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3

1.4

1.5

NN BP FP BF KM 3D LUD NW GE MC CFD

5.13

Aver

age

Nor

mal

ized

Turn

arou

nd T

ime

Common Priority Emergency Priority

Using invalidated kernel bufferSnapshottingOverhead


Discussion

● Reducing snapshot overhead through high bandwidth memory (HBM)○ Exynos 8890 provides up to 51 GB/s memory bandwidth while the

device used in our evaluation presents 28.7 GB/s ○ Snapshot process is more efficient when using HBM because it

implements memory copying using ARM's vector instruction set

● Selective snapshot process using compile-time hints○ GPU kernel is idempotent sometimes○ If enabling to detect idempotent kernels, then selective snapshot

processes can be used


• Introduction

• Motivation

• Related Work

• Background


• Evaluation

• Conclusion33

Conclusion

● A GPU kernel transactionization for preemptive scheduling○ We proposed a GPU kernel transactionization scheme that enables

immediate abortion and re-execution of a GPU kernel○ Based on the transactionization scheme, we developed a pseudo-

preemptive GPU kernel scheduler

● Evaluation results showed that the scheduling delay for the urgent task was reduced to approximately 18μs

● Source code is available to public at ○ https://github.com/Hyunsu-Lee/psched_gpu


Thank youQ&A

35

Backup Slide (1)

● Multiple priority evaluation


1 2 3 4

NN BP PF BFS KM 3D LUD NW GS MC CFD

Norm

alize

d Tu

rnar

ound

Tim

e

6 8

10 12 14 16 18

Priority [-10]Priority [ 0]Priority [ 10]

a gpu kernel transactionization scheme for preemptive priority...

Documents