a gpu kernel transactionization scheme for preemptive priority...

36
A GPU Kernel Transactionization Scheme for Preemptive Priority Scheduling Hyeonsu Lee, Jaehun Roh, Euiseong Seo School of Software Sungkyunkwan University

Upload: others

Post on 24-May-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A GPU Kernel Transactionization Scheme for Preemptive Priority …2018.rtas.org/wp-content/uploads/2018/05/S6.2.pdf · A GPU Kernel Transactionization Scheme for Preemptive Priority

A GPU Kernel Transactionization Schemefor Preemptive Priority Scheduling

Hyeonsu Lee, Jaehun Roh, Euiseong Seo

School of Software

Sungkyunkwan University

Page 2: A GPU Kernel Transactionization Scheme for Preemptive Priority …2018.rtas.org/wp-content/uploads/2018/05/S6.2.pdf · A GPU Kernel Transactionization Scheme for Preemptive Priority

• Introduction

• Motivation

• Related Work

• Background

• Our Approach• Design• Implementation

• Evaluation

• Conclusion

IndexContents

2

• Introduction

• Motivation

• Related Work

• Background

• Our Approach• Design• Implementation

• Evaluation

• Conclusion

Page 3: A GPU Kernel Transactionization Scheme for Preemptive Priority …2018.rtas.org/wp-content/uploads/2018/05/S6.2.pdf · A GPU Kernel Transactionization Scheme for Preemptive Priority

Introduction

● GPU is specialized in parallel computing workload○ Embedded systems can earn substantial benefit from GPU

computing

3

Embedded

Voice Recognition

Face Detection

Health Analysis

NVIDIA’s Machine Learning Benchmark

13.6× Faster

A GPU Kernel Transactionization Scheme for Preemptive Priority Scheduling

Page 4: A GPU Kernel Transactionization Scheme for Preemptive Priority …2018.rtas.org/wp-content/uploads/2018/05/S6.2.pdf · A GPU Kernel Transactionization Scheme for Preemptive Priority

• Introduction

• Motivation

• Related Work

• Background

• Our Approach• Design• Implementation

• Evaluation

• Conclusion4

Page 5: A GPU Kernel Transactionization Scheme for Preemptive Priority …2018.rtas.org/wp-content/uploads/2018/05/S6.2.pdf · A GPU Kernel Transactionization Scheme for Preemptive Priority

Delayed

Motivation

● Example: Autonomous vehicle control systems using GPU

5

W..wait!

GPU Embedded System

Object Detection

Run

High PriorityEvasion Path Finder

Crash

A GPU Kernel Transactionization Scheme for Preemptive Priority Scheduling

Page 6: A GPU Kernel Transactionization Scheme for Preemptive Priority …2018.rtas.org/wp-content/uploads/2018/05/S6.2.pdf · A GPU Kernel Transactionization Scheme for Preemptive Priority

Motivation

● Priority inversion problem on GPU○ No GPU Preemption HW and Interfaces

6

Priority : Kernel B > Kernel A

CPU

GPU

Kernel BDeadline

Kernel A Kernel B

Time

A GPU Kernel Transactionization Scheme for Preemptive Priority Scheduling

Page 7: A GPU Kernel Transactionization Scheme for Preemptive Priority …2018.rtas.org/wp-content/uploads/2018/05/S6.2.pdf · A GPU Kernel Transactionization Scheme for Preemptive Priority

Motivation

● Priority inversion problem on GPU○ No GPU Preemption HW and Interfaces

7

Priority : Kernel B > Kernel A

CPU

GPU

Kernel BDeadline

Kernel A Kernel B

Time

A GPU Kernel Transactionization Scheme for Preemptive Priority Scheduling

Embedded Systems NeedPreemptive Priority Scheduling of GPU Kernels

Page 8: A GPU Kernel Transactionization Scheme for Preemptive Priority …2018.rtas.org/wp-content/uploads/2018/05/S6.2.pdf · A GPU Kernel Transactionization Scheme for Preemptive Priority

• Introduction

• Motivation

• Related Work

• Background

• Our Approach• Design• Implementation

• Evaluation

• Conclusion8

Page 9: A GPU Kernel Transactionization Scheme for Preemptive Priority …2018.rtas.org/wp-content/uploads/2018/05/S6.2.pdf · A GPU Kernel Transactionization Scheme for Preemptive Priority

Related Work - Hardware

● HW extension for preemptive GPU scheduling• Experimental architecture was proposed and evaluated with

simulation• Nvidia commercialized GPUs with HW preemption

● Limitations of HW GPU preemption• Up to 100μs context switching time from massive amount of state

saving/restoring • Increased complexity in HW design and thus cost to manufacture

9

(But, they have not disclosed API for its, yet)

A GPU Kernel Transactionization Scheme for Preemptive Priority Scheduling

Page 10: A GPU Kernel Transactionization Scheme for Preemptive Priority …2018.rtas.org/wp-content/uploads/2018/05/S6.2.pdf · A GPU Kernel Transactionization Scheme for Preemptive Priority

● Kernel slicing approach

● Thread-block-level scheduling approach

High PrioritySub Kernel

Related Work - Software

10

Short Delay

SubKernel

High PrioritySub Kernel

High PrioritySub Kernel

Determined through static analysis

BlockTask

Preemption?

Repeat

No

ThreadBlock

YesSwitching to the

highest-priority kernel

Delayed

We propose a software preemption solution that• Immediately aborts currently running kernel • Re-executes aborted kernels later

A GPU Kernel Transactionization Scheme for Preemptive Priority Scheduling

Page 11: A GPU Kernel Transactionization Scheme for Preemptive Priority …2018.rtas.org/wp-content/uploads/2018/05/S6.2.pdf · A GPU Kernel Transactionization Scheme for Preemptive Priority

• Introduction

• Motivation

• Related Work

• Background

• Our Approach• Design• Implementation

• Evaluation

• Conclusion11

Page 12: A GPU Kernel Transactionization Scheme for Preemptive Priority …2018.rtas.org/wp-content/uploads/2018/05/S6.2.pdf · A GPU Kernel Transactionization Scheme for Preemptive Priority

Background – Shared System Memory

● GPU memory management in embedded systems

12

KernelBuffer 1

KernelBuffer 2

GPU Virtual Address SpaceCPU Virtual Address Space

Host-sideBuffer

GPUIOMMUCPUMMU

EntryShared

EntryShared

A GPU Kernel Transactionization Scheme for Preemptive Priority Scheduling

CPU can directly access GPU memory

Page 13: A GPU Kernel Transactionization Scheme for Preemptive Priority …2018.rtas.org/wp-content/uploads/2018/05/S6.2.pdf · A GPU Kernel Transactionization Scheme for Preemptive Priority

Background – Shared Kernel Buffers

● Producer-consumer relationships through sharing kernel buffers● Blind abortion and re-execution may corrupt data delivered from

its producer

13

KernelB

CPU(ready)

SharedKernelBuffer

GPU(running)

GPU MemoryTask

KernelA

Input Data

Result Data

A GPU Kernel Transactionization Scheme for Preemptive Priority Scheduling

Producer Consumer Relationship

Write

Re-useAbortion

Page 14: A GPU Kernel Transactionization Scheme for Preemptive Priority …2018.rtas.org/wp-content/uploads/2018/05/S6.2.pdf · A GPU Kernel Transactionization Scheme for Preemptive Priority

• Introduction

• Motivation

• Related Work

• Background

• Our Approach• Design• Implementation

• Evaluation

• Conclusion14

Page 15: A GPU Kernel Transactionization Scheme for Preemptive Priority …2018.rtas.org/wp-content/uploads/2018/05/S6.2.pdf · A GPU Kernel Transactionization Scheme for Preemptive Priority

Design - Preemption with Transactionized Kernels

● Our design is structured in

15GPU

SchedulingQueue

Running Queue

Low PriorityKernel

AbortionCommand

Low PriorityKernel

A GPU Kernel Transactionization Scheme for Preemptive Priority Scheduling

GPU kernel abortion : To immediately abort the low priority kernel

Page 16: A GPU Kernel Transactionization Scheme for Preemptive Priority …2018.rtas.org/wp-content/uploads/2018/05/S6.2.pdf · A GPU Kernel Transactionization Scheme for Preemptive Priority

Design - Preemption with Transactionized Kernels

● Our design is structured in

16GPU

SchedulingQueue

Running Queue

Kernel BufferSnapshot

Kernel BufferRoll-back

A GPU Kernel Transactionization Scheme for Preemptive Priority Scheduling

GPU memory context saving/restoring : For re-execution of a preempted kernel

Page 17: A GPU Kernel Transactionization Scheme for Preemptive Priority …2018.rtas.org/wp-content/uploads/2018/05/S6.2.pdf · A GPU Kernel Transactionization Scheme for Preemptive Priority

Design - Preemption with Transactionized Kernels

● Our design is structured in

17GPU

SchedulingQueue

Running Queue

Dirty DataSnapshot

Inconsistency of a kernel buffer

A GPU Kernel Transactionization Scheme for Preemptive Priority Scheduling

Transaction of GPU kernel execution :To prevent simultaneously access shared kernel buffer

Page 18: A GPU Kernel Transactionization Scheme for Preemptive Priority …2018.rtas.org/wp-content/uploads/2018/05/S6.2.pdf · A GPU Kernel Transactionization Scheme for Preemptive Priority

Design - Transactionization Scheme

● GPU kernel transaction structure for re-execution of kernel○ Procedures from snapshotting to kernel finish

18

GPU JobScheduling

SchedulingProcess

GPU KernelTransaction Job Running Job DoneSnapshotting

Scheduling Request

Kernel

GPU KernelTransaction

A GPU Kernel Transactionization Scheme for Preemptive Priority Scheduling

Page 19: A GPU Kernel Transactionization Scheme for Preemptive Priority …2018.rtas.org/wp-content/uploads/2018/05/S6.2.pdf · A GPU Kernel Transactionization Scheme for Preemptive Priority

Design – Example of Transactionization Scheme

● Pseudo-preemptive priority scheduling flow of TransactionizedGPU Kernels

19

GPU JobScheduler

SnapshotModule

SystemMemory

GPU

KernelA

KernelA

Schedule

Submit

KernelB

Submit

Running

LaunchKernel

A

Running

KernelB

Launch

KernelA

AbortAbort

Command

Done & Schedule

KernelB

KernelA Re-schedule

Rolling backSnapshot Mem.GPU Mem.

Running

KernelA

Launch

Priority : Kernel A < Kernel B

SnapshottingGPU Mem.Snapshot Mem.

A GPU Kernel Transactionization Scheme for Preemptive Priority Scheduling

Page 20: A GPU Kernel Transactionization Scheme for Preemptive Priority …2018.rtas.org/wp-content/uploads/2018/05/S6.2.pdf · A GPU Kernel Transactionization Scheme for Preemptive Priority

• Introduction

• Motivation

• Related Work

• Background

• Our Approach• Design• Implementation

• Evaluation

• Conclusion20

Page 21: A GPU Kernel Transactionization Scheme for Preemptive Priority …2018.rtas.org/wp-content/uploads/2018/05/S6.2.pdf · A GPU Kernel Transactionization Scheme for Preemptive Priority

Implementation – Snapshot Mechanism

● Implementing snapshot mechanism through GPGPU programming model

21

Task

GPUDeviceDriver

Snapshot TargetDetermination SnapshottingSnapshot

Allocation

Kernel InstanceCreation

KernelLaunch

Kernel BufferAllocation

A GPU Kernel Transactionization Scheme for Preemptive Priority Scheduling

Page 22: A GPU Kernel Transactionization Scheme for Preemptive Priority …2018.rtas.org/wp-content/uploads/2018/05/S6.2.pdf · A GPU Kernel Transactionization Scheme for Preemptive Priority

Implementation – Snapshot Memory Management

● Allocating snapshot memory when allocating a kernel buffer● Tracking kernel buffer addresses thorough metadata of kernel

when launching kernel

22

Root

Buffer Buffer

Buffer Buffer Buffer Buffer

JobJobKernel

Instance

KernelBuffer

SnapshotMemory

Allocation with same size

Kernel Buffer Address List

Snapshot TargetAddress List

Binary DataDecoding &Generating

Address pointing

A GPU Kernel Transactionization Scheme for Preemptive Priority Scheduling

Page 23: A GPU Kernel Transactionization Scheme for Preemptive Priority …2018.rtas.org/wp-content/uploads/2018/05/S6.2.pdf · A GPU Kernel Transactionization Scheme for Preemptive Priority

Implementation – Snapshot Process

● Snapshot process is implemented by kernel threads

23

while(1){if( ! Is_next_kernel )

interruptible_sleep()

if( re-execution )roll-back();

elsesnapshot();

submit_to_GPU();}

Snapshot Thread Pseudo Code

Snapshot Cancel Range

Low PriorityTask

High PriorityTask

SnapshotThread-1

SnapshotThread-2

Kernel 1.Launch(wake-up)4.Launch

2.SnapshotCancel

3.Switching

Kernel

A GPU Kernel Transactionization Scheme for Preemptive Priority Scheduling

Page 24: A GPU Kernel Transactionization Scheme for Preemptive Priority …2018.rtas.org/wp-content/uploads/2018/05/S6.2.pdf · A GPU Kernel Transactionization Scheme for Preemptive Priority

Implementation – Preemption Mechanism

24

● Pre-preemption process○ Minimizing process of aborting scheduled kernels

● Post-preemption process○ Recovering aborted kernels to the scheduling queue○ Running a kernel launch and a post-preemption process

K Scheduling &Preemption

SchedulingQueue Slot Kernel

ExecutionSnapshotting InterruptHandler

De-queueSlot

SnapshotCancel

GPUAbortion Done

P_RUNP_SNAPP_SLOT

Pre-Preemption Process

Post-Preemption Process

STATUS CODE ->

A GPU Kernel Transactionization Scheme for Preemptive Priority Scheduling

Page 25: A GPU Kernel Transactionization Scheme for Preemptive Priority …2018.rtas.org/wp-content/uploads/2018/05/S6.2.pdf · A GPU Kernel Transactionization Scheme for Preemptive Priority

• Introduction

• Motivation

• Related Work

• Background

• Our Approach• Design• Implementation

• Evaluation

• Conclusion25

Page 26: A GPU Kernel Transactionization Scheme for Preemptive Priority …2018.rtas.org/wp-content/uploads/2018/05/S6.2.pdf · A GPU Kernel Transactionization Scheme for Preemptive Priority

Evaluation Setup (1)

● Target device environment

● Rodinia benchmark suite 3.1

26A GPU Kernel Transactionization Scheme for Preemptive Priority Scheduling

Page 27: A GPU Kernel Transactionization Scheme for Preemptive Priority …2018.rtas.org/wp-content/uploads/2018/05/S6.2.pdf · A GPU Kernel Transactionization Scheme for Preemptive Priority

Evaluation Setup (2)

● Micro benchmark (M-Bench)○ We made simple GPGPU workload to be used for lower priority

workload than Rodinia benchmark’s workloads○ M-Bench running in background disturb the high priority task

● Priority configuration○ [E] : Emergency priority

§ This priority is the highest priority and does not require snapshotting○ [C] : Common priority

§ This priority is a higher priority than M-Bench○ [N] : Normal environment

§ This is result of evaluation in vanilla kernel

27A GPU Kernel Transactionization Scheme for Preemptive Priority Scheduling

Page 28: A GPU Kernel Transactionization Scheme for Preemptive Priority …2018.rtas.org/wp-content/uploads/2018/05/S6.2.pdf · A GPU Kernel Transactionization Scheme for Preemptive Priority

Performance of High Priority Task

● Performance of a high priority task○ Degrading performance up to 186.9× depending on low priority tasks

on normal environment○ Guaranteeing consistent performance regardless of background load

on Transactionization scheme

28

0.5 1

1.5 2

2.5 3

3.5

NN BP FP BF KM 3D LUD NW GE MC CFD

Nor

mal

ized

Tur

naro

und

Tim

e

20 40 60 80

100 186.91xM-Bench[N]1xM-Bench[C]1xM-Bench[E]

2xM-Bench[N]2xM-Bench[C]2xM-Bench[E]

A GPU Kernel Transactionization Scheme for Preemptive Priority Scheduling

Page 29: A GPU Kernel Transactionization Scheme for Preemptive Priority …2018.rtas.org/wp-content/uploads/2018/05/S6.2.pdf · A GPU Kernel Transactionization Scheme for Preemptive Priority

Scheduling Delay on Original Environment

● Histogram of scheduling delay of high priority kernels○ Scheduling delay depending on the running kernel length in normal

environment● Our approach suppressed the launch delay of high-priority kernels

within 18μs

29

!

"

"!

"!!

"!!!

"!!!!

"! #! $! %! &! '! (! )! *! "!! ""! "#!!"#

$%&'()

**"&&%+*%,

+,-./01234 5.167 89:;

!

"

"!

"!!

"!!!

"!!!!

"!!!!!

"! "# "$ "% "& #! ## #$ #% #& '! '# '$ '% '& $! $# $$ $% $& (!!"#

$%&'()

**"&&%+*%,

!"#$%#&' ()*+, -./0

Normal environment

99.9% percentile within 18μs

Transactionization scheme

Increased delay due to kernel lengthA GPU Kernel Transactionization Scheme for Preemptive Priority Scheduling

Page 30: A GPU Kernel Transactionization Scheme for Preemptive Priority …2018.rtas.org/wp-content/uploads/2018/05/S6.2.pdf · A GPU Kernel Transactionization Scheme for Preemptive Priority

Analysis of Preemption Delay

● There is no delay caused by eviction in the preemption process since it exceeds the majority of 18μs

30

5 10 15 20 25 30 35 40 45 50

NN BP PF BFS KM 3D LUD NW GS MC CFS

Del

ayed

Tim

e (µ

s)

Launch Pre-preemption

Preemption overheadwithin 10μs

> Eviction Delay (18μs)

A GPU Kernel Transactionization Scheme for Preemptive Priority Scheduling

Page 31: A GPU Kernel Transactionization Scheme for Preemptive Priority …2018.rtas.org/wp-content/uploads/2018/05/S6.2.pdf · A GPU Kernel Transactionization Scheme for Preemptive Priority

Snapshotting Overhead

● Our scheme has significant overhead due to snapshotting○ Advantages of performance improvement due to launch without

snapshotting at emergency priority

31

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3

1.4

1.5

NN BP FP BF KM 3D LUD NW GE MC CFD

5.13

Aver

age

Nor

mal

ized

Turn

arou

nd T

ime

Common Priority Emergency Priority

Using invalidated kernel bufferSnapshottingOverhead

A GPU Kernel Transactionization Scheme for Preemptive Priority Scheduling

Page 32: A GPU Kernel Transactionization Scheme for Preemptive Priority …2018.rtas.org/wp-content/uploads/2018/05/S6.2.pdf · A GPU Kernel Transactionization Scheme for Preemptive Priority

Discussion

● Reducing snapshot overhead through high bandwidth memory (HBM)○ Exynos 8890 provides up to 51 GB/s memory bandwidth while the

device used in our evaluation presents 28.7 GB/s ○ Snapshot process is more efficient when using HBM because it

implements memory copying using ARM's vector instruction set

● Selective snapshot process using compile-time hints○ GPU kernel is idempotent sometimes○ If enabling to detect idempotent kernels, then selective snapshot

processes can be used

32A GPU Kernel Transactionization Scheme for Preemptive Priority Scheduling

Page 33: A GPU Kernel Transactionization Scheme for Preemptive Priority …2018.rtas.org/wp-content/uploads/2018/05/S6.2.pdf · A GPU Kernel Transactionization Scheme for Preemptive Priority

• Introduction

• Motivation

• Related Work

• Background

• Our Approach• Design• Implementation

• Evaluation

• Conclusion33

Page 34: A GPU Kernel Transactionization Scheme for Preemptive Priority …2018.rtas.org/wp-content/uploads/2018/05/S6.2.pdf · A GPU Kernel Transactionization Scheme for Preemptive Priority

Conclusion

● A GPU kernel transactionization for preemptive scheduling○ We proposed a GPU kernel transactionization scheme that enables

immediate abortion and re-execution of a GPU kernel○ Based on the transactionization scheme, we developed a pseudo-

preemptive GPU kernel scheduler

● Evaluation results showed that the scheduling delay for the urgent task was reduced to approximately 18μs

● Source code is available to public at ○ https://github.com/Hyunsu-Lee/psched_gpu

34A GPU Kernel Transactionization Scheme for Preemptive Priority Scheduling

Page 35: A GPU Kernel Transactionization Scheme for Preemptive Priority …2018.rtas.org/wp-content/uploads/2018/05/S6.2.pdf · A GPU Kernel Transactionization Scheme for Preemptive Priority

Thank youQ&A

35

Page 36: A GPU Kernel Transactionization Scheme for Preemptive Priority …2018.rtas.org/wp-content/uploads/2018/05/S6.2.pdf · A GPU Kernel Transactionization Scheme for Preemptive Priority

Backup Slide (1)

● Multiple priority evaluation

36A GPU Kernel Transactionization Scheme for Preemptive Priority Scheduling

1 2 3 4

NN BP PF BFS KM 3D LUD NW GS MC CFD

Norm

alize

d Tu

rnar

ound

Tim

e

6 8

10 12 14 16 18

Priority [-10]Priority [ 0]Priority [ 10]