a gpu kernel transactionization scheme for preemptive priority...
TRANSCRIPT
A GPU Kernel Transactionization Schemefor Preemptive Priority Scheduling
Hyeonsu Lee, Jaehun Roh, Euiseong Seo
School of Software
Sungkyunkwan University
• Introduction
• Motivation
• Related Work
• Background
• Our Approach• Design• Implementation
• Evaluation
• Conclusion
IndexContents
2
• Introduction
• Motivation
• Related Work
• Background
• Our Approach• Design• Implementation
• Evaluation
• Conclusion
Introduction
● GPU is specialized in parallel computing workload○ Embedded systems can earn substantial benefit from GPU
computing
3
Embedded
Voice Recognition
Face Detection
Health Analysis
NVIDIA’s Machine Learning Benchmark
13.6× Faster
A GPU Kernel Transactionization Scheme for Preemptive Priority Scheduling
• Introduction
• Motivation
• Related Work
• Background
• Our Approach• Design• Implementation
• Evaluation
• Conclusion4
Delayed
Motivation
● Example: Autonomous vehicle control systems using GPU
5
W..wait!
GPU Embedded System
Object Detection
Run
High PriorityEvasion Path Finder
Crash
A GPU Kernel Transactionization Scheme for Preemptive Priority Scheduling
Motivation
● Priority inversion problem on GPU○ No GPU Preemption HW and Interfaces
6
Priority : Kernel B > Kernel A
CPU
GPU
Kernel BDeadline
Kernel A Kernel B
Time
A GPU Kernel Transactionization Scheme for Preemptive Priority Scheduling
Motivation
● Priority inversion problem on GPU○ No GPU Preemption HW and Interfaces
7
Priority : Kernel B > Kernel A
CPU
GPU
Kernel BDeadline
Kernel A Kernel B
Time
A GPU Kernel Transactionization Scheme for Preemptive Priority Scheduling
Embedded Systems NeedPreemptive Priority Scheduling of GPU Kernels
• Introduction
• Motivation
• Related Work
• Background
• Our Approach• Design• Implementation
• Evaluation
• Conclusion8
Related Work - Hardware
● HW extension for preemptive GPU scheduling• Experimental architecture was proposed and evaluated with
simulation• Nvidia commercialized GPUs with HW preemption
● Limitations of HW GPU preemption• Up to 100μs context switching time from massive amount of state
saving/restoring • Increased complexity in HW design and thus cost to manufacture
9
(But, they have not disclosed API for its, yet)
A GPU Kernel Transactionization Scheme for Preemptive Priority Scheduling
● Kernel slicing approach
● Thread-block-level scheduling approach
High PrioritySub Kernel
Related Work - Software
10
Short Delay
SubKernel
High PrioritySub Kernel
High PrioritySub Kernel
Determined through static analysis
BlockTask
Preemption?
Repeat
No
ThreadBlock
YesSwitching to the
highest-priority kernel
Delayed
We propose a software preemption solution that• Immediately aborts currently running kernel • Re-executes aborted kernels later
A GPU Kernel Transactionization Scheme for Preemptive Priority Scheduling
• Introduction
• Motivation
• Related Work
• Background
• Our Approach• Design• Implementation
• Evaluation
• Conclusion11
Background – Shared System Memory
● GPU memory management in embedded systems
12
KernelBuffer 1
KernelBuffer 2
GPU Virtual Address SpaceCPU Virtual Address Space
Host-sideBuffer
GPUIOMMUCPUMMU
EntryShared
EntryShared
A GPU Kernel Transactionization Scheme for Preemptive Priority Scheduling
CPU can directly access GPU memory
Background – Shared Kernel Buffers
● Producer-consumer relationships through sharing kernel buffers● Blind abortion and re-execution may corrupt data delivered from
its producer
13
KernelB
CPU(ready)
SharedKernelBuffer
GPU(running)
GPU MemoryTask
KernelA
Input Data
Result Data
A GPU Kernel Transactionization Scheme for Preemptive Priority Scheduling
Producer Consumer Relationship
Write
Re-useAbortion
• Introduction
• Motivation
• Related Work
• Background
• Our Approach• Design• Implementation
• Evaluation
• Conclusion14
Design - Preemption with Transactionized Kernels
● Our design is structured in
15GPU
SchedulingQueue
Running Queue
Low PriorityKernel
AbortionCommand
Low PriorityKernel
A GPU Kernel Transactionization Scheme for Preemptive Priority Scheduling
GPU kernel abortion : To immediately abort the low priority kernel
Design - Preemption with Transactionized Kernels
● Our design is structured in
16GPU
SchedulingQueue
Running Queue
Kernel BufferSnapshot
Kernel BufferRoll-back
A GPU Kernel Transactionization Scheme for Preemptive Priority Scheduling
GPU memory context saving/restoring : For re-execution of a preempted kernel
Design - Preemption with Transactionized Kernels
● Our design is structured in
17GPU
SchedulingQueue
Running Queue
Dirty DataSnapshot
Inconsistency of a kernel buffer
A GPU Kernel Transactionization Scheme for Preemptive Priority Scheduling
Transaction of GPU kernel execution :To prevent simultaneously access shared kernel buffer
Design - Transactionization Scheme
● GPU kernel transaction structure for re-execution of kernel○ Procedures from snapshotting to kernel finish
18
GPU JobScheduling
SchedulingProcess
GPU KernelTransaction Job Running Job DoneSnapshotting
Scheduling Request
Kernel
GPU KernelTransaction
A GPU Kernel Transactionization Scheme for Preemptive Priority Scheduling
Design – Example of Transactionization Scheme
● Pseudo-preemptive priority scheduling flow of TransactionizedGPU Kernels
19
GPU JobScheduler
SnapshotModule
SystemMemory
GPU
KernelA
KernelA
Schedule
Submit
KernelB
Submit
Running
LaunchKernel
A
Running
KernelB
Launch
KernelA
AbortAbort
Command
Done & Schedule
KernelB
KernelA Re-schedule
Rolling backSnapshot Mem.GPU Mem.
Running
KernelA
Launch
Priority : Kernel A < Kernel B
SnapshottingGPU Mem.Snapshot Mem.
A GPU Kernel Transactionization Scheme for Preemptive Priority Scheduling
• Introduction
• Motivation
• Related Work
• Background
• Our Approach• Design• Implementation
• Evaluation
• Conclusion20
Implementation – Snapshot Mechanism
● Implementing snapshot mechanism through GPGPU programming model
21
Task
GPUDeviceDriver
Snapshot TargetDetermination SnapshottingSnapshot
Allocation
Kernel InstanceCreation
KernelLaunch
Kernel BufferAllocation
A GPU Kernel Transactionization Scheme for Preemptive Priority Scheduling
Implementation – Snapshot Memory Management
● Allocating snapshot memory when allocating a kernel buffer● Tracking kernel buffer addresses thorough metadata of kernel
when launching kernel
22
Root
Buffer Buffer
Buffer Buffer Buffer Buffer
JobJobKernel
Instance
KernelBuffer
SnapshotMemory
Allocation with same size
Kernel Buffer Address List
Snapshot TargetAddress List
Binary DataDecoding &Generating
Address pointing
A GPU Kernel Transactionization Scheme for Preemptive Priority Scheduling
Implementation – Snapshot Process
● Snapshot process is implemented by kernel threads
23
while(1){if( ! Is_next_kernel )
interruptible_sleep()
if( re-execution )roll-back();
elsesnapshot();
submit_to_GPU();}
Snapshot Thread Pseudo Code
Snapshot Cancel Range
Low PriorityTask
High PriorityTask
SnapshotThread-1
SnapshotThread-2
Kernel 1.Launch(wake-up)4.Launch
2.SnapshotCancel
3.Switching
Kernel
A GPU Kernel Transactionization Scheme for Preemptive Priority Scheduling
Implementation – Preemption Mechanism
24
● Pre-preemption process○ Minimizing process of aborting scheduled kernels
● Post-preemption process○ Recovering aborted kernels to the scheduling queue○ Running a kernel launch and a post-preemption process
K Scheduling &Preemption
SchedulingQueue Slot Kernel
ExecutionSnapshotting InterruptHandler
De-queueSlot
SnapshotCancel
GPUAbortion Done
P_RUNP_SNAPP_SLOT
Pre-Preemption Process
Post-Preemption Process
STATUS CODE ->
A GPU Kernel Transactionization Scheme for Preemptive Priority Scheduling
• Introduction
• Motivation
• Related Work
• Background
• Our Approach• Design• Implementation
• Evaluation
• Conclusion25
Evaluation Setup (1)
● Target device environment
● Rodinia benchmark suite 3.1
26A GPU Kernel Transactionization Scheme for Preemptive Priority Scheduling
Evaluation Setup (2)
● Micro benchmark (M-Bench)○ We made simple GPGPU workload to be used for lower priority
workload than Rodinia benchmark’s workloads○ M-Bench running in background disturb the high priority task
● Priority configuration○ [E] : Emergency priority
§ This priority is the highest priority and does not require snapshotting○ [C] : Common priority
§ This priority is a higher priority than M-Bench○ [N] : Normal environment
§ This is result of evaluation in vanilla kernel
27A GPU Kernel Transactionization Scheme for Preemptive Priority Scheduling
Performance of High Priority Task
● Performance of a high priority task○ Degrading performance up to 186.9× depending on low priority tasks
on normal environment○ Guaranteeing consistent performance regardless of background load
on Transactionization scheme
28
0.5 1
1.5 2
2.5 3
3.5
NN BP FP BF KM 3D LUD NW GE MC CFD
Nor
mal
ized
Tur
naro
und
Tim
e
20 40 60 80
100 186.91xM-Bench[N]1xM-Bench[C]1xM-Bench[E]
2xM-Bench[N]2xM-Bench[C]2xM-Bench[E]
A GPU Kernel Transactionization Scheme for Preemptive Priority Scheduling
Scheduling Delay on Original Environment
● Histogram of scheduling delay of high priority kernels○ Scheduling delay depending on the running kernel length in normal
environment● Our approach suppressed the launch delay of high-priority kernels
within 18μs
29
!
"
"!
"!!
"!!!
"!!!!
"! #! $! %! &! '! (! )! *! "!! ""! "#!!"#
$%&'()
**"&&%+*%,
+,-./01234 5.167 89:;
!
"
"!
"!!
"!!!
"!!!!
"!!!!!
"! "# "$ "% "& #! ## #$ #% #& '! '# '$ '% '& $! $# $$ $% $& (!!"#
$%&'()
**"&&%+*%,
!"#$%#&' ()*+, -./0
Normal environment
99.9% percentile within 18μs
Transactionization scheme
Increased delay due to kernel lengthA GPU Kernel Transactionization Scheme for Preemptive Priority Scheduling
Analysis of Preemption Delay
● There is no delay caused by eviction in the preemption process since it exceeds the majority of 18μs
30
5 10 15 20 25 30 35 40 45 50
NN BP PF BFS KM 3D LUD NW GS MC CFS
Del
ayed
Tim
e (µ
s)
Launch Pre-preemption
Preemption overheadwithin 10μs
> Eviction Delay (18μs)
A GPU Kernel Transactionization Scheme for Preemptive Priority Scheduling
Snapshotting Overhead
● Our scheme has significant overhead due to snapshotting○ Advantages of performance improvement due to launch without
snapshotting at emergency priority
31
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3
1.4
1.5
NN BP FP BF KM 3D LUD NW GE MC CFD
5.13
Aver
age
Nor
mal
ized
Turn
arou
nd T
ime
Common Priority Emergency Priority
Using invalidated kernel bufferSnapshottingOverhead
A GPU Kernel Transactionization Scheme for Preemptive Priority Scheduling
Discussion
● Reducing snapshot overhead through high bandwidth memory (HBM)○ Exynos 8890 provides up to 51 GB/s memory bandwidth while the
device used in our evaluation presents 28.7 GB/s ○ Snapshot process is more efficient when using HBM because it
implements memory copying using ARM's vector instruction set
● Selective snapshot process using compile-time hints○ GPU kernel is idempotent sometimes○ If enabling to detect idempotent kernels, then selective snapshot
processes can be used
32A GPU Kernel Transactionization Scheme for Preemptive Priority Scheduling
• Introduction
• Motivation
• Related Work
• Background
• Our Approach• Design• Implementation
• Evaluation
• Conclusion33
Conclusion
● A GPU kernel transactionization for preemptive scheduling○ We proposed a GPU kernel transactionization scheme that enables
immediate abortion and re-execution of a GPU kernel○ Based on the transactionization scheme, we developed a pseudo-
preemptive GPU kernel scheduler
● Evaluation results showed that the scheduling delay for the urgent task was reduced to approximately 18μs
● Source code is available to public at ○ https://github.com/Hyunsu-Lee/psched_gpu
34A GPU Kernel Transactionization Scheme for Preemptive Priority Scheduling
Thank youQ&A
35
Backup Slide (1)
● Multiple priority evaluation
36A GPU Kernel Transactionization Scheme for Preemptive Priority Scheduling
1 2 3 4
NN BP PF BFS KM 3D LUD NW GS MC CFD
Norm
alize
d Tu
rnar
ound
Tim
e
6 8
10 12 14 16 18
Priority [-10]Priority [ 0]Priority [ 10]