ocelot: supported devices
DESCRIPTION
Ocelot: supported devices. Overview. Ocelot PTX Emulator Multicore-Backend NVIDIA GPU Backend AMD GPU Backend. Multicore CPU Backend: Introduction. Target: Efficient execution of PTX kernels on CPUs ISA Translation from PTX to LLVM - PowerPoint PPT PresentationTRANSCRIPT
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
OCELOT: SUPPORTED DEVICES
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 2
OverviewOcelot PTX Emulator
Multicore-Backend
NVIDIA GPU Backend
AMD GPU Backend
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Multicore CPU Backend: IntroductionTarget: Efficient execution of PTX kernels on CPUs
ISA Translation from PTX to LLVM Execution-model translation from PTX thread hierarchy to
serialized PTX threads
Light-weight thread scheduler
LLVM Just-in-time compilation to x86 LLVM transformations applied before code generation
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Some Interesting Features
Serialization Transforms
JIT for Parallel Code
Utilize all resources
4
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Translation to CPUs: Thread FusionExecution Manager• thread scheduling• context
managementThread Blocks
Multicore Host Threads
Thread serializatio
n
Execution Model Translation Thread scheduling Dealing with specialized operations
e.g. custom hardware Control flow restructuring Resource management (multiple
cores)
One worker pthread per CPU core
Execute a kernel
5
J. Stratton, S. Stone, and W. mei Hwu, Mcuda: An efficient implementation of cuda kernels on multi-cores," University of Illinois at Urbana-Champaign, Tech. Rep. IMPACT-08-01,March 2008.G. Diamos, A. Kerr, S. Yalamanchili and N. Clark, “Ocelot: A Dynamic Optimizing Compiler for Bulk-Synchronous Applications in Heterogeneous,” PACT October 2010
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Ocelot Source Code: Multicore CPU Backend• ocelot/
• executive/• interface/MulticoreCPUDevice.h• interface/LLVMContext.h• interface/LLVMExecutableKernel.h• interface/LLVMCooperativeThreadArray.h• interface/LLVMModuleManager.h• interface/TextureOperations.h
• ir/• interface/LLVMInstruction.h
• translator/ • interface/PTXToLLVMTranslator.h
• transforms/ • interface/SubkernelFormationPass.h• interface/RemoveBarrierPass.h
6
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Multicore CPU: ISA Translation
Translate PTX IR to LLVM Internal Representation Arithmetic instructions have one-to-few mapping Special instructions and registers handled by LLVM intrinsics (e.g. cos, clock64, bar.sync) Texture sampling calls Ocelot’s texture library
LLVMContext contains pointers to address spaces, next entry ID, thread ID
Custom LLVM IR implementation insulates Ocelot from LLVM changes LLVM requires SSA form -> Ocelot converts PTX to SSA Remove predication
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
PTX to LLVM ISA Translation
//// ocelot/translation/implementation/PTXToLLVMTranslator.cpp//void PTXToLLVMTranslator::_translateAdd( const ir::PTXInstruction& i ){
if( ir::PTXOperand::isFloat( i.type ) ) {
ir::LLVMFadd add;
ir::LLVMInstruction::Operand result = _destination( i );
add.a = _translate( i.a );
add.b = _translate( i.b );
add.d = result;
_llvmKernel->_statements.push_back(
ir::LLVMStatement( add ) );
}
else { ..
..
..
};
}
• Translate each PTX instruction to LLVM IR instruction sequence
• Special PTX registers and instructions mapped to LLVM intrinsics:• llvm.readcyclecounter()• llvm.sqrt.f32()
• Result is LLVM function implementing PTX kernel
• Should be invertible if coupled to LLVM->PTX code generator (not implemented)
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Thread Serialization
Thread loops Enter next executable region via scheduler block
Barriers: store live values into thread-local memory, return to thread scheduler
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Using the Multicore Backend Edit configure.ocelot
Executive: devices:
llvm – efficient execution of PTX on multicore CPU optimizationLevel – basic, none, full, memory,
debug workerThreadLimit -- number of worker threads
Optimizations: subkernelSize - size of subkernels in
instructions simplifyCFG – whether to apply CFG
simplification pass hoistSpecialValues – whether to load
LLVMContext values at launch of kernel
executive: { devices: [ llvm ], asynchronousKernelLaunch: true, optimizationLevel: none, workerThreadLimit: 1, warpSize: 1},optimizations: { subkernelSize: 1000, simplifyCFG: true, hoistSpecialValues: true},
10
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 11
OverviewOcelot PTX Emulator
Multicore-Backend
NVIDIA GPU Backend
AMD Backend
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
NVIDIA GPU: Introduction Executes PTX kernels on GPUs via the CUDA Driver API
Thin layer on top of CUDA Driver API Ocelot enables rewriting of PTX kernels
Register reallocation Runtime optimizations Instrumentation
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Ocelot Source Code: NVIDIA GPU Device Backend• ocelot/
• executive/• interface/NVIDIAGPUDevice.h• interface/NVIDIAExecutableKernel.h
13
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Using the NVIDIA GPU Backend Edit configure.ocelot
executive: devices:
nvidia – invokes NVIDIA GPU backend
executive: { devices: [ nvidia ],},
14
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Dynamic Instrumentation
Run-time generation of user-defined, custom instrumentation code for CUDA kernels
Harness chip-level instrumentation when possibleInstrumentation data to drive
Off-line workload characterization On-line debugging & program optimization On-line resource management
Inspired in part by the PIN1 infrastructure
15
15
Naila Farooqui, Andrew Kerr, Greg Eisenhauer, Karsten Schwan, Sudhakar Yalamanchili. Lynx: A Dynamic Instrumentation System for Data-Parallel Applications on GPGPU Architectures. ISPASS. April 2012.
PhD Student: Naila Farooqui, Joint with K. Schwan and A. Gavrilovska
1 C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood. “Pin: building customized program analysis tools with dynamic instrumentation,” PLDI '05
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Instrumentation Support in Ocelot High-level, C constructs to define instrumentation + (C-to-PTX)
JIT Integration with system management software and dynamic
compiler Online resource management based on profiling
Additional Instrumentor APIs to provide criteria for instrumentation
Selectively perform instrumentation on kernels
16
16
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Custom Instrumentation
Transparent profiling and characterization of library implementations
17
nvcc
PTX
Ocelot Run Time
CUDA
Libraries
Instrumentation APIs
Inst
rum
ento
r
C-on-Demand JIT
C-PTX Translator
PTX-PTX Transformer
Lynx
Example Instrumentati
on Code
17
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Instrumentation: Instruction count
* Scan (CUDA SDK)
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Remote Device Layer
Remote procedure call layer for Ocelot device callsExecute local applications that run kernels remotelyMulti-GPU applications can become multi-node
19
19
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Switchable Compute
Switch devices at runtime
Load balancing Remote execution
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 21
OverviewOcelot PTX Emulator
Multicore-Backend
NVIDIA Backend
AMD GPU Backend
Rodrigo Dominguez, Dana Schaa, and David Kaeli. “Caracal: Dynamic Translation of Runtime Environments for GPUs.” In Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units,
GPGPU-4
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
AMD GPU BackendExecutes PTX kernels on GPUs via the CAL Driver API
Rewriting of PTX kernels (for optimization, instrumentation, etc.) also gets translated to the AMD backend
Ocelot Device Interface: Module registration Memory management
Global/Shared/Constant/Parameter memory allocation Kernel launches
Translation from PTX to IL Texture management OpenGL interoperability Streams and Events
Rodrigo Dominguez, Dana Schaa, and David Kaeli. “Caracal: Dynamic Translation of Runtime Environments for GPUs.” In Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units,
GPGPU-4
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
AMD Evergreen Architecture
AMD Radeon HD 5870 20 SIMD cores 16 Stream Cores (SC)
per SIMD core Each SC is VLIW-5 A total of 1600 ALUs Wavefronts of 64
threads Peak is 2.72 TFLOPS
(SP) and 544 GFLOPS (DP)
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
AMD Evergreen Architecture
One SIMD Core
Source: AMD OpenCL University Kit
General Purpose Registers
One Stream Core
T-Processing Element Branch
Execution Unit
Processing
Elements
Instruction and Control Flow
Each Stream Core includes: 4 Processing
Elements 4 independent SP or
integer operations 2 DP operation 1 DP fma or mult
operation 1 Special Function
Unit 1 SP or integer
operation SP or DP
transcendental Branch Execution Unit GPR = 5.24 MB
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
AMD Evergreen Architecture
Local Data Share 2 TB/s 32 KB per SIMD
Global Data Share Shared between
all threads in a kernel
Low latency global reductions
L1 (8 KB) L2
512 KB 450 GB/s
Global Memory GDDR5 153 GB/s
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Translation from PTX to IL
PTXRISC style syntaxLoad-Store instruction setRegisters are typed and scalar
Unlimited virtual registersPredicate registersControl flow based on branches and labels
Designed for compute (GPGPU)
.entry vecAdd (.param .u64 A,.param .u64 B,.param .u64 C,.param .s32 N)
{mov.u16 rh1, ctaid.x;mov.u16 rh2, ntid.x;mul.wide.u16 r1, rh1, rh2;cvt.u32.u16 r2, tid.x;add.u32 r3, r2, r1;ld.param.s32 r4, [N];setp.le.s32 p1, r4, r3;@p1 bra Label_1;...}
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Translation from PTX to IL
ILRegisters are 32-bit and vectors (4 components)
Registers have no typeSwizzlesResources are globally scoped
Structured control flow (if-end, while-end)
Designed for graphics, not compute (see FSAIL)
il_cs_2_0dcl_raw_uav_id(0)dcl_cb cb0[2]dcl_cb cb1[4]dcl_literal l0, 4, 4, 4, 4mov r0.x, vThreadGrpId.xmov r1.x, cb0[0].ximul r2.x, r0.x, r1.xmov r3.x, vTidInGrp.xiadd r4.x, r3.x, r2.xmov r5.x, cb1[3].xige r6.x, r4.x, r5.xif_logicalz r6.x...endif
end
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
AMD GPU BackendValidated over 30 applications from the CUDA SDKSupport for pre-compiled librariesDevice selection can be made at runtimeWhat is supported?
Global memory (cudaMalloc, cudaMemcpy) Shared memory (including extern) Constant memory Atomics (global and shared) Barriers and Fences 30+ PTX instructions
Rodrigo Dominguez, Dana Schaa, and David Kaeli. Caracal: Dynamic Translation of Runtime Environments for GPUs. In Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, GPGPU-4
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Ocelot Source Code: AMD GPU Device Backend• ocelot/
• analysis/• interface/StructuralAnalysis.h
• executive/• interface/ATIGPUDevice.h• interface/ATIExecutableKernel.h
• transforms/• interface/StructuralTransform.h
29
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Using the AMD GPU Backend Edit configure.ocelot
executive: devices:
amd – invokes AMD GPU backend
executive: { devices: [ amd ],},
30
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Unstructured to Structured Control Flow*Branch Divergence is key to high performance in GPU
Its impact is different depending upon whether the control flow is structured or unstructured
Not all GPUs support unstructured CFG directly Using dynamic translation to support AMD GPUs**
31
* Wu H, Diamos G, Li S, Yalamanchili S. Characterization and Transformation of Unstructured Control Flow in GPU Applications. CACHES. 2011.** R. Dominguez, D. Schaa, and D. Kaeli. Caracal: Dynamic translation of runtime environments for gpus. In Proceedingsof the Fourth Workshop on General Purpose Processing on Graphics Processing Units, pages 5–11. ACM, 2011.