ocelot: supported devices

31
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY OCELOT: SUPPORTED DEVICES

Upload: rowena

Post on 24-Feb-2016

63 views

Category:

Documents


0 download

DESCRIPTION

Ocelot: supported devices. Overview. Ocelot PTX Emulator Multicore-Backend NVIDIA GPU Backend AMD GPU Backend. Multicore CPU Backend: Introduction. Target: Efficient execution of PTX kernels on CPUs ISA Translation from PTX to LLVM - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Ocelot: supported devices

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

OCELOT: SUPPORTED DEVICES

Page 2: Ocelot: supported devices

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 2

OverviewOcelot PTX Emulator

Multicore-Backend

NVIDIA GPU Backend

AMD GPU Backend

Page 3: Ocelot: supported devices

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Multicore CPU Backend: IntroductionTarget: Efficient execution of PTX kernels on CPUs

ISA Translation from PTX to LLVM Execution-model translation from PTX thread hierarchy to

serialized PTX threads

Light-weight thread scheduler

LLVM Just-in-time compilation to x86 LLVM transformations applied before code generation

Page 4: Ocelot: supported devices

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Some Interesting Features

Serialization Transforms

JIT for Parallel Code

Utilize all resources

4

Page 5: Ocelot: supported devices

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Translation to CPUs: Thread FusionExecution Manager• thread scheduling• context

managementThread Blocks

Multicore Host Threads

Thread serializatio

n

Execution Model Translation Thread scheduling Dealing with specialized operations

e.g. custom hardware Control flow restructuring Resource management (multiple

cores)

One worker pthread per CPU core

Execute a kernel

5

J. Stratton, S. Stone, and W. mei Hwu, Mcuda: An efficient implementation of cuda kernels on multi-cores," University of Illinois at Urbana-Champaign, Tech. Rep. IMPACT-08-01,March 2008.G. Diamos, A. Kerr, S. Yalamanchili and N. Clark, “Ocelot: A Dynamic Optimizing Compiler for Bulk-Synchronous Applications in Heterogeneous,” PACT October 2010

Page 6: Ocelot: supported devices

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Ocelot Source Code: Multicore CPU Backend• ocelot/

• executive/• interface/MulticoreCPUDevice.h• interface/LLVMContext.h• interface/LLVMExecutableKernel.h• interface/LLVMCooperativeThreadArray.h• interface/LLVMModuleManager.h• interface/TextureOperations.h

• ir/• interface/LLVMInstruction.h

• translator/ • interface/PTXToLLVMTranslator.h

• transforms/ • interface/SubkernelFormationPass.h• interface/RemoveBarrierPass.h

6

Page 7: Ocelot: supported devices

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Multicore CPU: ISA Translation

Translate PTX IR to LLVM Internal Representation Arithmetic instructions have one-to-few mapping Special instructions and registers handled by LLVM intrinsics (e.g. cos, clock64, bar.sync) Texture sampling calls Ocelot’s texture library

LLVMContext contains pointers to address spaces, next entry ID, thread ID

Custom LLVM IR implementation insulates Ocelot from LLVM changes LLVM requires SSA form -> Ocelot converts PTX to SSA Remove predication

Page 8: Ocelot: supported devices

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

PTX to LLVM ISA Translation

//// ocelot/translation/implementation/PTXToLLVMTranslator.cpp//void PTXToLLVMTranslator::_translateAdd( const ir::PTXInstruction& i ){

if( ir::PTXOperand::isFloat( i.type ) ) {

ir::LLVMFadd add;

ir::LLVMInstruction::Operand result = _destination( i );

add.a = _translate( i.a );

add.b = _translate( i.b );

add.d = result;

_llvmKernel->_statements.push_back(

ir::LLVMStatement( add ) );

}

else { ..

..

..

};

}

• Translate each PTX instruction to LLVM IR instruction sequence

• Special PTX registers and instructions mapped to LLVM intrinsics:• llvm.readcyclecounter()• llvm.sqrt.f32()

• Result is LLVM function implementing PTX kernel

• Should be invertible if coupled to LLVM->PTX code generator (not implemented)

Page 9: Ocelot: supported devices

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Thread Serialization

Thread loops Enter next executable region via scheduler block

Barriers: store live values into thread-local memory, return to thread scheduler

Page 10: Ocelot: supported devices

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Using the Multicore Backend Edit configure.ocelot

Executive: devices:

llvm – efficient execution of PTX on multicore CPU optimizationLevel – basic, none, full, memory,

debug workerThreadLimit -- number of worker threads

Optimizations: subkernelSize - size of subkernels in

instructions simplifyCFG – whether to apply CFG

simplification pass hoistSpecialValues – whether to load

LLVMContext values at launch of kernel

executive: { devices: [ llvm ], asynchronousKernelLaunch: true, optimizationLevel: none, workerThreadLimit: 1, warpSize: 1},optimizations: { subkernelSize: 1000, simplifyCFG: true, hoistSpecialValues: true},

10

Page 11: Ocelot: supported devices

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 11

OverviewOcelot PTX Emulator

Multicore-Backend

NVIDIA GPU Backend

AMD Backend

Page 12: Ocelot: supported devices

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

NVIDIA GPU: Introduction Executes PTX kernels on GPUs via the CUDA Driver API

Thin layer on top of CUDA Driver API Ocelot enables rewriting of PTX kernels

Register reallocation Runtime optimizations Instrumentation

Page 13: Ocelot: supported devices

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Ocelot Source Code: NVIDIA GPU Device Backend• ocelot/

• executive/• interface/NVIDIAGPUDevice.h• interface/NVIDIAExecutableKernel.h

13

Page 14: Ocelot: supported devices

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Using the NVIDIA GPU Backend Edit configure.ocelot

executive: devices:

nvidia – invokes NVIDIA GPU backend

executive: { devices: [ nvidia ],},

14

Page 15: Ocelot: supported devices

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Dynamic Instrumentation

Run-time generation of user-defined, custom instrumentation code for CUDA kernels

Harness chip-level instrumentation when possibleInstrumentation data to drive

Off-line workload characterization On-line debugging & program optimization On-line resource management

Inspired in part by the PIN1 infrastructure

15

15

Naila Farooqui, Andrew Kerr, Greg Eisenhauer, Karsten Schwan, Sudhakar Yalamanchili. Lynx: A Dynamic Instrumentation System for Data-Parallel Applications on GPGPU Architectures. ISPASS. April 2012.

PhD Student: Naila Farooqui, Joint with K. Schwan and A. Gavrilovska

1 C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood. “Pin: building customized program analysis tools with dynamic instrumentation,” PLDI '05

Page 16: Ocelot: supported devices

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Instrumentation Support in Ocelot High-level, C constructs to define instrumentation + (C-to-PTX)

JIT Integration with system management software and dynamic

compiler Online resource management based on profiling

Additional Instrumentor APIs to provide criteria for instrumentation

Selectively perform instrumentation on kernels

16

16

Page 17: Ocelot: supported devices

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Custom Instrumentation

Transparent profiling and characterization of library implementations

17

nvcc

PTX

Ocelot Run Time

CUDA

Libraries

Instrumentation APIs

Inst

rum

ento

r

C-on-Demand JIT

C-PTX Translator

PTX-PTX Transformer

Lynx

Example Instrumentati

on Code

17

Page 18: Ocelot: supported devices

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Instrumentation: Instruction count

* Scan (CUDA SDK)

Page 19: Ocelot: supported devices

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Remote Device Layer

Remote procedure call layer for Ocelot device callsExecute local applications that run kernels remotelyMulti-GPU applications can become multi-node

19

19

Page 20: Ocelot: supported devices

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Switchable Compute

Switch devices at runtime

Load balancing Remote execution

Page 21: Ocelot: supported devices

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 21

OverviewOcelot PTX Emulator

Multicore-Backend

NVIDIA Backend

AMD GPU Backend

Rodrigo Dominguez, Dana Schaa, and David Kaeli. “Caracal: Dynamic Translation of Runtime Environments for GPUs.” In Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units,

GPGPU-4

Page 22: Ocelot: supported devices

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

AMD GPU BackendExecutes PTX kernels on GPUs via the CAL Driver API

Rewriting of PTX kernels (for optimization, instrumentation, etc.) also gets translated to the AMD backend

Ocelot Device Interface: Module registration Memory management

Global/Shared/Constant/Parameter memory allocation Kernel launches

Translation from PTX to IL Texture management OpenGL interoperability Streams and Events

Rodrigo Dominguez, Dana Schaa, and David Kaeli. “Caracal: Dynamic Translation of Runtime Environments for GPUs.” In Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units,

GPGPU-4

Page 23: Ocelot: supported devices

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

AMD Evergreen Architecture

AMD Radeon HD 5870 20 SIMD cores 16 Stream Cores (SC)

per SIMD core Each SC is VLIW-5 A total of 1600 ALUs Wavefronts of 64

threads Peak is 2.72 TFLOPS

(SP) and 544 GFLOPS (DP)

Page 24: Ocelot: supported devices

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

AMD Evergreen Architecture

One SIMD Core

Source: AMD OpenCL University Kit

General Purpose Registers

One Stream Core

T-Processing Element Branch

Execution Unit

Processing

Elements

Instruction and Control Flow

Each Stream Core includes: 4 Processing

Elements 4 independent SP or

integer operations 2 DP operation 1 DP fma or mult

operation 1 Special Function

Unit 1 SP or integer

operation SP or DP

transcendental Branch Execution Unit GPR = 5.24 MB

Page 25: Ocelot: supported devices

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

AMD Evergreen Architecture

Local Data Share 2 TB/s 32 KB per SIMD

Global Data Share Shared between

all threads in a kernel

Low latency global reductions

L1 (8 KB) L2

512 KB 450 GB/s

Global Memory GDDR5 153 GB/s

Page 26: Ocelot: supported devices

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Translation from PTX to IL

PTXRISC style syntaxLoad-Store instruction setRegisters are typed and scalar

Unlimited virtual registersPredicate registersControl flow based on branches and labels

Designed for compute (GPGPU)

.entry vecAdd (.param .u64 A,.param .u64 B,.param .u64 C,.param .s32 N)

{mov.u16 rh1, ctaid.x;mov.u16 rh2, ntid.x;mul.wide.u16 r1, rh1, rh2;cvt.u32.u16 r2, tid.x;add.u32 r3, r2, r1;ld.param.s32 r4, [N];setp.le.s32 p1, r4, r3;@p1 bra Label_1;...}

Page 27: Ocelot: supported devices

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Translation from PTX to IL

ILRegisters are 32-bit and vectors (4 components)

Registers have no typeSwizzlesResources are globally scoped

Structured control flow (if-end, while-end)

Designed for graphics, not compute (see FSAIL)

il_cs_2_0dcl_raw_uav_id(0)dcl_cb cb0[2]dcl_cb cb1[4]dcl_literal l0, 4, 4, 4, 4mov r0.x, vThreadGrpId.xmov r1.x, cb0[0].ximul r2.x, r0.x, r1.xmov r3.x, vTidInGrp.xiadd r4.x, r3.x, r2.xmov r5.x, cb1[3].xige r6.x, r4.x, r5.xif_logicalz r6.x...endif

end

Page 28: Ocelot: supported devices

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

AMD GPU BackendValidated over 30 applications from the CUDA SDKSupport for pre-compiled librariesDevice selection can be made at runtimeWhat is supported?

Global memory (cudaMalloc, cudaMemcpy) Shared memory (including extern) Constant memory Atomics (global and shared) Barriers and Fences 30+ PTX instructions

Rodrigo Dominguez, Dana Schaa, and David Kaeli. Caracal: Dynamic Translation of Runtime Environments for GPUs. In Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, GPGPU-4

Page 29: Ocelot: supported devices

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Ocelot Source Code: AMD GPU Device Backend• ocelot/

• analysis/• interface/StructuralAnalysis.h

• executive/• interface/ATIGPUDevice.h• interface/ATIExecutableKernel.h

• transforms/• interface/StructuralTransform.h

29

Page 30: Ocelot: supported devices

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Using the AMD GPU Backend Edit configure.ocelot

executive: devices:

amd – invokes AMD GPU backend

executive: { devices: [ amd ],},

30

Page 31: Ocelot: supported devices

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Unstructured to Structured Control Flow*Branch Divergence is key to high performance in GPU

Its impact is different depending upon whether the control flow is structured or unstructured

Not all GPUs support unstructured CFG directly Using dynamic translation to support AMD GPUs**

31

* Wu H, Diamos G, Li S, Yalamanchili S. Characterization and Transformation of Unstructured Control Flow in GPU Applications. CACHES. 2011.** R. Dominguez, D. Schaa, and D. Kaeli. Caracal: Dynamic translation of runtime environments for gpus. In Proceedingsof the Fourth Workshop on General Purpose Processing on Graphics Processing Units, pages 5–11. ACM, 2011.