leone sig graph 2011

7/29/2019 Leone Sig Graph 2011

1/37

Native shader compi

with

Ma


2/37

Why compile shaders?

RenderMans SIMD interpreter is hard to beat.

Amortizes interpretive overhead over batches of po

Shading is dominated by floating point calculations.


3/37

SIMD interpreter

For each instruction in shader:

Decode and dispatch instruction.

For each point in batch:

Ifrunflagis on:

Load operands.

Compute.

Store result.


4/37

SIMD interpreter: example inner l


5/37

SIMD interpreter: benefits

Interpretive overhead is amortized (if batch is large).

Uniform operations can be executed once per batch.

Derivatives are easy: neighboring values are always rea


6/37

SIMD interpreter: drawbacks

Low compute density, poor instruction-level parallelism


7/37

SIMD interpreter: example inner l


8/37

SIMD interpreter: drawbacks

Low compute density, poor instruction-level parallelism

Load, compute, store, repeat.

Poor locality, high memory traffic

Intermediate results are stored in memory, not regis

High overhead for small batches

Difficult to vectorize (pointers and conditionals).


9/37

Compiled shader execution


Load inputs.


Compute.

Store outputs.


10/37

Benefits of native compilation

Eliminates interpretive overhead. Good for small batch

Good locality and register utilization.

Intermediate results are stored in registers, not mem

Good instruction-level parallelism.

Instruction scheduling avoids pipeline stalls.

Vectorizes easily.


11/37

Issues: batch shading

Use vectorized shaders on small batches.

Uniform operations: once per grid, not once per point.

Some are very expensive (e.g. plugin calls).

Derivatives: need "previously" computed values fromneighboring points.

RSL permits derivatives of arbitrary expressions.


12/37

Why vectorize?

Consider batch execution of a compiled shader:


Load inputs.


Compute.

Store outputs.


13/37

Why vectorize?

Consider batch execution of a vectorizedshader:

For each block of 4 or 8 points in batch:

Load inputs.


Compute on vector registers (with mask)

Store outputs.


14/37

Consider using SSE instructions only for vectors and m

Simple vector code generation

float dot(vector v1, vector v2){

vector v0 = v1 * v2;return v0.x + v0.y + v0.z;

}

load4 r1, [v1]load4 r2, [v2]mult4 r3, r1, r2move r0, r3.xadd r0, r3.yadd r0, r3.z


15/37

Shader vectorization

To vectorize, first scalarize:

float dot(vector v{

float x = v1.xfloat y = v1.yfloat z = v1.zreturn x + y +

}


vector v0 = v1 * v2;return v0.x + v0.y + v0.z;

}


16/37

Scalar code generation

Next, generate ordinary scalar code:


float x = v1.x * v2.x;float y = v1.y * v2.y;float z = v1.z * v2.z;return x + y + z;

}

load r1, [v1.x]load r2, [v2.x]mult r0, r1, r2

load r1, [v1.y]load r2, [v2.y]mult r3, r1, r2

load r1, [v1.z]load r2, [v2.z]mult r3, r1, r2

add r0, r0, r3add r0, r0, r3


17/37

Finally, widen each instruction for a batch size of four:

load4 r1, [v1.x]load4 r2, [v2.x]mult4 r0, r1, r2

load4 r1, [v1.y]load4 r2, [v2.y]mult4 r3, r1, r2

load4 r1, [v1.z]load4 r2, [v2.z]mult4 r3, r1, r2

add4 r0, r0, r3add4 r0, r0, r3


float x = v1.x * v2.x;float y = v1.y * v2.y;float z = v1.z * v2.z;return x + y + z;

}

Vectorize for batch of four


18/37

Struct of arrays (SOA)

Normally a batch of vectors is an array of structs (AOS

x y z x y z x y z x y z . . .

Vector load instructions (in SSE) require contiguous da

Store batch of vectors as a struct of arrays (SOA):

x x x x . . . y y y y . . . z z z z .


19/37

Masking / blending

Use a mask to avoid clobbering components of registe

by the other branch. No masking in SSE. Use variable blend in SSE4:

blend(a, b, mask){

return (a & mask) | ~(b & mask)}

No need to blend each instruction

Blend at basic block boundaries (at phi nodes in SSA).


20/37

Vectorization: recent work

ispc: Intel SPMD program compiler (Matt Pharr)

Beyond Programmable Shadingcourse, SIGGRAPH 201

Open source: ispc.github.com

Whole function vectorization in AnySL (Karrenberg et

Code Generation and Optimization2011
http://www.cdl.uni-saarland.de/projects/wfv/http://www.cdl.uni-saarland.de/projects/wfv/http://www.cdl.uni-saarland.de/projects/wfv/http://www.cdl.uni-saarland.de/projects/anysl/http://www.cdl.uni-saarland.de/projects/anysl/http://www.cdl.uni-saarland.de/projects/wfv/http://www.cdl.uni-saarland.de/projects/wfv/http://ispc.github.com/http://ispc.github.com/


21/37

Film shading on GPUs

Previous work

LightSpeed (Ragan-Kelly et al. SIGGRAPH 2007)

RenderAnts (Zhou et al. SIGGRAPH Asia 2009)

Code generation is easier now (thanks CUDA, OpenC

PTX AMD IL

LLVM and Clang


22/37

GPU code generation with LLV

NVIDIAs LLVM to PTX code generator (Grover)

Not to be confused with PTX to LLVM front end (PL

Incomplete PTX support in llvm-trunk (Chiou)

Google summer of code project (Holewinski)

Experimental PTX back end for AnySL (Rhodin) LLVM to AMD IL (Villmow)
http://sourceforge.net/projects/llvmptxbackend/https://sites.google.com/site/justinholewinski/projects/gsoc/llvm-ptx-back-end-2011http://llvm.org/devmtg/2010-11http://llvm.org/devmtg/2010-11http://sourceforge.net/projects/llvmptxbackend/http://sourceforge.net/projects/llvmptxbackend/https://sites.google.com/site/justinholewinski/projects/gsoc/llvm-ptx-back-end-2011https://sites.google.com/site/justinholewinski/projects/gsoc/llvm-ptx-back-end-2011


23/37

Issues: GPU code generation

Film shaders interoperate with the renderer.

File I/O: textures, pointclouds, etc. (out of core).

Shader plugins (DSOs).

Sampling, ray tracing.

Answer: multi-pass partitioning (Riffel et al. GH 2004)


24/37

Partitioning


25/37

Multi-pass partitioning for CPU

Synchronize for GPU calls, uniform operations, derivati

Does not require hardware threads or locks.

A thread yields by returning (to a scheduler).

Intermediate data is stored in a cactus stack (Cilk)or continuation closures (CPS).

Data management and scheduling is a key problem(Budge et al. Eurographics 2009)


26/37

Issues: summary

CPU code generation (perhaps JIT)

Vectorization

GPU code generation

Multi-pass partitioning


27/37

Introduction to LLVM

Mid-level intermediate representation (IR)

High-level types: structs, arrays, vectors, functions.

Control-flow graph: basic blocks with branches

Many modular analysis and optimization passes.

Code generation for x86, x64, ARM, ...

Just-in-time (JIT) compiler too.


28/37

Example: from C to LLVM IR

definefloat@sqrt(float%f){ entry: %0 = fcmpogtfloat%f, 0.0

bri1%0, label%bb1, label%bb2

...

}


29/37


definefloat@sqrt(float%f){ entry: %0 = fcmpogtfloat%f, 0.0

bri1%0, label%bb1, label%bb2 bb1: %1 = callfloat@fabsf(float%f) retfloat%1 bb2: retfloat0.0}


30/37


definevoid@foo(i32%x, i32%y){

%z = allocai32 %1 = addi32%y, %x storei32%1, i32*%z

...}


31/37

Writing a simple code generato


32/37

Writing a simple code generato

Ad f LLVM


33/37

Advantages of LLVM

Well designed intermediate representation (IR).

Wide range of optimizations (configurable).

JIT code generation.

Interoperability.

I b l


34/37

Interoperability

Shaders can call out to renderer via C ABI.

We can inline library code into compiled shaders.

Compile C++ to LLVM IR with Clang.

This greatly simplifies code generation.

W k f LLVM


35/37

Weaknesses of LLVM

No automatic vectorization.

Poor support for vector-oriented code generation.

No predication.

Few vector instructions, must resort to SSE/AVX int

LLVM


36/37

LLVM resources

www.llvm.org/docs

Language Reference Manual

Getting Started Guide

LLVM Tutorial (section 3)

Relevant open source projects

ispc.github.com

github.com/MarkLeone/PostHaste
http://ispc.github.com/http://www.llvm.org/docs/tutorial/http://www.llvm.org/docs/tutorial/LangImpl3.htmlhttp://www.llvm.org/docs/GettingStarted.htmlhttp://www.llvm.org/docs/LangRef.htmlhttp://github.com/MarkLeone/PostHastehttp://github.com/MarkLeone/PostHastehttp://ispc.github.com/http://ispc.github.com/http://www.llvm.org/docs/tutorial/LangImpl3.htmlhttp://www.llvm.org/docs/tutorial/LangImpl3.htmlhttp://www.llvm.org/docs/tutorial/http://www.llvm.org/docs/tutorial/http://www.llvm.org/docs/GettingStarted.htmlhttp://www.llvm.org/docs/GettingStarted.htmlhttp://www.llvm.org/docs/LangRef.htmlhttp://www.llvm.org/docs/LangRef.htmlhttp://www.llvm.org/docshttp://www.llvm.org/docs


37/37

Questions?

Mark [email protected]

leone sig graph 2011

Documents