leone sig graph 2011
TRANSCRIPT
-
7/29/2019 Leone Sig Graph 2011
1/37
Native shader compi
with
Ma
-
7/29/2019 Leone Sig Graph 2011
2/37
Why compile shaders?
RenderMans SIMD interpreter is hard to beat.
Amortizes interpretive overhead over batches of po
Shading is dominated by floating point calculations.
-
7/29/2019 Leone Sig Graph 2011
3/37
SIMD interpreter
For each instruction in shader:
Decode and dispatch instruction.
For each point in batch:
Ifrunflagis on:
Load operands.
Compute.
Store result.
-
7/29/2019 Leone Sig Graph 2011
4/37
SIMD interpreter: example inner l
-
7/29/2019 Leone Sig Graph 2011
5/37
SIMD interpreter: benefits
Interpretive overhead is amortized (if batch is large).
Uniform operations can be executed once per batch.
Derivatives are easy: neighboring values are always rea
-
7/29/2019 Leone Sig Graph 2011
6/37
SIMD interpreter: drawbacks
Low compute density, poor instruction-level parallelism
-
7/29/2019 Leone Sig Graph 2011
7/37
SIMD interpreter: example inner l
-
7/29/2019 Leone Sig Graph 2011
8/37
SIMD interpreter: drawbacks
Low compute density, poor instruction-level parallelism
Load, compute, store, repeat.
Poor locality, high memory traffic
Intermediate results are stored in memory, not regis
High overhead for small batches
Difficult to vectorize (pointers and conditionals).
-
7/29/2019 Leone Sig Graph 2011
9/37
Compiled shader execution
For each point in batch:
Load inputs.
For each instruction in shader:
Compute.
Store outputs.
-
7/29/2019 Leone Sig Graph 2011
10/37
Benefits of native compilation
Eliminates interpretive overhead. Good for small batch
Good locality and register utilization.
Intermediate results are stored in registers, not mem
Good instruction-level parallelism.
Instruction scheduling avoids pipeline stalls.
Vectorizes easily.
-
7/29/2019 Leone Sig Graph 2011
11/37
Issues: batch shading
Use vectorized shaders on small batches.
Uniform operations: once per grid, not once per point.
Some are very expensive (e.g. plugin calls).
Derivatives: need "previously" computed values fromneighboring points.
RSL permits derivatives of arbitrary expressions.
-
7/29/2019 Leone Sig Graph 2011
12/37
Why vectorize?
Consider batch execution of a compiled shader:
For each point in batch:
Load inputs.
For each instruction in shader:
Compute.
Store outputs.
-
7/29/2019 Leone Sig Graph 2011
13/37
Why vectorize?
Consider batch execution of a vectorizedshader:
For each block of 4 or 8 points in batch:
Load inputs.
For each instruction in shader:
Compute on vector registers (with mask)
Store outputs.
-
7/29/2019 Leone Sig Graph 2011
14/37
Consider using SSE instructions only for vectors and m
Simple vector code generation
float dot(vector v1, vector v2){
vector v0 = v1 * v2;return v0.x + v0.y + v0.z;
}
load4 r1, [v1]load4 r2, [v2]mult4 r3, r1, r2move r0, r3.xadd r0, r3.yadd r0, r3.z
-
7/29/2019 Leone Sig Graph 2011
15/37
Shader vectorization
To vectorize, first scalarize:
float dot(vector v{
float x = v1.xfloat y = v1.yfloat z = v1.zreturn x + y +
}
float dot(vector v1, vector v2){
vector v0 = v1 * v2;return v0.x + v0.y + v0.z;
}
-
7/29/2019 Leone Sig Graph 2011
16/37
Scalar code generation
Next, generate ordinary scalar code:
float dot(vector v1, vector v2){
float x = v1.x * v2.x;float y = v1.y * v2.y;float z = v1.z * v2.z;return x + y + z;
}
load r1, [v1.x]load r2, [v2.x]mult r0, r1, r2
load r1, [v1.y]load r2, [v2.y]mult r3, r1, r2
load r1, [v1.z]load r2, [v2.z]mult r3, r1, r2
add r0, r0, r3add r0, r0, r3
-
7/29/2019 Leone Sig Graph 2011
17/37
Finally, widen each instruction for a batch size of four:
load4 r1, [v1.x]load4 r2, [v2.x]mult4 r0, r1, r2
load4 r1, [v1.y]load4 r2, [v2.y]mult4 r3, r1, r2
load4 r1, [v1.z]load4 r2, [v2.z]mult4 r3, r1, r2
add4 r0, r0, r3add4 r0, r0, r3
float dot(vector v1, vector v2){
float x = v1.x * v2.x;float y = v1.y * v2.y;float z = v1.z * v2.z;return x + y + z;
}
Vectorize for batch of four
-
7/29/2019 Leone Sig Graph 2011
18/37
Struct of arrays (SOA)
Normally a batch of vectors is an array of structs (AOS
x y z x y z x y z x y z . . .
Vector load instructions (in SSE) require contiguous da
Store batch of vectors as a struct of arrays (SOA):
x x x x . . . y y y y . . . z z z z .
-
7/29/2019 Leone Sig Graph 2011
19/37
Masking / blending
Use a mask to avoid clobbering components of registe
by the other branch. No masking in SSE. Use variable blend in SSE4:
blend(a, b, mask){
return (a & mask) | ~(b & mask)}
No need to blend each instruction
Blend at basic block boundaries (at phi nodes in SSA).
-
7/29/2019 Leone Sig Graph 2011
20/37
Vectorization: recent work
ispc: Intel SPMD program compiler (Matt Pharr)
Beyond Programmable Shadingcourse, SIGGRAPH 201
Open source: ispc.github.com
Whole function vectorization in AnySL (Karrenberg et
Code Generation and Optimization2011
http://www.cdl.uni-saarland.de/projects/wfv/http://www.cdl.uni-saarland.de/projects/wfv/http://www.cdl.uni-saarland.de/projects/wfv/http://www.cdl.uni-saarland.de/projects/anysl/http://www.cdl.uni-saarland.de/projects/anysl/http://www.cdl.uni-saarland.de/projects/wfv/http://www.cdl.uni-saarland.de/projects/wfv/http://ispc.github.com/http://ispc.github.com/ -
7/29/2019 Leone Sig Graph 2011
21/37
Film shading on GPUs
Previous work
LightSpeed (Ragan-Kelly et al. SIGGRAPH 2007)
RenderAnts (Zhou et al. SIGGRAPH Asia 2009)
Code generation is easier now (thanks CUDA, OpenC
PTX AMD IL
LLVM and Clang
-
7/29/2019 Leone Sig Graph 2011
22/37
GPU code generation with LLV
NVIDIAs LLVM to PTX code generator (Grover)
Not to be confused with PTX to LLVM front end (PL
Incomplete PTX support in llvm-trunk (Chiou)
Google summer of code project (Holewinski)
Experimental PTX back end for AnySL (Rhodin) LLVM to AMD IL (Villmow)
http://sourceforge.net/projects/llvmptxbackend/https://sites.google.com/site/justinholewinski/projects/gsoc/llvm-ptx-back-end-2011http://llvm.org/devmtg/2010-11http://llvm.org/devmtg/2010-11http://sourceforge.net/projects/llvmptxbackend/http://sourceforge.net/projects/llvmptxbackend/https://sites.google.com/site/justinholewinski/projects/gsoc/llvm-ptx-back-end-2011https://sites.google.com/site/justinholewinski/projects/gsoc/llvm-ptx-back-end-2011 -
7/29/2019 Leone Sig Graph 2011
23/37
Issues: GPU code generation
Film shaders interoperate with the renderer.
File I/O: textures, pointclouds, etc. (out of core).
Shader plugins (DSOs).
Sampling, ray tracing.
Answer: multi-pass partitioning (Riffel et al. GH 2004)
-
7/29/2019 Leone Sig Graph 2011
24/37
Partitioning
-
7/29/2019 Leone Sig Graph 2011
25/37
Multi-pass partitioning for CPU
Synchronize for GPU calls, uniform operations, derivati
Does not require hardware threads or locks.
A thread yields by returning (to a scheduler).
Intermediate data is stored in a cactus stack (Cilk)or continuation closures (CPS).
Data management and scheduling is a key problem(Budge et al. Eurographics 2009)
-
7/29/2019 Leone Sig Graph 2011
26/37
Issues: summary
CPU code generation (perhaps JIT)
Vectorization
GPU code generation
Multi-pass partitioning
-
7/29/2019 Leone Sig Graph 2011
27/37
Introduction to LLVM
Mid-level intermediate representation (IR)
High-level types: structs, arrays, vectors, functions.
Control-flow graph: basic blocks with branches
Many modular analysis and optimization passes.
Code generation for x86, x64, ARM, ...
Just-in-time (JIT) compiler too.
-
7/29/2019 Leone Sig Graph 2011
28/37
Example: from C to LLVM IR
definefloat@sqrt(float%f){ entry: %0 = fcmpogtfloat%f, 0.0
bri1%0, label%bb1, label%bb2
...
}
-
7/29/2019 Leone Sig Graph 2011
29/37
Example: from C to LLVM IR
definefloat@sqrt(float%f){ entry: %0 = fcmpogtfloat%f, 0.0
bri1%0, label%bb1, label%bb2 bb1: %1 = callfloat@fabsf(float%f) retfloat%1 bb2: retfloat0.0}
-
7/29/2019 Leone Sig Graph 2011
30/37
Example: from C to LLVM IR
definevoid@foo(i32%x, i32%y){
%z = allocai32 %1 = addi32%y, %x storei32%1, i32*%z
...}
-
7/29/2019 Leone Sig Graph 2011
31/37
Writing a simple code generato
-
7/29/2019 Leone Sig Graph 2011
32/37
Writing a simple code generato
Ad f LLVM
-
7/29/2019 Leone Sig Graph 2011
33/37
Advantages of LLVM
Well designed intermediate representation (IR).
Wide range of optimizations (configurable).
JIT code generation.
Interoperability.
I b l
-
7/29/2019 Leone Sig Graph 2011
34/37
Interoperability
Shaders can call out to renderer via C ABI.
We can inline library code into compiled shaders.
Compile C++ to LLVM IR with Clang.
This greatly simplifies code generation.
W k f LLVM
-
7/29/2019 Leone Sig Graph 2011
35/37
Weaknesses of LLVM
No automatic vectorization.
Poor support for vector-oriented code generation.
No predication.
Few vector instructions, must resort to SSE/AVX int
LLVM
-
7/29/2019 Leone Sig Graph 2011
36/37
LLVM resources
www.llvm.org/docs
Language Reference Manual
Getting Started Guide
LLVM Tutorial (section 3)
Relevant open source projects
ispc.github.com
github.com/MarkLeone/PostHaste
http://ispc.github.com/http://www.llvm.org/docs/tutorial/http://www.llvm.org/docs/tutorial/LangImpl3.htmlhttp://www.llvm.org/docs/GettingStarted.htmlhttp://www.llvm.org/docs/LangRef.htmlhttp://github.com/MarkLeone/PostHastehttp://github.com/MarkLeone/PostHastehttp://ispc.github.com/http://ispc.github.com/http://www.llvm.org/docs/tutorial/LangImpl3.htmlhttp://www.llvm.org/docs/tutorial/LangImpl3.htmlhttp://www.llvm.org/docs/tutorial/http://www.llvm.org/docs/tutorial/http://www.llvm.org/docs/GettingStarted.htmlhttp://www.llvm.org/docs/GettingStarted.htmlhttp://www.llvm.org/docs/LangRef.htmlhttp://www.llvm.org/docs/LangRef.htmlhttp://www.llvm.org/docshttp://www.llvm.org/docs -
7/29/2019 Leone Sig Graph 2011
37/37
Questions?
Mark [email protected]