![Page 2: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU](https://reader033.vdocuments.us/reader033/viewer/2022042010/5e71fcf76516ae4a5b347b0b/html5/thumbnails/2.jpg)
2
Introduction
● Smash Cars 2 project
– Static scene of moderate size
– Many dynamic objects
– Multiple render passes
– Totals up to 3000 batches per frame
● PPU render up to 12 ms
– Target – 60 fps :(
![Page 3: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU](https://reader033.vdocuments.us/reader033/viewer/2022042010/5e71fcf76516ae4a5b347b0b/html5/thumbnails/3.jpg)
3
Introduction
![Page 4: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU](https://reader033.vdocuments.us/reader033/viewer/2022042010/5e71fcf76516ae4a5b347b0b/html5/thumbnails/4.jpg)
4
Optimization techniques
● PPU code optimizations
– Has been done several times
– Would like PPU time to become ~0
● Static command buffers
– Somewhat restricted
– Culling is unclear
● Move code to SPU
![Page 5: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU](https://reader033.vdocuments.us/reader033/viewer/2022042010/5e71fcf76516ae4a5b347b0b/html5/thumbnails/5.jpg)
5
Agenda
● Render design
● Brief description of SPU
● Porting
● Development
● Q & A
![Page 6: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU](https://reader033.vdocuments.us/reader033/viewer/2022042010/5e71fcf76516ae4a5b347b0b/html5/thumbnails/6.jpg)
6
Agenda
● Render design
● Brief description of SPU
● Porting
● Development
● Q & A
![Page 7: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU](https://reader033.vdocuments.us/reader033/viewer/2022042010/5e71fcf76516ae4a5b347b0b/html5/thumbnails/7.jpg)
7
Render – “high” level
● Rendering is done on sets of RenderItem
– The sets are already sorted and culled
● RenderItem aggregates:
– SceneNode
– Material
– Shader
– RenderEntity
![Page 8: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU](https://reader033.vdocuments.us/reader033/viewer/2022042010/5e71fcf76516ae4a5b347b0b/html5/thumbnails/8.jpg)
8
Render – SceneNode
● Transform graph node
– Local transform
– Global transform (derived from local)
● Local transforms are set by misc code
– Animations
– Physics
– Game logic
![Page 9: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU](https://reader033.vdocuments.us/reader033/viewer/2022042010/5e71fcf76516ae4a5b347b0b/html5/thumbnails/9.jpg)
9
Render – Shader
● Render pipeline setup algorithm
– virtual void apply● Setups auto-parameters
– Are computed automagically by the system– WorldViewProjection, ShadowMap, etc.
– virtual void setup● Setups material
– Material parameters (including textures) – Shaders
● 99% objects are of final type HWShader
![Page 10: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU](https://reader033.vdocuments.us/reader033/viewer/2022042010/5e71fcf76516ae4a5b347b0b/html5/thumbnails/10.jpg)
10
Render – Material
● Container of instance data for Shader
– Data layout description● Parameter name/type● Offset in data array
– Data array
– Accessors for name/index (get/set)
– Render states● Blend, alpha test, depth, cull
![Page 11: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU](https://reader033.vdocuments.us/reader033/viewer/2022042010/5e71fcf76516ae4a5b347b0b/html5/thumbnails/11.jpg)
11
Render – RenderEntity
● Drawing algorithm
– virtual void render
● Several implementations
– RenderStaticGeometry
– RenderSkinnedGeometry
– RenderMorphedGeometry
– DynamicObject
![Page 12: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU](https://reader033.vdocuments.us/reader033/viewer/2022042010/5e71fcf76516ae4a5b347b0b/html5/thumbnails/12.jpg)
12
Render – low level
● Cross-platform wrappers
– State setup (with cache)
– Vertex/pixel constant setup
– Shader setup
● GCM implementation
– PS3-specific API for CB generation● Is mostly present on SPU
– This makes porting easier
![Page 13: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU](https://reader033.vdocuments.us/reader033/viewer/2022042010/5e71fcf76516ae4a5b347b0b/html5/thumbnails/13.jpg)
13
Agenda
● Render design
● Brief description of SPU
● Porting
● Development
● Q & A
![Page 14: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU](https://reader033.vdocuments.us/reader033/viewer/2022042010/5e71fcf76516ae4a5b347b0b/html5/thumbnails/14.jpg)
14
SPU – what is it?
● 6 like cores
– 3.2 GHz, in-order, dual-issue
– 128 vector registers
– Local Storage (LS)● 256 Kb – code + data● 6 cycle latency● External memory is accessed via DMA
– Asynchronous memcpy (LS ↔ memory)– Alignment/size restrictions
![Page 15: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU](https://reader033.vdocuments.us/reader033/viewer/2022042010/5e71fcf76516ae4a5b347b0b/html5/thumbnails/15.jpg)
15
SPU – porting tasks
● Build code for SPU
● Run code on SPU
– Task/job manager
– Code/data size
– Virtual functions
● Optimization
– Effective usage of DMA
– Code optimization
![Page 16: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU](https://reader033.vdocuments.us/reader033/viewer/2022042010/5e71fcf76516ae4a5b347b0b/html5/thumbnails/16.jpg)
16
Agenda
● Render design
● Brief description of SPU
● Porting
● Development
● Q & A
![Page 17: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU](https://reader033.vdocuments.us/reader033/viewer/2022042010/5e71fcf76516ae4a5b347b0b/html5/thumbnails/17.jpg)
17
Porting steps
● Step 1 – working prototype
● Step 2 – data optimization
● Step 3 – code optimization
![Page 18: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU](https://reader033.vdocuments.us/reader033/viewer/2022042010/5e71fcf76516ae4a5b347b0b/html5/thumbnails/18.jpg)
18
Porting steps
● Step 1 – working prototype
● Step 2 – data optimization
● Step 3 – code optimization
![Page 19: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU](https://reader033.vdocuments.us/reader033/viewer/2022042010/5e71fcf76516ae4a5b347b0b/html5/thumbnails/19.jpg)
19
Porting steps
● Step 1 – working prototype
– Speed does not matter● Non-optimal code, synchronous DMA
– Complete functionality
● Step 2 – data optimization
● Step 3 – code optimization
![Page 20: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU](https://reader033.vdocuments.us/reader033/viewer/2022042010/5e71fcf76516ae4a5b347b0b/html5/thumbnails/20.jpg)
20
Step 1 – PPU interface
● async::Renderer
– Simple interface● push(RenderItem) (+ batch versions)● flush()● kick()
– The limits are set when creating renderer● Maximum number of items● Maximum CB size
– Double-buffering for CB
![Page 21: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU](https://reader033.vdocuments.us/reader033/viewer/2022042010/5e71fcf76516ae4a5b347b0b/html5/thumbnails/21.jpg)
21
Step 1 – PPU interfacePPURenderer 1 Renderer 2
pushpushpush
push
itemitem
itemitem
flush
flushSPU 1 SPU 2
CB CB
render jobrender job
kick2 GPU CBrender 2
render miscrender 1
rendermisc
kick1
GPUrender 2
render misc
render 1
Write
Read
![Page 22: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU](https://reader033.vdocuments.us/reader033/viewer/2022042010/5e71fcf76516ae4a5b347b0b/html5/thumbnails/22.jpg)
22
Step 1 – DMA helpers
● Convenience functions to simplify DMA
– Allocator● Trivial stack allocator, ptr += size
– fetchData(ea, size)● Memory allocation and synchronous DMA● Can handle misalignment
– fetchObject / fetchObjectArray● Typed versions of fetchData
– Later we made asynchronous variants
![Page 23: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU](https://reader033.vdocuments.us/reader033/viewer/2022042010/5e71fcf76516ae4a5b347b0b/html5/thumbnails/23.jpg)
23
Step 1 – DMA helpers
● void* fetchData(alloc, ea, size)uint32_t sizeAligned = (size + (ea & 15) + 15) & ~15;
void* ls = alloc.allocate(sizeAligned);
DmaGet(ls, ea & ~15, sizeAligned);
DmaWait();
return (char*)ls + (ea & 15);
● T* fetchObject(alloc, ea)return (T*)fetchData(alloc, ea, sizeof(T));
![Page 24: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU](https://reader033.vdocuments.us/reader033/viewer/2022042010/5e71fcf76516ae4a5b347b0b/html5/thumbnails/24.jpg)
24
Step 1 – virtual functions
● PPU vfptr does not make sense on SPU
● The solution varies across interface
– Shader● Single supported shader type – HWShader
– RenderEntity● Enum for all supported types● Enum value is stored in unused pointer bits
– ptr = actual_ptr | type // actual_ptr % 4 == 0
![Page 25: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU](https://reader033.vdocuments.us/reader033/viewer/2022042010/5e71fcf76516ae4a5b347b0b/html5/thumbnails/25.jpg)
25
Step 1 – encapsulation
● Makes porting harder
– Methods with incorrect SPU code● CRT_ASSERT(next->prev == this)
– Additional method parameters● render() → render(Context)
● Makes SPU code refactoring harder
● Solution (some people don't like this...)
– #define private public [SPU-only!]
![Page 26: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU](https://reader033.vdocuments.us/reader033/viewer/2022042010/5e71fcf76516ae4a5b347b0b/html5/thumbnails/26.jpg)
26
Step 1 – shader patch
● RSX lacks PS constant registers
– Constants are embedded into microcode
– Microcode has to be patched● RSX blitting
– Huge RSX cost (up to 50% frame time)● PPU render
– Ring buffer for microcode instances– Complex synchronization
● SPU render– Instances are stored in the same buffer where
CB resides
![Page 27: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU](https://reader033.vdocuments.us/reader033/viewer/2022042010/5e71fcf76516ae4a5b347b0b/html5/thumbnails/27.jpg)
27
Step 1 – synchronization
● PPU/SPU
– Data races● Transformation matrices● Material parameters
– Objects can be deleted
– Solution● SPU code has to be fast● PPU waits for SPU before changing data
![Page 28: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU](https://reader033.vdocuments.us/reader033/viewer/2022042010/5e71fcf76516ae4a5b347b0b/html5/thumbnails/28.jpg)
28
Step 1 – synchronization
● SPU/RSX
– PPU● flush() inserts WAIT at the beginning of CB
– Waits indefinitely● kick() inserts CALL in main CB
– SPU● Fills CB with rendering commands/shaders● Appends RET to the end● Replaces WAIT with NOP*
![Page 29: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU](https://reader033.vdocuments.us/reader033/viewer/2022042010/5e71fcf76516ae4a5b347b0b/html5/thumbnails/29.jpg)
29
Step 1 – results
● Porting time – 3 days
● Render time – 25 ms
– PPU render time is 12.5 ms
– How to make it faster?● Brute-force – split queue into 5 chunks
– 5 ms for 5 SPU● Write better code
● Completely separate code branch
– Common data structures
![Page 30: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU](https://reader033.vdocuments.us/reader033/viewer/2022042010/5e71fcf76516ae4a5b347b0b/html5/thumbnails/30.jpg)
30
Porting steps
● Step 1 – working prototype
● Step 2 – data optimization
● Step 3 – code optimization
![Page 31: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU](https://reader033.vdocuments.us/reader033/viewer/2022042010/5e71fcf76516ae4a5b347b0b/html5/thumbnails/31.jpg)
31
Porting steps
● Step 1 – working prototype
● Step 2 – data optimization
– Change data layout● Lower indirection count
– Asynchronous DMA● Double-buffering for input/output data
● Step 3 – code optimization
![Page 32: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU](https://reader033.vdocuments.us/reader033/viewer/2022042010/5e71fcf76516ae4a5b347b0b/html5/thumbnails/32.jpg)
32
Step 2 – memory layout
RenderItem
SceneNode RenderStaticGeometry
Material
VertexDeclaration
HWShader
Parameters
Textures
TextureDesc
TextureData
GCMTexture
Param ctab
Auto ctab
HWShaderImpl
VS command buffer PS program
![Page 33: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU](https://reader033.vdocuments.us/reader033/viewer/2022042010/5e71fcf76516ae4a5b347b0b/html5/thumbnails/33.jpg)
33
Step 2 – data layout
● Goal – lower indirection count
– Actually, make graph paths shorter
● Where do they come from?
– Shared data
– “Variable” length arrays● Size is known at load time
– “Good” architecture● Law of Demeter
– Pimpl
![Page 34: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU](https://reader033.vdocuments.us/reader033/viewer/2022042010/5e71fcf76516ae4a5b347b0b/html5/thumbnails/34.jpg)
34
● Textures
– struct TextureInfo● Stored in data array● Is updated in setValue● The contents is sufficient for texture setup
– 4b – sampler state, 12b – texture header
● Render States
– Stored in data array● 16b for all states
Step 2 – materials
![Page 35: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU](https://reader033.vdocuments.us/reader033/viewer/2022042010/5e71fcf76516ae4a5b347b0b/html5/thumbnails/35.jpg)
35
● Before:
● After:
Step 2 – materials
Material + Render States Parameters
Textures TextureDesc
TextureData GCMTexture
Material Parameters + Textures + Render States
![Page 36: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU](https://reader033.vdocuments.us/reader033/viewer/2022042010/5e71fcf76516ae4a5b347b0b/html5/thumbnails/36.jpg)
36
Step 2 – HWShaderImpl
● Lots of “variable” length arrays
– Constant tables
– Shader data
● Solution
– Sequential layout of everything in memory
– Header contains offsets
– DMA get and pointer fixup● vsCB = (char*)impl + impl->vsCBOffset
![Page 37: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU](https://reader033.vdocuments.us/reader033/viewer/2022042010/5e71fcf76516ae4a5b347b0b/html5/thumbnails/37.jpg)
37
● Before:
● After:
Step 2 – HWShaderImpl
Param ctab Auto ctabHWShaderImpl VS CB PS program
Param ctab
Auto ctab HWShaderImpl
VS command buffer
PS program
HWShader
![Page 38: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU](https://reader033.vdocuments.us/reader033/viewer/2022042010/5e71fcf76516ae4a5b347b0b/html5/thumbnails/38.jpg)
38
Step 2 – VertexDeclaration
● class RenderStaticGeometry
– VertexDeclaration* vdecl● Can store vdecl by value
– Space penalty
● There are not a lot of unique instances
– There is a declaration cache anyway
– Can implement a software cache!● 4 element cache, DMA stall on cache miss● 30 cache misses for 3500 batches
![Page 39: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU](https://reader033.vdocuments.us/reader033/viewer/2022042010/5e71fcf76516ae4a5b347b0b/html5/thumbnails/39.jpg)
39
Step 2 – FlatRenderItem
● Graph path to HWShaderImpl is long
– item->material->shader->impl
● FlatRenderItem
– Caches pointers/sizes● Material data EA/size● Shader impl EA/size● Scene node/render entity EA
– Created at level load time
![Page 40: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU](https://reader033.vdocuments.us/reader033/viewer/2022042010/5e71fcf76516ae4a5b347b0b/html5/thumbnails/40.jpg)
40
Step 2 – FlatRenderItem
RenderItem SceneNodeRenderStaticGeometry
Material HWShader HWShaderImpl
● Before:
● After:
FlatRenderItemSceneNode
RenderStaticGeometryMaterial data
HWShaderImpl
Material data
![Page 41: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU](https://reader033.vdocuments.us/reader033/viewer/2022042010/5e71fcf76516ae4a5b347b0b/html5/thumbnails/41.jpg)
41
Step 2 – DMA optimizations
● Up to now all DMA are synchronous
● Can hide DMA latency!
– Launch several requests● Wait for all at once
– Double buffering● While current batch is being processed
– Source data for next batch is being read– Result for previous batch is being written
● Requires additional LS memory– Not a problem in our case
![Page 42: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU](https://reader033.vdocuments.us/reader033/viewer/2022042010/5e71fcf76516ae4a5b347b0b/html5/thumbnails/42.jpg)
42
Step 2 – output DMA
● Command buffer
– Two 8 Kb buffers
– Swap on buffer overflow
● Shader buffer
– Can do double buffering
– It's easier to wait for transfer though● But before processing instead of after!● DmaPut has enough latency to complete
![Page 43: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU](https://reader033.vdocuments.us/reader033/viewer/2022042010/5e71fcf76516ae4a5b347b0b/html5/thumbnails/43.jpg)
43
Step 2 – input DMA
● For each batch
– Wait for previous transfers
– Prefetch next batch● 4 DmaGet at once
– Current batch processing
● Requires loop prologue
– Prefetch for first batch
![Page 44: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU](https://reader033.vdocuments.us/reader033/viewer/2022042010/5e71fcf76516ae4a5b347b0b/html5/thumbnails/44.jpg)
44
Step 2 – input DMA
● Complex code (lack of experience atm)
– FlatRenderItem are fetched one by one● It's easier to fetch in groups
● Bugs
– The code prefetches one item past the end ● PPU duplicates last item to avoid errors
– Don't forget to wait for last DmaGet !● Otherwise stack corruption is possible
![Page 45: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU](https://reader033.vdocuments.us/reader033/viewer/2022042010/5e71fcf76516ae4a5b347b0b/html5/thumbnails/45.jpg)
45
Step 2 – results
● Optimization time – 3 days
● Render time – 8 ms
– Without double buffering – 12 ms
● PPU time did not change (don't ask)
FlatRenderItem
Material data
SceneNode
RenderStaticGeometry VertexDeclaration
HWShaderImpl
![Page 46: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU](https://reader033.vdocuments.us/reader033/viewer/2022042010/5e71fcf76516ae4a5b347b0b/html5/thumbnails/46.jpg)
46
Porting steps
● Step 1 – working prototype
● Step 2 – data optimization
● Step 3 – code optimization
![Page 47: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU](https://reader033.vdocuments.us/reader033/viewer/2022042010/5e71fcf76516ae4a5b347b0b/html5/thumbnails/47.jpg)
47
Porting steps
● Step 1 – working prototype
● Step 2 – data optimization
● Step 3 – code optimization
– Profiling● SN Tuner● SPUsim
– Optimization
![Page 48: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU](https://reader033.vdocuments.us/reader033/viewer/2022042010/5e71fcf76516ae4a5b347b0b/html5/thumbnails/48.jpg)
48
Step 3 – SN Tuner
● CPU/GPU profiler for PS3
– SPU performance counters● DMA stalls● Instruction scheduling
– Overview of code quality
– SPU PC sampling● No overhead as opposed to PPU sampling● Used for function cost overview
– Had to selectively remove inlining
![Page 49: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU](https://reader033.vdocuments.us/reader033/viewer/2022042010/5e71fcf76516ae4a5b347b0b/html5/thumbnails/49.jpg)
49
Step 3 – SPUsim
● SPU simulator for PC
– Awesome for prototyping● Lightning fast iterations● Stalls statistics● Instruction trace
– Shows stalls, lack of pairing
– For small self-contained functions● You can setup DMA, but it's not very easy
![Page 50: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU](https://reader033.vdocuments.us/reader033/viewer/2022042010/5e71fcf76516ae4a5b347b0b/html5/thumbnails/50.jpg)
50
Step 3 – branching
● Branching carries a lot of overhead
● Reduce branch counts
– Branch flattening
– Loop unrolling
– Switch → function pointer table
● Zero-size DMA
● Branch hinting
![Page 51: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU](https://reader033.vdocuments.us/reader033/viewer/2022042010/5e71fcf76516ae4a5b347b0b/html5/thumbnails/51.jpg)
51
Step 3 – LS load/store
● LS load/store is limited to 16b size/align
– Compiler performs shuffle / masking
● 16b reads
– Padding for input data
– Loop unrolling
● 16b writes
– Write several RSX commands at a time
– Padding for output data (via NOP for RSX)
![Page 52: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU](https://reader033.vdocuments.us/reader033/viewer/2022042010/5e71fcf76516ae4a5b347b0b/html5/thumbnails/52.jpg)
52
Step 3 – results
● Optimization time – 5 days
● Render time – 2 ms
● Further optimizations
– Code optimization is still possible● But is not worth it for now
– Parallel rendering with N SPUs● Different scene chunks● Different passes
![Page 53: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU](https://reader033.vdocuments.us/reader033/viewer/2022042010/5e71fcf76516ae4a5b347b0b/html5/thumbnails/53.jpg)
53
Porting results
● PPU time – 12.5 ms
● SPU time (prototype) – 25 ms (3 days)
● SPU time (layout) – 12 ms (2.5 days)
● SPU time (async DMA) – 8 ms (1 day)
● SPU time (code) – 2 ms (5 days)
● 75 Kb SPU code, 20 Kb PPU code
– Currently 105 / 26 Kb
![Page 54: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU](https://reader033.vdocuments.us/reader033/viewer/2022042010/5e71fcf76516ae4a5b347b0b/html5/thumbnails/54.jpg)
54
Agenda
● Render design
● Brief description of SPU
● Porting
● Development
● Q & A
![Page 55: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU](https://reader033.vdocuments.us/reader033/viewer/2022042010/5e71fcf76516ae4a5b347b0b/html5/thumbnails/55.jpg)
55
Development
● Already implemented
– Batch sorting
– Culling (frustum, screen size)
– Custom game parameter setup
● Future work
– Occlusion culling (already implemented)
– Single buffered context
– Uber shaders