streaming v. multicore in graphics applications jared hoberock victor lu sanjay patel john c. hart...
Post on 29-Mar-2015
215 Views
Preview:
TRANSCRIPT
Streaming v. Multicore inGraphics Applications
Jared HoberockVictor LuSanjay PatelJohn C. HartUniv. of Illinois
Dynamic Virtual Environments
World of Warcraft•Social Internet World•Completely Unconstrained(can build & share things)•Lower Quality Graphics
Grand Theft Auto IV•“Sandbox” World•Free Interaction(within gamespace)•High Quality Graphics
Halo 3•First-Person Shooter•Constrained Interaction•Photorealistic Graphics(much precomputation)
Dynamic, Flexible“Game” Graphics
Precomputed, Rigid“Film” Graphics
Multicore enables both flexibility and photorealism
Videogame Production
• Costly° Expensive: $10M/title° Slow: 3+ years/title
• Compromises° Precomputed visibility – restricts viewer
mobility and environment complexity° Precomputed lighting – restricts scene dynamics, user alterations° Precomputed motion – restricts movement to mocap data, rigging
• Consequences° Significant development effort to achieve realtime rates° Dynamic social gamespace quality lags that of solo/team shooter
levels
• Solution° Leverage multicore power to ray trace for dynamic visibility & lighting
How Close Are We?
• Single CPU ray tracing° RTRT Core renders at
1~5 Hz on 2.5 GHz P4° Need 60 Hz for games° 30 GHz CPU needed to ray
trace game scenes [Schmittler et al., Realtime Ray Tracing for Current & Future Games, SIGGRAPH 2006 Real Time Ray Tracing Course Notes]
• We won’t see a 30GHz serial processor (burns too brightly!)
• We will see 16+ cores• But can we do in parallel
what we predict in serial?Ingo Wald, RTRT Core, SIGGRAPH 2005Real Time Ray Tracing Course Notes
Spatial Data Structures
Nearest Neighbor Problems in Graphics• Rendering: Photon Mapping (k-NN)
° Find 500 photons nearest to a ray-surfaceintersection to compute surface’s illumination
• Modeling: Surface Reconstruction (-NN)° Surface reconstructed at each point depends o
locations of nearest points within a given distance
• Animation: Collision Detection (-NN)° Collision between multiple interacting elements accelerated by
avoiding all pairs intersections
Built on hierarchical spatial data structuresHow can we build, query and maintain on SIMD
GPU’s?
kD-Tree
• Hierarchy of axis-aligned partitions° 2-D partitions are lines° 3-D partitions are planes
• Axis of partitions alternates wrt depth of the tree
• Average access time is O(log n)• Worst case O(n) when tree is
severely lopsided• Need to maintain a balanced
tree O(n log n)• Can find k nearest neighbors in
O(k + log n) time using a heap
GPU Hierarchy Traversal
• SIMD “stackless” hierarchy traversal° Prethread with hit/miss pointers° Hit pointer points to first child° Miss pointer points to next sibling
or if last sibling then ancestor’s sibling
• References° Foley & Sugerman, kD-tree
Acceleration Structures for aGPU raytracer, Graphics HW 05
° Carr, Hoberock, Crane & Hart,Fast GPU Ray Tracing of Dynamic Meshes Using Geometry Images, Graphics Interface 2006
GPU Hierarchy Construction• Recent approaches sort first,
then organize into hierarchy° Zhou, Hou, Wang, Guo, “Real-
Time KD-Tree Construction onGraphics Hardware, SIGGRAPH Asia 2008
° Godiyal, Hoberock, Hart, Garland,“Rapid Multipole Graph Drawingon the GPU,” Graph Drawing 2008
• Latter uses kD-tree for fastn-body approximation tocompute force directed layout
• CPU+GPU° CPU builds kD-tree° GPU performs median selection° Practical when > 50K elements
Incoherent Shader Execution
• Videogame graphics rasterize triangles° Same shader applied to all pixels
(fragments) in triangle° Shading & visibility occur simultaneously
• Future videogames will also trace rays° Visibility first, then shading
• Primary eye rays are coherent• Secondary rays are reflected or
scattered into incoherent shader queries
• Different shader (not just different shader data) applied to each ray° e.g. hair, skin, cloth, liquids, foliage
Chris Wyman
GPU Architecture
• GPU = MIMD of SIMD• MIMD processing
° Cell: 8 MIMD nodes° GF8800: 16 MIMD nodes° LRB: 32 MIMD nodes
• SIMD processing° Cell: 4 per MIMD node° GF8800: 8 per MIMD node° LRB: 16 per MIMD node
• Some MIMD nodes have distinct “control” processors though similar processing could occur via one SIMD node (masking rest)
• LRB “core” is a MIMD proc., NVIDIA “core” is a SIMD proc.• NVIDIA “warp” is 32 threads streaming on one MIMD node
MIMD NodeMIMD NodeSIMD
Node
SIMD
Node
IBM Cell Architecture
Flex I/O
Memoryinterfacecontroller
Businterfacecontroller
Dual XDR
32 bytes/cycle 16 bytes/cycle
Element interconnect bus (up to 96 bytes/cycle)
16 bytes/cycle
Synergistic processor elements
Powerprocessorelement
Powerprocessor unit
Powerexecution
unit
L1cache
L2 cache
Localstore
SXUSPU
SMF
Localstore
SXUSPU
SMF
Localstore
SXUSPU
SMF
Localstore
SXUSPU
SMF
16 bytes/cycle 16 bytes/cycle (2x)
Localstore
SXUSPU
SMF
Localstore
SXUSPU
SMF
Localstore
SXUSPU
SMF
Localstore
SXUSPU
SMF
64-bit Power Architecture with vector media extensions
Gschwind et al., Synergistic Processing inCell’s Multicore Architecture, IEEE Micro, 2006
NVIDIA Tesla Architecture
Conditional Program Flow
• High-performance stuck with low-level streaming SIMD
• Even in multicore• Problem with SIMD:
Conditional Program Flow° If a data-dependent condition
leads to two different program flows
° Then both program flows must be executed on all SIMD nodes (serialization)
° Result masked per SIMD processor by the condition data
MIMD for loopSIMD for loop
if (X) then A else B
MIMD for loopSIMD for loop
if (X) then A else B
TT TT TT TT TT TT TT FFX:
X?X?
AA BB
AA
BB
Mask on XMask on X
AA AA AA AA AA AA AA BB
Deferred Shading
• Handle visibility first° Intersect rays w/scene° Store result for later shading
• Shade ray intersections• If different rays in the same
MIMD node need differentshaders, then shaders areserialized
• O(NS) performance° N = # of rays° S = # of shaders (per MIMD node)° O(S) when distributed across N
processes
MIMD for all raysSIMD for all rays
intersect ray with sceneset mask to shader #
MIMD for all raysSIMD for all rays
for all shaders in SIMD ray warp
shader(ray) if mask == shader
MIMD for all raysSIMD for all rays
intersect ray with sceneset mask to shader #
MIMD for all raysSIMD for all rays
for all shaders in SIMD ray warp
shader(ray) if mask == shader
Process Sorting
• Need to bucket computations to move those with identical control flows onto the SIMD processors of the same MIMD node
• When is it worth the trouble?
Scan (Prefix Sum)
1 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0
0 1 1 1 1 1 1 1 2 2 3 3 3 3 4 4
Shader Scheduling
• Sort jobs based onshader request° Radix sort° Segmented scan° Global v. local sort
• Load MIMD nodes onlywith rays requesting thesame shader
• Still O(NS)° Performing O(N) scan on each of S shaders
• Can we scan on all shaders simultaneously?
MIMD for all raysSIMD for all rays
intersect ray with scene
MIMD for all shadersScan rays needing that shaderMIMD for all rays needing that
shaderSIMD for all rays
shader(ray)
MIMD for all raysSIMD for all rays
intersect ray with scene
MIMD for all shadersScan rays needing that shaderMIMD for all rays needing that
shaderSIMD for all rays
shader(ray)
Stanford Bunny in Cornell Box
• Three shaders: wall, glass, light
• Shaders simple• Warp size: 32
Hit Incoherence
Branches
Eff.
1 0.6% 1.15 87%
2 30% 2.40 42%
3 38% 2.55 39%
4 40% 2.65 37%
5 40% 2.67 37%
How often ray’s shader differs from previous
ray’s
How often ray’s shader differs from previous
ray’s
Average # of branches per
warp
Average # of branches per
warp
Automotive/CAD Viz
• DJ_Designs via Google 3D WH
• 16 simple shaders• Small parts ameliorate their
shader’s impact on overall efficiency
Bounce
Incoherence Efficiency
1 1.6% 28%
2 40% 13%
3 30% 14%
4 22% 15%
5 17% 17%
Angel in Cornell Box
• Four shaders:• wall, light simple• marble, wood are more
expensive, procedural
Bounce
Incoherence Efficiency
1 1.2% 77%
2 52% 23%
3 53% 21%
4 47% 22%
5 40% 23%
Siebel Center Staircase
• Six shaders° Copper, glass girder,
chrome, marble, light
• Efficiency bump due to smooth glass/chrome coherence and rays exiting the scene
Bounce
Incoherence Efficiency
1 3% 68%
2 34% 34%
3 36% 33%
4 33% 32%
5 30% 34%
Efficiency Images
Branching Penalties
Warp size: 32
All 32 SIMD threads
must follow the samecontrol
flow
16 shaders
one shader
• Shader execution° Serial: one at a time° SIMD: as a “big switch”
• Serialized° Slower, wastes
processors° Avoids locks° Can conserve memory
• Compare w/ & w/o stream compaction
Memory Coherence
Processes:
Memory:
Processes:
Memory:
Processes:
Memory:
Scheduling Approaches
• Five Options° Serial Unsorted° Serial Global Compaction° Parallel Unsorted° Parallel Local Compaction° Parallel Global Compaction
• Each variation involves bookkeeping overhead
0
Serialized SIMD Parallel
Unsorted SortedGlobal
Unsorted SortedLocal
SortedGlobal
500
1000
1500
2000
2500
3000
3500
Observations•Even for these modest scenes there are significant performance gains•Local per-node compaction doesn't work•Even zero-time sort would not improve most cases•Local per-node workloads hindered by too many shaders to schedule•Faster stream compaction: Prefix sum, Scatter/Gather
Conclusions
• Stream compaction° Not practical for simple shaders° Practical for procedural textures (wood, marble)° Probably for complex shaders (hair, cloth, skin)
• Warp coherence nevertheless leads to data incoherence° Even when all shaders in a MIMD node run the same
shader, their data is still distributed across memory, outside of cache boundaries
• Static tuning ok, but run-time better• Broader implication to object polymorphism
° Streaming same objects with different virtual function tables
top related