4developers 2015: gamedev-grade debugging - leszek godlewski
TRANSCRIPT
Gamedev-grade debuggingLeszek Godlewski, The Astronauts
Source: http://igetyourfail.blogspot.com/2009/01/reaching-out-tale-of-failed-skinning.html
● Engine Programmer, The Astronauts (Nov 2014 – present)
– PS4 port of The Vanishing of Ethan Carter
● Programmer, Nordic Games (early 2014 – Nov 2014)
● Freelance Programmer (Sep 2013 – early 2014)
● Generalist Programmer, The Farm 51 (Mar 2010 – Aug 2013)
Who is this guy?
Agenda
● How is gamedev different?● Bug species● Case studies● Conclusions
StartStart Exit?Exit?
EndEnd
Yes
NoUpdateUpdate DrawDraw
How is gamedev different?
33 milliseconds
● How much time you have to get shit done™– 30 Hz → 33⅓ ms per frame
– 60 Hz → 16⅔ ms per frame
E d i t o rE d i t o r
L e v e l t o o l sL e v e l t o o l s
A s s e t t o o l sA s s e t t o o l s
E n g i n eE n g i n e
P h y s i c sP h y s i c s
R e n d e r i n gR e n d e r i n g A u d i oA u d i o
N e t w o r kN e t w o r k
P l a t f o r mP l a t f o r m
I n p u tI n p u t
N e t w o r kb a c k - e n d
N e t w o r kb a c k - e n d
G a m eG a m e
U IU I L o g i cL o g i c A IA I
Interdisciplinary working environment
● Designers– Game, Level, Quest, Audio…
● Artists– Environment, Character, 2D, UI, Concept…
● Programmers– Gameplay, Engine, Tools, UI, Audio…
● Writers● Composers● Actors● Producers● PR & Marketing Specialists● … } Tightly
woventeams
Severe, fixed hardware constraints
● Main reason for extensive use of native code
Different trade-offs
Robustness
C
ost
Performance
Fun
/Coo
lnes
s
Enterprise/B2B/webdev Gamedev
Indeterminism & complexity
● Leads to poor testability– Parts make no sense in isolation
– What exactly is correct?
– Performance regressions?
Source: https://github.com/memononen/recastnavigation
Aversion to general software engineering
● Modelling● Object-Oriented Programming● Design patterns● C++ STL● Templates in general● …
Agenda
● How is gamedev different?● Bug species● Case studies● Conclusions
Source: http://benigoat.tumblr.com/post/100306422911/press-b-to-crouch
Bug species
General programming bugs
● Memory access violations● Memory stomping/buffer overflows● Infinite loops● Uninitialized variables● Reference cycles● Floating point precision errors● Out-Of-Memory/memory fragmentation● Memory leaks● Threading errors
Bad maths
● Incorrect transform order– Matrix multiplication not commutative
– AB ≠ BA
● Incorrect transform space
Source: http://leadwerks.com/wiki/index.php?title=TFormQuat
Temporal bugs
● Incorrect update order
– for (int i = 0; i < entities.size(); ++i)entities[i].update();
● Incorrect interpolation/blending– Bad alpha term– Bad blending mode (additive/modulate)
● Deferred effects– After n frames
– After n times an action happens
– n may be random, indeterministic
Graphical glitches
● Incorrect render state● Shader code bugs● Precision
Source: http://igetyourfail.blogspot.com/2009/01/visit-lake-fail-this-weekend.html
Content bugs
● Incorrect scripts● Buggy assets
Source: http://www.polycount.com/forum/showpost.php?p=1263124&postcount=10466
Worst part?
● Most cases are two or more of the aforementioned, intertwined
Agenda
● How is gamedev different?● Bug species● Case studies● Conclusions
Most material captured by
Case studies
Video settings not updating
Incorrect weapon after demon mode foreshadowing
Post-death sprint camera anim
Corpses teleported on death
Corpses teleported on death
● In normal gameplay, pawns have simplified movement– Sweep the actor's collision primitive through the
world
– Slide along slopes,stop against walls
Source: http://udn.epicgames.com/Three/PhysicalAnimation.html
Corpses teleported on death
● Upon death, pawns switch to physics-based movement (ragdoll)
Source: http://udn.epicgames.com/Three/PhysicalAnimation.html
Corpses teleported on death (cont.)
● Physics bodies have separate state from the game actor– Actor does not drive physics bodies, unless
requested
– If actor is driven byphysics simulation,their location issynchronized tothe hips bonebody's
Source: http://udn.epicgames.com/Three/PhysicalAnimation.html
Corpses teleported on death (cont.)
● Idea: breakpoint in FarMove()?– One function because world octree is updated– Function gets called a gazillion times per frame �– Terrible noise
● Breakpoint condition?– Teleport from arbitrary point A to arbitrary point B– Distance?
● Breakpoint sequence?– Break on death instead
– When breakpoint hit, break in FarMove()
Corpses teleported on death (cont.)
● Cause: physics body driving the actor with out-of-date state
● Fix: request physics body state synchronization to animation before switching to ragdoll
Weapons floating away from the player
Weapons floating away from the player
Weapons floating away from the player
● Extremely rare, only encountered on consoles– Reproduction rate somewhere at 1 in 50 attempts
– And never on developer machines �● Player pawn in a special state for the
rollercoaster ride– Many things could go wrong
● For the lack of repro, sprinkled the code with debug logs
Weapons floating away from the player (cont.)
● Cause: incorrect update order
– for (int i = 0;i < entities.size();++i)entities[i].update();
– Player pawn forced to update after rollercoaster car– Possible for weapons to be updated before player
pawns
● Fix: enforce weapon update after player pawns
Characters with “rapiers”
Characters with “rapiers”
● UE3 has ”content cooking” as part of game build pipeline– Redistributable builds are ”cooked” builds
● Artifact appears only in cooked builds
Characters with “rapiers” (cont.)
● Logs contained assertions for ”out-of-bounds vertices”● Mesh vertex compression scheme
– 32-bit float → 16-bit short int (~50% savings)– Find bounding sphere for all vertices– Normalize all vertices to said sphere radius– Map [-1; 1] floats to [-32768; 32767] 16-bit integers
● Assert condition
– for (int i = 0; i < 3; ++i)assert(v[i] >= -1.f && v[i] <= 1.f,”Out-of-bound vertex!”);
Characters with “rapiers” (cont.)
● v[i] was NaN
– Interesting property of NaN: all comparisons fail– Even with itself
● float f = nanf();bool b = (f == f);// b is false
● How did it get there?!● Tracked the NaN all the way down to the raw
engine asset!
Characters with “rapiers” (cont.)
● Cause: ???● Fix: re-export the mesh from 3D software
– Magic!
Meta-case: undeniable assertion
Undeniable assertion
● Happened while debugging ”rapiers”● Texture compression library without sources● Flood of non-critical assertions
– For almost every texture
– Could not ignore in bulk �– Terrible noise
● Solution suggestion taken from [SINILO12]
Undeniable assertion (cont.)
● Enter disassembly
Undeniable assertion (cont.)
● Locate assert message function call instruction
Undeniable assertion (cont.)
● Enter memory view and look up the adress– 0xE8 is the CALL opcode
– 4-byte address argument
Undeniable assertion (cont.)
● NOP it out!– 0x90 is the NOP opcode
Undeniable assertion (cont.)
Incorrect player movement
Incorrect player movement
Incorrect player movement
● Recreating player movement from one engine in another (Pain Engine → Unreal Engine 3)
● Different physics engines (Havok vs PhysX)● Many nuances
– Air control
– Jump and fall heights
– Slope & stair climbing & sliding down
Incorrect player movement (cont.)
● Main nuance: capsule vs cylinder
Incorrect player movement (cont.)
● Switching our pawn collision to capsule-based was not an option
● Emulate by sampling the ground under the cylinder instead
● No clever way to debug, just make it ”bug out” and break in debugger
Incorrect player movement (cont.)
● Situation when getting stuck● Cause: vanilla UE3 code sent a player locked
between non-walkable surfaces into the ”falling” state
● Fix: keep the player in the “walking” state
Incorrect player movement (cont.)
● Situation when moving without player intent● Added visualization of sampling, turned on
collision display● Cause: undersampling● Fix: increase radial sampling resolution1) 2)
Blinking full-screen damage effects
Blinking full-screen damage effects
● Post-process effects are organized in one-way chains
Blinking full-screen damage effects (cont.)
● No debugger available to observe the PP chain● Rolled my own overlay that walked and
dumped the chain contents
MaterialEffect 'Vignette' Param 'Strength' 0.83 [IIIIIIII ]MaterialEffect 'FilmGrain' Param 'Strength' 0.00 [ ]UberPostProcessEffect 'None' SceneHighLights (X=0.80,Y=0.80,Z=0.80) SceneMidTones (X=0.80,Y=0.80,Z=0.80) …MaterialEffect 'Blood' Param 'Strength' 1.00 [IIIIIIIIII]
Blinking full-screen damage effects (cont.)
● Cause: entire PP chain override– Breakpoint in chain setting revealed the level script
as the source
– Overeager level designer ticking one checkbox too many when setting up thunderstorm effects
● Fix: disable chain overriding altogether– No use case for it in our game anyway
Incorrect animation states
Incorrect animation states
Incorrect animation states
Incorrect animation states
● Animation in UE3 is done by evaluating a tree– Branches are weight-blended (either replacement or
additive blend)
– Sequences (raw animations) for whole-skeleton poses
– Skeletalcontrols forfine-tuning ofindividualbones
Source: http://udn.epicgames.com/Three/AnimTreeEditorUserGuide.html
Incorrect animation states (cont.)
● Prominent case for domain-specific debuggers● No tools for that in UE3, rolled my own visualizer
– Walks the animation tree and dumps active branches
– Allows inspection of states, but not transitions
– Conventionaldebuggingstill required,but greatlynarroweddown
Incorrect animation states (cont.)
● Animation bug “checklist”● Inspect the animation state in slow motion
– Is the correct blending mode used?
● Inspect the AI and cutscene state– Capable of full animation overrides
● Inspect the assets (animation sequences)– Is the root bone correctly oriented?– Is the root bone motion correct?– Are inverse kinematics targets present and correctly placed?
– Is the mesh skeleton complete and correct?
Incorrect animation states (cont.)
● Incorrect blend of reload animation– Cause: bad root bone orientation in animation
sequence
● Left hand off the weapon– Cause: left hand inverse kinematics was off
– Fix: revise IK state control code
● Left hand incorrectly oriented– Cause: bad IK target marker orientation on weapon
mesh
Viewport stretched when portals are in view
Viewport stretched when portals are in view
● Graphics debugging is:– Tracing & recording graphics API calls
– Replaying the trace
– Reviewing the renderer state and resources
● Trace may besomewhat unreadableat first…
Viewport stretched when portals are… (cont.)
● Traces may be annotated for clarity– Direct3D: ID3DUserDefinedAnnotation– OpenGL:
GL_KHR_debug(more info:[GODLEWSKI01])
Viewport stretched when portals are… (cont.)
● Quick renderer state inspection revealed that viewport dimensions were off– 1024x1024, 1:1 aspect ratio instead of 1280x720, 16:9
– Looks like shadow map resolution?
● Found the latest glViewport() call– Shadow map code indeed
● Why wasn't the viewport updated for main scene rendering?
Viewport stretched when portals are… (cont.)
● Renderer state changes are expensive– New state needs to be validated
– Modern graphics APIs are asynchronous
– State reading may requrie synchronization → stalls
● Cache the current renderer state to avoid redundant calls– Cache ↔ state divergence → bugs!
Viewport stretched when portals are… (cont.)
● Cause: cache ↔ state divergence– Difference between Direct3D and OpenGL:
viewport dimensions as part of render target state, or global state
● Fix: tie viewport dimensions to render target in the cache
Black artifacts
Black artifacts
Black artifacts
Black artifacts
Black artifacts
Black artifacts
● First thing to do is to inspect the state● Nothing suspicious found, turned to shaders● On OpenGL 4.2+, shaders could be debugged in NSight…● OpenGL 2.1, so had to resort to early returns from shader with
debug colours– Shader equivalent of debug logs, a.k.a. ”Your Mum's Debugger”
● ”Shotgun debugging” with is*() functions
– isnan(), isinf()● isnan() returned true!
Black artifacts (cont.)
● Cause: undefined behaviour in NVIDIA's pow() implementation
– Results are undefined if x < 0.Results are undefined if x = 0 and y <= 0. [GLSL120]
– Undefined means the implementation is free to do whatever● NVIDIA returns QNaN the Barbarian (displayed as black, poisoning all involved
calculations)● Other vendors usually return 0
● Fix: for all pow() calls, clamp either:– Arguments to their proper ranges
– Output to [0; ∞)
Mysterious crash
Mysterious crash
● Game in content lock (feature freeze) for a while● Playstation 3 port nearly done● Crash ~3-5 frames after entering a specific room● First report included a perfectly normal callstack but no
obvious reason● QA reassigned to another task, could not pursue more● Concluded it must've been an OOM crash
Mysterious crash (cont.)
● Bug comes back, albeit with wildly different callstack● Asked QA to reproduce mutliple times, including other platforms
– No crashes on X360 & Windows!
● Totally different callstack each time● Confusion!
– OOM? Even in 512 MB developer mode (256 MB in retail units)?– Bad content?– Console OS bug?
– Audio thread?– ???
Mysterious crash (cont.)
● Reviewed a larger sample of callstacks● Most ended in dlmalloc's integrity checks
– Assertions triggered upon allocations and frees
● Memory stomping…? Could it be…?
Mysterious crash (cont.)
● Started researching memory debugging– No tools provided by Sony
● Tried using debug allocators (dmalloc et al.)– Most use the concept of memory fences
– Difficult to hook up to UE3
malloc
Regular allocation Fenced allocation
malloc
Mysterious crash (cont.)
● Found and integrated a community-developed tool, Heap Inspector [VANDERBEEK14]– Memory analyzer
– Focused on consumption and usage patterns monitoring– Records callstacks for allocations and frees
● Several reproduction attempts revealed a correlation– Crash adress
– Construction of a specific class
● Gotcha!
Mysterious crash (cont.)
// class declaration
class Crasher extends ActorComponent;
var int DummyArray[1024];
// in ammo consumption code
Crash = new class'Crasher';
Comp = new class'ActorComponent'(Crash);
Mysterious crash (cont.)
● Cause: buffer overflow vulnerability in UnrealScript VM– No manifestation on X360 & Windows due to larger
allocation alignment (8 vs 16 bytes)
● Fix: make copy-construction fail when template is a subclassed object
● I wish I had Valgrind! [GODLEWSKI02]
Agenda
● How is gamedev different?● Bug species● Case studies● Conclusions
Takeaway
● Time is of the essence!● Always on a tight schedule● Constantly in motion
– Temporal visualization is key– Custom, domain-specific tools
● Complex and indeterministic– Difficult to automate testing
– Wide knowledge required
● Prone to bugs outside the code– Custom, domain-specific tools, again
Takeaway (cont.)
● Rendering is a whole separate beast– Absolutely custom tools in isolation from the rest of the
game– Still far from ideal usability
● Good to know your machine down to the metal● Good memory debugging tools make a world's
difference● You are never safe, not even in managed
languages!
Questions?
@ [email protected] @TheIneQuationK www.inequation.org
Thank you!
References
● SINILO12 – Sinilo, M. ”Coding in a debugger” [link]● GODLEWSKI01 – Godlewski, L. ”OpenGL (ES)
debugging” [link]● GLSL120 – Kessenich, J. ”The OpenGL® Shading
Language”, Language Version: 1.20, Document Revision: 8, p. 57 [link]
● VANDERBEEK14 – van der Beek, J. ”Heap Inspector” [link]
● GODLEWSKI02 – Godlewski, L. ”Advanced Linux Game Programming” [link]