advanced linux game programming

Advanced Linux Game Programming

Advanced Linux
Game Programming

Leszek Godlewski
Programmer, Nordic Games

Hi everyone, my name is Leszek Godlewski and I'm here today to talk about some of the more advanced aspects of Linux game programming.

Or, porting your game engine to Linux, to be more precise.

These notes will only supplement the slides, they are not a transcript.

Nordic Games GmbH

Started in 2011 as a sister company to Nordic Games Publishing (We Sing)

Base IP acquired from JoWooD and DreamCatcher (SpellForce, The Guild, Aquanox, Painkiller)

Initially focusing on smaller, niche games

Acquired THQ IPs in 2013 (Darksiders, Titan Quest, Red Faction, MX vs. ATV)

Now shifting towards being a production company with internal devs

Since fall 2013: internal studio in Munich, Germany (Grimlore Games)

Leszek Godlewski
Programmer, Nordic Games

PortsPainkiller Hell & Damnation (The Farm 51)

Deadfall Adventures (The Farm 51)

Darksiders (Nordic Games)

Formerly generalist programmer on PKHD & DA at TF51

I'm a programmer with Nordic Games. A bit unusual role with the company since it's mostly a publisher, but I'm the resident Linux guy and generalist programmer where necessary.

I've got three ports under my belt: Painkiller Hell & Damnation and Deadfall Adventures for The Farm 51, and Darksiders for Nordic.

I was also a generalist programmer at TF51 on the original teams for those two games.

Objective of this talk

Your game engine on Linux, before porting:

Missing!

Objective of this talk (cont.)

Your first working Linux port:

Oops. Bat-Signal!

Objective of this talk (cont.)

Where I want to try helping you get to:

In other words, from this:

To this:

And that's mostly debugging

All sorts of debuggers!

apitrace

Who can guess what these are, apart from the obvious ones? :)

Answers, clockwise from top left: GDB, Valgrind, apitrace (doh), Nsight (doh!), GPU PerfStudio, VOGL (doh...).

Demo code available

is.gd/GDCE14Linux

Intended takeaway

Build system improvements

Signal handlers

Memory debugging with Valgrind

OpenGL debugging techniques

The focus of this talk lies in these 4 areas. I'd like to share my findings about optimizing native code builds, signal (crash) handling, jump on to memory debugging stuff using the Valgrind tool, and finish off with OpenGL debugging techniques. There are some bonus slides if there is still time, too.

Intended takeaway Agenda


Signal handlers



Actually, let's adopt this as the agenda. :) The upcoming element will be in bold.

Build systems

What I had initially with UE3:Copy/paste of the Mac OS X toolchain

It worked, but...Slow

Huge binaries because of debug symbols

Problematic linking of circular dependencies

This is based on the two UE3 titles, but also applied to Darksiders to some extent (there was no Mac toolchain, had to make one from scratch). It displayed the same problems.

Build systems (cont.)

32-bit binaries required for feature/hardware parity with Windows

Original solution: a chroot jail with an entire 32-bit Ubuntu system just for building

As much as I would love 32-bit to die, I had to support it to achieve full parity with Windows.

A chroot is a bit like a VM a separate, sandboxed view of the entire system environment (i.e. file system), but running on the same kernel.

Host system is Debian amd64.

Cross-compiling for 32/64-bit

gcc -m32/-m64 is not enough!Only sets target code generation

Not headers & libraries (CRT, OpenMP, libgcc etc.)

Fixed by installing gcc-multilibDependency package for non-default architectures (i.e. i386 on an amd64 system and vice versa)

Fortunately, I could get rid of that chroot jail thanks to gcc-multilib.

Clang (ad nauseam)

Clang is fastergcc:3m47s

Clang:3m05s

More benchmarks at Phoronix [LARABEL13]

Clang has different diagnostics than gcc

Did you mean x? and colours for the win.

Clang (cont.)

Preprocessor macro compatibilityDeclares __GNUC__ etc.

Command line compatibilityEasily switch back & forth between Clang & gcc

Clang caveats

C++ object files may be incompatible with gcc & fail to link (need full rebuilds)

Clang is not as mature as gccOccasionally has generated faulty code for me (YMMV)

C object files are unaffected.

Clang seems to have slightly different C++ symbol name generation rules (pre-mangling).

Clang caveats (cont.)

Slight inconsistencies in C++ standard strictnessTemplates

Anonymous structs/unions

May need to add this-> in some places

May need to name some anonymous types

So: Clang or gcc?

Both:Clang quick iterations during development

gcc final shipping binaries

Linking GNU ld

Default linker on Linux

Ancient

Single-threaded

Requires specification of libraries in the order of reverse dependency...

We are not doomed to use it!

Linking GNU gold

Multi-threaded linker for ELF binariesld:18s

gold:5s

Developed at Google, now officially part of GNU binutils

Officially in binutils ships as an optional part of default toolchain on all modern distros.

Linking GNU gold (cont.)

Drop-in replacement for ldMay need an additional parameter or toolchain setupclang++ -B/usr/lib/gold-ld ...

g++ -fuse-ld=gold ...

Still needs libs in the order of reverse dependency...

Linking reverse dependency

Major headache/game-breaker with circular dependencies

Proper fix: re-specify the same libraries over and over againgcc app.o -lA -lB -lA

And not only circular sometimes it's tricky to even get regular dependencies in the right order.

Circular actually happened to me with the PhysX 2.8.4 loader. That was even worse because even though it seemed to have linked, the dynamic linker at runtime still failed to resolve everything...

Linking reverse dep. (cont.)

app

A

B

Here's how this reverse dependency thing works: you first declare the dependent object, then its dependency. Starting the construction of a building from the roof, if you will.

First, an app object file is linked. It has unsatisfied imports for symbols from library A.


app

A

B

Library A's exports satisfy the imports of app, but they import symbols from B.


app

A

B

Library B's exports match library A's imports, but B now has unsatisfied imports of A's symbols.


app

A

B

A

Just the missing symbols

By re-specifying A, the linker satisfies the A imports in library B.

Linking library groups

Declare library groups insteadWrap library list with --start-group, --end-groupShorthand: -(, -)

g++ foo.obj -Wl,-$ -lA -lB -Wl,-$

Results in exhaustive search for symbols

Linking library groups (cont.)

Actually used for non-library objects (TUs)

Caveat: the exhaustive search!Manual warns of possible performance hit

Not observed here, but keep that in mind!

Running the binary in debugger

inequation@spearhead:~/projects/largebinary$ gdb -silent largebinaryReading symbols from /home/inequation/projects/largebinary/largebinary...[zzz... several minutes later...]done.(gdb)

This is your typical GDB session start at the beginning of porting.

Caching the gdb-index

Large codebases generate heavy debug symbols (hundreds of MBs)

GDB does symbol indexing at every single startup Massive waste of time!

Caching the gdb-index (cont.)

Solution: fold indexing into the build process

Old linkers: as described in [GNU01]

New linkers (i.e. gold): --gdb-indexMay need to forward from compiler driver: -Wl,--gdb-index

Old linkers need some shell script magic run GDB in batch mode etc.

Here's what I had been using before discovering --gdb-index:

@echo Generating gdb-index in $(OUTPUT_PATH)/gdb-index/$(notdir $@).gdb-index@mkdir -p $(OUTPUT_PATH)/gdb-index@$(GDB) -batch -ex "save gdb-index $(OUTPUT_PATH)/gdb-index" $@@echo Merging gdb-index into $@@$(OBJCOPY) --add-section .gdb_index=$(OUTPUT_PATH)/gdb-index/$(notdir $@).gdb-index --set-section-flags .gdb_index=readonly $@ $@

Agenda


Signal handlers



Signal handlers

Unix signals are async notifications

Sources can be:the process itself

another process

user

kernel

Signal handlers (cont.)

A lot like interruptsJump to handler upon first non-atomic op

Normal programflowNormal flowresumesSignalhandlerexecution

SIGNAL

A program can possibly resume normal execution after a signal.


System installs a default handlerUsually terminates and/or dumps coreCore minidump in Windows parlance, but entire mapped address range is dumped (truncated to RLIMIT_CORE bytes)

See signal(7) for default actions

Signal handler can do different things by default: terminate, dump core or stop (suspend).

I'll talk about RLIMIT_CORE a bit later.

It's actually preferable to perform a clean exit (instead of aborting) in response to SIGTERM and SIGQUIT.


Can (should!) specify custom handlers

Get/set handlers via sigaction(2)void handler(int, siginfo_t *, void *);

Needs SA_SIGINFO flag in sigaction() call

Extensively covered in [BENYOSSEF08]

Interesting siginfo_t fields

si_code reason for sending the signalExamples: signal source, FP over/underflow, memory permissions, unmapped address

si_addr memory location (if relevant)SIGILL, SIGFPE, SIGSEGV, SIGBUS and SIGTRAP

These fields are a good reason for using the extended information handlers (with siginfo_t), as opposed to the traditional signal(2) calls with void sighandler(int) handlers.

si_code is an enumeration, not a bit field. Includes reasons general (sent by user, kernel, async I/O operation etc.) and specific to the signal (e.g. whether SIGILL was caused by invalid opcode, operand or addressing mode etc.).

Interesting signals

Worth catchingSIGSEGV, SIGILL, SIGHUP, SIGQUIT, SIGTRAP, SIGIOT, SIGBUS, SIGFPE, SIGTERM, SIGINT

Worth ignoringsignal(signum, SIG_IGN);

SIGCHLD, SIGPIPE

Not all signals are interesting.

SIGTRAP is a programmatical breakpoint (you can also raise(SIGTRAP) from code to have the debugger stop). It's up to you to decide how to handle that.

SIGCHLD is only of note if you plan on spawning sub-processes. Similarly for SIGPIPE, if you're interested in notifications for the other end breaking a FIFO pipe.

Signal handling caveats

Prone to race conditionsSignals may be nested

NormalflowSignalhandler

SIGNAL

SignalhandlerSIGNAL

SIGNAL

In reality, this is slightly more complex than that because of masking and deferring of signals. See sigaction(2) for details. Worth noting that by default, further signalling of the same signal type will be deferred until the return of the handler, unless you specify SA_NODEFER or explicitly unmask that signal.

Ben-Yossef recommends using a spinlock, but warns of deadlocking if contending against a thread with higher priority. In that case it's better to just try locking and sleep with pselect(), but that's a overkill in most cases.

Signal handling caveats (cont.)

Prone to race conditionsCan't share locks with the main program

LockmutexLock mutexDeadlock Signalhandler

SIGNAL

Normalflow

What if your code is interrupted right after grabbing the lock?

Unless your mutex is re-entrant and the lock happened on the signalled thread, of course.


Prone to race conditionsCan't call async-unsafe/non-reentrant functionsSee signal(7) for a list of safe ones

Notable functions not on the list:printf() and friends (formatted output)

malloc() and free()

Speaking of re-entrant


Not safe to allocate or free heap memory

STOMPED?STOMPED?STOMPED?STOMPED?STOMPED?STOMPED?STOMPED?STOMPED?STOMPED?STOMPED?STOMPED?Source: [LEA01]

Allocator's data structures may have been damaged in the fault.


Custom handlers do not dump coreAt handler installation time:Raise RLIMIT_CORE to desired core size

Inside handler, after custom logging:Restore default handler using signal(2) or sigaction(2)

raise(signum);

When using sigaction(2) to install your handler, you will need a stash of struct sigaction in static memory and populate it when installing your handler.

You might not be able to bump RLIMIT_CORE due to permissions. You may also want to limit it, as core dumps are often huge because they include a dump of all the mapped memory tranges. However, smaller core size truncation not all structures/ranges can be dumped dump may actually be useless (stack not in the dumped memory range). Important to get a stack trace in the handler anyway.

The raise(3) call may not work unless you unmask the signal. This is demonstrated in the code example.

Also worth noting that apps are not really in control of core dump filename (/proc/sys/kernel/core_pattern or sysctl).

Safe stack walking

glibc provides backtrace(3) and friends

Symbols are read from the dynamic symbol tablePass -rdynamic at compile-time to populate

The backtrace() function is easy to use, just pass it a static array of void pointers and its size.

Safe stack walking (cont.)

backtrace_symbols() internally calls malloc()Not safe...

Still, can get away with it most of the time (YMMV)

Personally, I haven't had any problems with just calling backtrace_symbols() and calling it a day, but I imagine it may be a bad idea in memory-constrained situations or when nested signalling happens.

Example proper solution

Fork a watchdog process in main()Communicate over a FIFO pipe

In signal handler:Collect & send information down the pipe

backtrace_symbols_fd() down the pipe

Demo code: is.gd/GDCE14Linux

Pipe communication protocol up to the developer.

Agenda


Signal handlers



Is this even related to porting?

Yes! Portability bugs easily overlooked

Hardcoded struct sizes/offsets

OpenGL buffers

Incorrect binary packing/unpacking

How did we/they manage to ship that?!

One such case was not setting the right OpenGL pixel packing/unpacking rules when uploading textures, resulting in reading beyond the buffer.

What is Valgrind?

Framework for dynamic, runtime analysis

Dynamic recompilationmachine code IR tool machine code

Performance typically at 25-20% of unmodified codeWorse if heavily threaded execution is serialized

What is Valgrind? (cont.)

Many tools in it:Memory error detectors (Memcheck, SGcheck)

Cache profilers (Cachegrind, Callgrind)

Thread error detectors (Helgrind, DRD)

Heap profilers (Massif, DHAT)

Memcheck doesn't watch the stack or static variables, only the heap. SGcheck the reverse.

Memcheck basics

Basic usage extremely simpleas long as you use the vanilla libc malloc()

valgrind ./app

Will probably report a ton of errors on the first run!Again: How did they manage to ship that?!

Memcheck basics (cont.)

Many false positives, esp. in 3rd partiesXlib, NVIDIA driver

Can suppress them via suppress filesCall Valgrind with --gen-suppressions=yes to generate suppression definitions

Be careful with that! Can let OpenGL bugs slip!

Valgrind will also watch out for allocations made by the driver.

Contrived example

#include int main(int argc, char *argv[]) { int foo, *ptr1 = &foo; int *ptr2 = malloc(sizeof(int)); if (*ptr1) ptr2[1] = 0xabad1dea; else ptr2[1] = 0x15bad700; ptr2[0] = ptr2[2]; return *ptr1;}

Demo code:

is.gd/GDCE14Linux

This very bad code makes several mistakes:Branch on uninitialized variable

Overwrite memory past malloced area

Read past malloced area

Make a syscall with uninitialized memory as a parameter (return code)

Leak memory allocated to ptr2.

Actual demo code in repo also has an example for use-after-free.

Valgrind output for such

==8925== Conditional jump or move depends on uninitialised value(s)==8925== Invalid write of size 4==8925== Invalid read of size 4==8925== Syscall param exit_group(status) contains uninitialised byte(s)==8925== LEAK SUMMARY:==8925== definitely lost: 4 bytes in 1 blocks

Memcheck easily identifies all the bugs. Actual output includes call stacks, both for when the errors occur and for the allocation/freeing of the relevant memory chunk.

Running with --leak-check=full also pinpoints the source of the leak.

What about custom allocators?

Custom memory pool & allocation algo

Valgrind only sees mmap()/munmap() of multiples of entire memory pages

All access within those pages now valid!

How to track errors?

Valgrind has no way of knowing about individual allocations and frees from the pool. Thus it won't protect control structures etc.

Worth noting that mmap() calls for mapping large address ranges will cause Valgrind to issue a warning (not sure about the threshold value, but a 512MB heap does trigger it).

Client requests

Allow annotation of custom allocators

~20 C macros defined in valgrind.hCommon and per-tool requests exist

Can be cut out with -DNVALGRIND

Detailed description in [VALGRIND01]

Client requests have negligible performance impact and are no-ops if Valgrind is not present.

Example: Instrumenting dlmalloc

2.8.4 instrumentation from [CRYSTAL01]

Demo code: is.gd/GDCE14LinuxCompile the sample with -DDLMALLOC

Similar results to libc malloc()

This instrumentation could be improved by using the MEMPOOL family of client requests instead, but it's still quite verbose and useful.

Other uses of client requests

Pointer validationIs address mapped? Is it defined?

Mid-session leak checksLevel transitions

Pointer validation could be used in garbage collectors, for instance.

Other uses of client req. (cont.)

Poisoning memory regionsEnsuring signal handlers don't touch the heap

Ensuring geometry buffers aren't read on CPU

Source: [LEA01]

I'm sure you'll find a ton of other uses for making sure some data stays untouched.

Debugging inside Valgrind

A gdbserver for remote debugging

SIGTRAP (breakpoint) on every error

Unlimited memory watchpoints!Data breakpoints in Visual Studio parlance

Cf. 4 single-word hardware debug registers on x86

Debugging inside Valgrind (cont.)

Terminal A:valgrind --vgdb=yes --vgdb-error=0 ./MyGame

Terminal B:gdb ./MyGame

target remote | vgdb

Agenda


Signal handlers



I will deliberately not speak of IHV tools because they are complex enough to warrant separate talks. I'll leave that to the IHV's dev relations teams. :)

Ye Olde Way

Call glGetError() after each OpenGL call

Get 1 of 8 (sic!) error codes

Look up the call in the manual

See what this particular error means in this particular context

Ye Olde Way (cont.)

Then check what was actually the case6 possible reasons for GL_INVALID_VALUE in glTexImage*() alone! See [OPENGL01]

Usually: attach a debugger, replay the scenario

This sucks!

Ye Olde Way (cont.)

Then check what was actually the case6 possible reasons for GL_INVALID_VALUE in glTexImage*() alone! See [OPENGL01]

Usually: attach a debugger, replay the scenario

This sucks! used to suck

No longer does, because we no longer need to do that.

Debug callback

Never call glGetError() again!

Much more detailed informationIncl. performance tips from the driver

Good to check what different drivers say

May not work without a debug OpenGL context (GLX_CONTEXT_DEBUG_BIT_ARB)

The extension spec says that it's only suggested that driver implementers make it available only in debug contexts. It might just work in a regular one with your driver, but better stay on the safe side.

Debug callback (cont.)

Provided by either of (ABI-compatible): GL_KHR_debug [OPENGL02], GL_ARB_debug_output [OPENGL03]

OpenGLOpenGL ESNVIDIA (official)AMD (official)Intel (Mesa)AMD (Mesa)

ARB_debug_output

KHR_debug

State of Mesa at the time of writing this, on my ThinkPad T500 with Debian amd64 testing.


void callback(GLenum source, GLenum type, GLuint id, GLenum severity, GLsizei length, const GLchar* message, const void* userParam);

Filter by source, type, severity or individual messages

You don't get just the message, but also some metadata and a user pointer.

IDs are unregulated, so drivers may use them as they see fit. They will definitely vary from vendor to vendor.


Verbosity can be controlled (filtering)glDebugMessageControl[ARB]()

[OPENGL02][OPENGL03]

Turn to 11 (GL_DONT_CARE) for valuable perf information!Memory type for buffers, unused mip levels

You can also filter the messages programatically with ifs.

Intel is pretty verbose by default; NVIDIA needs setting all channels to GL_DONT_CARE to spew out more performance hints.

API call tracing

Record a trace of the run of the application

Replay and review the traceLook up OpenGL state at a particular call

Inspect state variables, resources and objects: textures, shaders, buffers...

apitrace or VOGL

Well, this is not helpful...

It's a flat list. Good luck finding a faulty call in a multi-pass renderer in it.

Much better!

Does not a tree view improve things?

KHR_debugEXT_debug_markerEXT_debug_labelGREMEDY_string_markerGREMEDY_frame_terminator

One-off messages

Call grouping

Object labels

Frame terminators

SupportGoodLimitedLimitedLimitedLimited

Annotating the call stream

Annotating the call stream (cont.)

All aforementioned extensions supported by apitrace regardless of driver

Recommended: GL_KHR_debug

Annotating the call stream (cont.)

Call groupingglPushDebugGroup()/glPopDebugGroup()

One-off messagesglDebugMessageInsert[ARB]()

glStringMarkerGREMEDY()

Object labelling

glObjectLabel(), glGetObjectLabel()Buffer, shader, program, vertex array, query, program pipeline, transform feedback, sampler, texture, render buffer, frame buffer, display list

glObjectPtrLabel(), glGetObjectPtrLabel()Sync objects

Annotation caveats

Multi-threaded grouping may break hierarchy

glDebugMessageInsert() calls the debug callback, polluting error streamsWorkaround: drop if source == GL_DEBUG_SOURCE_APPLICATION

In multi-threaded apps, look at the thread ID from which you're trying to push or pop a debug group.

Callback filtering can be achieved via glDebugMessageControl[ARB]() or programmatically. Just remember to always set source to GL_DEBUG_SOURCE_APPLICATION in your glDebugMessageInsert[ARB]() calls.

Example 1: PIX events emulation

#define D3DPERF_BeginEvent(colour, name) \ if (GLEW_KHR_debug && threadOwnsDevice()) \ glPushDebugGroup(GL_DEBUG_SOURCE_APPLICATION,\ (GLuint)colour, -1, name)#define D3DPERF_EndEvent()\ if (GLEW_KHR_debug && threadOwnsDevice())\ glPopDebugGroup()

There's a little trick here since there are no set rules about message IDs, I'm casting the D3DCOLOR (32-bit pixel) to GLuint and using that as an ID.

This was extremely cool to do, because I could instantly take advantage of the existing instrumentation on the whole engine.

Example 2: Game tech demo

University assignment from 2009

Annotated OpenGL 1.4

Demo code: is.gd/GDCE14Linux

The reason this demo was written for such an old version of OpenGL is that it needed to run on an old mobile Intel GPU (i915 I think).

Still works nicely with KHR_debug, though.

Takeaway

gcc-multilib is the prerequisite for 32/64-bit cross-compilation

Switching back and forth between Clang and gcc is easy and useful

Link times can be greatly improved by using gold

Takeaway (cont.)

Caching the gdb-index improves debugging experience

Crash handling is easy to do, tricky to get right

Takeaway (cont.)

Valgrind is an enormous aid in memory debugging

Even when employing custom allocators

OpenGL debugging experience can be vastly improved using some extensions

Questions?

@[email protected]@TheIneQuationKinequation.org

Thank you!

Further Nordic Games information:Kwww.nordicgames.atDevelopment information:Kwww.grimloregames.com

References

LARABEL13 Larabel, M. Clang 3.4 Performance Very Strong Against GCC 4.9 [link]

GNU01 Index Files Speed Up GDB [link]

GNU02 Options for Debugging Your Program or GCC [link]

BENYOSSEF08 Ben-Yossef, G. Crash N' Burn: Writing Linux application fault handlers [link]

LEA01 Lea, D. A Memory Allocator [link]

VALGRIND01 The Client Request mechanism [link]

CRYSTAL01 Crystal Space 3D SDK [link]

OPENGL01 glTexImage2D [link]

OPENGL02 ARB_debug_output [link]

OPENGL03 KHR_debug [link]

XDG01 XDG Base Directory Specification [link]

(), e.g. sigaction(2) Linux Programmer's Manual; to view, type man into a terminal or a web search engine

Special thanks

Fabian GiesenKatarzyna GriksaDamian KobiakaJetro LauhaEric LengyelKrzysztof NarkowiczReinhard PolliceBartomiej WroskiKacper Zber

Shout out to people who helped me (in various ways) to prepare this talk!

Bonus slides!

OpenGL resource leak checking

Intel i965 driver vs stack

Locating user data according to FreeDesktop.org guidelines

Thread priorities in Linux

Additional/new debug features

FREE!!!

OpenGL resource leak checking

Courtesy of Eric Lengyel & Fabian Giesenstatic void check_for_leaks(){ GLuint max_id = 10000; // better idea would be to keep track of assigned names. GLuint id; // if brute force doesn't work, you're not applying it hard enough for ( id = 1 ; id

advanced linux game programming

Software