coding for multiple cores

45
Coding for Multiple Cores Bruce Dawson & Chuck Walbourn Programmers Game Technology Group

Upload: lee-hanxue

Post on 14-Jul-2015

194 views

Category:

Engineering


4 download

TRANSCRIPT

Page 1: Coding for multiple cores

Coding for Multiple Cores

Bruce Dawson & Chuck WalbournProgrammersGame Technology Group

Page 2: Coding for multiple cores

Why multi-threading/multi-core?

Clock rates are stagnantFuture CPUs will be predominantly multi-thread/multi-core

Xbox 360 has 3 coresPS3 will be multi-core>70% of PC sales will be multi-core by end of 2006

Most Windows Vista systems will be multi-core

Two performance possibilities:Single-threaded? Minimal performance growthMulti-threaded? Exponential performance growth

Page 3: Coding for multiple cores

Design for MultithreadingGood design is critical

Bad multithreading can be worse than no multithreading

Deadlocks, synchronization bugs, poor performance, etc.

Page 4: Coding for multiple cores

Bad Multithreading

Thread 1

Thread 2

Thread 3

Thread 4

Thread 5

Page 5: Coding for multiple cores

Rendering ThreadRendering ThreadRendering Thread

Game Thread

Good Multithreading

Main Thread

Physics

Rendering Thread

Animation/Skinning

Particle Systems

Networking

File I/O

Game Thread

Page 6: Coding for multiple cores

Another Paradigm: CascadesThread 1

Thread 2

Thread 3

Thread 4

Thread 5

Input

Physics

AI

Rendering

Present

Frame 1Frame 2Frame 3Frame 4

Advantages:Synchronization points are few and well-defined

Disadvantages:Increases latency (for constant frame rate)

Needs simple (one-way) data flow

Page 7: Coding for multiple cores

Typical Threaded Tasks

File Decompression

Rendering

Graphics Fluff

Physics

Page 8: Coding for multiple cores

File Decompression

Most common CPU heavy thread on the Xbox 360

Easy to multithread

Allows use of aggressive compression to improve load times

Don’t throw a thread at a problem better solved by offline processing

Texture compression, file packing, etc.

Page 9: Coding for multiple cores

Rendering

Separate update and render threads

Rendering on multiple threads (D3DCREATE_MULTITHREADED) works poorly

Exception: Xbox 360 command buffers

Special case of cascades paradigmPass render state from update to render

With constant workload gives same latency, better frame rate

With increased workload gives same frame rate, worse latency

Page 10: Coding for multiple cores

Graphics Fluff

Extra graphics that doesn't affect playProcedurally generated animating cloud textures

Cloth simulations

Dynamic ambient occlusion

Procedurally generated vegetation, etc.

Extra particles, better particle physics, etc.

Easy to synchronize

Potentially expensive, but if the core is otherwise idle...?

Page 11: Coding for multiple cores

Physics?

Could cascade from update to physics to rendering

Makes use of three threads

May be too much latency

Could run physics on many threadsUses many threads while doing physics

May leave threads mostly idle elsewhere

Page 12: Coding for multiple cores

Rendering ThreadRendering Thread

Overcommitted Multithreading?Physics

Rendering Thread

Animation/Skinning

Particle Systems

Game Thread

Page 13: Coding for multiple cores

How Many Threads?No more than one CPU intensive software thread per core

3-6 on Xbox 3601-? on PC (1-4 for now, need to query)

Too many busy threads adds complexity, and lowers performance

Context switches are not free

Can have many non-CPU intensive threads

I/O threads that block, or intermittent tasks

Page 14: Coding for multiple cores

Simultaneous Multi-Threading

Be careful with Simultaneous Multi-Threading (SMT) threads

Not the same as double the number of cores

Can give a small perf boost

Can cause a perf drop

Can avoid scheduler latency

Ideally one heavy thread per core plus some additional intermittent threads

Page 15: Coding for multiple cores

Case Study: Kameo (Xbox 360)

Started single threaded

Rendering was taking half of time—put on separate thread

Two render-description buffers created to communicate from update to render

Linear read/write access for best cache usage

Doesn't copy const data

File I/O and decompress on other threads

Page 16: Coding for multiple cores

Separate Rendering Thread

Update Thread

Buffer 1

Render Thread

Buffer 0

Page 17: Coding for multiple cores

Case Study: Kameo (Xbox 360)

Core Thread Software threads

00 Game update

1 File I/O

10 Rendering

1

20 XAudio

1 File decompression

Total usage was ~2.2-2.5 cores

Page 18: Coding for multiple cores

Case Study: Project Gotham Racing

Core Thread Software threads

00 Update, physics, rendering, UI

1 Audio update, networking

10 Crowd update, texture decompression

1 Texture decompression

20 XAudio

1

Total usage was ~2.0-3.0 cores

Page 19: Coding for multiple cores

Managing Your Threads

Creating threads

Synchronizing

TerminatingDon't use TerminateThread()

Bad idea on Windows: leaves the process in an indeterminate state, doesn't allow clean-up, etc.

Unavailable on Xbox 360

Instead return from your thread function, or call ExitThread

Page 20: Coding for multiple cores

Creating Threads Poorlyconst int stackSize = 0;HANDLE hThread = CreateThread(0, stackSize, ThreadFunctionBad, 0, 0, 0);// Do work on main thread here.for (;;) { // Wait for child thread to complete DWORD exitCode; GetExitCodeThread(hThread, &exitCode); if (exitCode != STILL_ACTIVE) break;}

...

DWORD __stdcall ThreadFunctionBad(void* data){#ifdef WIN32 SetThreadAffinityMask(GetCurrentThread(), 8);#endif // Do child thread work here. return 0;}

CreateThread doesn't initialize C runtime

Stack size of zero means inherit parent's

stack size

Busy waiting is bad!

Don't forget to close this when done with it

Be careful with thread affinities on Windows

Page 21: Coding for multiple cores

Creating Threads Wellconst int stackSize = 65536;HANDLE hThread = (HANDLE)_beginthreadex(0, stackSize, ThreadFunction, 0, 0, 0);// Do work on main thread here.// Wait for child thread to completeWaitForSingleObject(hThread, INFINITE);CloseHandle(hThread);

...

unsigned __stdcall ThreadFunction(void* data){#ifdef XBOX // On Xbox 360 you must explicitly assign // software threads to hardware threads. XSetThreadProcessor(GetCurrentThread(), 2);#endif // Do child thread work here. return 0;}

_beginthreadex initializes CRT

Specify stack size on Xbox 360

The correct way to wait for a thread to exit

Don't forget to close this when done with it

Thread affinities must be specified on Xbox

360

Page 22: Coding for multiple cores

Alternative: OpenMP

Available in VC++ 2005

Simple way to parallelize loops and some other constructs

Works best on long symmetric tasks—particles?

Game tasks are short—16.6 ms

Many game tasks are not symmetric

OpenMP is nice, but not ideal

Page 23: Coding for multiple cores

Available Synchronization Objects

Events

Semaphores

Mutexes

Critical Sections

Don't use SuspendThread()Some title have used this for synchronization

Can easily lead to deadlocks

Interacts badly with Visual Studio debugger

Page 24: Coding for multiple cores

Exclusive Access: Mutex// InitializeHANDLE mutex = CreateMutex(0, FALSE, 0);

// Usevoid ManipulateSharedData() { WaitForSingleObject(mutex, INFINITE); // Manipulate stuff... ReleaseMutex(mutex);}

// DestroyCloseHandle(mutex);

Page 25: Coding for multiple cores

Exclusive Access: CRITICAL_SECTION// InitializeCRITICAL_SECTION cs;InitializeCriticalSection(&cs);

// Usevoid ManipulateSharedData() { EnterCriticalSection(&cs); // Manipulate stuff... LeaveCriticalSection(&cs);}

// DestroyDeleteCriticalSection(&cs);

Page 26: Coding for multiple cores

Lockless programming

Trendy technique to use clever programming to share resources without locking

Includes InterlockedXXX(), lockless message passing, Double Checked Locking, etc.

Very hard to get right:Compiler can reorder instructions

CPU can reorder instructions

CPU can reorder reads and writes

Not as fast as avoiding synchronization entirely

Page 27: Coding for multiple cores

Lockless Messages: Buggy

void SendMessage(void* input) { // Wait for the message to be 'empty'. while (g_msg.filled) ; memcpy(g_msg.data, input, MESSAGESIZE); g_msg.filled = true;}

void GetMessage() { // Wait for the message to be 'filled'. while (!g_msg.filled) ; memcpy(localMsg.data, g_msg.data, MESSAGESIZE); g_msg.filled = false;}

Page 28: Coding for multiple cores

Synchronization tips/costs:

Synchronization is moderately expensive when there is no contention

Hundreds to thousands of cycles

Synchronization can be arbitrarily expensive when there is contention!

Goals:Synchronize rarely

Hold locks briefly

Minimize shared data

Page 29: Coding for multiple cores

Beware hidden synchronization:

Allocations are (generally) a synch pointConsider per-thread heaps with no lockingHEAP_NO_SERIALIZE flag avoids lock on Win32 heapsConsider custom single-purpose allocatorsConsider avoiding memory allocations!

Avoid synch in in-house profilersD3DCREATE_MULTITHREADED causes synchronization on almost every Direct3D call

Page 30: Coding for multiple cores

Threading File I/O & Decompression

First: use large reads and asynchronous I/O

Then: consider compression to accelerate loading

Don't do format conversions etc. that are better done at build time!

Have resource proxies to allow rendering to continue

Page 31: Coding for multiple cores

File I/O Implementation Details

vector<Resource*> g_resources;

Worst design: decompressor locks g_resources while decompressing

Better design: decompressor adds resources to vector after decompressing

Still requires renderer to synch on every resource access

Best design: two Resource* vectorsRenderer has private vector, no locking required

Decompressor use shared vector, syncs when adding new Resource*

Renderer moves Resource* from shared to private vector once per frame

Page 32: Coding for multiple cores

Profiling multi-threaded apps

Need thread-aware profilers

Profiling may hide many synchronization stalls

Home-grown spin locks make profiling harder

Consider instrumenting calls to synchronization functions

Don't use locks in instrumentation—use TLS variables to store results

Windows: Intel VTune, AMD CodeAnalyst, and the Visual Studio Team System Profiler

Xbox 360: PIX, XbPerfView, etc.

Page 33: Coding for multiple cores

PIX timing capture

Page 34: Coding for multiple cores

Naming Threadstypedef struct tagTHREADNAME_INFO { DWORD dwType; // must be 0x1000 LPCSTR szName; // pointer to name (in user addr space) DWORD dwThreadID; // thread ID (-1=caller thread) DWORD dwFlags; // reserved for future use, must be zero} THREADNAME_INFO;

void SetThreadName( DWORD dwThreadID, LPCSTR szThreadName) { THREADNAME_INFO info; info.dwType = 0x1000; info.szName = szThreadName; info.dwThreadID = dwThreadID; info.dwFlags = 0;

__try { RaiseException( 0x406D1388, 0, sizeof(info)/sizeof(DWORD),

(DWORD*)&info ); } __except(EXCEPTION_CONTINUE_EXECUTION) { }}

SetThreadName(-1, "Main thread");

Page 35: Coding for multiple cores

Other Ideas

Debugging tips for MTVisual Studio does support multi-threaded debugging

Use threads window

Use @hwthread in watch window on Xbox 360

KD and WinDBG support multi-threaded debugging

Thread Local Storage (TLS)__declspec(thread) declares per-thread variables

But doesn't work in dynamically loaded DLLs

TLSAlloc is less efficient, less convenient, but works in dynamically loaded DLLs

Page 36: Coding for multiple cores

Windows tips

Avoid using D3DCREATE_MULTITHREADEDIt’s easy, it works, it’s really really slowBest to do all calls to Direct3D from a single threadCould pass off locked resource pointers to a queue for a loading threads to work with

Test on multiple machines and configurations

Single-core, SMT (i.e. Hyper-Threading), Dual-core, Intel and AMD chips, Multi-socket multicore (4+ cores)

Page 37: Coding for multiple cores

Windows API features

WaitForMultipleObjectObviously better than a series of WaitForSingleObject calls

The OS is highly optimized around multithreading and event-based blocking

I/O Completion PortsVery efficient way to have the OS assign a pool of worker threads to incoming I/O requests

Useful construct for implementing a game server

Page 38: Coding for multiple cores

SMT versus Multicore

OS returns number of logical processors in GetSystemInfo(), so a 2 could mean a SMT machine with only 1 actual core –or- 2 coresDetailed Win32 APIs exposing this distinction not available until Windows XP x64, Windows Server 2003 SP1, Windows Vista, etc.GetLogicalProcessorInformation()

For now you have to use CPUID detailed by Intel and AMD to parse this out…

Page 39: Coding for multiple cores

Timing with Multiple Cores

RDTSC is not always synced between cores!As your thread moves from core to core, results of RDTSC counter deltas may be nonsense

CPU frequency itself can change at run-time through speed step technologies

See Power Management APIs for more information

Best thing to do is use Win32 API QueryPerformanceCounter / QueryPerformanceFrequencySee DirectX SDK article Game Timing and Multiple Cores

Page 40: Coding for multiple cores

Thread Micromanagement

Use SetThreadAffinityMask with caution!

May be useful for assigning ‘heavy’ work threadsThis mask is technically a hint, not a commitmentRDTSC-based instrumenting will require locking the game threads to a single coreOtherwise let the Windows scheduler do the right thingCreateDevice/Reset might have a side-effect on the calling thread’s affinity with software vertex processing enabled

Page 41: Coding for multiple cores

Thread Micromanagement (cont)

Be careful about boosting thread priorityIf the priority is too high, you could cause the system to hang and become unresponsive

If the priority is too low, the thread may starve

Page 42: Coding for multiple cores

DLLs and Multithreading

DllMain for every DLL is informed of thread creation/destruction

For some DLLs this is required to initialize TLS

For many this is a waste of time, so call DisableThreadLibraryCalls() from your DllMain during process creation (DLL_PROCESS_ATTACH)

The OS serializes access to the entry pointThis means threads created during DllMain won’t start for a while, so don’t wait on them in the DLL startup

Page 43: Coding for multiple cores

Resources

Multithreading Applications in Win32, Jim Beveridge & Robert Weiner, Addison-Wesley, 1997Multiprocessor Considerations for Kernel-Mode Drivers

http://download.microsoft.com/download/e/b/a/eba1050f-a31d-436b-9281-92cdfeae4b45/MP_issues.doc

Determining Logical Processors per Physical Processorhttp://www.intel.com/cd/ids/developer/asmo-na/eng/dc/threading/knowledgebase/43842.htm

GetLogicalProcessorInformationhttp://msdn.microsoft.com/library/default.asp?url=/library/en-us/dllproc/base/getlogicalprocessorinformation.asp

Double checked lockinghttp://en.wikipedia.org/wiki/Double-checked_locking

Page 44: Coding for multiple cores

ResourcesGDC 2006 Presentations

http://msdn.com/directx/presentations

DirectX Developer Centerhttp://msdn.com/directx

XNA Developer Centerhttp://msdn.com/xna

Xbox Developer Center (Registered Devs Only)https://xds.xbox.com

XNA, DirectX, XACT Forumshttp://msdn.com/directx/forums

Email [email protected] (DirectX Feedback)

[email protected] (Xbox Developers Only)

[email protected] (XNA Feedback)

Page 45: Coding for multiple cores

© 2006 Microsoft Corporation. All rights reserved.Microsoft, DirectX, Xbox 360, the Xbox logo, and XNA are either registered trademarks or trademarks of Microsoft Corporation in the United Sates and / or other countries.

This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary.