paris game/ai conference 2011
DESCRIPTION
Slides from the Paris Game/AI Conference 2011 talk by Neil Henning - covering theTRANSCRIPT
Preparing AI for Parallelism
Lessons from NASCAR The Game 2011Neil Henning – Technology Lead
Paris Game AI Conference 2011
Neil [email protected]
m
I am sure some of you are wondering...
Introduction
Paris Game AI Conference 2011
Why a guy from
is doing a talk about
which was developed by
Neil [email protected]
m
● Team from Codeplay worked for 15 months on game
Introduction
Paris Game AI Conference 2011
Neil [email protected]
m
Introduction
● NASCAR isn’t just about driving straight, then turning left
● 43 cars on screen at the same time
● Overtaking is all about navigating through these packs● Cannot simply make the AI use LODs, nearly always in
view
● Cars race in tight packs on the circuit
Paris Game AI Conference 2011
Neil [email protected]
m
● How to prepare AI for parallelism
Agenda
● …by investigating NASCAR the Game 2011's AI
Paris Game AI Conference 2011
Neil [email protected]
m
Agenda
● During the investigation I will answer the questions:
● Why prepare your AI for parallelism?
● What changes should be made?
● How did these changes help when optimizing NASCAR?
● How did we make use of the PS3's unique hardware?
● What common issues are there?
Paris Game AI Conference 2011
Neil [email protected]
m
● What performance improvement was achieved?
Why prepare your AI for parallelism?
● Without parallelism, tighter limits on number of bots
Paris Game AI Conference 2011
frame length
● Say we have four bots
● In serial – can easily fit in a frame
Neil [email protected]
m
Why prepare your AI for parallelism?
● Without parallelism, tighter limits on number of bots
Paris Game AI Conference 2011
frame length
● Want to increase bots by 3x?
● Have to either optimize or parallelize (or both)
Neil [email protected]
m
Why prepare your AI for parallelism?
● Without parallelism, tighter limits on number of bots
Paris Game AI Conference 2011
frame length
● Split work between threads
● Only possible with parallelism
Neil [email protected]
m
Why prepare your AI for parallelism?
● Multicore is the future (has been for some time)
Paris Game AI Conference 2011
● Even iPad uses dual core processors now!
● Sony's new PS Vita is quad core
● This generation of consoles are multicore
● Being able to split work amongst cores is key
● Might not be required yet, but could be essential later
Neil [email protected]
m
Why prepare your AI for parallelism?
● Helps during crunch time
Paris Game AI Conference 2011
● Have AI prepared to become parallel
● Either optimize engine or cut features
● Optimization being sought throughout engine
● Optimization folks will love you!
Neil [email protected]
m
● Split work into manageable chunks
Paris Game AI Conference 2011
Neil [email protected]
m
What changes should be made?
● In NASCAR, had 18 components for each car
Stay
Behind
Stay
Beside
Obstacle
Detectio
n
Driving
Controllers
● Components are in groups
Paris Game AI Conference 2011
Neil [email protected]
m
What changes should be made?
● All components in a group can be run in parallel
● 43 cars = 43 AIs
Paris Game AI Conference 2011
Neil [email protected]
m
What changes should be made?
● Each car’s groups can be run in parallel too
…
0
1
2
42
…
What changes should be made?
● Read/Write phases
Paris Game AI Conference 2011
● Two phases for your AI
● Read phase can read world/other car state
● Write phase can modify own car state
Neil [email protected]
m
What changes should be made?
● Use temporary data to store read values from
environment
Paris Game AI Conference 2011
● In read phase, store needed reads into temporary data
● In write phase, read from the temporary data
● AI is one frame behind world events
Neil [email protected]
m
● Effect on AI is minimal
What changes should be made?
● In NASCAR a read/write phase was used
Paris Game AI Conference 2011
Neil [email protected]
m
Write
Phase
Read
Phase● Write phase uses data from previous frames read phase
● Minimal set of components in read/write phase group
● Only components that required world/other car state
What changes should be made?
● Remove large stack locals
Paris Game AI Conference 2011
● Having two or more threads means lots of duplicate
locals
Neil [email protected]
m
void func(){
char localBuffer[1024];// … do something with localBuffer
}
● If func is called from many threads, many times data
use!
What changes should be made?
● Document code – describe relationship between data
Paris Game AI Conference 2011
Neil [email protected]
m
struct Foo{
Bar * bar;};
one :
one?
one :
many?
many :
one?
What changes should be made?
● Document code – describe relationship between data
Paris Game AI Conference 2011
Neil [email protected]
m
struct Foo{
Bar * bar;};
● Knowing how data is shared critical for threading
● Documenting the relationship saves time and effort later
What common issues are there?
● Virtual functions – can have a high runtime cost
Paris Game AI Conference 2011
● ~500-1200 cycles on PowerPC if virtual lookup misses
cache
Neil [email protected]
m
● Can equate to a large amount of time doing no work
What common issues are there?
● In NASCAR, components had virtual update method
Paris Game AI Conference 2011
● Based on previous game (Supercar Challenge)
Neil [email protected]
m
● 16 cars in previous, now 43 cars
● 5 component types in previous, now 18 component
types
● Now read/write phase too
● 80 virtual calls to update became 1333 virtual calls!
What common issues are there?
● In NASCAR, components had virtual update method
Paris Game AI Conference 2011
● In real terms, 3ms of virtual function lookup per frame
Neil [email protected]
m
● First optimization was to have typed buckets of
components
● 1333 virtual calls went to 31 virtual calls
● Platform agnostic (PS3, 360 and Wii all sped up)
What common issues are there?
● Virtual functions not just a code abstraction
Paris Game AI Conference 2011
● Virtual functions hide data too
Neil [email protected]
m
● Not knowing the size of data kills SPU/Compute
development
struct Foo { virtual void func(); };struct Bar : public Foo { virtual void func(); };
Foo * foo;foo->func();
// don’t know size of foo! Could be sizeof(Foo) || sizeof(Bar)
What common issues are there?
● Naïve multithreading – locks galore
Paris Game AI Conference 2011
● Locks can be a solution, be very careful of use though
Neil [email protected]
m
void func(){
lock->lock();// … do somethinglock->unlock();
}
● Read/write phases allow removal of most (if not all) locks
● Avoid/reduce/remove locks if possible
What common issues are there?
● Physics subsystem caused issues with NASCAR
Paris Game AI Conference 2011
● Physics system used, raycast to find problematic
obstacles
Neil [email protected]
m
● Each call to raycast used a mutex, every thread would
halt!
● AI required knowledge of obstacles
● Had to refactor code to remove need for locking
What common issues are there?
● Know your data – how is it accessed? Where is it shared?
Paris Game AI Conference 2011
Neil [email protected]
m
struct RaceCar { Brain * brain; };
struct Brain { RaceCar * raceCar; Obstacle ** obstacles; };
struct Obstacle { BrainInterface * interface; };
struct BrainInterface { RaceCar * raceCar; Brain * brain; };
● Very easy for systems grown over time to have
convoluted struct layouts
How did these changes help when optimizing
NASCAR?
Paris Game AI Conference 2011
Neil [email protected]
m
How did these changes help when optimizing NASCAR?
● Read/Write phase was key to performance on Xbox 360
Paris Game AI Conference 2011
● Allowed work to be split across all 6 threads
Neil [email protected]
m
● Each thread was given 1/6th of the cars to process
● Takes 2ms of all CPU resources on 360 in a frame
...
barriers
How did these changes help when optimizing NASCAR?
● Tried the same approach on PS3
Paris Game AI Conference 2011
● Both threads on PS3 were completely full
Neil [email protected]
m
● Any multithreading speedup has to be on the SPUs
● Code was ~2Mb and data was ~8Mb – far too large!
● Each SPU has 256kb local storage (for code & data)
● Unfeasible to mimic 360 approach
● Only 2 threads on PS3, but have 6 sub processors (the
SPUs)
● On PS3 most costly components were targeted
How did these changes help when optimizing NASCAR?
Paris Game AI Conference 2011
Neil [email protected]
m
How did these changes help when optimizing NASCAR?
● PS3 version relied on components being run in parallel
Paris Game AI Conference 2011
● And all components in a group being able to be run in
parallel
Neil [email protected]
m
● Costly groups were made to use the SPUs
● Knowing relationship between data was key
● Well documented code made life so much easier!
How did we make use of the PS3's unique
hardware?
Paris Game AI Conference 2011
Neil [email protected]
m
How did we make use of the PS3's unique hardware?
● Codeplay was asked by Eutechnyx to optimize the AI
Paris Game AI Conference 2011
● Very tight deadlines, 1 month to reduce time taken in AI
Neil [email protected]
m
● No main thread time left – have to use the SPUs
● Our Offload compiler technology crucial
How did we make use of the PS3's unique hardware?
● For those unfamiliar with coding for the SPU…
Paris Game AI Conference 2011
● They are amazingly fast, if you code correctly for them
Neil [email protected]
m
● Normally requires total rewrite of existing codebase
● Painful to access global variables
● Virtual functions are a complete write off
How did we make use of the PS3's unique hardware?
● SPU development typically takes many months
Paris Game AI Conference 2011
● Common to have 4-5 SPU programmers for ~10 months
Neil [email protected]
m
● Not feasible for late-in-cycle development
● Offload aims to mitigate the issues with getting code
onto SPU
● Can offload code to SPU much quicker (typically a few
man days)
● Much easier to move existing code bases to SPU
How did we make use of the PS3's unique hardware?
Paris Game AI Conference 2011
Neil [email protected]
m
● Small language extension moves work from PPU to SPU
● Any work within an offload block is performed on the SPU
__blockingoffload(){
// do some work on SPU, PPU waits for completion!};
offloadThread_t handle = __offload(){
// do some work on SPU!};
// can do some work on PPU before waiting for SPUoffloadThreadJoin(handle);
● All PPU code is duplicated for the SPU
How did we make use of the PS3's unique hardware?
Paris Game AI Conference 2011
Neil [email protected]
m
● Offload allows access to global variables
● Just use them as normal!
int aGlobalVariable;
__blockingoffload(){
int aLocalVariable = aGlobalVariable;};
How did we make use of the PS3's unique hardware?
Paris Game AI Conference 2011
Neil [email protected]
m
● Offload allows virtual function calls too
● Just have to specify which virtual functions may be called
struct Foo { virtual void bar() {} };
__blockingoffload[Foo::bar this](){
Foo foo;foo.bar();
};
How did we make use of the PS3's unique hardware?
● First, profiled the AI during a typical race
Paris Game AI Conference 2011
Neil [email protected]
m
Driving Controllers
Obstacle Detection
Stay Behind Other Car
Stay Beside Other Car
● Four components taking most of the frame time
How did we make use of the PS3's unique hardware?
● Used four slightly different strategies when
multithreading
Paris Game AI Conference 2011
Neil [email protected]
m
Driving Controllers
Obstacle Detection
Stay Behind Other Car
Stay Beside Other Car
How did we make use of the PS3's unique hardware?
● Obstacle Detection only component in its group
Paris Game AI Conference 2011
Neil [email protected]
m
Obstacle Detection
● Very inefficient code for the SPU, but moved 1/3 onto 4
SPUs
How did we make use of the PS3's unique hardware?
● Looked at Stay Behind/Beside Other Car together
Paris Game AI Conference 2011
Neil [email protected]
m
Stay Behind Other Car
Stay Beside Other Car
● In the same group, can be run in parallel
How did we make use of the PS3's unique hardware?
● Moved Stay Behind component to SPU
Paris Game AI Conference 2011
Neil [email protected]
m
Stay Behind Other Car
Stay Beside Other Car
● Stay Beside component would continue to be run on PPU
How did we make use of the PS3's unique hardware?
● As long as SPU work was less time than the PPU work, no
cost!
Paris Game AI Conference 2011
Neil [email protected]
m
Stay Behind Other Car
Stay Beside Other Car
● Effectively ‘hid’ the cost of calculating Stay Behind
component
How did we make use of the PS3's unique hardware?
● Lastly, driving controllers took 1/3 of AI cost alone
Paris Game AI Conference 2011
Neil [email protected]
m
Driving Controllers
● Split the cars across 4 SPUs, and ran in parallel
How did we make use of the PS3's unique hardware?
● In total ~170 source code changed
Paris Game AI Conference 2011
● Changes were purely optimization
Neil [email protected]
m
AIObstacle ** obstacles;unsigned int numObstacles;offloadThread_t handle = __offload(obstacles, numObstacles){
for(unsigned int i = 0; i < numObstacles; i++){
AIObstacle * obstacle = obstacles[i];
// use obstacle for calculations}
};
How did we make use of the PS3's unique hardware?
● In total ~170 source code changed
Paris Game AI Conference 2011
● Changes were purely optimization
Neil [email protected]
m
// array of AIObstacle * ’s on main memoryAIObstacle ** obstacles;unsigned int numObstacles;offloadThread_t handle = __offload(obstacles, numObstacles){
for(unsigned int i = 0; i < numObstacles; i++){
// AIObstacle * points to main memoryAIObstacle * obstacle = obstacles[i];
// use obstacle for calculations}
};
How did we make use of the PS3's unique hardware?
● In total ~170 source code changed
Paris Game AI Conference 2011
● Changes were purely optimization
Neil [email protected]
m
// array of AIObstacle * ’s on main memoryAIObstacle ** obstacles;unsigned int numObstacles;offloadThread_t handle = __offload(obstacles, numObstacles){
CachedPointer<AIObstacle *>innerObstacles(obstacles, numObstacles);
for(unsigned int i = 0; i < numObstacles; i++){
// AIObstacle * points to main memoryCachedPointer<AIObstacle>
obstacle(innerObstacles[i]);// use obstacle for calculations
}};
What performance improvement was achieved?
● Obstacle detection went from 2ms -> 1.1ms
Paris Game AI Conference 2011
Neil [email protected]
m
● ~100 lines of source code changed
Obstacle Detection
● 2½ weeks development time
What performance improvement was achieved?
Paris Game AI Conference 2011
Neil [email protected]
m
Obstacle Detection
● Obstacle detection went from 2ms -> 1.1ms
● ~100 lines of source code changed
● 2½ weeks development time
What performance improvement was achieved?
● Stay Behind went from 1.1ms -> 0ms (hidden behind
other)
Paris Game AI Conference 2011
Neil [email protected]
m
● ~50 lines of source code changed
Stay Behind Other Car
Stay Beside Other Car
● 1 week development time
What performance improvement was achieved?
Paris Game AI Conference 2011
Neil [email protected]
m
Stay Behind Other Car
Stay Beside Other Car
● Stay Behind went from 1.1ms -> 0ms (hidden behind
other)● ~50 lines of source code changed
● 1 week development time
What performance improvement was achieved?
● Driving Controllers went from 4ms -> 0.6ms
Paris Game AI Conference 2011
Neil [email protected]
m
● ~20 lines of source code changed
Driving Controllers
● 8 hours development time
What performance improvement was achieved?
Paris Game AI Conference 2011
Neil [email protected]
m
Driving Controllers
● Driving Controllers went from 4ms -> 0.6ms
● ~20 lines of source code changed
● 8 hours development time
What performance improvement was achieved?
● Performance speaks for itself!
Paris Game AI Conference 2011
Neil [email protected]
m
● 50% speed improvement on PS3
Takeaway
● It is possible to parallelise late in development
Paris Game AI Conference 2011
● But need code ready to be parallelised
Neil [email protected]
m
● Small changes in coding style lead to hugely better
results● Better to plan systems from beginning with multicore in
mind
Questions?
Paris Game AI Conference 2011
Neil [email protected]
m
Can also catch me on twitter @sheredom