quaketm: parallelizing a complex serial application using transactional memory vladimir gajinov 1,2,...
TRANSCRIPT
QuakeTM: Parallelizing a Complex Serial Application
Using Transactional Memory
Vladimir Gajinov1,2, Ferad Zyulkyarov1,2,Osman S. Unsal1, Adrián Cristal1, Eduard Ayguadé1,2, Tim Harris3, Mateo Valero1,2
1Barcelona Supercomputing Center
2Universitat Politècnica de Catalunya
3Microsoft Research
2
Outline
Introduction & motivation
Quake description
Parallelization
Results
Conclusion
3CPU processing is the bottleneck.
Introduction
Topic of this workParallelization of the Quake server.
What is Quake? The first person shooter game.
A sequential application.
Close to instantaneous control of player actions.
High degree of interaction among players in a detailed 3D virtual world.
Requirements of a sequential game server
OpenMP + Transactional MemoryMethod
4
Background
• OpenMP:– API for writing shared-memory parallel programming
in C/C++ and Fortran. – Compiler directives and library routines.– Fork-Join parallelism.
• Transactional Memory (TM):– concurrency control mechanism.– series of reads and writes to shared memory are
handled atomically. – When successful transaction commits,
otherwise it aborts.
5
Motivation• Just a few TM applications available
– STAMP, Haskell STM benchmark, RMS-TM …– Clear need for more complex applications.
• Contribution:Parallelization of a complex sequential application using
TM.
• Question:Is it possible to achieve fine-grained locking performance with the coarse-grained parallelization effort?
• MOTIVATION - Test TM programmability:– Start with a coarse-grained approach.– Test the performance.– Determine the problems.– Compare with a fine-grained approach.
6
Outline
Introduction & motivation
Quake description
Parallelization
Results
Conclusion
7
Quake Organization
Typical client – server architecture
ServerMaintains the consistency of
the game world.
Handles the coordination among clients.
Clients
Update graphics
Implement user-interface operations
8
The Server• The main server task - computing a new frame
Process
Read
Physics Update
SELECT
Reply
Yes
No
Tx
Rx
Frame execution diagram
Request Processing
Sequential server execution with 8 connected clients.
Execution breakdown
We concentrate on the request
processing stage
2.1%
87.8%
3.1%
9
LEVEL 4
LEVEL 1
LEVEL 2
LEVEL 3
LEVEL 5
Areanode tree Top view
• 3D volume in a 3D coordinate space.• Represented as a binary space partition tree.• Fine grained and inefficient.
Areanode tree:- balanced binary tree.- each 3D point in the map must
either be in an areanode that is a leaf or in a division plane.
- areanodes maintain a list of game objects (entities).
Quake Map
10
Outline
Introduction & motivation
Quake description
Parallelization
Results
Conclusion
11
Parallelization
• Only the request processing stage is parallelized• OpenMP to start parallel execution.• Transactions for synchronization.• Coarse-grained approach.• Comparison with the fine-grained implementation
of Atomic Quake [PPoPP2009]• Application characteristics:
Coarse-grained8 TM blocks
Big read & write setsLong transactionsAbort rate 35.3%
Fine-grained
58 TM blocksAbort rate 4.1%
12
Shared Data
• Three types of shared data structures:– Areanode tree – Game objects– Message buffers
• Common global state buffer • Per-player reply buffers
• Most intensive sharing inside the request processing stage.
13
Client Requests
Two types of requests:
• Connection related messages – associated with the connection or disconnection protocols,
used when the client wants to join or leave the server game session, or other facilities that do not affect gameplay
• Gameplay messages– most important type of requests – model the player’s interaction with the game world. – the most used – MOVE COMMAND.
14
Pseudocode for the request processing stage
while (NET_GetPacket ()) { // Filter packets
if (connection related packet) { SV_ConnectionlessPacket (); continue; }
// game play packets for (i=0 ; i<MAX_CLIENTS ; i++) { // Do some checking here SV_ExecuteClientMessage (); }}
while (NET_GetPacket ()) { // Filter packets
if (connection related packet){ SV_ConnectionlessPacket (); continue; }
AddPacketToList(); CopyBuffer();}
#pragma intel omp parallel taskq shared(packetlist, ...){while (packetlist != NULL) { #pragma intel omp task captureprivate(packetlist) { NET_Message_Init(..); // check for packets from connected clients for (i=0, cl=svs.clients ; i<MAX_CLIENTS ; i++,cl++) { // Do some checking here SV_ExecuteClientMessage (cl); } }
packetlist = packetlist->next;}
Sequential Parallel
15
The Move Command
ExecutionConstruct the bounding box.
Traverse the areanode tree.
Find objects contained in the bounding box.
Associate them with the command.
Simulate the move.
Remove the player from the old position.
Add him to the new position.
Parameters
Player’s origin
View angles
Motion indicators
Time to run
16
Move Command Execution
AddLinksToPmove
Execute Move
ClientPhysics
ClientThink
PmoveInit
PlayerMove
LinkEntity
PlayerTouch
Transaction 1
Transaction 2
Transaction 3
Transaction 4
T1
T2
T3
T4
ClientPhysics client’s physics update
ClientThink execute actions registered in previous frames
PmoveInit pmove (player move) structure initialization
AddLinksToPmove determines which entities could be affected by the current move command.
PlayerMove constructs a trajectory line and determines the client's final position
LinkEntity re-links the player’s entity to the new position in the areanode tree
PlayerTouch model influence on the other game objects
17
ReachPoints
int reachpoints[NumThreads][x*16]
TM_PUREvoid PointReached(int check) {
reachpoints[ThreadId][check]++;}
int main () {. . .TRANSACTION
PointReached (1);
statement_1;PointReached
(2);TRANSACTION_END. . .
}
Helps to:• Identify thread private variables.• Discover where transactions abort• Discover causes for the aborts.• Discover TM false sharing conflicts
(conflict management granularity).
18
Outline
Introduction & motivation
Quake description
Parallelization
Results
Conclusion
19
Evaluation• TraceBot:
– An automatic trace client.– Behavior is controlled by a finite state machine.
• VideoClient:– Normal graphical client for proving correctness.– For trace creation.
• The server runs on one machine, the clients on the other.– Server – 8 cores (4 x dual-core 64-bit Intel® Xeon™).
• Frame execution time as a performance measure.
• Prototype version 3.0 of the Intel STM C/C++ compiler.– In-place updates.– Cache line granularity conflict detection.– Transactions validate the read set at commit time, and
if necessary during the read operation, – function annotations: tm_callable, tm_pure and tm_unknown.– Closed nesting - flattening
20
Results - Normalized average frame execution times (coarse)
0
2
4
6
8
1 2 4 8 16
Number of clients
Nor
mal
ized
exe
cutio
n tim
e serial global_lock TM_coarse
The baseline is always the average frame execution time of the sequential server for the respected number of clients.
TM version overhead3.5x – 6x
more than 85% of the time is spent
in critical sections.
Overhead is too high
21
Results - performance of coarse-grained configurations
0.01.02.03.04.05.06.07.08.0
1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8
1 client 2 clients 4 clients 8 clients 16 clients
Number of threads
Tim
e [m
s]
global_lock TM_coarse
1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
TM_coarse
Spee
dup
2 threads 4 threads 8 threads
0.0
2.0
4.0
6.0
8.0
1 2 4 8
Threads
Ave
rage
fra
me
time
[ms]
global_lock TM_coarse
Comparative performance of parallel configurations
Transactional server running with 16 clients (speedup & scalability)
22
Transactional statistic – coarse-grained
Clients Transactions AbortsAbort rate
[%] Mean [KB] Max [KB] Total [MB]
1 34754 0 0.0Reads 3.0 104 105
Writes 0.6 17 20
2 95980 1970 2.1Reads 2.8 863 263
Writes 0.6 164 55
4 179241 10820 6.0Reads 3.4 1413 570
Writes 0.6 269 108
8 364305 76560 21.0Reads 4.2 1478 1207
Writes 0.8 251 216
16 524561 184992 35.3Reads 5.1 1704 1725
Writes 0.9 262 296
The abort rate is significant
TM server running with 8 threads.
23
The Overhead Breakdown
TM block
Multithread execution - 8 threads, 16 clients
Total [109 cycles]
Instrumentation time Abort overheadAbort rate
[%]109 cycles % 109 cycles %
1 13.5 10.3 75.8 3.3 24.2 19.5
2 9.5 9.0 94.1 0.6 5.9 18.0
3 17.2 15.1 87.9 2.1 12.1 52.7
4 11.6 10.9 94.3 0.7 5.7 22.4
5 5.9 3.2 53.7 2.8 46.3 61.1
overall 57.9 48.5 83.8 9.4 16.2 35.2
We have limited possibility for profiling
Seems like the TM instrumentation
overhead is more important
24
Results - Normalized average frame execution times (fine)
0
1
2
3
4
1 2 4 8 16
Number of clients
Nor
mal
ized
exe
cutio
n tim
e serial lock_fine TM_fine
TM version overhead2.4x – 3x
25
Results - performance of fine-grained configurations
0.00.51.01.52.02.53.03.54.0
1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8
1 client 2 clients 4 clients 8 clients 16 clients
Number of threads
Tim
e [m
s]
lock_fine TM_fine
1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
lock_fine TM_fine
Spee
dup
2 threads 4 threads 8 threads
0.0
2.0
4.0
6.0
8.0
1 2 4 8
Threads
Ave
rage
fra
me
time
[ms]
global_lock lock_fineTM_coarse TM_fine
Comparative performance of parallel configurations
Transactional server running with 16 clients (speedup & scalability)
26
Transactional statistic – fine-grained
Clients Transactions AbortsAbort rate
[%] Mean [B] Max [B] Total [MB]
1 190206 0 0.0Reads 65.1 58511 12
Writes 5.2 20102 1
2 367118 826 0.2Reads 66.0 62728 25
Writes 5.7 24397 2
4 655020 4165 0.6Reads 83.7 80275 55
Writes 8.2 39726 5
8 1439874 20593 1.4Reads 102.5 102470 145
Writes 9.6 57552 14
16 3226759 131814 4.1Reads 133.3 231593 192
Writes 15.5 211651 22
TM server running with 8 threads.
27
Outline
Introduction & motivation
Quake description
Parallelization
Results
Conclusion
28
QuakeTM Characteristics
• 27.600 lines of code.• 49 files.• Configurable with macros
– Synchronization, granularity, nesting, TM implementation.
• Coarse-grained setup:– 8 critical regions (TM or global lock)
• Fine-grained setup:– 58 critical regions (TM or fine-grained locks)
• Available on the www.bscmsrc.eu
29
Conclusion
• The transactional overhead is excessive:– 6x slowdown – 35.3% abort rate
• A coarse-grained approach is not a good option for the current STM systems.
• Significant programmer time investment (10 man-months).
• Fine-grained approach maybe the only solution.
30
Questions?
Thank you!
Download QuakeTM
www.bscmsrc.eu
31
Intel Compiler
• single lock atomicity semantics and weak atomicity guarantees. – Strongly atomic semantics, where non-
transactional accesses are treated as implicit single-operation transactions
32
Atomic Quake
• Main objective was to evaluate the effort of replacing locks with transactions.
• The lock parallelization is not block structured which required code reorganization to adapt to the TM model.
• The second problem was to avoid I/O operations which is not an issue in a lock based system.
• Finally, a big fraction of the development time was spent in understanding how locks are associated with the variables and to get a grip with the locking strategy.
33
Atomic Quake 2
• Thread private data – call to get_specific• The conditional variables – no retry• I/O in transactions – tm_pure• Proposition for error handling
– When error happens commit the transaction and handle the error outside the atomic block.
• Privatization examples– Custom memory manager allocates a block of
memory for string operations• TM fits for guarding access to different shared
data (separate locks)