quaketm: parallelizing a complex serial application using transactional memory vladimir gajinov 1,2,...

QuakeTM: Parallelizing a Complex Serial Application

Using Transactional Memory

Vladimir Gajinov1,2, Ferad Zyulkyarov1,2,Osman S. Unsal1, Adrián Cristal1, Eduard Ayguadé1,2, Tim Harris3, Mateo Valero1,2

1Barcelona Supercomputing Center

2Universitat Politècnica de Catalunya

3Microsoft Research

2

Outline

Introduction & motivation

Quake description

Parallelization

Results

Conclusion

3CPU processing is the bottleneck.

Introduction

Topic of this workParallelization of the Quake server.

What is Quake? The first person shooter game.

A sequential application.

Close to instantaneous control of player actions.

High degree of interaction among players in a detailed 3D virtual world.

Requirements of a sequential game server

OpenMP + Transactional MemoryMethod

4

Background

• OpenMP:– API for writing shared-memory parallel programming

in C/C++ and Fortran. – Compiler directives and library routines.– Fork-Join parallelism.

• Transactional Memory (TM):– concurrency control mechanism.– series of reads and writes to shared memory are

handled atomically. – When successful transaction commits,

otherwise it aborts.

5

Motivation• Just a few TM applications available

– STAMP, Haskell STM benchmark, RMS-TM …– Clear need for more complex applications.

• Contribution:Parallelization of a complex sequential application using

TM.

• Question:Is it possible to achieve fine-grained locking performance with the coarse-grained parallelization effort?

• MOTIVATION - Test TM programmability:– Start with a coarse-grained approach.– Test the performance.– Determine the problems.– Compare with a fine-grained approach.

6

Outline


Quake description

Parallelization

Results

Conclusion

7

Quake Organization

Typical client – server architecture

ServerMaintains the consistency of

the game world.

Handles the coordination among clients.

Clients

Update graphics

Implement user-interface operations

8

The Server• The main server task - computing a new frame

Process

Read

Physics Update

SELECT

Reply

Yes

No

Tx

Rx

Frame execution diagram

Request Processing

Sequential server execution with 8 connected clients.

Execution breakdown

We concentrate on the request

processing stage

2.1%

87.8%

3.1%

9

LEVEL 4

LEVEL 1

LEVEL 2

LEVEL 3

LEVEL 5

Areanode tree Top view

• 3D volume in a 3D coordinate space.• Represented as a binary space partition tree.• Fine grained and inefficient.

Areanode tree:- balanced binary tree.- each 3D point in the map must

either be in an areanode that is a leaf or in a division plane.

- areanodes maintain a list of game objects (entities).

Quake Map

10

Outline


Quake description

Parallelization

Results

Conclusion

11

Parallelization

• Only the request processing stage is parallelized• OpenMP to start parallel execution.• Transactions for synchronization.• Coarse-grained approach.• Comparison with the fine-grained implementation

of Atomic Quake [PPoPP2009]• Application characteristics:

Coarse-grained8 TM blocks

Big read & write setsLong transactionsAbort rate 35.3%

Fine-grained

58 TM blocksAbort rate 4.1%

12

Shared Data

• Three types of shared data structures:– Areanode tree – Game objects– Message buffers

• Common global state buffer • Per-player reply buffers

• Most intensive sharing inside the request processing stage.

13

Client Requests

Two types of requests:

• Connection related messages – associated with the connection or disconnection protocols,

used when the client wants to join or leave the server game session, or other facilities that do not affect gameplay

• Gameplay messages– most important type of requests – model the player’s interaction with the game world. – the most used – MOVE COMMAND.

14

Pseudocode for the request processing stage

while (NET_GetPacket ()) { // Filter packets

if (connection related packet) { SV_ConnectionlessPacket (); continue; }

// game play packets for (i=0 ; i<MAX_CLIENTS ; i++) { // Do some checking here SV_ExecuteClientMessage (); }}

while (NET_GetPacket ()) { // Filter packets

if (connection related packet){ SV_ConnectionlessPacket (); continue; }

AddPacketToList(); CopyBuffer();}

#pragma intel omp parallel taskq shared(packetlist, ...){while (packetlist != NULL) { #pragma intel omp task captureprivate(packetlist) { NET_Message_Init(..); // check for packets from connected clients for (i=0, cl=svs.clients ; i<MAX_CLIENTS ; i++,cl++) { // Do some checking here SV_ExecuteClientMessage (cl); } }

packetlist = packetlist->next;}

Sequential Parallel

15

The Move Command

ExecutionConstruct the bounding box.

Traverse the areanode tree.

Find objects contained in the bounding box.

Associate them with the command.

Simulate the move.

Remove the player from the old position.

Add him to the new position.

Parameters

Player’s origin

View angles

Motion indicators

Time to run

16

Move Command Execution

AddLinksToPmove

Execute Move

ClientPhysics

ClientThink

PmoveInit

PlayerMove

LinkEntity

PlayerTouch

Transaction 1

Transaction 2

Transaction 3

Transaction 4

T1

T2

T3

T4

ClientPhysics client’s physics update

ClientThink execute actions registered in previous frames

PmoveInit pmove (player move) structure initialization

AddLinksToPmove determines which entities could be affected by the current move command.

PlayerMove constructs a trajectory line and determines the client's final position

LinkEntity re-links the player’s entity to the new position in the areanode tree

PlayerTouch model influence on the other game objects

17

ReachPoints

int reachpoints[NumThreads][x*16]

TM_PUREvoid PointReached(int check) {

reachpoints[ThreadId][check]++;}

int main () {. . .TRANSACTION

PointReached (1);

statement_1;PointReached

(2);TRANSACTION_END. . .

}

Helps to:• Identify thread private variables.• Discover where transactions abort• Discover causes for the aborts.• Discover TM false sharing conflicts

(conflict management granularity).

18

Outline


Quake description

Parallelization

Results

Conclusion

19

Evaluation• TraceBot:

– An automatic trace client.– Behavior is controlled by a finite state machine.

• VideoClient:– Normal graphical client for proving correctness.– For trace creation.

• The server runs on one machine, the clients on the other.– Server – 8 cores (4 x dual-core 64-bit Intel® Xeon™).

• Frame execution time as a performance measure.

• Prototype version 3.0 of the Intel STM C/C++ compiler.– In-place updates.– Cache line granularity conflict detection.– Transactions validate the read set at commit time, and

if necessary during the read operation, – function annotations: tm_callable, tm_pure and tm_unknown.– Closed nesting - flattening

20

Results - Normalized average frame execution times (coarse)

0

2

4

6

8

1 2 4 8 16

Number of clients

Nor

mal

ized

exe

cutio

n tim

e serial global_lock TM_coarse

The baseline is always the average frame execution time of the sequential server for the respected number of clients.

TM version overhead3.5x – 6x

more than 85% of the time is spent

in critical sections.

Overhead is too high

21

Results - performance of coarse-grained configurations

0.01.02.03.04.05.06.07.08.0

1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8

1 client 2 clients 4 clients 8 clients 16 clients

Number of threads

Tim

e [m

s]

global_lock TM_coarse

1

1.1

1.2

1.3

1.4

1.5

1.6

1.7

TM_coarse

Spee

dup

2 threads 4 threads 8 threads

0.0

2.0

4.0

6.0

8.0

1 2 4 8

Threads

Ave

rage

fra

me

time

[ms]

global_lock TM_coarse

Comparative performance of parallel configurations

Transactional server running with 16 clients (speedup & scalability)

22

Transactional statistic – coarse-grained

Clients Transactions AbortsAbort rate

[%] Mean [KB] Max [KB] Total [MB]

1 34754 0 0.0Reads 3.0 104 105

Writes 0.6 17 20

2 95980 1970 2.1Reads 2.8 863 263

Writes 0.6 164 55

4 179241 10820 6.0Reads 3.4 1413 570

Writes 0.6 269 108

8 364305 76560 21.0Reads 4.2 1478 1207

Writes 0.8 251 216

16 524561 184992 35.3Reads 5.1 1704 1725

Writes 0.9 262 296

The abort rate is significant

TM server running with 8 threads.

23

The Overhead Breakdown

TM block

Multithread execution - 8 threads, 16 clients

Total [109 cycles]

Instrumentation time Abort overheadAbort rate

[%]109 cycles % 109 cycles %

1 13.5 10.3 75.8 3.3 24.2 19.5

2 9.5 9.0 94.1 0.6 5.9 18.0

3 17.2 15.1 87.9 2.1 12.1 52.7

4 11.6 10.9 94.3 0.7 5.7 22.4

5 5.9 3.2 53.7 2.8 46.3 61.1

overall 57.9 48.5 83.8 9.4 16.2 35.2

We have limited possibility for profiling

Seems like the TM instrumentation

overhead is more important

24

Results - Normalized average frame execution times (fine)

0

1

2

3

4

1 2 4 8 16

Number of clients

Nor

mal

ized

exe

cutio

n tim

e serial lock_fine TM_fine

TM version overhead2.4x – 3x

25

Results - performance of fine-grained configurations

0.00.51.01.52.02.53.03.54.0

1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8

1 client 2 clients 4 clients 8 clients 16 clients

Number of threads

Tim

e [m

s]

lock_fine TM_fine

1

1.1

1.2

1.3

1.4

1.5

1.6

1.7

lock_fine TM_fine

Spee

dup

2 threads 4 threads 8 threads

0.0

2.0

4.0

6.0

8.0

1 2 4 8

Threads

Ave

rage

fra

me

time

[ms]

global_lock lock_fineTM_coarse TM_fine

Comparative performance of parallel configurations

Transactional server running with 16 clients (speedup & scalability)

26

Transactional statistic – fine-grained

Clients Transactions AbortsAbort rate

[%] Mean [B] Max [B] Total [MB]

1 190206 0 0.0Reads 65.1 58511 12

Writes 5.2 20102 1

2 367118 826 0.2Reads 66.0 62728 25

Writes 5.7 24397 2

4 655020 4165 0.6Reads 83.7 80275 55

Writes 8.2 39726 5

8 1439874 20593 1.4Reads 102.5 102470 145

Writes 9.6 57552 14

16 3226759 131814 4.1Reads 133.3 231593 192

Writes 15.5 211651 22

TM server running with 8 threads.

27

Outline


Quake description

Parallelization

Results

Conclusion

28

QuakeTM Characteristics

• 27.600 lines of code.• 49 files.• Configurable with macros

– Synchronization, granularity, nesting, TM implementation.

• Coarse-grained setup:– 8 critical regions (TM or global lock)

• Fine-grained setup:– 58 critical regions (TM or fine-grained locks)

• Available on the www.bscmsrc.eu

29

Conclusion

• The transactional overhead is excessive:– 6x slowdown – 35.3% abort rate

• A coarse-grained approach is not a good option for the current STM systems.

• Significant programmer time investment (10 man-months).

• Fine-grained approach maybe the only solution.

30

Questions?

Thank you!

Download QuakeTM

www.bscmsrc.eu

31

Intel Compiler

• single lock atomicity semantics and weak atomicity guarantees. – Strongly atomic semantics, where non-

transactional accesses are treated as implicit single-operation transactions

32

Atomic Quake

• Main objective was to evaluate the effort of replacing locks with transactions.

• The lock parallelization is not block structured which required code reorganization to adapt to the TM model.

• The second problem was to avoid I/O operations which is not an issue in a lock based system.

• Finally, a big fraction of the development time was spent in understanding how locks are associated with the variables and to get a grip with the locking strategy.

33

Atomic Quake 2

• Thread private data – call to get_specific• The conditional variables – no retry• I/O in transactions – tm_pure• Proposition for error handling

– When error happens commit the transaction and handle the error outside the atomic block.

• Privatization examples– Custom memory manager allocates a block of

memory for string operations• TM fits for guarding access to different shared

data (separate locks)

quaketm: parallelizing a complex serial application using transactional memory vladimir gajinov 1,2,...

Documents

quake map slide

sequentialparallel slide

quake server

microsoft research slide

sequential game server

server game session

game world

transactional memory