rh locks uppsala university information technology department of computer systems uppsala...

RH Locks

Uppsala UniversityInformation Technology

Department of Computer SystemsUppsala Architecture Research Team [UART]

RH Lock:A Scalable Hierarchical Spin Lock

RH Lock:A Scalable Hierarchical Spin Lock

Zoran Radovic and Erik Hagersten{zoranr, eh}@it.uu.se

2nd ANNUAL WORKSHOP ON MEMORYPERFORMANCE ISSUES (WMPI 2002)May 25, 2002, Anchorage, Alaska

WMPI 2002, Alaska Uppsala Architecture Research Team (UART) RH Locks

Synchronization History

Spin-Locks test_and_set (TAS), e.g., IBM System/360, ’64 Rudolph and Segall, ISCA’84

• test_and_test_and_set (TATAS)

TATAS with exponential backoff (TATAS_EXP), ’90 – ’91

P1

$

P2

$

P3

$

Pn

$

Memory

FREELock:

P3

BUSY

Busy-wait/backoff

FREEBUSYBUSY BUSY

…


Performance, 12 years ago …Traditional microbenchmark

for (i = 0; i < iterations; i++) { ACQUIRE(lock);

// Null Critical Section (CS)

RELEASE(lock);}

Thanks: Michael L. Scott

IF (more contention) THEN less efficient CS …

IF (more contention) THEN less efficient CS …


Making it Scalable: Queues …

Spin on your predecessor’s flag

First-come first-served order

Queue-Based Locks QOLB/QOSB ’89 MCS ’91 CLH ’93


Performance, May 2002Traditional microbenchmark

0,00

0,05

0,10

0,15

0,20

0,25

0 2 4 6 8 10 12 14 16Processors

Tim

e/P

roce

ssor

s [s

econ

ds]

TATAS

TATAS_EXP

MCS

CLH

16

Sun Enterprise E6000 SMP


Synchronization Today

Commercial applications use spin-locks (!) usually TATAS & TATAS_EXP with timeout for

• recovery from transaction deadlock

• recovery from preemption of the lock holder

POSIX threads:• pthread_mutex_lock

• pthread_mutex_unlock

HPC: runtime systems, OpenMP, …


Switch

Non-Uniform MemoryArchitecture (NUMA)

NUMA optimizations Page migration Page replication

P1

$

P2

$

P3

$

Pn

$

P1

$

P2

$

P3

$

Pn

$

Memory Memory

12 – 10


Non-Uniform CommunicationArchitecture (NUCA)

NUCA examples (NUCA ratios): 1992: Stanford DASH (~ 4.5) 1996: Sequent NUMA-Q (~ 10) 1999: Sun WildFire (~ 6) 2000: Compaq DS-320 (~ 3.5) Future: CMP, SMT (~ 10)

NUCAratio

Switch

P1

$

P2

$

P3

$

Pn

$

P1

$

P2

$

P3

$

Pn

$

Memory Memory

1 2 – 10

Our NUCA …


Our NUCA: Sun WildFire

Two E6000 connected through a hardware-coherent interface with a raw bandwidth of 800 MB/s in each direction

16 UltraSPARC II (250 MHz) CPUs per node 8 GB memory

NUCA ratio 6


Performance on our NUCA

0,00

0,05

0,10

0,15

0,20

0,25

0,30

0,35

0,40

0,45

0,50

0 4 8 12 16 20 24 28 32Processors

Tim

e/P

roce

sso

rs [

seco

nd

s]

TATAS

TATAS_EXP

MCS

CLH

0

10

20

30

40

50

60

70

80

90

100

0 4 8 12 16 20 24 28 32Processors

Nod

e-ha

ndof

fs [

%]

16 16


Our Goals

Demonstrate that the first-come first-served nature of queue-based locks is unwanted for NUCAs new microbenchmark: “more realistic” behavior, and real application study

Design a scalable spin lock that exploits the NUCAs creating a controlled unfairness (stable lock), and reducing the traffic compared with the test&set locks


Outline

History & BackgroundNUMA vs. NUCAExperimentation Environment The RH Lock Performance Results Application Performance Conclusions


Key Ideas Behind RH Lock

Minimizing global traffic at lock-handover Only one thread per node will try to acquire a “remote” lock

Maximizing node locality of NUCAs Handover the lock to a neighbor in the same node Creates locality for the critical section (CS) data as well Especially good for large CS and high contention

RH lock in a nutshell: Double TATAS_EXP: one node-local lock + one “global”


The RH Lock Algorithm

FREE

P1

$

P2

$

P3

$

P16

$

Cabinet 1: Memory

REMOTE

P17

$

P18

$

P19

$

P32

$

Cabinet 2: Memory

FREEREMOTELock1:

Lock2:

Lock1:

Lock2:

P2

2

P19

19else:

TATAS(my_TID, Lock)until FREE or

L_FREE

if “REMOTE”:Spin remotely

CAS(FREE, REMOTE)until FREE

(w/ exp backoff)

… …

FREECS

1

2

16

1 REMOTE

32L_FREE

Acquire:SWAP(my_TID, Lock)If (FREE or L_FREE) You’ve got it!

Release:CAS(my_TID, FREE) else L_FREE)

16

FREECS

IF (more contention) THEN more efficient CS

IF (more contention) THEN more efficient CS


Performance ResultsTraditional microbenchmark, 2-node Sun WildFire

0

10

20

30

40

50

60

70

80

90

100

0 4 8 12 16 20 24 28 32Processors

Nod

e-ha

ndof

fs [

%]

TATAS

TATAS_EXP

MCS

CLH

RH Fair_factor = 1

RH Fair_factor = 50

RH Fair_factor = 100

0,00

0,05

0,10

0,15

0,20

0,25

0,30

0,35

0,40

0,45

0,50

0 4 8 12 16 20 24 28 32Processors

Tim

e/P

roce

ssor

s [s

econ

ds]

TATAS

TATAS_EXP

MCS

CLH

RH


Controlling Unfairness …

FREE

P1

$

P2

$

P3

$

Pn

$

Cabinet 1: Memory

FREE

Lock1:

Lock2:

P2

TID

void rh_acquire_slowpath(rh_lock *L){

...

if ((random() % FAIR_FACTOR) == 0) be_fare = TRUE; else be_fare = FALSE;

...

}

void rh_release(rh_lock *L){ if (be_fare) *L = FREE; else if (cas(L, my_tid, FREE) != my_tid) *L = L_FREE;

}

L_FREE


Node-handoffsTraditional microbenchmark, 2-node Sun WildFire

0

10

20

30

40

50

60

70

80

90

100

0 4 8 12 16 20 24 28 32Processors

No

de

-ha

nd

off

s [%

]

TATASTATAS_EXPMCSCLHRH Fair_factor = 1RH Fair_factor = 50RH Fair_factor = 100


New Microbenchmark

for (i = 0; i < iterations; i++) { ACQUIRE(lock);

// Critical Section (CS) work

RELEASE(lock);

// Non-CS work STATIC part +

// Non-CS work RANDOM part}

More realistic node-handoffs for queue-based locks Constant number of processors The amount of Critical Section (CS) work can be

increased we can control the “amount of contention”


Performance ResultsNew microbenchmark, 2-node Sun WildFire, 28 CPUs

0

5

10

15

20

25

30

0 500 1000 1500 2000Critical Work [array size]

Tim

e [s

econ

ds]

TATAS

TATAS_EXP

MCS

CLH

RH

0

5

10

15

20

25

30

35

40

45

50

55

60

0 500 1000 1500 2000Critical Work [array size]

Nod

e-ha

ndof

fs [

%]

WF

14 14


Application Performance (1)Methodology

The SPLASH-2 programs 14 apps

We study only applications with more then 10,000 acquire/release operations Barnes, Cholesky, FMM,

Radiosity, Raytrace, Volrend, and Water-Nsq

Synchronization algorithms TATAS, TATAS_EXP, MCS,

CLH, and RH

2-node Sun WildFire

Program Lock Acquires

Barnes 69,193

Cholesky 74,284

FFT 32

FMM 80,528

LU-c & LU-nc 32

Ocean-c 6,304

Ocean-nc 6,656

Radiosity 295,627

Radix 32

Raytrace 366,450

Volrend 38,456

Water-Nsq 112,415

Water-Sp 510


Application Performance (2)Raytrace Speedup

WF

0

1

2

3

4

5

6

7

8

0 4 8 12 16 20 24 28

Number of Processors

Sp

ee

du

p

TATAS

TATAS_EXP

MCS

CLH

RH


Single-Processor ResultsTraditional microbenchmark, null CS

TATAS 97 ns

TATAS_EXP 97 ns

MCS 202 ns

CLH 137 ns

RH 121 ns

1: for (i = 0; i < iterations; i++) { 2: ACQUIRE(lock); 3: RELEASE(lock); 4: }


Performance ResultsTraditional microbenchmark, single-node E6000

Bind all threads to only one of the E6000 nodes

0,00

0,05

0,10

0,15

0,20

0,25

0 2 4 6 8 10 12 14 16Processors

Tim

e/P

roce

ssor

s [s

econ

ds]

TATAS

TATAS_EXP

MCS

CLH

RH

As expected:

RH lock TATAS_EXP


First-come first-served not desirable for NUCAs The RH lock exploits NUCAs by

creating locality through controlled unfairness (stable lock) reducing traffic compared with the test&set locks

The only lock that performs better under contention A critical section (CS) guarded by the RH lock take

less than half the time to execute with the same CS guarded by any other lock

Raytrace on 30 CPUs: 1.83 – 5.70 “better” Works best for NUCA with a few large “nodes”

Conclusions


http://www.it.uu.se/research/group/uart

UART’s Home Page

rh locks uppsala university information technology department of computer systems uppsala...

Documents

rh locks wmpi

nuca slide

testset locks slide

alaska slide

memory performance issues

efficient cs slide

scalable spin lock

memory free lock