rh lock: a scalable hierarchical spin lock

25
RH Locks Uppsala University Information Technology Department of Computer Systems Uppsala Architecture Research Team [UART] RH Lock: A Scalable Hierarchical Spin Lock Zoran Radovic and Erik Hagersten {zoranr, eh}@it.uu.se 2nd ANNUAL WORKSHOP ON MEMORY PERFORMANCE ISSUES (WMPI 2002) May 25, 2002, Anchorage, Alaska

Upload: lucius

Post on 01-Feb-2016

41 views

Category:

Documents


0 download

DESCRIPTION

RH Lock: A Scalable Hierarchical Spin Lock. 2nd ANNUAL WORKSHOP ON MEMORY PERFORMANCE ISSUES (WMPI 2002) May 25, 2002, Anchorage, Alaska. Uppsala University Information Technology Department of Computer Systems Uppsala Architecture Research Team [ UART ]. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: RH Lock: A Scalable Hierarchical Spin Lock

RH Locks

Uppsala UniversityInformation Technology

Department of Computer SystemsUppsala Architecture Research Team [UART]

RH Lock:A Scalable Hierarchical Spin Lock

RH Lock:A Scalable Hierarchical Spin Lock

Zoran Radovic and Erik Hagersten{zoranr, eh}@it.uu.se

2nd ANNUAL WORKSHOP ON MEMORYPERFORMANCE ISSUES (WMPI 2002)May 25, 2002, Anchorage, Alaska

Page 2: RH Lock: A Scalable Hierarchical Spin Lock

WMPI 2002, Alaska Uppsala Architecture Research Team (UART) RH Locks

Synchronization History

Spin-Locks test_and_set (TAS), e.g., IBM System/360, ’64 Rudolph and Segall, ISCA’84

• test_and_test_and_set (TATAS)

TATAS with exponential backoff (TATAS_EXP), ’90 – ’91

P1

$

P2

$

P3

$

Pn

$

Memory

FREELock:

P3

BUSY

Busy-wait/backoff

FREEBUSYBUSY BUSY

Page 3: RH Lock: A Scalable Hierarchical Spin Lock

WMPI 2002, Alaska Uppsala Architecture Research Team (UART) RH Locks

Performance, 12 years ago …Traditional microbenchmark

for (i = 0; i < iterations; i++) { ACQUIRE(lock);

// Null Critical Section (CS)

RELEASE(lock);}

Thanks: Michael L. Scott

IF (more contention) THEN less efficient CS …

IF (more contention) THEN less efficient CS …

Page 4: RH Lock: A Scalable Hierarchical Spin Lock

WMPI 2002, Alaska Uppsala Architecture Research Team (UART) RH Locks

Making it Scalable: Queues …

Spin on your predecessor’s flag

First-come first-served order

Queue-Based Locks QOLB/QOSB ’89 MCS ’91 CLH ’93

Page 5: RH Lock: A Scalable Hierarchical Spin Lock

WMPI 2002, Alaska Uppsala Architecture Research Team (UART) RH Locks

Performance, May 2002Traditional microbenchmark

0,00

0,05

0,10

0,15

0,20

0,25

0 2 4 6 8 10 12 14 16Processors

Tim

e/P

roce

ssor

s [s

econ

ds]

TATAS

TATAS_EXP

MCS

CLH

16

Sun Enterprise E6000 SMP

Page 6: RH Lock: A Scalable Hierarchical Spin Lock

WMPI 2002, Alaska Uppsala Architecture Research Team (UART) RH Locks

Synchronization Today

Commercial applications use spin-locks (!) usually TATAS & TATAS_EXP with timeout for

• recovery from transaction deadlock

• recovery from preemption of the lock holder

POSIX threads:• pthread_mutex_lock

• pthread_mutex_unlock

HPC: runtime systems, OpenMP, …

Page 7: RH Lock: A Scalable Hierarchical Spin Lock

WMPI 2002, Alaska Uppsala Architecture Research Team (UART) RH Locks

Switch

Non-Uniform MemoryArchitecture (NUMA)

NUMA optimizations Page migration Page replication

P1

$

P2

$

P3

$

Pn

$

P1

$

P2

$

P3

$

Pn

$

Memory Memory

12 – 10

Page 8: RH Lock: A Scalable Hierarchical Spin Lock

WMPI 2002, Alaska Uppsala Architecture Research Team (UART) RH Locks

Non-Uniform CommunicationArchitecture (NUCA)

NUCA examples (NUCA ratios): 1992: Stanford DASH (~ 4.5) 1996: Sequent NUMA-Q (~ 10) 1999: Sun WildFire (~ 6) 2000: Compaq DS-320 (~ 3.5) Future: CMP, SMT (~ 10)

NUCAratio

Switch

P1

$

P2

$

P3

$

Pn

$

P1

$

P2

$

P3

$

Pn

$

Memory Memory

1 2 – 10

Our NUCA …

Page 9: RH Lock: A Scalable Hierarchical Spin Lock

WMPI 2002, Alaska Uppsala Architecture Research Team (UART) RH Locks

Our NUCA: Sun WildFire

Two E6000 connected through a hardware-coherent interface with a raw bandwidth of 800 MB/s in each direction

16 UltraSPARC II (250 MHz) CPUs per node 8 GB memory

NUCA ratio 6

Page 10: RH Lock: A Scalable Hierarchical Spin Lock

WMPI 2002, Alaska Uppsala Architecture Research Team (UART) RH Locks

Performance on our NUCA

0,00

0,05

0,10

0,15

0,20

0,25

0,30

0,35

0,40

0,45

0,50

0 4 8 12 16 20 24 28 32Processors

Tim

e/P

roce

sso

rs [

seco

nd

s]

TATAS

TATAS_EXP

MCS

CLH

0

10

20

30

40

50

60

70

80

90

100

0 4 8 12 16 20 24 28 32Processors

Nod

e-ha

ndof

fs [

%]

16 16

Page 11: RH Lock: A Scalable Hierarchical Spin Lock

WMPI 2002, Alaska Uppsala Architecture Research Team (UART) RH Locks

Our Goals

Demonstrate that the first-come first-served nature of queue-based locks is unwanted for NUCAs new microbenchmark: “more realistic” behavior, and real application study

Design a scalable spin lock that exploits the NUCAs creating a controlled unfairness (stable lock), and reducing the traffic compared with the test&set locks

Page 12: RH Lock: A Scalable Hierarchical Spin Lock

WMPI 2002, Alaska Uppsala Architecture Research Team (UART) RH Locks

Outline

History & BackgroundNUMA vs. NUCAExperimentation Environment The RH Lock Performance Results Application Performance Conclusions

Page 13: RH Lock: A Scalable Hierarchical Spin Lock

WMPI 2002, Alaska Uppsala Architecture Research Team (UART) RH Locks

Key Ideas Behind RH Lock

Minimizing global traffic at lock-handover Only one thread per node will try to acquire a “remote” lock

Maximizing node locality of NUCAs Handover the lock to a neighbor in the same node Creates locality for the critical section (CS) data as well Especially good for large CS and high contention

RH lock in a nutshell: Double TATAS_EXP: one node-local lock + one “global”

Page 14: RH Lock: A Scalable Hierarchical Spin Lock

WMPI 2002, Alaska Uppsala Architecture Research Team (UART) RH Locks

The RH Lock Algorithm

FREE

P1

$

P2

$

P3

$

P16

$

Cabinet 1: Memory

REMOTE

P17

$

P18

$

P19

$

P32

$

Cabinet 2: Memory

FREEREMOTELock1:

Lock2:

Lock1:

Lock2:

P2

2

P19

19else:

TATAS(my_TID, Lock)until FREE or

L_FREE

if “REMOTE”:Spin remotely

CAS(FREE, REMOTE)until FREE

(w/ exp backoff)

… …

FREECS

1

2

16

1 REMOTE

32L_FREE

Acquire:SWAP(my_TID, Lock)If (FREE or L_FREE) You’ve got it!

Release:CAS(my_TID, FREE) else L_FREE)

16

FREECS

IF (more contention) THEN more efficient CS

IF (more contention) THEN more efficient CS

Page 15: RH Lock: A Scalable Hierarchical Spin Lock

WMPI 2002, Alaska Uppsala Architecture Research Team (UART) RH Locks

Performance ResultsTraditional microbenchmark, 2-node Sun WildFire

0

10

20

30

40

50

60

70

80

90

100

0 4 8 12 16 20 24 28 32Processors

Nod

e-ha

ndof

fs [

%]

TATAS

TATAS_EXP

MCS

CLH

RH Fair_factor = 1

RH Fair_factor = 50

RH Fair_factor = 100

0,00

0,05

0,10

0,15

0,20

0,25

0,30

0,35

0,40

0,45

0,50

0 4 8 12 16 20 24 28 32Processors

Tim

e/P

roce

ssor

s [s

econ

ds]

TATAS

TATAS_EXP

MCS

CLH

RH

Page 16: RH Lock: A Scalable Hierarchical Spin Lock

WMPI 2002, Alaska Uppsala Architecture Research Team (UART) RH Locks

Controlling Unfairness …

FREE

P1

$

P2

$

P3

$

Pn

$

Cabinet 1: Memory

FREE

Lock1:

Lock2:

P2

TID

void rh_acquire_slowpath(rh_lock *L){

...

if ((random() % FAIR_FACTOR) == 0) be_fare = TRUE; else be_fare = FALSE;

...

}

void rh_release(rh_lock *L){ if (be_fare) *L = FREE; else if (cas(L, my_tid, FREE) != my_tid) *L = L_FREE;

}

L_FREE

Page 17: RH Lock: A Scalable Hierarchical Spin Lock

WMPI 2002, Alaska Uppsala Architecture Research Team (UART) RH Locks

Node-handoffsTraditional microbenchmark, 2-node Sun WildFire

0

10

20

30

40

50

60

70

80

90

100

0 4 8 12 16 20 24 28 32Processors

No

de

-ha

nd

off

s [%

]

TATASTATAS_EXPMCSCLHRH Fair_factor = 1RH Fair_factor = 50RH Fair_factor = 100

Page 18: RH Lock: A Scalable Hierarchical Spin Lock

WMPI 2002, Alaska Uppsala Architecture Research Team (UART) RH Locks

New Microbenchmark

for (i = 0; i < iterations; i++) { ACQUIRE(lock);

// Critical Section (CS) work

RELEASE(lock);

// Non-CS work STATIC part +

// Non-CS work RANDOM part}

More realistic node-handoffs for queue-based locks Constant number of processors The amount of Critical Section (CS) work can be

increased we can control the “amount of contention”

Page 19: RH Lock: A Scalable Hierarchical Spin Lock

WMPI 2002, Alaska Uppsala Architecture Research Team (UART) RH Locks

Performance ResultsNew microbenchmark, 2-node Sun WildFire, 28 CPUs

0

5

10

15

20

25

30

0 500 1000 1500 2000Critical Work [array size]

Tim

e [s

econ

ds]

TATAS

TATAS_EXP

MCS

CLH

RH

0

5

10

15

20

25

30

35

40

45

50

55

60

0 500 1000 1500 2000Critical Work [array size]

Nod

e-ha

ndof

fs [

%]

WF

14 14

Page 20: RH Lock: A Scalable Hierarchical Spin Lock

WMPI 2002, Alaska Uppsala Architecture Research Team (UART) RH Locks

Application Performance (1)Methodology

The SPLASH-2 programs 14 apps

We study only applications with more then 10,000 acquire/release operations Barnes, Cholesky, FMM,

Radiosity, Raytrace, Volrend, and Water-Nsq

Synchronization algorithms TATAS, TATAS_EXP, MCS,

CLH, and RH

2-node Sun WildFire

Program Lock Acquires

Barnes 69,193

Cholesky 74,284

FFT 32

FMM 80,528

LU-c & LU-nc 32

Ocean-c 6,304

Ocean-nc 6,656

Radiosity 295,627

Radix 32

Raytrace 366,450

Volrend 38,456

Water-Nsq 112,415

Water-Sp 510

Page 21: RH Lock: A Scalable Hierarchical Spin Lock

WMPI 2002, Alaska Uppsala Architecture Research Team (UART) RH Locks

Application Performance (2)Raytrace Speedup

WF

0

1

2

3

4

5

6

7

8

0 4 8 12 16 20 24 28

Number of Processors

Sp

ee

du

p

TATAS

TATAS_EXP

MCS

CLH

RH

Page 22: RH Lock: A Scalable Hierarchical Spin Lock

WMPI 2002, Alaska Uppsala Architecture Research Team (UART) RH Locks

Single-Processor ResultsTraditional microbenchmark, null CS

TATAS 97 ns

TATAS_EXP 97 ns

MCS 202 ns

CLH 137 ns

RH 121 ns

1: for (i = 0; i < iterations; i++) { 2: ACQUIRE(lock); 3: RELEASE(lock); 4: }

Page 23: RH Lock: A Scalable Hierarchical Spin Lock

WMPI 2002, Alaska Uppsala Architecture Research Team (UART) RH Locks

Performance ResultsTraditional microbenchmark, single-node E6000

Bind all threads to only one of the E6000 nodes

0,00

0,05

0,10

0,15

0,20

0,25

0 2 4 6 8 10 12 14 16Processors

Tim

e/P

roce

ssor

s [s

econ

ds]

TATAS

TATAS_EXP

MCS

CLH

RH

As expected:

RH lock TATAS_EXP

Page 24: RH Lock: A Scalable Hierarchical Spin Lock

WMPI 2002, Alaska Uppsala Architecture Research Team (UART) RH Locks

First-come first-served not desirable for NUCAs The RH lock exploits NUCAs by

creating locality through controlled unfairness (stable lock) reducing traffic compared with the test&set locks

The only lock that performs better under contention A critical section (CS) guarded by the RH lock take

less than half the time to execute with the same CS guarded by any other lock

Raytrace on 30 CPUs: 1.83 – 5.70 “better” Works best for NUCA with a few large “nodes”

Conclusions

Page 25: RH Lock: A Scalable Hierarchical Spin Lock

WMPI 2002, Alaska Uppsala Architecture Research Team (UART) RH Locks

http://www.it.uu.se/research/group/uart

UART’s Home Page