rh locks uppsala university information technology department of computer systems uppsala...
Post on 18-Dec-2015
217 views
TRANSCRIPT
RH Locks
Uppsala UniversityInformation Technology
Department of Computer SystemsUppsala Architecture Research Team [UART]
RH Lock:A Scalable Hierarchical Spin Lock
RH Lock:A Scalable Hierarchical Spin Lock
Zoran Radovic and Erik Hagersten{zoranr, eh}@it.uu.se
2nd ANNUAL WORKSHOP ON MEMORYPERFORMANCE ISSUES (WMPI 2002)May 25, 2002, Anchorage, Alaska
WMPI 2002, Alaska Uppsala Architecture Research Team (UART) RH Locks
Synchronization History
Spin-Locks test_and_set (TAS), e.g., IBM System/360, ’64 Rudolph and Segall, ISCA’84
• test_and_test_and_set (TATAS)
TATAS with exponential backoff (TATAS_EXP), ’90 – ’91
P1
$
P2
$
P3
$
Pn
$
Memory
FREELock:
P3
BUSY
Busy-wait/backoff
FREEBUSYBUSY BUSY
…
WMPI 2002, Alaska Uppsala Architecture Research Team (UART) RH Locks
Performance, 12 years ago …Traditional microbenchmark
for (i = 0; i < iterations; i++) { ACQUIRE(lock);
// Null Critical Section (CS)
RELEASE(lock);}
Thanks: Michael L. Scott
IF (more contention) THEN less efficient CS …
IF (more contention) THEN less efficient CS …
WMPI 2002, Alaska Uppsala Architecture Research Team (UART) RH Locks
Making it Scalable: Queues …
Spin on your predecessor’s flag
First-come first-served order
Queue-Based Locks QOLB/QOSB ’89 MCS ’91 CLH ’93
WMPI 2002, Alaska Uppsala Architecture Research Team (UART) RH Locks
Performance, May 2002Traditional microbenchmark
0,00
0,05
0,10
0,15
0,20
0,25
0 2 4 6 8 10 12 14 16Processors
Tim
e/P
roce
ssor
s [s
econ
ds]
TATAS
TATAS_EXP
MCS
CLH
16
Sun Enterprise E6000 SMP
WMPI 2002, Alaska Uppsala Architecture Research Team (UART) RH Locks
Synchronization Today
Commercial applications use spin-locks (!) usually TATAS & TATAS_EXP with timeout for
• recovery from transaction deadlock
• recovery from preemption of the lock holder
POSIX threads:• pthread_mutex_lock
• pthread_mutex_unlock
HPC: runtime systems, OpenMP, …
WMPI 2002, Alaska Uppsala Architecture Research Team (UART) RH Locks
Switch
Non-Uniform MemoryArchitecture (NUMA)
NUMA optimizations Page migration Page replication
P1
$
P2
$
P3
$
Pn
$
P1
$
P2
$
P3
$
Pn
$
Memory Memory
12 – 10
WMPI 2002, Alaska Uppsala Architecture Research Team (UART) RH Locks
Non-Uniform CommunicationArchitecture (NUCA)
NUCA examples (NUCA ratios): 1992: Stanford DASH (~ 4.5) 1996: Sequent NUMA-Q (~ 10) 1999: Sun WildFire (~ 6) 2000: Compaq DS-320 (~ 3.5) Future: CMP, SMT (~ 10)
NUCAratio
Switch
P1
$
P2
$
P3
$
Pn
$
P1
$
P2
$
P3
$
Pn
$
Memory Memory
1 2 – 10
Our NUCA …
WMPI 2002, Alaska Uppsala Architecture Research Team (UART) RH Locks
Our NUCA: Sun WildFire
Two E6000 connected through a hardware-coherent interface with a raw bandwidth of 800 MB/s in each direction
16 UltraSPARC II (250 MHz) CPUs per node 8 GB memory
NUCA ratio 6
WMPI 2002, Alaska Uppsala Architecture Research Team (UART) RH Locks
Performance on our NUCA
0,00
0,05
0,10
0,15
0,20
0,25
0,30
0,35
0,40
0,45
0,50
0 4 8 12 16 20 24 28 32Processors
Tim
e/P
roce
sso
rs [
seco
nd
s]
TATAS
TATAS_EXP
MCS
CLH
0
10
20
30
40
50
60
70
80
90
100
0 4 8 12 16 20 24 28 32Processors
Nod
e-ha
ndof
fs [
%]
16 16
WMPI 2002, Alaska Uppsala Architecture Research Team (UART) RH Locks
Our Goals
Demonstrate that the first-come first-served nature of queue-based locks is unwanted for NUCAs new microbenchmark: “more realistic” behavior, and real application study
Design a scalable spin lock that exploits the NUCAs creating a controlled unfairness (stable lock), and reducing the traffic compared with the test&set locks
WMPI 2002, Alaska Uppsala Architecture Research Team (UART) RH Locks
Outline
History & BackgroundNUMA vs. NUCAExperimentation Environment The RH Lock Performance Results Application Performance Conclusions
WMPI 2002, Alaska Uppsala Architecture Research Team (UART) RH Locks
Key Ideas Behind RH Lock
Minimizing global traffic at lock-handover Only one thread per node will try to acquire a “remote” lock
Maximizing node locality of NUCAs Handover the lock to a neighbor in the same node Creates locality for the critical section (CS) data as well Especially good for large CS and high contention
RH lock in a nutshell: Double TATAS_EXP: one node-local lock + one “global”
WMPI 2002, Alaska Uppsala Architecture Research Team (UART) RH Locks
The RH Lock Algorithm
FREE
P1
$
P2
$
P3
$
P16
$
Cabinet 1: Memory
REMOTE
P17
$
P18
$
P19
$
P32
$
Cabinet 2: Memory
FREEREMOTELock1:
Lock2:
Lock1:
Lock2:
P2
2
P19
19else:
TATAS(my_TID, Lock)until FREE or
L_FREE
if “REMOTE”:Spin remotely
CAS(FREE, REMOTE)until FREE
(w/ exp backoff)
… …
FREECS
1
2
16
1 REMOTE
32L_FREE
Acquire:SWAP(my_TID, Lock)If (FREE or L_FREE) You’ve got it!
Release:CAS(my_TID, FREE) else L_FREE)
16
FREECS
IF (more contention) THEN more efficient CS
IF (more contention) THEN more efficient CS
WMPI 2002, Alaska Uppsala Architecture Research Team (UART) RH Locks
Performance ResultsTraditional microbenchmark, 2-node Sun WildFire
0
10
20
30
40
50
60
70
80
90
100
0 4 8 12 16 20 24 28 32Processors
Nod
e-ha
ndof
fs [
%]
TATAS
TATAS_EXP
MCS
CLH
RH Fair_factor = 1
RH Fair_factor = 50
RH Fair_factor = 100
0,00
0,05
0,10
0,15
0,20
0,25
0,30
0,35
0,40
0,45
0,50
0 4 8 12 16 20 24 28 32Processors
Tim
e/P
roce
ssor
s [s
econ
ds]
TATAS
TATAS_EXP
MCS
CLH
RH
WMPI 2002, Alaska Uppsala Architecture Research Team (UART) RH Locks
Controlling Unfairness …
FREE
P1
$
P2
$
P3
$
Pn
$
Cabinet 1: Memory
FREE
Lock1:
Lock2:
P2
TID
void rh_acquire_slowpath(rh_lock *L){
...
if ((random() % FAIR_FACTOR) == 0) be_fare = TRUE; else be_fare = FALSE;
...
}
void rh_release(rh_lock *L){ if (be_fare) *L = FREE; else if (cas(L, my_tid, FREE) != my_tid) *L = L_FREE;
}
L_FREE
WMPI 2002, Alaska Uppsala Architecture Research Team (UART) RH Locks
Node-handoffsTraditional microbenchmark, 2-node Sun WildFire
0
10
20
30
40
50
60
70
80
90
100
0 4 8 12 16 20 24 28 32Processors
No
de
-ha
nd
off
s [%
]
TATASTATAS_EXPMCSCLHRH Fair_factor = 1RH Fair_factor = 50RH Fair_factor = 100
WMPI 2002, Alaska Uppsala Architecture Research Team (UART) RH Locks
New Microbenchmark
for (i = 0; i < iterations; i++) { ACQUIRE(lock);
// Critical Section (CS) work
RELEASE(lock);
// Non-CS work STATIC part +
// Non-CS work RANDOM part}
More realistic node-handoffs for queue-based locks Constant number of processors The amount of Critical Section (CS) work can be
increased we can control the “amount of contention”
WMPI 2002, Alaska Uppsala Architecture Research Team (UART) RH Locks
Performance ResultsNew microbenchmark, 2-node Sun WildFire, 28 CPUs
0
5
10
15
20
25
30
0 500 1000 1500 2000Critical Work [array size]
Tim
e [s
econ
ds]
TATAS
TATAS_EXP
MCS
CLH
RH
0
5
10
15
20
25
30
35
40
45
50
55
60
0 500 1000 1500 2000Critical Work [array size]
Nod
e-ha
ndof
fs [
%]
WF
14 14
WMPI 2002, Alaska Uppsala Architecture Research Team (UART) RH Locks
Application Performance (1)Methodology
The SPLASH-2 programs 14 apps
We study only applications with more then 10,000 acquire/release operations Barnes, Cholesky, FMM,
Radiosity, Raytrace, Volrend, and Water-Nsq
Synchronization algorithms TATAS, TATAS_EXP, MCS,
CLH, and RH
2-node Sun WildFire
Program Lock Acquires
Barnes 69,193
Cholesky 74,284
FFT 32
FMM 80,528
LU-c & LU-nc 32
Ocean-c 6,304
Ocean-nc 6,656
Radiosity 295,627
Radix 32
Raytrace 366,450
Volrend 38,456
Water-Nsq 112,415
Water-Sp 510
WMPI 2002, Alaska Uppsala Architecture Research Team (UART) RH Locks
Application Performance (2)Raytrace Speedup
WF
0
1
2
3
4
5
6
7
8
0 4 8 12 16 20 24 28
Number of Processors
Sp
ee
du
p
TATAS
TATAS_EXP
MCS
CLH
RH
WMPI 2002, Alaska Uppsala Architecture Research Team (UART) RH Locks
Single-Processor ResultsTraditional microbenchmark, null CS
TATAS 97 ns
TATAS_EXP 97 ns
MCS 202 ns
CLH 137 ns
RH 121 ns
1: for (i = 0; i < iterations; i++) { 2: ACQUIRE(lock); 3: RELEASE(lock); 4: }
WMPI 2002, Alaska Uppsala Architecture Research Team (UART) RH Locks
Performance ResultsTraditional microbenchmark, single-node E6000
Bind all threads to only one of the E6000 nodes
0,00
0,05
0,10
0,15
0,20
0,25
0 2 4 6 8 10 12 14 16Processors
Tim
e/P
roce
ssor
s [s
econ
ds]
TATAS
TATAS_EXP
MCS
CLH
RH
As expected:
RH lock TATAS_EXP
WMPI 2002, Alaska Uppsala Architecture Research Team (UART) RH Locks
First-come first-served not desirable for NUCAs The RH lock exploits NUCAs by
creating locality through controlled unfairness (stable lock) reducing traffic compared with the test&set locks
The only lock that performs better under contention A critical section (CS) guarded by the RH lock take
less than half the time to execute with the same CS guarded by any other lock
Raytrace on 30 CPUs: 1.83 – 5.70 “better” Works best for NUCA with a few large “nodes”
Conclusions