let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads:...
TRANSCRIPT
![Page 1: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/1.jpg)
@kavya719
Let’s talk locks!
![Page 2: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/2.jpg)
kavya
![Page 3: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/3.jpg)
locks.
![Page 4: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/4.jpg)
“locks are slow”
![Page 5: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/5.jpg)
“locks are slow”
lock contention causes ~10x latency
late
ncy
(ms)
time
![Page 6: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/6.jpg)
“locks are slow”
…but they’re used everywhere.from schedulers to databases and web servers.
lock contention causes ~10x latency
late
ncy
(ms)
time
![Page 7: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/7.jpg)
“locks are slow”
…but they’re used everywhere.from schedulers to databases and web servers.
lock contention causes ~10x latency
late
ncy
(ms)
time?
![Page 8: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/8.jpg)
let’s analyze its performance! performance models for contention
let’s build a lock! a tour through lock internals
let’s use it, smartly! a few closing strategies
![Page 9: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/9.jpg)
our case-studyLock implementations are hardware, ISA, OS and language specific: We assume an x86_64 SMP machine running a modern Linux.We’ll look at the lock implementation in Go 1.12.
CPU 0 CPU 1
cache cacheinterconnect
memory
simplified SMP system diagram
![Page 10: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/10.jpg)
use as you would threads > go handle_request(r)
but user-space threads:managed entirely by the Go runtime, not the operating system.
The unit of concurrent execution: goroutines.
a brief go primer
![Page 11: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/11.jpg)
use as you would threads > go handle_request(r)
but user-space threads:managed entirely by the Go runtime, not the operating system.
The unit of concurrent execution: goroutines.
a brief go primer
Data shared between goroutines must be synchronized. One way is to use the blocking, non-recursive lock construct:
> var mu sync.Mutex mu.Lock() … mu.Unlock()
![Page 12: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/12.jpg)
let’s build a lock!a tour through lock internals.
![Page 13: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/13.jpg)
want: “mutual exclusion”only one thread has access to shared data at any given time
![Page 14: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/14.jpg)
T1 running on CPU 1
T2 running on CPU 2
func reader() { // Read a task t := tasks.get() // Do something with it. ... }
func writer() { // Write to tasks tasks.put(t) }
// track whether tasks is // available (0) or not (1) // shared ring buffer var tasks Tasks
want: “mutual exclusion”only one thread has access to shared data at any given time
![Page 15: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/15.jpg)
func reader() { // Read a task t := tasks.get() // Do something with it. ... }
func writer() { // Write to tasks tasks.put(t) }
// track whether tasks is // available (0) or not (1) // shared ring buffer var tasks Tasks
want: “mutual exclusion”…idea! use a flag?
T1 running on CPU 1
T2 running on CPU 2
![Page 16: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/16.jpg)
// track whether tasks can be // accessed (0) or not (1) var flag int var tasks Tasks
![Page 17: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/17.jpg)
// track whether tasks can be // accessed (0) or not (1) var flag int var tasks Tasks
func reader() { for { /* If flag is 0, can access tasks. */ if flag == 0 { /* Set flag */ flag++ ... /* Unset flag */ flag-- return } /* Else, keep looping. */ } }
T1 running on CPU 1
![Page 18: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/18.jpg)
// track whether tasks can be // accessed (0) or not (1) var flag int var tasks Tasks
func reader() { for { /* If flag is 0, can access tasks. */ if flag == 0 { /* Set flag */ flag++ ... /* Unset flag */ flag-- return } /* Else, keep looping. */ } }
func writer() { for { /* If flag is 0, can access tasks. */ if flag == 0 { /* Set flag */ flag++ ... /* Unset flag */ flag-- return } /* Else, keep looping. */ } }
T1 running on CPU 1
T2 running on CPU 2
![Page 19: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/19.jpg)
// track whether tasks can be // accessed (0) or not (1) var flag int var tasks Tasks
func reader() { for { /* If flag is 0, can access tasks. */ if flag == 0 { /* Set flag */ flag++ ... /* Unset flag */ flag-- return } /* Else, keep looping. */ } }
func writer() { for { /* If flag is 0, can access tasks. */ if flag == 0 { /* Set flag */ flag++ ... /* Unset flag */ flag-- return } /* Else, keep looping. */ } }
T1 running on CPU 1
T2 running on CPU 2
![Page 20: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/20.jpg)
flag++
T1 running on CPU 1
![Page 21: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/21.jpg)
flag++
CPU
mem
ory
1. Read (0)
2. Modify
3. Write (1)
T1 running on CPU 1
![Page 22: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/22.jpg)
R
W
flag++
timeline of memory operations
T1 running on CPU 1
![Page 23: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/23.jpg)
R
R
W
flag++
if flag == 0
timeline of memory operations
T1 running on CPU 1
T2 running on CPU 2
T2 may observe T1’s RMW half-complete
![Page 24: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/24.jpg)
atomicityA memory operation is non-atomic if it can be observed half-complete by another thread.
An operation may be non-atomic because it:
• uses multiple CPU instructions: operations on a large data structure; compiler decisions.
• use a single non-atomic CPU instruction: RMW instructions; unaligned loads and stores.> o := Order { id: 10, name: “yogi bear”, order: “pie”, count: 3,
}
![Page 25: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/25.jpg)
atomicityA memory operation is non-atomic if it can be observed half-complete by another thread.
An operation may be non-atomic because it:
• uses multiple CPU instructions: operations on a large data structure; compiler decisions.
• uses a single non-atomic CPU instruction:RMW instructions; unaligned loads and stores.
> flag++
![Page 26: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/26.jpg)
atomicityA memory operation is non-atomic if it can be observed half-complete by another thread.
An operation may be non-atomic because it:
• uses multiple CPU instructions: operations on a large data structure; compiler decisions.
• uses a single non-atomic CPU instruction:RMW instructions; unaligned loads and stores.
> flag++
An atomic operation is an “indivisible” memory access.
In x86_64, loads, stores that are naturally aligned up to 64b.*
guarantees the data item fits within a cache line;cache coherency guarantees a consistent view for a single cache line.
* these are not the only guaranteed atomic operations.
![Page 27: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/27.jpg)
nope; not atomic. …idea! use a flag?
![Page 28: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/28.jpg)
func reader() { for { /* If flag is 0, can access tasks. */ if flag == 0 { /* Set flag */ flag = 1 t := tasks.get() ... /* Unset flag */ flag = 0 return } /* Else, keep looping. */ } }
T1 running on CPU 1
![Page 29: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/29.jpg)
the compiler may reorder operations.
// Sets flag to 1 & reads data. func reader() { flag = 1 t := tasks.get() ... flag = 0
![Page 30: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/30.jpg)
the processor may reorder operations.
StoreLoad reordering load t before store flag = 1
// Sets flag to 1 & reads data. func reader() { flag = 1 t := tasks.get() ... flag = 0
![Page 31: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/31.jpg)
memory access reorderingThe compiler, processor can reorder memory operations to optimize execution.
![Page 32: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/32.jpg)
memory access reorderingThe compiler, processor can reorder memory operations to optimize execution.
• The only cardinal rule is sequential consistency for single threaded programs.
• Other guarantees about compiler reordering are captured by a language’s memory model: C++, Go guarantee data-race free programs will be sequentially consistent.
• For processor reordering, by the hardware memory model:x86_64 provides Total Store Ordering (TSO).
![Page 33: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/33.jpg)
memory access reorderingThe compiler, processor can reorder memory operations to optimize execution.
• The only cardinal rule is sequential consistency for single threaded programs.
• Other guarantees about compiler reordering are captured by a language’s memory model: C++, Go guarantee data-race free programs will be sequentially consistent.
• For processor reordering, by the hardware memory model:x86_64 provides Total Store Ordering (TSO).
![Page 34: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/34.jpg)
memory access reorderingThe compiler, processor can reorder memory operations to optimize execution.
• The only cardinal rule is sequential consistency for single threaded programs.
• Other guarantees about compiler reordering are captured by a language’s memory model: C++, Go guarantee data-race free programs will be sequentially consistent.
• For processor reordering, by the hardware memory model:x86_64 provides Total Store Ordering (TSO).
a relaxed consistency model. most reorderings are invalid but StoreLoad is game;allows processor to hide the latency of writes.
![Page 35: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/35.jpg)
nope; not atomic and no memory order guarantees. …idea! use a flag?
![Page 36: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/36.jpg)
nope; not atomic and no memory order guarantees. …idea! use a flag?
need a construct that provides atomicity and prevents memory reordering.
![Page 37: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/37.jpg)
nope; not atomic and no memory order guarantees. …idea! use a flag?
need a construct that provides atomicity and prevents memory reordering.
…the hardware provides!
![Page 38: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/38.jpg)
For guaranteed atomicity and to prevent memory reordering.
special hardware instructions
x86 example: XCHG (exchange)
these instructions are called memory barriers. they prevent reordering by the compiler too. x86 example: MFENCE, LFENCE, SFENCE.
![Page 39: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/39.jpg)
special hardware instructions
The x86 LOCK instruction prefix provides both.
Used to prefix memory access instructions: LOCK ADD
For guaranteed atomicity and to prevent memory reordering.
} atomic operations in languages like Go: atomic.Add
atomic.CompareAndSwap
![Page 40: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/40.jpg)
special hardware instructions
The x86 LOCK instruction prefix provides both.
Used to prefix memory access instructions: LOCK ADD
For guaranteed atomicity and to prevent memory reordering.
} atomic operations in languages like Go: atomic.Add
atomic.CompareAndSwapLOCK CMPXCHG
Atomic compare-and-swap (CAS) conditionally updates a variable:checks if it has the expected value and if so, changes it to the desired value.
![Page 41: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/41.jpg)
the CAS succeeded; we set flag to 1.
flag was 1 so our CAS failed; try again.
var flag int var tasks Tasks
func reader() { for {
// Try to atomically CAS flag from 0 -> 1 if atomic.CompareAndSwap(&flag, 0, 1) { ... // Atomically set flag back to 0. atomic.Store(&flag, 0) return }
// CAS failed, try again :) } }
baby’s first lock
![Page 42: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/42.jpg)
var flag int var tasks Tasks
func reader() { for {
// Try to atomically CAS flag from 0 -> 1 if atomic.CompareAndSwap(&flag, 0, 1) { ... // Atomically set flag back to 0. atomic.Store(&flag, 0) return }
// CAS failed, try again :) } }
baby’s first lock: spinlocks
This is a simplified spinlock.
Spinlocks are used extensively in the Linux kernel.}
![Page 43: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/43.jpg)
The atomic CAS is the quintessence of any lock implementation.
![Page 44: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/44.jpg)
cost of an atomic operation
Run on a 12-core x86_64 SMP machine.
Atomic store to a C _Atomic int, 10M times in a tight loop. Measure average time taken per operation(from within the program).
With 1 thread: ~13ns (vs. regular operation: ~2ns) With 12 cpu-pinned threads: ~110ns
threads are effectively serialized
var flag int var tasks Tasks
func reader() { for {
// Try to atomically CAS flag from 0 -> 1 if atomic.CompareAndSwap(&flag, 0, 1) { ... // Atomically set flag back to 0. atomic.Store(&flag, 0) return }
// CAS failed, try again :) } }
spinlocks
![Page 45: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/45.jpg)
sweet.We have a scheme for mutual exclusion that provides atomicity and memory ordering guarantees.
![Page 46: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/46.jpg)
sweet.
…butspinning for long durations is wasteful; it takes away CPU time from other threads.
We have a scheme for mutual exclusion that provides atomicity and memory ordering guarantees.
![Page 47: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/47.jpg)
sweet.
…butspinning for long durations is wasteful; it takes away CPU time from other threads.
We have a scheme for mutual exclusion that provides atomicity and memory ordering guarantees.
enter the operating system!
![Page 48: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/48.jpg)
Linux’s futexInterface and mechanism for userspace code to ask the kernel to suspend/ resume threads.
futex syscall kernel-managed queue
![Page 49: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/49.jpg)
flag can be 0: unlocked 1: locked 2: there’s a waiter
var flag int var tasks Tasks
![Page 50: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/50.jpg)
set flag to 2 (there’s a waiter)
flag can be 0: unlocked 1: locked 2: there’s a waiter
futex syscall to tell the kernel to suspend us until flag changes.
when we’re resumed, we’ll CAS again.
var flag int var tasks Tasks
func reader() { for { if atomic.CompareAndSwap(&flag, 0, 1) { ... }
// CAS failed, set flag to sleeping. v := atomic.Xchg(&flag, 2)
// and go to sleep. futex(&flag, FUTEX_WAIT, ...) } }
T1’s CAS fails(because T2 has set the flag)
T1
![Page 51: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/51.jpg)
in the kernel:
keyA (from the userspace address:
&flag)
keyA
T1
futex_q
1. arrange for thread to be resumed in the future: add an entry for this thread in the kernel queue for the address we care about
![Page 52: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/52.jpg)
in the kernel:
keyA (from the userspace address:
&flag)
keyA
T1
futex_q
keyother
Tother
futex_q
keyotherhash(keyA)
1. arrange for thread to be resumed in the future: add an entry for this thread in the kernel queue for the address we care about
![Page 53: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/53.jpg)
in the kernel:
keyA (from the userspace address:
&flag)
keyA
T1
futex_q
keyother
Tother
futex_q
keyotherhash(keyA)
1. arrange for thread to be resumed in the future: add an entry for this thread in the kernel queue for the address we care about
2. deschedule the calling thread to suspend it.
![Page 54: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/54.jpg)
T2 is done (accessing the shared data)
T2
func writer() { for { if atomic.CompareAndSwap(&flag, 0, 1) { ... // Set flag to unlocked.
v := atomic.Xchg(&flag, 0) if v == 2 { // If there was a waiter, issue a wake up. futex(&flag, FUTEX_WAKE, ...) } return }
v := atomic.Xchg(&flag, 2) futex(&flag, FUTEX_WAIT, …) } }
![Page 55: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/55.jpg)
T2 is done (accessing the shared data)
T2
func writer() { for { if atomic.CompareAndSwap(&flag, 0, 1) { ... // Set flag to unlocked.
v := atomic.Xchg(&flag, 0) if v == 2 { // If there was a waiter, issue a wake up. futex(&flag, FUTEX_WAKE, ...) } return }
v := atomic.Xchg(&flag, 2) futex(&flag, FUTEX_WAIT, …) } }
if flag was 2, there’s at least one waiter
futex syscall to tell the kernel to wake a waiter up.
![Page 56: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/56.jpg)
func writer() { for { if atomic.CompareAndSwap(&flag, 0, 1) { ... // Set flag to unlocked.
v := atomic.Xchg(&flag, 0) if v == 2 { // If there was a waiter, issue a wake up. futex(&flag, FUTEX_WAKE, ...) } return }
v := atomic.Xchg(&flag, 2) futex(&flag, FUTEX_WAIT, …) } }
if flag was 2, there’s at least one waiter
futex syscall to tell the kernel to wake a waiter up.
hashes the key walks the hash bucket’s futex queue finds the first thread waiting on the address schedules it to run again!
}
T2 is done (accessing the shared data)
T2
![Page 57: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/57.jpg)
pretty convenient!
pthread mutexes use futexes.
That was a hella simplified futex. …but we still have a nice, lightweight primitive to build synchronization constructs.
![Page 58: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/58.jpg)
cost of a futex
Run on a 12-core x86_64 SMP machine.
Lock & unlock a pthread mutex 10M times in loop(lock, increment an integer, unlock).
Measure average time taken per lock/unlock pair(from within the program).
uncontended case (1 thread): ~13ns contended case (12 cpu-pinned threads): ~0.9us
![Page 59: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/59.jpg)
cost of a futex
Run on a 12-core x86_64 SMP machine.
Lock & unlock a pthread mutex 10M times in loop(lock, increment an integer, unlock).
Measure average time taken per lock/unlock pair(from within the program).
uncontended case (1 thread): ~13ns contended case (12 cpu-pinned threads): ~0.9us
cost of the user-space atomic CAS = ~13ns}
cost of the atomic CAS + syscall + thread context switch = ~0.9us
}
![Page 60: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/60.jpg)
spinning vs. sleeping
Spinning makes sense for short durations; it keeps the thread on the CPU. The trade-off is it uses CPU cycles not making progress. So at some point, it makes sense to pay the cost of the context switch to go to sleep.
There are smart “hybrid” futexes:CAS-spin a small, fixed number of times —> if that didn’t lock, make the futex syscall. Example: the Go runtime’s futex implementation.
![Page 61: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/61.jpg)
spinning vs. sleeping
Spinning makes sense for short durations; it keeps the thread on the CPU. The trade-off is it uses CPU cycles not making progress. So at some point, it makes sense to pay the cost of the context switch to go to sleep.
There are smart “hybrid” futexes:CAS-spin a small, fixed number of times —> if that didn’t lock, make the futex syscall. Examples: the Go runtime’s futex implementation; a variant of the pthread_mutex.
![Page 62: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/62.jpg)
…can we do better for user-space threads?
![Page 63: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/63.jpg)
…can we do better for user-space threads?
goroutines are user-space threads. The go runtime multiplexes them onto threads. lighter-weight and cheaper than threads:goroutine switches = ~tens of ns; thread switches = ~a µs. CPU core
g1 g6g2
thread
CPU core } OS scheduler
Go scheduler}
![Page 64: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/64.jpg)
…can we do better for user-space threads?
goroutines are user-space threads. The go runtime multiplexes them onto threads. lighter-weight and cheaper than threads:goroutine switches = ~tens of ns; thread switches = ~a µs. CPU core
g1 g6g2
thread
CPU core } OS scheduler
Go scheduler}
we can block the goroutine without blocking the underlying thread!to avoid the thread context switch cost.
![Page 65: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/65.jpg)
This is what the Go runtime’s semaphore does!
The semaphore is conceptually very similar to futexes in Linux*, but it is used to sleep/wake goroutines:
a goroutine that blocks on a mutex is descheduled, but not the underlying thread. the goroutine wait queues are managed by the runtime, in user-space.
* There are, of course, differences in implementation though.
![Page 66: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/66.jpg)
the goroutine wait queues are managed by the Go runtime, in user-space.
var flag int var tasks Tasks
func reader() { for { // Attempt to CAS flag. if atomic.CompareAndSwap(&flag, ...) { ... }
// CAS failed; add G1 as a waiter for flag. root.queue()
// and to sleep. futex(&flag, FUTEX_WAIT, ...) } }
G1’s CAS fails(because G2 has set the flag)
G1
![Page 67: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/67.jpg)
&flag (the userspace address)
&flag
G1 G3
G4
&otherhash(&flag)
}
the top-level waitlist for a hash bucket is implemented as a treap
}
there’s a second-level wait queue for each unique address
the goroutine wait queues (in user-space, managed by the go runtime)
![Page 68: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/68.jpg)
the goroutine wait queues are managed by the Go runtime, in user-space.
var flag int var tasks Tasks
func reader() { for { // Attempt to CAS flag. if atomic.CompareAndSwap(&flag, ...) { ... }
// CAS failed; add G1 as a waiter for flag. root.queue()
// and suspend G1. gopark() } }
G1’s CAS fails(because G2 has set the flag)
G1
the Go runtime deschedules the goroutine; keeps the thread running!
![Page 69: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/69.jpg)
G2’s done(accessing the shared data)
G2
func writer() { for { if atomic.CompareAndSwap(&flag, 0, 1) { ... // Set flag to unlocked.
atomic.Xadd(&flag, ...) // If there’s a waiter, reschedule it. waiter := root.dequeue(&flag) goready(waiter) return }
root.queue() gopark() } }
find the first waiter goroutine and reschedule it ]
![Page 70: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/70.jpg)
this is clever.Avoids the hefty thread context switch cost in the contended case,up to a point.
![Page 71: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/71.jpg)
this is clever.Avoids the hefty thread context switch cost in the contended case,up to a point.
but…
![Page 72: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/72.jpg)
func reader() { for { if atomic.CompareAndSwap(&flag, ...) { ... }
// CAS failed; add G1 as a waiter for flag. semaroot.queue()
// and suspend G1. gopark() } }
once G1 is resumed, it will try to CAS again.
Resumed goroutines have to compete with any other goroutines trying to CAS. They will likely lose:there’s a delay between when the flag was set to 0 and this goroutine was rescheduled..G1
![Page 73: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/73.jpg)
Resumed goroutines have to compete with any other goroutines trying to CAS. They will likely lose:there’s a delay between when the flag was set to 0 and this goroutine was rescheduled..
// Set flag to unlocked. atomic.Xadd(&flag, …) // If there’s a waiter, reschedule it. waiter := root.dequeue(&flag) goready(waiter) return
![Page 74: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/74.jpg)
Resumed goroutines have to compete with any other goroutines trying to CAS. They will likely lose:there’s a delay between when the flag was set to 0 and this goroutine was rescheduled..
So, the semaphore implementation may end up:
• unnecessarily resuming a waiter goroutine results in a goroutine context switch again.
• cause goroutine starvationcan result in long wait times, high tail latencies.
![Page 75: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/75.jpg)
Resumed goroutines have to compete with any other goroutines trying to CAS. They will likely lose:there’s a delay between when the flag was set to 0 and this goroutine was rescheduled..
So, the semaphore implementation may end up:
• unnecessarily resuming a waiter goroutine results in a goroutine context switch again.
• cause goroutine starvationcan result in long wait times, high tail latencies.
the sync.Mutex implementation adds a layer that fixes these.
![Page 76: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/76.jpg)
go’s sync.MutexIs a hybrid lock that uses a semaphore to sleep / wake goroutines.
![Page 77: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/77.jpg)
go’s sync.Mutex
Additionally, it tracks extra state to:
Is a hybrid lock that uses a semaphore to sleep / wake goroutines.
prevent unnecessarily waking up a goroutine“There’s a goroutine actively trying to CAS”: An unlock in this case does not wake a waiter.
prevent severe goroutine starvation “a waiter has been waiting”: If a waiter is resumed but loses the CAS again, it’s queued at the head of the wait queue.If a waiter fails to lock for 1ms, switch the mutex to “starvation mode”.
prevent unnecessarily waking up a goroutine“There’s a goroutine actively trying to CAS”: An unlock in this case does not wake a waiter.
prevent severe goroutine starvation “a waiter has been waiting”: If a waiter is resumed but loses the CAS again, it’s queued at the head of the wait queue.If a waiter fails to lock for 1ms, switch the mutex to “starvation mode”.
![Page 78: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/78.jpg)
go’s sync.Mutex
Additionally, it tracks extra state to:
Is a hybrid lock that uses a semaphore to sleep / wake goroutines.
prevent unnecessarily waking up a goroutine“There’s a goroutine actively trying to CAS”: An unlock in this case does not wake a waiter.
prevent severe goroutine starvation “a waiter has been waiting”: If a waiter is resumed but loses the CAS again, it’s queued at the head of the wait queue.If a waiter fails to lock for 1ms, switch the mutex to “starvation mode”.
other goroutines cannot CAS, they must queue The unlock hands the mutex off to the first waiter.i.e. the waiter does not have to compete.
![Page 79: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/79.jpg)
how does it perform?
Run on a 12-core x86_64 SMP machine.
Lock & unlock a Go sync.Mutex 10M times in loop(lock, increment an integer, unlock).
Measure average time taken per lock/unlock pair(from within the program).
uncontended case (1 goroutine): ~13ns contended case (12 goroutines): ~0.8us
![Page 80: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/80.jpg)
how does it perform?
Contended case performance of C vs. Go:Go initially performs better than C
but they ~converge as concurrency gets high enough.
}
![Page 81: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/81.jpg)
how does it perform?
Contended case performance of C vs. Go:Go initially performs better than C
but they ~converge as concurrency gets high enough.
}}
![Page 82: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/82.jpg)
uses a semaphore
sync.Mutex
![Page 83: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/83.jpg)
&flag G1 G3
G4
&other
the Go runtime semaphore’s hash table for waiting goroutines:
each hash bucket needs a lock. …and it’s a futex!
![Page 84: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/84.jpg)
&flag G1 G3
G4
&other
the Go runtime semaphore’s hash table for waiting goroutines:
each hash bucket needs a lock. …it’s a futex!
![Page 85: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/85.jpg)
&flag G1 G3
G4
&other &flag G1
the Linux kernel’s futex hash table for waiting threads:
each hash bucket needs a lock. …it’s a spin lock!
each hash bucket needs a lock. …it’s a futex!
the Go runtime semaphore’s hash table for waiting goroutines:
![Page 86: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/86.jpg)
&flag G1 G3
G4
&other &flag G1
each hash bucket needs a lock. …it’s a spinlock!
each hash bucket needs a lock. …it’s a futex!
the Go runtime semaphore’s hash table for waiting goroutines:
the Linux kernel’s futex hash table for waiting threads:
![Page 87: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/87.jpg)
uses futexes
uses spin-locks
It’s locks all the way down!
uses a semaphore
sync.Mutex
![Page 88: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/88.jpg)
let’s analyze its performance!performance models for contention.
![Page 89: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/89.jpg)
uncontended caseCost of the atomic CAS.
contended caseIn the worst-case, cost of failed atomic operations + spinning + goroutine context switch + thread context switch. ….But really, depends on degree of contention.
![Page 90: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/90.jpg)
how many threads do we need to support a target throughput? while keeping response time the same.
how does response time change with the number of threads? assuming a constant workload.
“How does application performance change with concurrency?”
![Page 91: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/91.jpg)
Amdahl’s Law
Speed-up depends on the fraction of the workload that can be parallelized (p).
speed-up with N threads = 1 (1 — p) + p
N
![Page 92: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/92.jpg)
a simple experiment
Measure time taken to complete a fixed workload.
serial fraction holds a lock (sync.Mutex). scale parallel fraction (p) from 0.25 to 0.75 measure time taken for number of goroutines (N) = 1 —> 12.
![Page 93: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/93.jpg)
p = 0.75
p = 0.25
Amdahl’s Law
Speed-up depends on the fraction of the workload that can be parallelized (p).
![Page 94: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/94.jpg)
Universal Scalability Law (USL)
• contention penaltydue to serialization for shared resources.examples: lock contention, database contention.
• crosstalk penaltydue to coordination for coherence.
examples: servers coordinating to synchronize mutable state.
αN
Scalability depends on contention and cross-talk.
![Page 95: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/95.jpg)
Universal Scalability Law (USL)
• contention penaltydue to serialization for shared resources.examples: lock contention, database contention.
• crosstalk penaltydue to coordination for coherence.
examples: servers coordinating to synchronize mutable state.
αN
Scalability depends on contention and cross-talk.
βN2
![Page 96: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/96.jpg)
Universal Scalability Law (USL)
N (αN + βN2 + C)
NC
N(αN + C)
contention and crosstalk
linear scaling
contention
thro
ughp
ut
concurrency
throughput of N threads = N (αN + βN2 + C)
![Page 97: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/97.jpg)
p = 0.75p = 0.25
USL curves plotted using the R usl package
p = parallel fraction of workload
![Page 98: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/98.jpg)
let’s use it, smartly!a few closing strategies.
![Page 99: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/99.jpg)
but first, profile!Go mutex • Go mutex contention profiler
https://golang.org/doc/diagnostics.html
Linux • perf-lock:
perf examples by Brendan Gregg Brendan Gregg article on off-cpu analysis
• eBPF:example bcc tool to measure user lock contention
• Dtrace, systemtap • mutrace, Valgrind-drd
pprof mutex contention profile
![Page 100: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/100.jpg)
strategy I: don’t use a lock• remove the need for synchronization from hot-paths:
typically involves rearchitecting. • reduce the number of lock operations:
doing more thread local work, buffering, batching, copy-on-write. • use atomic operations. • use lock-free data structures
see: http://www.1024cores.net/
![Page 101: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/101.jpg)
strategy II: granular locks• shard data:
but ensure no false sharing, by padding to cache line size.examples: go runtime semaphore’s hash table buckets;Linux scheduler’s per-CPU runqueues;Go scheduler’s per-CPU runqueues;
• use read-write locks
scheduler benchmark(CreateGoroutineParallel)
modified scheduler: global lock; runqueuego scheduler: per-CPU core, lock-free runqueues
![Page 102: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/102.jpg)
strategy III: do less serial work
lock contention causes ~10x latency
late
ncy
time timesmaller critical section change
• move computation out of critical section:typically involves rearchitecting.
![Page 103: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/103.jpg)
bonus strategy:• contention-aware schedulers example: Contention-aware scheduling in MySQL 8.0 Innodb
![Page 104: Let’s talk locks!...use as you would threads > go handle_request(r) but user-space threads: managed entirely by the Go runtime, not the operating system. The unit of concurrent execution:](https://reader034.vdocuments.us/reader034/viewer/2022042909/5f3a69024bff3821fa0ffd0a/html5/thumbnails/104.jpg)
Special thanks to Eben Freeman, Justin Delegard, Austin Duffield for reading drafts of this.
@kavya719speakerdeck.com/kavya719/lets-talk-locks
ReferencesJeff Preshing’s excellent blog seriesMemory Barriers: A Hardware View for Software HackersLWN.net on futexes The Go source code The Universal Scalability Law Manifesto, Neil Gunther