in search of the perfect global interpreter lock
DESCRIPTION
Presentation on the Python/Ruby Global Interpreter Lock at RuPy 2011. October 14, 2011. Poznan, Poland.TRANSCRIPT
Copyright (C) 2010, David Beazley, http://www.dabeaz.com
In Search of the Perfect Global Interpreter Lock
1
David Beazleyhttp://www.dabeaz.com
@dabeaz
Presented at RuPy 2011Poznan, Poland
October 15, 2011
Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Introduction
• As many programmers know, Python and Ruby feature a Global Interpreter Lock (GIL)
• More precise: CPython and MRI
• It limits thread performance on multicore
• Theoretically restricts code to a single CPU
2
Copyright (C) 2010, David Beazley, http://www.dabeaz.com
An Experiment• Consider a trivial CPU-bound function
def countdown(n): while n > 0: n -= 1
3
• Run it once with a lot of workCOUNT = 100000000 # 100 millioncountdown(COUNT)
• Now, divide the work across two threadst1 = Thread(target=count,args=(COUNT//2,))t2 = Thread(target=count,args=(COUNT//2,))t1.start(); t2.start()t1.join(); t2.join()
Copyright (C) 2010, David Beazley, http://www.dabeaz.com
An Experiment• Some Ruby
def countdown(n) while n > 0 n -= 1 endend
4
• SequentialCOUNT = 100000000 # 100 millioncountdown(COUNT)
• Subdivided across threadst1 = Thread.new { countdown(COUNT/2) }t2 = Thread.new { countdown(COUNT/2) }t1.joint2.join
Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Expectations
• Sequential and threaded versions perform the same amount of work (same # calculations)
• There is the GIL... so no parallelism
• Performance should be about the same
5
Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Results
6
• Ruby 1.9 on OS-X (4 cores)Sequential Threaded (2 threads)
: 2.46s: 2.55s (~ same)
Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Results
• Python 2.7
7
Sequential Threaded (2 threads)
: 6.12s: 9.28s (1.5x slower!)
• Ruby 1.9 on OS-X (4 cores)Sequential Threaded (2 threads)
: 2.46s: 2.55s (~ same)
Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Results
• Python 2.7
8
Sequential Threaded (2 threads)
: 6.12s: 9.28s (1.5x slower!)
• Ruby 1.9 on OS-X (4 cores)Sequential Threaded (2 threads)
: 2.46s: 2.55s (~ same)
• Question: Why does it get slower in Python?
Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Results
9
• Ruby 1.9 on Windows Server 2008 (2 cores)Sequential Threaded (2 threads)
: 3.32s: 3.45s (~ same)
Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Results
• Python 2.7
10
Sequential Threaded (2 threads)
: 6.9s: 63.0s (9.1x slower!)
• Ruby 1.9 on Windows Server 2008 (2 cores)Sequential Threaded (2 threads)
: 3.32s: 3.45s (~ same)
Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Results
• Python 2.7
11
Sequential Threaded (2 threads)
: 6.9s: 63.0s (9.1x slower!)
• Ruby 1.9 on Windows Server 2008 (2 cores)Sequential Threaded (2 threads)
: 3.32s: 3.45s (~ same)
• Why does it get that much slower on Windows?
Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Experiment: Messaging
12
• A request/reply server for size-prefixed messages
ServerClient
• Each message: a size header + payload
• Similar: ZeroMQ
Copyright (C) 2010, David Beazley, http://www.dabeaz.com
An Experiment: Messaging
13
• A simple test - message echo (pseudocode)
def client(nummsg,msg): while nummsg > 0: send(msg) resp = recv() sleep(0.001) nummsg -= 1
def server(): while True: msg = recv() send(msg)
Copyright (C) 2010, David Beazley, http://www.dabeaz.com
An Experiment: Messaging
14
• A simple test - message echo (pseudocode)
def client(nummsg,msg): while nummsg > 0: send(msg) resp = recv() sleep(0.001) nummsg -= 1
def server(): while True: msg = recv() send(msg)
• To be less evil, it's throttled (<1000 msg/sec)
• Not a messaging stress test
Copyright (C) 2010, David Beazley, http://www.dabeaz.com
An Experiment: Messaging
15
• A test: send/receive 1000 8K messages
• Scenario 1: Unloaded server
ServerClient
• Scenario 2 : Server competing with one CPU-thread
ServerClient
CPU-Thread
Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Results• Messaging with no threads (OS-X, 4 cores)
16
CPython 2.7Ruby 1.9
: 1.26s: 1.29s: 1.29s
Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Results• Messaging with no threads (OS-X, 4 cores)
17
CPython 2.7Ruby 1.9
: 1.26s: 1.29s: 1.29s
• Messaging with one CPU-bound thread*
CPython 2.7 Ruby 1.9
: 1.16s (~8% faster!?): 12.3s (10x slower): 42.0s (33x slower)
• Hmmm. Curious.* On Ruby, the CPU-bound thread was also given lower priority
Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Results• Messaging with no threads (Linux, 8 CPUs)
18
CPython 2.7Ruby 1.9
: 1.13s: 1.18s: 1.18s
Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Results• Messaging with no threads (Linux, 8 CPUs)
19
CPython 2.7Ruby 1.9
: 1.13s: 1.18s: 1.18s
• Messaging with one CPU-bound thread
CPython 2.7 Ruby 1.9
: 1.11s (same): 1.60s (1.4x slower) - better: 5839.4s (~5000x slower) - worse!
Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Results• Messaging with no threads (Linux, 8 CPUs)
20
CPython 2.7Ruby 1.9
: 1.13s: 1.18s: 1.18s
• Messaging with one CPU-bound thread
CPython 2.7 Ruby 1.9
: 1.11s (same): 1.60s (1.4x slower) - better: 5839.4s (~5000x slower) - worse!
• 5000x slower? Really? Why?
Copyright (C) 2010, David Beazley, http://www.dabeaz.com
The Mystery Deepens• Disable all but one CPU core
21
Python 2.7 (4 cores+hyperthreading)Python 2.7 (1 core)
: 9.28s: 7.9s (faster!)
• Messaging with one CPU-bound thread
Ruby 1.9 (4 cores+hyperthreading)Ruby 1.9 (1 core)
: 42.0s: 10.5s (much faster!)
• ?!?!?!?!?!?
• CPU-bound threads (OS-X)
Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Better is Worse• Change software versions
22
Python 2.7 (Messaging)Python 3.2 (Messaging)
: 12.3s: 20.1s (1.6x slower)
• Let's downgrade to Ruby 1.8 (Linux)
Ruby 1.9 (Messaging)Ruby 1.8.7 (Messaging)
: 42.0: 10.0s (4x faster)
• Let's upgrade to Python 3 (Linux)
• So much for progress (sigh)
Copyright (C) 2010, David Beazley, http://www.dabeaz.com
What's Happening?
• The GIL does far more than limit cores
• It can make performance much worse
• Better performance by turning off cores?
• 5000x performance hit on Linux?
• Why?
23
Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Why You Might Care
• Must you abandon Python/Ruby for concurrency?
• Having threads restricted to one CPU core might be okay if it were sane
• Analogy: A multitasking operating system (e.g., Linux) runs fine on a single CPU
• Plus, threads get used a lot behind the scenes (even in thread alternatives, e.g., async)
24
Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Why I Care
• It's an interesting little systems problem
• How do you make a better GIL?
• It's fun.
25
Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Some Background
• I have been discussing some of these issues in the Python community since 2009
26
http://www.dabeaz.com/GIL
• I'm less familiar with Ruby, but I've looked at its GIL implementation and experimented
• Very interested in commonalities/differences
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 27
A Tale of Two GILs
Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Thread Implementation
• System threads (e.g., pthreads)
• Managed by OS
• Concurrent execution of the Python interpreter (written in C)
28
• System threads (e.g., pthreads)
• Managed by OS
• Concurrent execution of the Ruby VM (written in C)
Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Alas, the GIL
• Parallel execution is forbidden
• There is a "global interpreter lock"
• The GIL ensures that only one thread runs in the interpreter at once
• Simplifies many low-level details (memory management, callouts to C extensions, etc.)
29
Copyright (C) 2010, David Beazley, http://www.dabeaz.com
GIL Implementation
30
int gil_locked = 0;mutex_t gil_mutex;cond_t gil_cond;
void gil_acquire() { mutex_lock(gil_mutex); while (gil_locked) cond_wait(gil_cond); gil_locked = 1; mutex_unlock(gil_mutex);}void gil_release() { mutex_lock(gil_mutex); gil_locked = 0; cond_notify(); mutex_unlock(gil_mutex);}
mutex_t gil;
void gil_acquire() { mutex_lock(gil);}void gil_release() { mutex_unlock(gil);}
Simple mutex lock
Condition variable
Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Thread Execution Model• The GIL results in cooperative multitasking
31
Thread 1
Thread 2
Thread 3
block block block block block
• When a thread is running, it holds the GIL
• GIL released on blocking (e.g., I/O operations)
run
runrun
run
run
release GIL
acquire GIL
release GIL
acquire GIL
Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Threads for I/O
• For I/O it works great
• GIL is never held very long
• Most threads just sit around sleeping
• Life is good
32
Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Threads for Computation
• You may actually want to compute something!
• Fibonacci numbers
• Image/audio processing
• Parsing
• The CPU will be busy
• And it won't give up the GIL on its own
33
Copyright (C) 2010, David Beazley, http://www.dabeaz.com
CPU-Bound Switching
• Releases and reacquires the GIL every 100 "ticks"
• 1 Tick ~= 1 interpreter instruction
34
• Background thread generates a timer interrupt every 10ms
• GIL released and reacquired by current thread on interrupt
Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Python Thread Switching
35
CPU BoundThread
Run 100ticks
Run 100ticks
Run 100ticks
• Every 100 VM instructions, GIL is dropped, allowing other threads to run if they want
• Not time based--switching interval depends on kind of instructions executed
relea
se
acqu
ire
relea
se
acqu
ire
relea
se
acqu
ire
Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Ruby Thread Switching
36
CPU BoundThread
Run Run
TimerThread
Timer (10ms) Timer (10ms)
relea
se
acqu
ire
relea
se
acqu
ire
• Loosely mimics the time-slice of the OS
• Every 10ms, GIL is released/acquired
Copyright (C) 2010, David Beazley, http://www.dabeaz.com
A Common Theme• Both Python and Ruby have C code like this:
37
void execute() { while (inst = next_instruction()) { // Run the VM instruction ... if (must_release_gil) { GIL_release();
/* Other threads may run now */ GIL_acquire(); } }}
• Exact details vary, but concept is the same
• Each thread has periodic release/acquire in the VM to allow other threads to run
Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Question
38
if (must_release_gil) { GIL_release();
/* Other threads may run now */ GIL_acquire(); }
• Short answer: Everything!
• What can go wrong with this bit of code?
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 39
Pathology
Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Thread Switching• Suppose you have two threads
40
• Thread 1 : Running
• Thread 2 : Ready (Waiting for GIL)
Thread 1Running
Thread 2 READY
Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Thread Switching• Easy case : Thread 1 performs I/O (read/write)
41
• Thread 1 : Releases GIL and blocks for I/O
• Thread 2 : Gets scheduled, starts running
Thread 1Running
Thread 2 READY
I/O
pthreads/OS
scheduleRunning
BLOCKED
acquire GIL
releaseGIL
Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Thread Switching• Tricky case : Thread 1 runs until preempted
42
Thread 1Running
Thread 2 READY
preem
pt
pthreads/OS
releaseGIL
Which thread runs?
???
???
Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Thread Switching• You might expect that Thread 2 will run
43
• But you assume the GIL plays nice...
Thread 1Running
Thread 2 READY
preem
pt
pthreads/OS
releaseGIL
Runningschedule
READY
acquire GIL
Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Thread Switching• What might actually happen on multicore
44
Thread 1Running
Thread 2 READY
preem
pt
pthreads/OS
releaseGIL
schedule
Running
acquire GIL
fails (GIL locked)
READY
• Both threads attempt to run simultaneously
• ... but only one will succeed (depends on timing)
Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Fallacy
45
if (must_release_gil) { GIL_release();
/* Other threads may run now */ GIL_acquire(); }
• This code doesn't actually switch threads
• It might switch threads, but it depends
• What operating system
• # cores
• Lock scheduling policy (if any)
Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Fallacy
46
if (must_release_gil) { GIL_release(); sleep(0);
/* Other threads may run now */ GIL_acquire(); }
• This doesn't force switching (sleeping)
• It might switch threads, but it depends
• What operating system
• # cores
• Lock scheduling policy (if any)
Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Fallacy
47
if (must_release_gil) { GIL_release(); sched_yield()
/* Other threads may run now */ GIL_acquire(); }
• Neither does this (calling the scheduler)
• It might switch threads, but it depends
• What operating system
• # cores
• Lock scheduling policy (if any)
Copyright (C) 2010, David Beazley, http://www.dabeaz.com
A Conflict
• There are conflicting goals
• Python/Ruby - wants to run on a single CPU, but doesn't want to do thread scheduling (i.e., let the OS do it).
• OS - "Oooh. Multiple cores." Schedules as many runnable tasks as possible at any instant
• Result: Threads fight with each other
48
Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Multicore GIL Battle
49
• Python 2.7 on OS-X (4 cores)Sequential Threaded (2 threads)
: 6.12s: 9.28s (1.5x slower!)
Thread 1100 ticks
preem
pt
preem
pt
preem
pt
100 ticks
Thread 2
...
release
schedule
READY
Eventually...
READY
release
run
pthreads/OS
acquire acquire
fail
READY
schedule fail
READY
• Millions of failed GIL acquisitions
Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Multicore GIL Battle
50
• You can see it! (2 CPU-bound threads)
Why >100%?
• Comment: In Python, it's very rapid
• GIL is released every few microseconds!
Copyright (C) 2010, David Beazley, http://www.dabeaz.com
I/O Handling• If there is a CPU-bound thread, I/O bound
threads have a hard time getting the GIL
51
Thread 1 (CPU 1) Thread 2 (CPU 2)
Network PacketAcquire GIL (fails)
run
Acquire GIL (fails)
Acquire GIL (fails)
Acquire GIL (success)
preempt
preempt
preempt
preempt
run
sleep
Might repeat 100s-1000s of times
run
run
run
Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Messaging Pathology
52
• Messaging on Linux (8 Cores)
Ruby 1.9 (no threads)Ruby 1.9 (1 CPU thread)
: 1.18s: 5839.4s
• Locks in Linux have no fairness
• Consequence: Really hard to steal the GIL
• And Ruby only retries every 10ms
Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Let's Talk Fairness
53
• Fair-locking means that locks have some notion of priorities, arrival order, queuing, etc.
Lock t1 t2 t3 t4 t5waiting
t0running
Lock t2 t3 t4 t5 t0waiting
t1running
release
• Releasing means you go to end of line
Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Effect of Fair-Locking
54
• Ruby 1.9 (multiple cores)Messages + 1 CPU Thread (OS-X)Messages + 1 CPU Thread (Linux)
• Question: Which one uses fair locking?
: 42.0s: 5839.4s
Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Effect of Fair-Locking
55
• Ruby 1.9 (multiple cores)Messages + 1 CPU Thread (OS-X)Messages + 1 CPU Thread (Linux)
• Benefit : I/O threads get their turn (yay!)
: 42.0s (Fair): 5839.4s
Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Effect of Fair-Locking
56
• Ruby 1.9 (multiple cores)Messages + 1 CPU Thread (OS-X)Messages + 1 CPU Thread (Linux)
• Benefit : I/O threads get their turn (yay!)
: 42.0s (Fair): 5839.4s
• Python 2.7 (multiple cores)
2 CPU-Bound Threads (OS-X)2 CPU-Bound Threads (Windows)
: 9.28s: 63.0s
• Question: Which one uses fair-locking?
Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Effect of Fair-Locking
57
• Ruby 1.9 (multiple cores)Messages + 1 CPU Thread (OS-X)Messages + 1 CPU Thread (Linux)
• Benefit : I/O threads get their turn (yay!)
: 42.0s (Fair): 5839.4s
• Python 2.7 (multiple cores)
2 CPU-Bound Threads (OS-X)2 CPU-Bound Threads (Windows)
: 9.28s: 63.0s (Fair)
• Problem: Too much context switching
Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Fair-Locking - Bah!
58
• In reality, you don't want fairness
• Messaging Revisited (OS X, 4 Cores)Ruby 1.9 (No Threads)Ruby 1.9 (1 CPU-Bound thread)
: 1.29s: 42.0s (33x slower)
• Why is it still 33x slower?
• Answer: Fair locking! (and convoying)
Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Messaging Revisited
59
• Go back to the messaging server
def server(): while True: msg = recv() send(msg)
Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Messaging Revisited
60
• The actual implementation (size-prefixed messages)
def server(): while True: size = recv(4) msg = recv(size) send(size) send(msg)
Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Performance Explained
61
• What actually happens under the covers
def server(): while True: size = recv(4) msg = recv(size) send(size) send(msg)
GIL releaseGIL releaseGIL releaseGIL release
• Why? Each operation might block
• Catch: Passes control back to CPU-bound thread
Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Performance Illustrated
62
CPU BoundThread
run
TimerThread
10ms
I/O Thread
10ms 10ms 10ms
DataArrives
recv recv send send done
run run run run run
10ms
• Each message has 40ms response cycle
• 1000 messages x 40ms = 40s (42.0s measured)
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 63
Despair
Copyright (C) 2010, David Beazley, http://www.dabeaz.com
A Solution?
• Yes, yes, everyone hates threads
• However, that's only because they're useful!
• Threads are used for all sorts of things
• Even if they're hidden behind the scenes
64
Don't use threads!
Copyright (C) 2010, David Beazley, http://www.dabeaz.com
A Better Solution
• It's probably not going away (very difficult)
• However, does it have to thrash wildly?
• Question: Can you do anything?
65
Make the GIL better
Copyright (C) 2010, David Beazley, http://www.dabeaz.com
GIL Efforts in Python 3
• Python 3.2 has a new GIL implementation
• It's imperfect--in fact, it has a lot of problems
• However, people are experimenting with it
66
Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Python 3 GIL• GIL acquisition now based on timeouts
67
Thread 1
Thread 2 READY
running
wait(gil, TIMEOUT)
release
runningIOWAIT
data arrives
wait(gil, TIMEOUT)
5ms
drop_request
• Involves waiting on a condition variable
Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Problem: Convoying• CPU-bound threads significantly degrade I/O
68
Thread 1
Thread 2 READY
running
run
data arrives
• This is the same problem as in Ruby
• Just a shorter time delay (5ms)
data arrives
running
READYrun
release
running
READY
data arrives
5ms 5ms 5ms
Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Problem: Convoying
• You can directly observe the delays (messaging)
69
Python/Ruby (No threads)Python 3.2 (1 Thread)Ruby 1.9 (1 Thread)
: 1.29s (no delays): 20.1s (5ms delays): 42.0s (10ms delays)
• Still not great, but problem is understood
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 70
Promise
Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Priorities
• Best promise : Priority scheduling
• Earlier versions of Ruby had it
• It works (OS-X, 4 cores)
71
Ruby 1.9 (1 Thread)Ruby 1.8.7 (1 Thread)Ruby 1.8.7 (1 Thread, lower priority)
: 42.0s: 40.2s: 10.0s
• Comment: Ruby-1.9 allows thread priorities to be set in pthreads, but it doesn't seem to have much (if any) effect
Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Priorities
• Experimental Python-3.2 with priority scheduler
• Also features immediate preemption
• Messages (OS X, 4 Cores)
72
Python 3.2 (No threads)Python 3.2 (1 Thread)Python 3.2+priorities (1 Thread)
: 1.29s: 20.2s: 1.21s (faster?)
• That's a lot more promising!
Copyright (C) 2010, David Beazley, http://www.dabeaz.com
New Problems
• Priorities bring new challenges
• Starvation
• Priority inversion
• Implementation complexity
• Do you have to write a full OS scheduler?
• Hopefully not, but it's an open question
73
Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Final Words
• Implementing a GIL is a lot trickier than it looks
• Even work with priorities has problems
• Good example of how multicore is diabolical
74
Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Thanks for Listening!
• I hope you learned at least one new thing
• I'm always interested in feedback
• Follow me on Twitter (@dabeaz)
75