database systems do not scale to 1000 cpu cores and...
TRANSCRIPT
Database Systems Do Not Scale to 1000 CPU Cores And Other Tales of the Macabre
@andy_pavlo
2
Three million children die per year due to poor nutrition.
Source: http://www.wfp.org/hunger/stats
3
Three days after you die, stomach enzymes start to digest you.
Source: http://discovermagazine.com/2006/sep/10-20thingsdeath
4
Everyone in this room will be dead in 65 years.
Source: http://discovermagazine.com/2006/sep/10-20thingsdeath
5
Database systems cannot scale to 1000 CPU cores.
Source: http://www.vldb.org/pvldb/vol8/p209-yu.pdf
YCSB // 6
DBx1000 on Graphite Simulator Write-Intensive Workload High Contention
YCSB // 6
DBx1000 on Graphite Simulator Write-Intensive Workload High Contention
Why This Matters
• The era of single-core CPU speed-up is over. • Database applications are getting more
complex and larger. • Existing DBMSs are unable to take advantage of
future “many-core” CPU architectures.
7
Today’s Talk
• Transaction Processing • Experimental Platform • Evaluation & Discussion
8
Today’s Talk
• Transaction Processing • Experimental Platform • Evaluation & Discussion
8
9
Transaction Processing
On-line Transaction Processing
• Fast operations that ingest new data and then update state using ACID transactions.
• Transaction Example: – Send $50 from user A to user B
10
Concurrency Control
• Allows transactions to access a database in a multi-programmed fashion while preserving the illusion that each of them is executing alone on a dedicated system.
• Provides Atomicity + Isolation in ACID
11
Concurrency Control
• Two-Phase Locking (Pessimistic) • Timestamp Ordering (Optimistic)
12
Two-Phase Locking (2PL) 13
Transaction #1 BE
GIN
COMM
IT
LOCK(A) LOCK(B) UNLOCK(A) UNLOCK(B) READ(A) WRITE(B)
Shrinking Phase
LOCK(A) LOCK(B)
Growing Phase
Transaction #2
BEGI
N
COMM
IT
LOCK(B) LOCK(A) WRITE(A) UNLOCK(A) UNLOCK(B) WRITE(B)
Two-Phase Locking (2PL) 13
Transaction #1 BE
GIN
COMM
IT
LOCK(A) LOCK(B) UNLOCK(A) UNLOCK(B) READ(A) WRITE(B) LOCK(A) LOCK(B)
Transaction #2
BEGI
N
COMM
IT
LOCK(B) LOCK(A) WRITE(A) UNLOCK(A) UNLOCK(B) WRITE(B)
Two-Phase Locking (2PL) 13
Transaction #1 BE
GIN
COMM
IT
LOCK(A) LOCK(B) UNLOCK(A) UNLOCK(B) READ(A) WRITE(B) LOCK(A) LOCK(B)
Transaction #2
BEGI
N
COMM
IT
LOCK(B) LOCK(A) WRITE(A) UNLOCK(A) UNLOCK(B) WRITE(B)
Two-Phase Locking (2PL) 13
Transaction #1 BE
GIN
COMM
IT
LOCK(A) LOCK(B) UNLOCK(A) UNLOCK(B) READ(A) WRITE(B) LOCK(A) LOCK(B)
Transaction #2
BEGI
N
COMM
IT
LOCK(B) LOCK(A) WRITE(A) UNLOCK(A) UNLOCK(B) WRITE(B)
Two-Phase Locking (2PL) 13
Transaction #1 BE
GIN
COMM
IT
LOCK(A) LOCK(B) UNLOCK(A) UNLOCK(B) READ(A) WRITE(B) LOCK(A) LOCK(B)
Transaction #2
BEGI
N
COMM
IT
LOCK(B) LOCK(A) WRITE(A) UNLOCK(A) UNLOCK(B) WRITE(B)
Two-Phase Locking (2PL) 13
Transaction #1 BE
GIN
COMM
IT
LOCK(A) LOCK(B) UNLOCK(A) UNLOCK(B) READ(A) WRITE(B) LOCK(A) LOCK(B)
Transaction #2
BEGI
N
COMM
IT
LOCK(B) LOCK(A) WRITE(A) UNLOCK(A) UNLOCK(B) WRITE(B)
Two-Phase Locking (2PL) 13
Transaction #1 BE
GIN
COMM
IT
LOCK(A) LOCK(B) UNLOCK(A) UNLOCK(B) READ(A) WRITE(B) LOCK(A) LOCK(B)
Two-Phase Locking (2PL)
• Deadlock Detection (DEADLOCK) • Non-waiting Deadlock Prevention (NO_WAIT) • Wait-and-Die Deadlock Prevention (WAIT_DIE)
14
Record Read Timestamp
Write Timestamp
A
B 10000
Timestamp Ordering (T/O) 15
Transaction #1 BE
GIN
COMM
IT
READ(A) WRITE(B) WRITE(A)
• • • •
10000
10000
• • •
10000
Record Read Timestamp
Write Timestamp
A
B 10000
Timestamp Ordering (T/O) 15
Transaction #1 BE
GIN
COMM
IT
READ(A) WRITE(B) WRITE(A)
• • • •
10000
10000
• • •
10000
Record Read Timestamp
Write Timestamp
A
B 10000
Timestamp Ordering (T/O) 15
Transaction #1 BE
GIN
COMM
IT
READ(A) WRITE(B) WRITE(A)
• • • •
10000
10000
• • •
10000
10001
Record Read Timestamp
Write Timestamp
A
B 10000
Timestamp Ordering (T/O) 15
Transaction #1 BE
GIN
COMM
IT
READ(A) WRITE(B) WRITE(A)
• • • •
10001
10000
• • •
10000
10001
Record Read Timestamp
Write Timestamp
A
B 10000
Timestamp Ordering (T/O) 15
Transaction #1 BE
GIN
COMM
IT
READ(A) WRITE(B) WRITE(A)
• • • •
10001
10001
• • •
10000
10001
Record Read Timestamp
Write Timestamp
A
B 10000
Timestamp Ordering (T/O) 15
Transaction #1 BE
GIN
COMM
IT
READ(A) WRITE(B) WRITE(A)
• • • •
10001
10001
• • •
10000
10001
Record Read Timestamp
Write Timestamp
A
B 10000
Timestamp Ordering (T/O) 15
Transaction #1 BE
GIN
COMM
IT
READ(A) WRITE(B) WRITE(A)
• • • •
10001
10001
• • •
10005
10001
Record Read Timestamp
Write Timestamp
A
B 10000
Timestamp Ordering (T/O) 15
Transaction #1 BE
GIN
COMM
IT
READ(A) WRITE(B) WRITE(A)
• • • •
10001
10001
• • •
10005
10001
Timestamp Ordering (T/O)
• Basic T/O (TIMESTAMP) • Multi-Version Concurrency Control (MVCC) • Optimistic Concurrency Control (OCC)
16
Multi-Version CC 17
Transaction #1 BE
GIN
COMM
IT
READ(A) WRITE(B) WRITE(A)
• • • • • • •
Record Write Timestamp
A1 10000
B1 10000
Multi-Version CC 17
Transaction #1 BE
GIN
COMM
IT
READ(A) WRITE(B) WRITE(A)
• • • • • • •
10001
Record Write Timestamp
A1 10000
B1 10000
Multi-Version CC 17
Transaction #1 BE
GIN
COMM
IT
READ(A) WRITE(B) WRITE(A)
• • • • • • •
10001
Record Write Timestamp
A1 10000
B1 10000
Txn Read Set
Write Set
10001 { A1 }
Multi-Version CC 17
Transaction #1 BE
GIN
COMM
IT
READ(A) WRITE(B) WRITE(A)
• • • • • • •
10001
Record Write Timestamp
A1 10000
B1 10000
10001 B2
Txn Read Set
Write Set
10001 { A1 } { B2 }
Multi-Version CC 17
Transaction #1 BE
GIN
COMM
IT
READ(A) WRITE(B) WRITE(A)
• • • • • • •
10001
Record Write Timestamp
A1 10000
B1 10000
10001 B2
Txn Read Set
Write Set
10001 { A1 } { B2 }
Multi-Version CC 17
Transaction #1 BE
GIN
COMM
IT
READ(A) WRITE(B) WRITE(A)
• • • • • • •
10001
Record Write Timestamp
A1 10000
B1 10000
10001 B2
10003 A2
Txn Read Set
Write Set
10001
10003
{ A1 } { B2 }
{ A2 } { B2 }
Multi-Version CC 17
Transaction #1 BE
GIN
COMM
IT
READ(A) WRITE(B) WRITE(A)
• • • • • • •
10001
Record Write Timestamp
A1 10000
B1 10000
10001 B2
10003 A2
Txn Read Set
Write Set
10001
10003
{ A1 } { B2 }
{ A2 } { B2 }
Concurrency Control Schemes 18
DL_DETECT NO_WAIT WAIT_DIE
2PL w/ Deadlock Detection 2PL w/ Non-waiting Prevention 2PL w/ Wait-and-Die Prevention
TIMESTAMP MVCC OCC
Basic T/O Algorithm Multi-Version T/O Optimistic Concurrency Control
Concurrency Control Schemes 18
DL_DETECT NO_WAIT WAIT_DIE
2PL w/ Deadlock Detection 2PL w/ Non-waiting Prevention 2PL w/ Wait-and-Die Prevention
TIMESTAMP MVCC OCC
Basic T/O Algorithm Multi-Version T/O Optimistic Concurrency Control
Concurrency Control Schemes 18
DL_DETECT NO_WAIT WAIT_DIE
2PL w/ Deadlock Detection 2PL w/ Non-waiting Prevention 2PL w/ Wait-and-Die Prevention
TIMESTAMP MVCC OCC
Basic T/O Algorithm Multi-Version T/O Optimistic Concurrency Control
19
Evaluation Testbed
20
No DBMS supports multiple CC schemes.
No CPU supports 1000 cores.
Experimental Platform 21
DBx1000
Worker Threads
Experimental Platform 21
DBx1000 Graphite Simulator
Core
L2
L1
Worker Threads
Experimental Platform 21
DBx1000 Graphite Simulator
Compute Cluster
Core
L2
L1
Worker Threads
Target Workload
• Yahoo! Cloud Serving Benchmark (YCSB) – 20 million tuples – Each tuple is 1KB (total database is ~20GB)
• Each transactions reads/modifies 16 tuples. • Varying skew in transaction access patterns. • Serializable isolation level.
22
23
Evaluation
YCSB // 24
DBx1000 on Graphite Simulator Read-Only Workload No Contention
YCSB // 25
DBx1000 on Graphite Simulator Write-Intensive Workload Medium Contention
YCSB // 26
DBx1000 on Graphite Simulator Write-Intensive Workload High Contention
YCSB //
Time % Breakdown (512 Cores)
26 DBx1000 on Graphite Simulator Write-Intensive Workload High Contention
Bottlenecks
• Lock Thrashing – DL_DETECT, WAIT_DIE
• Timestamp Allocation – All T/O algorithms + WAIT_DIE
• Memory Allocations – OCC + MVCC
27
Bottlenecks
• Lock Thrashing – DL_DETECT, WAIT_DIE
• Timestamp Allocation – All T/O algorithms + WAIT_DIE
• Memory Allocations – OCC + MVCC
27
Locking Thrashing
• Each transaction waits longer to acquire locks, causing other transactions to wait a longer to acquire locks.
• The perfect workload is where transactions acquire locks in primary key order.
28
YCSB // 29
DBx1000 with 2PL DL_DETECT Write-Intensive Workload No Deadlocks (Ordered Lock Acquisition)
YCSB // 29
DBx1000 with 2PL DL_DETECT Write-Intensive Workload No Deadlocks (Ordered Lock Acquisition)
YCSB // 29
DBx1000 with 2PL DL_DETECT Write-Intensive Workload No Deadlocks (Ordered Lock Acquisition)
30
30
END @andy_pavlo
32
Maybe it’s time for me to get my life back together…
Hardware/Software Co-Design
• Bottlenecks can only be overcome through new hardware-level optimizations: – Hardware-accelerated Lock Sharing – Asynchronous Memory Copying – Decentralized Memory Controller.
33
Next Steps
• Evaluating other main bottlenecks in DBMSs: – Logging + Recovery – Indexes
• Extend DBx1000 to support distributed concurrency control algorithms.
34
35
Andy Pavlo
Mike Stonebraker
Srini Devadas
Xiangyao Yu
http://cmudb.io/1000cores
END @andy_pavlo