database systems do not scale to 1000 cpu cores and...

Post on 28-Apr-2018

220 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Database Systems Do Not Scale to 1000 CPU Cores And Other Tales of the Macabre

@andy_pavlo

2

Three million children die per year due to poor nutrition.

Source: http://www.wfp.org/hunger/stats

3

Three days after you die, stomach enzymes start to digest you.

Source: http://discovermagazine.com/2006/sep/10-20thingsdeath

4

Everyone in this room will be dead in 65 years.

Source: http://discovermagazine.com/2006/sep/10-20thingsdeath

5

Database systems cannot scale to 1000 CPU cores.

Source: http://www.vldb.org/pvldb/vol8/p209-yu.pdf

YCSB // 6

DBx1000 on Graphite Simulator Write-Intensive Workload High Contention

YCSB // 6

DBx1000 on Graphite Simulator Write-Intensive Workload High Contention

Why This Matters

• The era of single-core CPU speed-up is over. • Database applications are getting more

complex and larger. • Existing DBMSs are unable to take advantage of

future “many-core” CPU architectures.

7

Today’s Talk

• Transaction Processing • Experimental Platform • Evaluation & Discussion

8

Today’s Talk

• Transaction Processing • Experimental Platform • Evaluation & Discussion

8

9

Transaction Processing

On-line Transaction Processing

• Fast operations that ingest new data and then update state using ACID transactions.

• Transaction Example: – Send $50 from user A to user B

10

Concurrency Control

• Allows transactions to access a database in a multi-programmed fashion while preserving the illusion that each of them is executing alone on a dedicated system.

• Provides Atomicity + Isolation in ACID

11

Concurrency Control

• Two-Phase Locking (Pessimistic) • Timestamp Ordering (Optimistic)

12

Two-Phase Locking (2PL) 13

Transaction #1 BE

GIN

COMM

IT

LOCK(A) LOCK(B) UNLOCK(A) UNLOCK(B) READ(A) WRITE(B)

Shrinking Phase

LOCK(A) LOCK(B)

Growing Phase

Transaction #2

BEGI

N

COMM

IT

LOCK(B) LOCK(A) WRITE(A) UNLOCK(A) UNLOCK(B) WRITE(B)

Two-Phase Locking (2PL) 13

Transaction #1 BE

GIN

COMM

IT

LOCK(A) LOCK(B) UNLOCK(A) UNLOCK(B) READ(A) WRITE(B) LOCK(A) LOCK(B)

Transaction #2

BEGI

N

COMM

IT

LOCK(B) LOCK(A) WRITE(A) UNLOCK(A) UNLOCK(B) WRITE(B)

Two-Phase Locking (2PL) 13

Transaction #1 BE

GIN

COMM

IT

LOCK(A) LOCK(B) UNLOCK(A) UNLOCK(B) READ(A) WRITE(B) LOCK(A) LOCK(B)

Transaction #2

BEGI

N

COMM

IT

LOCK(B) LOCK(A) WRITE(A) UNLOCK(A) UNLOCK(B) WRITE(B)

Two-Phase Locking (2PL) 13

Transaction #1 BE

GIN

COMM

IT

LOCK(A) LOCK(B) UNLOCK(A) UNLOCK(B) READ(A) WRITE(B) LOCK(A) LOCK(B)

Transaction #2

BEGI

N

COMM

IT

LOCK(B) LOCK(A) WRITE(A) UNLOCK(A) UNLOCK(B) WRITE(B)

Two-Phase Locking (2PL) 13

Transaction #1 BE

GIN

COMM

IT

LOCK(A) LOCK(B) UNLOCK(A) UNLOCK(B) READ(A) WRITE(B) LOCK(A) LOCK(B)

Transaction #2

BEGI

N

COMM

IT

LOCK(B) LOCK(A) WRITE(A) UNLOCK(A) UNLOCK(B) WRITE(B)

Two-Phase Locking (2PL) 13

Transaction #1 BE

GIN

COMM

IT

LOCK(A) LOCK(B) UNLOCK(A) UNLOCK(B) READ(A) WRITE(B) LOCK(A) LOCK(B)

Transaction #2

BEGI

N

COMM

IT

LOCK(B) LOCK(A) WRITE(A) UNLOCK(A) UNLOCK(B) WRITE(B)

Two-Phase Locking (2PL) 13

Transaction #1 BE

GIN

COMM

IT

LOCK(A) LOCK(B) UNLOCK(A) UNLOCK(B) READ(A) WRITE(B) LOCK(A) LOCK(B)

Two-Phase Locking (2PL)

• Deadlock Detection (DEADLOCK) • Non-waiting Deadlock Prevention (NO_WAIT) • Wait-and-Die Deadlock Prevention (WAIT_DIE)

14

Record Read Timestamp

Write Timestamp

A

B 10000

Timestamp Ordering (T/O) 15

Transaction #1 BE

GIN

COMM

IT

READ(A) WRITE(B) WRITE(A)

• • • •

10000

10000

• • •

10000

Record Read Timestamp

Write Timestamp

A

B 10000

Timestamp Ordering (T/O) 15

Transaction #1 BE

GIN

COMM

IT

READ(A) WRITE(B) WRITE(A)

• • • •

10000

10000

• • •

10000

Record Read Timestamp

Write Timestamp

A

B 10000

Timestamp Ordering (T/O) 15

Transaction #1 BE

GIN

COMM

IT

READ(A) WRITE(B) WRITE(A)

• • • •

10000

10000

• • •

10000

10001

Record Read Timestamp

Write Timestamp

A

B 10000

Timestamp Ordering (T/O) 15

Transaction #1 BE

GIN

COMM

IT

READ(A) WRITE(B) WRITE(A)

• • • •

10001

10000

• • •

10000

10001

Record Read Timestamp

Write Timestamp

A

B 10000

Timestamp Ordering (T/O) 15

Transaction #1 BE

GIN

COMM

IT

READ(A) WRITE(B) WRITE(A)

• • • •

10001

10001

• • •

10000

10001

Record Read Timestamp

Write Timestamp

A

B 10000

Timestamp Ordering (T/O) 15

Transaction #1 BE

GIN

COMM

IT

READ(A) WRITE(B) WRITE(A)

• • • •

10001

10001

• • •

10000

10001

Record Read Timestamp

Write Timestamp

A

B 10000

Timestamp Ordering (T/O) 15

Transaction #1 BE

GIN

COMM

IT

READ(A) WRITE(B) WRITE(A)

• • • •

10001

10001

• • •

10005

10001

Record Read Timestamp

Write Timestamp

A

B 10000

Timestamp Ordering (T/O) 15

Transaction #1 BE

GIN

COMM

IT

READ(A) WRITE(B) WRITE(A)

• • • •

10001

10001

• • •

10005

10001

Timestamp Ordering (T/O)

• Basic T/O (TIMESTAMP) • Multi-Version Concurrency Control (MVCC) • Optimistic Concurrency Control (OCC)

16

Multi-Version CC 17

Transaction #1 BE

GIN

COMM

IT

READ(A) WRITE(B) WRITE(A)

• • • • • • •

Record Write Timestamp

A1 10000

B1 10000

Multi-Version CC 17

Transaction #1 BE

GIN

COMM

IT

READ(A) WRITE(B) WRITE(A)

• • • • • • •

10001

Record Write Timestamp

A1 10000

B1 10000

Multi-Version CC 17

Transaction #1 BE

GIN

COMM

IT

READ(A) WRITE(B) WRITE(A)

• • • • • • •

10001

Record Write Timestamp

A1 10000

B1 10000

Txn Read Set

Write Set

10001 { A1 }

Multi-Version CC 17

Transaction #1 BE

GIN

COMM

IT

READ(A) WRITE(B) WRITE(A)

• • • • • • •

10001

Record Write Timestamp

A1 10000

B1 10000

10001 B2

Txn Read Set

Write Set

10001 { A1 } { B2 }

Multi-Version CC 17

Transaction #1 BE

GIN

COMM

IT

READ(A) WRITE(B) WRITE(A)

• • • • • • •

10001

Record Write Timestamp

A1 10000

B1 10000

10001 B2

Txn Read Set

Write Set

10001 { A1 } { B2 }

Multi-Version CC 17

Transaction #1 BE

GIN

COMM

IT

READ(A) WRITE(B) WRITE(A)

• • • • • • •

10001

Record Write Timestamp

A1 10000

B1 10000

10001 B2

10003 A2

Txn Read Set

Write Set

10001

10003

{ A1 } { B2 }

{ A2 } { B2 }

Multi-Version CC 17

Transaction #1 BE

GIN

COMM

IT

READ(A) WRITE(B) WRITE(A)

• • • • • • •

10001

Record Write Timestamp

A1 10000

B1 10000

10001 B2

10003 A2

Txn Read Set

Write Set

10001

10003

{ A1 } { B2 }

{ A2 } { B2 }

Concurrency Control Schemes 18

DL_DETECT NO_WAIT WAIT_DIE

2PL w/ Deadlock Detection 2PL w/ Non-waiting Prevention 2PL w/ Wait-and-Die Prevention

TIMESTAMP MVCC OCC

Basic T/O Algorithm Multi-Version T/O Optimistic Concurrency Control

Concurrency Control Schemes 18

DL_DETECT NO_WAIT WAIT_DIE

2PL w/ Deadlock Detection 2PL w/ Non-waiting Prevention 2PL w/ Wait-and-Die Prevention

TIMESTAMP MVCC OCC

Basic T/O Algorithm Multi-Version T/O Optimistic Concurrency Control

Concurrency Control Schemes 18

DL_DETECT NO_WAIT WAIT_DIE

2PL w/ Deadlock Detection 2PL w/ Non-waiting Prevention 2PL w/ Wait-and-Die Prevention

TIMESTAMP MVCC OCC

Basic T/O Algorithm Multi-Version T/O Optimistic Concurrency Control

19

Evaluation Testbed

20

No DBMS supports multiple CC schemes.

No CPU supports 1000 cores.

Experimental Platform 21

DBx1000

Worker Threads

Experimental Platform 21

DBx1000 Graphite Simulator

Core

L2

L1

Worker Threads

Experimental Platform 21

DBx1000 Graphite Simulator

Compute Cluster

Core

L2

L1

Worker Threads

Target Workload

• Yahoo! Cloud Serving Benchmark (YCSB) – 20 million tuples – Each tuple is 1KB (total database is ~20GB)

• Each transactions reads/modifies 16 tuples. • Varying skew in transaction access patterns. • Serializable isolation level.

22

23

Evaluation

YCSB // 24

DBx1000 on Graphite Simulator Read-Only Workload No Contention

YCSB // 25

DBx1000 on Graphite Simulator Write-Intensive Workload Medium Contention

YCSB // 26

DBx1000 on Graphite Simulator Write-Intensive Workload High Contention

YCSB //

Time % Breakdown (512 Cores)

26 DBx1000 on Graphite Simulator Write-Intensive Workload High Contention

Bottlenecks

• Lock Thrashing – DL_DETECT, WAIT_DIE

• Timestamp Allocation – All T/O algorithms + WAIT_DIE

• Memory Allocations – OCC + MVCC

27

Bottlenecks

• Lock Thrashing – DL_DETECT, WAIT_DIE

• Timestamp Allocation – All T/O algorithms + WAIT_DIE

• Memory Allocations – OCC + MVCC

27

Locking Thrashing

• Each transaction waits longer to acquire locks, causing other transactions to wait a longer to acquire locks.

• The perfect workload is where transactions acquire locks in primary key order.

28

YCSB // 29

DBx1000 with 2PL DL_DETECT Write-Intensive Workload No Deadlocks (Ordered Lock Acquisition)

YCSB // 29

DBx1000 with 2PL DL_DETECT Write-Intensive Workload No Deadlocks (Ordered Lock Acquisition)

YCSB // 29

DBx1000 with 2PL DL_DETECT Write-Intensive Workload No Deadlocks (Ordered Lock Acquisition)

30

30

END @andy_pavlo

32

Maybe it’s time for me to get my life back together…

Hardware/Software Co-Design

• Bottlenecks can only be overcome through new hardware-level optimizations: – Hardware-accelerated Lock Sharing – Asynchronous Memory Copying – Decentralized Memory Controller.

33

Next Steps

• Evaluating other main bottlenecks in DBMSs: – Logging + Recovery – Indexes

• Extend DBx1000 to support distributed concurrency control algorithms.

34

35

Andy Pavlo

Mike Stonebraker

Srini Devadas

Xiangyao Yu

http://cmudb.io/1000cores

http://15445.courses.cs.cmu.edu

15-445/645 Database Systems

http://pelotondb.io

END @andy_pavlo

top related