philip a. bernstein joint work with colin reid, sudipto ...philip a. bernstein joint work with colin...

33
Philip A. Bernstein Joint work with Colin Reid, Sudipto Das, Ming Wu, Xinhao Yuan Microsoft Corporation Presented at Amazon.com December 5, 2011

Upload: others

Post on 15-Apr-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Philip A. Bernstein Joint work with Colin Reid, Sudipto ...Philip A. Bernstein Joint work with Colin Reid, Sudipto Das, Ming Wu, Xinhao Yuan Microsoft Corporation ... Used in Oracle

Philip A. Bernstein Joint work with Colin Reid, Sudipto Das, Ming Wu, Xinhao Yuan Microsoft Corporation Presented at Amazon.com December 5, 2011

Page 2: Philip A. Bernstein Joint work with Colin Reid, Sudipto ...Philip A. Bernstein Joint work with Colin Reid, Sudipto Das, Ming Wu, Xinhao Yuan Microsoft Corporation ... Used in Oracle

It’s a research project.

We have two implementations

A software stack for transactional record management Stores [key, value] pairs, which are accessed within transactions

It’s a standard interface that underlies all database systems

Functionality

Records: Stored [key, value] pairs

Record operations: Insert, Delete, Update, Get record where field = X; Get next

Transactions: Start, Commit, Abort

2

Page 3: Philip A. Bernstein Joint work with Colin Reid, Sudipto ...Philip A. Bernstein Joint work with Colin Reid, Sudipto Das, Ming Wu, Xinhao Yuan Microsoft Corporation ... Used in Oracle

Enables scaling-out large-scale web services without partitioning data or application

Supports real-time data analytics Uses multi-version data for high-speed transaction processing and queries on the same server

All isolation levels, including concurrency control over key-range operations.

Exploits technology trends flash memory, high-speed networks, multi-core

3

Page 4: Philip A. Bernstein Joint work with Colin Reid, Sudipto ...Philip A. Bernstein Joint work with Colin Reid, Sudipto Das, Ming Wu, Xinhao Yuan Microsoft Corporation ... Used in Oracle

The log is the database. All servers can access it.

Each transaction executes against its partial, cached, DB copy

Then it appends its after-images to the log.

Each server rolls forward the log on its partial, cached DB copy

Roll forward (a.k.a. meld) does optimistic concurrency control

N.B.: Log-append is the only server-to-server synchronization

Log

4

DBMS cache

App Server

Meld DBMS cache

App Server

Meld DBMS cache

App Server

Meld

Page 5: Philip A. Bernstein Joint work with Colin Reid, Sudipto ...Philip A. Bernstein Joint work with Colin Reid, Sudipto Das, Ming Wu, Xinhao Yuan Microsoft Corporation ... Used in Oracle

Log

DBMS

App Core

Cache

DBMS

App Core

Core

Roll Forward

The log is the database.

All cores can access it.

Each transaction appends its after-images to the log.

One core runs meld to do OCC and roll forward the log

Page 6: Philip A. Bernstein Joint work with Colin Reid, Sudipto ...Philip A. Bernstein Joint work with Colin Reid, Sudipto Das, Ming Wu, Xinhao Yuan Microsoft Corporation ... Used in Oracle

Motivation

System architecture

Performance

Related Work

Conclusion

6

Page 7: Philip A. Bernstein Joint work with Colin Reid, Sudipto ...Philip A. Bernstein Joint work with Colin Reid, Sudipto Das, Ming Wu, Xinhao Yuan Microsoft Corporation ... Used in Oracle

7

G

B

A

H

C I

D

A D C B I H G

Binary Search Tree

Tree is marshaled into the log

7

Page 8: Philip A. Bernstein Joint work with Colin Reid, Sudipto ...Philip A. Bernstein Joint work with Colin Reid, Sudipto Das, Ming Wu, Xinhao Yuan Microsoft Corporation ... Used in Oracle

D

Update D’s value

8

G

B

A

H

C I

D

Copy on write

To update a node, replace nodes up to the root

C

G

B

8

Page 9: Philip A. Bernstein Joint work with Colin Reid, Sudipto ...Philip A. Bernstein Joint work with Colin Reid, Sudipto Das, Ming Wu, Xinhao Yuan Microsoft Corporation ... Used in Oracle

Each server has a cache of the last committed DB state

9

A D C B I H G

Transaction execution 1. Get pointer to snapshot 2. Generate updates locally 3. Append intention log record

D C B G Snapshot

G

B

C

D

DB cache

G

B

C

D

H

I A

last committed

Page 10: Philip A. Bernstein Joint work with Colin Reid, Sudipto ...Philip A. Bernstein Joint work with Colin Reid, Sudipto Das, Ming Wu, Xinhao Yuan Microsoft Corporation ... Used in Oracle

Each server processes intention records in sequence

To process transaction T’s intention record. Check whether T experienced a conflict

If not, T committed, so the server merges the intention into its last committed state

All servers make the same commit/abort decisions

10

A D C B I H G D C B G

T’s conflict zone

transaction T

Did a committed transaction write into T’s readset or writeset here?

Snapshot

Page 11: Philip A. Bernstein Joint work with Colin Reid, Sudipto ...Philip A. Bernstein Joint work with Colin Reid, Sudipto Das, Ming Wu, Xinhao Yuan Microsoft Corporation ... Used in Oracle

1. Run transaction

2. Broadcast intention

3. Append intention to log

4. Send log location

5. De-serialize intention

6. Meld

11 11

Page 12: Philip A. Bernstein Joint work with Colin Reid, Sudipto ...Philip A. Bernstein Joint work with Colin Reid, Sudipto Das, Ming Wu, Xinhao Yuan Microsoft Corporation ... Used in Oracle

1. Broadcasting the intention

2. Appending intention to the log

3. Optimistic concurrency control (OCC)

4. Meld

Technology will improve 1 & 2

For 3, app behavior drives OCC performance

But 4 depends on single-threaded processor performance, which isn’t improving

Hence, it’s important to optimize Meld

12

Page 13: Philip A. Bernstein Joint work with Colin Reid, Sudipto ...Philip A. Bernstein Joint work with Colin Reid, Sudipto Das, Ming Wu, Xinhao Yuan Microsoft Corporation ... Used in Oracle

Last-committed state before T is melded

Compare transaction T’s after-image to the last committed state

which is annotated with version and dependency metadata

Traverse T’s intention, comparing versions to last-committed state

Stop traversing when you reach an unchanged subtree

If version(x)=version(x) then simply replace x by x

Log x

x[read: x] x

state when T executed Transaction T’s intention

x

13

Page 14: Philip A. Bernstein Joint work with Colin Reid, Sudipto ...Philip A. Bernstein Joint work with Colin Reid, Sudipto Das, Ming Wu, Xinhao Yuan Microsoft Corporation ... Used in Oracle

14

• T1 creates keys B,C,D,E

D

B E

C

T1

T2 D

B E

C A

D

B E

C F

T3

D

B E

C F A

• T2 and T3 do not conflict, so the resulting melded state is A, B, C, D, E, F

• Then T2 and T3 execute concurrently, both based on the result of T1

• T2 inserts A

• T3 inserts F

14

Page 15: Philip A. Bernstein Joint work with Colin Reid, Sudipto ...Philip A. Bernstein Joint work with Colin Reid, Sudipto Das, Ming Wu, Xinhao Yuan Microsoft Corporation ... Used in Oracle

15

Root

D

B E

C

Node Metadata • version of the subtree • dependency info

metadata

… …

Every node n has a unique version number, VN(n), which identifies the exact content of n’s subtree

Every node n in an intention T stores metadata about T’s snapshot

Version of n in T’s snapshot

Dependency information

metadata compresses to ~30 bytes

15

Page 16: Philip A. Bernstein Joint work with Colin Reid, Sudipto ...Philip A. Bernstein Joint work with Colin Reid, Sudipto Das, Ming Wu, Xinhao Yuan Microsoft Corporation ... Used in Oracle

16

T1 D

B E

C VN=51

VN=53

VN=54

VN=52

C B E D

VN Offset +1 +2 +3 +4

Absolute VN: 50 51 52 53 54

T1 T0

We need to avoid synchronization when assigning VNs

VN(n) = intention base location + offset of n in its intention

The base location is assigned when the intention is logged

Given: T0’s root subtree has VN 50

VN of each node n in T1= 50 + n’s offset

16

Page 17: Philip A. Bernstein Joint work with Colin Reid, Sudipto ...Philip A. Bernstein Joint work with Colin Reid, Sudipto Das, Ming Wu, Xinhao Yuan Microsoft Corporation ... Used in Oracle

Subtree metadata includes a source structure version (SSV).

Intutively, SSV(n) = version of n in transaction T’s snapshot

DependsOn(n) = Yes if T depends on n not having changed while T executed

T1’s root subtree depends on the entire tree version 50.

Since SSV(D) = VN(), T1 becomes the last-committed state.

17

Absolute VN 50 51 52 53 54

C B E D

VN Offset: +1 +2 +3 +4 SSV: 0 0 0 50 DependsOn: N N N Y

T0 T1

17

Page 18: Philip A. Bernstein Joint work with Colin Reid, Sudipto ...Philip A. Bernstein Joint work with Colin Reid, Sudipto Das, Ming Wu, Xinhao Yuan Microsoft Corporation ... Used in Oracle

C B E D

VN Offset: +1 +2 +3 +4 SSV: 0 0 0 50 DependsOn: N N N Y

Absolute VN 51 52 53 54 55 56 57

A B D

+1 +2 +3 0 52 54 N N N

T0 T1 T2

A serial intention is one whose source version is the last committed state.

Meld is then trivial and needs to consider only the root node.

T1 was serial.

T2 is serial, so meld makes T2 the last committed state.

Thus, a meld of a serial intention executes in constant time.

18

Page 19: Philip A. Bernstein Joint work with Colin Reid, Sudipto ...Philip A. Bernstein Joint work with Colin Reid, Sudipto Das, Ming Wu, Xinhao Yuan Microsoft Corporation ... Used in Oracle

19

D

B E

C

T1

T2 D

B E

C A

D

B E

C F

T3

D

B E

C F A

19

Page 20: Philip A. Bernstein Joint work with Colin Reid, Sudipto ...Philip A. Bernstein Joint work with Colin Reid, Sudipto Das, Ming Wu, Xinhao Yuan Microsoft Corporation ... Used in Oracle

20 Absolute VN 51 52 53 54 55 56 57 58 59 60

T0 T1

C B E D A B D

T2

VN Offset: +1 +2 +3 +4 SSV: 0 0 0 50 DependsOn: N N N Y

+1 +2 +3 0 53 54 N N N

F E D

T3

T3 is not serial because VN of D in T2 (= 57) SSV(D) in T3 (= 54).

Meld checks if T3 conflicts with a transaction in its conflict zone

Traverses T3, comparing T3’s nodes to the last-committed state

When a concurrent transaction (e.g. T3) experiences no conflicts, meld creates an ephemeral intention to merge its state

+1 +2 +3 0 52 54 N N N

20

Page 21: Philip A. Bernstein Joint work with Colin Reid, Sudipto ...Philip A. Bernstein Joint work with Colin Reid, Sudipto Das, Ming Wu, Xinhao Yuan Microsoft Corporation ... Used in Oracle

21

Absolute VN 51 52 53 54 55 56 57 58 59 60 61

+1 +2 +3 0 52 54 N N N

T0 T1

C B E D A B D

T2

VN Offset: +1 +2 +3 +4 SSV: 0 0 0 50 DependsOn: N N N Y

+1 +2 +3 0 53 54 N N N

F E D

T3 M3

+1 57 N

D

A committed concurrent intention produces an ephemeral intention

It’s created deterministically in memory on all servers.

It logically commits immediately after the intention it melds.

21

Ephemeral intention

Page 22: Philip A. Bernstein Joint work with Colin Reid, Sudipto ...Philip A. Bernstein Joint work with Colin Reid, Sudipto Das, Ming Wu, Xinhao Yuan Microsoft Corporation ... Used in Oracle

22

Distinguishing payload updates from subtree updates

Phantom detection

Asymmetric meld operations

Deletions, using tombstones in the intention header

Garbage collection

Checkpointing and recovery

See [Bernstein et al., VLDB 2011]

22

Page 23: Philip A. Bernstein Joint work with Colin Reid, Sudipto ...Philip A. Bernstein Joint work with Colin Reid, Sudipto Das, Ming Wu, Xinhao Yuan Microsoft Corporation ... Used in Oracle

Focus here is on meld throughput only

For latency, see our VLDB 2011 paper

We count committed and aborted transactions

Experiment setup

128K keys, all in main memory. Keys and payloads are 8 bytes.

Serializable isolation, so intentions contain readsets

De-serialize intentions on separate threads before meld

Meld throughput depends on transaction size and conflict zone size (“concurrency degree”)

As transaction size or concurrency degree increase more concurrent transactions update keys with common ancestors meld has to traverse deeper in the tree

23

Page 24: Philip A. Bernstein Joint work with Colin Reid, Sudipto ...Philip A. Bernstein Joint work with Colin Reid, Sudipto Das, Ming Wu, Xinhao Yuan Microsoft Corporation ... Used in Oracle

r:w ratio is 1:1 con-di = concurrency degree i

24

Page 25: Philip A. Bernstein Joint work with Colin Reid, Sudipto ...Philip A. Bernstein Joint work with Colin Reid, Sudipto Das, Ming Wu, Xinhao Yuan Microsoft Corporation ... Used in Oracle

25

Page 26: Philip A. Bernstein Joint work with Colin Reid, Sudipto ...Philip A. Bernstein Joint work with Colin Reid, Sudipto Das, Ming Wu, Xinhao Yuan Microsoft Corporation ... Used in Oracle

Brute force = traverse the whole tree

26

Page 27: Philip A. Bernstein Joint work with Colin Reid, Sudipto ...Philip A. Bernstein Joint work with Colin Reid, Sudipto Das, Ming Wu, Xinhao Yuan Microsoft Corporation ... Used in Oracle

Hardly any effect, indicating most traversals short-circuit high in the tree.

27

Page 28: Philip A. Bernstein Joint work with Colin Reid, Sudipto ...Philip A. Bernstein Joint work with Colin Reid, Sudipto Das, Ming Wu, Xinhao Yuan Microsoft Corporation ... Used in Oracle

Hyder resembles a primary-copy replicated DB

Primary copy broadcasts only committed updates

Central transaction server is a bottleneck

In Hyder, only the log is centralized

Hyder is a “data-sharing” DB system

Classical approach uses a distributed lock manager

Each server runs an ordinary single-server DBMS

But, before a server fetches a page, it locks the page

28

Page 29: Philip A. Bernstein Joint work with Colin Reid, Sudipto ...Philip A. Bernstein Joint work with Colin Reid, Sudipto Das, Ming Wu, Xinhao Yuan Microsoft Corporation ... Used in Oracle

29

Performance issues: remote lock requests; ping-pong pages

Used in Oracle RAC & Exadata and IBM DB2 Data-Sharing

Have not yet compared its performance to Hyder

Server A Server B

P

• Server A gets a write-lock on page P and fetches P Request P

P

r2

• Server B requests a lock on P • Lock manager forward request to A

• When A is able to unlock P, it releases the lock and sends P to B

• Need this synchronization even if B wants a different record than A

Page 30: Philip A. Bernstein Joint work with Colin Reid, Sudipto ...Philip A. Bernstein Joint work with Colin Reid, Sudipto Das, Ming Wu, Xinhao Yuan Microsoft Corporation ... Used in Oracle

Lots of OCC papers but none that give details of efficient conflict-testing

By contrast, there’s a huge literature on conflict-testing for locking

Oxenstored [Gazagnairem & Hanquezis, ICFP 09] Similar scenario: MV trees and OCC

However, very coarse-grain conflict-testing

Uses none of our optimizations

30

Page 31: Philip A. Bernstein Joint work with Colin Reid, Sudipto ...Philip A. Bernstein Joint work with Colin Reid, Sudipto Das, Ming Wu, Xinhao Yuan Microsoft Corporation ... Used in Oracle

New algorithm for OCC

Developed many optimizations to truncate the conflict checking early in the tree traversal

Implemented and measure it

Future work: Apply it to other tree structures

Measure it on various storage devices

Compare it with locking and other OCC methods on multiversion trees

Try to apply it to physiological logging

31

Page 32: Philip A. Bernstein Joint work with Colin Reid, Sudipto ...Philip A. Bernstein Joint work with Colin Reid, Sudipto Das, Ming Wu, Xinhao Yuan Microsoft Corporation ... Used in Oracle

C.W. Reid, P.A. Bernstein: Implementing an Append-Only Interface for Semiconductor Storage. IEEE Data Eng. Bull. 33(4): 14-20 (2010)

P.A. Bernstein, C.W. Reid, S. Das: Hyder - A Transactional Record Manager for Shared Flash. CIDR 2011: 9-20

P.A. Bernstein, C.W. Reid, M. Wu, X. Yuan: Optimistic Concurrency Control by Melding Trees. PVLDB 4(11): 944-955 (2011)

Page 33: Philip A. Bernstein Joint work with Colin Reid, Sudipto ...Philip A. Bernstein Joint work with Colin Reid, Sudipto Das, Ming Wu, Xinhao Yuan Microsoft Corporation ... Used in Oracle

© 2007 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it

should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.