spanner: google’s globallydistributed database james c. corbett, jeffrey dean, michael epstein,...
TRANSCRIPT
SPANNER: GOOGLE’S GLOBALLYDISTRIBUTE
D DATABASE
James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang, Dale Woodford Google, Inc
OUTLINE
Introduction
Implementation
Data model
True Time
Concurrency Control
Conclusion
INTRODUCTION
•First multi version, globally distributed, synchronously replicated database
•Big table like versioned key-value store
•Megastore like schematized semi-relational tables
•Interesting Features: Data replication can be configured at a fine grain by applications
•Externally consistent read and writes
•Globally consistent reads across the database at a timestamp
•Assigns timestamps to txn’s even though distributed
IMPLEMENTATION
•A Spanner deployment is called a universe.
•Spanner is organized as a set of zones
•Zones are the unit of administrative deployment.
•The set of zones is also the set of locations across which data can be replicated
•Zone has zone master and between 100 to 1000 span servers.
•Zone master assigns data to span server, span server serves data to user
•Universe masters displays status. (for debugging and more)
•The placement driver handles automated movement of data across zones by communicating with span servers
STATE MACHINE REPLICATION AND PAXOS State Machine Approach:
1. Place copies of the State Machine on multiple, independent servers.
2. Receive client requests, interpreted as Inputs to the State Machine.
3. Choose an ordering for the Inputs.
4. Execute Inputs in the chosen order on each server.
5. Respond to clients with the Output from the State Machine.
6. Monitor replicas for differences in State or Output.
An order of Inputs may be defined using a voting protocol
The problem of voting for a single value by a group of independent entities is called Consensus
Inputs may be ordered by their position in the series of consensus instances (Consensus Order)
Paxos[6] is a protocol for solving consensus, and may be used as the protocol for implementing Consensus Order.
Paxos requires a single leader to ensure liveness
That is, one of the replicas must remain leader long enough to achieve consensus on the next operation of the state machine
requirement is that one replica remain leader long enough to move the system forward.
SPAN SERVER SOFTWARE STACK
•Tablet implements a bag of mappings similar to big table’s mappings
•(key:string, timestamp:int64) -> string
•Tablet’s state is stored in form of b-tree files stored in colossus
•To support replication, each span server implements a paxos machine on top of each tablet.
•Paxos state machines are used to implement a consistently replicated bag of mappings
•Writes must initiate paxos protocol at the leader. Reads access state directly from tablet at any replica that is up to date
•Set of replicas is collectively a paxos group
At each replica that is a leader, span server implements a lock table to implement concurrency control
Long lived paxos leader are critical to implement concurrency control
Transaction managers are used to support distributed transaction
Replica that is used to implenet txn manager is participant leader. Other replicas are participant slaves.
If a txn involves more than one paxos group, the group leaders must coordinate to perform two phase lock.
One group leader is chosen as coordinate leader and the others are slaves.
DIRECTORIES AND PLACEMENT
Directory is a bucketing abstraction implemented on top of key-value mappings.
Allows application to control locality of their data
Directory is a unit of data placement. Data in a directory has same replication configuration
Movedir is the background task used to move data from a directory to directory.
For data to be transfere from a paxos group to another, the transfer occurs through directories.
DATA MODEL
• Support schematized semi-relational tables
•Data model is layered on top of the directory-bucketed key-value mappings
•An application creates one or more databases in a universe.
• Each database can contain an unlimited number of schematized tablesdirectory-bucketed key-value mappings
•Spanner’s data model is not purely relational, in that rows must have names
•Spanner’s data model is not purely relational, in that rows must have names
•Spanner database must be partitioned by clients into one or more hierarchiesof tables
TRUE TIME
Notations:
TTinterval – Time interval with bounded time uncertainity. [earliest,latest]
TTstamp- Endpoints of TTinterval
TT.now() – Returns a TTinterval
TT.after(t)– true if t has definitely passed
TT.before(t) – true if has definitely not passed
Tabs(e) – absolute time of an event e
WHAT GOOD DOES TRUE TIME DO? Guarantees:
For an invocation tt=TT.now(), tt.earliest <= tabs(enow) <= tt.latest
•It makes sure that an event occurs during or after earliest time and ends before the end of the time interval
•Guarantees correctness properties around concurrency control
•Implements features like:
I. Externally consistent transactions
II. Lock-free read only transactions
III. Non-blocking reads in the past
These features guarantee that a audit read at timestamp t will see the effects of every transaction committed as of t.
CONCURRENCY CONTROL
TimeStamp Management:
Spanner supports read-write, read-only transcations and snapshot reads
•Read-Only:
Read-only transactions should declare that it has not writes
Reads in a read-only transaction execute at a system-chosen timestamp without locking, so that incoming writes are not blocked.
Reads in read-only can execute on any replica that is up-to-date
•SnapShot reads:
A read in the past that executes without locking.
Client can specify the timestamp or let the spanner choose one given a upper bound
Execution takes place on any replica that is up to date
A commit is mandatory once a timestamp is chosen for either read-only or snapshot txns.
When a server fails, clients can internally continue the query on a different server by repeating the timestamp and the current read
DETAILS
Spanner requires a scope expression for every read-only transaction, which is an expression that summarizes the keys that will be read by the entire transaction
If the scope’s values are served by a single Paxos group:
•The client issues the read-only transaction to that group’s leader.
•leader assigns sread and executes the read
•LastTS(): timestamp of the last committed write at a Paxos group
•the assignment sread = LastTS() trivially satisfies external consistency
If Scope values are served multiple paxos group:
•Spanner has its reads execute at sread = TT.now().latest
oAssigning timestamps to Read-Write transactions:
Uses two-Phase locking. Timestamps can be assigned after acquiring locks and before releasing.
For a txn, spanner assigns the timestamp as the same which paxos assigns to paxos write that represents the txn commit
Depends on Monotonicity In-variant:
Spanner assigns timestamps to Paxos writes in monotonically increasing order, even across leaders
A single leader replica can trivially assign timestamps in monotonically increasing order.
leader must only assign timestamps within the interval of its leader lease
Enforcing External-consistency invariant:
If the start of a transaction T2 occurs after the commit of a transaction T1, then the commit timestamp of T2 must be greater than the commit timestamp of T1.
tabs(e1commit ) < tabs(e2
start) ) ->s1 < s2
The protocol for executing transactions and assigning timestamps obeys two rules:
Start: Coordinate leader assigns timestamp si Txn Ti no less than value TT.now()
Commit Wait: coordinator leader ensures that clients cannot see any data committed by Ti until TT.after(si) is true
DETAILS
Reads within RW Txn uses Wound-Wait to avoid dead locks
CONCLUSION
Distributed transaction consistency
Eventual consistency replication support across data centers
Closing the gap between big table and Megastore while preserving
their postives
THANK YOU!