building reliable systems with apache bookkeeper
DESCRIPTION
A presentation at the Barcelona JUG 19/06/2014TRANSCRIPT
Building reliable systems with
Apache BookKeeper
Matthieu MorelIvan Kelly
Challenges in distributed systems
Ryan Lintleman cc-by-nc 2.0 https://flic.kr/p/5XNGow
Expect failures
up to 10% annual failure rates for disks/servers
The network is reliable
NOT
Jonathan Briggs - cc by 2.0 https://flic.kr/p/bnAxuz
SymptomsAlex Proimos cc-by-2.0 https://flic.kr/p/bt29wL
Problem 1: not available
Problem 1: not available
Problem 2: inconsistencies
CAP
consistency
partition toleranceavailability
zookeeper / bookkeeper
cassandra
More issues...
cc-by-2.0 https://flic.kr/p/8j57SG
Problem 3: split brain
writer A writer A writer A
writer A’
2 writers !
writer A’
single writer for A
Problem 4: failure detection
A
B
C
Problem 5: recovery
Recovery protocol?
For many systems,we need consistent datafor recovery
Solutions !
Steven Depolo cc-by-2.0 https://flic.kr/p/9APgFF
guarantees
protocols
tools / building blocks
techniques
Primary backup, active replication
active standby
active
active
Replication
f+1 replicas for f concurrent failures
Quorums
ensemble
quorum 1
quorum 2 quorum 3
Useful building block: ZooKeeper
● Centralized coordination serviceo configuration, service discoveryo locking, queues, barrierso failure detectiono leader election, membership
● Reliable
● Source of truth
Failure detection with ZooKeeper
ZooKeeper
Ephemeral znodes
Heartbeats
Timeouts
Triggers cluster update
Recovery: protocol
1. provision / acquire new node
2. fetch under-replicated data
3. rebuild state
4. join ensemble
Requirement: durability
sync mutationsto storage
append-onlyjournal
Journaling:persisting mutations
Imaging:persisting the state when it changes
Concretely
● Write-Ahead Logging
● Databases - data stores
● Durable Messaging
Enter Apache BookKeeper
Reliable distributed logging
CC-BY-2.0 https://flic.kr/p/dSHr87
BookKeeper : durability service
durability
replication consistency
on commodity hardware
recovery
user library
A building block for reliable systems
The ledger abstraction
op op op op op op op op op op opop opop opop op
addread
checkpoint
Ledger 1Ledger 2 Ledger 3
Guarantees
If an entry has been acknowledged,
it must be readable
If an entry is read once,
it must always be readable
History
Initial use case : Hadoop name node recovery
2008: open sourced contrib of ZooKeeper
2011: sub-project of ZooKeeper
2012: Production
Community
Committers from:
● Yahoo!● Twitter● Microsoft● Huawei● Facebook
Inside of Apache BookKeeper
CC-BY-2.0 https://flic.kr/p/agpLTR
ledgerçledger
Architecture
bookie
zookeeper
bookie
bookie
client library
client system
ledgerledger entry
index
metadatastore
Write path
ledgerçledger
bookie
bookie
bookie
client library
ledgerledger entry
replication+striping
Reliable writes
● store digest along with entry
● fsync each entry before returning
● ACK when:○ all previous
entries ○ this entry
accepted by quorum
Read path
bookie
bookie
bookie
read(ledger X entry Y)
Partial writes
bookie
bookie
bookie
read(ledger X entry Y)no quorum for this entry!
Read would be inconsistent
Last Add Confirmed
Consensus on written entries
bookie
bookie
bookie
read(ledger X entry Y)
Zookeeper
close(ledger X)
Ledger X, LAC
Recovery of a ledger
bookie
bookie
bookie
zookeeper
What is the last entry?
piggy-backlast add confirmed
Fencing: prevent multiple brains
writer writer
writer writer
roll
Inside of a bookie
L2 - E7L2 - E6
L2 - E3L2 - E4
L1 - E4
L1 - E2L2 - E1
L1 - E1
L3 - E1
...
sequential entries
interleaved
physical file
Storage device
disk
fsync Sequential entriesSynchronous writes
OK for writing
Reads interfere with writes
add ack
L2 - E3L3 - E7
L1 - E4
L1 - E2L2 - E1
L1 - E1
Separate read and write devices
disk 2
fsync
ackadd
L2 - E3L3 - E7
L1 - E4
L1 - E2L2 - E1
L1 - E1disk 1
L2 - E3L3 - E7
L1 - E4L1 - E2
L2 - E1
L1 - E1
async flush
cache Similar ratesDurabilityRead-efficient
INDEX
Ledger device Journal device
Garbage collection / compaction
disk 2
L2 - E3L3 - E7
L1 - E4
L1 - E2L2 - E1
L1 - E1
disk 1
L2 - E3L3 - E7
L2 - E1L1 - E4L1 - E2L1 - E1
L1 - E4L1 - E2
L2 - E1L1 - E1
L1 - E4
L1 - E2L2 - E1
L1 - E1
Ledger 1 deleted
L2 - E1
Entry log Journal
Using Apache BookKeeper
as a building block
Raul Hernandez - CC-BY-2.0 https://flic.kr/p/aSwTKT
Guarantees
If an entry has been acknowledged,
it must be readable
If an entry is read once,
it must always be readable
API
BookKeeper
createLedgeropenLedgerdeleteLedger
LedgerHandle
addEntryreadEntryclose
asyncCreateLedgerasyncOpenLedgerasyncDeleteLedger
asyncAddEntryasyncReadEntryasyncClose
Asynchronouswithcallbacks
Tech stack
● Java● Netty● ZooKeeper
Performance considerations
I/O bound- disk IOPS: ~ 120/s HDD, 500 000/s SSD- network: 1Gb/s ~ 100MB/s max or less in practice
~ 1KB msgs: 100 000/s per node
Public use cases
● Hadoop namenode (Huawei)
● WAL (HubSpot)
● Hedwig (open source)
● PNUTS cross-colo replication (Yahoo)
● Push notifications (Yahoo)
● Cloud messaging (Yahoo)
Primary backup
BookKeeper
active standby
write tail
apply opsbuild backup state
Reads from open ledger
Asks for current Last Add Confirmedfrom bookies
Data store WAL
Bookkeeper library
bookies
datastore
Bookkeeper library
bookies
Data structure
Durability for arbitrary (distributed)data structures!
Elasticity
Bookkeeper library
bookies
Elasticity
Bookkeeper library
bookies
Shared log infrastructure
Bookkeeper library
bookies
Application A Application B
System C
http://zookeeper.apache.org/bookkeeper/
https://github.com/apache/bookkeeper
Matthieu Morel: mmorel
Ivan Kelly: ivank
@apache.org