building reliable systems with apache bookkeeper

58
Building reliable systems with Apache BookKeeper Matthieu Morel Ivan Kelly

Upload: matthieu-morel

Post on 14-Jun-2015

1.601 views

Category:

Technology


1 download

DESCRIPTION

A presentation at the Barcelona JUG 19/06/2014

TRANSCRIPT

Page 1: Building reliable systems with Apache BookKeeper

Building reliable systems with

Apache BookKeeper

Matthieu MorelIvan Kelly

Page 2: Building reliable systems with Apache BookKeeper

Challenges in distributed systems

Page 3: Building reliable systems with Apache BookKeeper

Ryan Lintleman cc-by-nc 2.0 https://flic.kr/p/5XNGow

Expect failures

Page 4: Building reliable systems with Apache BookKeeper

up to 10% annual failure rates for disks/servers

Page 5: Building reliable systems with Apache BookKeeper

The network is reliable

NOT

Jonathan Briggs - cc by 2.0 https://flic.kr/p/bnAxuz

Page 6: Building reliable systems with Apache BookKeeper

SymptomsAlex Proimos cc-by-2.0 https://flic.kr/p/bt29wL

Page 7: Building reliable systems with Apache BookKeeper

Problem 1: not available

Page 8: Building reliable systems with Apache BookKeeper

Problem 1: not available

Page 9: Building reliable systems with Apache BookKeeper

Problem 2: inconsistencies

Page 10: Building reliable systems with Apache BookKeeper

CAP

consistency

partition toleranceavailability

zookeeper / bookkeeper

cassandra

Page 11: Building reliable systems with Apache BookKeeper

More issues...

cc-by-2.0 https://flic.kr/p/8j57SG

Page 12: Building reliable systems with Apache BookKeeper

Problem 3: split brain

writer A writer A writer A

writer A’

2 writers !

writer A’

single writer for A

Page 13: Building reliable systems with Apache BookKeeper

Problem 4: failure detection

A

B

C

Page 14: Building reliable systems with Apache BookKeeper

Problem 5: recovery

Recovery protocol?

For many systems,we need consistent datafor recovery

Page 15: Building reliable systems with Apache BookKeeper

Solutions !

Steven Depolo cc-by-2.0 https://flic.kr/p/9APgFF

Page 16: Building reliable systems with Apache BookKeeper

guarantees

protocols

tools / building blocks

techniques

Page 17: Building reliable systems with Apache BookKeeper

Primary backup, active replication

active standby

active

active

Page 18: Building reliable systems with Apache BookKeeper

Replication

f+1 replicas for f concurrent failures

Page 19: Building reliable systems with Apache BookKeeper

Quorums

ensemble

quorum 1

quorum 2 quorum 3

Page 20: Building reliable systems with Apache BookKeeper

Useful building block: ZooKeeper

● Centralized coordination serviceo configuration, service discoveryo locking, queues, barrierso failure detectiono leader election, membership

● Reliable

● Source of truth

Page 21: Building reliable systems with Apache BookKeeper

Failure detection with ZooKeeper

ZooKeeper

Ephemeral znodes

Heartbeats

Timeouts

Triggers cluster update

Page 22: Building reliable systems with Apache BookKeeper

Recovery: protocol

1. provision / acquire new node

2. fetch under-replicated data

3. rebuild state

4. join ensemble

Page 23: Building reliable systems with Apache BookKeeper

Requirement: durability

sync mutationsto storage

append-onlyjournal

Journaling:persisting mutations

Imaging:persisting the state when it changes

Page 24: Building reliable systems with Apache BookKeeper

Concretely

● Write-Ahead Logging

● Databases - data stores

● Durable Messaging

Page 25: Building reliable systems with Apache BookKeeper

Enter Apache BookKeeper

Reliable distributed logging

CC-BY-2.0 https://flic.kr/p/dSHr87

Page 26: Building reliable systems with Apache BookKeeper

BookKeeper : durability service

durability

replication consistency

on commodity hardware

recovery

user library

A building block for reliable systems

Page 27: Building reliable systems with Apache BookKeeper

The ledger abstraction

op op op op op op op op op op opop opop opop op

addread

checkpoint

Ledger 1Ledger 2 Ledger 3

Page 28: Building reliable systems with Apache BookKeeper

Guarantees

If an entry has been acknowledged,

it must be readable

If an entry is read once,

it must always be readable

Page 29: Building reliable systems with Apache BookKeeper

History

Initial use case : Hadoop name node recovery

2008: open sourced contrib of ZooKeeper

2011: sub-project of ZooKeeper

2012: Production

Page 30: Building reliable systems with Apache BookKeeper

Community

Committers from:

● Yahoo!● Twitter● Microsoft● Huawei● Facebook

Page 31: Building reliable systems with Apache BookKeeper

Inside of Apache BookKeeper

CC-BY-2.0 https://flic.kr/p/agpLTR

Page 32: Building reliable systems with Apache BookKeeper

ledgerçledger

Architecture

bookie

zookeeper

bookie

bookie

client library

client system

ledgerledger entry

index

metadatastore

Page 33: Building reliable systems with Apache BookKeeper

Write path

ledgerçledger

bookie

bookie

bookie

client library

ledgerledger entry

replication+striping

Page 34: Building reliable systems with Apache BookKeeper

Reliable writes

● store digest along with entry

● fsync each entry before returning

● ACK when:○ all previous

entries ○ this entry

accepted by quorum

Page 35: Building reliable systems with Apache BookKeeper

Read path

bookie

bookie

bookie

read(ledger X entry Y)

Page 36: Building reliable systems with Apache BookKeeper

Partial writes

bookie

bookie

bookie

read(ledger X entry Y)no quorum for this entry!

Read would be inconsistent

Page 37: Building reliable systems with Apache BookKeeper

Last Add Confirmed

Consensus on written entries

bookie

bookie

bookie

read(ledger X entry Y)

Zookeeper

close(ledger X)

Ledger X, LAC

Page 38: Building reliable systems with Apache BookKeeper

Recovery of a ledger

bookie

bookie

bookie

zookeeper

What is the last entry?

piggy-backlast add confirmed

Page 39: Building reliable systems with Apache BookKeeper

Fencing: prevent multiple brains

writer writer

writer writer

Page 40: Building reliable systems with Apache BookKeeper

roll

Inside of a bookie

L2 - E7L2 - E6

L2 - E3L2 - E4

L1 - E4

L1 - E2L2 - E1

L1 - E1

L3 - E1

...

sequential entries

interleaved

physical file

Page 41: Building reliable systems with Apache BookKeeper

Storage device

disk

fsync Sequential entriesSynchronous writes

OK for writing

Reads interfere with writes

add ack

L2 - E3L3 - E7

L1 - E4

L1 - E2L2 - E1

L1 - E1

Page 42: Building reliable systems with Apache BookKeeper

Separate read and write devices

disk 2

fsync

ackadd

L2 - E3L3 - E7

L1 - E4

L1 - E2L2 - E1

L1 - E1disk 1

L2 - E3L3 - E7

L1 - E4L1 - E2

L2 - E1

L1 - E1

async flush

cache Similar ratesDurabilityRead-efficient

INDEX

Ledger device Journal device

Page 43: Building reliable systems with Apache BookKeeper
Page 44: Building reliable systems with Apache BookKeeper

Garbage collection / compaction

disk 2

L2 - E3L3 - E7

L1 - E4

L1 - E2L2 - E1

L1 - E1

disk 1

L2 - E3L3 - E7

L2 - E1L1 - E4L1 - E2L1 - E1

L1 - E4L1 - E2

L2 - E1L1 - E1

L1 - E4

L1 - E2L2 - E1

L1 - E1

Ledger 1 deleted

L2 - E1

Entry log Journal

Page 45: Building reliable systems with Apache BookKeeper
Page 46: Building reliable systems with Apache BookKeeper

Using Apache BookKeeper

as a building block

Raul Hernandez - CC-BY-2.0 https://flic.kr/p/aSwTKT

Page 47: Building reliable systems with Apache BookKeeper

Guarantees

If an entry has been acknowledged,

it must be readable

If an entry is read once,

it must always be readable

Page 48: Building reliable systems with Apache BookKeeper

API

BookKeeper

createLedgeropenLedgerdeleteLedger

LedgerHandle

addEntryreadEntryclose

asyncCreateLedgerasyncOpenLedgerasyncDeleteLedger

asyncAddEntryasyncReadEntryasyncClose

Asynchronouswithcallbacks

Page 49: Building reliable systems with Apache BookKeeper

Tech stack

● Java● Netty● ZooKeeper

Page 50: Building reliable systems with Apache BookKeeper

Performance considerations

I/O bound- disk IOPS: ~ 120/s HDD, 500 000/s SSD- network: 1Gb/s ~ 100MB/s max or less in practice

~ 1KB msgs: 100 000/s per node

Page 51: Building reliable systems with Apache BookKeeper

Public use cases

● Hadoop namenode (Huawei)

● WAL (HubSpot)

● Hedwig (open source)

● PNUTS cross-colo replication (Yahoo)

● Push notifications (Yahoo)

● Cloud messaging (Yahoo)

Page 52: Building reliable systems with Apache BookKeeper

Primary backup

BookKeeper

active standby

write tail

apply opsbuild backup state

Reads from open ledger

Asks for current Last Add Confirmedfrom bookies

Page 53: Building reliable systems with Apache BookKeeper

Data store WAL

Bookkeeper library

bookies

datastore

Page 54: Building reliable systems with Apache BookKeeper

Bookkeeper library

bookies

Data structure

Durability for arbitrary (distributed)data structures!

Page 55: Building reliable systems with Apache BookKeeper

Elasticity

Bookkeeper library

bookies

Page 56: Building reliable systems with Apache BookKeeper

Elasticity

Bookkeeper library

bookies

Page 57: Building reliable systems with Apache BookKeeper

Shared log infrastructure

Bookkeeper library

bookies

Application A Application B

System C

Page 58: Building reliable systems with Apache BookKeeper

http://zookeeper.apache.org/bookkeeper/

https://github.com/apache/bookkeeper

Matthieu Morel: mmorel

Ivan Kelly: ivank

@apache.org