building reliable systems with apache bookkeeper

Post on 14-Jun-2015

1.601 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

A presentation at the Barcelona JUG 19/06/2014

TRANSCRIPT

Building reliable systems with

Apache BookKeeper

Matthieu MorelIvan Kelly

Challenges in distributed systems

Ryan Lintleman cc-by-nc 2.0 https://flic.kr/p/5XNGow

Expect failures

up to 10% annual failure rates for disks/servers

The network is reliable

NOT

Jonathan Briggs - cc by 2.0 https://flic.kr/p/bnAxuz

SymptomsAlex Proimos cc-by-2.0 https://flic.kr/p/bt29wL

Problem 1: not available

Problem 1: not available

Problem 2: inconsistencies

CAP

consistency

partition toleranceavailability

zookeeper / bookkeeper

cassandra

More issues...

cc-by-2.0 https://flic.kr/p/8j57SG

Problem 3: split brain

writer A writer A writer A

writer A’

2 writers !

writer A’

single writer for A

Problem 4: failure detection

A

B

C

Problem 5: recovery

Recovery protocol?

For many systems,we need consistent datafor recovery

Solutions !

Steven Depolo cc-by-2.0 https://flic.kr/p/9APgFF

guarantees

protocols

tools / building blocks

techniques

Primary backup, active replication

active standby

active

active

Replication

f+1 replicas for f concurrent failures

Quorums

ensemble

quorum 1

quorum 2 quorum 3

Useful building block: ZooKeeper

● Centralized coordination serviceo configuration, service discoveryo locking, queues, barrierso failure detectiono leader election, membership

● Reliable

● Source of truth

Failure detection with ZooKeeper

ZooKeeper

Ephemeral znodes

Heartbeats

Timeouts

Triggers cluster update

Recovery: protocol

1. provision / acquire new node

2. fetch under-replicated data

3. rebuild state

4. join ensemble

Requirement: durability

sync mutationsto storage

append-onlyjournal

Journaling:persisting mutations

Imaging:persisting the state when it changes

Concretely

● Write-Ahead Logging

● Databases - data stores

● Durable Messaging

Enter Apache BookKeeper

Reliable distributed logging

CC-BY-2.0 https://flic.kr/p/dSHr87

BookKeeper : durability service

durability

replication consistency

on commodity hardware

recovery

user library

A building block for reliable systems

The ledger abstraction

op op op op op op op op op op opop opop opop op

addread

checkpoint

Ledger 1Ledger 2 Ledger 3

Guarantees

If an entry has been acknowledged,

it must be readable

If an entry is read once,

it must always be readable

History

Initial use case : Hadoop name node recovery

2008: open sourced contrib of ZooKeeper

2011: sub-project of ZooKeeper

2012: Production

Community

Committers from:

● Yahoo!● Twitter● Microsoft● Huawei● Facebook

Inside of Apache BookKeeper

CC-BY-2.0 https://flic.kr/p/agpLTR

ledgerçledger

Architecture

bookie

zookeeper

bookie

bookie

client library

client system

ledgerledger entry

index

metadatastore

Write path

ledgerçledger

bookie

bookie

bookie

client library

ledgerledger entry

replication+striping

Reliable writes

● store digest along with entry

● fsync each entry before returning

● ACK when:○ all previous

entries ○ this entry

accepted by quorum

Read path

bookie

bookie

bookie

read(ledger X entry Y)

Partial writes

bookie

bookie

bookie

read(ledger X entry Y)no quorum for this entry!

Read would be inconsistent

Last Add Confirmed

Consensus on written entries

bookie

bookie

bookie

read(ledger X entry Y)

Zookeeper

close(ledger X)

Ledger X, LAC

Recovery of a ledger

bookie

bookie

bookie

zookeeper

What is the last entry?

piggy-backlast add confirmed

Fencing: prevent multiple brains

writer writer

writer writer

roll

Inside of a bookie

L2 - E7L2 - E6

L2 - E3L2 - E4

L1 - E4

L1 - E2L2 - E1

L1 - E1

L3 - E1

...

sequential entries

interleaved

physical file

Storage device

disk

fsync Sequential entriesSynchronous writes

OK for writing

Reads interfere with writes

add ack

L2 - E3L3 - E7

L1 - E4

L1 - E2L2 - E1

L1 - E1

Separate read and write devices

disk 2

fsync

ackadd

L2 - E3L3 - E7

L1 - E4

L1 - E2L2 - E1

L1 - E1disk 1

L2 - E3L3 - E7

L1 - E4L1 - E2

L2 - E1

L1 - E1

async flush

cache Similar ratesDurabilityRead-efficient

INDEX

Ledger device Journal device

Garbage collection / compaction

disk 2

L2 - E3L3 - E7

L1 - E4

L1 - E2L2 - E1

L1 - E1

disk 1

L2 - E3L3 - E7

L2 - E1L1 - E4L1 - E2L1 - E1

L1 - E4L1 - E2

L2 - E1L1 - E1

L1 - E4

L1 - E2L2 - E1

L1 - E1

Ledger 1 deleted

L2 - E1

Entry log Journal

Using Apache BookKeeper

as a building block

Raul Hernandez - CC-BY-2.0 https://flic.kr/p/aSwTKT

Guarantees

If an entry has been acknowledged,

it must be readable

If an entry is read once,

it must always be readable

API

BookKeeper

createLedgeropenLedgerdeleteLedger

LedgerHandle

addEntryreadEntryclose

asyncCreateLedgerasyncOpenLedgerasyncDeleteLedger

asyncAddEntryasyncReadEntryasyncClose

Asynchronouswithcallbacks

Tech stack

● Java● Netty● ZooKeeper

Performance considerations

I/O bound- disk IOPS: ~ 120/s HDD, 500 000/s SSD- network: 1Gb/s ~ 100MB/s max or less in practice

~ 1KB msgs: 100 000/s per node

Public use cases

● Hadoop namenode (Huawei)

● WAL (HubSpot)

● Hedwig (open source)

● PNUTS cross-colo replication (Yahoo)

● Push notifications (Yahoo)

● Cloud messaging (Yahoo)

Primary backup

BookKeeper

active standby

write tail

apply opsbuild backup state

Reads from open ledger

Asks for current Last Add Confirmedfrom bookies

Data store WAL

Bookkeeper library

bookies

datastore

Bookkeeper library

bookies

Data structure

Durability for arbitrary (distributed)data structures!

Elasticity

Bookkeeper library

bookies

Elasticity

Bookkeeper library

bookies

Shared log infrastructure

Bookkeeper library

bookies

Application A Application B

System C

http://zookeeper.apache.org/bookkeeper/

https://github.com/apache/bookkeeper

Matthieu Morel: mmorel

Ivan Kelly: ivank

@apache.org

top related