free recovery: a step towards self-managing state

Free Recovery: Free Recovery: A Step Towards Self-Managing A Step Towards Self-Managing StateState

Andy Huang and Armando FoxAndy Huang and Armando FoxStanford UniversityStanford University

Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang

Persistent hash tablesPersistent hash tables

FrontendsApp Servers

DB

LAN

LAN

KeyKey ValueValue

Yahoo! user IDYahoo! user ID User profileUser profile

ISBNISBN Amazon Amazon catalog catalog metadatametadata

Hash table


Two state management challengesTwo state management challenges

Failure handlingFailure handling• Consistency requirements Consistency requirements

Node recovery costlyNode recovery costly

Reliable failure detectionReliable failure detection

• Relax internal consistencyRelax internal consistency

Fast, non-intrusive Fast, non-intrusive recovery (“free”)recovery (“free”)

System evolutionSystem evolution• Large data setsLarge data sets

Repartitioning is costlyRepartitioning is costly

Good resources provisioningGood resources provisioning

• Free recoveryFree recovery

Automatic, online Automatic, online repartitioningrepartitioning

an easy-to-manage an easy-to-manage cluster-based persistent hash cluster-based persistent hash

table table for Internet servicesfor Internet services

DStorDStoree


DStore architectureDStore architecture

Dlib

LAN

Brickapp server

Dlib: exposes hash table API and is the “coordinator” for distributed operations

Brick: stores data by writing synchronously to disk



DStorDStoree


Focusing on recoveryFocusing on recovery

Technique 1: QuorumsTechnique 1: Quorums

Tolerant to brick inconsistencyTolerant to brick inconsistency

Technique 2: Single-phase Technique 2: Single-phase writeswrites

No request relies on specific No request relies on specific bricksbricks

Simple, non-intrusive recoverySimple, non-intrusive recovery

2PC: failure between phases complicates protocol

• 2nd phase depends on particular set of bricks

• Relies on reliable failure detection

Single-phase quorum writes: can be completed by any majority of bricks

Any brick can fail at any timeAny brick can fail at any time

Write: send to all, wait for majority

Read: read from majority

OK if some bricks’ data differs

Failure = missing some writes


Considering consistencyConsidering consistency

Dl1 B1 B2 B3

x = 0

Dl2

0

read

read

1

Dlib failure can cause a partial write, violating the quorum property

If timestamps differ, read-repair restores majority invariant

Delayed commit

write(1)


Considering consistencyConsidering consistency

B1 B2 B3

x = 0

Dl1 Dl2

1

read

write

write(1)

A write-in-progress cookie can be used to detect partial writes and commit/abort on the next read An individual An individual

client’s view of client’s view of DStore is DStore is consistent with consistent with that of a single that of a single centralized server centralized server (Bayou)(Bayou)


Benchmark: Free recoveryBenchmark: Free recovery

0

25

50

75

100

0 5 10 15 20 25 30

PUT

req/

sec

Time (minutes)

0

25

50

75

100

0 5 10 15 20 25 30

PUT

req/

sec

Time (minutes)

0

25

50Repairs/sec

0

25

50Repairs/sec

0K

1K

2K

3K

4K

GET

req/

sec

0K

1K

2K

3K

4K

GET

req/

sec

0

100

200

300

400

0 5 10 15 20 25 30

PUT

req/

sec

Time (minutes)

0

100

200

300

400

0 5 10 15 20 25 30

PUT

req/

sec

Time (minutes)

0

100

200Repairs/sec

0

100

200Repairs/sec

0K

2K

4K

6K

8K

GET

req/

sec

0K

2K

4K

6K

8K

GET

req/

sec

Worst-case behavior(100% cache hit rate)

Expected behavior(85% cache hit rate)

Recovery: fast and non-intrusiveRecovery: fast and non-intrusive

Bri

ck

kill

ed

Reco

very


Benchmark: Automatic failure detectionBenchmark: Automatic failure detection

0

50

100

150

200

0 5 10 15

PUT

req/

sec

Time (minutes)

0

50

100

150

200

0 5 10 15

PUT

req/

sec

Time (minutes)

0

50Repairs/sec

0

50Repairs/sec

0K

4K

8K

GET

req/

sec

0K

4K

8K

GET

req/

sec

0

50

100

150

200

0 5 10 15

PUT

req/

sec

Time (minutes)

0

50

100

150

200

0 5 10 15

PUT

req/

sec

Time (minutes)

0

50Repairs/sec

0

50Repairs/sec

0K

4K

8K

GET

req/

sec

0K

4K

8K

GET

req/

sec

Modest policy(anomaly threshold = 8)

Aggressive policy(anomaly threshold = 5)

False positives: low costFalse positives: low costFail-stutter: detected by PinpointFail-stutter: detected by PinpointFail-

stutt

er


Online repartitioningOnline repartitioning

1.1. Take brick offlineTake brick offline

2.2. Copy data to new brickCopy data to new brick

3.3. Bring both bricks onlineBring both bricks online

0 1 0 1 0 1

0 1 0 1 0 1

0 1 0 1 0 1 1

0 1 0 1 0 1

Appears as if brick just failed and recoveredAppears as if brick just failed and recovered


Benchmark: Automatic online Benchmark: Automatic online repartitioningrepartitioning

0

100

200

300

0 10 20 30 40 50 60 70 80

PUT

req/

sec

Time (minutes)

0

100

200

300

0 10 20 30 40 50 60 70 80

PUT

req/

sec

Time (minutes)

0

100

200

300

0 10 20 30 40 50 60 70 80

PUT

req/

sec

Time (minutes)

0

100

200

300

0 10 20 30 40 50 60 70 80

PUT

req/

sec

Time (minutes)

0

100

200

300

0 10 20 30 40 50 60 70 80

PUT

req/

sec

Time (minutes)

0

25

50Repairs/sec

0

25

50Repairs/sec

0K

2K

4K

6K

8K

GET

req/

sec

# bricks6 9 12

0K

2K

4K

6K

8K

GET

req/

sec

# bricks6 9 12

0K

2K

4K

6K

8K

GET

req/

sec

# bricks6 9 12

0K

2K

4K

6K

8K

GET

req/

sec

# bricks6 9 12

0K

2K

4K

6K

8K

GET

req/

sec

# bricks6 9 12

0

60

120

0 5 10 15 20 25 30 35 40

PUT

req/

sec

Time (minutes)

0

60

120

0 5 10 15 20 25 30 35 40

PUT

req/

sec

Time (minutes)

0

25

50Repairs/sec

0

25

50Repairs/sec

0K

2.5K

5K

GET

req/

sec

# bricks

3 4 5 6

0K

2.5K

5K

GET

req/

sec

# bricks

3 4 5 6

Evenly-distributed load(3 to 6 bricks)

Hotspot in 01 partition(6 to 12 bricks)

Brick selection: effectiveBrick selection: effectiveRepartitioning: non-intrusiveRepartitioning: non-intrusive

Naive

Naive


Perform online checkpointsPerform online checkpoints Take checkpointing brick Take checkpointing brick

offlineoffline

Just like failure+recoveryJust like failure+recovery

See if free recovery can See if free recovery can simplify online data simplify online data reconstruction after hard reconstruction after hard failuresfailures

Any other state Any other state management challenges management challenges you can think of?you can think of?

Next up for free recoveryNext up for free recovery

0

50

100

150

200

0 1 2 3 4 5 6 7 8 9 10

PUT

req/s

ec

Time (minutes)


SummarySummary

Free recoveryFree recovery

DStore = DStore = DecoupledDecoupled Storage Storage

Managed like a stateless Web farmManaged like a stateless Web farm

Quorums [spacial decoupling]

Cost: extra overprovisioning

Gain: fast, non-intrusive recovery

Single-phase ops [temporal decoupling]Cost: temporarily violates “majority” invariantGain: any brick can fail at any time

Failure handling fast, non-intrusive Mechanism: simple reboot

Policy: aggressively reboot anomalous bricks

System evolution “plug-and-play”

Mechanism: automatic, online repartitioning

Policy: dynamically add and remove nodes based on predicted load




DStorDStoree

[email protected]@stanford.edu

free recovery: a step towards self-managing state

Documents

roc retreat lake tahoe

andy huangbenchmark

andy huangnext

andy huangfocusing

failure recoverysee

0dl2dlib failure

brick offlinejust

new brick