cs 443 advanced os fabián e. bustamante, spring 2005 glacier: highly durable, decentralized storage...

25
CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 Glacier: Highly Durable, Decentralized Storage Despite Massive Correlated Failures Andreas Haeberlen, Alan Mislove, and Peter Druschel Presenter: Yi Qiao

Upload: damion-newport

Post on 14-Dec-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 Glacier: Highly Durable, Decentralized Storage Despite Massive Correlated Failures Andreas Haeberlen,

CS 443 Advanced OS

Fabián E. Bustamante, Spring 2005

Glacier: Highly Durable, Decentralized Storage Despite Massive Correlated

Failures

Andreas Haeberlen, Alan Mislove, and Peter Druschel

Presenter: Yi Qiao

Page 2: CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 Glacier: Highly Durable, Decentralized Storage Despite Massive Correlated Failures Andreas Haeberlen,

2

Outline

Introduction

Related Work

Assumption and Targeted Environment

Glacier

Object Aggregation

Security

Evaluation

Conclusion

Page 3: CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 Glacier: Highly Durable, Decentralized Storage Despite Massive Correlated Failures Andreas Haeberlen,

3

Introduction

How to achieve high availability in decentralized storage systems?

Replication

Problems– Failure is not independent– Worms make the problem worse– Catastrophic effects of losing some data

Glacier– A distributed storage system that is robust to large-scale

correlated failures• Highly durable, decentralized storage

– Trades efficiency of storage for durability– No any assumptions about the nature and correlation of failures– Aggregation of small objects and a fragment maintenance protocol to

reduce the message overhead

Page 4: CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 Glacier: Highly Durable, Decentralized Storage Despite Massive Correlated Failures Andreas Haeberlen,

4

Related Works

OceanStore and Phoenix– Apply introspection to defend against correlated failures

• Difficult to capture all correlations

• Introspection itself can make the system vulnerable to attacks

– Glacier relies on minimal assumptions about the nature of failures by using larger storage overhead

TotalRecall– Optimizes availability under churn, no worst-case

guarantees

PAST, Farsite– Replication against data loss

Weatherspoon et al.– Erasure codes can achieve better MTTF than plain

replication

Page 5: CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 Glacier: Highly Durable, Decentralized Storage Despite Massive Correlated Failures Andreas Haeberlen,

5

Assumptions and Intended Environment

Intended to be used in an environment consisting of desktop computers within an organizational intranet– Some fraction of nodes can be home desktops connected via DSL

or wireless LAN

– Modest amount of churn and good network connectivity

– Used on combination with conventional decentralized replication storage layer

Lifetime – hundreds of days; session time – days or hours

Three operation model– Normal operation

– Large-scale failure – up to a fraction of fmax nodes failures• Protecting data stored on non-faulty nodes

– Recovery mode • reconstitutes aggregates and restores missing fragments

Page 6: CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 Glacier: Highly Durable, Decentralized Storage Despite Massive Correlated Failures Andreas Haeberlen,

6

Glacier

Participating storage nodes form an overlay network– The set of keys forms a circular space– Each node stores objects with keys in their own key

segment– Uses underlying DHT layer for secure routing and

communication

Operates along with a primary store with full replicas

Aggregation of small objects

Erasure coding of aggregation

Fragments placement at random selected nodes

Page 7: CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 Glacier: Highly Durable, Decentralized Storage Despite Massive Correlated Failures Andreas Haeberlen,

7

Glacier

Durability guarantee – f<=fmax– Each object survives with probability p>=pmin

Application Interface– put(i,v,o,l)– get(i,v) o– refresh(i,v,l)– No primitives for deletion or overwriting

• Leases are used, can be renewed when necessary

Page 8: CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 Glacier: Highly Durable, Decentralized Storage Despite Massive Correlated Failures Andreas Haeberlen,

8

Glacier

Fragments and manifests– Erasure code – reduce storage overhead

• Object O of size (O) is stored as n fragments F1, F2, .. Fn of size (O)/r, any r of which are sufficient for object restore

• Object key k, Fragment Fi key (k,i,v)

– Object authenticator and Manifest • AO=(H(O), H(F1), H(F2), …, H(Fn), v, l)• Corrupted fragments can be detected and removed

Key ownership– Keys are assigned by consistent hashing over the

set of nodes that are either on-line or were on-line within a period Tmax

Page 9: CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 Glacier: Highly Durable, Decentralized Storage Despite Massive Correlated Failures Andreas Haeberlen,

9

Glacier

Fragment placement– Fragments of the same object

placed on different random chosen nodes

– Fragments of objects with similar keys should be grouped together

– Placement function should be stable

– P(k,i,v)=k+i/(n+1)+H(v)• Primary replica – position k

• n fragments – n+1 equidistant points in the circular space

• H(v) – prevents load imbalance

– When inserting new object (k,v), if owner of P(k,i,v) is offline, discard the fragment and restore later

Page 10: CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 Glacier: Highly Durable, Decentralized Storage Despite Massive Correlated Failures Andreas Haeberlen,

10

Glacier

Fragment maintenance– Fragment insertion misses, key space ownership

change, failures may cause fragments lost– A simple protocol

• The node compiles a list of all keys (k,v) in its local fragment store and sends the list to some of its peers

• Each peer replies with a list of manifests for missing object

• The node requires k fragments from its peers, validates them, and computes the fragment to store locally

Page 11: CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 Glacier: Highly Durable, Decentralized Storage Despite Massive Correlated Failures Andreas Haeberlen,

11

Glacier

Recovery– No need for Glacier to explicitly detect failure

• Compromised nodes– Fail permanently, other nodes take over the key segments– Repaired and rejoin the system with an empty fragment store

– Limits the number of simultaneous fragment reconstructions for a fixed number to avoid congestive collapse

Garbage Collection– Happens when lease expires – Can be carried out independently by each storage node– Grace period TG for maximal clock difference

Page 12: CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 Glacier: Highly Durable, Decentralized Storage Despite Massive Correlated Failures Andreas Haeberlen,

12

Glacier

Configuration– An object can be reconstructed if r out of N fragment can be

obtained

– Change N and r so that P meets desired durability

– Still offers protection even when fmax is chosen too low– Lease time must be larger than maximal duration of a large-

scale failure – order of months

Page 13: CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 Glacier: Highly Durable, Decentralized Storage Despite Massive Correlated Failures Andreas Haeberlen,

13

Glacier – Object Aggregation

Massive redundancy – substantially large number of internal objects than application objects

Aggregation of small application objects to reduce the cost of fragments creation and maintenance – tuples (oi, ki, vi)

Aggregation is performed on a per-user basis– Simple, but loses the opportunity of bundle objects from different

users

Local aggregate directory– Aggregates link – forming a directed acyclic graph

Page 14: CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 Glacier: Highly Durable, Decentralized Storage Despite Massive Correlated Failures Andreas Haeberlen,

14

Glacier – Object Aggregation

Recovery– Both primary store data and the aggregated directory could

be lost after a correlated failure– Recover aggregated directory by walking the DAG

Consolidation– Periodically check the aggregate directory for aggregates

whose leases will expire soon• Not renew the lease if many objects have expired leases

– Non-expired objects are consolidated with new objects to generate a new aggregate

– Particularly effective when object lifetimes are bimodal• Consolidated aggregate contains mostly long-lived objects

Page 15: CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 Glacier: Highly Durable, Decentralized Storage Despite Massive Correlated Failures Andreas Haeberlen,

15

Glacier - Security

Potential attacks against either durability or the integrity of data stored in Glacier– Attacks on integrity– Attacks on durability– Attacks on the time source– Space-filling attacks– Attacks on Glacier itself– Haystack-needle attacks

Page 16: CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 Glacier: Highly Durable, Decentralized Storage Despite Massive Correlated Failures Andreas Haeberlen,

16

Evaluation

Glacier prototype– On top of FreePastry implementation of the Pastry

structured overlay– Uses PAST as its primary store

Two sets of experiments– ePost

• A cooperative, server-less email system for a small groups of users

• Glacier used as the storage layer

– Trace-driven simulations• A much larger workload with 147 users and up to 1,000

nodes

Page 17: CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 Glacier: Highly Durable, Decentralized Storage Despite Massive Correlated Failures Andreas Haeberlen,

17

Evaluation

ePost experiments– 20 to 30 nodes, mostly desktop PCs running linux– 8 passive users and 9 active users– Uses Glacier to store email and corresponding

metadata• N=48, r=5, fmax=60%, pmin=0.999999• Experiment too small to guarantee uncorrelated fragment

losses

– Glacier was able to handle all the failures with the development and test of ePost

Page 18: CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 Glacier: Highly Durable, Decentralized Storage Despite Massive Correlated Failures Andreas Haeberlen,

18

Evaluation

ePost Workload– Cumulative size of inserted objects over time

• Live – objects not expired yet

– Histogram of object sizes• Bimodal

– Large number of objects between 1-10 KB» Justified aggregation

– A low number of large objects» Usually attachments

Page 19: CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 Glacier: Highly Durable, Decentralized Storage Despite Massive Correlated Failures Andreas Haeberlen,

19

Evaluation

ePost storage– Amount of storage required by Glacier for the workload

• Grows slowly as new emails entering the system

• XML data structure creates an additional 32% overhead

– On-disk data structures VS actual email payload• Storage overload close to 9.6 * 1.32

Page 20: CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 Glacier: Highly Durable, Decentralized Storage Despite Massive Correlated Failures Andreas Haeberlen,

20

Evaluation

ePost traffic– Five categories

• Insertion, Refresh, maintenance, handoff, lookup

– In times with low failures, traffic dominated by insertions and refreshes

– During unstable period, handoff and maintenance traffic increases

Page 21: CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 Glacier: Highly Durable, Decentralized Storage Despite Massive Correlated Failures Andreas Haeberlen,

21

Evaluation

ePOST aggregation– Compare the number of objects with the number of

aggregates in the system• Aggregation reduces the number of keys by over one order of

magnitude

• Low number of expired objects – effective aggregate consolidation

Page 22: CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 Glacier: Highly Durable, Decentralized Storage Despite Massive Correlated Failures Andreas Haeberlen,

22

Evaluation

Simulation Study – Diurnal Behavior– Glacier and the aggregation layer implemented– Trace from department email server– Diurnal behavior affects message overhead

• Higher churn – Less insertion traffic– More maintenance message for lost fragments recovery

Page 23: CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 Glacier: Highly Durable, Decentralized Storage Despite Massive Correlated Failures Andreas Haeberlen,

23

Evaluation

Simulation Study – Load– See how load influences the message overhead

• Under light load, message overhead remains about constant

– Aggregates are performed periodically by every node

• Higher load makes overhead increases about linearly

Page 24: CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 Glacier: Highly Durable, Decentralized Storage Despite Massive Correlated Failures Andreas Haeberlen,

24

Evaluation

Simulation Study – Scalability– Increase overlay size, study per-node traffic

overhead• Remains approximately constant• Grows slowly since the messages are routed using

Pastry

Page 25: CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 Glacier: Highly Durable, Decentralized Storage Despite Massive Correlated Failures Andreas Haeberlen,

25

Conclusions

Ensures durability of unrecoverable data in a cooperative, decentralized storage system

Robust for large-scale, correlated, Byzantine failures of storage nodes

No introspection

Massive redundancy to mask the effects of correlated failures

Erasure codes and garbage collection to reduce storage cost

Aggregation and fragment maintenance protocol to reduce message costs