cs 443 advanced os fabián e. bustamante, spring 2005 glacier: highly durable, decentralized storage...

CS 443 Advanced OS

Fabián E. Bustamante, Spring 2005

Glacier: Highly Durable, Decentralized Storage Despite Massive Correlated

Failures

Andreas Haeberlen, Alan Mislove, and Peter Druschel

Presenter: Yi Qiao

2

Outline

Introduction

Related Work

Assumption and Targeted Environment

Glacier

Object Aggregation

Security

Evaluation

Conclusion

3

Introduction

How to achieve high availability in decentralized storage systems?

Replication

Problems– Failure is not independent– Worms make the problem worse– Catastrophic effects of losing some data

Glacier– A distributed storage system that is robust to large-scale

correlated failures• Highly durable, decentralized storage

– Trades efficiency of storage for durability– No any assumptions about the nature and correlation of failures– Aggregation of small objects and a fragment maintenance protocol to

reduce the message overhead

4

Related Works

OceanStore and Phoenix– Apply introspection to defend against correlated failures

• Difficult to capture all correlations

• Introspection itself can make the system vulnerable to attacks

– Glacier relies on minimal assumptions about the nature of failures by using larger storage overhead

TotalRecall– Optimizes availability under churn, no worst-case

guarantees

PAST, Farsite– Replication against data loss

Weatherspoon et al.– Erasure codes can achieve better MTTF than plain

replication

5

Assumptions and Intended Environment

Intended to be used in an environment consisting of desktop computers within an organizational intranet– Some fraction of nodes can be home desktops connected via DSL

or wireless LAN

– Modest amount of churn and good network connectivity

– Used on combination with conventional decentralized replication storage layer

Lifetime – hundreds of days; session time – days or hours

Three operation model– Normal operation

– Large-scale failure – up to a fraction of fmax nodes failures• Protecting data stored on non-faulty nodes

– Recovery mode • reconstitutes aggregates and restores missing fragments

6

Glacier

Participating storage nodes form an overlay network– The set of keys forms a circular space– Each node stores objects with keys in their own key

segment– Uses underlying DHT layer for secure routing and

communication

Operates along with a primary store with full replicas

Aggregation of small objects

Erasure coding of aggregation

Fragments placement at random selected nodes

7

Glacier

Durability guarantee – f<=fmax– Each object survives with probability p>=pmin

Application Interface– put(i,v,o,l)– get(i,v) o– refresh(i,v,l)– No primitives for deletion or overwriting

• Leases are used, can be renewed when necessary

8

Glacier

Fragments and manifests– Erasure code – reduce storage overhead

• Object O of size (O) is stored as n fragments F1, F2, .. Fn of size (O)/r, any r of which are sufficient for object restore

• Object key k, Fragment Fi key (k,i,v)

– Object authenticator and Manifest • AO=(H(O), H(F1), H(F2), …, H(Fn), v, l)• Corrupted fragments can be detected and removed

Key ownership– Keys are assigned by consistent hashing over the

set of nodes that are either on-line or were on-line within a period Tmax

9

Glacier

Fragment placement– Fragments of the same object

placed on different random chosen nodes

– Fragments of objects with similar keys should be grouped together

– Placement function should be stable

– P(k,i,v)=k+i/(n+1)+H(v)• Primary replica – position k

• n fragments – n+1 equidistant points in the circular space

• H(v) – prevents load imbalance

– When inserting new object (k,v), if owner of P(k,i,v) is offline, discard the fragment and restore later

10

Glacier

Fragment maintenance– Fragment insertion misses, key space ownership

change, failures may cause fragments lost– A simple protocol

• The node compiles a list of all keys (k,v) in its local fragment store and sends the list to some of its peers

• Each peer replies with a list of manifests for missing object

• The node requires k fragments from its peers, validates them, and computes the fragment to store locally

11

Glacier

Recovery– No need for Glacier to explicitly detect failure

• Compromised nodes– Fail permanently, other nodes take over the key segments– Repaired and rejoin the system with an empty fragment store

– Limits the number of simultaneous fragment reconstructions for a fixed number to avoid congestive collapse

Garbage Collection– Happens when lease expires – Can be carried out independently by each storage node– Grace period TG for maximal clock difference

12

Glacier

Configuration– An object can be reconstructed if r out of N fragment can be

obtained

– Change N and r so that P meets desired durability

– Still offers protection even when fmax is chosen too low– Lease time must be larger than maximal duration of a large-

scale failure – order of months

13

Glacier – Object Aggregation

Massive redundancy – substantially large number of internal objects than application objects

Aggregation of small application objects to reduce the cost of fragments creation and maintenance – tuples (oi, ki, vi)

Aggregation is performed on a per-user basis– Simple, but loses the opportunity of bundle objects from different

users

Local aggregate directory– Aggregates link – forming a directed acyclic graph

14

Glacier – Object Aggregation

Recovery– Both primary store data and the aggregated directory could

be lost after a correlated failure– Recover aggregated directory by walking the DAG

Consolidation– Periodically check the aggregate directory for aggregates

whose leases will expire soon• Not renew the lease if many objects have expired leases

– Non-expired objects are consolidated with new objects to generate a new aggregate

– Particularly effective when object lifetimes are bimodal• Consolidated aggregate contains mostly long-lived objects

15

Glacier - Security

Potential attacks against either durability or the integrity of data stored in Glacier– Attacks on integrity– Attacks on durability– Attacks on the time source– Space-filling attacks– Attacks on Glacier itself– Haystack-needle attacks

16

Evaluation

Glacier prototype– On top of FreePastry implementation of the Pastry

structured overlay– Uses PAST as its primary store

Two sets of experiments– ePost

• A cooperative, server-less email system for a small groups of users

• Glacier used as the storage layer

– Trace-driven simulations• A much larger workload with 147 users and up to 1,000

nodes

17

Evaluation

ePost experiments– 20 to 30 nodes, mostly desktop PCs running linux– 8 passive users and 9 active users– Uses Glacier to store email and corresponding

metadata• N=48, r=5, fmax=60%, pmin=0.999999• Experiment too small to guarantee uncorrelated fragment

losses

– Glacier was able to handle all the failures with the development and test of ePost

18

Evaluation

ePost Workload– Cumulative size of inserted objects over time

• Live – objects not expired yet

– Histogram of object sizes• Bimodal

– Large number of objects between 1-10 KB» Justified aggregation

– A low number of large objects» Usually attachments

19

Evaluation

ePost storage– Amount of storage required by Glacier for the workload

• Grows slowly as new emails entering the system

• XML data structure creates an additional 32% overhead

– On-disk data structures VS actual email payload• Storage overload close to 9.6 * 1.32

20

Evaluation

ePost traffic– Five categories

• Insertion, Refresh, maintenance, handoff, lookup

– In times with low failures, traffic dominated by insertions and refreshes

– During unstable period, handoff and maintenance traffic increases

21

Evaluation

ePOST aggregation– Compare the number of objects with the number of

aggregates in the system• Aggregation reduces the number of keys by over one order of

magnitude

• Low number of expired objects – effective aggregate consolidation

22

Evaluation

Simulation Study – Diurnal Behavior– Glacier and the aggregation layer implemented– Trace from department email server– Diurnal behavior affects message overhead

• Higher churn – Less insertion traffic– More maintenance message for lost fragments recovery

23

Evaluation

Simulation Study – Load– See how load influences the message overhead

• Under light load, message overhead remains about constant

– Aggregates are performed periodically by every node

• Higher load makes overhead increases about linearly

24

Evaluation

Simulation Study – Scalability– Increase overlay size, study per-node traffic

overhead• Remains approximately constant• Grows slowly since the messages are routed using

Pastry

25

Conclusions

Ensures durability of unrecoverable data in a cooperative, decentralized storage system

Robust for large-scale, correlated, Byzantine failures of storage nodes

No introspection

Massive redundancy to mask the effects of correlated failures

Erasure codes and garbage collection to reduce storage cost

Aggregation and fragment maintenance protocol to reduce message costs

cs 443 advanced os fabián e. bustamante, spring 2005 glacier: highly durable, decentralized storage...

Documents

storage nodes

glacier fragments

data glacier

object key

necessary slide

decentralized storage

plain replication slide

attacks glacier