oceanstore an architecture for global-scale persistent storage

54
OceanStore An Architecture for Global-scale Persistent Storage By John Kubiatowicz, David Bindel, Yan Chen, Steven Czerwinski, Patrick Eaton, Dennis Geels, Ramakrishna Gummadi, Sean Rhea, Hakim Weatherspoon, Westley Weimer, Chris Wells, and Ben Zhao http://oceanstore.cs.berkeley.edu Presented by Yongbo Wang, Hailing Yu

Upload: alec

Post on 04-Jan-2016

25 views

Category:

Documents


0 download

DESCRIPTION

OceanStore An Architecture for Global-scale Persistent Storage. By John Kubiatowicz, David Bindel, Yan Chen, Steven Czerwinski, Patrick Eaton, Dennis Geels, Ramakrishna Gummadi, Sean Rhea, Hakim Weatherspoon, Westley Weimer, Chris Wells, and Ben Zhao http://oceanstore.cs.berkeley.edu. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: OceanStore An Architecture for Global-scale Persistent Storage

OceanStoreAn Architecture for Global-scale Persistent Storage

By John Kubiatowicz, David Bindel, Yan Chen, Steven Czerwinski, Patrick Eaton, Dennis Geels, Ramakrishna Gummadi, Sean Rhea, Hakim Weatherspoon, Westley Weimer, Chris Wells, and Ben Zhao

http://oceanstore.cs.berkeley.edu

Presented by Yongbo Wang, Hailing Yu

Page 2: OceanStore An Architecture for Global-scale Persistent Storage

Ubiquitous Computing

Page 3: OceanStore An Architecture for Global-scale Persistent Storage

OceanStore Overview

A global-scale utility infrastructure Internet-based, distributed storage

system for information appliances such as computers, PDAs, cellular phones,…

It is designed to support 1010 users, each having 104 data files (Support over 1014 files)

Page 4: OceanStore An Architecture for Global-scale Persistent Storage

OceanStore Overview (cont) Automatically recovers from server and

network failures Utilizes redundancy and client-side

cryptographic techniques to protect data Allows replicas of a data object to exist

anywhere, at any time Incorporates new resources Adjusts to usage patterns

Page 5: OceanStore An Architecture for Global-scale Persistent Storage
Page 6: OceanStore An Architecture for Global-scale Persistent Storage

OceanStore Two Unique design goals:

Ability to be constructed from an untrusted infrastructure

Servers may crash Information can be stolen

Support of nomadic data Data can be cached anywhere, anytime

(promiscuous caching) Data is separated from its physical

location

Page 7: OceanStore An Architecture for Global-scale Persistent Storage

Underlying Technologies

Naming Access control Data Location and Routing Data Update Deep Archival Storage Introspection

Page 8: OceanStore An Architecture for Global-scale Persistent Storage

Naming Objects are identified by a globally

unique identifier (GUID) Different objects in OceanStore

use different mechanism to generate their GUID

Page 9: OceanStore An Architecture for Global-scale Persistent Storage

Underlying Technologies

Naming Access control Data Location and Routing Data Update Deep Archival Storage Introspection

Page 10: OceanStore An Architecture for Global-scale Persistent Storage

Access Control Reader Restriction

Encrypt the data that is not public Distribute the encryption key to users

having read permission Writer Restriction

The owner of an object can decide an access control list (ACL) for the object

All writes are verified by well-behaved servers and clients based on the ACL.

Page 11: OceanStore An Architecture for Global-scale Persistent Storage

Underlying Technologies

Naming Access control Data Location and Routing Data Update Deep Archival Storage Introspection

Page 12: OceanStore An Architecture for Global-scale Persistent Storage

Data Location and Routing Provides necessary service to

route messages to their destinations and to locate objects in the system

Works on top of IP

Page 13: OceanStore An Architecture for Global-scale Persistent Storage

Data Location and Routing Each object in the system is

identified by a globally unique identifier ,GUID (a pseudo-random fixed length bit string) An object GUID is a secure hash

function over the object’s contents OceanStore uses 160-bit SHA-1 hash

for which the probability that two out of 1014 objects hash to the same value is approximately 1 in 1020.

Page 14: OceanStore An Architecture for Global-scale Persistent Storage

Data Location and Routing In OceanStore system, entities that

are accessed frequently are likely to reside close to where they are being used

Two-tiered approach: First use a fast probabilistic algorithm If necessary, use a slower but reliable

hierarchical algorithm

Page 15: OceanStore An Architecture for Global-scale Persistent Storage

Probabilistic algorithm Each server has a set of neighbors,

chosen from servers closest to it in network latency

A server associates with each neighbor a probability of finding each object in the system through that neighbor

This association is maintained in constant space using an attenuated Bloom filter

Page 16: OceanStore An Architecture for Global-scale Persistent Storage

Bloom Filters An efficient, lossy way of describing sets A Bloom filter is a bit-vector of length w

with a family of hash functions Each hash function maps the elements of

the represented set to an integer in [0,w) To form a representation of a set, each

set element is hashed and the bits in the vector corresponding to has functions’ results are set

Page 17: OceanStore An Architecture for Global-scale Persistent Storage

Bloom Filters To check if an element is in the set

Element is hashed Corresponding bits in the filter are checked

- If any of the bits are not set, it is not in the set- If all bits are set, it may be in the set

The element may not be in the set even if all of the hashed bits are set (false positive)

False positive rate of a Bloom filter is a linear function of its width, number of hash functions and cardinality of the represented set

Page 18: OceanStore An Architecture for Global-scale Persistent Storage

A Bloom Filter: To check an object’s name against a Bloom filter summary, the name is hashed with n different hash functions (here, n=3) and bits corresponding to the result are checked

Page 19: OceanStore An Architecture for Global-scale Persistent Storage

Attenuated Bloom Filters An attenuated Bloom filter of depth d is

an array of d normal bloom filters For each neighbor link, an attenuated

Bloom filter is kept The k th bloom filter in the array is the

merger of all Bloom filters for all of the nodes k hops away through any path starting with that neighbor link

Page 20: OceanStore An Architecture for Global-scale Persistent Storage

Attenuated Bloom Filter for the outgoing link AB

In FAB,the document “Uncle John’s Band” would map to potential value 1/4+1/8=3/8.

Page 21: OceanStore An Architecture for Global-scale Persistent Storage

The Query Algorithm

The query node examines the 1st level of each of its neighbors’ filters If matches are found, the query is

forwarded to closest neighbor If no filter matches, the querying

node examines the next level of each filter at each step and forwards the query if a matching node founds

Page 22: OceanStore An Architecture for Global-scale Persistent Storage

The probabilistic query process: n1 is looking for object X, which is hashed to bits 0,1, and 3.

Page 23: OceanStore An Architecture for Global-scale Persistent Storage

Probabilistic location and routing A filter of depth d stores information

about servers d hops from the server If a query reaches a server d hops

away from its source due to a false positive, it is not forwarded further

In this case, the probabilistic algorithm gives up and forwards the query to deterministic algorithm

Page 24: OceanStore An Architecture for Global-scale Persistent Storage

Deterministic location and routing

Tapestry: OceanStore’s self-organizing routing and object location subsystem IP overlay network with a distributed,

fault tolerant architecture A query is routed from node to node

until the location of a replica is discovered

Page 25: OceanStore An Architecture for Global-scale Persistent Storage

Tapestry A hierarchical distributed data

structure Every server is assigned a random

and unique node-ID The node-ID ’s are then used to

construct a mesh of neighbor links

Page 26: OceanStore An Architecture for Global-scale Persistent Storage

Tapestry Every node is connected to other

nodes via neighbor links of various levels Level-1 edges connect to a set of nodes

closest in network latency with different values in the lowest digit of their node-ID’s

Level-2 edges connect to the closest nodes that match in the lowest digit and different second digits, etc.

Page 27: OceanStore An Architecture for Global-scale Persistent Storage

Tapestry Each node has a neighbor map

with multiple levels for example, the 9th entry of the 4th

level for node 325AE is the node closest to 325AE which ends in 95AE

Messages are routed to the destination ID digit by digit ***8=>**98=>*598=>4598

Page 28: OceanStore An Architecture for Global-scale Persistent Storage

Neighbor Map for Tapestry node 0642

Page 29: OceanStore An Architecture for Global-scale Persistent Storage

Tapestry routing example: A potential path for a message originating at node 0325 destined for node 4598

Page 30: OceanStore An Architecture for Global-scale Persistent Storage

Tapestry Each object is associated with a

location root through a deterministic mapping function

To advertise an object o, the server s storing the object sends a publish message toward the object’s root, leaving location pointers at each hop

Page 31: OceanStore An Architecture for Global-scale Persistent Storage

Tapestry routing example: To publish an object, the server storing the object sends a publish message toward the object’s root (e.g. node 4598), leaving location pointers at each node

Page 32: OceanStore An Architecture for Global-scale Persistent Storage

Locating an object To locate an object, a client sends a

message toward the object’s root. When the message encounters a pointer, it routes directly to the object

It is proved that Tapestry can route the request to the asymptotically optimal node (in terms of the shortest path network distance) containing a replica

Page 33: OceanStore An Architecture for Global-scale Persistent Storage

Tapestry routing example: To locate an object, node 0325 sends a message toward the object’s root (e.g. node 4598)

Page 34: OceanStore An Architecture for Global-scale Persistent Storage

Data Location and Routing Fault tolerance:

Tapestry uses redundant neighbor pointers when it detects a primary route failure

Uses periodic UDP probes to check link conditions

Tapestry deterministically chooses multiple root nodes for each object

Page 35: OceanStore An Architecture for Global-scale Persistent Storage

Data Location and Routing Automatic repair:

Node insertions: A new node needs the address of at least one

existing node It then starts advertising its services and the roles it

can assume to the system through the existing node

Exiting nodes: If possible, the exiting node runs a shutdown script

to inform the system In any case, neighbors will detect its absence and

update routing tables accordingly

Page 36: OceanStore An Architecture for Global-scale Persistent Storage

Underlying Technologies

Naming Access control Data Location and Routing Data Update Deep Archival Storage Introspection

Page 37: OceanStore An Architecture for Global-scale Persistent Storage

Updates Updates are made by clients and all

updates are logged OceanStore allows concurrent updates Serializing updates:

Since the infrastructure is untrusted, using a master replica will not work

Instead, a group of server’s called inner ring is responsible for choosing final commit order

Page 38: OceanStore An Architecture for Global-scale Persistent Storage

Update commitment

Inner ring is a group of servers working on behalf of an object.

It consists of a small number of highly-connected servers.

Each object has an inner ring which can be located through Tapestry

Page 39: OceanStore An Architecture for Global-scale Persistent Storage

Inner ring An object’s inner ring,

Generates new versions of an object from client updates

Generates encoded, archival fragments and distributes them

Provides mapping from active GUID to the GUID of most recent version of the object

Verifies a data object’s legitimate writers Maintains an update history providing an

undo mechanism

Page 40: OceanStore An Architecture for Global-scale Persistent Storage

Update commitment

Each inner ring makes its decisions through a Byzantine agreement protocol

Byzantine agreement lets a group of 3n+1 servers reach a agreement whenever no more than n of them are faulty

Page 41: OceanStore An Architecture for Global-scale Persistent Storage

Update commitment

Other nodes containing the data of that object are called secondary nodes

They do not participate in serialization protocol

They are organized into one or more multicast trees (dissemination trees)

Page 42: OceanStore An Architecture for Global-scale Persistent Storage

Path of an update:

a) After generating an update, a client sends it directly to the object’s inner ring

b) While inner ring performs a Byzantine agreement to commit the update, secondary nodes propagate the update among themselves

c) The result of update is multicast down the dissemination tree to all secondary nodes

Page 43: OceanStore An Architecture for Global-scale Persistent Storage

Cost of an update in bytes sent across the network, normalized to

minimum cost needed to send the update to each of the replicas

Page 44: OceanStore An Architecture for Global-scale Persistent Storage

Update commitment

Fault tolerance: Guarantees fault tolerance if less than

one third of the servers in the inner ring is malicious

Secondary nodes do not participate in the Byzantine protocol, but receive consistency information

Page 45: OceanStore An Architecture for Global-scale Persistent Storage

Update commitment

Automatic repair: Servers of the inner ring can be

changed without affecting the rest of the system

Servers participating in the inner ring are altered continuously to maintain the Byzantine assumption

Page 46: OceanStore An Architecture for Global-scale Persistent Storage

Underlying Technologies

Naming Access control Data Location and Routing Data Update Deep Archival Storage Introspection

Page 47: OceanStore An Architecture for Global-scale Persistent Storage

Deep Archival Storage

Each object is treated as a series of m fragments and then transformed into n fragments, where n>m

That uses Reed-Solomon encoding. Any m of the n coded fragments are

sufficient to construct the original data Rate of encoding: r=m/n Storage overhead=1/r=n/m

Page 48: OceanStore An Architecture for Global-scale Persistent Storage

Underlying Technologies

Naming Access control Data Location and Routing Data Update Deep Archival Storage Introspection

Page 49: OceanStore An Architecture for Global-scale Persistent Storage

Introspection

It is impossible to manually administer millions of servers and objects

OceanStore contains introspection tools Event monitoring Event analysis Self-adaptation

Page 50: OceanStore An Architecture for Global-scale Persistent Storage

Introspection

Introspective modules on servers observe network traffic and measure local traffic.

They automatically create, replace, and remove replicas in response to object’s usage patterns

Page 51: OceanStore An Architecture for Global-scale Persistent Storage

Introspection

If a replica becomes unavailable: Clients will receive service from a more

distant replica This produces extra load on distant

replicas Introspective mechanism detects this

and new replicas are created Above actions provide fault tolerance

and automatic repair

Page 52: OceanStore An Architecture for Global-scale Persistent Storage

Event handlers summarizes local events. These summaries are stored in a database. The information in the database is periodically analyzed and necessary actions are taken. A summary is sent to other nodes.

Page 53: OceanStore An Architecture for Global-scale Persistent Storage

Conclusion OceanStore provides a global-scale,

distributed storage platform through adaptation, fault tolerance and repair

It is self-maintaining A prototype implemented in Java is

under construction at UC Berkeley. Although it is not operational yet, many components are already functioning in isolation

Page 54: OceanStore An Architecture for Global-scale Persistent Storage

The end…

Questions?