pond: the oceanstore prototype
DESCRIPTION
POND: THE OCEANSTORE PROTOTYPE. S. Rea, P. Eaton, D. Geels, H. Weatherspoon, J. Kubiatowicz U. C. Berkeley. Key Ideas. Versioning file system Location independent routing Uses hashes instead of addresses Mapping is done through Tapestry Byzantine update commitment - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: POND: THE OCEANSTORE PROTOTYPE](https://reader030.vdocuments.us/reader030/viewer/2022033100/56814962550346895db6b6e1/html5/thumbnails/1.jpg)
POND:THE OCEANSTORE PROTOTYPE
S. Rea, P. Eaton, D. Geels,H. Weatherspoon, J. Kubiatowicz
U. C. Berkeley
![Page 2: POND: THE OCEANSTORE PROTOTYPE](https://reader030.vdocuments.us/reader030/viewer/2022033100/56814962550346895db6b6e1/html5/thumbnails/2.jpg)
Key Ideas
• Versioning file system• Location independent routing
– Uses hashes instead of addresses– Mapping is done through Tapestry
• Byzantine update commitment – By nodes holding primary copies (inner ring)– Proactive threshold signatures allow inner ring
membership updates
![Page 3: POND: THE OCEANSTORE PROTOTYPE](https://reader030.vdocuments.us/reader030/viewer/2022033100/56814962550346895db6b6e1/html5/thumbnails/3.jpg)
Key Ideas
• Push-based update of other copies– Through an overlay multicast network– Copies are not permanent
• Continuous archiving in erasure-coded form– Very reliable– Very slow access
![Page 4: POND: THE OCEANSTORE PROTOTYPE](https://reader030.vdocuments.us/reader030/viewer/2022033100/56814962550346895db6b6e1/html5/thumbnails/4.jpg)
Motivation
• Find a better solution forlong-term management of data
• Enabling trends:– Near universal connectivity through high-
bandwidth links– Very fast increase of disk storage capacity per
unit cost
![Page 5: POND: THE OCEANSTORE PROTOTYPE](https://reader030.vdocuments.us/reader030/viewer/2022033100/56814962550346895db6b6e1/html5/thumbnails/5.jpg)
OceanStore
• Internet-scale cooperative file system• Will provide
– High durability– Universal accessibility
• Will use a two-tiered storage system• Stores data objects
![Page 6: POND: THE OCEANSTORE PROTOTYPE](https://reader030.vdocuments.us/reader030/viewer/2022033100/56814962550346895db6b6e1/html5/thumbnails/6.jpg)
Two-tiered organization
• Upper tier– Powerful , well connected hosts– Serialize changes and archive results
• Lower tier– Less powerful hosts
• Can be user workstations– Provide storage resources
![Page 7: POND: THE OCEANSTORE PROTOTYPE](https://reader030.vdocuments.us/reader030/viewer/2022033100/56814962550346895db6b6e1/html5/thumbnails/7.jpg)
Two-tiered organization
Archive
Primary replica (in inner ring)
Secondary replica
Secondary replica
Secondary replica
![Page 8: POND: THE OCEANSTORE PROTOTYPE](https://reader030.vdocuments.us/reader030/viewer/2022033100/56814962550346895db6b6e1/html5/thumbnails/8.jpg)
Basic requirements
• OceanStore should1. Let information be accessed from
any location
2. Balance the tension between privacy and information sharing
3. Offer an easily understandable and usable model of data consistency
4. Guarantee data integrity
![Page 9: POND: THE OCEANSTORE PROTOTYPE](https://reader030.vdocuments.us/reader030/viewer/2022033100/56814962550346895db6b6e1/html5/thumbnails/9.jpg)
First basic assumption
• Infrastructure cannot be trusted , except in aggregate– Host and routers can fail arbitrarily– Must consider
• Passive failures: host snooping, …• Active failures: host injecting malicious
messages, …
![Page 10: POND: THE OCEANSTORE PROTOTYPE](https://reader030.vdocuments.us/reader030/viewer/2022033100/56814962550346895db6b6e1/html5/thumbnails/10.jpg)
Second basic assumption
• Infrastructure is continuously changing– Performance of communication paths varies– Resources enter and leave the network without
warning– System should
• Be at least self-organizing andself-repairing
• Aim to be self-tuning
![Page 11: POND: THE OCEANSTORE PROTOTYPE](https://reader030.vdocuments.us/reader030/viewer/2022033100/56814962550346895db6b6e1/html5/thumbnails/11.jpg)
The challenge
• Build a system that provides– An expressive user interface– High data availability– High data durability– High data privacy and integrity
atop an untrusted and ever changing base
More ambitious than FARSITE
![Page 12: POND: THE OCEANSTORE PROTOTYPE](https://reader030.vdocuments.us/reader030/viewer/2022033100/56814962550346895db6b6e1/html5/thumbnails/12.jpg)
The data model
• OceanStore data object– Similar to a traditional file – Ordered sequence of read-only versions
• Versioning– Simplifies consistency issues– Allows recovery of previous versions
• Identical blocks are shared among versions
![Page 13: POND: THE OCEANSTORE PROTOTYPE](https://reader030.vdocuments.us/reader030/viewer/2022033100/56814962550346895db6b6e1/html5/thumbnails/13.jpg)
Data object implementation (I)
• Each data object has an AGUID(Active Globally-Unique Identifier)– Secure hash of application-level name and
private key of owner• Each version has a VGUID (Version GUID)
– BGUID of root block of a version• Each block has a BGUID (Block GUID)
– Secure hash of block contents
![Page 14: POND: THE OCEANSTORE PROTOTYPE](https://reader030.vdocuments.us/reader030/viewer/2022033100/56814962550346895db6b6e1/html5/thumbnails/14.jpg)
A data objectAGUID
VGUIDi VGUIDi+1
M M
root block
Indirect blocks
Data blocks
COW
COW
![Page 15: POND: THE OCEANSTORE PROTOTYPE](https://reader030.vdocuments.us/reader030/viewer/2022033100/56814962550346895db6b6e1/html5/thumbnails/15.jpg)
Data object implementation
• AGUID, VGUID and BGUID arelocation-transparent– OceanStore relies on a lower-level service
to map GIDs into addresses
![Page 16: POND: THE OCEANSTORE PROTOTYPE](https://reader030.vdocuments.us/reader030/viewer/2022033100/56814962550346895db6b6e1/html5/thumbnails/16.jpg)
Application-level consistency (I)
• Updating an object means creating a new version
• Updates are– Atomic– Represented as an array of potential actions
each guarded by a predicate
![Page 17: POND: THE OCEANSTORE PROTOTYPE](https://reader030.vdocuments.us/reader030/viewer/2022033100/56814962550346895db6b6e1/html5/thumbnails/17.jpg)
Application-level consistency (II)
• Actions can be– Appending data– Replacing bytes at a specific address
• Predicates can be– Checking the latest version number of the
object– Verifying values of bytes at a specific address
![Page 18: POND: THE OCEANSTORE PROTOTYPE](https://reader030.vdocuments.us/reader030/viewer/2022033100/56814962550346895db6b6e1/html5/thumbnails/18.jpg)
Application-level consistency (II)
• Actions can be– Appending data– Replacing bytes at a specific address
• Predicates can be– Checking the latest version number of the
object– Verifying values of bytes at a specific address
![Page 19: POND: THE OCEANSTORE PROTOTYPE](https://reader030.vdocuments.us/reader030/viewer/2022033100/56814962550346895db6b6e1/html5/thumbnails/19.jpg)
Application-level consistency (III)
• Predicate and action model– Allows to implement multiple level of
consistency • Atomic transactions satisfying ACID
properties for database applications• Weaker consistency for mailboxes
![Page 20: POND: THE OCEANSTORE PROTOTYPE](https://reader030.vdocuments.us/reader030/viewer/2022033100/56814962550346895db6b6e1/html5/thumbnails/20.jpg)
A footnote
• ACID properties of atomic transactions mean that atomic transactions– Are Atomic– Bring the database from one consistent
state to another consistent state– Isolate their partial results until the
transaction is completed– Guarantee the durability of final result
![Page 21: POND: THE OCEANSTORE PROTOTYPE](https://reader030.vdocuments.us/reader030/viewer/2022033100/56814962550346895db6b6e1/html5/thumbnails/21.jpg)
Virtualization through Tapestry
• OceanStore messages are addressed with a GUID• Tapestry forwards these messages to host
containing a resource with that GUID– Fully decentralized service
• Hosts can– Join tapestry by supplying its GUID– Publish the GUIDs of the resources they have
![Page 22: POND: THE OCEANSTORE PROTOTYPE](https://reader030.vdocuments.us/reader030/viewer/2022033100/56814962550346895db6b6e1/html5/thumbnails/22.jpg)
Replication and consistency (I)
• Each object has a single primary replica• Primary replica
– Serializes and applies all updates– Creates a certificate (heartbeat ) mapping
AGUID of object to GUID of its latest version– Controls access to the object– …
![Page 23: POND: THE OCEANSTORE PROTOTYPE](https://reader030.vdocuments.us/reader030/viewer/2022033100/56814962550346895db6b6e1/html5/thumbnails/23.jpg)
Replication and consistency (II)
• Heartbeat contains– An AGUID– A VGUID– A timestamp– A version sequence number
• Getting the most recent version of object means getting its most recent heartbeat
![Page 24: POND: THE OCEANSTORE PROTOTYPE](https://reader030.vdocuments.us/reader030/viewer/2022033100/56814962550346895db6b6e1/html5/thumbnails/24.jpg)
The inner ring
• Small set of co-operating servers that manage primary replicas
• Implement a Byzantine fault-tolerant protocol to– Agree on all updates to an object– Digitally sign the result
![Page 25: POND: THE OCEANSTORE PROTOTYPE](https://reader030.vdocuments.us/reader030/viewer/2022033100/56814962550346895db6b6e1/html5/thumbnails/25.jpg)
Archival storage
• Stores object versions that are not frequently accessed
• Uses erasure codes– Each block
• Partitioned into m fragments• Encoded into n > m fragments
– Any subset of m fragments suffices to reconstitute the block
![Page 26: POND: THE OCEANSTORE PROTOTYPE](https://reader030.vdocuments.us/reader030/viewer/2022033100/56814962550346895db6b6e1/html5/thumbnails/26.jpg)
Caching of data objects
• Retrieving data from archive is slow• OceanStore also maintains of whole blocks
– Secondary replicas• Heartbeats always come from the
primary replica• Updates of secondary replicas are done through
a dissemination tree
![Page 27: POND: THE OCEANSTORE PROTOTYPE](https://reader030.vdocuments.us/reader030/viewer/2022033100/56814962550346895db6b6e1/html5/thumbnails/27.jpg)
Path of an OceanStore updateA
pplic
atio
n
Archive
Primary replica in inner ring
Secondary replica
Secondary replica
Secondary replica
![Page 28: POND: THE OCEANSTORE PROTOTYPE](https://reader030.vdocuments.us/reader030/viewer/2022033100/56814962550346895db6b6e1/html5/thumbnails/28.jpg)
Updating primary replicas (I)
• Use a Byzantine fault-tolerant protocol– Tolerates up to f failures in a system made up
of 3f + 1 hosts• Protocol uses digitally signed messages using
symmetric key message authentication code– Faster than using public keys– Complicates the Byzantine agreement protocol
![Page 29: POND: THE OCEANSTORE PROTOTYPE](https://reader030.vdocuments.us/reader030/viewer/2022033100/56814962550346895db6b6e1/html5/thumbnails/29.jpg)
Updating primary replicas (II)
• Solution was to use – Symmetric keys for all communications within
the inner ring– Public keys to communicate with all other
machines
![Page 30: POND: THE OCEANSTORE PROTOTYPE](https://reader030.vdocuments.us/reader030/viewer/2022033100/56814962550346895db6b6e1/html5/thumbnails/30.jpg)
Proactive threshold signatures
• (listen to lecture)
![Page 31: POND: THE OCEANSTORE PROTOTYPE](https://reader030.vdocuments.us/reader030/viewer/2022033100/56814962550346895db6b6e1/html5/thumbnails/31.jpg)
Prototype software architecture
Network (Java NBIO)Tapestry
Byz
antin
eag
reem
ent
Inne
r rin
g
Arc
hive
Dis
sem
inat
ion
tree
/repl
icas
Clie
ntin
terf
ace
App
licat
ion
![Page 32: POND: THE OCEANSTORE PROTOTYPE](https://reader030.vdocuments.us/reader030/viewer/2022033100/56814962550346895db6b6e1/html5/thumbnails/32.jpg)
The prototype
• Written in Java
![Page 33: POND: THE OCEANSTORE PROTOTYPE](https://reader030.vdocuments.us/reader030/viewer/2022033100/56814962550346895db6b6e1/html5/thumbnails/33.jpg)
Conclusion