pond the oceanstore prototype. introduction problem: rising cost of storage management observations:...
Post on 16-Dec-2015
215 Views
Preview:
TRANSCRIPT
PondThe OceanStore Prototype
Introduction
Problem: Rising cost of storage managementObservations:
Universal connectivity via Internet$100 terabyte storage within three years
Solution: OceanStore
OceanStore
Internet-scaleCooperative file systemHigh durabilityUniversal availabilityTwo-tier storage system
Upper tier: powerful serversLower tier: less powerful hosts
OceanStore
More on OceanStore
Unit of storage: data objectApplications: email, UNIX file systemRequirements for the object interface
Information universally accessibleBalance between privacy and sharingSimple and usable consistency modelData integrity
OceanStore Assumptions
Infrastructure untrusted except in aggregate
Most nodes are not faulty and malicious
Infrastructure constantly changingResources enter and exit the network without prior warningSelf-organizing, self-repairing, self-tuning
OceanStore Challenges
Expressive storage interfaceHigh durability on untrusted and changing base
Data Model
The view of the system that is presented to client applications
Storage Organization
OceanStore data object ~= fileOrdered sequence of read-only versions
Every version of every object kept foreverCan be used as backup
An object contains metadata, data, and references to previous versions
Storage Organization
A stream of objects identified by AGUID
Active globally-unique identifierCryptographically-secure hash of an application-specific name and the owner’s public keyPrevents namespace collisions
Storage Organization
Each version of data object stored in a B-tree like data structure
Each block has a BGUID• Cryptographically-secure hash of the
block content
Each version has a VGUIDTwo versions may share blocks
Storage Organization
Application-Specific Consistency
An update is the operation of adding a new version to the head of a version stream Updates are applied atomically
Represented as an array of potential actionsEach guarded by a predicate
Application-Specific Consistency
Example actionsReplacing some bytesAppending new data to an objectTruncating an object
Example predicatesCheck for the latest version numberCompare bytes
Application-Specific Consistency
To implement ACID semanticCheck for readersIf none, update
Append to a mailboxNo checking
No explicit locks or leases
Application-Specific Consistency
Predicate for readsExamples • Can’t read something older than 30
seconds• Only can read data from a specific time
frame
System Architecture
Unit of synchronization: data objectChanges to different objects are independent
Virtualization through Tapestry
Resources are virtual and not tied to particular hardwareA virtual resource has a GUID, globally unique identifierUse Tapestry, a decentralized object location and routing system
Scalable overlay network, built on TCP/IP
Virtualization through Tapestry
Use GUIDs to address hosts and resourcesHosts publish the GUIDs of their resources in TapestryHosts also can unpublish GUIDs and leave the network
Replication and Consistency
A data object is a sequence of read-only versions, consisting of read-only blocks, named by BGUIDsNo issues for replicationThe mapping from AGUID to the latest VGUID may changeUse primary-copy replication
Replication and Consistency
The primary copy Enforces access controlSerializes concurrent updates
Archival Storage
Replication: 2x storage to tolerate one failureErasure code is much better
A block is divided into m fragmentsm fragments encoded into n > m fragmentsAny m fragments can restore the original object
Caching of Data Objects
Reconstructing a block from erasure code is an expensive processNeed to locate m fragments from m machinesUse whole-block caching for frequently-read objects
Caching of Data Objects
To read a block, look for the block firstIf not available
Find block fragmentsDecode fragmentsPublish that the host now caches the blockAmortize the cost of erasure encoding/decoding
Caching of Data Objects
Updates are pushed to secondary replicas via application-level multicast tree
The Full Update Path
Serialized updates are disseminated via the multicast tree for an objectAt the same time, updates are encoded and fragmented for long-term storage
The Full Update Path
The Primary Replica
Primary servers run Byzantine agreement protocol
Need more than 2/3 nonfaulty participantsMessages required grow quadratic in the number of participants
Public-Key Cryptography
Too expensiveUse symmetric-key message authentication codes (MACs)
Two to three orders of magnitude fasterDownside: can’t prove the authenticity of a message to the third partyUsed only for the inner ring
Public-key cryptography for outer ring
Proactive Threshold Signatures
Byzantine agreement guarantees correctness if not more than 1/3 servers fail during the life of the systemNot practical for a long-lived systemNeed to reboot servers at regular intervalsKey holders are fixed
Proactive Threshold Signatures
Proactive threshold signaturesMore flexibility in choosing the membership of the inner ring
A public key is paired with a number of private keysEach server uses its key to generate a signature share
Proactive Threshold Signatures
Any k shares may be combined to produce a full signatureTo change membership of an inner ring
Regenerate signature sharesNo need to change the public keyTransparent to secondary hosts
The Responsible Party
Who chooses the inner ring?Responsible party:
A server that publishes sets of failure-independent nodes• Through offline measurement and
analysis
Software Architecture
Java atop the Staged Event Driven Architecture (SEDA)
Each subsystem is implemented as a stageWith each own state and thread poolStages communicate through events50,000 semicolons by five graduate students and many undergrad interns
Software Architecture
Language Choice
Java: speed of developmentStrongly typedGarbage collectedReduced debugging timeSupport for eventsEasy to port multithreaded code in Java• Ported to Windows 2000 in one week
Language Choice
Problems with Java:Unpredictability introduced by garbage collectionEvery thread in the system is halted while the garbage collector runsAny on-going process stalls for ~100 millisecondsMay add several seconds to requests travel cross machines
Experimental Setup
Two test bedsLocal cluster of 42 machines at Berkeley• Each with 2 1.0 GHz Pentium III • 1.5GB PC133 SDRAM• 2 36GB hard drives, RAID 0• Gigabit Ethernet adaptor• Linux 2.4.18 SMP
Experimental Setup
PlanetLab, ~100 nodes across ~40 sites• 1.2 GHz Pentium III, 1GB RAM• ~1000 virtual nodes
Storage Overhead
For 32 choose 16 erasure encoding2.7x for data > 8KB
For 64 choose 16 erasure encoding4.8x for data > 8KB
The Latency Benchmark
A single client submits updates of various sizes to a four-node inner ringMetric: Time from before the request is signed to the signature over the result is checkedUpdate 40 MB of data over 1000 updates, with 100ms between updates
The Latency Benchmark
Update Latency (ms)
Key Size
Update Size
5% Time
Median Time
95%Time
512b4kB 39 40 41
2MB 1037 1086 1348
1024b
4kB 98 99 100
2MB 1098 1150 1448
Latency Breakdown
Phase Time (ms)
Check 0.3
Serialize
6.1
Apply 1.5
Archive 4.5
Sign 77.8
The Throughput Microbenchmark
A number of clients submit updates of various sizes to disjoint objects, to a four-node inner ringThe clients
Create their objectsSynchronize themselvesUpdate the object as many time as possible for 100 seconds
The Throughput Microbenchmark
Archive Retrieval Performance
Populate the archive by submitting updates of various sizes to a four-node inner ringDelete all copies of the data in its reconstructed formA single client submits reads
Archive Retrieval Performance
Throughput: 1.19 MB/s (Planetlab)2.59 MB/s (local cluster)
Latency~30-70 milliseconds
The Stream Benchmark
Ran 500 virtual nodes on PlanetLabInner Ring in SF Bay AreaReplicas clustered in 7 largest P-Lab sites
Streams updates to all replicasOne writer - content creator – repeatedly appends to data objectOthers read new versions as they arriveMeasure network resource consumption
The Stream Benchmark
The Tag Benchmark
Measures the latency of token passingOceanStore 2.2 times slower than TCP/IP
The Andrew Benchmark
File system benchmark4.6x than NFS in read-intensive phases7.3x slower in write-intensive phases
top related