oceanstore global-scale persistent storage john kubiatowicz university of california at berkeley
Post on 19-Dec-2015
221 views
TRANSCRIPT
OStore
John Kubiatowicz FDIS 2000
Context: Project Endeavour
Interdisciplinary, Technology-Centered
Team• Alex Aiken, PL• Eric Brewer, OS• John Canny, AI• David Culler, OS/Arch• Joseph Hellerstein, DB• Michael Jordan, Learning• Anthony Joseph, OS• Randy Katz, Nets• John Kubiatowicz, Arch• James Landay, UI
• Jitendra Malik, Vision• George Necula, PL• Christos Papadimitriou,
Theory• David Patterson, Arch• Kris Pister, Mems• Larry Rowe, MM• Alberto Sangiovanni-
Vincentelli, CAD• Doug Tygar, Security• Robert Wilensky, DL/AI
OStore
John Kubiatowicz FDIS 2000
Endeavour Goals:• Enhancing human understanding
– Help people to interact with information, devices, and people - exploit Moore’s law growth in everything
– Enable new approaches for problem solving & learning– Figure of merit: how effectively we amplify and leverage human
intellect
• Enabling and exploiting ubiquitous computing– Small devices, sensors, smart materials, cars, etc
• New methods for design, construction, and administration of ultra-scale systems– “Planetary-scale” Information Utilities
• Infrastructure is transparent and always active• Extensive use of redundancy of hardware and data
– Devices that negotiate their interfaces automatically– Elements that tune, repair, and maintain themselves
OStore
John Kubiatowicz FDIS 2000
Endeavour Maxims
• Exploit Moore’s law growth for better behavior– Use of “excess” capacity for better human interface
• Personal Information Mgmt is the Killer App– Not corporate processing but management, analysis,
aggregation, dissemination, filtering for the individual– Automated extraction and organization of daily
activities to assist people
• Time to move beyond the Desktop– Community computing: infer relationships among
information, delegate control, establish authority
• Information Technology as a Utility– Continuous service delivery, on a planetary-scale, on
top of a highly dynamic information base
OStore
John Kubiatowicz FDIS 2000
Endeavour Approach
“Fluid”, Network-Centric System Software– Partitioning and
management of state between soft and persistent state
– Data processing placement and movement
– Component discovery and negotiation
– Flexible capture, self-organization, and re-use of information
• Information Devices– Beyond desktop computers to MEMS-
sensors/actuators with capture/display to yield enhanced activity spaces
• Information Utility• Information Applications
– High Speed/Collaborative Decision Making and Learning
– Augmented “Smart” Spaces: Rooms and Vehicles
• Design Methodology– User-centric Design with
HW/SW Co-design;– Formal methods for safe and trustworthy
decomposable and reusable components
OStore
John Kubiatowicz FDIS 2000
OceanStore Context: Ubiquitous Computing
• Computing everywhere:– Desktop, Laptop, Palmtop– Cars, Cellphones– Shoes? Clothing? Walls?
• Connectivity everywhere:– Rapid growth of bandwidth in the interior of the net– Broadband to the home and office– Wireless technologies such as CMDA, Satelite, laser
• Rise of the thin-client metaphor:– Services provided by interior of network– Incredibly thin clients on the leaves
• MEMs devices -- sensors+CPU+wireless net in 1mm3
• Mobile society: people move and devices are disposable
OStore
John Kubiatowicz FDIS 2000
Questions about information:
• Where is persistent information stored?– 20th-century tie between location and content outdated (we all
survived the Feb 29th bug -- let’s move on!)– In world-scale system, locality is key
• How is it protected?– Can disgruntled employee of ISP sell your secrets?– Can’t trust anyone (how paranoid are you?)
• Can we make it indestructible? – Want our data to survive “the big one”! – Highly resistant to hackers (denial of service)– Wide-scale disaster recovery
• Is it hard to manage?– Worst failures are human-related– Want automatic (introspective) diagnose and repair
OStore
John Kubiatowicz FDIS 2000
First Observation:Want Utility Infrastructure
• Mark Weiser from Xerox: Transparent computing is the ultimate goal– Computers should disappear into the background
• In storage context:– Don’t want to worry about backup– Don’t want to worry about obsolescence– Need lots of resources to make data secure and highly
available, BUT don’t want to own them– Outsourcing of storage already becoming popular
• Pay monthly fee and your “data is out there” – Simple payment interface
one bill from one company
OStore
John Kubiatowicz FDIS 2000
Second Observation:Need wide-scale
deployment• Many components with geographic separation
– System not disabled by natural disasters– Can adapt to changes in demand and regional outages– Gain in stability through statistics– Difference between thermodynamics and mechanics
surprising stability of temperature and pressure given 1030 molecules with highly variable behavior!
• Wide-scale use and sharing also requires wide-scale deployment– Bandwidth increasing rapidly, but latency bounded by
speed of light
• Handling many people with same system leads to economies of scale
OStore
John Kubiatowicz FDIS 2000
OceanStore:Everyone’s data, One big
Utility “The data is just out there”
• Separate information from location– Locality is an only an optimization (an important one!)– Wide-scale coding and replication for durability
• All information is globally identified– Unique identifiers are hashes over names & keys– Single uniform lookup interface replaces: DNS, server location, data
location– No centralized namespace required (such as SDSI)
OStore
John Kubiatowicz FDIS 2000
Amusing back of the envelope calculation
(courtesy Bill Bolotsky, Microsoft)
• How many files in the OceanStore?– Assume 1010 people in world– Say 10,000 files/person (very conservative?)– So 1014 files in OceanStore!
– If 1 gig files (not likely), get 1 mole of files!
Truly impressive number of elements…… but small relative to physical constants
OStore
John Kubiatowicz FDIS 2000
• Service provided by confederation of companies– Monthly fee paid to one service provider– Companies buy and sell capacity from each other
Utility-based Infrastructure
Pac Bell
Sprint
IBMAT&T
CanadianOceanStore
IBM
OStore
John Kubiatowicz FDIS 2000
Outline• Motivation• Properties of the OceanStore and
Assumptions• Specific Technologies and approaches:
– Conflict resolution on encrypted data– Replication and Deep archival storage– Naming and Data Location– Introspective computing for optimization and
repair– Economic models
• Conclusion
OStore
John Kubiatowicz FDIS 2000
Ubiquitous Devices Ubiquitous Storage
• Consumers of data move, change from one device to another, work in cafes, cars, airplanes, the office, etc.
• Properties REQUIRED for OceanStore storage substrate:– Strong Security: data encrypted in the infrastructure;
resistance to monitoring and denial of service attacks– Coherence: too much data for naïve users to keep
coherent “by hand”– Automatic replica management and optimization: huge
quantities of data cannot be managed manually – Simple and automatic recovery from disasters:
probability of failure increases with size of system– Utility model: world-scale system requires cooperation
across administrative boundaries
OStore
John Kubiatowicz FDIS 2000
State of the Art?• Widely deployed systems: NFS, AFS (/DFS)
– Single “regions” of failure, caching only at endpoints– ClearText exposed at various levels of system– Compromised server all data on server compromised
• Mobile computing community: Coda, Ficus, Bayou– Small scale, fixed coherence mechanism– Not optimized to take advantage of high-bandwidth
connections between server components– ClearText also exposed at various levels of system
• Web caching community: Inktomi, Akamai– Specialized, incremental solutions– Caching along client/server path, various bottlenecks
• Database Community: – Interfaces not usable by legacy applications– ACID update semantics not always appropriate
OStore
John Kubiatowicz FDIS 2000
OceanStore Assumptions• Untrusted Infrastructure:
– The OceanStore is comprised of untrusted components– Only cyphertext within the infrastructure– Information must not be “leaked” over time
• Principle Party:– There is one organization that is financially responsible for the
integrity of your data• Mostly Well-Connected:
– Data producers and consumers are connected to a high-bandwidth network most of the time
– Exploit multicast for quicker consistency when possible• Promiscuous Caching:
– Data may be cached anywhere, anytime • Operations Interface with Conflict Resolution:
– Applications employ an operations-oriented interface, rather than a file-systems interface
– Coherence is centered around conflict resolution
OStore
John Kubiatowicz FDIS 2000
OceanStore Technologies I:Naming and Data Location
• Requirements:– System-level names should help to authenticate data– Route to nearby data without global communication– Don’t inhibit rapid relocation of data
• OceanStore approach: Two-level search with embedded routing– Underlying namespace is flat and built from secure
cryptographic hashes (160-bit SHA-1)– Search process combines quick, probabilistic search with
slower guaranteed search– Long-distance data location and routing are integrated
• Every source/destination pair has multiple routing paths• Continuous, on-line optimization adapts for hot spots,
denial of service, and inefficiencies in routing
OStore
John Kubiatowicz FDIS 2000
Universal Location Facility
Universal Name
Name OID
Root StructureUpdate OID:
Archive versions:Version OID1Version OID2Version OID3
Global Object Resolution
FloatingReplica
Active Data
CommitLogs
CheckpointOIDGlobal Object
Resolution
Version OID
Archival copyor snapshot
Archival copyor snapshot
Archival copyor snapshot
Global Object Resolution
Global Object Resolution
ErasureCoded:
• Takes 160-bit unique identifier (GUID) and Returns the nearest object that matches
OStore
John Kubiatowicz FDIS 2000
Some current results:• Have a working algorithm for local search
– Uses attenuated bloom filters– Performs search by passing messages from node to node. All
state kept in messages!– Updates filters through semi-chaotic passing of information
between neighbors• Resembles compiler dataflow algorithm• Can be shown to converge
• Have candidate for “backing store” index– Randomized data structure with locality properties
• Every document has multiple roots in the OceanStore• Searches “close” to copy tend to find copy quickly
– Redundant, insensitive to faults, and repairable– Investigating algorithms to continually adapt routing
structure to adjust for faults and denial of service
OStore
John Kubiatowicz FDIS 2000
OceanStore Technologies II:Rapid Update in an
Untrusted Infrastructure• Requirements:
– Scalable coherence mechanism which can operate directly on encrypted data without revealing information
– Handle Byzantine failures – Rapid dissemination of committed information
• OceanStore Approach:– Operations-based interface using conflict resolution
• Modeled after Xerox Bayou updates packets include:Predicate/update pairs which operate on encryped data
• Use of oblivious function techniques to perform this update• Use of incremental cryptographic techniques
– User signs Updates and principle party signs commits– Committed data multicast to clients
OStore
John Kubiatowicz FDIS 2000
Our State of the Art• Have techniques for protecting metadata
– Uses encryption and signatures to provide protection against substitution attacks
– Provides “secure pointer” technology• Have a working scheme that can do some forms of
conflict resolution directly on encryped data– Uses new technique for searching on encrypted data. – Can be generalized to perform optimistic concurrency,
but at cost in performance and possibly privacy• Byzantine assumptions for update commitment:
– Signatures on update requests from clients• Compromised servers are unable to produce valid updates• Uncompromised second-tier servers can make consistent
ordering decision with respect to tentative commits– Use of threshold cryptography in inner-tier of servers– Signatures on update stream from inner-tier
• Use of chained MACs to reduce overhead
OStore
John Kubiatowicz FDIS 2000
OceanStore Technologies III:High-Availability and
Disaster Recovery• Requirements:
– Handle diverse, unstable participants in OceanStore– Mitigate denial of service attacks– Eliminate backup as independent (and fallible) technology– Flexible “disaster recovery” for everyone
• OceanStore Approach:– Use of erasure-codes to provide stable storage for archival
copies and snapshots of live data– Mobile replicas are self-contained centers for logging and
conflict resolution– Version-based update for painless recovery– Continuous introspection repairs data structures and
degree of redundancy
OStore
John Kubiatowicz FDIS 2000
Floating Replicas and “Deep Archival Coding”
• Floating Replicas are per-object virtual servers– Complete copy of data– logging for updates/conflict resolution– Interaction with other centers to keep data consistent– May appear and disappear like bubbles
• Erasure coded fragments provide very stable store– Multi-level codes spread over 1000s of nodes– Could lose 1/2 of nodes and still recover data– Archive: old versions of data and checkpoints– Inactive data may only be in erasure-coded form
OStore
John Kubiatowicz FDIS 2000
Floating Replica and Deep Archival Coding
Erasure-coded Fragments
Ver1: 0x34243Ver2: 0x49873Ver3: …
FullCop
y
Conflict ResolutionLogs
Ver1: 0x34243Ver2: 0x49873Ver3: …
FullCop
y
Conflict ResolutionLogs
Ver1: 0x34243Ver2: 0x49873Ver3: …
FullCop
y
Conflict ResolutionLogs
FloatingReplica
OStore
John Kubiatowicz FDIS 2000
Structure of Archival Checkpoints
• All blocks and fragments signed• “Copy on Write” behavior• Older metablocks fragmented also
Metadata
RedoLogs
Checkpoint Reference (GUID)
. . . . .
Blocks
Unit of Archival Storage
Unit of Coding
Fragments
NOTE: Each Block needs a GUID
Metadata
RedoLogs
Checkpoint Reference (Later Version)
OStore
John Kubiatowicz FDIS 2000
Proactive Self-Maintenance• Continuous testing and repair of information
– Slow sweep through all information to make sure there are sufficient erasure-coded fragments
– Continuously reevaluate of risk and redistribute data– Slow sweep and repair of metadata/search trees
• Continuous online self-testing of HW and SW– detects flaky, failing, or buggy components via:
• fault injection: triggering hardware and software error handling paths to verify their integrity/existence
• stress testing: pushing HW/SW components past normal operating parameters
• scrubbing: periodic restoration of potentially “decaying” hardware or software state
– automates preventive maintenance
OStore
John Kubiatowicz FDIS 2000
OceanStore Technologies IV:Introspective Optimization
• Requirements:– Reasonable job on global-scale optimization problem
• Take advantage of locality whenever possible• Sensitivity to limited storage and bandwidth at endpoints
– Repair of data structures, increasing of redundancy– Stability in chaotic environment Active Feedback
• OceanStore Approach:– Introspective Monitoring and analysis of relationships to
cluster information by relatedness– Time series-analysis of user and data motion– Rearrangement and replication in response to monitoring
• Clustered prefetching: fetch related objects• Proactive-prefetching: get data there before needed• Rearrangement in response to overload and attack
OStore
John Kubiatowicz FDIS 2000
• Client observer and optimizer components– greedy agents working on the behalf of the client
• Watches client activity/combines with historical info• Performs clustering and time-series analysis • forwards results to infrastructure (privacy issues!)
– Monitoring of state of network to adapt behavior• Typical Actions:
– cluster related files together– prefetch files that will be needed soon– Create/destroy floating replicas
Example: Client Introspection
OStore
John Kubiatowicz FDIS 2000
OceanStore Technologies V:The oceanic data market
• Properties:– Utility providers have resources (storage and bandwidth) – Clients use resources both directly and indirectly
• Use of data storage and bandwidth on demand• Data movement “on behalf” of users
– Some customers are more important than others
• Techniques that we are exploring (very preliminary)– Data market driven by principle party
• Tradeoff between performance (replication) and cost– Secure signatures on data packets permit:
• Accounting of bandwidth and CPU utilization• Access control policies (Bays in OceanStore nomenclature)
– Use of challenge-response protocols (similar to zero-knowledge proofs) to demonstrate possession of data
OStore
John Kubiatowicz FDIS 2000
Two-Phase Implementation:• This term: Read-Mostly Prototype
– Construction of data location facility– Initial introspective gathering of tacit info and
adaptation– Initial archival techniques (use of erasure codes)– Unix file-system interface under Linux (“legacy
apps”)
• Later?: Full Prototype– Final conflict resolution and encryption techniques– More sophisticated tacit info gathering and
rearrangement– Final object interface and integration with Endeavour
applications– Wide-scale deployment via NTON and Internet-2
OStore
John Kubiatowicz FDIS 2000
OceanStore Conclusion• The Time is now for a Universal Data Utility
– Ubiquitous computing and connectivity is (almost) here!– Confederation of utility providers is right model
• OceanStore holds all data, everywhere– Local storage is a cache on global storage– Provides security in an untrusted infrastructure– Large scale system has good statistical properties
• Use of introspection for performance and stability• Quality of individual servers enhances reliability
• Exploits economies of scale to:– Provide high-availability and extreme survivability– Lower maintenance cost:
• self-diagnosis and repair• Insensitivity to technology changes:
Just unplug one set of servers, plug in others