oceanstore global-scale persistent storage john kubiatowicz university of california at berkeley

34
OceanStore Global-Scale Persistent Storage John Kubiatowicz University of California at Berkeley

Post on 19-Dec-2015

221 views

Category:

Documents


1 download

TRANSCRIPT

OceanStoreGlobal-Scale Persistent Storage

John KubiatowiczUniversity of California at Berkeley

OStore

John Kubiatowicz FDIS 2000

Context: Project Endeavour

Interdisciplinary, Technology-Centered

Team• Alex Aiken, PL• Eric Brewer, OS• John Canny, AI• David Culler, OS/Arch• Joseph Hellerstein, DB• Michael Jordan, Learning• Anthony Joseph, OS• Randy Katz, Nets• John Kubiatowicz, Arch• James Landay, UI

• Jitendra Malik, Vision• George Necula, PL• Christos Papadimitriou,

Theory• David Patterson, Arch• Kris Pister, Mems• Larry Rowe, MM• Alberto Sangiovanni-

Vincentelli, CAD• Doug Tygar, Security• Robert Wilensky, DL/AI

OStore

John Kubiatowicz FDIS 2000

Endeavour Goals:• Enhancing human understanding

– Help people to interact with information, devices, and people - exploit Moore’s law growth in everything

– Enable new approaches for problem solving & learning– Figure of merit: how effectively we amplify and leverage human

intellect

• Enabling and exploiting ubiquitous computing– Small devices, sensors, smart materials, cars, etc

• New methods for design, construction, and administration of ultra-scale systems– “Planetary-scale” Information Utilities

• Infrastructure is transparent and always active• Extensive use of redundancy of hardware and data

– Devices that negotiate their interfaces automatically– Elements that tune, repair, and maintain themselves

OStore

John Kubiatowicz FDIS 2000

Endeavour Maxims

• Exploit Moore’s law growth for better behavior– Use of “excess” capacity for better human interface

• Personal Information Mgmt is the Killer App– Not corporate processing but management, analysis,

aggregation, dissemination, filtering for the individual– Automated extraction and organization of daily

activities to assist people

• Time to move beyond the Desktop– Community computing: infer relationships among

information, delegate control, establish authority

• Information Technology as a Utility– Continuous service delivery, on a planetary-scale, on

top of a highly dynamic information base

OStore

John Kubiatowicz FDIS 2000

Endeavour Approach

“Fluid”, Network-Centric System Software– Partitioning and

management of state between soft and persistent state

– Data processing placement and movement

– Component discovery and negotiation

– Flexible capture, self-organization, and re-use of information

• Information Devices– Beyond desktop computers to MEMS-

sensors/actuators with capture/display to yield enhanced activity spaces

• Information Utility• Information Applications

– High Speed/Collaborative Decision Making and Learning

– Augmented “Smart” Spaces: Rooms and Vehicles

• Design Methodology– User-centric Design with

HW/SW Co-design;– Formal methods for safe and trustworthy

decomposable and reusable components

OStore

John Kubiatowicz FDIS 2000

OceanStore Context: Ubiquitous Computing

• Computing everywhere:– Desktop, Laptop, Palmtop– Cars, Cellphones– Shoes? Clothing? Walls?

• Connectivity everywhere:– Rapid growth of bandwidth in the interior of the net– Broadband to the home and office– Wireless technologies such as CMDA, Satelite, laser

• Rise of the thin-client metaphor:– Services provided by interior of network– Incredibly thin clients on the leaves

• MEMs devices -- sensors+CPU+wireless net in 1mm3

• Mobile society: people move and devices are disposable

OStore

John Kubiatowicz FDIS 2000

Questions about information:

• Where is persistent information stored?– 20th-century tie between location and content outdated (we all

survived the Feb 29th bug -- let’s move on!)– In world-scale system, locality is key

• How is it protected?– Can disgruntled employee of ISP sell your secrets?– Can’t trust anyone (how paranoid are you?)

• Can we make it indestructible? – Want our data to survive “the big one”! – Highly resistant to hackers (denial of service)– Wide-scale disaster recovery

• Is it hard to manage?– Worst failures are human-related– Want automatic (introspective) diagnose and repair

OStore

John Kubiatowicz FDIS 2000

First Observation:Want Utility Infrastructure

• Mark Weiser from Xerox: Transparent computing is the ultimate goal– Computers should disappear into the background

• In storage context:– Don’t want to worry about backup– Don’t want to worry about obsolescence– Need lots of resources to make data secure and highly

available, BUT don’t want to own them– Outsourcing of storage already becoming popular

• Pay monthly fee and your “data is out there” – Simple payment interface

one bill from one company

OStore

John Kubiatowicz FDIS 2000

Second Observation:Need wide-scale

deployment• Many components with geographic separation

– System not disabled by natural disasters– Can adapt to changes in demand and regional outages– Gain in stability through statistics– Difference between thermodynamics and mechanics

surprising stability of temperature and pressure given 1030 molecules with highly variable behavior!

• Wide-scale use and sharing also requires wide-scale deployment– Bandwidth increasing rapidly, but latency bounded by

speed of light

• Handling many people with same system leads to economies of scale

OStore

John Kubiatowicz FDIS 2000

OceanStore:Everyone’s data, One big

Utility “The data is just out there”

• Separate information from location– Locality is an only an optimization (an important one!)– Wide-scale coding and replication for durability

• All information is globally identified– Unique identifiers are hashes over names & keys– Single uniform lookup interface replaces: DNS, server location, data

location– No centralized namespace required (such as SDSI)

OStore

John Kubiatowicz FDIS 2000

Basic Structure:Irregular Mesh of “Pools”

OStore

John Kubiatowicz FDIS 2000

Amusing back of the envelope calculation

(courtesy Bill Bolotsky, Microsoft)

• How many files in the OceanStore?– Assume 1010 people in world– Say 10,000 files/person (very conservative?)– So 1014 files in OceanStore!

– If 1 gig files (not likely), get 1 mole of files!

Truly impressive number of elements…… but small relative to physical constants

OStore

John Kubiatowicz FDIS 2000

• Service provided by confederation of companies– Monthly fee paid to one service provider– Companies buy and sell capacity from each other

Utility-based Infrastructure

Pac Bell

Sprint

IBMAT&T

CanadianOceanStore

IBM

OStore

John Kubiatowicz FDIS 2000

Outline• Motivation• Properties of the OceanStore and

Assumptions• Specific Technologies and approaches:

– Conflict resolution on encrypted data– Replication and Deep archival storage– Naming and Data Location– Introspective computing for optimization and

repair– Economic models

• Conclusion

OStore

John Kubiatowicz FDIS 2000

Ubiquitous Devices Ubiquitous Storage

• Consumers of data move, change from one device to another, work in cafes, cars, airplanes, the office, etc.

• Properties REQUIRED for OceanStore storage substrate:– Strong Security: data encrypted in the infrastructure;

resistance to monitoring and denial of service attacks– Coherence: too much data for naïve users to keep

coherent “by hand”– Automatic replica management and optimization: huge

quantities of data cannot be managed manually – Simple and automatic recovery from disasters:

probability of failure increases with size of system– Utility model: world-scale system requires cooperation

across administrative boundaries

OStore

John Kubiatowicz FDIS 2000

State of the Art?• Widely deployed systems: NFS, AFS (/DFS)

– Single “regions” of failure, caching only at endpoints– ClearText exposed at various levels of system– Compromised server all data on server compromised

• Mobile computing community: Coda, Ficus, Bayou– Small scale, fixed coherence mechanism– Not optimized to take advantage of high-bandwidth

connections between server components– ClearText also exposed at various levels of system

• Web caching community: Inktomi, Akamai– Specialized, incremental solutions– Caching along client/server path, various bottlenecks

• Database Community: – Interfaces not usable by legacy applications– ACID update semantics not always appropriate

OStore

John Kubiatowicz FDIS 2000

OceanStore Assumptions• Untrusted Infrastructure:

– The OceanStore is comprised of untrusted components– Only cyphertext within the infrastructure– Information must not be “leaked” over time

• Principle Party:– There is one organization that is financially responsible for the

integrity of your data• Mostly Well-Connected:

– Data producers and consumers are connected to a high-bandwidth network most of the time

– Exploit multicast for quicker consistency when possible• Promiscuous Caching:

– Data may be cached anywhere, anytime • Operations Interface with Conflict Resolution:

– Applications employ an operations-oriented interface, rather than a file-systems interface

– Coherence is centered around conflict resolution

OStore

John Kubiatowicz FDIS 2000

OceanStore Technologies I:Naming and Data Location

• Requirements:– System-level names should help to authenticate data– Route to nearby data without global communication– Don’t inhibit rapid relocation of data

• OceanStore approach: Two-level search with embedded routing– Underlying namespace is flat and built from secure

cryptographic hashes (160-bit SHA-1)– Search process combines quick, probabilistic search with

slower guaranteed search– Long-distance data location and routing are integrated

• Every source/destination pair has multiple routing paths• Continuous, on-line optimization adapts for hot spots,

denial of service, and inefficiencies in routing

OStore

John Kubiatowicz FDIS 2000

Universal Location Facility

Universal Name

Name OID

Root StructureUpdate OID:

Archive versions:Version OID1Version OID2Version OID3

Global Object Resolution

FloatingReplica

Active Data

CommitLogs

CheckpointOIDGlobal Object

Resolution

Version OID

Archival copyor snapshot

Archival copyor snapshot

Archival copyor snapshot

Global Object Resolution

Global Object Resolution

ErasureCoded:

• Takes 160-bit unique identifier (GUID) and Returns the nearest object that matches

OStore

John Kubiatowicz FDIS 2000

Some current results:• Have a working algorithm for local search

– Uses attenuated bloom filters– Performs search by passing messages from node to node. All

state kept in messages!– Updates filters through semi-chaotic passing of information

between neighbors• Resembles compiler dataflow algorithm• Can be shown to converge

• Have candidate for “backing store” index– Randomized data structure with locality properties

• Every document has multiple roots in the OceanStore• Searches “close” to copy tend to find copy quickly

– Redundant, insensitive to faults, and repairable– Investigating algorithms to continually adapt routing

structure to adjust for faults and denial of service

OStore

John Kubiatowicz FDIS 2000

OceanStore Technologies II:Rapid Update in an

Untrusted Infrastructure• Requirements:

– Scalable coherence mechanism which can operate directly on encrypted data without revealing information

– Handle Byzantine failures – Rapid dissemination of committed information

• OceanStore Approach:– Operations-based interface using conflict resolution

• Modeled after Xerox Bayou updates packets include:Predicate/update pairs which operate on encryped data

• Use of oblivious function techniques to perform this update• Use of incremental cryptographic techniques

– User signs Updates and principle party signs commits– Committed data multicast to clients

OStore

John Kubiatowicz FDIS 2000

Tentative Updates:Epidemic Disemination

OStore

John Kubiatowicz FDIS 2000

Committed Updates:Multicast Dissemination

OStore

John Kubiatowicz FDIS 2000

Our State of the Art• Have techniques for protecting metadata

– Uses encryption and signatures to provide protection against substitution attacks

– Provides “secure pointer” technology• Have a working scheme that can do some forms of

conflict resolution directly on encryped data– Uses new technique for searching on encrypted data. – Can be generalized to perform optimistic concurrency,

but at cost in performance and possibly privacy• Byzantine assumptions for update commitment:

– Signatures on update requests from clients• Compromised servers are unable to produce valid updates• Uncompromised second-tier servers can make consistent

ordering decision with respect to tentative commits– Use of threshold cryptography in inner-tier of servers– Signatures on update stream from inner-tier

• Use of chained MACs to reduce overhead

OStore

John Kubiatowicz FDIS 2000

OceanStore Technologies III:High-Availability and

Disaster Recovery• Requirements:

– Handle diverse, unstable participants in OceanStore– Mitigate denial of service attacks– Eliminate backup as independent (and fallible) technology– Flexible “disaster recovery” for everyone

• OceanStore Approach:– Use of erasure-codes to provide stable storage for archival

copies and snapshots of live data– Mobile replicas are self-contained centers for logging and

conflict resolution– Version-based update for painless recovery– Continuous introspection repairs data structures and

degree of redundancy

OStore

John Kubiatowicz FDIS 2000

Floating Replicas and “Deep Archival Coding”

• Floating Replicas are per-object virtual servers– Complete copy of data– logging for updates/conflict resolution– Interaction with other centers to keep data consistent– May appear and disappear like bubbles

• Erasure coded fragments provide very stable store– Multi-level codes spread over 1000s of nodes– Could lose 1/2 of nodes and still recover data– Archive: old versions of data and checkpoints– Inactive data may only be in erasure-coded form

OStore

John Kubiatowicz FDIS 2000

Floating Replica and Deep Archival Coding

Erasure-coded Fragments

Ver1: 0x34243Ver2: 0x49873Ver3: …

FullCop

y

Conflict ResolutionLogs

Ver1: 0x34243Ver2: 0x49873Ver3: …

FullCop

y

Conflict ResolutionLogs

Ver1: 0x34243Ver2: 0x49873Ver3: …

FullCop

y

Conflict ResolutionLogs

FloatingReplica

OStore

John Kubiatowicz FDIS 2000

Structure of Archival Checkpoints

• All blocks and fragments signed• “Copy on Write” behavior• Older metablocks fragmented also

Metadata

RedoLogs

Checkpoint Reference (GUID)

. . . . .

Blocks

Unit of Archival Storage

Unit of Coding

Fragments

NOTE: Each Block needs a GUID

Metadata

RedoLogs

Checkpoint Reference (Later Version)

OStore

John Kubiatowicz FDIS 2000

Proactive Self-Maintenance• Continuous testing and repair of information

– Slow sweep through all information to make sure there are sufficient erasure-coded fragments

– Continuously reevaluate of risk and redistribute data– Slow sweep and repair of metadata/search trees

• Continuous online self-testing of HW and SW– detects flaky, failing, or buggy components via:

• fault injection: triggering hardware and software error handling paths to verify their integrity/existence

• stress testing: pushing HW/SW components past normal operating parameters

• scrubbing: periodic restoration of potentially “decaying” hardware or software state

– automates preventive maintenance

OStore

John Kubiatowicz FDIS 2000

OceanStore Technologies IV:Introspective Optimization

• Requirements:– Reasonable job on global-scale optimization problem

• Take advantage of locality whenever possible• Sensitivity to limited storage and bandwidth at endpoints

– Repair of data structures, increasing of redundancy– Stability in chaotic environment Active Feedback

• OceanStore Approach:– Introspective Monitoring and analysis of relationships to

cluster information by relatedness– Time series-analysis of user and data motion– Rearrangement and replication in response to monitoring

• Clustered prefetching: fetch related objects• Proactive-prefetching: get data there before needed• Rearrangement in response to overload and attack

OStore

John Kubiatowicz FDIS 2000

• Client observer and optimizer components– greedy agents working on the behalf of the client

• Watches client activity/combines with historical info• Performs clustering and time-series analysis • forwards results to infrastructure (privacy issues!)

– Monitoring of state of network to adapt behavior• Typical Actions:

– cluster related files together– prefetch files that will be needed soon– Create/destroy floating replicas

Example: Client Introspection

OStore

John Kubiatowicz FDIS 2000

OceanStore Technologies V:The oceanic data market

• Properties:– Utility providers have resources (storage and bandwidth) – Clients use resources both directly and indirectly

• Use of data storage and bandwidth on demand• Data movement “on behalf” of users

– Some customers are more important than others

• Techniques that we are exploring (very preliminary)– Data market driven by principle party

• Tradeoff between performance (replication) and cost– Secure signatures on data packets permit:

• Accounting of bandwidth and CPU utilization• Access control policies (Bays in OceanStore nomenclature)

– Use of challenge-response protocols (similar to zero-knowledge proofs) to demonstrate possession of data

OStore

John Kubiatowicz FDIS 2000

Two-Phase Implementation:• This term: Read-Mostly Prototype

– Construction of data location facility– Initial introspective gathering of tacit info and

adaptation– Initial archival techniques (use of erasure codes)– Unix file-system interface under Linux (“legacy

apps”)

• Later?: Full Prototype– Final conflict resolution and encryption techniques– More sophisticated tacit info gathering and

rearrangement– Final object interface and integration with Endeavour

applications– Wide-scale deployment via NTON and Internet-2

OStore

John Kubiatowicz FDIS 2000

OceanStore Conclusion• The Time is now for a Universal Data Utility

– Ubiquitous computing and connectivity is (almost) here!– Confederation of utility providers is right model

• OceanStore holds all data, everywhere– Local storage is a cache on global storage– Provides security in an untrusted infrastructure– Large scale system has good statistical properties

• Use of introspection for performance and stability• Quality of individual servers enhances reliability

• Exploits economies of scale to:– Provide high-availability and extreme survivability– Lower maintenance cost:

• self-diagnosis and repair• Insensitivity to technology changes:

Just unplug one set of servers, plug in others