reconceiving the web as a distributed (nosql) data system

Reconceiving the Web as a Distributed (NoSQL) Data System

Reconceiving the Web as a Distributed (NoSQL) Data System

Daniel AustinPayPal, Inc.NoSQL Now! ConferenceAugust 22, 2013V1.2

The Big Idea

“The World-Wide Web is the World’s Largest NoSQL

Distributed Data System”

The Mind Map

History

• DNS (1983)The first large-scale DDS, using Flat files• WWW (1989)“a single user-interface to many large classes of stored information such as reports, notes, data-bases, computer documentation and on-line systems help”

Berners-Lee & Cailliau, 1989

But Why NoSQL?

WWWDB: Anatomy

WWW

HTML(Presentation)

URI(Addressing)

HTTP(Transport)

Typology of Hyperlink Queries• Hypertext links come in two flavors:

transitive and intransitive• Transitive queries are usually for

inactive content – presentation material to supplement the user’s queried data

• Intransitive queries are user-actuated and usually provide navigation and business logic for the query

Data Clients Query Data Sources

What Do HTTP URIs Identify?• Not a single resource• WWWDB query syntax is split

between HTTP ‘verbs’ (POST, GET, PUT, DELETE) and their objects, addressed by URIs

• URI encapsulates a resource as the object identified by a query

(Note that transitive and intransitive hyperlinks almost always go to different locations)

CDN as a Caching Mechanism• CDNs such as Akamai and

Cloudfront provide local caching services for WWWDB, mostly for static, presentation-related objects– Frequency-based caching for transitive

hyperlinks– Most secondary queries go to the CDN– 95%+ of all the bytes transported over

the Web– ~90% of all WWWDB queries (HTTP

requests/responses)

APIs as Secondary Queries• Active Subqueries• Usually dynamic• URIs function as a selection mechanism• Often User-Actuated, Intransitive Events• Query results often modify the display

REST as a Query Syntax Mechanism• Common

Semantics– REST provides a

means of specifying the proper query for an object in a specific state

• Demands NoSQL due to state constraints

• Uses query strings for ranged searches

Image courtesy IBM

Indexing WWWDB

• Google, Bing, Yahoo! and other ‘index searches’ on WWWDB– Inconsistent results are accepted

• Query Cache or a Data Cache?• Secondary Query Routing• Alternative query indices – Wolfram

Alpha, Index Mundi, Twitter act as ‘almanacs’

Does the CAP Theorem Apply?

Yes, It Does, But Only Partially• Partition and Availability – 404’s,

DDOS• WWWDB Relaxes the Consistency

Constraint• We accept inconsistent queries and

broken links as a tradeoff for real-time availability and high-velocity updates

But We Can Do Better!

Drawbacks of the CAP Model• Caching – All data is Not cached

everywhere– Some sites are single-location/single

source– Hard (static) assets are far more

widely cached• What does CAP mean when data is

only partially distributed?– Very little – consistency only applies to

part of the queries

Improving WWWDB

• Better Data Clients– HTML5 provides new query

mechanism via Web Sockets, WebStorage, and other means

– Still mostly presentation-level improvments

• Better Caching, Distribution & Tranport– Work currently being done at IETF on

HTTP 2.0• Better Queries

– Very little work being done – more on this later!

RDF and the Semantic Web• Changes query patterns but not

storage– Queries based on semantic ID of

resource• Requires content to be semantically

labeled• Work on Sparql reduces query

limitations– But may also make things slower (!)

• Cloud computing and query distribution will prove a more powerful force for improving WWWDB than semantic queries

Browsers as Data Clients

• Presentation First!– Data is treated as secondary

• Designed for Browsing Not Querying– Query patterns are inefficient– Semi-stateful nature of Web sessions

• Bedeviled with Legacy Issues

Optimizing Web Queries

• REST doesn’t imply FAST – Use a domain model to limit query

endpoints– May require unnecessary requests

• Query-string semantics allows for joins, arbitrary comparison

• Recognize that some queries require state and use it

• Distribute intransitive queries more widely

Reforming Hypertext for Querying WWWDB• Enlarge the number of link types• Distinguish transitive links• Add bidirectional linking• Enhance the semantics of the query

string• Make hypertext more useful for

mobile and devices

IPv6 and Query Routing for WWWDB• The IPv6 space is large enough to

allow for multiple query addressing schemes:– Semantic addressing of objects by

type– Objects in the Internet of Things– Dynamic, context driven addressing

Scaling the WWWDB

• This may require expanding our notions of URIs and links (queries)

• Semantic mapping of resources requires additional complexity for queries

• Explicit state management for efficiency

Every system has a scaling limit

Final Thoughts• The Web is the largest NoSQL

Distributed Data System– URIs address the resultset of a NoSQL

query– Transitive and Intransitive hyperlinks

• We can add power and simplicity to our queries by carefully reforming the URI syntax and the current implementations of hypertext

• HTTP and HTML are undergoing significant evolution – now it’s time for URIs!

Reconceiving the Web as a Distributed Data System

Thank You!

Reconceiving the Web as a Distributed Data System

Thank You!

Daniel AustinPayPal, Inc.NoSQL Now! ConferenceAugust 22, 2013V1.2

@daniel_b_austin

reconceiving the web as a distributed (nosql) data system

Technology