reconceiving the web as a distributed (nosql) data system

23
Reconceiving the Web as a Distributed (NoSQL) Data System Daniel Austin PayPal, Inc. NoSQL Now! Conference August 22, 2013 V1.2

Upload: daniel-austin

Post on 08-May-2015

585 views

Category:

Technology


0 download

DESCRIPTION

[Slides from NoSQL Now! 2013] Nearly every Web request is a request for information from a database or a front-end caching system for one. Based on this concept, we can reconceive the Web as a large-scale distributed data system using NoSQL query languages across high-level protocols such as HTTP. Exploring this idea further leads us to a better understanding of the structure of the Web, and invites us to apply modern NoSQL thinking toward making it better. My goal is to re-orient people’s thinking toward the Web as a big NoSQL data system and then explore the implications.

TRANSCRIPT

Page 1: Reconceiving the Web as a Distributed (NoSQL) Data System

Reconceiving the Web as a Distributed (NoSQL) Data System

Reconceiving the Web as a Distributed (NoSQL) Data System

Daniel AustinPayPal, Inc.NoSQL Now! ConferenceAugust 22, 2013V1.2

Page 2: Reconceiving the Web as a Distributed (NoSQL) Data System

The Big Idea

“The World-Wide Web is the World’s Largest NoSQL

Distributed Data System”

Page 3: Reconceiving the Web as a Distributed (NoSQL) Data System

The Mind Map

Page 4: Reconceiving the Web as a Distributed (NoSQL) Data System

History

• DNS (1983)The first large-scale DDS, using Flat files• WWW (1989)“a single user-interface to many large classes of stored information such as reports, notes, data-bases, computer documentation and on-line systems help”

Berners-Lee & Cailliau, 1989

But Why NoSQL?

Page 5: Reconceiving the Web as a Distributed (NoSQL) Data System

WWWDB: Anatomy

WWW

HTML(Presentation)

URI(Addressing)

HTTP(Transport)

Page 6: Reconceiving the Web as a Distributed (NoSQL) Data System

Typology of Hyperlink Queries• Hypertext links come in two flavors:

transitive and intransitive• Transitive queries are usually for

inactive content – presentation material to supplement the user’s queried data

• Intransitive queries are user-actuated and usually provide navigation and business logic for the query

Page 7: Reconceiving the Web as a Distributed (NoSQL) Data System

Data Clients Query Data Sources

Page 8: Reconceiving the Web as a Distributed (NoSQL) Data System

What Do HTTP URIs Identify?• Not a single resource• WWWDB query syntax is split

between HTTP ‘verbs’ (POST, GET, PUT, DELETE) and their objects, addressed by URIs

• URI encapsulates a resource as the object identified by a query

(Note that transitive and intransitive hyperlinks almost always go to different locations)

Page 9: Reconceiving the Web as a Distributed (NoSQL) Data System

CDN as a Caching Mechanism• CDNs such as Akamai and

Cloudfront provide local caching services for WWWDB, mostly for static, presentation-related objects– Frequency-based caching for transitive

hyperlinks– Most secondary queries go to the CDN– 95%+ of all the bytes transported over

the Web– ~90% of all WWWDB queries (HTTP

requests/responses)

Page 10: Reconceiving the Web as a Distributed (NoSQL) Data System

APIs as Secondary Queries• Active Subqueries• Usually dynamic• URIs function as a selection mechanism• Often User-Actuated, Intransitive Events• Query results often modify the display

Page 11: Reconceiving the Web as a Distributed (NoSQL) Data System

REST as a Query Syntax Mechanism• Common

Semantics– REST provides a

means of specifying the proper query for an object in a specific state

• Demands NoSQL due to state constraints

• Uses query strings for ranged searches

Image courtesy IBM

Page 12: Reconceiving the Web as a Distributed (NoSQL) Data System

Indexing WWWDB

• Google, Bing, Yahoo! and other ‘index searches’ on WWWDB– Inconsistent results are accepted

• Query Cache or a Data Cache?• Secondary Query Routing• Alternative query indices – Wolfram

Alpha, Index Mundi, Twitter act as ‘almanacs’

Page 13: Reconceiving the Web as a Distributed (NoSQL) Data System

Does the CAP Theorem Apply?

Yes, It Does, But Only Partially• Partition and Availability – 404’s,

DDOS• WWWDB Relaxes the Consistency

Constraint• We accept inconsistent queries and

broken links as a tradeoff for real-time availability and high-velocity updates

But We Can Do Better!

Page 14: Reconceiving the Web as a Distributed (NoSQL) Data System

Drawbacks of the CAP Model• Caching – All data is Not cached

everywhere– Some sites are single-location/single

source– Hard (static) assets are far more

widely cached• What does CAP mean when data is

only partially distributed?– Very little – consistency only applies to

part of the queries

Page 15: Reconceiving the Web as a Distributed (NoSQL) Data System

Improving WWWDB

• Better Data Clients– HTML5 provides new query

mechanism via Web Sockets, WebStorage, and other means

– Still mostly presentation-level improvments

• Better Caching, Distribution & Tranport– Work currently being done at IETF on

HTTP 2.0• Better Queries

– Very little work being done – more on this later!

Page 16: Reconceiving the Web as a Distributed (NoSQL) Data System

RDF and the Semantic Web• Changes query patterns but not

storage– Queries based on semantic ID of

resource• Requires content to be semantically

labeled• Work on Sparql reduces query

limitations– But may also make things slower (!)

• Cloud computing and query distribution will prove a more powerful force for improving WWWDB than semantic queries

Page 17: Reconceiving the Web as a Distributed (NoSQL) Data System

Browsers as Data Clients

• Presentation First!– Data is treated as secondary

• Designed for Browsing Not Querying– Query patterns are inefficient– Semi-stateful nature of Web sessions

• Bedeviled with Legacy Issues

Page 18: Reconceiving the Web as a Distributed (NoSQL) Data System

Optimizing Web Queries

• REST doesn’t imply FAST – Use a domain model to limit query

endpoints– May require unnecessary requests

• Query-string semantics allows for joins, arbitrary comparison

• Recognize that some queries require state and use it

• Distribute intransitive queries more widely

Page 19: Reconceiving the Web as a Distributed (NoSQL) Data System

Reforming Hypertext for Querying WWWDB• Enlarge the number of link types• Distinguish transitive links• Add bidirectional linking• Enhance the semantics of the query

string• Make hypertext more useful for

mobile and devices

Page 20: Reconceiving the Web as a Distributed (NoSQL) Data System

IPv6 and Query Routing for WWWDB• The IPv6 space is large enough to

allow for multiple query addressing schemes:– Semantic addressing of objects by

type– Objects in the Internet of Things– Dynamic, context driven addressing

Page 21: Reconceiving the Web as a Distributed (NoSQL) Data System

Scaling the WWWDB

• This may require expanding our notions of URIs and links (queries)

• Semantic mapping of resources requires additional complexity for queries

• Explicit state management for efficiency

Every system has a scaling limit

Page 22: Reconceiving the Web as a Distributed (NoSQL) Data System

Final Thoughts• The Web is the largest NoSQL

Distributed Data System– URIs address the resultset of a NoSQL

query– Transitive and Intransitive hyperlinks

• We can add power and simplicity to our queries by carefully reforming the URI syntax and the current implementations of hypertext

• HTTP and HTML are undergoing significant evolution – now it’s time for URIs!

Page 23: Reconceiving the Web as a Distributed (NoSQL) Data System

Reconceiving the Web as a Distributed Data System

Thank You!

Reconceiving the Web as a Distributed Data System

Thank You!

Daniel AustinPayPal, Inc.NoSQL Now! ConferenceAugust 22, 2013V1.2

@daniel_b_austin