alternatives to relational dbs what and why…. relational dbs sql - fixed schema, row oriented...

Alternatives to relational DBsWhat and Why…

Relational DBs SQL - Fixed schema, row oriented & optimized

SQL - Rigid 2Phase transactions, with locking bottleneck

SQL - Set theoretic

SQL - Centralized distribution

SQL - Computational, not navigational/inter-connected & set-oriented

Sql - Poor support for heterogeneity & compression

No SQL - no or not only

Column-oriented - HBase (uses column families and no schema, has versioning and consistence transactions)

Key/value pairs - Google Dynamo Graph like - Neo4J Document based - MongoDB (cluster based

for huge scale, supports nested docs, and uses JavaScript for queries, and no schema)

But remember - Categories not distinct - take each one for

what it is Heterogeneous structure & polyglot

language environment is common NoSQL DBs tend to be unsupported with

funky GUIs - but there are very active volunteer user bases maintaining and evolving them

NoSQL DBs also tend to use programming languages for queries

When do you want non-2P transactions and no SQL Interactive, zillion user apps where user fixes errors via some

form of compensation

Minimal interconnectedness

Individual data values are not mission-critical

Read-heavy environments

Cloud -based environments

Queries are not set-oriented & are computational and imperative, and perhaps long

Real time apps

SQL is here to stay... Formal & unambiguous semantics

Declarative language with clean separation of application and queries

Consistent

Flexible

Black boxed, tested, and supported - and very well understood with many thousands of trained programmers - SQL is a basic language, like Java, Javascript, PHP, C#. etc.

Great GUIs that are very rich and debugged

And importantly... Lots of apps need clean, well

understood stacks, not speed or the cloud

In particular, websites that do retail business need consistent transactions and do not need the speed that comes with delayed updates

Relational DBs scale reasonably well, too, at least in non-cloud environments

Changing trends Relational DBMSs were engineered for

multi-app sharing Relational DBMSs were focused on an

independent “state”, separate from apps But now, many complex environments are

essentially single-app Correspondingly, data structuring

impedance mismatch seems unreasonable

Aspects of modern apps Databases are being used more and more

for planning and approximate uses But relational databases are very flat They are engineered for atomic, exact

transactions Modern applications also tend to

manipulate objects that consist of multiple sets or lists of things because they are inexact or partly user-drive – websites are often like this

The new, single-app, aggregate approach The application does not compete with

other applications for data access So the application takes care of ACID

properties implicitly The set or list or aggregate approach

provides a convenient unit for correctness Modern apps tend to group things in

sets of temporal versions, like the past five versions of a document or webpage

New approaches to data modeling Key/value

Keys are arbitrary and identify a set of values

What is returned, though, is a blob to the db

Key/document The key can in effect be a query based on

the content of the document, but it is a rigid query based on the structure and purpose of the document

New approaches, continued Column-family approaches

Like google BigTable Groups of columns are stored together There might be a group for items ordered And a group for attributes of customer, like

name, address, credit card, etc. Important: it isn’t key/value, or

key/document, or key/column families that are important, it is the overall philosophy And many new database approaches meet

multiple criteria

Accommodating widely distributed, cluster-based environments A group of values (list, set, aggregate) can be co-

located on a cluster, even if the set of values is big The set is often fixed and not fluid, as in a set-

oriented, atomic approach Differences

Key/value has black-boxed data set Key/document supports internal structure, but not in

the schema sense; each doc is unique In key/column family dbs, gives us something similar

to a relational schema, but still gives us that cluster-friendly notion

Graph databases What do we do when

lists/groups/aggregates must be restructured dynamically?

This is the heart of flat, set based data grouping, as in relational dbs

But with no static graph/object structure, this leads to costly joins

A non-cluster approach Lots of complex interconnections

between atomic things Graph-based Flat, with no structure within nodes The expensive operation is insert, where

the graph is updated Searching involves mostly short graph

traversals Performance is better on a single server

Issue of “no-schema” Allows dynamic decisions about placing

an object in a distributed store For graph databases, we can insert data

at will, but work in a small number (or one) of servers

Allows for similar, but not identically structured data – like documents

Complications of schema-less We cannot leverage a small structure

against a large volume of data But this causes application programs to

have to infer a schema via its logical flow and values stored in data that is retrieved So it is easy to misinterpret data It is harder to share data accurately

Could this trend of having no schema die off?

The cloud and cluster focus… Mixture of replication and partitioning Major trade-off: read accessibility vs.

minimizing write costs Focus on maximizing machine assets as

opposed to maximum control/security

Various server and cluster clouds models… Single server Breaking a complex object up by

attributes (columns) – this is sharding – and placing them on a small number of machines

Replication Primary copy? /2)+1 locking? Lock one for reading, lock all for writing?

Bottleneck issues Primary copies are worst Sharding creates localized bottlenecks

where all machines holding a shard must be accessible

Read divided by write ratio transaction reveals the benefits of lock half vs. lock all for writing

Try to keep replicates and related shards in a single cluster in the cloud

Major concern: consistency Without a 2 phase protocol, we can get

inconsistent reads, lost writes, all the things we talked about with respect to transactions

But locking is costly in a cloud environment, especially in a column environment

And locking makes graph insertions expensive – lots of connections

Things to be flexible about… Replication consistency Durability of updates… a way to control both of these – a sort of voting where if enough copies agree, it is correct and is made durable

Another approach to loosening up… Named versions Clock versions Averaging or voting

A common technique: Map-Reduce Cluster friendly Another perspective of cluster-based

processing – comparing 3 environments In transaction processing systems, we

execute a query on database server Or we are in a distributed environment with

multiple machines providing data and processing capabilities

In a cluster environment, we are midway between these two environments

Hadoop An example of this is Hadoop

Goal is to support parallel processing of huge databases

It is an Apache project In contains HBase, a distributed, highly

scalable database

Map-Reduce example Consider a transaction based system that processes

many thousands of claims a day We might have sets of related claims submitted by the

same person We could form aggregate objects

Claim set ID references a subscriber ID and a policy ID of a person who is a subscriber

Claim set ID also references a set of triples, each of which has a claim amount, a medical procedure number, and a count of the number of times the procedure was carried out; there can be any number of these pairs

Each aggregate object is a single record

Map:

Claim Set IDSubscriber IDPolicy IDSet of: (medical procedure ID, count of number of applications of the procedure, cost of a single procedure instance)

Mapping gets the following, given an instance of the above: Set of (medical procedure, total cost)** This is a summed total cost over all applications of procedure, but only on a per subscriber level, so the procedure number repeats.**

Reduce:

Takes a set of these triples and returns a single object containing:

a procedure IDa total amount

** Important: these are drawn from multiple subscribers and the number of applications of a procedure, and so it tells us what we are being charged for procedures as a whole. **

Things to note… Each map process is independent and

so they can be done in parallel without any bottleneck for summing them up or sharing the pieces of one aggregate record

The reduce grabs objects with the same procedure ID and sums up the total dollars

Sequencing of m/r procedures The approach can be used not just to increase

parallelism in a cluster-based environment with a single pass of the data

Reduce ops can be chained together: map/reduce -> map/reduce

Suppose each original object also has a month with it at the level of inner triple

We can create a key/value association for each triple (map), where there is a month

And then these triples can be reduced into groups according to dates

Claim Set IDSubscriber IDPolicy IDSet of: (medical procedure ID,

total cost, number of applications, month)

Mapping gets the following objects:Set of (medical procedure ID, total dollar, month), where medical procedure ID repeats)

Reducing gives us sets of:a procedure IDmontha total amount … where we add over procedure ID

We can further map this to give us sets of triples:montha total amount

… where we ignore procedure ID, where month repeatsWe can then reduce this to give us a set of:

monthdollar amount… where we sum over months.

Important An data environment must be modeled

carefully to allow for multiple layers There are other ways that map/reduce

instances can be combined

Why no schemas? You don’t have to know the structure of

what you are going to store Heterogeneous objects can be treated

as equals But --- the application is now controlling

data semantics It has nothing to start with Code is much harder to view than

schemas

Views and key/value, key/document DBs Materialize?

Always up do date Recreate/update upon request

Take advantage of heterogeneous structure, materialized can be treated like core data

Let a wide variety of users use the DB and materialize whatever they want

The CAP theorem Consistency, Availability, Partition

Tolerance You can only have two of these

Consistency – rigid Availability – If you can communicate

with a cluster, you can read and write from it

Partition tolerance – refers to the cluster becoming partitioned

What this really means A system might suffer partitions from

time to time You have to trade off consistency with

availability In other words, we have to relax on the

ACID conditions if we want high throughput

Key/value server operations from client Given a key, get a value Assign a key and value Delete a key/value pair

Protections Eventually consistent in the presence of

replication Consistent if no replicates – due to lack of

connections between objects Alternative for replicates – return both or

use a timestamp or arbitrary or “average” On update, consider finished if 1 more

than half have the value

Issues of keys Usually, if you don’t have the key, you

don’t get the data Usually, keys are not handed out But the key may consist of items the

client user is aware of, such as Dates Session ID User login

Issues of scale Large objects can be sharded across a

cluster But when objects are extremely large,

the notion of a cluster fails and we can no longer use replication and eventual consistency

Key/Document DBs Docs are hierarchical Can be anything

E.g., xml The are “self-defining” Sometimes they have common

attributes

Example: MongoDB You can have an update wait until k out

of m replicates are updated Cannot guarantee consistency if more

than one document is involved By definition an operation on a single

document is atomic

Examining the inside of a document Sometimes you can query the internals

of a document without returning all of it Sometimes you can create materialized

views that span multiple documents

alternatives to relational dbs what and why…. relational dbs sql - fixed schema, row oriented...

Documents

set of valueswhat

zillion user apps

data modelingkeyvaluekeys

data accessso

userdrive websites

user fixes errors

commonnosql dbs

sql interactive