alternatives to relational dbs what and why…. relational dbs sql - fixed schema, row oriented...
TRANSCRIPT
Alternatives to relational DBsWhat and Why…
Relational DBs SQL - Fixed schema, row oriented & optimized
SQL - Rigid 2Phase transactions, with locking bottleneck
SQL - Set theoretic
SQL - Centralized distribution
SQL - Computational, not navigational/inter-connected & set-oriented
Sql - Poor support for heterogeneity & compression
No SQL - no or not only
Column-oriented - HBase (uses column families and no schema, has versioning and consistence transactions)
Key/value pairs - Google Dynamo Graph like - Neo4J Document based - MongoDB (cluster based
for huge scale, supports nested docs, and uses JavaScript for queries, and no schema)
But remember - Categories not distinct - take each one for
what it is Heterogeneous structure & polyglot
language environment is common NoSQL DBs tend to be unsupported with
funky GUIs - but there are very active volunteer user bases maintaining and evolving them
NoSQL DBs also tend to use programming languages for queries
When do you want non-2P transactions and no SQL Interactive, zillion user apps where user fixes errors via some
form of compensation
Minimal interconnectedness
Individual data values are not mission-critical
Read-heavy environments
Cloud -based environments
Queries are not set-oriented & are computational and imperative, and perhaps long
Real time apps
SQL is here to stay... Formal & unambiguous semantics
Declarative language with clean separation of application and queries
Consistent
Flexible
Black boxed, tested, and supported - and very well understood with many thousands of trained programmers - SQL is a basic language, like Java, Javascript, PHP, C#. etc.
Great GUIs that are very rich and debugged
And importantly... Lots of apps need clean, well
understood stacks, not speed or the cloud
In particular, websites that do retail business need consistent transactions and do not need the speed that comes with delayed updates
Relational DBs scale reasonably well, too, at least in non-cloud environments
Changing trends Relational DBMSs were engineered for
multi-app sharing Relational DBMSs were focused on an
independent “state”, separate from apps But now, many complex environments are
essentially single-app Correspondingly, data structuring
impedance mismatch seems unreasonable
Aspects of modern apps Databases are being used more and more
for planning and approximate uses But relational databases are very flat They are engineered for atomic, exact
transactions Modern applications also tend to
manipulate objects that consist of multiple sets or lists of things because they are inexact or partly user-drive – websites are often like this
The new, single-app, aggregate approach The application does not compete with
other applications for data access So the application takes care of ACID
properties implicitly The set or list or aggregate approach
provides a convenient unit for correctness Modern apps tend to group things in
sets of temporal versions, like the past five versions of a document or webpage
New approaches to data modeling Key/value
Keys are arbitrary and identify a set of values
What is returned, though, is a blob to the db
Key/document The key can in effect be a query based on
the content of the document, but it is a rigid query based on the structure and purpose of the document
New approaches, continued Column-family approaches
Like google BigTable Groups of columns are stored together There might be a group for items ordered And a group for attributes of customer, like
name, address, credit card, etc. Important: it isn’t key/value, or
key/document, or key/column families that are important, it is the overall philosophy And many new database approaches meet
multiple criteria
Accommodating widely distributed, cluster-based environments A group of values (list, set, aggregate) can be co-
located on a cluster, even if the set of values is big The set is often fixed and not fluid, as in a set-
oriented, atomic approach Differences
Key/value has black-boxed data set Key/document supports internal structure, but not in
the schema sense; each doc is unique In key/column family dbs, gives us something similar
to a relational schema, but still gives us that cluster-friendly notion
Graph databases What do we do when
lists/groups/aggregates must be restructured dynamically?
This is the heart of flat, set based data grouping, as in relational dbs
But with no static graph/object structure, this leads to costly joins
A non-cluster approach Lots of complex interconnections
between atomic things Graph-based Flat, with no structure within nodes The expensive operation is insert, where
the graph is updated Searching involves mostly short graph
traversals Performance is better on a single server
Issue of “no-schema” Allows dynamic decisions about placing
an object in a distributed store For graph databases, we can insert data
at will, but work in a small number (or one) of servers
Allows for similar, but not identically structured data – like documents
Complications of schema-less We cannot leverage a small structure
against a large volume of data But this causes application programs to
have to infer a schema via its logical flow and values stored in data that is retrieved So it is easy to misinterpret data It is harder to share data accurately
Could this trend of having no schema die off?
The cloud and cluster focus… Mixture of replication and partitioning Major trade-off: read accessibility vs.
minimizing write costs Focus on maximizing machine assets as
opposed to maximum control/security
Various server and cluster clouds models… Single server Breaking a complex object up by
attributes (columns) – this is sharding – and placing them on a small number of machines
Replication Primary copy? /2)+1 locking? Lock one for reading, lock all for writing?
Bottleneck issues Primary copies are worst Sharding creates localized bottlenecks
where all machines holding a shard must be accessible
Read divided by write ratio transaction reveals the benefits of lock half vs. lock all for writing
Try to keep replicates and related shards in a single cluster in the cloud
Major concern: consistency Without a 2 phase protocol, we can get
inconsistent reads, lost writes, all the things we talked about with respect to transactions
But locking is costly in a cloud environment, especially in a column environment
And locking makes graph insertions expensive – lots of connections
Things to be flexible about… Replication consistency Durability of updates… a way to control both of these – a sort of voting where if enough copies agree, it is correct and is made durable
Another approach to loosening up… Named versions Clock versions Averaging or voting
A common technique: Map-Reduce Cluster friendly Another perspective of cluster-based
processing – comparing 3 environments In transaction processing systems, we
execute a query on database server Or we are in a distributed environment with
multiple machines providing data and processing capabilities
In a cluster environment, we are midway between these two environments
Hadoop An example of this is Hadoop
Goal is to support parallel processing of huge databases
It is an Apache project In contains HBase, a distributed, highly
scalable database
Map-Reduce example Consider a transaction based system that processes
many thousands of claims a day We might have sets of related claims submitted by the
same person We could form aggregate objects
Claim set ID references a subscriber ID and a policy ID of a person who is a subscriber
Claim set ID also references a set of triples, each of which has a claim amount, a medical procedure number, and a count of the number of times the procedure was carried out; there can be any number of these pairs
Each aggregate object is a single record
Map:
Claim Set IDSubscriber IDPolicy IDSet of: (medical procedure ID, count of number of applications of the procedure, cost of a single procedure instance)
Mapping gets the following, given an instance of the above: Set of (medical procedure, total cost)** This is a summed total cost over all applications of procedure, but only on a per subscriber level, so the procedure number repeats.**
Reduce:
Takes a set of these triples and returns a single object containing:
a procedure IDa total amount
** Important: these are drawn from multiple subscribers and the number of applications of a procedure, and so it tells us what we are being charged for procedures as a whole. **
Things to note… Each map process is independent and
so they can be done in parallel without any bottleneck for summing them up or sharing the pieces of one aggregate record
The reduce grabs objects with the same procedure ID and sums up the total dollars
Sequencing of m/r procedures The approach can be used not just to increase
parallelism in a cluster-based environment with a single pass of the data
Reduce ops can be chained together: map/reduce -> map/reduce
Suppose each original object also has a month with it at the level of inner triple
We can create a key/value association for each triple (map), where there is a month
And then these triples can be reduced into groups according to dates
Claim Set IDSubscriber IDPolicy IDSet of: (medical procedure ID,
total cost, number of applications, month)
Mapping gets the following objects:Set of (medical procedure ID, total dollar, month), where medical procedure ID repeats)
Reducing gives us sets of:a procedure IDmontha total amount … where we add over procedure ID
We can further map this to give us sets of triples:montha total amount
… where we ignore procedure ID, where month repeatsWe can then reduce this to give us a set of:
monthdollar amount… where we sum over months.
Important An data environment must be modeled
carefully to allow for multiple layers There are other ways that map/reduce
instances can be combined
Why no schemas? You don’t have to know the structure of
what you are going to store Heterogeneous objects can be treated
as equals But --- the application is now controlling
data semantics It has nothing to start with Code is much harder to view than
schemas
Views and key/value, key/document DBs Materialize?
Always up do date Recreate/update upon request
Take advantage of heterogeneous structure, materialized can be treated like core data
Let a wide variety of users use the DB and materialize whatever they want
The CAP theorem Consistency, Availability, Partition
Tolerance You can only have two of these
Consistency – rigid Availability – If you can communicate
with a cluster, you can read and write from it
Partition tolerance – refers to the cluster becoming partitioned
What this really means A system might suffer partitions from
time to time You have to trade off consistency with
availability In other words, we have to relax on the
ACID conditions if we want high throughput
Key/value server operations from client Given a key, get a value Assign a key and value Delete a key/value pair
Protections Eventually consistent in the presence of
replication Consistent if no replicates – due to lack of
connections between objects Alternative for replicates – return both or
use a timestamp or arbitrary or “average” On update, consider finished if 1 more
than half have the value
Issues of keys Usually, if you don’t have the key, you
don’t get the data Usually, keys are not handed out But the key may consist of items the
client user is aware of, such as Dates Session ID User login
Issues of scale Large objects can be sharded across a
cluster But when objects are extremely large,
the notion of a cluster fails and we can no longer use replication and eventual consistency
Key/Document DBs Docs are hierarchical Can be anything
E.g., xml The are “self-defining” Sometimes they have common
attributes
Example: MongoDB You can have an update wait until k out
of m replicates are updated Cannot guarantee consistency if more
than one document is involved By definition an operation on a single
document is atomic
Examining the inside of a document Sometimes you can query the internals
of a document without returning all of it Sometimes you can create materialized
views that span multiple documents