marlon louise parco | emmanuel alarillo | alex rabago | ezekiel contreras | bayani sagisag

Marlon Louise Parco | Emmanuel Alarillo | Alex Rabago | Ezekiel Contreras | Bayani Sagisag

SQL has ruled for two decades• Store persistent data• Application Integration• Mostly Standard• Concurrency Control• Reporting

Relational databases are designed to run on a single machine, so to scale, you need buy a bigger machine.

but SQL’s dominance is cracking

But it’s cheaper and more effective to scale horizontally by buying lots of machines.

•The machines in these large clusters are individually unreliable, but the overall cluster keeps working even as machines die - so the overall cluster is reliable.• The “cloud” is exactly this kind of cluster, which means relational databases don’t play well with the cloud.

but SQL’s dominance is cracking

•The rise of web services provides an effective alternative to shared databases for application integration, making it easier for different applications to choose their own data storage.

so now we have NoSQL databases•There is no formal definition of NoSQL, but there are some common characteristics of NoSQL databases

Big Data

Schemaless

Programmer-friendly

AvailabilityHighly

Scalable

Low-latency

Examples include…

Key-Value Store

Document Store Big Table

XMLDB

Object Store

Lets

com

pare

!Scalaris

Written in: Erlang Main point: Distributed P2P key-value store License: Apache Protocol: Proprietary & JSON-RPC

Best used: If you like Erlang and wanted to use Mnesia or DETS or ETS, but you need something that is accessible from more languages (and scales much better than ETS or DETS).

For example: In an Erlang-based system when you want to give access to the DB to Python, Ruby or Java programmers.

Kyoto Tycoon

Written in: C++ Main point: A lightweight network DBM License: GPL Protocol: HTTP (TSV-RPC or REST)

Best used: When you want to choose the backend storage algorithm engine very precisely. When speed is of the essence.

For example: Caching server. Stock prices. Analytics. Real-time data collection. Real-time communication. And wherever you used memcached before.

VoltDB

Written in: Java Main point: Fast transactions and repidly changing data License: GPL 3 Protocol: Proprietary

Best used: Where you need to act fast on massive amounts of incoming data.

For example: Point-of-sales data analysis. Factory control systems.

Couchbase (ex-Membase)

Written in: Erlang & C Main point: Memcache compatible, but with persistence and clustering License: Apache Protocol: memcached + extensions

Best used: Any application where low-latency data access, high concurrency support and high availability is a requirement.

For example: Low-latency use-cases like ad targeting or highly-concurrent web apps like online gaming (e.g. Zynga).

The "long tail"(Not widely known, but definitely worthy ones)

ElasticSearch

Written in: Java Main point: Advanced Search License: Apache Protocol: JSON over HTTP (Plugins: Thrift, memcached)

Best used: When you have objects with (flexible) fields, and you need "advanced search" functionality.

For example: A dating service that handles age difference, geographic location, tastes and dislikes, etc. Or a leaderboard system that depends on many variables.

Neo4j

Written in: Java Main point: Graph database - connected data License: GPL, some features AGPL/commercial Protocol: HTTP/REST (or embedding in Java)

Best used: For graph-style, rich or complex, interconnected data. Neo4j is quite different from the others in this sense.

For example: For searching routes in social relations, public transport links, road maps, or network topologies.

Special-purpose

Accumulo

Written in: Java and C++ Main point: A BigTable with Cell-level security License: Apache Protocol: Thrift

Best used: If you need a different HBase.

For example: Same as HBase, since it's basically a replacement: Search engines. Analysing log data. Any place where scanning huge, two-dimensional join-less tables are a requirement.

Hypertable

Written in: C++ Main point: A faster, smaller HBase License: GPL 2.0 Protocol: Thrift, C++ library, or HQL shell

Best used: If you need a better HBase.

For example: Same as HBase, since it's basically a replacement: Search engines. Analysing log data. Any place where scanning huge, two-dimensional join-less tables are a requirement.

Cassandra

Written in: Java Main point: Best of BigTable and Dynamo License: Apache Protocol: Thrift & custom binary CQL3

Best used: When you write more than you read (logging). If every component of the system must be in Java. ("No one gets fired for choosing Apache's stuff.")

For example: Banking, financial industry (though not necessarily for financial transactions, but these industries are much bigger than that.) Writes are faster than reads, so one natural niche is data analysis.

HBase

Written in: Java Main point: Billions of rows X millions of columns License: Apache Protocol: HTTP/REST (also Thrift)

Best used: Hadoop is probably still the best way to run Map/Reduce jobs on huge datasets. Best if you use the Hadoop/HDFS stack already.

For example: Search engines. Analysing log data. Any place where scanning huge, two-dimensional join-less tables are a requirement.

Clones of Google’s Bigtable

Redis

Written in: C/C++ Main point: Blazing fast License: BSD Protocol: Telnet-like

Best used: For rapidly changing data with a foreseeable database size (should fit mostly in memory).

For example: Stock prices. Analytics. Real-time data collection. Real-time communication. And wherever you used memcached before.

Couch DB

Written in: ErlangMain Point: DB consistency, ease of use License: Apache Protocol: HTTP/REST

Best used: For accumulating, occasionally changing data, on which pre-defined queries are to be run. Places where versioning is important.

For example: CRM, CMS systems. Master-master replication is an especially interesting feature, allowing easy multi-site deployments.

RiakWritten in: Erlang & C, some JavaScript Main Point: Fault toleranceLicense: Apache Protocol: HTTP/REST or custom binary

Best used: If you want something Dynamo-like data storage, but no way you're gonna deal with the bloat and complexity. If you need very good single-site scalability, availability and fault-tolerance, but you're ready to pay for multi-site replication.

For example: Point-of-sales data collection. Factory control systems. Places where even seconds of downtime hurt. Could be used as a well-update-able web server.

Mongo DBWritten in: C++Main Point: Retains some friendly properties of SQL. (Query, index) License: AGPL (Drivers: Apache)Protocol: Custom, binary (BSON)

Best used: If you need dynamic queries. If you prefer to define indexes, not map/reduce functions. If you need good performance on a big DB. If you wanted CouchDB, but your data changes too much, filling up disks.

For example: For most things that you would do with MySQL or PostgreSQL, but having predefined columns really holds you back.

The popular ones

So this means we can…• Reduce Development Drag• Embrace Large Scale

but this does not mean relational is dead

• The relational model is still relevant•ACID Transactions•Tools• Familiarity

Polyglot Persistencethis leads us to a world of

• Using multiple data storage technologies, chosen based upon the way data is being used by individual applications. • Polyglot persistence will occur over the enterprise as different applications use different data storage technologies.• It will also occur within a single application as different parts of an application’s data store have different access characteristics.

What might Polyglot look like?

Polyglot Persistence provides lots of new opportunities for enterprises

THANK YOU!

NoSQL Databases

THE FUTURE IS:

Polyglot Persistence

marlon louise parco | emmanuel alarillo | alex rabago | ezekiel contreras | bayani sagisag

Documents