marlon louise parco | emmanuel alarillo | alex rabago | ezekiel contreras | bayani sagisag
DESCRIPTION
Marlon Louise Parco | Emmanuel Alarillo | Alex Rabago | Ezekiel Contreras | Bayani Sagisag. SQL has ruled for two decades. Store persistent data Application Integration Mostly Standard Concurrency Control Reporting. but SQL’s dominance is cracking. - PowerPoint PPT PresentationTRANSCRIPT
Marlon Louise Parco | Emmanuel Alarillo | Alex Rabago | Ezekiel Contreras | Bayani Sagisag
SQL has ruled for two decades• Store persistent data• Application Integration• Mostly Standard• Concurrency Control• Reporting
Relational databases are designed to run on a single machine, so to scale, you need buy a bigger machine.
but SQL’s dominance is cracking
But it’s cheaper and more effective to scale horizontally by buying lots of machines.
•The machines in these large clusters are individually unreliable, but the overall cluster keeps working even as machines die - so the overall cluster is reliable.• The “cloud” is exactly this kind of cluster, which means relational databases don’t play well with the cloud.
but SQL’s dominance is cracking
•The rise of web services provides an effective alternative to shared databases for application integration, making it easier for different applications to choose their own data storage.
so now we have NoSQL databases•There is no formal definition of NoSQL, but there are some common characteristics of NoSQL databases
Big Data
Schemaless
Programmer-friendly
AvailabilityHighly
Scalable
Low-latency
Examples include…
Key-Value Store
Document Store Big Table
XMLDB
Object Store
Lets
com
pare
!Scalaris
Written in: Erlang Main point: Distributed P2P key-value store License: Apache Protocol: Proprietary & JSON-RPC
Best used: If you like Erlang and wanted to use Mnesia or DETS or ETS, but you need something that is accessible from more languages (and scales much better than ETS or DETS).
For example: In an Erlang-based system when you want to give access to the DB to Python, Ruby or Java programmers.
Kyoto Tycoon
Written in: C++ Main point: A lightweight network DBM License: GPL Protocol: HTTP (TSV-RPC or REST)
Best used: When you want to choose the backend storage algorithm engine very precisely. When speed is of the essence.
For example: Caching server. Stock prices. Analytics. Real-time data collection. Real-time communication. And wherever you used memcached before.
VoltDB
Written in: Java Main point: Fast transactions and repidly changing data License: GPL 3 Protocol: Proprietary
Best used: Where you need to act fast on massive amounts of incoming data.
For example: Point-of-sales data analysis. Factory control systems.
Couchbase (ex-Membase)
Written in: Erlang & C Main point: Memcache compatible, but with persistence and clustering License: Apache Protocol: memcached + extensions
Best used: Any application where low-latency data access, high concurrency support and high availability is a requirement.
For example: Low-latency use-cases like ad targeting or highly-concurrent web apps like online gaming (e.g. Zynga).
The "long tail"(Not widely known, but definitely worthy ones)
ElasticSearch
Written in: Java Main point: Advanced Search License: Apache Protocol: JSON over HTTP (Plugins: Thrift, memcached)
Best used: When you have objects with (flexible) fields, and you need "advanced search" functionality.
For example: A dating service that handles age difference, geographic location, tastes and dislikes, etc. Or a leaderboard system that depends on many variables.
Neo4j
Written in: Java Main point: Graph database - connected data License: GPL, some features AGPL/commercial Protocol: HTTP/REST (or embedding in Java)
Best used: For graph-style, rich or complex, interconnected data. Neo4j is quite different from the others in this sense.
For example: For searching routes in social relations, public transport links, road maps, or network topologies.
Special-purpose
Accumulo
Written in: Java and C++ Main point: A BigTable with Cell-level security License: Apache Protocol: Thrift
Best used: If you need a different HBase.
For example: Same as HBase, since it's basically a replacement: Search engines. Analysing log data. Any place where scanning huge, two-dimensional join-less tables are a requirement.
Hypertable
Written in: C++ Main point: A faster, smaller HBase License: GPL 2.0 Protocol: Thrift, C++ library, or HQL shell
Best used: If you need a better HBase.
For example: Same as HBase, since it's basically a replacement: Search engines. Analysing log data. Any place where scanning huge, two-dimensional join-less tables are a requirement.
Cassandra
Written in: Java Main point: Best of BigTable and Dynamo License: Apache Protocol: Thrift & custom binary CQL3
Best used: When you write more than you read (logging). If every component of the system must be in Java. ("No one gets fired for choosing Apache's stuff.")
For example: Banking, financial industry (though not necessarily for financial transactions, but these industries are much bigger than that.) Writes are faster than reads, so one natural niche is data analysis.
HBase
Written in: Java Main point: Billions of rows X millions of columns License: Apache Protocol: HTTP/REST (also Thrift)
Best used: Hadoop is probably still the best way to run Map/Reduce jobs on huge datasets. Best if you use the Hadoop/HDFS stack already.
For example: Search engines. Analysing log data. Any place where scanning huge, two-dimensional join-less tables are a requirement.
Clones of Google’s Bigtable
Redis
Written in: C/C++ Main point: Blazing fast License: BSD Protocol: Telnet-like
Best used: For rapidly changing data with a foreseeable database size (should fit mostly in memory).
For example: Stock prices. Analytics. Real-time data collection. Real-time communication. And wherever you used memcached before.
Couch DB
Written in: ErlangMain Point: DB consistency, ease of use License: Apache Protocol: HTTP/REST
Best used: For accumulating, occasionally changing data, on which pre-defined queries are to be run. Places where versioning is important.
For example: CRM, CMS systems. Master-master replication is an especially interesting feature, allowing easy multi-site deployments.
RiakWritten in: Erlang & C, some JavaScript Main Point: Fault toleranceLicense: Apache Protocol: HTTP/REST or custom binary
Best used: If you want something Dynamo-like data storage, but no way you're gonna deal with the bloat and complexity. If you need very good single-site scalability, availability and fault-tolerance, but you're ready to pay for multi-site replication.
For example: Point-of-sales data collection. Factory control systems. Places where even seconds of downtime hurt. Could be used as a well-update-able web server.
Mongo DBWritten in: C++Main Point: Retains some friendly properties of SQL. (Query, index) License: AGPL (Drivers: Apache)Protocol: Custom, binary (BSON)
Best used: If you need dynamic queries. If you prefer to define indexes, not map/reduce functions. If you need good performance on a big DB. If you wanted CouchDB, but your data changes too much, filling up disks.
For example: For most things that you would do with MySQL or PostgreSQL, but having predefined columns really holds you back.
The popular ones
So this means we can…• Reduce Development Drag• Embrace Large Scale
but this does not mean relational is dead
• The relational model is still relevant•ACID Transactions•Tools• Familiarity
Polyglot Persistencethis leads us to a world of
• Using multiple data storage technologies, chosen based upon the way data is being used by individual applications. • Polyglot persistence will occur over the enterprise as different applications use different data storage technologies.• It will also occur within a single application as different parts of an application’s data store have different access characteristics.
What might Polyglot look like?
Polyglot Persistence provides lots of new opportunities for enterprises
THANK YOU!
NoSQL Databases
THE FUTURE IS:
Polyglot Persistence