cs 401r nosql database report jon belyeu fall...

CS 401R NoSQL Database Report

Jon Belyeu

Fall 2015

CS 401R NoSQL Database Report Jon Belyeu

Introduction

This report will perform a comparison of four NoSQL data systems, specifically Redis, Apache CouchDB, Apache Cassandra, and Hypertable. Redis (REmote DIctionary Server) is actually an in-memory data cache, but despite that has become an extremely popular key-value database. CouchDB is a document-model database, oriented toward JSON object access and designed mainly for easy use and web friendliness. Cassandra is a highly scalable database originally created by Facebook, but is now open-source and widely available. Hypertable is the open-source version of BigTable, a high-performance Google project meant for use with Google File System. Hypertable runs on top of a distributed file system and makes large-scale data analytics convenient. All of these tools are popular and effective for different applications and the goal of this report is to describe the systems involved in order to make possible an accurate assessment of the best use cases for each.

Database Descriptions

History. Redis was developed by Salvatore Sanfilippo, an Italian programmer, who needed a highly scalable database that would efficiently support write-intensive use for a real0time analytics project he was working on. He thought of the problem as being similar to that of a very large linked list, where natural order is preserved, adding data is fast and easy, and the last n nodes are very accessible (1).

Sanfilippo developed the database and used it locally at first, then made it publicly available as a open-source project. Redis quickly become popular and drew enough attention that Sanfilippo continued to develop and improve it, until VMWare, Inc. contacted him and offered to sponsor the project. He continued working on the project as an employee of VMWare (2).

After some time with VMWare, Sanfilippo moved to Pivotal Software, Inc, which was created by a VMWare reorganization as a spin-off company (3). Redis


remains an open source project, freely available and widely popular for data uses that require high-speed writes and streaming data.

Data Model. The Redis data model is based on a key-value pairing, similar to a regular hashmap or dictionary (4). Both data and keys are stored as binary safe strings, which means that Redis can store most kinds of values as strings (including some image formats), as long as the string is under the size limit of 512 Megabytes. Redis also has several data structures for those strings (5), specifically:

● Lists. The Redis list is a list of strings, sorted by entry number and allowing string addition from the end (tail) or beginning (head). Since values are added at the ends, lists allow constant time addition of values. Write operations require O(n) time and may be slow in the middle of the lists.

● Sets. Redis sets are unordered collections of strings with constant time addition, removal, and searching for items. They do not allow repeated members, so values may be added without the necessity of searching first to verify uniqueness; and they natively support unions, intersections, and set difference operations.

● Hashes. Redis hashes are maps between string keys and string values. They are often used to represent objects and do so with very minimal space usage, although they can represent many other concepts as well.

● Sorted Sets. Redis also provides a sorted set structure similar to the unordered set type. These sets associate a score with each set string and use the scores to determine order for all elements. Sorted sets perform add, remove, and update operations in O(logn) time.

Redis also supports two other types, called bitmaps and HyperLogLogs (HLLs). Rather than being true data types, bitmaps are sets of bit-oriented operations on the string type and allow storage of individuals facts about the key they associate with. An example is tracking the number of times a user visits a website, by simply setting a bit each time the user visits and counting the bits. HLLs, also based on the string type, provide set cardinality estimation. They work very much like sets and provide memory-efficient counting because instead of storing the values of the set, they store a state representation (6).

Physical Storage. As noted above, Redis stores values as key-value pairs. Both are stored in memory on the server instance where Redis is run, allowing extremely fast operations. There are two persistence options for Redis,


specifically RBD persistence and AOF persistence, both writing data to disk only to maintain persistence between sessions of Redis. RBD persistence takes snapshots of the dataset at certain intervals, while AOF persistence logs every write operation in real time and replays the operations at startup. It is also possible to combine these two persistence methods or disable persistence completely. Each has its own advantages, with an obvious trade-off between performance and durability; RBD is very efficient and provides faster operations than AOF, but AOF runs no risk of data loss in case of a shutdown. Redis developers recommend combining AOF and RDB (7).

Transactions. Redis has immediate data consistency due to the storage of data in memory and the preference toward modification of values rather than removal and rewriting when changing values already in the database. Redis transaction isolation is guaranteed by sequential execution of each command, so requests by multiple clients cannot overlap. Commands are executed in an all-or-none fashion, which guarantees that transactions are atomic. If there is a server failure during transaction processing, a write to disk may end in a partially complete state. However, Redis detects this on server startup and exits with an error, allowing the server administrator to rectify the error before continuing (8).

Redis durability, as described above, is customizable and may be good, but due to the in-memory nature of the data store, durability is not a strength of Redis and some data may be lost if there is a server failure (7). In ACID summary:

● Atomicity. Redis transactions are atomic, but can develop faults in some cases of write failure and require human intervention to effect repairs.

● Consistency. Redis is fully consistent due to in-memory data storage. ● Isolation. Requests are handled sequentially, which guarantees isolation

for each. ● Durability. Redis durability is customizable.

Scalability. Redis partition tolerance allows sets of key-value pairs to be split up among many different Redis instances and stored in the memory of all servers in use (9). This allows both initial large dataset storage and easy scaling.


Redis provides documentation (8) of scaling benchmarks, including multiple socket server performance:


History. Couch is anacronym for Cluster of Unreliable Commodity Hardware. The database was created by Damien Katz, in 2005 and became an Apache open-source project in 2008 (11). Katz began researching new projects after being laid off from his position as a software developer at a startup company, with the express goals of finding interesting new projects and having more time to spend with his family (12). He slowly developed the idea of a document-based database model, with the main benefit being web-friendliness and ease of use. IBM contacted him when CouchDB began to be successful and after some negotiations, he transferred ownership of the project to IBM. A requirement was made in the deal that the code be donated to Apache once it was developed and therefore CouchDB became an open-source project.

After finishing development of CouchDB, Katz moved to the CouchBase Server project, which combines preexisting Membase database with the CouchDB technology. He intends this to eventually replace CouchDB (13).

Data Model. CouchDB is a document model database. Entries consist of key-value pairs, with a unique string as the key and a JSON document as the value (14). This allows a great deal of flexibility in data, which may have any values representable as a JSON object. This flexibility also results in an essentially structureless design; any structure beyond the key-value pairing that the database has is dependent on the user.

CouchDB documents may have any number of fields and no requirement is made that different documents be related. All data types supported by JSON are allowed within the documents and the only field requirement is that no two fields within the same document may have the same name (15).

Data retrieval is based on the unique string names that identify the documents, which are contained within document-specific metadata fields and which make document lookup operations very fast. CouchDB uses ‘view’ models that make the otherwise unstructured data model usable, by defining aggregate, join, and report operations (15). These operations execute using the concept of


MapReduce, taking in the document as an argument to map function and emitting the data intended. Reduce, if needed, is used to aggregate results.

Physical Storage. Data is stored in a B-tree data structure, which maintains sorting and provides search, insert, and delete operations in logarithmic time (15). The map and reduce functions produce the key-value pairs CouchDB needs and allows data to be inserted into the B-tree, sorted by keys. This allows for simple partition tolerance, as pairs can be easily stored in any node and retrieved based on the key.

Transactions. There is no locking mechanism in CouchDB and multiple users may concurrently pull and edit the same document (15). If multiple users make changes, the later commits must be applied to newly pulled versions of the document and each change replaces the existing version, rather than modifying it. This means that similar to most version control software (like Git or Subversion) if clients pull a document while another client is making changes, data may not be consistent between pulls; however, CouchDB promises eventual consistency as the document will update completely when committed.

CouchDB writes to disk whenever a document is updated by first flushing the document data and index update, then writing the header out in two identical chunks. If the database crashes during the first step, the partial updates are forgotten; if there is a crash during the second step, the old headers remain (16). In ACID summary:

● Atomicity. Any write operation that fails is deleted, so partial documents are never saved.

● Consistency. CouchDB never overwrites committed data, thus ensuring that the data on disk is always consistent.

● Isolation. Since data is written to disk at every document update, transactions are isolated by nature. Each must terminate before another can begin.

● Durability. Data is written to disk when documents are updated, so data is durable.

Scalability. The key-value nature of CouchDB makes it very simple to scale and in fact it was designed from the bottom up to be readily distributed across partitions (17). In addition, CouchDB has merged with project BigCouch (18), a CouchDB-based database design by Cloudant intended for large-scale data


applications. BigCouch provides improved distributed application support and clustering capability to the existing CouchDB codebase.

CouchDb uses asynchronous replication rather than sharding to scale the framework, although the wrapper CouchDB Lounge can provide sharding on top of CouchDB (19).

History. Cassandra was originally developed at Facebook by Avinash Lakshman and Prashant Malik and was initially intended to be used for the back end of the Facebook Inbox Search (20). Cassandra was released to Apache as an open-source project in 2008 and became a top-level project in 2010 and has been deployed by many companies managing large datasets, including Netflix, Adobe, Twitter, HP, IBM, and Reddit (21).

Data Model. Cassandra uses a hybrid between a key-value pairing and a column-oriented design. Cassandra uses column families, which are similar to RDBMS tables, to store rows and columns. The rows in a column family are not constricted to a set of columns and columns may be added to any set of rows within a family at any time (22). Cassandra uses the Cassandra Query Language (CQL) to perform operations and supports the following data types (23):

● Ascii. The ascii type is a string based on the US ASCII character set ● Bigint. Bigints are 64-bit signed integers. ● Blob. Cassandra blobs are standard binary large objects of arbitrary size

with no automatic validation. ● Boolean. The Cassandra boolean type is a normal true/false boolean

variable. ● Counter. Counters are very much like bigints (64 bit), but are unsigned. ● Decimal. Decimals in Cassandra are of variable precision and may be

either integers or floats. ● Double. The double is a 64-bit floating point numeric. ● Float. The float is a 32-bit floating point numeric.


● Inet. Inets are string representations of IP addresses (allow either IPv4 or IPv6.

● Int. Ints are 32-bit signed integers. ● List. Lists are ordered data structures and may hold one or more element. ● Map. Maps in Cassandra are JSON arrays of literals. ● Set. Sets may hold one or more elements and do not maintain order. ● Text. Cassandra texts are strings encoded with UTF-8. ● Timestamp. The timestamp holds both date and time in a combined

integer and string format, encoded as 8 bytes since epoch. ● UUID. UUIDs are Universally Unique Modifiers. ● TimeUUID. TimeUUIDs are UUIDs of type 1. ● Varchar. The varchar is a UTF-8 encoded string. ● Varint. The varint is an integer of arbitrary precision.

Cassandra uses a partitioned row store (24) with tunable consistency and organizes the rows into tables. Each table has a partition key, repeated for every row, and rows clustered by the other columns in the rows. The more partitions are required to satisfy each query, the slower the query runs, so latency is highly dependent on efficient structuring of the data into partitions.

Physical Storage. Cassandra is made to run on a distributed file system with fault tolerance handled by peer-to-peer data distribution across nodes. Clients can connect to any node, which sets the node connected to as the coordinator node for the request. Key elements of the storage model (26) are:

● Nodes. A node is the basic component of a Cassandra instance, running on one server.

● Data Centers. Data centers are collections of related nodes and handle data replication.

● Clusters. A cluster is a collection of data centers. ● Commit Log. The log is the write location for all data to maintain

durability. ● Table. A table is a collection of ordered columns related by row.

Transactions. As stated above, consistency is tunable in Cassandra (25). Cassandra extends the idea of eventual consistency by setting a consistency level at each read or write request and holding a response to the client until the


consistency level (corresponding to a number of nodes that have responded) has reached the desired level.

● Atomicity. Cassandra handles each write request individually by partition (27), which can result in a write failing in one partition and returning a failed status code to the client, but succeeding in another partition and storing the value. Writes will therefore not return a status code of success unless all elements of a command succeed, but may save data from partially complete commands. Cassandra does not support bundled transactional commands as would a relational database, however, so although Cassandra could be considered atomic, atomicity does not strictly apply.

● Consistency. When a write request is made, it is sent out to all partitions for replication and the write operations continue after the system has reached the desired consistency level specified by the client. This means that aside from the partition-specific failures noted above, Cassandra is eventually consistent. The client can even choose to require that all replica nodes write the update to commit log and memory table before returning (28), which decreases the availability of the data but ensures consistency.

● Isolation. Isolation was not maintained in early versions of Cassandra (29) (a client could read partial rows from a concurrently processed write request), but it has been added to later versions and transactions are now processed in isolation.

● Durability. Write requests are not considered a success in Cassandra until they have been recorded in memory and written to disk in the commit log (30). If any error, such as a server failure, occurs during a write, the commit log is used to restore the memtables on reboot. This allows Cassandra to maintain full durability.


Scalability. Cassandra was designed for scaling and is exceptionally efficient for large applications. The replication-based partitioning allows large numbers of separate machines to run Cassandra in clusters at a time and more machines can be easily added. The following benchmarks (31) are from a Netflix test run with Cassandra:


History. Hypertable was based on publications describing the structure of Google’s BigTable database and was specifically designed for scalability (32). BigTable was made by Google engineers to address the need for a highly scalable high performance database that could be applied to many problems as needed by Google’s customers.

Data Model. Hypertable uses multi-dimensional tables to store data (32). Each table stores a row key as the first dimension, which serves as the primary identifier and defines order of rows. The second table holds the column family, which is the set of columns that define the data in the rows. The third dimension is the column identifier, and the fourth holds a timestamp representing the insertion time of the cell.

Data are unstructured in Hypertable in that there are no explicit types (33), so everything inserted is stored as a string, as this graphic shows:

Physical Storage. Hypertable stores sorted lists of key/value pairs, compressed, written to disk as files called CellStores, and saved in the underlying distributed file system (34). The CellStores are separated into blocks with indices for each block that contain the last key in the block as well as the block offset.

Transactions. Hypertable does not support transactions in the sense that a relational database does, which affects its ACID compliance. Therefore an ACID evaluation reveals:

● Atomicity. Without transactions, atomicity is less meaningful. As each command is entered individually rather than as part of a transaction, Hypertable is generally considered atomic (34); however, an update to multiple rows in different tables will perform each update independently and will not maintain atomicity.


● Consistency. Hypertable maintains consistency with Hyperspace (36), based on Google’s Chubby locking service. It maintains metadata on the database status and enforces a distributed consensus protocol.

● Isolation. Hypertable uses snapshot isolation (36), which forces all queries to use the same version of the database based on the database state when a request is executed (37).

● Durability. The durability level of Hypertable depends on the configuration of HDFS or other file system that Hypertable runs on. Durability is guaranteed for file blocks and commit logs, and it may be configured for the non-system table commit log.

Scalability. The scalability of Hypertable is due to the underlying distributed file system. Hypertable is generally run with GFS (Google File System) or the derivative open source version HDFS (Hadoop File System), both of which distribute file storage across multiple machines for maximum availability (36). This means any number of additional machines can be added to scale effectively.

Database Comparison.

These four databases are all intended for use with potentially large datasets, so there is a strong focus on scalability and partition tolerance. Redis and CouchDB both have some intrinsic weaknesses as they scale to large datasets, however.

Since Redis stores all data in memory, the hard limit to what can be stored in a Redis cluster is the combined memory of the computers in the cluster. While this may be considerable, it will generally be more limited than the disk-based approach generally taken and will require a greater monetary cost for the same amount of storage.

CouchDB does store data on disk, but the document-model system leaves all the work of organizing and using the data to the user. Since CouchDB supports MapReduce operations, it can be used for large-scale applications, but the search system is still relatively time consuming and more complex queries will be fairly slow on very large data sets. The design makes it excellent for the purpose it is generally intended to fill, however; it is an excellent user-friendly web database.

BigTable and Cassandra scale exceptionally well and large applications could well use either of these if the use cases fit their strengths otherwise. Hypertable is a strong multi-purpose database and is becoming even better with ongoing


development. Cassandra is a fairly mature technology and is especially well suited to data modeling cases that include related data points (like Facebook group members).

Conclusion

Choosing a database is critical for long-term effective use of data. There are many excellent database designs, optimized for different use cases, so if the needs or a particular application are well understood, it should be possible to find a database that will work well. Similarly, if a database is chosen without careful consideration, it is very possible for the choice to become a costly mistake. Because of this, any developer who expects to have a part in database decisions should maintain a good general knowledge of the database options and their strengths or weaknesses.

Additionally, although relational database systems and MongoDB have many applications and a large market share between them, they may not be the best databases for all purposes. Only four databases were considered here, but each has its own purpose and each is better than either MongoDB or a relational database for some purposes. The best way to ensure that the best database will be used for a given application is to understand both the database options and the problem, and to match them accordingly.


References:

1. http://www.eu-startups.com/2011/01/an-interview-with-salvatore-sanfilippo-creator-of-redis-working-out-of-sicily/

2. http://blogs.vmware.com/tribalknowledge/2010/03/vmware-hires-key-developer-for-redis.html 3. https://gigaom.com/2012/07/16/vmware-plans-cloud-spin-out-to-keep-up-with-microsoft-amazon-and-

google/ 4. http://redis.io/topics/twitter-clone 5. http://redis.io/topics/data-types 6. http://redis.io/topics/data-types-intro 7. http://redis.io/topics/persistence 8. http://antirez.com/news/36 9. http://redis.io/topics/partitioning 10. http://redis.io/topics/benchmarks 11. http://www.pcworld.com/article/201046/article.html 12. http://www.infoq.com/presentations/katz-couchdb-and-me 13. http://damienkatz.net/2012/01/the_future_of_couchdb.html 14. http://www.ibm.com/developerworks/opensource/library/os-couchdb/index.html 15. http://www.ibm.com/developerworks/opensource/library/os-couchdb/index.html 16. http://wiki.apache.org/couchdb/Technical%20Overview 17. https://cwiki.apache.org/confluence/display/COUCHDB/Introduction 18. https://cloudant.com/press-releases/cloudant-contributes-database-scalability-and-fault-tolerance-fra

mework-to-apache-couchdb/ 19. http://delivery.acm.org/10.1145/1980000/1978919/p12-cattell.pdf?ip=128.187.103.98&id=1978919&acc

=ACTIVE%20SERVICE&key=B63ACEF81C6334F5%2EE4A70A1B77145BCF%2E4D4702B0C3E38B35%2E4D4702B0C3E38B35&CFID=722407042&CFTOKEN=78331827&__acm__=1445138081_0579cba103bdc2546fe04097d5820987

20. http://www.tutorialspoint.com/cassandra/cassandra_tutorial.pdf 21. http://whatis.techtarget.com/definition/Cassandra-Apache-Cassandra 22. http://docs.datastax.com/en/archived/cassandra/0.7/docs/data_model/column_families.html 23. http://docs.datastax.com/en/cql/3.0/cql/cql_reference/cql_data_types_c.html 24. http://docs.datastax.com/en/cql/3.1/cql/ddl/dataModelingApproach.html 25. http://docs.datastax.com/en/cassandra/2.1/cassandra/dml/dmlAboutDataConsistency.html 26. http://docs.datastax.com/en/cassandra/2.0/cassandra/architecture/architectureIntro_c.html 27. http://docs.datastax.com/en/cassandra/2.0/cassandra/dml/dml_atomicity_c.html 28. http://docs.datastax.com/en/cassandra/1.2/cassandra/dml/dml_config_consistency_c.html 29. http://docs.datastax.com/en/cassandra/2.0/cassandra/dml/dml_isolation_c.html 30. http://docs.datastax.com/en/cassandra/2.0/cassandra/dml/dml_durability_c.html 31. http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html 32. http://www.linux-mag.com/id/6645/ 33. https://code.google.com/p/hypertable/wiki/HyperRecord 34. https://code.google.com/p/hypertable/wiki/ArchitecturalOverview#Data_Model 35. https://books.google.com/books?id=g_QwBgAAQBAJ&pg=PA186&lpg=PA186&dq=hypertable+atomicit

y&source=bl&ots=CXJd1qIXoY&sig=kXypIRiV_hnbDcwOFgSyIfDGCvM&hl=en&sa=X&ved=0CC8Q6AEwA2oVChMIju7w1bvRyAIVWNpjCh0d9gll#v=onepage&q=hypertable%20atomicity&f=false

36. http://hypertable.com/documentation/architecture/ 37. https://msdn.microsoft.com/en-us/library/tcbchxcb(v=vs.110).aspx

http://www.eu-startups.com/2011/01/an-interview-with-salvatore-sanfilippo-creator-of-redis-working-out-of-sicily/

http://www.eu-startups.com/2011/01/an-interview-with-salvatore-sanfilippo-creator-of-redis-working-out-of-sicily/

http://blogs.vmware.com/tribalknowledge/2010/03/vmware-hires-key-developer-for-redis.html

https://gigaom.com/2012/07/16/vmware-plans-cloud-spin-out-to-keep-up-with-microsoft-amazon-and-google/

https://gigaom.com/2012/07/16/vmware-plans-cloud-spin-out-to-keep-up-with-microsoft-amazon-and-google/

http://redis.io/topics/twitter-clone

http://redis.io/topics/data-types

http://redis.io/topics/data-types-intro

http://redis.io/topics/persistence

http://antirez.com/news/36

http://redis.io/topics/partitioning

http://redis.io/topics/benchmarks

http://www.pcworld.com/article/201046/article.html

http://www.infoq.com/presentations/katz-couchdb-and-me

http://damienkatz.net/2012/01/the_future_of_couchdb.html

http://www.ibm.com/developerworks/opensource/library/os-couchdb/index.html

http://www.ibm.com/developerworks/opensource/library/os-couchdb/index.html

http://wiki.apache.org/couchdb/Technical%20Overview

https://cwiki.apache.org/confluence/display/COUCHDB/Introduction

https://cloudant.com/press-releases/cloudant-contributes-database-scalability-and-fault-tolerance-framework-to-apache-couchdb/

https://cloudant.com/press-releases/cloudant-contributes-database-scalability-and-fault-tolerance-framework-to-apache-couchdb/

http://delivery.acm.org/10.1145/1980000/1978919/p12-cattell.pdf?ip=128.187.103.98&id=1978919&acc=ACTIVE%20SERVICE&key=B63ACEF81C6334F5%2EE4A70A1B77145BCF%2E4D4702B0C3E38B35%2E4D4702B0C3E38B35&CFID=722407042&CFTOKEN=78331827&__acm__=1445138081_0579cba103bdc2546fe04097d5820987




http://www.tutorialspoint.com/cassandra/cassandra_tutorial.pdf

http://whatis.techtarget.com/definition/Cassandra-Apache-Cassandra

http://docs.datastax.com/en/archived/cassandra/0.7/docs/data_model/column_families.html

http://docs.datastax.com/en/cql/3.0/cql/cql_reference/cql_data_types_c.html

http://docs.datastax.com/en/cql/3.1/cql/ddl/dataModelingApproach.html

http://docs.datastax.com/en/cassandra/2.1/cassandra/dml/dmlAboutDataConsistency.html

http://docs.datastax.com/en/cassandra/2.0/cassandra/architecture/architectureIntro_c.html

http://docs.datastax.com/en/cassandra/2.0/cassandra/dml/dml_atomicity_c.html

http://docs.datastax.com/en/cassandra/1.2/cassandra/dml/dml_config_consistency_c.html

http://docs.datastax.com/en/cassandra/2.0/cassandra/dml/dml_isolation_c.html

http://docs.datastax.com/en/cassandra/2.0/cassandra/dml/dml_durability_c.html

http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html

http://www.linux-mag.com/id/6645/

https://code.google.com/p/hypertable/wiki/HyperRecord

https://code.google.com/p/hypertable/wiki/ArchitecturalOverview#Data_Model

https://books.google.com/books?id=g_QwBgAAQBAJ&pg=PA186&lpg=PA186&dq=hypertable+atomicity&source=bl&ots=CXJd1qIXoY&sig=kXypIRiV_hnbDcwOFgSyIfDGCvM&hl=en&sa=X&ved=0CC8Q6AEwA2oVChMIju7w1bvRyAIVWNpjCh0d9gll#v=onepage&q=hypertable%20atomicity&f=false



http://hypertable.com/documentation/architecture/

https://msdn.microsoft.com/en-us/library/tcbchxcb(v=vs.110).aspx

cs 401r nosql database report jon belyeu fall...

Documents