benchmarking the suitability of key-value stores for...

Benchmarking the suitability of key-value stores for distributed scientific data

Hua Feng

August 22, 2012

MSc in High Performance Computing

The University of Edinburgh

Year of Presentation: 2012

Abstract This dissertation is going to explore the suitability for a key-value store to be utilized in a scientific environment. In the dissertation, we present a benchmark approach to evaluate the performance of a key-value store based on CAP (consistency, availability and partition tolerance) theorem, then explore the suitability when a scientific application needs to use a key-value store.

With development of Internet technology, there is a trend that more and more data is generated for scientific research which leads to the demand of partitioning data across multiply nodes. However, the performance of relational database drops because of its data complexity.

As a result, this dissertation will investigate the suitability of two key-value stores, Redis and Voldemort, by benchmarking it consistency and availability with different configuration (node number/dataset size).

The result of benchmarking shows that Redis scales with the size of datasets but not number of nodes while Voldemort scales with the number of nodes but not size of datasets regarding consistency. The availability of Redis is better than Voldemort but both and Redis and Voldemort reduce its availability with more datasets. In addition, Voldemort shows a possibility that adding more nodes to Voldemort system may increase its availability. Furthermore, due to the client issues for both two data stores partition tolerance may not be evaluated currently.

Finally, with an analysis of demanding scientific application, we conclude that both Voldemort and Redis could be suitable for a scientific application with dependence on the requirement of scientific application.

i

Contents

Chapter 1 Introduction.................................................................................................... 1

Chapter 2 Motivation...................................................................................................... 3

2.1 The growing requirement from scientific application............................................ 3

2.2 Reasons of using NOSQL....................................................................................... 3

2.3 If key-value stores are suitable for scientific application....................................... 4

Chapter 3 Intensive data concepts .................................................................................. 5

3.1 NOSQL Data store.................................................................................................. 5

3.2 Sharding and Replication........................................................................................ 6

3.2.1 Replication ....................................................................................................... 6

3.2.2 Sharding ........................................................................................................... 7

3.3 The CAP Theorem.................................................................................................. 7

3.3.1 Consistency...................................................................................................... 7

3.3.2 Availability ...................................................................................................... 9

3.3.3 Partition Tolerance........................................................................................... 9

3.4 Buffer and Index ................................................................................................... 10

3.4.1 Buffer ............................................................................................................. 10

3.4.2 Index .............................................................................................................. 10

Chapter 4 NOSQL data stores ...................................................................................... 11

4.1 The different types of NOSQL ............................................................................. 11

4.1.1 Key-value stores ............................................................................................ 11

4.1.2 Document oriented stores .............................................................................. 12

ii

4.1.3 Graph stores ................................................................................................... 12

4.1.4 Column-oriented stores.................................................................................. 12

4.2 Target data store to benchmark ............................................................................ 13

4.3 Voldemort ............................................................................................................. 14

4.3.1 Data Sharding and replication[48] ................................................................ 15

4.3.2 Index Pre-calculation..................................................................................... 15

4.4 Redis...................................................................................................................... 15

4.4.1 Persistence ..................................................................................................... 16

4.4.2 Virtual Memory ............................................................................................. 16

4.4.3 Replication ..................................................................................................... 16

4.4.4 Sharding[68]. ................................................................................................. 16

Chapter 5 . Evaluation Preparation............................................................................... 17

5.1 Evaluation ............................................................................................................. 17

5.1.1 Evaluation of Consistency............................................................................. 17

5.1.2 Evaluation of Availability ............................................................................. 19

5.1.3 Evaluation of Partition Tolerance.................................................................. 20

5.1.4 Other scripts for evaluation ........................................................................... 21

5.2 Platform................................................................................................................. 22

5.2.1 Amazon EC2 deployment ............................................................................. 23

5.2.2 EDIM1 deployment ....................................................................................... 24

5.3 Data Store Configuration...................................................................................... 24

5.3.1 Client Setup.................................................................................................... 24

5.3.2 Data used in evaluation.................................................................................. 24

5.4 Expected Result .................................................................................................... 25

Chapter 6 . Data Analysis ............................................................................................. 26

6.1 Consistency Benchmark ....................................................................................... 26

6.1.1 Result with 4-nodes cluster ........................................................................... 26

6.1.2 Result with 8-nodes cluster ........................................................................... 29

iii

6.1.3 Result with 16-nodes cluster ......................................................................... 31

6.1.4 Consistency Contrast ..................................................................................... 33

6.1.5 Conclusion on consistency evaluation .......................................................... 35

6.2 Availability Benchmark........................................................................................ 36

6.3 Partition Tolerance Benchmark............................................................................ 39

Chapter 7 . Suitability of key-value store..................................................................... 41

7.1 Exascale Challenge of Scientific Application...................................................... 41

7.2 Fitness of the key-value store for scientific application....................................... 42

7.2.1 Scalability ...................................................................................................... 42

7.2.2 Schema for scientific data.............................................................................. 43

7.2.3 Input and output ............................................................................................. 44

7.3 Conclusion for the suitability of key-value store ................................................. 44

Chapter 8 . Summary and Future Work ....................................................................... 46

Appendix A Node setting up procedure ...................................................................... 48

Appendix B Source script-startvoldemort.sh .............................................................. 50

Appendix C Source file-startredis.sh........................................................................... 51

Appendix D Source scritp-conV.sh............................................................................. 52

Appendix E Source scritp-conR.sh.............................................................................. 57

Appendix F Source script-ycsbV.sh ............................................................................ 62

Appendix G Source script-ycsbR.sh............................................................................ 63

Appendix H Source script-pa.sh.................................................................................. 64

Appendix I Client setting up- Voldemort.................................................................... 65

Appendix J Client setting up- Redis ............................................................................ 66

Appendix K Benchmark file download....................................................................... 67

Appendix L Voldemort configuration file................................................................... 68

References ....................................................................................................................... 75

iv

List of Tables

Table 1: A comparison between Redis and Voldmort[89][19]...................................... 14

Table 2: Speciation of single EDIM1 and AmazonEC2 node[60] [61]......................... 23

Table 3: Write performance evaluation .......................................................................... 36

Table 4: Write performance evaluation .......................................................................... 36

Table 5: Partition tolerance evaluation ........................................................................... 39

Table 6: Some public data set on Internet ...................................................................... 41

Table 7: Simple section from each data set .................................................................... 42

Table 8: Summary of scalability..................................................................................... 43

v

List of Figures

Figure 1: A simple comparison between Sharding and Replication................................ 6

Figure 2: The strong consistency needs N<W+R[32]...................................................... 8

Figure 3: The location of Voldemort[19] and Redis[51] according to CAP theorem.[66]......................................................................................................................................... 13

Figure 4: Consistent hash ring of Voldemort[48]........................................................... 15

Figure 5: The step of consistency benchmark ................................................................ 18

Figure 6: YCSB architecture[81].................................................................................... 20

Figure 7: Produce of partition tolerance benchmarking................................................. 21

Figure 8: The script procedure of loading content ......................................................... 22

Figure 9 Simple records from Nomao dataset and Record Linkage Comparison pattern dataset.............................................................................................................................. 25

Figure 10 Consistency Benchmark Result of four-node cluster .................................... 27

Figure 11 Consistency Benchmark Result of four-node cluster (threads 80 to 100)..... 28

Figure 12 Consistency Benchmark Result of four-node cluster (further investigation) 29

Figure 13 Consistency Benchmark Result of eight-node cluster ................................... 30

Figure 14 Consistency Benchmark Result of eight-node cluster (from threads 80 to 100).................................................................................................................................. 31

Figure 15 Consistency Benchmark Result of sixteen-node cluster................................ 32

Figure 16 Consistency Benchmark Result of sixteen-node cluster................................ 33

Figure 17 Consistency Benchmark with different nodes configuration ........................ 34

Figure 18 Consistency Benchmark with different nodes configuration at the time point of 80% ............................................................................................................................. 34

Figure 19 Voldemort Availability Benchamrk (write)................................................... 37

vi

Figure 20 Voldemort Availability Benchamrk (throughput) ......................................... 37

Figure 21 Voldemort Availability Benchamrk (read) .................................................... 38

vii

Acknowledgements

I would like to express my appreciation to the dissertation supervisor Mr. Neil Chue Hong and Dr Michael Jackson for their supervision for the dissertation and especially the feedback of my dissertation, which really helped me a lot.

I also would like to thank Mr. Paul Martain (School of Informatics, University of Edinburgh), who helped me setting up the environment on EDIM1.

Furthermore, I would like to send my gratitude to my classmate Sinan Shi, who shared his room with during the dissertation period.

Finally, I would like to thank everyone in EPCC, who had given me an impressive experience of the Msc course of high performance computing.

1

Chapter 1 Introduction

“Data-explosion”, noted as the indicator of information age, can be described as

increasing volumes of huge data about everything happened in the world. As a result,

the demands, which are needed to store, index and also query the data, are going to

climb up for nearly every big company. Not only some big I.T. (information

technology) companies such as Facebook and Google need to store huge amounts of

user data, Walmart also has their data warehouse[2] to analyse the shopping trend for

customers. In addition, some data intensive technology such as Map-Reduce[83] has

helped to process millions of data in Artificial Intelligence, Machine Learning and Web

Searching area[83]. However, a relational database will reach its limits when there is a

huge dataset[3]. Thus, this fact has led to great development of super scalable

distributed data stores not based on the relational database model, known as NOSQL

(not only structured query language) databases. For instance, Google’s Bigtable[4],

Amazon’s Dynamo[5] and Facebook’s Cassandra [16] are high-quality

implementations of NOSQL (Not only structured query language) stores. Meanwhile,

nearly all NOSQL stores were designed to efficiently stock up and index huge amounts

of data and highly optimized for read/write operation. However, there are few

scientific applications that are using a NOSQL database for massive data processing,

since NOSQL databases have just emerged in the recent years. Hence, this dissertation

is going to investigate the suitability for a scientific data application to use key-value

stores, which is one kind of NOSQL store.

In chapter 2, we introduce the motivation and present the approaches for the evaluation.

Then in chapter 3, we discuss the background about this dissertation such as NOSQL,

sharding, replication and the CAP theorem. In chapter 4 we present the detailed

information about target data stores (Redis and Voldemort). In chapter 5, we talk about

the evaluation approaches then the result of evaluation is discussed in chapter 6.

2

Finally, we benchmark the suitability of key-value stores in chapter 7 and summarise

the conclusion along with the further work in chapter 8.

3

Chapter 2 Motivation

We describe the reason why we are going benchmark NOSQL data stores for scientific application and the aim of the dissertation in this chapter.

2.1 The growing requirement from scientific application

A definition of “scientific application” is “An application that simulates real-world activities using mathematics”[82]. In another words, real world objects are modelled and simulated by mathematical formulas. For example, Weather Forecasting is a conventional scientific application that can forecast the weather situation by the information of atmosphere as described in the talk from Jing Jiang, Jayanth Gummaraju and Rohit Gupta [54]. They suggest several challenges during the development of scientific application and one of challenge is IO (Input/Output) challenge. The whole data set of weather forecast can be more than thousand Gigabytes[54]. Similarly, in the report from Oak Ridge national laboratory[55] regarding the requirement of leading scientific application the requirement with scientific application has been described as “producing data at a rate of about 60 GB every minute” regarding data analysis. Considering an exa-flop situation, “analysis output will amount to 20 PB/hour.”[55]. With the high requirements of the IO, the storage is going to be a critical overhead, and a traditional database might not be suitable in the situation, because the relation model might come across the limitation, which leads to a performance drop [46].

2.2 Reasons of using NOSQL

The reason why a NOSQL is beneficial for scientific application might be because of its suitability of data-intensive processing, scalability and flexibility, which are as follows:

One acceptable reason to use NOSQL technology is that a lot of NOSQL data stores are well suited to carry out data-intensive processing. An example is that Bigtable has been utilized in a large amount of Google service such as Youtube, Google Analytics and Google Earth[4], which is performing in a similar behaviour as scientific application. For instance, Google earth “servers tens of thousands of queries per second per data center with low latency” [4].

4

Scalability may be one factor in the choice of NOSQL for an application. Usually a relational database such as Oracle database, scalability is provided by purchasing servers with more powerful capacity. In contrast, NOSQL data stores are designed to scale out easily since data in NOSQL system is partitioned across multiple nodes.

Flexibility, especially data flexibility, may be another primary motivation for selecting NOSQL. The Key-value model provides an incredible level of flexibility to store images, documents, strings, integers and serialized objects in the same store. The schema-free model of key-value store may enhance the performance with the different selection of key-value pair. Because system can reduce its seeking time with a small key, not to mention small key is more possible to fit into memory. A more detailed description of key-value stores will be given in Section 4.1.

2.3 If key-value stores are suitable for scientific application

The aims of the dissertation are to explore the suitability of utilizing a NOSQL data store for a scientific application. There are two kinds of key-value implementation involved in the dissertation: a strict consistent memory-based key-value store and an eventual-consistent disk-based key-value store. Moreover, we are going to evaluate two target data stores of their suitability for scientific application. Consequently, by comparing the suitability of different key-value implementations we can evaluate how the performance of key-value store is when it is deployed with a scientific application, and if a key-value is suitable for scientific application.

5

Chapter 3 Intensive data concepts

In this chapter, we introduce the background information about the distributed system and some concepts about the performance of NOSQL system.

3.1 NOSQL Data store

NOSQL data store can be defined as “Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open-source and horizontally scalable”[21]. A most significant feature of a NOSQL store is Scalability, which means a system to “accommodate an increasing number of elements or objects, to process growing volumes of work gracefully, and/or to be susceptible to enlargement.”[87]. In addition, scalability usually refers to horizontally scalable, which means the ability of a distributed system or network to connect multiple nodes to work as one unit. In case a distributed system, the performance could be improved by involving more nodes. The name of NOSQL data stores implies that these stores do not have the capacity for execution of the SQL query language and are not following the relational model. It also stands for “Not Only SQL”, which indicates the necessity of building different database according to different demands such as to store huge dataset or to be high availability, like the architecture of Amazon[24]. In the past, relational databases were utilized in a number of applications areas such as finance service and computer science[22]. However, their richness of features leads to its complexity when data needs to be distributed over the whole network system. In particular, transactions and join operations are inefficient in a distributed system as claimed by Michael Stonebraker[23] regarding “logging, locking and latching”. This is why NOSQL data stores cannot provide the rich set of full ACID1 support. In addition, most NOSQL stores are designed to be deployed on clusters with distributed low-cost computers and network. Therefore, the system has considered the failure tolerant and made trade-offs with the ACID, transaction, query capabilities and

1 Atomicity, consistency, isolation, durability

6

performance. Furthermore, NOSQL stores are usually schema free with their own query languages.

3.2 Sharding and Replication

There are three interpretations for a data store to be scalable. It can be scalable with reading operation, writing operations and the size of the store. Replication and sharding are currently two major technologies for achieving the scalability of NOSQL data stores. To be more specific, figure 1 shows a simple comparison between sharding and replication.

Figure 1: A simple comparison between sharding and replication

3.2.1 Replication

As described in figure 1, replication means the data set is stored on more than one node and replicated in each node. This approach is effective to increase the reading performance of a NOSQL store, since read operations will be distributed by load balancer over many machines. Another advantage is that it enhances the failure tolerance of the whole system. There is always another node with the same dataset of failure node in disaster situation can perform the reading operation.

On the other hand, writing operation is the down side of data replication. After finish writing operation, another writing operation has to be executed in each node supposed to store replicated data which may influence write peroformance. There are two methods that can achieve this goal: either a writing operation has to be released for all replication nodes and wait for the positive return from each node, or only commit

7

writing operation on one or a limited number of nodes then send asynchronously request to the rest of node. These two approaches led to the different guarantee of availability and consistency of database, which will be explained with CAP(Consistency, Availability and Partition Tolerance) Theorem in chapter 3.3.

3.2.2 Sharding

As demonstrated in figure 1, sharding stands for the solution that data in a store has been divided into many shards, and distributed over nodes. Usually data is distributed by master node with utilization of a consistent hash function to decide associated shard or node. sharding operation also requires a partition table that is stored in master node to fetch the requested data. In addition, utilization of sharding means that a complete table (list or column, depend on the data mode of NOSQL store) is not stored on the same node. By distributing the data to multiply node, sharding can increase the capacity and the writing/reading performance, because there are more nodes can be queued during the operation, and even the number of node can be adjusted according to he demand without modifying the application.

On the other hand, sharding may also bring the complexity and inefficiency to data operations. The join operation is one of the most important operations in relational databases, which is to illustrate the relation of data items. There are two data sets involved in the join operation, the left one and the right one. Those two data are connected by one or more attribute through join operation. To perform the join operation in a NOSQL data store, there is a high cost of making requests to every node that contain the data, which results in a network overhead. Therefore, most of NOSQL data stores do not support the join operation.

Because of the situation that failure probability goes up when more nodes are added into the system, sharding and replication are often combined together to guarantee the robustness of the system.

3.3 The CAP Theorem

The CAP Theorem was introduced by Dr. Brewer[27] to describe the trade-offs in distributed systems and was later explained by Lynch [28]. It illustrates that in a distributed data store system such as a NOSQL data store, only two attributes out of availability, consistency and partition tolerance can be supported.

3.3.1 Consistency

Consistency means that all nodes in the distributed system can read the same data at the same time. In a distributed NOSQL system, consistency usually can be described as “eventual consistency ”:

Eventual Consistency means that the client will receive the responses as described: “In a steady state, the system will eventually return the last written value[29]”. Therefore, clients may come across an inconsistent situation when data is being updated. For example, an updated request may go to only one node while another process is trying to

8

read the same data in another node so that two processes will have two different results. [29]

To explain more about eventual consistency, the following approach is used:

N: The node of nodes that hold the replicated data.

R: The number of nodes that has to be read for one read operation.

W: The number of nodes that has to be written for one writing operation.

In a strong consistency situation (such as distributed relational database), N must be smaller than the sum between R and W[31]. Figure 2 reveals an example that a concurrent reading and writing with N=3, R=2 and W=2. The process has written new value x2 into two nodes and read from two nodes with a consistent result.

Figure 2: The strong consistency needs N<W+R[32]

When system only guarantees the eventual consistency, it just needs N >= R+W to ensure partition tolerances or availability. Therefore, some NOSQL stores such as Voldemort can adjust W and R for specific scenarios. A large R and a small W means the application is enhanced on read performance. Write performance can also be enhanced by this approach. In the meantime, small number of R and W can improve the availability but reduce consistency and system will be strictly replication with R =1 and W = 1[32].

9

Furthermore, Todd Lipcon explained and contrasted strict and eventual consistency in “Design Patterns for Distributed Non-Relational Databases[29]”. The definition provided by him of consistency is “determines rules for visibility and apparent order of updates”. Moreover, he also described strict consistency as follows:

Strict Consistency is defined as “All read operations must return data from the latest completed write operation, regardless of which replica the operations went to[29]”. This suggests that all read or write operations for a data set should be executed on the same node. Therefore, strict consistency cannot be achieved together with availability and partition tolerance according to the CAP theorem[29].

3.3.2 Availability

Availability means that the client can always get responses with or write operation in a specific period of time. If availability is not guaranteed in CAP system, some client may have null response, even data is stored in the distributed system. In case of none response, client will be returned with an error message or null value depending on the client. Generally, this is caused the connection achieves its timeout limitation. Although the read/write operation is still being executing inside the data store, the client does not have a approach to obtain the return value.

3.3.3 Partition Tolerance

Partition tolerance means system can provide operation continuously despite failure of the network when two or more isolated nodes cannot connect to each other. It can also be explained as the ability of system to support dynamic addition and removal nodes in the network.

When the amount of data exceeds the capacity of a single machine, it should be partitioned into several nodes to continue to provide task of a distributed data store; and the partition tolerance usually depends on the way that how data is partitioned. Different size of system or other factors such as dynamically removal of node can also have impact on the partition tolerance. In short, there are various approaches to partition data:

Memory Caches (memcached) [7] can be seen as partitioned in-memory databases because they replicate the last data in the main memory of database to client so as to reduce task on the server side. By configuration, memory cache can launch an array of processes with an assigned amount of memory and be noticed to application. The memcached protocol that can be used in application like a key/value store is available in different programming language. It store the objects associated to a key into the cache by configured memcached instances. In the situation that memcached process does not make a response, most clients can ignore id and try to get connected to another memcached server so as to process the current data.

Separating Reads, defined by separating the read operation and write operation as master-salves. Write operation will be routed to master while salve will only handle the read operation. However as a master needs to make asynchronous replication to slave, data will be inconsistent if a master crashes before it committed the new data.

10

Sharding partitions the data in a trend that data requested usually is partitioned on the same node and load volume is almost distributed among the nodes. Data may also be replicated because of reliability and load-balance. Moreover, to implement the Sharding there should be a mapping for data to data storage node. Depending on the different system, this mapping can be dynamic or static. Furthermore, since Sharding will disable join function of system, it is commented as “you lose all the features that make a relational database useful[28] ”. However, most of NOSQL data store also utilize the Sharding to partition data, because it can provide excellent performance on automatic partitioning and balancing.

3.4 Buffer and Index

Buffer and Index are the two major issues those may influence performance when a database comes across a large dataset[69].

3.4.1 Buffer

There are a number of database will store part of data in memory as buffer to improve performance. The mechanism of buffer is similar to the cache for CPU in many databases and it lowers the latency because memory is faster than hard disk. The issue of buffer is difference between the in-memory data and out-of-memory data. The performance between them may be one thousand times because out-of-memory data need more IO steps[69].

3.4.2 Index

Index is information built on the current data and can help to find certain data without searching the whole database. Generally, index is built on the data yield that can be easily sorted such as ID or date. A NOSQL system usually builds the index on its keys so it can help to locate a write/read operation. Index is not always helpful to speedup accesses to database. It depends on the how index is constructed. If indexes have been placed in a random place, it may affect an index scan significantly[69].

11

Chapter 4NOSQL data stores

After the introducing the common concepts of distributed data stores, this chapter presents further information about NOSQL data stores focussing on the key-value stores. Key-value stores have a simple data model, which allow clients to put and request per key. Finally, two target data stores for evaluation, Voldemort and Redis, will be briefly introduced in this chapter.

4.1 The different types of NOSQL

NOSQL stores can be viewed as a non-relational database in nature based on shared-nothing architecture1[88] and include limited support for transactions but provide a wide range of stores to be chosen from[37]. Generally, these NOSQL stores can be structured into four categories:

Key-value stores

Document oriented stores

Graph stores

Column stores

These categories are built upon how data is manipulated in the NOSQL stores and the capability of querying, adding and changing the data in the NOSQL stores.

4.1.1 Key-value stores Key-value stores are the most straightforward implementation of NOSQL stores. The main idea inside the key-value store is to save objects. Originally the purpose of the key-value store is to be built as a back-end in web server whose architecture is based on an object oriented[38]. The approach inside in a key-value store is very simple. It stores every object according to a unique key for each object. The value in the key-values is usually a serialized form of a unique key for the object without any additional functions. Searching, comparison or sorting cannot be made with the value stored in the database. However, it results in a total schema-free architecture. For example, Amazon S3 [39] is a key-value store service provided by Amazon with simple APIs based on the

1 A distributed computing architecture where there is nothing shared for each node.

12

HTTP protocols: GET, PUT and DELETE to objects containing from 1 byte to 5 terabytes of data each.

4.1.2 Document oriented stores

A document-oriented store encapsulates key-value pair like documents with unique keys. Every document contains a key “ID”, which is unique within a collection of documents. The capability of querying value differs from a key-value store, which led to the convenience to process complex data structures like nested objects. Similar to key-value stores, document-oriented stores do not have schema restrictions. However, it offers multi attribute searching on the documents, which is completely different from key-value stores. Therefore, document-oriented stores are very convenient in data integration.

4.1.3 Graph stores

In contrast to already introduced key-oriented NOSQL stores, graph stores are specialized in handling heavily linked data such as data with many relationships. For example, Neo4j[40] and GraphDB[41] are based on multi-relational property graphs. Key-value pair is consisting of nodes and edges while the schema can be defined by key and values. Therefore, a graph store is able to keep complex expression. Twitter stores the relationship between users for their tweet following service by their own graph store FlockDB[42], which is optimized for large, lists, fast reads and writes. Most graph stores are utilized in location based service and navigation system[93].

4.1.4 Column-oriented stores

Column-oriented store is described as “distributed storage system for managing structured data that is designed to scale to a very large size[4]” such as Bigtable, which is utilized in many Google projects, because it could provide high throughput performance and low latency response. The data used by a column-oriented store is like “sparse, distributed, persistent multidimensional sorted map”. Key-value pair is stored in this map by an arbitrary order. Similar to the key-value store, relational operation is not supported natively because value cannot be interpreted by system. However, columns can be grouped into column families and row can be added, which results in flexibility at runtime. Hbase[43] and Hypertable[44] are open source implementations for Bigtable.

13

4.2 Target data store to benchmark

Figure 3: The location of Voldemort[19] and Redis[51] according to CAP theorem.[66]

Figure 3 illustrates a list of NOSQL implementations that could be selected for investigation for the dissertation. As described in Chapter 2,we are looking for key-value store because of its schema-free architecture; and we hope that we can evaluate different key-value store implementations. As a result, we wish to pick up each one from the set of CP guaranteed key-value stores and AP guaranteed key-value stores.

During the first stages of our research, we found that the documentation of Redis and Voldemort was much better about configuration than other implementations such as Riak[79]. Thus, both of them were selected for the benchmarking experiments in this dissertation.

Redis is a strict consistent memory-cached key-value store, which keeps the data in the memory so as to enhance performance. Apart from Redis, Voldemort is an eventual consistent disk-based key-value store inspired by the key-value store “Dynamo[5]”, which is developed to support distributed storage. Table 1 shows a comparison between Redis and Voldemort and figure 3 illustrates the different focus of each data stores.

Redis Voldemort Written in C JAVA

14

Main features

Data store in memory, a fast response-focused implementation of key-value store

Inspired by Dynamo to provide a huge data store without comprising performance.

Protocol Telnet HTTP

Best used For rapidly changing data with a foreseeable database size (should fit mostly in memory).

Application with large requirement on data capacity, such as geological data.

For example Real-time data collection from sensors. Real-time communication.

Personal profile, meta data of maps.

Highlights Currently without disk-swap, no overheads with write and read with hard drives.

Highly configurable with no requirements of consistency and availability

Support Client C,C++ , Python,Erlang,Java,Lua C++,Ruby,Python

Table 1: A comparison between Redis and Voldmort[89][19]

As described in Figure 3, Redis prioritises consistency and partition tolerance while Voldemort prioritises availability and partition tolerance. According to CAP theorem, a NOSQL implementation can only achieve two out of three from consistency, availability and partition tolerance. However, it is still possible to evaluate consistency and availability for both of them and examine the performance gap in the evaluation.

4.3 Voldemort

Project Voldemort is an implementation of the paper of Dynamo developed by LinkedIn[84], which is a key-value store providing the following HTTP-based API:

GET(KEY)

PUT(KEY)

DELETE(KEY)

Similar to other Key-value stores such as Dynamo and Cassandra, both key and value can be complex objects and it only provides a simple data structure and API with no complex querying capabilities.

15

4.3.1 Data Sharding and replication[48]

Data of Voldemort will be partitioned across a cluster of nodes so that there is no single node that holds the complete data set. Even when data set can fit into a single node, the access to small data set will be dominated by seeking time on disk. Hence, partitioning of Voldemort could improve the cache efficiency by splitting the “hot” data (data that has been request regularly) into smaller chunks that could hopefully be fit into memory on the node that stores that partition.

Replication of Voldemort partitions the data into S chunks (one per node) and stores copies of a given key on R (replicate factor) nodes with the help of consistent hashing. Consistent hashing is a technology used by Voldmort to compute the location of current key across the cluster. Figure 4 shows how Voldemort can distribute the equally over all nodes:

Figure 4: Consistent hash ring of Voldemort[48]

In figure 4, the consistent hashing ring begins at 0 and circles around to 2^31-1.It is divided into Q equally sized partitions with Q>>S, and each of the S nodes is assigned Q/S partitions. Key is mapped across the ring by an arbitrary hash function, then Voldemort will compute a list of R nodes that is responsible for the key by computing the following nodes of the first unique nod in a clockwise direction. The diagram indicates a hash ring for nodes A, B, C, D. The arrows illustrate the resulting list of the nodes to store when keys are mapped on the hash ring with R = 3,S=4,Q=8, each server is assigned with two partitions.

4.3.2 Index Pre-calculation

Voldemort allows pre-building index and uploading it to the nodes. This function can benefit batch operation such as inserting and updating large amount of objects causing rebuilding the index.

4.4 Redis

Voldemort is supposed to run on a large number of compute nodes or on clusters. In contrast, some distributed system such as Redis[51] can be deployed on a single node. Additional nodes can be linked to the system to improve performance and availability. Redis was released as a small open source project in 2008 and has obtained great momentum since VMWare got involved. The idea of Redis is exactly a key-value store:

16

a simple interface, key with associated value from various types of data. The value can be queued by using the key. However, Redis generually keeps all the key-value pair in the main memory of the compute node to improve performance. If the memory space is not enough, part of key-value pairs can be swapped to disk by virtual memory system[85].

4.4.1 Persistence

Since Redis is designed to be an in-memory data store, it should keep the whole key-value pair in memory during the operation. All data will be purged from memory if Redis is stopped by a crash or error. To achieve data persistence and fault tolerance, Redis utilizes a periodic snapshot and a AOJ(append-only journaling) technology[70]. However, because size of in-memory objects is much larger than Redis snapshots, the data objects are usually stored in a hybrid manner for certain queries like a huge hash table. When Redis reaches its capacity limits, the performance of Redis decreases significantly, because the data need to be rehashed and resized. This rehashing mechanism keeps the stable performance of Redis by provisioning two hash table and migrating the keys between them[52].

4.4.2 Virtual Memory

After the 2.0 version of Redis, excess data can be handled by virtual memory when the memory is not enough. Redis will store object to storage as in-memory. Since the hash table needs to stay in-memory during the operation, and object has to rewrite it self if it needs to be swapped, a large object such as a sorted set will never be swapped out, which may cause a performance drop[70].

4.4.3 Replication Redis supports unidirectional replication, which means a second node can be configured as a slave node to the master node by replicating the data objects. The approach is achieved in by an asynchronous way, which means in that the object on the slave node may be outdated for some time. Any write operation to a slave will be ignored during the replication process, which makes a slave node a read-only node. Replication always copies all data object from master node, so there is no selective replication provided by Redis. As a result, read operation can be performed in multiply slave node while the write operation has to go to the master[70].

4.4.4 Sharding[68].

The cluster version of Redis is still under development so we used “Pre-shard(0.15)”[62], a Redis client extension, to implement the Sharding of Redis. The ideal of Pre-shard is described by Antirez in his website[68].

Pre-shard is a technique to ensure each key is mapped to different N Redis. The limitation of this kind of partitioning is to move the key from one instance to other instances. Apart from that, client needs to know which node is the slave node of the master node, so client can find a replicated data object for fault tolerance when needed.

17

Chapter 5 . Evaluation Preparation

In this chapter we present the detail of the different evaluation approaches taken in the project and explain the procedure of evaluation of consistency, availability and partition tolerance.

5.1 Evaluation

The evaluation considers the three guarantees described in chapter 3.3 as criteria to estimate performance of Redis and voldemort. To investigate the suitability of a key-value store for a scientific application, the aims of the evaluation are:

o To compare the consistency for two types of key-value stores (CP and AP) across varying node and data set sizes.

o To compare the availability for two types of key-value stores (CP and AP) across varying data set sizes.

o To evaluate the partition tolerance for both two target data stores.

5.1.1 Evaluation of Consistency

The definition of consistency has been described in chapter 3.3. There are limitations if we are going to evaluate the consistency by the definition because in an eventual consistency NOSQL data store such as Voldemort, the data store may forward the request to a node which has replicated data, it is not always possible to identify which node is going to be responsed to the request. This is because the API does not allow client to look into the partition mechanism and identify which specific node responds to a request.

For a strict consistency key-value store such as Redis, as described in 3.3.1, all read and write operations are executed on the master node, which means only owner node of a certain key could see the key except for its slave nodes (the data object on a slave node can not be seen by the master node but can be queried by the client).

As a result, we present an approach that can evaluate the consistency by using a large number of read threads for a concurrent operation.

18

To estimate the variation between the strict consistency of Redis and eventual consistency of Voldemort and how the consistency of both data stores varies with different configuration (nodes/datasets), an experiment was designed as following:

Figure 5: The step of consistency benchmark

As shown in the figure 5, the Python script of for the consistency benchmark first initializes 101 threads consisting of 100 read threads and one write thread. Each thread is assigned with an identical single key generated by the script. Read threads will wait for the completion event of the write thread after the initialization. Meanwhile, the write thread accesses the target data store and changes the value of the generated key to “test”, then sends out the write completion event to all read threads. Finally, all read threads are going to access the data store to get the value of the generated key updated value (“test”) is returned. The benchmark measures the time duration between the write thread updating the value and all read threads receiving the updated value. In general, the less time taken until all threads read the new value, the better consistency for that data store. This evaluation tries to simulate the situation where one read operation and one write operation happens concurrently, which is usually the real case that happens in a scientific monitor application to observe data such as a earthquake monitoring application or a typhoon analysis application. In addition, to obtain a considerable result that is significant enough so that it can be virtualized, the number of read operation is expanded to 100. Test.py is the main benchmark Python script to evaluate the data store. As demonstrated in figure 16, test.py uses the Python multiprocessing module[94] to implement the consistent evaluation by using Process class to initiate the processes, Event class to synchronize the processes and Lock class to acquire a normalized output result, which stands for each thread can print out result one by one, not concurrently.

However, below classes, which also belong to multiprocessing module, seems not to work for the implementation: Pool, Value and Array. During the development phase, Pool class was used to control the processes. Value and Array classes were used to

19

synchronize the processes. Neither of them succeeded because those classes did not work as expected.

Furthermore, the initial version of test.py has two issues. The first one is thread control. As the first version of test.py used threading to initiate threads, but threading module did not return a correct result. It might be caused by shared cache for threads and was fixed by changing to multiprocessing module. Another issue is about the concurrent operation. When any thread/process keeps generating read operation, Voldemort may return an incorrect response such null or error messages to client. This problem might be caused by the native support of Python client or Voldemort itself because it did not guarantee availability. This issue had been solved by using Event class in multiprocessing module, which will not awake process until write had computed writing operation.

Consistency evaluation uses the line count as the key and each record line as the value. The loading procedure usually takes more than two hours to insert 0.5 million rows of data. For each different node/data size, the script is executed for five times to obtain an average result and calculate the standard error. The total data size involved in the consistency benchmark is around 9.5 million rows (average 20 bytes of each row) of key-value pair, which might consume more than 12 hours.

5.1.2 Evaluation of Availability

The definition of availability has been described in chapter 3.3.2, and in the availability benchmark the two target data stores are always considered to be “available ”. To evaluate the availability, the latency for write/rear operation is measured to compare the how fast are two target data stores going to be available.

To benchmark the availability, YCSB[56] (Yahoo! Cloud Serving Benchmark) framework is used to evaluate the latency of randomized multiple read/write operations. YCSB is a framework designed by Yahoo research lab to compare the different implementations of distributed data store system. It provides the ability to simulate the scenario of a huge amount of operation throughout to benchmark availability by testing the average latency of operation. Figure 17 illustrates the architecture of YCSB, which could be considered as additional layer for NOSQL data store:

20

Figure 6: YCSB architecture[81]

The scripts for using the YCSB framework are provided in Appendix F and Appendix G. The output result of YCSB is illustrated in below attributes:

Throughput: How many operations performed per second.

Latency: the average latency of all operation.

99th Latency: 99% of operations completed latency.

95th Latency: 95% of operations completed latency.

In general, YCSB framework is a very friendly interface with enough statistics for the availability evaluation. It can be downloaded by the link provided in Yahoo’s webpage with no needs to compile code. There are two main phases for the availability evaluation, “load” and “run”. Both of the two operations are completed by YCSB automatically and the data also generated by YCSB. There are two different sizes of data in the evaluation for availability, 500 thousand and 5 million. However, because the YCSB could only be connected using Java Redis client provided by YCSB. It does not have the ability to evaluate Redis running on a cluster. Thus, the availability evaluation is mainly based on single node for both two target data stores.

5.1.3 Evaluation of Partition Tolerance

The definition of partition tolerance is described in chapter 3.3.3, and this evaluation aims to find out which target data store works better when one or more nodes are isolated. To evaluate the ability of a NOSQL system to work without impact, the benchmark measures the amount of available key-value pairs returned by a randomized

21

read thread after a node has been isolated from the rest. Although there is no guarantee that read thread will come across the isolated node during the operation, the percentage of how many random key-value pairs are available still demonstrates the partition tolerance of target data store. To achieve a representative result, the evaluation is performed only with 16 nodes. Figure 7 shows the procedure of partition tolerance evaluation:

Figure 7: Produce of partition tolerance benchmarking

Patest.py is a python script that is written to evaluate the partition tolerance according to the figure 7. After each read operation, Patest.py will generate another key for the next read operation. The key is generated by Random module of Python and the script is going to iterate the read operation for 10,000 times.

For the evaluation, one node will be rebooted to simulate isolation. Then the script will evaluate the store to obtain the result. The same procedure is repeated until the data has been gathered for one node isolation, two nodes isolation and four nodes isolation.

5.1.4 Other scripts for evaluation

Loadfile.py is a script to load data file into data store. Figure 8 illustrates that how it works:

22

Figure 8: The script procedure of loading content

As revealed in figure 8, loadfile.py is able to load the content of any kind of data file. It first read in the “key” value K in the data store and set K+1 to be the key for data at the first line, and then the data at the next line will be constructed as [K+2, data on second line] pair. The iteration continues until Python detect the EOF (end of file). This implementation is good at its flexibility to data structure and it can normalize the influence brought about by the different kind of keys. However, the efficiency of the script seems not to be good enough during the evaluation that lead to a time-consumed duration between each evaluation for data loading because the script is not a parallel programmed. It might be possible to parallelize the Python script to reduce the data loading time.

DB.py is a client script that packages both Redis and Voldemort client which allow user could create instance to either Redis or Voldemort with one script. Any Python script for the evaluation includes DB.py and requires the first parameter of script to be “redis” or “voldemort” to choose the data store. It helps improve reuse of he code and avoids making two version of Python script.

5.2 Platform

There are two major platforms that have been involved in the dissertation: Amazon EC2 (Elastic Compute Cloud) and EDIM1.

• Amazon EC2 is a web service provided by Amazon that offers scalable computes capacity by cloud web-based technology.

• EDIM1 is an experimental cluster to support the requirements of data intensive research. The machine was built by the staff of EPCC and School of Informatics of University of Edinburgh. The purpose of EDIM1 is not a high computing end like HECToR[90] but a power efficient alternative for research projects.

Table 3 shows the speciation of single node in Amazon EC2 and EDIM1. EDIM1 AmazonEC2

Motherboard Zotac ION-ITX-K 1.6 GHz Atom/ION Mini-ITX AWS service

23

Processor Dual-Core Intel 1.6 GHz Atom

2 EC2 Compute Units (Equivalent 1.0-1.2GHz

2007 Opteron) Memory 4 GB DDR3 613 MB memory

1 x 256 MB SSD (RealSSD C300)

Storage 3 x 2 TB HDD (Hitachi Deskstar 7K3000)

8 GB EBS storage

Table 2: Speciation of single EDIM1 and AmazonEC2 node[60] [61]

There are certain differences between an AmazonEC2 node and an EDIM1 node in hardware but not much in software. Both EDIM1 and AmazonEC2 use a Lunix operation system but a different distribution. The ROCK5.4 for EDIM1 is a HPC cluster Lunix while AMI Linux is a optimized running on AmazonEC2 instance.Since they use a similar operation system, which results in a similar configuration procedure. Considering it may cost more than 120 dollars for CPU and Memory if the whole evaluation is performed on AmazonEC2 with a similar configuration to EDIM1, the major evaluation platform is selected as EDIM1.

5.2.1 Amazon EC2 deployment

After obtaining the verification file(username.pem which contains the authorization key ) from Amazon, we can create system instance by the image provided by Amazon. In the dissertation, we used Amazon Linux AMI 2012.03 64 bit with micro instance (up to 2 compute units, 613 memory) as the development platform. However, below component must be added to the instance before the instance can be used as a development platform:

o Development Libraries

o Administration Tools

o Development Tools (c/java compiler)

o Last version of Ant[57](to compile Voldemort source file)

o Log4j[58](needed for YCSB)

o Java[59](needed by Voldemort)

The version of Python and Java installed automatically is sufficient for the evaluation. The detail commend can be referenced from Appendix A. For Amazon image reference, ami-657d7a11 can be queued when a new instance is going to be created. This image has been customized with all component needed for the dissertation including the latest versions of Redis and Voldemort.

24

Furthermore, there is one more setting, which one should pay attention to. The quick setting of AmazonEC2 security group might block the TCP connection that causes the Voldemort fail to establish a link.

5.2.2 EDIM1 deployment The base component deployment in EDIM1 is similar to Amazon EC2. However, since EDIM1 utilized a Linux distribution of ROCKs 5.4, which is a Linux version for high performance computing cluster, there are two additional steps that needed for the EDIM1: Java(1.6) and Python(2.7) installation because the automatic version is lower than the requirement. The Java version must be higher than 1.5 and Python version must be higher than 2.7.

5.3 Data Store Configuration

Redis(2.4.14) and Voldemort(0.90.1) have different methods of configuration. The default configuration of Redis is stored in redis.conf file, which is located under the folder of Redis. Because the sharding of Redis is an extension, Redis does not provide any configuration for it so the default setting of Redis is sufficient for the evaluation.

As far as Voldemort is concerned, Voldemort requires three xml files to store all configuration information: cluster.xml, server.properties and stores,xml. Cluster.xml keeps a copy of the node information including IP address, partition and node ID. Server.properties keeps local the server setting and stores.xml keeps a list of stores(similar to table in SQL) in the Voldemort system. Both cluster.xml and stores,xml should be identical in all Voldemort nodes while server.properties could be different from one node to another node.A template of Voldemort cluster configuration has been added in the Appendix L.

5.3.1 Client Setup

Both the Python clients of Redis and Voldemort are involved in the dissertation and the clients provided by YCSB(0.14).

Redis-preshade(0.1.5)[62] , as described in chapter 4.4.4, is the native Python client for Redis which enables sharding for Redis.

Voldemort-Python-client(0.90.1) [63]is the native Python client for Voldemort with released together with the Voldemort.

To install both Python client, instructions can be found in Appendix I and Appendix J. Both of two packages have been packaged into one Python script in this project.

5.3.2 Data used in evaluation

The data files used in the benchmarking have been obtained from the UCI Machine Learning Repository[86], which offers several files for machine learning purposes. One data set is called the Nomao(last updated 04-07-2011)[64] data set and another is called as the “Record Linkage Comparison patterns”(last updated 03-10-2011)[65] data set. Basically, the Nomao data set collects data about personal information such as

25

name, phone and localization while Record Linkage Comparison pattern (RLC) data represents individual data including name, sex, date of birth and postal code. Both data can be downloaded from website for free and sample record is listed in figure 9:

Figure 9 Simple records from Nomao dataset and Record Linkage Comparison pattern dataset

RLC pattern dataset is used in the consistency evaluation. By inserting twice the RLC pattern data set, a data store will hold a data set for about 9.5 Million data rows. Meanwhile, the Nomao dataset is used in partition tolerance evaluation; it will be loaded by script with about 30 thousand data rows into a data store. Since the dataset used in consistent evaluation should be large enough to illustrate the variation while the data set for partition tolerance should normal sized to save the time consumed on reinserting the data.

5.4 Expected Result

As described in figure 3, Redis is a CP type key-value store while Voldemort is an AP type key-value store. The result of the consistency evaluation may vary according to the consistency model of the two target data stores. Generally, if a data store is expert in keeping an excellent consistency, the virtualization of the result should be a line in parallel with to x-axis. By adjusting the number of node or size of data set, the evaluation can measure the scalability of certain data store. If a data store could maintain its line parallel or close to parallel with x-axis, it has a good scalability.

In the availability evaluation, the comparison between two target nodes may be straightforward. A data store that has a better availability it will have a lower latency and a higher throughput compared to a data store with poorer availability.

For partition tolerance evaluation, a data store that remains a high percentage of available key-value pairs when nodes become isolated provided by the test script can be regarded as a better data store for partition tolerance.

0#1,1,1,1,1,1,1,s,s,?,?,?,?,?,?,m,m,?,?,?,?,?,?,m,m,?,?,?,?,?,?,m,m,?,?,?,?,?,?,m,m,?,?,?,?,?,?,m,m,1,1,1,1,1,1,s,s,0.833333,0.714286,0.846154,0.8,0.804878,0.731707,n,n,1,1,1,1,1,1,s,s,1,1,1,1,1,1,s,s,1,1,1,1,1,1,s,s,?,?,?,m,?,?,?,m,1.0,1.0,1.0,s,1.0,1.0,1.0,s,1.0,1.0,1.0,s,0.999967001089,0.75,0.5,n,0.999953002209,0.777777777778,0.461538463831,n,1.0,0.999999996702,+1

37291,53113,0.833333333333333,?,1,?,1,1,1,1,0,TRUE

26

Chapter 6 . Data Analysis

This chapter gathers the data generated during the evaluation period and analyses the performance of target data stores. We will compare the two data stores using the three criteria defined by CAP theorem: consistency, availability and partition tolerance.

6.1 Consistency Benchmark

The consistency benchmark is divided into three different configurations: 4-nodes cluster, 8-nodes cluster and 16-nodes cluster. For each configuration, we load 0.5 million rows into cluster first evaluate the consistent performance then add the data size up to around 9.5 million rows. Between the two evaluations with the same number of node, there is no data flushing or other operations and the benchmark script is executed for five times. Between the evaluations with different number of node, all data will be deleted from the target data store. Since the loading procedure consumes more than one day to load million rows the data size will be no larger than 15 million rows.

6.1.1 Result with 4-nodes cluster

The result for a four-node cluster for Voldemort and Redis and run the benchmark script on them, the result is described as figure 10:

27

Figure 10 Consistency Benchmark Result of four-node cluster

The x-axis of figure 10 is represented by the number of threads, which has been returned with correct new value by a random key. The y-axis stands for the time taken for that number of read threads to have returned the correct value. For different data size in the data store, there are five benchmark scores recorded as demonstrated in the figure to investigate the general trend.

From figure 10, the general latency of single read performed by Redis is faster than Voldemort with the 4-node cluster. Nearly every thread performed by Redis is returned by the new value in 0.1 second while the time consumed by Voldemort thread is keeping a linear climb from 0.1 second to above 0.3 second. Moreover, there are more details in figure 11:

28

Figure 11 Consistency Benchmark Result of four-node cluster (threads 80 to 100)

Figure 11 shows the detail graph for threads count between 80 and 100. As described in chapter 4.4, Redis is an in-memory key-value implementation so all data of Redis are stored in the memory. It helps a data store with a buffer discussed in chapter 3.4.1 since memory is generally much faster than disk. For Voldemort, it only stores part of key-value pairs in memory, which may improve performance in certain situations but not generally. Furthermore, Redis is programmed by C language that allows Redis to carry out operation efficiently.

Another observation that can be achieved from figure 11 is that the increment of latency by adding 9 million data row into data store. Redis seems not to be impact by the increased size of data while the performance of Voldemort might be much worse at the end. In another words, the latency of Voldemort may rise according the number of threads that is reading concurrently. To prove this assumption, we inserted 5 million more rows to each data store and run script again. The result is revealed in figure 12:

29

Figure 12 Consistency Benchmark Result of four-node cluster (further investigation)

The figure 23 demonstrates the larger increment of Voldemort with more inserted data rows compared to Redis. The number after the data store name indicates the data sizes when being evaluated. According to current data graph, this phenomenon may grow to be more significant by inserting more data set.


The result of evaluation on each data store is illustrated in figure 13:

30

Figure 13 Consistency Benchmark Result of eight-node cluster

Figure 13 demonstrates the consistent situation for Redis and Voldemort. One significant observation is made by comparing the figure 13 and figure 10 is that the enhanced performance of Voldemort. The whole read latency is below 0.05 second before 50 percent of thread has completed operation. This is a striking improvement in contrast to the 4-nodes Voldemort, which is two times slower. In the mean time, Redis maintains an analogous performance by using 8-nodes or 4-nodes cluster and its consistency remains after inserted almost one million data rows.

However, Voldemort still has a performance dropping after nearly nine million rows inserted with 8-nodes cluster and with this configuration the impact turns out to be more considerable compared to the figure 11. There are more details in figure 14:

31

Figure 14 Consistency Benchmark Result of eight-node cluster (from threads 80 to 100)

Figure 14 shows how Voldemort turns to be slower than Redis after nine million rows inserted. By comparing to the stable consistent performance of Redis, the latency of Voldemort not only rises to more than 0.3 second, but also varies unstably which means the performance of consistency of Voldemort is unpredictable at this point. In figure 10, Voldemort also come across this issue, which means it is not occasional event for Voldemort. It may influence the scientific application that requires a good consistency for concurrent operation such as disaster monitoring[55]. As described in chapter 5.1.1, those application require the consistency to illustrate the current result and a data store of good consistency could even enhance the performance of application.


The result of consistency evaluation on each data store is demonstrated in figure 26:

32

Figure 15 Consistency Benchmark Result of sixteen-node cluster

Figure 15 shows an enormous transform between Redis and Voldemort. The performance of Redis jumps down to two times slower from the beginning and keeps behind the performance of Voldemort until the end. On the other side, Voldemort behaves as same as 8-nodes cluster with 16-nodes cluster. This comparison illustrates the fact that the Redis does not scale well with the number of nodes. One possible reason for this is the strict consistency model used by Redis require read and write to be performed on the single node, which may cause a heavy pressure for the node to perform operation. However, since we used “redis-shard” module described in chapter 4.4.4 to implement the Sharding of Redis, it may be a potential reason why Redis turns out to be unpredictable consistency since Pre-shard is not official released. It means that Pre-shard might not be fully tested on all conditions such as a 16-nodes configuration of Redis.

Moreover, we can also conclude that Redis will come across the same issue like Voldemort does as described in figure 16:

33

Figure 16 Consistency Benchmark Result of sixteen-node cluster

Figure 16 reveals that Redis comes across the issue that only occurs with Voldemort in the 4-nodes and 8-nodes configuration. Because Redis does not have this problem in the configuration of 4-nodes and 8-nodes, the additional nodes added in the 16-nodes configuration may cause it. As described in the previous chapter, additional node could bring about more capacity pressure to Redis client since the client is responsible for the sharding of Redis.

6.1.4 Consistency Contrast

After compared the consistent performance with different data size, we also evaluate the two data stores with different number of nodes:

34

Figure 17 Consistency Benchmark with different nodes configuration

Figure 18 Consistency Benchmark with different nodes configuration at the time point of 80%

Figure 17 demonstrates a universal situation with two data store and Figure 18 shows time contrast diagram that target data stores achieve the progress of 80 threads.

As describe in figure 17 and 18, the consistency performance of Voldemort is more stable and fluctuates in a minor range and the consistency performance of Redis drops when new nodes were added in the system. In addition, according to figure 17, the performance of Redis keeps dropping with the number of node and is exceeded by

35

Voldemort before 8 nodes. Meanwhile, Voldemort achieves its best performance at 8 nodes and slightly climbed up with 16 nodes.

To explain this situation, one possible reason is that Pre-shard utilized the idea released by Redis development team[68] that simply hash the key to get the correct node to write. The advantage of the implantation is its simplicity, which result in the efficiency of program. However, the client program is not a parallel programmed so it might have to coordinate every request when come across numerous concurrent operations. With a 16-nodes cluster, the client my cost a lot of time waiting for the response of request since more nodes cause more significant communication overhead. Another possible reason is because of the strict consistency model used by Redis. As discussed in 6.1.3, strict consistency may lead to an imbalance distribution of operation that focuses on single node.

As far as Voldemort is considered, the reason why it scales better than Redis may be because of its architecture. As described in the chapter 4.3.1, Voldemort has a hash ring to calculate the correct node to read/write. Voldemort node can replicate data object to another node and forward request to that node when Voldemort node is busy. The replication itself might be the root cause of the unstable consistent performance during the evaluation because the distance between the client and the current node or replicated node varies. Nevertheless, this implementation could enhance the scalability of Voldemort with a trade-off with stability of consistency.

Another observation about Voldemort is that the consistency performance of Voldemort increases before the 8 nodes and slightly drops after 8 nodes. One potential explanation for this phenomenon is that there are multiple factors, which can influence the consistency performance of Voldemort besides the number of nodes. The improvement of Voldemort between 5 to 8 nodes may be benefitting from the constant hashing ring while the slight drop in performance after 8 nodes may be the result from the delay of forwarding requests.

6.1.5 Conclusion on consistency evaluation

As described in 4.2, Voldemort does not guarantee consistency as set out by the CAP theorem while Redis does. The difference between the two target data stores evaluated is the consistency model. The strict consistency model and the eventual consistency model may lead to the performance variation of two target data stores.

Consequently, the comparison between Redis and Voldemort shows a remarkable contrast with different configuration of nodes and data sizes. The consistency performance of Redis remains respectable with low number of node cluster and Redis returns a stable consistency in every configuration with data size. In contrast, the consistency performance of Voldemort tends to fluctuate with a larger number concurrent operations but Voldemort has an excellent scalability with different numbers of nodes.

36

6.2 Availability Benchmark

The availability evaluation is performed by YCSB[56] framework. Because YCSB[56] does not provide a Sharding function to Redis, the evaluation of availability is based on the single node of Redis and Voldemort. The results of the evaluation are described in table 3 and table 4:

Table 3: Write performance evaluation

Table 4: Write performance evaluation

In general, both the availability performance of Voldemort and Redis reduced after more than five million data rows had been loaded into the data store. The availability plunge of both data stores is inevitable because of the increased index required by the larger data set. Just like relational database[69], there are similar issue for NOSQL data store needs to deal with: buffering and indexing, which has been briefly introduced in chapter 3.4. According to the data in the table 3 and 4, the buffering mechanism of Redis is better that leads to a remarkable low latency compared to the availability performance of Voldemort because Redis stores all data in memory as described in chapter 4.4. As a result, Redis not only manages a low latency but also remain stable after a new data set has been inserted. Moreover we noted that the focus of Voldemort is Partition Tolerance and Availability in chapter 4.2, which means that Voldemort

Write Throughput (ops/sec)

Average Latency (us)

95%Latency (ms)

99%Latency (ms)

Redis 0.5M 606.53±21.30 873.97±21.30 1±0 2±0.44 Redis 5M 494.96±30.63 1155.30±104.36 2±0.44 3.2±0.73 Voldemort 0.5M 163.65±2.36704 5373.806±23.6 7.4±0.24 10.2±0.66 Voldemort 5M 129.38±2.57 7632.11±293.89 10±0.54 16.2±1.59

Read Average Latency (us)

95%Latency (ms)

99%Latency (ms)

Redis 0.5M 1673.35±61.13 2±0 3.2±0.2 Redis 5M 2019.47±120.84 3.6±0.81 4.8±0.96 Voldemort 0.5M 2823.01±65.98 3.6±0.24 5.6±0.87 Voldemort 5M 4080.86±85.96 5.4±0.24 10±1.58

37

should have an improved availability than Redis. However, both tables show that the availability performance of Voldemort is poorer including average latency and 95% latency. An assumption is that the availability of Voldemort will go up because both consistent hash ring and pre-build index is optimized for a multiply-nodes cluster. The architecture used by Voldemort may cause its low availability performance with a single node. To investigate this, we evaluated the Voldemort cluster with three different nodes configuration and produce below graphs:

Figure 19 Voldemort Availability Benchamrk (write)

Figure 20 Voldemort Availability Benchamrk (throughput)

38

Figure 21 Voldemort Availability Benchamrk (read)

By comparing figure 19,20 and 21, the main conclusion may be the availability of Voldemort increases with the number of nodes but not the size of data set, which support the assumption that Voldemort is optimized for a cluster environment. When more nodes have been added into the distributed system, Voldemort can have less read/write latency and high throughput to improve the availability. As discussed in chapter Error! Reference source not found., Voldemort has a consistent hashing ring to manage the nodes in distributed system and a pre-build index mechanism described in chapter 4.3.2 to help develop its availability performance in a cluster environment.

Consequently, the availability of Voldemort scales with the number of nodes that have been added in the Voldemort system, which mean the latency could be improved by constructing more nodes in the distributed system. Hence, Voldemort might have a better availability performance in a multiple-nodes environment.

Meanwhile Redis continues to achieve low latency for single node, but its availability goes down with the increasing data size. Moreover, as described in chapter 4.4, Redis keeps all objects in the memory and replicates them to disk frequently. This approach enhances its availability by the utilization of low-latency memory but also limits the data size that can be stored in single node. Redis developed a Virtual Memory introduced in chapter 4.4.2 mechanism to remove this limit but the performance drops when Virtual Memory is turned on[70]. As a result, Redis might keep its high availability with low data size and enough available memory with single node.

39

6.3 Partition Tolerance Benchmark

In the third part of the evaluation, the partition tolerance of each data store was evaluated. In chapter 4.2, Redis is confirmed as a CP type of key-value store while Voldemort is considered as an AP focused key-value sore. Since partition tolerance is guaranteed by both target data stores, the evaluation will follow the instruction provided in chapter 5.1.3 and the result is listed in table:

Number of Isolated nodes

1 node isolated 2 nodes isolated

4 nodes isolated

Voldemort 92.64%±0.71% 86.66%±0.49% 73.88%±0.19%

Redis 94.14%±0.77% 86.84%±0.40% 74.48%±0.96%

Table 5: Partition tolerance evaluation

Table 5 illustrates the percentage of available keys that could be accessed by the script when some of the nodes were isolated. The data of table 5 shows that the performance of partition tolerance for Voldemort and Redis is similar which means both of Voldemort and Redis cluster is going to lose part of data in this situation. However, the result may not represent the current level of partition tolerance because Voldemort is described as “automatically replicated over multiple servers[19]” and Redis can replicate data object by master-slave mode[71].

For Voldemort, the investigation shows that the data had been replicated to each node other than the failed one as described in chapter 4.3.1 but the client could not return the replicated data object to the test script. The root cause for this issue is unidentified after the configuration file has been fixed for several times but client still doesn’t return the replicated data object. It is assumed that there are still bugs with the Python native client of Voldemort that caused this problem and there was not enough time to investigate it because of time for the dissertation.

Meanwhile, Redis could replicate all data from a master node to slave node and master/slave can be any one node in the Redis distributed system. The cause why Redis did not demonstrate an enhanced performance over partition tolerance is because of the Sharding lay “pre-shard” of Redis. As claimed in chapter 4.4.4, Pre-shard only support the Sharding for Redis but not support replication, which means the request, cannot be forwarded to other node when a node fails. Although the Pre-shard script could be rewritten to involve a replication, but the time left for the dissertation did not allow this work. As a result, the partition tolerance of Redis is still unclear until Redis team releases a cluster version of Redis.

Therefore, by the current evaluation of Redis and Voldemort, the two data store has a similar level of partition tolerance but this outcome might not be current. Because of

40

the unidentified issue of both clients, neither Redis nor Voldemort performs as the partition tolerance described in their document.

41

Chapter 7 . Suitability of key-value store

In this chapter of the dissertation, the result of evaluation is going to be investigated to demonstrate the suitability of a key-value store for a scientific application.

7.1 Exascale Challenge of Scientific Application

In the motivation Chapter 2, we already clarified that the requirement of scientific applications is climbing up. Most of scientific applications involve “collecting and maintaining very large datasets and applying vast amounts of computational power to the data”[72], which stands for “capturing and storing data” from all over the world. With development of web technology, it is quite common that datasets for a scientific research are dynamic datasets, which are updated regularly. As a result, there are three criteria could be used to evaluate the suitability for scientific application to utilize a data store:

Extreme scale stands for the challenge to manage petabytes level data objects. An example illustrated by Randal E. Bryant is during a simulation regarding Millennium, the supercomputer with 12,8880-core was keeping computing for 9.3 days and generated a nearly 700 terabytes[72]. Therefore, to scale with the size of data objects is currently a major challenge for scientific application and to deal with this challenge one major approach is to partition the data into a distributed share-nothing architecture (a similar architecture like NOSQL system), which leads to a requirement on the consistency of data store.

Complexity of scientific data sets is another challenge for scientific research as compared in table:

Table 6: Some public data set on Internet

Dataset SCAR [73] SDSS (Sloan Digital Sky Survey)[74]

REDD[75]

Data Type Plain Text, semi-structured

Plain Text, semi-structured

Plain Text, semi-structured

Data Category Environmental Astronomy Energy

42

SCAR 68994 Marion 46.8S 37.8E 24m ... 68906 Gough 40.4S 9.9 W 54m ...

SDSS SIMPLE = T BITPIX = 8 NAXIS = 0

REDD 1297340206.597013 135.000000 0.000000 3.623859 7.254136 10.949398 ... 1297340208.844086 722.000000 0.000000 3.638527 7.249567 10.929027 ...

Table 7: Simple section from each data set

Table 6 shows three datasets could be used for a scientific research and Table 7 illustrates a minor simple for each one of datasets. As described in table 6, currently there are a lot of datasets available on Internet. Most of them are semi-structured as demonstrated in table 7, which cause the complexity of dataset. To be more specific, the demonstrated simple data structure is a straightforward but a scientific data set might be more complex if there is Locality (Spatial, temporal) or Order (Time series) computation involved.

Low latency of IO is particularly critical in the field of real scientific time analytics. For the scientific application such as financial services, imaging analytic and geological data, the demands of real-time analysis require a low-latency for input and output, which stands for the availability of data stores described in 3.3. For instance, MICE/MAUS[91] is a “tracking, detector reconstruction and accelerator physics analysis framework” requires real time analysis of data to “online diagnostics during running of MICE”.

7.2 Fitness of the key-value store for scientific application

We evaluate if a key-value store is suitable for a scientific application based on the benchmark data in Chapter 6.

7.2.1 Scalability

As described in 7.1, some scientific applications manage a huge amount of data objects, which need to be partitioned across many compute nodes. As a result, scalability turns out to be the first criteria to evaluate the suitability of a key-value store to be used by scientific applications.

As we had evaluated the consistency and availability for each data store and had observed that the scalability of two key-value stores was various depending on different configuration of node/dataset. The scalability of both data stores has been summarised in table:

Scalability for a scientific application Redis Voldemort

Consistency Consistency scale Consistency scales

43

with size of dataset but not number of nodes.

with the number of node but not with the size of dataset.

Availability Availability not scale well with size of dataset.

Availability scales with the number of nodes but drops with large dataset.

Table 8: Summary of scalability

All statements shown in table 8 are illustrated in Chapter 6. As concluded in table 8, there is not an ideal model in table that keep both consistency and availability scalable with the number of node and the size of data size. Because of the advantage on build-in memory implementation, Redis keeps a remarkable latency at low number cluster with all size of dataset while Voldemort seems to be delayed by the persistence layer. In general, we assume that the two different implmentations we compared in Chapter 4 lead to this difference. Although we didn’t evaluate a Redis cluster but theoretically Redis cluster might also improve availability because the data that has been partitioned to each node is reduced with a cluster.

For a scientific application, scalability is crucial to its performance and reliability. As a result, both Redis and Voldemort can achieve a scalable consistency and availability with a current configuration, which is a small number of Redis cluster and a larger number of Voldemort cluster.

7.2.2 Schema for scientific data

The ability to handle complex structured data file is another vital factor that is demanded by scientific application. So how data stores can process a complex data is the second criteria.

Redis supports a List data model apart from string as the key-value model, which is called “Linked Lists ”[77]. The most benefit of Linked List is that “even if you have millions of elements inside a list, the operation of adding a new element in the head or in the tail of the list is performed in constant time.” In another words, adding a new object by the LPUSH command to a ten-element list is as fast as adding an object to a list with ten million elements. Consequently, Redis can use this schema-free data model to deal with any type of structured data file without compromising performance. Meanwhile, Voldemort can customize a new data type for data object associated with the key. One example should be as[48]: Therefore, Voldemort can also process any type of structured data with the self-custom type. In addition, a key-value store itself can also process semi-structured file because of its simple data model. For example, table 7 shows three semi-structured. For the SCAR dataset, the first field could be taken as the key, and the rest could be handled as value. For the SDSS dataset, data rows can be added up with other rows and associated with a

{"fname":"string", "lname":"string", "id":"int32", "emails":["string"]}

44

key such as line count. For the REDD dataset, it may be better to select the time order as the key. For a data store like Voldemort, it is more convenient to use JSON data format However, one disadvantage of using schema-free data format is that if a key is too complicated or too long, data store might cause additional overhead on index, which results in a performance decreasing.

7.2.3 Input and output

Since more and more scientific applications are required to perform computation and output the result at real time. Thus, availability is the third criteria that can be used to evaluate the suitability of a key-value for scientific applications.

According to the result of availability evaluation, for a single node, Redis is two times faster than Voldemort but Voldemort cluster can improve its availability by adding more nodes. Since Redis has no official cluster version that can be evaluated, we do not have any idea about whether Redis cluster could help to enhance its availability. However, with the observation for the consistency benchmark of Redis, a Redis cluster constructed by “pre-shard” might slow the performance since it does not optimize for the cluster environment.

7.3 Conclusion for the suitability of key-value store

We have evaluated the requirement of scientific applications and the capability of a key-value such as Redis or Voldemrot. We estimate the possibility of using a key-value store for scientific applications from following three aspects: scalability, schema and IO.

Base on the above evaluation of suitability for key-value store, we can conclude that both Redis and Voldemort could meet the demand of a scientific application under certain conditions.

First, Voldemort could offer a stable consistency performance without receiving impact caused by the size of dataset while Redis could provide a smooth consistency with small data set and small nodes. Meanwhile, only Voldemort scales with the node number in availability evaluation.

Secondly, a key-value store could provide the ability to process most of semi-structured data file. For instance, as demonstrated in chapter 7.2.2 Voldemort can define the data type used for data object while Redis has a list type to manipulate complex data. Moreover, to support multiple types turn out to be widespread for a key-value store implementation.

Finally, the evaluation in chapter 6.2 demonstrates that the availability of Redis is extremely low with a small amount of dataset while Voldemort can improve its availability by adding more node.

As a result, Voldemort might be suitable for a large scientific application requires relatively low latency for data analytsis and partitioning large data across nodes. Redis

45

is more suitable for a small scientific application, which does not require a large dataset but being sensitive to low latency.

46

Chapter 8 . Summary and Future Work

We summarise all statements made in the dissertation and discuss about the future work in this chapter. The purpose of this dissertation is to evaluate the suitability for a scientific application to utilize a key-value store. Thus, we selected two representative data store to be the target data stores for evaluation. The first key-value store implementation is Redis, a strict consistent built-in memory data store which concentrates consistency and partition tolerance, and the second key-value implementation is Voldemort, a eventual consistent disk-based key-value store implementation which concentrates availability and partition tolerance. The two different implementation of data stores are selected because currently they can represent several available key-value implementations. By evaluating the performance of two different key-value stores, we could investigate the suitability of two target data stores for scientific application.

Moreover, we evaluated the two target data stores according to criteria from the CAP theorem: consistency, availability and partition tolerance. Although a completed evaluation that followed definition of CAP theorem cannot be performed because of the implementation of key-value stores, the evaluation is conducted by three client-based approaches: Observing the time distribution for multiple thread after a write operation for consistency evaluation, examining the latency by multiple read and write operation for availability evaluation and inspecting the number of lost key-value pairs with several isolated nodes for partition tolerance.

Furthermore, the result of evaluation shows that two target data stores perform differently. Redis remains an enhanced performance with small number of nodes and small datasets while Voldemort maintains a stable consistency and scales with the number of nodes in consistent evaluation. In availability benchmark, the result of Redis is much higher than Voldmort but the following experiment shows that the availability of Voldemort could be increased by adding more nodes to Voldemort system. Finally, although the result of partition tolerance reveals that Redis and Voldemot are in similar level, the investigation on client shows that both clients of Voldemort and Redis could not locate the replicas. As a result, the evaluation of partition tolerance is categorized for further work.

To evaluate the suitability of target data stores, we have concluded that there are three criteria that can be used to evaluate the suitability of a key-value store for scientific application: scalability, schema and IO.

47

According to the evaluation result of Redis, it can scale with size of dataset in consistency evaluation and provide extremely low latency with small number of node and small datasets in availability evaluation. As a result, Redis is best suitable for a scientific application that meets those conditions.

Another target data store, Voldemort had been proved that it scales with number of node in consistency evaluation and could improve its availability with more nodes to a relative low latancy in availability evaluation. Thus, Voldemort can be considered for a scientific application that needs to use large amount of nodes with no sensitive to latency.

For the further work for this dissertation, we conclude them into three categories:

There should involve more data store in the benchmark stage. For example, we may involve Mysql[78], a relational database, to compare the difference between a NOSQL data store and a relational database. In addition, although we have picked up Redis and Voldemort as two target key-value stores, there are still a large number of key-value stores that may perform variously. For instance, Riak[79] is also a disk-based key-value store like Voldmeort but with only replication to focus on availability and Kyoto Cabinet is also a key-value implementation that uses a B+ tree[92] to enhance its performance. These two key-value stores were excluded by this dissertation because of time limitation. As a result, by comparing various implementations of key-value and relational database we might have a more precise conclusion about the suitability of data store for scientific application.

Additionally, we could develop a new interface based on YCSB framework. Although YCSB has saved us a lot of time, it still lack of the support for many NOSQL data stores. Rather than redeveloping new benchmarking software, it may be better to develop new interface to benchmark availability based on YCSB framework NOSQL data stores.

Furthermore, partition tolerance should be evaluated again for a current result. Currently the result of partition tolerance might not be described as a success because the issue of client for both Redid and Voldemort. We can redevelop a new client based on the old client to make sure client can be returned with any available replicated data objects.

Finally, the evaluation script could be improved in several approaches. First, the loading script could be parallelized to reduce the time consumption of inserting data rows. Second, the script of Redis client could be rewritten to involve the slave nodes for evaluation of partition tolerance. Thirdly, Looking into the source code of Voldemort client to investigate the root cause of the issue of partition tolerance.

48

Appendix A Node setting up procedure

sudo yum groupinstall "Development Libraries"

sudo yum groupinstall "Administration Tools"

sudo yum groupinstall "Development Tools"

sudo yum install ant

sudo yum install log4j

sudo yum install java

sudo yum install xml-commons-apis

sudo reboot

sudo mkdir benchmark

cd benchmark

sudo wget http://redis.googlecode.com/files/redis-2.4.15.tar.gz

sudo tar xzf redis-2.4.15.tar.gz

sudo cd redis-2.4.15

sudo make

sudo wget https://github.com/downloads/voldemort/voldemort/voldemort-0.90.1.tar.gz --no-check-certificate

sudo tar xzf voldemort-0.90.1.tar.gz

cd voldemort-0.90.1

sudo ant

sudo vi /etc/sysctl.conf(add vm.overcommit_memory = 1)

sudo wget https://github.com/downloads/brianfrankcooper/YCSB/ycsb-0.1.4.tar.gz --no-check-certificate

49

tar xzf ycsb-0.1.4.tar.gz

50

Appendix B Source script-startvoldemort.sh

cd /path to vodemort/voldemort-0.90.1

sudo ./bin/voldemort-server.sh /state/partition1/ycsb/voldemort-binding

51

Appendix C Source file-startredis.sh

sudo ./redis-2.4.15/src/redis-server redis-2.4.15/redis.conf

52

Appendix D Source scritp-conV.sh

Python loadfile.py voldemort Nomao/Nomao.data

Python test16.py voldemort

sleep 2


sleep 2


sleep 2


sleep 2


sleep 2


Python loadfile.py voldemort block_1.csv


sleep 2


sleep 2


sleep 2


sleep 2

53


sleep 2




sleep 2


sleep 2


sleep 2


sleep 2


sleep 2




sleep 2


sleep 2


sleep 2


sleep 2


sleep 2



54


sleep 2


sleep 2


sleep 2


sleep 2


sleep 2




sleep 2


sleep 2


sleep 2


sleep 2


sleep 2




sleep 2


sleep 2

55


sleep 2


sleep 2


sleep 2




sleep 2


sleep 2


sleep 2


sleep 2


sleep 2




sleep 2


sleep 2


sleep 2


sleep 2

56


sleep 2




sleep 2


sleep 2


sleep 2


sleep 2


57

Appendix E Source scritp-conR.sh

Python loadfile.py redis Nomao/Nomao.data

Python test16.py redis

sleep 2


sleep 2


sleep 2


sleep 2


sleep 2


Python loadfile.py redis block_1.csv


sleep 2


sleep 2


sleep 2


sleep 2

58


sleep 2




sleep 2


sleep 2


sleep 2


sleep 2


sleep 2




sleep 2


sleep 2


sleep 2


sleep 2


sleep 2



59


sleep 2


sleep 2


sleep 2


sleep 2


sleep 2




sleep 2


sleep 2


sleep 2


sleep 2


sleep 2




sleep 2


sleep 2

60


sleep 2


sleep 2


sleep 2




sleep 2


sleep 2


sleep 2


sleep 2


sleep 2




sleep 2


sleep 2


sleep 2


sleep 2

61


sleep 2




sleep 2


sleep 2


sleep 2


sleep 2


62

Appendix F Source script-ycsbV.sh

./ycsb-0.1.4/bin/ycsb load voldemort -P ycsb-0.1.4/workloads/workloada -p recordcount=50000 -target 1000 -p bootstrap_urls=tcp://10.1.255.227:6666

./ycsb-0.1.4/bin/ycsb run voldemort -P ycsb-0.1.4/workloads/workloada -p recordcount=50000 -target 1000 -p bootstrap_urls=tcp://10.1.255.227:6666





./ycsb-0.1.4/bin/ycsb load voldemort -P ycsb-0.1.4/workloads/workloada -p recordcount=500000 -target 1000 -p bootstrap_urls=tcp://10.1.255.227:6666






63

Appendix G Source script-ycsbR.sh

./ycsb-0.1.4/bin/ycsb load redis -P ycsb-0.1.4/workloads/workloada -p recordcount=50000 -target 1000 -p bootstrap_urls=tcp://localhost:6666

./ycsb-0.1.4/bin/ycsb run redis -P ycsb-0.1.4/workloads/workloada -p recordcount=50000 -target 1000 -p bootstrap_urls=tcp://localhost:6666





./ycsb-0.1.4/bin/ycsb load redis -P ycsb-0.1.4/workloads/workloada -p recordcount=1000000 -target 1000 -p bootstrap_urls=tcp://localhost:6666






64

Appendix H Source script-pa.sh

Python patest.py voldemort





Python patest.py redis





65

Appendix I Client setting up- Voldemort

Once the dependencies, run the test suite to sanity check things. You need to

first start up a Voldemort server locally, pointing to the config files in

tests/voldemort_config. From the root voldemort of the voldemort source tree, run:

> bin/voldemort-server clients/Python/tests/voldemort_config

In a separate shell, change into the clients/Python directory and run:

> Python setup.py nosetests

If all tests pass, you can install the package with the command:

> Python setup.py install

66

Appendix J Client setting up- Redis

Sudo Wget https://github.com/youngking/redis-shard/zipball/master

Sudo unzip master Cd redis-shard Sudo Python setup.py

67

Appendix K Benchmark file download

Wget http://archive.ics.uci.edu/ml/machine-learning-databases/00227/Nomao.zip

Sudo unzip Nomao.zip

Wget http://archive.ics.uci.edu/ml/machine-learning-databases/00210/documentation

Sudo unzip documentation

68

Appendix L Voldemort configuration file

Cluster.xml:

<cluster>

<name>mycluster</name>

<server>

<id>0</id>

<host>10.1.255.227</host>

<http-port>8081</http-port>

<socket-port>6666</socket-port>

<admin-port>6667</admin-port>

<partitions>0, 1</partitions>

</server>

<server>

<id>1</id>

<host>10.1.255.173</host>





</server>

<server>

<id>2</id>

69

<host>10.1.255.223</host>





</server>

<server>

<id>3</id>

<host>10.1.255.221</host>





</server>

<server>

<id>4</id>

<host>10.1.255.220</host>





</server>

<server>

<id>5</id>

<host>10.1.255.219</host>




70


</server>

<server>

<id>6</id>

<host>10.1.255.216</host>





</server>

<server>

<id>7</id>

<host>10.1.255.215</host>





</server>

<server>

<id>8</id>

<host>10.1.255.214</host>





</server>

<server>

<id>9</id>

71

<host>10.1.255.212</host>





</server>

<server>

<id>10</id>

<host>10.1.255.209</host>





</server>

<server>

<id>11</id>

<host>10.1.255.195</host>





</server>

<server>

<id>12</id>

<host>10.1.255.129</host>




72


</server>

<server>

<id>13</id>

<host>10.1.255.182</host>





</server>

<server>

<id>14</id>

<host>10.1.255.179</host>





</server>

<server>

<id>15</id>

<host>10.1.255.175</host>





</server>

</cluster>

73

stores.xml:

<stores>

<store>

<name>test</name>

<persistence>bdb</persistence>

<routing-strategy>all-routing</routing-strategy>

<routing>server</routing>

<replication-factor>9</replication-factor>

<required-reads>2</required-reads>

<required-writes>1</required-writes>

<key-serializer>

<type>string</type>

</key-serializer>

<value-serializer>

<type>string</type>

</value-serializer>

</store>

<store>

<name>usertable</name>


<routing-strategy>consistent-routing</routing-strategy>

<routing>client</routing>




<key-serializer>

<type>string</type>

</key-serializer>

74

<value-serializer>

<type>java-serialization</type>

</value-serializer>

</store>

<store>

<name>json_test</name>


<routing-strategy>consistent-routing</routing-strategy>

<routing>client</routing>




<key-serializer>

<type>json</type>

<schema-info version="0">"int32"</schema-info>

</key-serializer>

<value-serializer>

<type>json</type>

<schema-info version="0">{ "a":"float32", "b":["int16"], "c":"string", "d":{ "foo":"boolean", "bar":"date" }}</schema-info>

</value-serializer>

</store>

</stores>

75

References

[1] Python,Python programming language, http://www.Python.org/ (Online.

Accessed 24 August 2012)

[2] Partick,Ohlinger, Wal-Mart’s Data Warehouse, Vienna University of

Technology,2006.

[3] Dnaiel.J. Abadi. Data Management in the Cloud:Limitations and Opportunities.

IEEE Data Eng. Bull, 32(1):3–12, 2009.

[4] Fay, Chang.et al. Bigtable: A distributed storage system for structured data. In

OSDI, 2006.

[5] Giueppe, DeCandia. et al. Dynamo: Amazon’s Highly Available Key-value

Store. In SOSP, page 220, 2007.

[6] Avinash, Lakshman. Cassandra-A Decentralized Structured Storage system. In

LADIS, 2009.

[7] Memcached, http://code.google.com/p/memcached/wiki/NewOverview

(Online. Accessed 24 August 2012)

[8] Loannis,Konstantinou , Distributed Indexing of Web Scale Datasets for the

Cloud,MDAC’10, 2010

[9] An,Mingyuan.Using Index in the MapReduce Framework ,IEEE 10.1109

2010.12

[10] Vinay,Sudhakaran.Programming Abstractions for Dynamic,

Distributed,Data-intensive computing , Msc dissertation 2011

[11] Omkar,kulkal. Benchmarking an Amadaho-balanced cluster for Data

Intensive Computing, Msc dissertation 2011.

76

[12] Jeff, Terrace. Object storage on CRAQ: High-throughput chain

replication for read-mostly workloads. In Proc. USENIX Annual Technical

Conference, June 2009

[13] Robbert,van. Renesse. Chain replication for supporting high throughput

and availability. In Proc. 6th USENIX OSDI, Dec. 2004

[14] Add data replication to memcached. http://repcached.lab.klab.org/.


[15] Ion,Stoica.Chord: A scalable peer-to-peer lookup service for Internet

applications. In Proc. ACM SIGCOMM, Aug. 2001

[16] L.Avinash,M.Prashant,Cassandra - A Decentralized Structured Storage

System,ACIM,2009

[17] Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C.Hsieh, Deborah

A. Wallach, Michael Burrows, Tushar Chandra, Andrew Fikes, and Robert E.

Gruber, Bigtable: A distributed storage system for structured data, ACM

Trans.Comput. Syst. (2008), no. 2.

[18] MongoDB, http://www.mongodb.org/.(Online. Accessed 24 August

2012)

[19] Project Voldemort, http://project-voldemort.com/.(Online. Accessed 24

August 2012)

[20] Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan,

and Russell Sears, Benchmarking cloud serving systems with ycsb, SoCC, 2010,

pp. 143–154.

[21] NOSQL store, http://NOSQL-database.org/ (Online. Accessed 24

August 2012)

[22] C. J.Date. (2003). Introduction to Database Systems. 8th edition, Addison-Wesley. ISBN 0-321-19784-4.

[23] M.StoneBreaker.SQL databases V. NOSQL databases, Communications of the ACM, Vol. 53 No. 4, pp.10-11.

[24] A Conversation with Werner Vogels. http://queue.acm.org/detail.cfm?id=1142065, 2006.(Online access 17June 2012)

77

[25] Theo, Harder. Principles of transaction-oriented database recovery. Computing Surveys, 1983.

[26] Philip, A.Bernstein. Multiversion Concurrency Control-Theory and Algorithms. ACM ‘lkmsactions on Database Systems, Vol. 8, No. 4, 1983, pp.465-483.

[27] Eric, A. Brewer. PODC keynote.

http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf, 2000.

(Online; accessed 16 June 2012).

[28] Nancy, Lynch. Brewer's conjecture and the feasibility of consistent,

available, partition-tolerant web services. SIGACT News, 2002.

[29] Todd,Loch. Design Patterns for Distributed Non-Relational Databases.

June 2009. –Presentation of 2009-06-11.

http://www.slideshare.net/guestdfd1ec/design-patterns-for-distributed-nonrelati

onaldatabases (Online. Accessed 24 August 2012)

[30] Jay,K. Project Voldemort – Design. 2010.

http://project-voldemort.com/design.php (Online. Accessed 24 August 2012)

[31] Vogels, Werner. Eventual Consistenz - Revisited - All Things

Distributed.

http://www.allthingsdistributed.com/2008/12/eventually_consistent.html, 2008.

(Online. accessed 16 June 2012)

[32] David, Karger. Consistent Hashing and Random Trees: Distributed

Caching Protocols for Relieving Hot Spots on the World Wide Web. annual

ACM symposium on Theory of computing. 1997.pp. 654–663.

[33] Parallel & Distributed Operating Systems Group: The Chord/DHash

Projec http://pdos.csail.mit.edu/chord/ (Online. Accessed 24 August 2012)

[34] G. Sanjay,L. Shun-Tak. The Google File System.

http://labs.google.com/papers/gfs-sosp2003.pdf (Online. Accessed 24 August

2012)

[35] W. Tom. Consistent Hashing.

http://weblogs.java.net/blog/tomwhite/archive/2007/11/consistent_hash.html


[36] H. Ricky. NOSQL Patterns.

78

http://horicky.blogspot.com/2009/11/NOSQL-patterns.html (Online. Accessed

24 August 2012)

[37] Rick, Catell. Scalable SQL and NOSQL Data Stores , ACM vol.39

No.4,2011.

[38] Me.Seege. Key – Value Stores: a practical overview,2009. Hochschule

der Medien Computer Science, Stuttgart Germany

[39] Amazon S3.http://aws.amazon.com/s3/ (Online. Accessed 24 August

2012)

[40] Neo-4j.http.z/neo-lj.org (Online. Accessed 24 August 2012)

[41] GraphDB.http.z/www.sones.com (Online. Accessed 24 August 2012)

[42] FlockDB. http://github.com/twitter/flockdb (Online. Accessed 24

August 2012)

[43] HBase. http://hbase.apache.org, (Online. Accessed 24 August 2012)

[44] Hypertable. http://hypertable.org (Online. Accessed 24 August 2012)

[45] Ricky, Ho. Query processing for NOSQL DB.

2009.http://horicky.blogspot.com/2009/11/query-processing-for-NOSQL-db.ht

ml (Online. Accessed 24 August 2012)

[46] Limitations of the Relational Model.

http://highscalability.com/blog/2009/11/4/damn-which-database-do-i-use-now.html


[47] I, Bob. Drop ACID and think about Data. March 2009.

http://blip.tv/pycon-us-videos-2009-2010-2011/drop-acid-and-think-about-data-

1959086 (Online. Accessed 24 August 2012)

[48] Project Volemort Design. http://project-voldemort.com/design.php


[49] D,Crockford. The application/json Media Type for JavaScript Object

Notation (JSON). IETF (Internet Engineering Task Force). 2006

.http://tools.ietf.org/html/rfc4627 (Online. Accessed 24 August 2012)

79

[50] Google Code – Protocol Buffers.

http://code.google.com/intl/de/apis/protocolbuffers/ (Online. Accessed 24

August 2012)

[51] Redis.http://redis.io/ (Online. Accessed 24 August 2012)

[52] Non blocking rehashing.http://antirez.com/m/p.php?i=209 (Online.

Accessed 24 August 2012)

[53] CouchDB. http://couchdb.apache.org/ (Online. Accessed 24 August

2012)

[54] Jing, Jiang. Jayanth,Gummaraju.Rohit,Gupta. Scientific Applications

http://www.stanford.edu/class/ee392c/handouts/apps/scientific_long.pdf


[55] Sean, Ahern. Scientific Application Requirement For Leading

Computing at the ExScale . December 2007. OAK RIDGE NATIONAL

LABORATORY Oak Ridge, Tennessee 37831-6283

[56] YCSB. https://github.com/brianfrankcooper/YCSB/wiki/ (online.

Accessed 24 Aug 2012)

[57] Ant. http://ant.apache.org/ (online. Accessed 24 Aug 2012)

[58] Log4j. http://logging.apache.org/log4j/1.2/ (online. Accessed 24 Aug

2012)

[59] Java. http://www.java.com/ (online. Accessed 24 Aug 2012)

[60] Paul, Martin. EDIM1 Progress Report. Dec,2011.

http://research.nesc.ac.uk/files/report.pdf (online. Accessed 24 Aug 2012)

[61] Amazon Instance Type. http://aws.amazon.com/ec2/instance-types/

(online. accessed 22 Aug 2012)

[62] Redis-shard. https://github.com/youngking/redis-shard (online.

accessed 22 Aug 2012)

[63] Voldemort Python client.

https://github.com/voldemort/voldemort/tree/master/clients/Python (online.


80

[64] Nomao dataset. http://archive.ics.uci.edu/ml/datasets/Nomao (online.


[65] Record Linkage Comparison Patterns Data Set. http://archive.ics.uci.edu/ml/datasets/Record+Linkage+Comparison+Patterns

[66] Olivier, Cur. Introduction to NOSQL 2011.

http://igm.univ-mlv.fr/~ocure/LIGM_LIKE/Teaching/ir3/cloudCourse3.pdf

(online. Accessed Aug. 2012)

[67] Lutz, Prechelt. Comparing Java vs. C/C++ Efficiency Differences to

Interpersonal Differences. COMMUNICATIONS OF THE ACM. October

1999/Vol. 42, No. 10

[68] Redis PreSharding. http://antirez.com/post/redis-preSharding.html

(online. Accessed Aug 2012.)

[69] Peter,zaitsev. Why mysql could be slow with large data set ?

http://www.mysqlperformanceblog.com/2006/06/09/why-mysql-could-be-slow-

with-large-tables/ (online.accessed 17 Aug 2012)

[70] Redis FAQ. http://redis.io/topics/faq (online.accessed 17 Aug 2012)

[71] Redis Repliation. http://redis.io/topics/replication (online.accessed 17

Aug 2012)

[72] Randal E. Bryant. Data-Intensive Scalable Computing for Scientific

Applications. Ieee CS and the AIP. 1521-9615/11.2011

[73] Scientific Committee on Antarctic Research.

http://www.antarctica.ac.uk/met/READER/data.html (online.accessed 17 Aug

2012)

[74] SDSS data set. http://data.sdss3.org/sas/dr8/common/ (online.accessed

17 Aug 2012)

[75] REDD data set. http://redd.csail.mit.edu/ (online. accessed 17 Aug

2012)

[76] Sean, Ahern. Scientific Application Requirement for Leadship

Computing at the exascale. Oak Ridge National Laboratory December 2007

[77] A fifteen-minute introduction to Redis data types.

81

http://redis.io/topics/data-types-intro (online. accessed 17 Aug 2012)

[78] Mysql. http://www.mysql.com/ (online. accessed 17 Aug 2012)

[79] Riak. http://wiki.basho.com/Riak.html (online. accessed 17 Aug 2012)

[80] Kyoto Cabinet. http://fallabs.com/kyotocabinet/ (online. accessed 17

Aug 2012)

[81] Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan,

Russell Sears . Benchmarking Cloud Serving Systems with YCSB. SoCC’10, June

10–11, 2010, Indianapolis, Indiana, USA.

[82] Definition of scientific application.

http://www.pcmag.com/encyclopedia_term/0,1237,t=scientific+application&i=

50872,00.asp (online. accessed 17 Aug 2012)

[83] Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplied Data

Processing on Large Clusters. Google, Inc. OSDI 2004

[84] LinkIn. http://LinkIn.com (online. accessed 17 Aug 2012)

[85] Karl Seguin. The little Redis Books. http://openmymind.net/redis.pdf

(online. accessed 17 Aug 2012)

[86] UCI Machine Learning Repository. http://archive.ics.uci.edu/ml/ (online. accessed 17 Aug 2012)

[87] André B. Bondi, Characteristics of scalability and their impact on performance, Proceedings of the 2nd international workshop on Software and performance, Ottawa, Ontario, Canada, 2000, ISBN 1-58113-195-X, pages 195–203

[88] Mike Hogan. Shared-Disk vs. Shared-Nothing.

http://www.scaledb.com/pdfs/WP_SDvSN.pdf (online. accessed 17 Aug 2012)

[89] Kristóf Kovács. Cassandra vs MongoDB vs CouchDB vs Redis vs Riak vsHBasevsMembasevsNeo4jcomparison. http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis (online. accessed 17 Aug 2012)

[90] HECToR:UK National supercomputing service. http://www.hector.ac.uk/ (online. accessed 22 Aug 2012)

[91] MAUS (MICE Analysis User Software) http://micewww.pp.rl.ac.uk/projects/maus (online. accessed 22 Aug 2012)

[92] B+ TREES. http://baze.fri.uni-lj.si/dokumenti/B+%20Trees.pdf (online. accessed 22 Aug 2012)

82

[93] Derek,Greene. Graph and Network Analysishttp://mlg.ucd.ie/files/summer/tutorial.pdf (online. accessed 22 Aug 2012)

[94] Python Multiprocessing module http://docs.python.org/library/multiprocessing.html (online. accessed 22 Aug 2012)

benchmarking the suitability of key-value stores for...

Documents