geographical sharding in mongodb - ulisboa€¦ · vertical scaling eventually hits a limit when...

Geographical sharding in MongoDBUsing Voronoi partitioning to geographically shard point data

Ricardo Filipe Amendoeira

Thesis to obtain the Master of Science Degree in

Electrical and Computer Engineering

Supervisor: Prof. Joao Nuno de Oliveira e Silva

Examination Committee

Chairperson: Prof. Antonio Manuel Raminhos Cordeiro Grilo

Supervisor: Prof. Joao Nuno de Oliveira e Silva

Member of the Committee: Prof. Joao Carlos Antunes Leitao

May 2017

Acknowledgments

I have tremendous gratitude for the help given by my supervisor, Professor Joao Nuno de Oliveira

e Silva, who not only started my journey in the world of computer programming and widened my

horizons with various other courses, but was also always available and happy to help me and make me

focus on the important tasks at hand, keeping me motivated and smiling and pushing me to do better

until the very end.

Of course I am incredibly thankful to my parents for my upbringing, their sacrifices so that I could

have a good education, their forgiveness for my dumb mistakes and their unconditional support during

my time at IST.

To my girlfriend, Rita, thank you for always being positive, supportive and enjoying life with me.

I would also like to acknowledge the support of my friends, who made my time at IST a great and

enjoyable experience.

i

Abstract

The growth of dataset sizes in the Big Data era has created the need for horizontally scalable Database

Management Systems (DBMS), like MongoDB, Apache Cassandra and many others, generally known

as NoSQL DBMSs. These systems need to be able to be configured in ways that best fit the data

that they’ll contain, partitioning the data (Sharding) in an context-aware way so that queries can be

executed as efficiently as possible.

Spatial data (associated with a location) availability has been steadily growing as more devices get

geolocation capabilities, making it important for NoSQL DBMSs to be able to handle this type of data.

While NoSQL DBMSs are capable of sharding, not all (MongoDB included), can do it based on

location or do it in an efficient way.

Following the work by Joao Pedro Martins Graca in his Master’s thesis [6], exploring the various

spatial partitioning policies, this work aims to implement and evaluating the practical usefulness of

the proposed policy GeoSharding, based on Voronoi indexing, on the popular open-source project

MongoDB.

Benchmark results show that this policy can have a large positive impact in performance when

used with datasets of large amounts of small documents, such as sensor data or statistical geography

information.

Keywords: Big Data, Spatial Data, Sharding, MongoDB, NoSQL

iii

Resumo

O crescimento dos conjuntos de dados na era da Big Data criou a necessidade de sistemas de gestao de

bases de dados com capacidade para escalar horizontalmente, como o MongoDB, o Apache Cassandra

e muitos outros, geralmente conhecidos como sistemas NoSQL. Estes sistemas devem ser capazes de ser

configurados de formas adequadas aos dados que vao armazenar, segmentando os conjuntos de dados

(tecnica conhecida como Sharding) de formas que aproveitem o contexto dos mesmos, permitindo que

as interaccoes com a base de dados sejam executadas de forma o mais eficiente possıvel.

A disponibilidade de dados espaciais (associados com uma localizacao) tem aumentado consis-

tentemente ao longo dos anos, com o aumento do numero de dispositivos com capacidade para

geolocalizacao, tornando importante que os sistemas NoSQL sejam capazes de lidar e tirar partido

deste tipo de dados.

Embora os sistemas NoSQL suportem sharding, nem todos (incluındo o MongoDB), sao capazes de

o fazer com base em geolocalizacao, ou de faze-lo de forma eficiente.

Em seguimento ao trabalho feito por Joao Pedro Martins Graca na sua tese de Mestrado [6], ex-

plorando varias polıticas de segmentacao espacial, este trabalho tem como objectivo implementar e

avaliar a utilidade pratica da polıtica proposta GeoSharding, baseada em indexacao de Voronoi, no

popular projecto de codigo aberto MongoDB.

Os resultados dos testes efectuados mostram que esta polıtica pode ter um grande impacto positivo

no desempenho do sistema quando usada com conjuntos de dados que contenham grandes numeros

de pequenos documentos, como dados de sensores ou informacao de geografia estatıstica.

Palavras-Chave: Big Data, Dados Espaciais, Sharding, MongoDB, NoSQL

v

Contents

Abstract iii

Resumo v

List of Tables ix

List of Figures xi

List of Listings xiii

Acronyms xv

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Proposed Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Objectives and results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Related work 5

2.1 Geographical Information Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 NoSQL database management systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3 Geographically aware sharding algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.4 Existing solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.5 MongoDB (as of version 3.4) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3 GeoMongo 21

3.1 GeoMongo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4 Evaluation 29

4.1 Performance evaluation methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.2 Performance evaluation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.3 Evaluation of MongoDB as a base for GeoMongo . . . . . . . . . . . . . . . . . . . . . . 39

5 Conclusions 41

5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

Bibliography 45

A Write performance benchmark script A-1

vii

Contents

B Write performance results B-1

B.1 Documents with attached image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-2

B.2 Text-only documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-3

C Read performance benchmark script C-1

D Read performance results D-1

D.1 Documents with attached image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-2

D.2 Text-only documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-3

E Example document from the dataset E-1

viii

List of Tables

B.1 Write benchmark for documents with image attached (total running time) - 1 shard . . . B-2

B.2 Write benchmark for documents with image attached (total running time) - 2 shards . . B-2

B.3 Write benchmark for documents with image attached (total running time) - 4 shards . . B-2

B.4 Write benchmark for text-only documents (total running time) - 1 shard . . . . . . . . . B-3

B.5 Write benchmark for text-only documents (total running time) - 2 shards . . . . . . . . . B-3

B.6 Write benchmark for text-only documents (total running time) - 4 shards . . . . . . . . . B-3

D.1 Read benchmark for documents with image attached (total running time) - 1 shard . . . D-2

D.2 Read benchmark for documents with image attached (total running time) - 2 shards . . . D-2

D.3 Read benchmark for documents with image attached (total running time) - 4 shards . . . D-2

D.4 Read benchmark for text-only documents (total running time) - 1 shard . . . . . . . . . . D-3

D.5 Read benchmark for text-only documents (total running time) - 2 shards . . . . . . . . . D-3

D.6 Read benchmark for text-only documents (total running time) - 4 shards . . . . . . . . . D-3

ix

List of Figures

2.1 A practical example of the MapReduce process, on MongoDB . . . . . . . . . . . . . . . . 8

2.2 An example of kriging interpolation (red full line). Also shown: smooth polynomial

(blue dotted line) and Gaussian confidence intervals (shaded area). Squares indicate

existing data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3 A Z-curve used to partition the Earth into a 4x4 grid . . . . . . . . . . . . . . . . . . . . 11

2.4 The quadtree data structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.5 A spherical Voronoi diagram of the world’s capitals . . . . . . . . . . . . . . . . . . . . . 12

2.6 JSON document example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.7 MongoDB Sharded Cluster Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.8 Partition of the dataset into chunks based on shard key values . . . . . . . . . . . . . . . 15

2.9 Examples of bad shard key properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.10 An index being used to return sorted results . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.11 Replica set architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.12 Replica set failover process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.1 A visual summary of what data is stored in each type of machine in a MongoDB cluster . 23

3.2 Great-circle distance, ∆σ , between two points, P and Q. Also shown: λp and ϕp . . . . . . 27

4.1 Architecture of the benchmarking cluster (with varying numbers of shards) . . . . . . . . 31

4.2 Underwater resources of USA (mainland region shown) . . . . . . . . . . . . . . . . . . 32

4.3 Read performance of documents with attached image with: a) 1 shard, b) 2 shards and

c) 4 shards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.4 Horizontal scaling of read performance of documents with attached image, measured

with 64 client threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.5 Write performance of documents with attached image with: a) 1 shard, b) 2 shards and

c) 4 shards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.6 Horizontal scaling of write performance of documents with attached image, measured

with 64 client threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.7 Read performance of text-only documents with: a) 1 shard, b) 2 shards and c) 4 shards . 36

4.8 Horizontal scaling of read performance of text-only documents, measured with 64 client

threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.9 Write performance of text-only documents with: a) 1 shard, b) 2 shards and c) 4 shards 38

4.10 Horizontal scaling of write performance of text-only documents, measured with 64

client threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

xi

List of Listings

A.1 The benchmark script for write performance, in Python3 . . . . . . . . . . . . . . . . . . A-2

C.1 The benchmark script for read performance, in Python3 . . . . . . . . . . . . . . . . . . C-2

xiii

Acronyms

DB Database.

DBMS Database Management Systems.

GIS Geographical Information Systems.

HA High Availability.

JSON JavaScript Object Notation.

RDBMS Relational database management systems.

SQL Serial Query Language.

xv

1Introduction

Contents1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Proposed Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Objectives and results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1

1. Introduction

1.1 Motivation

Geo-referenced data, data that can be associated with some sort of location, has been growing in

availability as more devices like smart-phones get geolocation capabilities and satellite imagery and

sensor networks become better more available [12].

Photos posted to social-media, business information that can be searched by customers or naviga-

tion devices, weather data, census data among many other examples can all be associated with general

(regions) or specific (coordinates) locations.

As these datasets grow they become increasingly more difficult to manage and store while main-

taining good performance when querying or being used in computation [12].

A common way to scale databases is via vertical scaling, where the machine that runs the database

management system is upgraded with better hardware. Vertical scaling eventually hits a limit when the

cost of better hardware becomes uneconomical or there is no possible upgrade path. An alternative way

of scaling exists, horizontal scaling, where the system is scaled by adding more machines (usually with

commodity hardware to keep costs down) [2]. Horizontal scaling has no defined limit on scalability

but can of course be limited by communication overhead within the system or external factors like

network capacity.

The most common type of DMBS, Relational DBMSs, are usually very complex to scale horizontally

because of the way data is stored, with relations between tables and normalization, that makes tables

heavily dependent on each other to complete each table’s information [2].

A more recent kind of DBMS, the NoSQL DBMS, which aim to improve horizontal scalability so that

these systems can handle incredibly large datasets or/and very large request loads for web services.

They mostly achieve this by lowering some of the requirements/promises of RDBMSs, usually data

durability and/or data consistency, and by data de-normalization, which makes tables and rows (or the

equivalent data types) mostly independent, allowing for data to be easily partitioned and distributed

among several machines.

This type of horizontal scalability via data partitioning is called sharding. Sharding policies are

usually one of the main things to worry about when trying to improve the performance of a sharded

cluster, since the way data is partitioned and distributed affects the efficiency of operations on the

database. The policies should be adequate to both the data itself and the operations that will commonly

be done on it, so that data locality can be leveraged. Ideally queries should only require contacting a

single machine to be executed, the goal of choosing a good sharding policy is to approximate this as

much as possible.

Not many NoSQL DMBSs support sharding policies adequate for geographical data [17]: common

operations on georeferenced data usually involve fetching information related to a certain area, or do-

ing interpolation based on neighboring data, so a good distribution policy for geographical data should

take this into account to take advantage of data locality. If not, a simple query for data related with a

certain small area may have to communicate with all machines in the cluster, degrading performance

for any simultaneously occurring database operations.

A previous work compared different sharding policies by means of simulations, and concluded that

a new suggested policy named GeoSharding can have better performance than most of the policies

that are currently used in NoSQL DBMSs.

An implementation and evaluation of this policy could thus be valuable to measure the practical

usefulness of GeoSharding for NoSQL DBMSs.

2

1.2 Proposed Solution

1.2 Proposed Solution

An implementation of the partitioning policy proposed by Joao Graca [6] in the DBMS MongoDB is

proposed as a way to shard geo-referenced data without losing the context of the data’s location.

MongoDB was chosen as a base to the work because it’s a popular open-source NoSQL DMBS with

API-level support for interacting with geographical data (like range queries) and has existing support

for spatial indexing (which improves query performance at the machine level) but doesn’t have an

adequate sharding policy for georeferenced data, which would be the next logical step for geographical

data support on MongoDB. MongoDB can also do distributed computation, possibly making it a one-

stop shop for geographical data storage and processing.

The partitioning policy is conceptually quite simple: Each shard is assigned a virtual geographical

location in the form of latitude/longitude coordinates (doesn’t have to be related to it’s physical lo-

cation) and then Geo-referenced data is placed on the shard that is determined to be the closest one,

based on the shard’s virtual location. GeoSharding is a form of Voronoi Indexing, a spatial partitioning

technique that is used in many different fields [9].

This policy allows for irregularly sized regions to be assigned, as opposed to alternatives like geo-

hashing (z-curves) or quad-tree hashing (explained in chapter 2) where the regions are fixed sized.

This characteristic makes GeoSharding more flexible and better suited to irregular distributions of

spatial data, which is common (business locations are overwhelmingly located in cities, for example).

1.3 Objectives and results

The main objective of this work is to follow-up on the study done in [6] and implement the GeoShard-

ing partition policy for sharding in an existing database management system and to evaluate its real-

world effectiveness.

This boils down to making a working version of MongoDB that can be configured to shard by

geographical coordinates and allows the user to issue different virtual locations to different shards,

insert and search for Geo-referenced data and issue range-based queries that can be executed in an

efficient manner.

These objectives were achieved successfully, with the exception of range queries which, while pos-

sible, are not optimized to take full advantage of the data distribution of GeoSharding.

The results of the evaluation from chapter 4 show that GeoSharding is very promising for real-world

use, with a good choice of shard locations the performance is equivalent or better than hashed sharding

on all tested scenarios, with big performance improvements when using datasets where documents are

relatively small, which is common with sensor data and census information.

GeoSharding also has the advantage of distributing data in a way that allows queries and compu-

tations to take advantage of data locality, for example, range queries can be implemented efficiently

by contacting only a small subset of the cluster’s servers, which is not possible when using hashed-

sharding.

1.4 Outline

Chapter 2 presents some background that is relevant to understand GeoSharding and GeoMongo, as

well as some related work.

Chapter 3 goes into more detail about the requirements for a useful use of GeoMongo, an overview

of the MongoDB platform that GeoMongo is built on and the GeoMongo implementation details.

Results and evaluation are presented on chapter 4 and the conclusions are presented on chapter 5.

3

2Related work

Contents2.1 Geographical Information Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 NoSQL database management systems . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3 Geographically aware sharding algorithms . . . . . . . . . . . . . . . . . . . . . . . 10

2.4 Existing solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.5 MongoDB (as of version 3.4) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

5

2. Related work

This chapter presents some relevant background and Geographical Information Systems and data,

a small introduction to the requirements of data computation, an introduction to NoSQL systems and

popular algorithms for spatial partitioning, an introduction to some popular systems for data storage

and/or computation and finally an overview of MongoDB from a user’s perspective.

2.1 Geographical Information Systems

Geographical Information Systems (GIS) are systems designed to store, manipulate, analyze manage

or present many types of spatial or georeferenced data. These systems can vary much in both intended

use and technology stack, but always involve some kind of data store.

In the following sections several features of these systems are analyzed.

2.1.1 Spatial data types

Spatial data types are the backbone of GISs, as they are used to represent and store the information

in the system. As mentioned in [1], there are two types of spatial data types, which will be somewhat

familiar to those used to dealing with image formats: vector and raster data.

Like with images, vector data types represent geometric features, such as points, lines or polygons,

while raster data types use point grids that cover the desired space.

The use case will determine the ideal type to use, but the vector model can offer more flexibility

and precision. An overview of the most common types of vector data is given below:

• Point Data: This is the simplest data type and it is stored as a pair of X, Y Coordinates. It

represents the entity whose size is negligible as compared to the scale of the whole map such as

it may represent a city or a town on the map of a country.

• Lines and Polylines: A Line is formed by joining two points in an end to end manner and it

is represented as a pair of Point data types. In Spatial Databases, Polyline is represented as a

sequence of lines such that the end point of a line is the starting point of next line in series.

Polylines are used to represent spatial features such as Roads, Streets, Rivers or Routes, for

example.

• Polygons: Polygon is one of the most widely used spatial data type. They capture two dimensional

spatial features such as Cities, States, and Countries etc. In Spatial Databases, polygons are

represented by the ordered sequence of Coordinates of its Vertices, first and the last Coordinates

being the same.

• Regions: Regions is another significant spatial data model. A Region is a collection of overlap-

ping, non-overlapping or disjoint polygons. Regions are used to represent spatial features such

as the State of Hawaii, including several islands (polygons), for example.

2.1.2 Data computation

Data computation of spatial data can be useful for many different purposes, such as Geostatistics, sim-

ulations (of weather, for example), data mining or even machine learning (a transportation company

could use collected data to train an efficient route planner that takes several factors like weather and

time into account, for example).

The following are some important factors/features to consider when planning to do computation

on spatial data.

6

2.1 Geographical Information Systems

Server-side computation

Server-side computation can be a very important factor with huge datasets [1]. It avoids transferring

the dataset to the client by instead having the client upload the program to the server and having the

server send the results back to the client, which are almost always several orders of magnitude smaller

than the dataset, in the Big Data space.

Storage servers that have to do double-duty as processing servers can be much more expensive than

simple storage servers and vertical scalability is limited, so cluster processing with horizontal scaling

is often used when server-side computation is desired. This field is mostly dominated by Hadoop and,

more recently, Spark, both discussed in more detail on section 2.4.

Data locality

Data locality is an important concept to take into account when performing computation [1]. The

closer it is to the hardware that processes it, the faster the computation can be done. Note that in this

context closer does not necessarily mean physically closer but lower latency access by the CPU. This

is true even with small amounts of data, modern CPU’s have several levels of cache beyond the main

memory, which can itself be used to cache data from storage drives, so that the data in use can be more

readily available to the processor.

With large amounts of data this becomes increasingly important, since moving the data can actually

take much longer than the time it takes to process it, depending on how slow the connection used to

load the data is.

In the case of clustered data stores that support server-side computation the partitioning policy

can have big effects on performance, by distributing data appropriately according to the types of

computations that are desired and thus avoiding unnecessary data-transfer between nodes, which is

typically much slower than the internal connections to a machine’s storage hardware.

Temporal efficiency

Temporal efficiency of a computation system can be of two kinds: latency and throughput.

Throughput refers to the amount of data that can be processed in a given amount of time, ans is

usually more relevant for after the fact data analysis and data mining. This is the purposed of systems

like Hadoop, which are capable of crunching very large amounts of data at reasonable speed but suffer

in start-up latency, and can’t thus be expected to suit real-time or near real-time applications. lack

When doing

Latency refers to the amount of time the system will take to give the first result after starting the

computation, and is thus crucial for real-time or live computations (a user searching for nearby points

of interest, for example) but not very relevant for analysis applications where the goal is to be able to

handle larger amounts of data and latency isn’t relevant.

2.1.3 MapReduce

MapReduce [4] is a programming model that allows even programmers with little to no experience

with parallel programming to take advantage of massively parallel clusters.

To use MapReduce the programmer writes two pure (no side-effects) functions, Map and Reduce.

The requirement that both functions have to pure functions limits flexibility: they cannot, for example,

make database or network requests. MapReduce does not allow communication between running Map

or Reduce functions, which makes its use limited to embarrassingly parallel computations but is also

one of the reasons for its simplicity. When running the MapReduce computation, a set of key/value

pairs is fetched from the data store and the following order of operations occurs:

7

2. Related work

1. Some process is used to select the data that is to be processed. Not all implementations of

MapReduce include this step because undesired data can be filtered and thrown out by the Map

function. Making this a separate, explicit, step has the advantage of making Map functions

cleaner and can allow for some simple but effective optimizations.

2. The Map function is applied to each input key/value pair, executing the desired computation and

emitting an output key/value pair.

3. When all input pairs have been processed by the Map function a new set consisting of the pairs

emitted by the Map function is now available. This set may have keys with more than one value

if certain keys weren’t unique in the input set.

4. The Reduce function is applied for each key in the sub-set of keys from the output set that had

more than one associated value. The Reduce function’s purpose it to take the multiple values

and generate a single value for the respective key

5. After Reduce has been applied to all relevant keys the final set is ready, where every key from

Map’s output set is associated with a single value.

A practical example of the MapReduce process, on MongoDB, can be seen in fig. 2.1. The example

includes the data selection step, via a query filter that selects only the documents with the statusattribute equal to ”A”. In the case of the key ”A123”, which has two values after the Map step, 500 and

250, the Reduce function is applied so that in the result set each key only has a single value associated.

Figure 2.1: A practical example of the MapReduce process, on MongoDB

Source: https://docs.mongodb.com

8

2.2 NoSQL database management systems

2.1.4 Geostatistics

Geostatistics is the use of statistics with a focus on spatial or spatiotemporal data. It is used in many

different industries, like mining (to predict probability distributions or ore in the soil), petroleum

geology, oceanography, agriculture [14], forestry and, of course, geography [13]. It can generally be

used by businesses and the military to plan logistics and by health organizations to predict the spread

of diseases [11].

Kriging

Kriging is widely used class of geostatistical methods that are used to interpolate values of arbitrary

data (such as temperature or elevation) of unobserved locations based on observations from nearby

areas.

These methods are based on Gaussian process regression, the value interpolation is modeled using

a Gaussian process that takes into account prior covariances, as opposed to piecewise-polynomials that

optimize smoothness of the fitted values. If assumptions on the priors are appropriate, kriging can

yield the best linear unbiased prediction of the interpolated values [13].

Kriging techniques, when used on processing clusters, benefit from good data locality to avoid the

slow network transfer of data about surrounding areas on which the computation is dependent.

As can be seen in fig. 2.2, kriging interpolation chooses more likely results instead of optimizing

for line smoothness.

Figure 2.2: An example of kriging interpolation (red full line). Also shown: smooth polynomial (blue dotted line) and Gaussianconfidence intervals (shaded area). Squares indicate existing data.

Source: Wikimedia Commons

2.2 NoSQL database management systems

NoSQL, or Not Only SQL, is a category of database management systems (DBMS) that differentiate

themselves from the more common SQL DBMSs, that use the relational model, by basing themselves

on non-relational or not only relational models. Although NoSQL is a somewhat vague term that also

includes niche categories like Graph DMBSs, it’s commonly used to refer to systems that focus on

simplifying the design to more easily achieve horizontal scaling and high availability, which usually

means dropping the relational model [2].

To achieve simpler scalability and/or availability, NoSQL DBMSs usually relax the consistency guar-

antee on data they store. This follows from the CAP theorem, which states that a distributed system

9

2. Related work

can only provide two of the following guarantees:

• Consistency: Operations on stored data should always be based on the latest write to the system

or return an error

• Availability: Requests should receive non-error responses, although not necessarily from the

latest write

• Partition tolerance: The system continues to function correctly even if parts of it become un-

reachable or messages get delayed

Since many NoSQL DBMSs prioritize horizontal scalability and high-availability, partition tolerance

and availability become hard requirements of these systems. Since all three guarantees can’t simul-

taneously be achieved, a common solution is to degrade consistency to what is known as eventualconsistency: After a write the system might become inconsistent for a short period of time but eventu-

ally achieves consistency as the changes propagate to all machines.

Sharding

Sharding is a method to achieve horizontal scaling of a database via data partitioning between sev-

eral machines. This allows data and load to be spread among the various machines, providing high-

throughput and supporting very large datasets, bigger than what could fit on a single server.

Because of the focus on horizontal scalability these systems are useful for use with very large

datasets (that don’t fit in a single machine’s memory) or loads bigger than one machine can easily

handle by itself. They’re therefore good candidates for computational analysis with Geographical

Information Systems (GIS) or other use cases where consistency isn’t a hard requirement and capacity

and performance are priorities.

Some well-known examples of NoSQL DBMSs are MongoDB (used by several big tech companies

such as Google, Facebook and ebay), Cassandra (created and used by Facebook), DynamoDB (part of

Amazon Web Services) and BigTable (part of the Google Cloud platform).

2.3 Geographically aware sharding algorithms

In this section a quick overview of the relevant algorithms for geographically aware sharding is given.

An in-depth analysis of these algorithms can be seen on the work of Joao Graca in his Master’s thesis

[6], for which this work is a follow-up to.

Geohashing (Z-curve)

Geohashing 1 is a hierarchical geocoding system that subdivides the space into several equal-sized

grid-shaped regions that can be assigned to different machines. A Z-curve is used to reduce the dimen-

sionality of longitude/latitude coordinates by mapping them to a location along the curve, with closely

located coordinates being mapped to [6].

Some advantages of Geohashing are its simplicity and easily tunable precision (by truncating the

code). Z-curves are, however, not very efficient for certain types of queries (like range-queries) since

points on either end of a Z’s diagonal seem to be closely located, which is not the case. Since this

method involves projecting the spherical earth into a plane, points that are close by may end up on

opposite sides of the plane and appear far away to the curve.

Figure 2.3 shows a map of the Earth partitioned by a Z-curve.

1not to be confused with xkcd 426

10

2.3 Geographically aware sharding algorithms

Figure 2.3: A Z-curve used to partition the Earth into a 4x4 grid

Quadtree hashing

A Quadtree is a tree data structure that recursively subdivides a two-dimensional space into four

quadrants, as shown on fig. 2.4. The subdivision occurs per cell as its capacity (which can be an

arbitrary function) is reached, allowing the tree to be more detailed where appropriate [6].

Figure 2.4: The quadtree data structure

Quadtree hashing allows large empty sections to be quickly pruned when querying.

GeoSharding (Voronoi partitioning)

GeoSharding is the name of the policy used in this work, proposed by [6]. Based on Voronoi partition-

ing, which is a method for dividing a space into several non-uniform regions based on distance to a

given set of points in that space. The number of regions is equal to the number of points in the given

set as each point gets assigned the region of the space that no other point is closer to. By computing

the boundaries between regions a Voronoi diagram can be drawn, such as the spherical one shown on

fig.2.5, which has the benefit of being easily understandable visually.

11

2. Related work

Figure 2.5: A spherical Voronoi diagram of the world’s capitals

Source: rendered using jasondavies.com/maps/voronoi/capitals/

Unlike GeoHashing, Voronoi partitioning doesn’t suffer from the same disadvantages: points that

are close to each other, regardless of where they are on the map, are correctly considered close and

vice-versa.

GeoSharding is a very versatile and efficient space partitioning policy, as more points can be added

where more detail/capacity is required and the lack of regularity in the partitions.

A big drawback for GeoSharding can be it’s complexity in implementation for range or similar

queries. In terms of computational complexity, although computing the Voronoi diagram can be slower

than some alternatives, it can pre-computed and cached when changes to the cluster configuration are

made, so that the impact is minimal when actual queries are performed.

2.4 Existing solutions

In this section some viable existing solutions for either storage and/or computation of data are pre-

sented.

2.4.1 Spatial Hadoop

Hadoop is an open-source framework for distributed storage and computing of large scale datasets.

It’s one of the most mature and popular solutions for Big Data processing, it appeared at a time when

large scale computation was still done with powerful computers [10].

Hadoop’s design is based on taking advantage locality, it gets around the problem of moving very

large datasets to the processing nodes by instead moving the (small) processing software to storage

nodes and executing it there, in parallel. This design has two major parts: the Hadoop Distributed

File System (HDFS) and the MapReduce programming model, which allows the programmer to easily

write massively parallel programs at the cost of flexibility.

The Hadoop Distributed File System is capable of storing very large files, even up to the range

of terabytes, with horizontal scalability and high-availability via data redundancy. These files are as

flexible as regular files, they can be structured, unstructured, json, images, etc.

The recommended use of Hadoop is large-scale data analysis or other types of long-running compu-

tations that don’t need to be done in real-time, since Hadoop does not support streaming computation

as new data comes in.

12

jasondavies.com/maps/voronoi/capitals/

2.4 Existing solutions

Spatial Hadoop [5] extends Hadoop with spatial capabilities: data types, indexes, operations and a

new high-level language called Pigeon for use with spatial data. The use of a dedicated language can

be seen as downside since it makes existing Hadoop programs harder to reuse and might make spa-

tial programs incompatible with other Hadoop extensions. Experienced Hadoop programmers might

appreciate the high-level nature of the language and its familiar MapReduce paradigm, as well as the

several tutorials are available on the project’s website.

2.4.2 GeoSpark

GeoSpark [17] is a recent and promising project built on the popular Apache Spark. First released

in July 2015 and still currently in an experimental stage as of writing, it’s an open-source cluster

computing system designed for large-scale spatial data processing. GeoSpark inherits many of it’s

characteristics from Spark.

Compared to Hadoop, Spark provides parallel, in-memory processing, whereas traditional Hadoop

is focused on MapReduce and distributed storage.

Spark doesn’t have a native data storage solution, so it’s often used with the Hadoop distributed

file-system or other compatible data-stores like MongoDB or Cassandra, making it’s configuration more

complex than other alternatives.

The main advantage of Spark (and GeoSpark) is it’s performance, being up to 100 times faster

than Spatial Hadoop’s MapReduce [10] which is, in part, a consequence of its much more aggressive

memory usage for it’s in-memory computation. The heavy memory usage makes Spark a very costly

option since RAM capacity is much more limited and expensive than SSD or HDD capacity.

GeoSpark adds spatial partitioning (including Voronoi diagram) and spatial indexing to Spark along

with geometrical operations, most notably range queries for searching inside regions of space [18].

Since spatial/geometrical support isn’t native to Spark but is added by GeoSpark, the API for these

operations is GeoSpark specific, meaning that programs have to be re-written to take advantage of

these features. Developing for GeoSpark is complicated not only by the underlying system’s complexity,

Spark, but also some consequences of it’s experimental status, like the lack of comprehensive and

organized documentation for it’s API.

2.4.3 Cassandra

Cassandra [8] is an open-source NoSQL database designed for use with multiple, geographically dis-

tributed, data-centers, with high availability and large total throughput at the cost of extra read and

write latency.

It’s what’s knows as a wide column store, data is stored in tables with a big number of columns of

different data-types. While it may seem similar to relational DBMS’s, Cassandra does not support join

operations or nested queries.

Cassandra’s recommended use is as a storage back-end to online services, with large number of

concurrent interaction, possibly in different geographical locations, with small data involved in each

request. The lack of transaction support makes Cassandra unsuitable for applications that require

atomic and durable write operations.

It does not support native server-side computation but it can work as a storage back-end for

Hadoop, replacing the Hadoop distributed file system [2].

2.4.4 MongoDB

MongoDB is a very popular open-source NoSQL DBMS that was designed to be very easily scal-

able. To achieve simple horizontal scalability it avoids the complexity of Relational DBMS, lowers

13

2. Related work

the consistency to eventual consistency and provides built-in support for sharding, replication and

load-balancing.

It’s a document database, meaning that instead of rows in tables it has JSON documents in no-

schema collections. The JSON format allows for the serialization of common programming data-types

like maps, strings, numbers and arrays. A simple JSON document example is given on fig.2.6.

Figure 2.6: JSON document example


High Availability (HA) is an important characteristic of MongoDB, data replication and automatic

fail-over are built-in and easy to set-up by using replica sets.

MongoDB supports server-side distributed computation via MapReduce and also an abstraction of

MapReduce called Aggregation Pipeline, which aims to be a simpler way of using MapReduce. It sup-

ports other types of server-side computation that exist mostly as a convenience for small computations

like stored procedures, since they are not distributed.

Overall MongoDB is mostly designed for low-latency storage and querying but is also quite flexible,

with support for server-side computation, good scaling mechanisms and simple replication [2].

2.5 MongoDB (as of version 3.4)

In this section an overview of the relevant MongoDB features and characteristics is given.

2.5.1 Spatial data support

MongoDB has a native API for spatial data, which includes special queries for this data types and

indexing support. It can handle both 2D plane data and 2D sphere, which represents the surface of

the Earth and is the concern of this work. MongoDB can not only understand point data but also more

complex geometry, such as lines and polygons.

GeoJSON

MongoDB uses GeoJSON as the format for spatial data. It’s essentially a JSON that includes a special

field with the geometry related information of the object: it’s type and coordinates.

Each GeoJSON object can be one of the following types: Point, LineString, Polygon, MultiPoint,

MultiLineString, and MultiPolygon. The coordinates field can vary in shape depending on the type, for

a simple point it’s just a JSON array with the longitude and latitude of the object, in that order.

2.5.2 The layout of a MongoDB cluster

MongoDB sharding clusters are built from 3 different components:

• Router: Provides the interface between clients and the cluster and acts as a load balancer for the

shards in the cluster

• Config server: stores the meta-data and configuration of the cluster, like what data partition each

shard owns

14


• Shard: Stores a subset of the sharded database data

For high availability multiple router’s can be used and Config servers and Shards can be set-up as

replica sets, described in 2.5.4. A standard highly-available 2 shard cluster architecture is shown on

fig.2.7.

Figure 2.7: MongoDB Sharded Cluster Architecture


2.5.3 Sharding

Sharding is a method to achieve horizontal scaling of a database via data partitioning between several

machines. This allows the load to be spread among the various machines, providing high-throughput

and supporting bigger datasets than can fit on a single server.

When coupled with the use of MongoDB’s router processes sharding can become transparent to the

client.

MongoDB’s sharding mechanism works per collection and uses shard keys partition the data into

several parts called chunks, which are then distributed between the various shards of the cluster.

Shard Keys

A Shard key is an immutable field/fields selected by the user when sharding a collection and should

exist in all documents of that collection. MongoDB requires that each unique shard key value maps to

only one chunk, so that routers can contact only the responsible shard when routing queries.

Figure 2.8: Partition of the dataset into chunks based on shard key values


15

2. Related work

Performance and efficiency of a sharded cluster is affected by the choice of shard key which should

ideally distribute data evenly among all shards in the cluster.

A good shard key should have:

• Many possible values (high cardinality): If a shard key has only 4 possible values there can’t be

more than 4 chunks, limiting the horizontal scalability of the collection

• High frequency in the dataset: If only a small subset of shard key values appear in the dataset

then chunks that contain those values may limit throughput as most of the dataset will be served

from a minority of the existing chunks, as shown in fig.2.9a

• Non-monotonic behaviour: If a shard key is a field that changes monotonically (like a counter)

inserts are very likely to target a single shard at a time, as shown in fig.2.9b

(a) Low frequency shard key (b) Monotonically increasing shard key

Figure 2.9: Examples of bad shard key properties


Multiple fields in a compound index can be used as a shard key to attain these beneficial properties

if no single field is an ideal choice.

Besides the standard sharding method (called ranged sharding), hashed sharding is the only other

supported method. Hashed sharding can be used to avoid monotonic behavior. In this method the

shard key must be a single field which is passed through an hash function to get a pseudo-random

output with a uniform distribution. An unfortunate side effect of hashed sharding is that documents

with close shard keys won’t end up in the same chunks but that might not matter depending on the

meaning of the chosen field.

Chunks

Chunks are continuous ranges of the shard key’s key space that associated with a specific shard. Chunk

ranges cannot overlap each other and must collectively occupy the whole key space, to guarantee that

each shard key value is always mapped to a single chunk.

To prevent the data distribution throughout the cluster from becoming too unbalanced, the con-

fig server periodically runs a background process that migrates chunks to different shards called the

balancer and also uses a mechanism called chunk splitting.

A chunk splitting is a process where a big chunk (a chunk with a lot of documents in it’s range) gets

split into two smaller chunks with a similar number of documents and that together cover the same

key range as the original chunk. Chunk splitting does not cause migrations and happens when a chunk

goes over the chunk size (64MiB by default), a value that can be set by the user, with smaller chunk

sizes increasing granularity which allows for better data distribution at the cost of more migration

overhead (chunks moving between shards more often). If a chunk’s range is already a single value of

the key space it becomes indivisible and can become a performance bottleneck.

16


2.5.3.A Zones

Zones can be used to get more manual control of chunk migration, by allowing the user to choose

specific shard key ranges and assign them to a shard or a group of shards. The balancer prioritizes

zones before other metrics when deciding where to migrate a chunk to. There’s no requirement of

covering the whole key space with zones, so the user can assign zones to only a subset of it.

Some use cases for this feature are to route data based on shard performance or, in larger systems,

to store data based on the location of data centers. An example of the second use case would be to add

a field to each document with the continent where that data is relevant, like where a user is located,

and then assign the various values of that field to different sharding zones, each going to a different

data center.

Indexes

Indexes are required when sharding, more specifically: When sharding on a certain key a index for

that same key must exist first. Indexes exist per-collection and are essentially sorted lists of key values

that exist in the collection, with pointers to the documents that contain those values. Indexes allow

certain queries to happen without requiring collection scans, which are very expensive, as well as

sorted results at little cost by using the sorted nature of the index (an example of this is shown on fig.

2.10).

Figure 2.10: An index being used to return sorted results


2.5.4 Replica sets

Replica sets are small clusters of MongoDB daemon instances (Config server or Shards) that store the

same data set. A replica set has one primary instance that can both respond to reads and writes and

one or more secondary instances that only respond to reads (but by default don’t do so). A simple

replica set with architecture with two secondary instances is shown on fig.2.11.

17

2. Related work

Figure 2.11: Replica set architecture


Changes to the primary’s dataset are propagated asynchronously to the secondary instances, mean-

ing that after a write is confirmed to a client reads on a secondary instance might still return old data

for a short amount of time. This eventual consistency allows for better write performance by not wait-

ing for the secondaries before confirming writes and is the reason that secondary instances by default

don’t respond to read queries. Clients have the option of requesting that a write is acknowledged by

at least n instances before being confirmed, to ensure write durability at the cost of performance.

In case of a failure of the primary (detected by a periodic heartbeat connection) the failover process

consists in a vote takes place and one of the secondary instances is promoted to primary, as shown on

fig.2.12.

Figure 2.12: Replica set failover process


2.5.5 Storage engines

MongoDB (as of version 3.0) supports multiple different storage engines which can add new capabil-

ities or be optimized for certain hardware, making MongoDB a very versatile DBMS. This is done via

an exposed API, with the following ones being shipped with MongoDB:

• WiredTiger storage engine: the default, suited for most use cases. Includes native compression

and concurrency control

18


• Encrypted storage engine: very similar to WiredTiger but including support for database encryp-

tion

• In-Memory storage engine: stores data in memory to provide low and predictable latency

19

3GeoMongo

Contents3.1 GeoMongo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

21

3. GeoMongo

3.1 GeoMongo

GeoMongo is the name chosen for this work. It aims to make MongoDB capable of geographically

aware sharding by using the GeoSharding partitioning policy discussed in section 2.3 and study the

usefulness of this feature.

Some knowledge of MongoDB’s internals is required to understand the reasoning behind or the

necessity of some changes that have been made. For this reason section 3.1.1 gives an overview

of MongoDB’s internals specific to sharding and any extra details that are considered relevant are

mentioned alongside the overview of the changes in section 3.1.2.

3.1.1 Internals of MongoDB’s sharding

Routing table

The routing table is a map of where a collection’s data is located, more specifically it’s a map of chunks

to shards.

Router (mongos)

The mongos process acts as the router/load-balancer for the cluster. It caches the routing table and

shard list in memory and has no persistent state.

The router also acts as a proxy between clients and the database cluster, so that clients can transpar-

ently access data without knowing the cluster configuration. A disadvantage of this is that the router’s

network connection can become limit, but luckily MongoDB is designed to support many different

routers being used simultaneously.

Shards (mongod)

Shards are mongod instances in its default operation mode (or a replica set of them). They also cache

the routing table and shard list to be able to perform certain cluster operations like chunk migrations

or sharded mapReduce.

Config server (mongod)

A mongod process can be configured as a config server, which is responsible for storing the cluster’s

metadata, like the routing table. Although the other server types cache some of the metadata the

config server acts as the single source of truth and is the only server type that stores the routing table

on disk.

Chunk and shard versions

Chunk versions are used as a way detect stale routing table caches. Each new chunk of a collection

must have a higher version than the previous ones and chunk migrations increment the version of

migrated chunks. Shards have a different shard version for each collection and which is simply the

highest version among chunks of the collection that it owns.

Whenever a request is sent to a shard, the request must include the shard version it thinks the shard

is currently at. The shard compares its local version with the sent version and updates its cache if the

sent version is higher. If the sent version is lower, an error is returned telling the sender to update its

cache.

22

3.1 GeoMongo

Chunks

Each chunk has a few important pieces of information:

• Minimum and maximum keys: the values that define the chunk’s range. From the minimum

(inclusive) to the maximum (exclusive)

• Version: chunk version, which includes a time-stamp

• Shard: the shard that owns the chunk

The use of minimum and maximum keys as an integral part of a chunk implies that sharding should

always be done by splitting up along a single dimension. Dimensionality reduction methods (like Z-

curves) can be used to shard spaces with multiple dimensions but a better solution would be to use a

more general approach that could handle multiple-dimension spaces defined by the policies.

Balancer

The balancer is a routine that runs periodically in the config server, checking for the data distribution

among the shards, and can begin chunk migrations between shards if it detects that one shard is

under/over utilized.

Chunk splitting

Every time an insert is made, the Router handling the request checks if the chunk goes over the chunk

size limit and, if so, issues a chunk split command to the shard that owns chunk.

Once a chunk is split there is no way to reduce the number of chunks as there is no chunk merging

capability.

Request flow in a router

Whenever the router gets a request a new thread is created to handle the connection. The request is

then interpreted in order to prepare the resources needed to process it and async connections to config

servers or shards are created, if required. After these connections are finished the result is assembled

and sent to the client. Note that the router doesn’t just point the client to the relevant shards, it makes

the connections itself so that the cluster configuration is transparent to the client.

Shard 1(mongod)

Shard 2(mongod)

Router(mongos)

Cluster operationsConfig server

(mongod)

Stores the actual data. For sharded collections, stores

only a subset of themAlso caches sharding

information.

Single source of truth about where data is in the cluster:

chunks, chunk ranges, shards, collections, etc

Caches sharding information, like

chunks, but otherwise has no permanent

data

Figure 3.1: A visual summary of what data is stored in each type of machine in a MongoDB cluster

23

3. GeoMongo

3.1.2 GeoMongo Implementation

Most of the time spent on implementing GeoMongo was to get an understanding of the relevant code

for sharding in MongoDB, which is a very large project with over 5000 files and over 1.2 million lines

of source code.

The MongoDB’s code-base has some internal assumptions on the properties of sharding keys and

chunks that are broken by the GeoSharding partition policy. This lead to a much more complex and

time-consuming implementation than originally predicted.

There are five major code sections that had to be changed to add support for a new partition policy:

• User interface for sharding

• Chunk creation

• Sharding initialization

• Write operations

• Read operations

There are also several other minor code sections that required updates, however most of these are

very similar and are either related to chunk validation or assumptions about chunk object structure

and not worth mentioning.

In the following sections the major changes are discussed.

User interface

From the user’s perspective not much changed. Two new commands were added to the client shell

and the Router process, a command to shard a collection with the GeoSharding policy and one to

add/change chunks locations or assigned shards.

The user interface remains very similar to btree and hashed sharding with the difference that the

user gets manual control over chunk ranges and what shards they’re assigned to. Although the choice

of shard can be automated via the balancer, dynamically choosing good locations to assign to each

chunk is a complex problem that should be studied in a future work.

Since MongoDB has native support for geographic queries, like geoNear, the user can possibly get

immediate performance benefits without changing existing application code, just by choosing a new

sharding policy.

Chunk creation

The chunk creation process is different for geographical chunks but also simpler in it’s current form, as

the chunk ranges (the Voronoi diagram) aren’t calculated at creation time, which could be a valuable

optimization.

Apart from verifying if a chunk with the same position already exists, creating a geographical chunk

involves using a higher version than the last created chunk, storing longitude and latitude instead of

min and max, setting the chunk’s name-space (what database and collection it belongs to) and what

shard it will initially be assigned to. In practical terms this boils down to creating a document in

a special collection that is stored on the config server, representing the chunk. These documents

can be fetched and/or updated by other routers or shards when stale caches are detected or chunk

modifications are made.

Unlike what’s done for the standard btree and hashed sharding mechanisms, automatically select-

ing uniformly distributed locations for chunks when setting up sharding on a collection isn’t desirable,

since expecting a dataset to be uniformly distributed over the entire sphere is unreasonable.

24

3.1 GeoMongo

There is, however, an opportunity for automatic adjustment of chunk locations after the dataset is

loaded or dynamically as it grows by perhaps using statistical analysis or machine learning algorithms.

Sharding initialization

When initializing sharding for an empty collection MongoDB still expects at least one chunk to exist

at the end of the process. Either a default location can be chosen, which would better fit the existing

interface of MongoDB, or the user can be asked for the location of the first chunk.

The implemented solution uses a default location that can later be changed by the user, since this

fits the existing interface and can transparently support a possible future work that dynamically selects

new locations for chunks as the database is used.

Write operations

When writing to the database, whether it’s a single document or a batch of multiple documents, each

document’s shard key goes through a method of the ChunkManager object, called findIntersect-

ingChunk, that is responsible for mapping the document’s shard key to the correct chunk. Thus the

necessary changes here involved adding a new code path, for when the shard key is of type 2dsphere,

that extracts the coordinates and iterates through the existing chunks to find the closest one. A new

method, findNearestGeoChunk, was created for this since it can be reused for reading queries. That

method is described in detail in GeoSharding routing, at the end of this section.

Note that depending on the database configuration documents might not be required to have a

shard key, in which case they get placed into the primary shard.

Read operations

Reading operations can be made up of multiple filters.

Read operations, unlike write operations, aren’t handled document by document since there’s no

way to know how many documents will be involved in a query before it’s executed. This means that

multiple chunks and shards may need to be contacted in order to execute the request. ChunkMan-

ager objects have a method, getShardIdsForQuery, responsible for interpreting the query’s filters and

returning a list of shards that may contain documents that fit the filters.

In the case of GeoSharding the changes are similar to what was done for write operations: A new

code path was added for when one of the filters included coordinates to search for. The same process

of matching the relevant chunk is done and the respective shard is returned.

When sharding is enabled an additional filter is added to every read query that instructs shards that

are contacted to filter out any documents that are found but don’t belong to chunks they own, since

during/after a migration documents might still exist on the old shard for a while.

If none of the filters are related to a document’s coordinates all shards have to be contacted.

Minor changes

There were also other, smaller, changes necessary for GeoMongo to work:

• Chunk ranges: MongoDB includes a mechanism for btree and hashed sharding to cache chunk

ranges. In it’s current form GeoSharding makes no use of this feature but it could be very useful

for range based queries, allowing the Voronoi diagram to be computed at chunk insertion/modi-

fication time rather than query time.

• Chunk ID’s: Chunk’s were previously identified by their minimum key, since the complete shard

key space had to be assigned and chunk ranges couldn’t overlap. Longitude and latitude don’t

have the same characteristics so both are needed to identify a chunk.

25

3. GeoMongo

• Chunk validation: chunks ranges and min/max values are validated in all server types and several

parts of the code-base. These checks range from verifying that the minimum key is smaller than

the maximum key to verifying that the whole key space is accounted for with no overlaps. For

GeoSharding the checks are simpler: make sure that the longitude and latitude values are valid

and prevent multiple chunks having the same location.

• Chunk splitting: Chunk splitting was disabled for GeoSharding since, as stated before, choosing

a location for new chunks can’t be trivially automated. Furthermore, the policy dictates that each

shard should own a single region of space and there’s no clear benefit to allowing more at the

cost of extra implementation complexity.

Internal changes

Like mentioned before, the GeoSharding policy breaks several of the assumptions that MongoDB makes

with regards to sharding keys and chunks, the biggest one being that it’s not 1-dimensional.

A new shard key type was created, 2dsphere, matching the key type used for geographical indexes.

Unlike other shard key types, for GeoSharding 2 values are needed, the coordinates.

Most of the changes that had to be made were, unsurprisingly, to code that handles chunks.

Because MongoDB is designed with 1-dimensional sharding in mind, chunks include a minimum

and maximum key. Since GeoSharding is 2-dimensional policy based on locations the minimum and

maximum keys have no meaning, so when using GeoSharding the chunk stores it’s virtual longitude

and latitude instead. This change affected many parts of the code-base and all three types of server

because the minimum and maximum keys are assumed to exist and are validated in several different

components responsible for the sharding mechanism, instead of relying on an abstraction for chunk

identification and validation.

Chunk creation, validation and chunk ID’s also required updates to remove limitations imposed by

the lack of an abstraction over chunk identification and validation.

The auto-splitting feature is disabled under GeoSharding because it was outside the scope of this

work to dynamically create and choose locations for chunks, which would have been required to

support the feature.

GeoSharding routing

When inserting data or making queries the router calculates what shard it should forward the request

to. While it is conceptually simple (select the shard that owns the chunk closest to the relevant doc-

ument) computational precision must be taken into account. Since the number of chunks should be

equal to the number of shards, and therefore very small, simply iterating over the chunks and calcu-

lating the great-circle distance to the document’s coordinates is expected to have a negligible impact

in performance.

26

3.1 GeoMongo

λ

ϕ

P

ΔσQ

Figure 3.2: Great-circle distance, ∆σ , between two points, P and Q. Also shown: λp and ϕp

Source: Wikimedia commons

Vincenty’s formula was used to compute the great-circle distance, as it’s more accurate than com-

mon alternative, the Haversine formula, and is computationally inexpensive [16]. Although the Earth

is not a sphere but an oblate spheroid and Vincenty’s formula still has a small amount of error, these

details aren’t problematic since precision isn’t crucial for GeoSharding, as it only affects the locations

of chunk boundaries, points that are close still end up in neighboring chunks. Performance is a concern

because this calculation is done multiple times for each write or read on the collection but Vincenty’s

formula is a relatively simple computation and should have a negligible impact.

Vincenty’s formula for a sphere (which the Earth closely approximates) is shown on 3.1, where

λp and λq are the longitudes of two points, P and Q, and ϕp and ϕq are the latitudes, ∆λ and ∆ϕ are

the differences in longitude and latitude between the two points and ∆σ is the great-circle distance

between them. Some of these angles are depicted in fig. 3.2.

∆σ = arctan

√(cosϕq · sin(∆λ)

)2+

(cosϕp · sinϕq − sinϕp · cosϕq · cos(∆λ)

)2

sinϕp · sinϕq + cosϕp · cosϕq · cos(∆λ)(3.1)

For computation the function atan2(y,x) should be used in place of arctan(y/x) because it can

calculate the quadrant of the computed angle and return valid results when x = 0, as it receives the

arguments separately and can thus know the values and signs of x and y.

In the case of ties when computing the closest chunk the one with the smallest Id is chosen.

Range based queries and similar

Unfortunately GeoMongo does not have optimized range queries (in it’s current form, GeoMongo’s

range queries send requests to all shards), which would be able to take full advantage of the data

distribution of GeoSharding. This is because range based queries (and similar, like intersection) ag-

gregate data points that are geographically close, which means that with GeoSharding there’s a good

probability that only a single (or very few) shards need to be contacted, lowering the resource impact

of the query, which means higher capacity for the cluster to handle other queries simultaneously [7].

If these queries had been optimized, this would be the order of steps:

1. When a chunk is added/modified, re-compute the Voronoi diagram of the chunk ranges. More

specifically, compute the boundaries (edges) of the ranges using Delaunay triangulation, which

27

3. GeoMongo

can be done in O(nloдn) time (n being the number of regions) with the divide and conquer

algorithm by Leach [9].

2. Cache the chunk range edges, to avoid unnecessary computation when running queries

3. When a range query is received, contact any chunks with a center within the range, as these are

definitely included in the search.

4. Compute the range boundaries that intersect the query’s search circle. The chunks to which these

boundaries belong contain a region within the search circle and should also be contacted

28

4Evaluation

Contents4.1 Performance evaluation methodology . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.2 Performance evaluation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.3 Evaluation of MongoDB as a base for GeoMongo . . . . . . . . . . . . . . . . . . 39

29

4. Evaluation

In this chapter an evaluation of GeoMongo is done, both in terms of performance and implementa-

tion

4.1 Performance evaluation methodology

The performance testing was done on a cluster of 6 equivalent machines and a 7th, different, machine

connected on the same network was used as the client for all the benchmarks. In this section the

details about the machines and benchmarking methodology are given.

Cluster machines specifications

All the cluster machines were identical, both in terms of hardware, software, and network connection.

The relevant specifications for these machines:

• CPU: Intel Core i7-2600k @ 3.4GHz, a 4-core, hyper-threaded CPU with Sandy Bridge architec-

ture and 8MB of L3 cache.

• RAM: 12GB of DDR3 in dual-channel configuration.

• OS and Storage drive: 1TB, 7200RPM Western Digital Blue hard disk drive.

The machines were running Ubuntu 14.04 - Trusty Tahr, a long term support version of Ubuntu.

Client machine specifications

All the cluster machines were identical, both in terms of hardware, software, and network connection.

The relevant specifications for these machines:

• CPU: Intel Xeon E3-1230v5 @ 3.4GHz, a 4-core, hyper-threaded CPU with Skylake architecture

and 8MB of L3 cache.

• RAM: 16GB of DDR4 in dual-channel configuration.

• OS and Storage drive: 1TB, Samsung 850 Pro solid state drive.

The client machine was running Ubuntu 16.04 - Xenial Xerus, a long term support version of

Ubuntu.

Network specifications

All 7 machines were connected on the same local area network via a 1Gb/s switch. During all the

test runs, with the exception of reads for documents with images, the total throughput of any of the

machines never reached the network’s Gigabit limit so there’s no reason to believe it was a bottleneck

on those tests. More details about the exception are given in the test results discussion.

Benchmark software

Two different scripts were made to serve as benchmarking software. The scripts were written in

Python3 and used the PyMongo driver library to interact with the MongoDB router. Pool threads were

used to simulate multiple concurrent clients.

Because CPython uses a Global Interpreter Lock (GIL) to prevent memory corruption from data

races, threads can’t run in parallel, only interleave their execution (unless multiple processes are used).

While this prevents parallel computation, it’s not an issue for network-bound programs that block on

I/O, since threads quickly switch between each other as they block waiting for request responses.

30

4.1 Performance evaluation methodology

With both reads and writes, the insert/query order was randomized for each run, to avoid impacting

the machines in order and thus reducing performance to a single machine at a time.

The scripts can be seen in appendix A (write benchmark) and appendix C (read benchmark).

Cluster configurations tested

Three different configurations were tested for the cluster: 1, 2 and 4 shards. Using no more than 4

shards allowed the router, config server and client to live on their own machines and avoid any load

from these processes to interfere with the performance of the shard processes.

Shard 1(mongod)

Shard 2(mongod)

Shard 3(mongod)

Shard 4(mongod)

Router(mongos)

Cluster operationsConfig server

(mongod)

Figure 4.1: Architecture of the benchmarking cluster (with varying numbers of shards)

Policies tested

GeoSharding was of course tested. For all tests with more than one shard, the chunk locations were

handpicked to achieve a near uniform distribution between all shards. Other than time-constraints, this

decision was made because the data distribution in GeoSharding depends completely on the locations

chosen for the chunks, and so they should picked at least somewhat competently or be dynamically

calculated (unfortunately this is out of the scope of this work). The tests done with a single shard

should be nearly equivalent to the worst case scenario, where all the data gets allocated to a single

chunk.

Besides GeoSharding, Hashed sharding was also tested as a means of comparison. Hashed sharding

was chosen as it’s the most adequate policy when a context-aware one isn’t available, and MongoDB

does not have a geography-aware sharding policy. The sharding key used for Hashed sharding was the

auto-generated document id since each document gets a unique one.

Dataset used

The performance evaluation was made using a real-world dataset of underground water resources

in the United States of America. The dataset is comprised of 4648 entries distributed throughout

the country’s territory, including Alaska and islands in the Pacific Ocean like Guam but mostly in the

mainland. Figure 4.2 shows that the distribution isn’t uniform even on mainland USA.

31

4. Evaluation

Figure 4.2: Underwater resources of USA (mainland region shown)

Two variations of the dataset were tested: the raw dataset, where each document is very small (on

the order of 100 bytes) and a variation where each document had a 500KB image attached to it, to

compare the effects of document.

The dataset can be downloaded from 1. An example of a document from the dataset can be seen in

appendix E.

Benchmark execution

Each test was repeated three times and it’s results were averaged. The variation between each run

was small so little information is lost by averaging the results. The complete results are included in

appendix B (writes) and appendix D (reads).

Write performance test consisted of writing the whole dataset, randomly ordered, to the cluster.

Different amounts of threads (1, 2, 4, 16 and 64) were used to simulate multiple concurrent clients.

The database was dropped between each run of a write performance test.

For read performance tests, the dataset was loaded once at the start and then several runs of

reads were performed, with different amounts of concurrent threads to simulate multiple clients. The

performance for reads seemed to scale much more for write performance test: 1, 2, 4, 16 and 64.

4.2 Performance evaluation results

In this section the benchmark results are presented and discussed. As said before, two versions of

the dataset were used, with remarkably different, but expected, outcomes: a text-only version and a

version where every document had a moderately sized image attached to it. The same benchmarks

were used for both versions.

4.2.1 Attached image documents

For this set of tests an image of roughly 0.5MB (473.35KB) was attached to each document, bringing

the total size of the dataset to about 2.4GB. Because of the size of each document the tests became

network/communication driven and the difference between GeoSharding and hashed sharding was

hidden by the transfer time of each document.

1https://nwis.waterdata.usgs.gov/usa/nwis/gwlevels

32

https://nwis.waterdata.usgs.gov/usa/nwis/gwlevels


Read performance

0

20

40

60

80

100

120

140

160

180

1 2 4 16 64

TIM

E TO

REA

D T

HE

DA

TASE

T (S

)

NUMBER OF CONCURRENT READS

Hashed Sharding GeoSharding

(a)

0

20

40

60

80

100

120

140

160

180

1 2 4 16 64

TIM

E TO

REA

D T

HE

DA

TASE

T (S

)



(b)

0

20

40

60

80

100

120

140

160

180

1 2 4 16 64

TIM

E TO

REA

D T

HE

DA

TASE

T (S

)



(c)

Figure 4.3: Read performance of documents with attached image with: a) 1 shard, b) 2 shards and c) 4 shards

Analyzing fig. 4.3, both policies had very similar results, performing equally within margin of error.

Because of how large each document was, the test became communication driven: the transfer time of

each document was much larger than the time taken to find it, which significantly reduced the impact

of the sharding policy.

0

5

10

15

20

25

30

1 2 4

TIM

E TO

REA

D T

HE

DA

TASE

T 1

0 T

IMES

(S)

NUMBER OF SHARDS


Figure 4.4: Horizontal scaling of read performance of documents with attached image, measured with 64 client threads

33

4. Evaluation

In fig. 4.4 the horizontal scaling for reads on both policies can be seen. There is very little improve-

ment, which seems to be a consequence of even 1 shard being able to get very close to theoretical

maximum throughput of the network connection: at 1Gb/s, 2.5GB of data should take 20 seconds

to transfer. 80% of that throughput (800Mb/s for a running time of 25 seconds) is achieved with a

single shard so by adding any additional machines one can only expect at most a 20% improvement

in this test. Unfortunately the equipment needed to remove this bottleneck was not available but as

the running time was dominated by document transfer it’s unlikely that the two policies would show a

significant difference between each other.

Write performance

0

20

40

60

80

100

120

140

160

1 2 4 16 64

TIM

E TO

INSE

RT

THE

DA

TASE

T (S

)

NUMBER OF CONCURRENT INSERTIONS


(a)

0

20

40

60

80

100

120

140

160

1 2 4 16 64

TIM

E TO

INSE

RT

THE

DA

TASE

T (S

)



(b)

0

20

40

60

80

100

120

140

160

1 2 4 16 64

TIM

E TO

INSE

RT

THE

DA

TASE

T (S

)



(c)

Figure 4.5: Write performance of documents with attached image with: a) 1 shard, b) 2 shards and c) 4 shards

Writing larger documents is the only test in this evaluation where increasing the number of threads

on the client has a very small performance benefit, as seen on fig. 4.5a. This points to a limitation

on the write performance of a single shard. Writing 2.5GB in 125 seconds means a throughput of

20MB/s, which is not unreasonable for a 7200RPM drive when not writing serially (as it’s writing

several thousand documents and not a single large file). The storage layer of MongoDB features

compression, which might also slow down write performance.

34


0

20

40

60

80

100

120

140

160

1 2 4TI

ME

TO IN

SER

T TH

E D

ATA

SET

(S)

NUMBER OF SHARDS


Figure 4.6: Horizontal scaling of write performance of documents with attached image, measured with 64 client threads

Horizontal scaling of writes was pretty much linear with the number of shards, as shown in fig. 4.6,

unlike with reads (which were limited by the network), writes seem to be directly limited by the write

capability of each shard, hence the linear scaling. With 4 shards the network throughput approached

the theoretical limit but with a big enough margin that it’s safe to assume it wasn’t limited by the

connection.

Performance was pretty much the same for both policies, which is expected on insertions and

even more in the case of larger documents, as the transfer time (and in this case, disk-write time)

overshadows their impact.

4.2.2 Text-only documents

The documents in the text-only version of the dataset were quite small, on the order of hundreds of

bytes. GeoSharding performed either on par or significantly better than hashed sharding in both reads

and writes.

Because the text-only version of the dataset is quite small, at roughly 0.5MB total, the running time

of some test runs took less than 1 second. To prevent this, the text-only benchmark was changed to

read the dataset 10 times, raising the running time of the fastest test runs to an acceptable level, where

start-up overheads and initial network latency should have negligible impact.

35

4. Evaluation

Read performance

0

200

400

600

800

1000

1200

1 2 4 16 64

TIM

E TO

REA

D T

HE

DA

TASE

T 1

0 T

IMES

(S)



(a)

0

200

400

600

800

1000

1200

1 2 4 16 64

TIM

E TO

REA

D T

HE

DA

TASE

T 1

0 T

IMES

(S)



(b)

0

200

400

600

800

1000

1200

1 2 4 16 64

TIM

E TO

REA

D T

HE

DA

TASE

T 1

0 T

IMES

(S)



(c)

Figure 4.7: Read performance of text-only documents with: a) 1 shard, b) 2 shards and c) 4 shards

As can be seen in fig. 4.7a multi-threading the client (or using multiple clients) can significantly

improve the read times, even with a single shard in the cluster. This is because MongoDB is capable of

opening multiple threads to respond to clients, parallelizing the data transfer and reducing the effect

of round trip latency between requests. Beyond 16 threads the improvement was small to insignificant,

however.

By default, MongoDB will accept up to 65536 simultaneous connections and the storage engine,

WiredTiger, has a limit of 128 concurrent simultaneous read/write operations. Thus the bottleneck

appears to lie elsewhere, perhaps in the 8 CPU cores of the machines, although the CPU’s didn’t seem

over-utilized during the test runs.

Still in fig. 4.7a, GeoSharding can be seen to perform on par or even slightly better than hashed

sharding, showing that GeoSharding routing does not introduce significant read overhead when com-

pared to hashed sharding.

On fig.4.7b, with 2 shards, GeoSharding shows significantly smaller running times, especially when

the client uses a large amount of threads. When using 64 threads GeoSharding is 3.1 times faster

than hashed sharding. This is likely because with hashed sharding the router, when given a pair of

coordinates, can’t calculate which shards might have documents with those coordinates and has to

contact both shards. Because there are multiple concurrent requests this can actually slow down the

36


test more than expected, as each request causes an unnecessary slowdown to other shards, impacting

other requests as well.

GeoSharding allows the router to contact only the relevant shard for each query, lowering the

running time for the query and minimizing the impact on other concurrent queries.

With 4 shards (fig. 4.7c) the same effect can be observed, this time with an even bigger difference,

with GeoSharding being 5.1 times faster than hashed sharding.

0

50

100

150

200

250

1 2 4

TIM

E TO

REA

D T

HE

DA

TASE

T 1

0 T

IMES

(S)

NUMBER OF SHARDS


Figure 4.8: Horizontal scaling of read performance of text-only documents, measured with 64 client threads

Fig. 4.8 shows the best case scenario for each policy and each number of shards, so that the

horizontal scaling capability of both policies can be analyzed. The advantage of GeoSharding over

hashed sharding is quite large, with GeoSharding even getting more than linear scaling: by moving

from 1 shard to 2, performance increases by a factor of 4.1, and then by a factor of 2.8 when going

to 4 shards.

Besides the reasons mentioned on the discussion of fig. 4.7, read locks might also explain the big

improvements in performance. With MongoDB, as with most DBMSs, reads and writes attempt to lock

the database to prevent other requests from using incomplete/incorrect document. MongoDB allows

multiple reads to share a single lock but there is still a performance impact, as locks are released with

their respective read queries and a new lock must be obtained for the next group of queries. When

using sharding on MongoDB read locks happen at the level of each shard, essentially multiplying the

number of possible concurrent query batches by the number of shards.

37

4. Evaluation

Write performance

0

0,2

0,4

0,6

0,8

1

1,2

1,4

1 2 4 16 64

TIM

E TO

INSE

RT

THE

DA

TASE

T (S

)



(a)

0

0,2

0,4

0,6

0,8

1

1,2

1,4

1 2 4 16 64

TIM

E TO

INSE

RT

THE

DA

TASE

T (S

)



(b)

0

0,2

0,4

0,6

0,8

1

1,2

1,4

1 2 4 16 64

TIM

E TO

INSE

RT

THE

DA

TASE

T (S

)



(c)

Figure 4.9: Write performance of text-only documents with: a) 1 shard, b) 2 shards and c) 4 shards

With writes (fig. 4.9) a similar trend can be seen in all cluster configurations in regards to threading

on the client: Performance steadily improves until somewhere between 4 and 16, probably 8 as that is

the number of threads on the CPU’s of all machines involved.

As expected, GeoSharding is very much on par with hashed sharding during insertions, since both

policies only contact one shard when inserting a document.

38

4.3 Evaluation of MongoDB as a base for GeoMongo

0

0,05

0,1

0,15

0,2

0,25

0,3

1 2 4TI

ME

TO IN

SER

T TH

E D

ATA

SET

(S)

NUMBER OF SHARDS


Figure 4.10: Horizontal scaling of write performance of text-only documents, measured with 64 client threads

With regards to horizontal scaling, in fig. 4.10, both policies behave similarly: a modest improve-

ment can be seen when going from 1 to 2 shards but there are no significant gains when going beyond

2 shards.

The bottleneck might be the router, since it acts as a proxy between the client and the cluster. The

default connection limit for this component is 10 thousand connections, however, so at least it doesn’t

seem to be a configuration issue.

Another possibility is that, with a total running time of less than a quarter of a second, the dominant

factors might simply be the initial connection latency and the time it takes to start the threads on the

server-side. Unfortunately the text-only dataset is small enough for this to be a possible issue.

Inserting the data-set multiple times was not a reasonable option because MongoDB’s storage layer

features compression, which might produce inconsistent results, as the insert order was randomized

for each test run.

4.2.3 Network usage

Network usage during the tests was observed but not accurately recorded. It was fairly consistent for

the duration of each test, except for the very small ramp-up and ramp-down times. The following

observations could be made:

• The 1Gb/s connections bottlenecked the tests on the bigger dataset.

• For text-only files the network was not a bottleneck, none of the machines had a total throughput

close to the Gigabit limit of the connection.

• The router’s role as a proxy was clear, as it’s network throughput in each direction was symmet-

rical to the sum of the rest of the machines’ throughput.

• The config server used very little bandwidth on all test runs, rarely reaching 15Kb/s and below

1Kb/s most of the time.

4.3 Evaluation of MongoDB as a base for GeoMongo

The decision to use MongoDB came with some unfortunate setbacks, due to lack of knowledge about

the project’s limitations. In this section the pros and cons of this decision are mentioned.

39

4. Evaluation

4.3.1 The pros

The native support for geographical queries on MongoDB meant that most of the client interface for

the work would already be in place. It also makes it possible for code already written for this API to

get a performance boost for free, by changing the cluster sharding configuration, which means that

existing libraries like PyMongo (the Python3 client for MongoDB) and many others would be able to

take advantage of the new policy from day 1.

Another big advantage for basing the work on MongoDB was the existence of geographical in-

dex support. Indexes can be very important to query performance by avoiding collection scans and

should come before sharding when attempting to improve query performance, since sharding makes

the system more complex and comes with added costs for hardware. With both existing API for geo-

graphical operations and geographical indexes, geographic sharding was the next logical step in terms

of geographical data support for MongoDB.

Besides the technical aspects, MongoDB’s excellent user documentation was of great help when

trying to setup a cluster, writing the benchmarks and also understanding the impact of various things

on the performance of a MongoDB cluster.

4.3.2 The cons

The biggest issues were the sheer size and complexity of MongoDB’s code-base, with over 5000 files

of source code, and the very limited documentation about the code-base. Because of this, getting a

decent understanding of the source code organization took a long time, which delayed GeoMongo’s

implementation much more than expected.

The way sharding is implemented in MongoDB assumes too much about partition policies, with

many types of checks and validations in several parts of the code-base that have to be worked around

when implementing a policy that does not fit those assumptions. A more general implementation, using

the object-oriented features of C++, could allow for different policies to define their own validation

rules for chunks and chunk ranges. It also has some other limitations, like forbidding the assignment

of a document to more than one chunk, that may make some of the future work suggestions hard to

implement.

In terms of server-side computation MongoDB was a letdown. Although MongoDB supports several

options for server-side computation, most of them run only on the router, which means that data must

be transfered from the shards, processed by a single node and then sent back to update the results.

MongoDB does support MapReduce and an easier to use abstraction called the Aggregation pipeline,

which uses several rounds of MapReduce to perform a series of computations one after the other. The

problem here is that MapReduce functions can’t interact with the database for any reason, meaning

that many types of computations, like Kriging interpolation, can’t take full advantage of data locality:

the results have to be sent to the router and then sent back to write to the shards [3].

4.3.3 Conclusion

Overall, MongoDB was probably not the best choice to test the GeoSharding partitioning policy. Al-

though the feature can be useful for users of this DBMS, it was unnecessarily complex to modify.

Academically it was limiting in several areas, especially for server-side computation, where the advan-

tage of data locality is wasted by the lack of a mechanism to to store results without going through the

router.

40

5Conclusions

Contents5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

41

5. Conclusions

5.1 Summary

The work presented in this dissertation was aimed at evaluating, with a practical implementation,

the usefulness of the GeoSharding partitioning policy as suggested by Joao Graca’s work, which used

simulations to compare different policies in [6].

The very popular database management system MongoDB was chosen as the base for this work,

which seemed on paper like a good fit but its internal architecture proved to be limiting or difficult

to work with in several ways, for example for distributed computation. Despite the setbacks from

this choice, a working version of MongoDB with GeoSharding for point data, named GeoMongo, was

achieved.

In the evaluation chapter two scenarios were analyzed: datasets with moderately sized data points,

like georeferenced images, and datasets with small data points, like georeferenced sensor data. The

results showed that for larger data points MongoDB’s request latency and throughput is good enough

that hardware limitations came into play, as well as the large transfer times for these bigger data

points making the choice of policy less important. For small data points, where the transfer time is

small and internal overhead of the database is more important, the outcome was quite different, with

GeoSharding performing incredibly well when compared to hashed sharding, making GeoMongo very

well suited to these kinds of datasets.

The results show great promise for use with datasets with very large numbers of small data points,

such as sensor data [15], and the way spatial-data is distributed in GeoMongo clusters has great

potential to be leveraged further by future work.

Community contributions

During the development of GeoMongo a bug was found on the most recent released version of Mon-

goDB (3.4), which was able to crash a config server when receiving malformed input. The bug (issue

#27258) was reported to the project maintainers, who assigned it a critical severity level. The bug

was fixed a few days later, after requesting some more information and diagnostics to help determine

the root cause.

5.2 Future Work

In this work the GeoSharding partitioning policy was proved to work with point data on MongoDB, and

with very good results when used appropriately. Due to time constraints, some interesting ideas (range

queries and non-point data) could not be tested and unfortunately had to be left out. In this section

these and other relevant ideas are discussed, both on their possible benefits and their complications

5.2.1 Generalizing MongoDB’s sharding internals

As discussed in section 4.3, MongoDB’s sharding internals are not a good fit for GeoSharding, requiring

a ugly duct-tape in several parts of the code-base to work. Generalizing chunks, chunk creation, chunk

ranges, chunk id’s and chunk splits so that they can be overridden by subclasses of chunks could allow

for a very clean and maintainable implementation of GeoSharding and possibly other policies that

don’t be reduced to a single dimension without information loss.

5.2.2 Leveraging GeoSharding for range and intersect queries

A major feature of GeoSharding is how it keeps points located closely together in the same chunks/shards.

This can be leveraged by the native geographical queries of MongoDB to hit only the necessary shards

42

5.2 Future Work

when searching for documents, instead of the current behavior of hitting every shard because previ-

ously there was no way to predict what shards could have data for a certain location.

This is bound to be more complex than searching for specific locations, as the problem becomes

about calculating with what chunk regions the range query intersects, which can be more than one. To

avoid performance issues caching the chunk regions might be necessary as well.

5.2.3 GeoSharding for non-point data

Sharding non-point geographical data, such as lines or polygons, in MongoDB would be a very complex

problem, due to certain limitations imposed by MongoDB on it’s sharding mechanisms (for very valid

reasons).

The two main issues are that in MongoDB each document must be assigned to one and only one

chunk and documents must be filtered at the shard level. This means that a range query that intersects

a line or polygon may not find the line if the range is contained to a set of chunks than don’t include

the chunk that the line is assigned to.

One way to solve this problem is to make range queries (and similar queries) search also on chunks

outside it’s range, which is not optimal and can be very inefficient. Another solution would be to place

copies of the document on chunks it intersects, however, MongoDB’s requirements don’t allow this,

which means at least a heavy rewrite of several parts of MongoDB would be needed or it may even

incompatible with other existing features of MongoDB.

Although seemingly complex, implementing this would be very valuable and make GeoSharding

on MongoDB much more versatile, as lines are often used to represent roads, rivers and other similar

objects while polygons can often be used to represent buildings, land properties, cities, natural parks

and many other types of geographic areas.

5.2.4 Dynamic allocation of chunk’s virtual coordinates

Dynamic allocation of chunk regions would make GeoMongo much more useful, practical and intuitive,

since manually calculating good locations based on the geographical distribution of a dataset can

be time-consuming or otherwise impossible if the data-set is still growing. For datasets that grow

over time, dynamic allocation can make it possible to keep the cluster balanced, without manual

intervention, long after it’s been setup.

A possible method of doing this allocation is using the K-means algorithm, with K equal to the

number of shards, implemented as MapReduce, to take advantage of MongoDB’s distributed compu-

tation capabilities, as described in [19]. This computation can be used in a similar fashion to garbage

collectors in programming languages with managed memory, being done only periodically, when the

system load is low or when data distribution imbalance in the cluster rises above a certain threshold.

A study on the characteristics of different algorithms for this purpose, using simulations, could be

very valuable not only for GeoMongo but also for other projects that have or plan to have similar

features.

43

Bibliography

[1] Abhishek S (2014) Distributed Computation On Spatial Data on Hadoop Cluster (10305042):1–39

[2] Abramova V, Bernardino J (2013) NoSQL databases: MongoDB vs cassandra. Proceedings of the

International C* Conference on Computer Science and Software Engineering, ACM 2013 pp 14–22,

DOI 10.1145/2494444.2494447

[3] Akdogan A, Demiryurek U, Banaei-Kashani F, Shahabi C (2010) Voronoi-based geospatial query pro-

cessing with MapReduce. In: Proceedings - 2nd IEEE International Conference on Cloud Computing

Technology and Science, CloudCom 2010, pp 9–16, DOI 10.1109/CloudCom.2010.92

[4] Dean J, Ghemawat S (2004) MapReduce: Simplied Data Processing on Large Clusters. Pro-

ceedings of 6th Symposium on Operating Systems Design and Implementation pp 137–149,

DOI 10.1145/1327452.1327492, URL http://static.googleusercontent.com/media/research.

google.com/es/us/archive/mapreduce-osdi04.pdf, 10.1.1.163.5292

[5] Eldawy A (2014) SpatialHadoop: towards flexible and scalable spatial processing using mapreduce.

Proceedings of the 2014 SIGMOD PhD symposium (December):46–50, DOI 10.1145/2602622.

2602625, URL http://dl.acm.org/citation.cfm?id=2602625

[6] Graca JPM, Silva JNdOe (2016) GeoSharding : Optimization of data partitioning in sharded geo-

referenced databases. PhD thesis, Instituto Superior Tecnico

[7] Ilarri S, Mena E, Illarramendi A (2010) Location-dependent query processing. ACM Computing

Surveys 42(3):1–73, DOI 10.1145/1670679.1670682

[8] Lakshman A, Malik P (2010) Cassandra. ACM SIGOPS Operating Systems Review 44(2):35, DOI

10.1145/1773912.1773922, URL http://portal.acm.org/citation.cfm?doid=1773912.1773922

[9] Leach G (1992) Improving Worst-Case Optimal Delaunay Triangulation Algorithms. In 4th Canadian

Conference on Computational Geometry 2:15, DOI 10.1.1.56.2323, URL http://citeseerx.ist.

psu.edu/viewdoc/summary?doi=10.1.1.56.2323

[10] Lenka RK, Barik RK, Gupta N, Ali SM, Rath A, Dubey H (2016) Comparative Analysis of Spatial-

Hadoop and GeoSpark for Geospatial Big Data Analytics URL http://arxiv.org/abs/1612.07433,

1612.07433

[11] Li aD, Wang S (2005) Concepts, principles and applications of spatial data mining and knowledge

discovery. in ISSTM 2005 pp 27–29, DOI 10.1109/WKDD.2009.13, URL http://citeseerx.ist.

psu.edu/viewdoc/summary?doi=10.1.1.368.6770

[12] Mennis J, Guo D (2009) Spatial data mining and geographic knowledge discovery-An introduction.

Computers, Environment and Urban Systems 33(6):403–408, DOI 10.1016/j.compenvurbsys.2009.

11.001

45

http://static.googleusercontent.com/media/research.google.com/es/us/archive/mapreduce-osdi04.pdf

http://static.googleusercontent.com/media/research.google.com/es/us/archive/mapreduce-osdi04.pdf

10.1.1.163.5292

http://dl.acm.org/citation.cfm?id=2602625

http://portal.acm.org/citation.cfm?doid=1773912.1773922

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.56.2323


http://arxiv.org/abs/1612.07433

1612.07433



[13] Olea RA (2009) A Practical Primer on Geostatistics. Tech. rep., URL http://pubs.usgs.gov/of/

2009/1103/

[14] Rajesh D (2011) Application of Spatial Data Mining for Agriculture. International Journal of Com-

puter Applications 15(2):975–8887, DOI 10.5120/1922-2566

[15] Tsai CW, Lai CF, Chiang MC, Yang LT (2014) Data mining for internet of things: A survey. IEEE

Communications Surveys and Tutorials 16(1):77–97, DOI 10.1109/SURV.2013.103013.00206,

1111.6189v1

[16] Vincenty T (1975) Direct and Inverse Solutions of Geodesics on the Ellipsoid With Application of

Nested Equations. Survey Review 23(176):88–93, DOI 10.1179/sre.1975.23.176.88, URL http:

//www.tandfonline.com/doi/full/10.1179/sre.1975.23.176.88

[17] Yu J, Jinxuan W, Mohamed S (2015) GeoSpark: A Cluster Computing Framework for Process-

ing Large-Scale Spatial Data. 23th International Conference on Advances in Geographic Infor-

mation Systems (3):4–7, DOI 10.1145/2820783.2820860, URL http://www.public.asu.edu/∼

jinxuanw/papers/GeoSpark.pdf%5Cnhttp://www.public.asu.edu/∼jinxuanw/

[18] Yu J, Wu J, Sarwat M (2016) A demonstration of GeoSpark: A cluster computing framework for

processing big spatial data. In: 2016 IEEE 32nd International Conference on Data Engineering,

ICDE 2016, pp 1410–1413, DOI 10.1109/ICDE.2016.7498357

[19] Zhao W, Ma H, He Q (2009) Parallel K-means clustering based on MapReduce. In: Lecture Notes in

Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in

Bioinformatics), vol 5931 LNCS, pp 674–679, DOI 10.1007/978-3-642-10665-1 71

46

http://pubs.usgs.gov/of/2009/1103/

http://pubs.usgs.gov/of/2009/1103/

1111.6189v1

http://www.tandfonline.com/doi/full/10.1179/sre.1975.23.176.88

http://www.tandfonline.com/doi/full/10.1179/sre.1975.23.176.88

http://www.public.asu.edu/~jinxuanw/papers/GeoSpark.pdf%5Cnhttp://www.public.asu.edu/~jinxuanw/

http://www.public.asu.edu/~jinxuanw/papers/GeoSpark.pdf%5Cnhttp://www.public.asu.edu/~jinxuanw/

AWrite performance benchmark script

A-1

1 from pymongo import MongoClient

2 from threading import Thread

3 from queue import Queue

4 import json , time , os, random

56 LIMIT_ENTRIES = None # Can be set to None to write all

7 ATTACH_FILE = False

8 N_THREADS = 1

910 ROUTER_IP = ’localhost ’

11 ROUTER_PORT = 27017

1213 DB_NAME = ’wells_US ’

14 COLL_NAME = ’points ’

15 DATASET_FILENAME = ’wells_US.json’

16 BIN_FILENAME = ’earth0 .5MB.jpg’

1718 client = MongoClient(ROUTER_IP , ROUTER_PORT)

19 db = client[DB_NAME]

20 collection = db[COLL_NAME]

2122 with open(DATASET_FILENAME) as dataset:

23 data_entries = json.load(dataset)

2425 if LIMIT_ENTRIES is None:

26 LIMIT_ENTRIES = len(data_entries)

2728 if ATTACH_FILE:

29 with open(BIN_FILENAME , ’rb’) as binary_file:

30 for entry in data_entries [: LIMIT_ENTRIES ]:

31 entry[’binary_data ’] = binary_file.read()

3233 q = Queue () # shared FIFO queue , to send work to the threads

34 def worker ():

35 while True:

36 batch = q.get()

37 result = collection.insert_many(batch , ordered=False)

38 q.task_done ()

3940 for i in range(N_THREADS):

41 t = Thread(target=worker)

42 t.daemon = True

43 t.start ()

4445 print("Created ", N_THREADS , " threads")

4647 random.shuffle(data_entries)

4849 start_time = time.time()


51 q.put(data_entries[i:: N_THREADS ]) # N_THREADS batches

5253 q.join() # wait for all threads to finish

54 time_insert_many = time.time() - start_time

5556 print("insert many: " + str(time_insert_many) + " seconds elapsed")

57 print("wrote " + str(LIMIT_ENTRIES) + " documents totaling about " + str(LIMIT_ENTRIES * sys.

getsizeof(data_entries [0]) / 10**6) + "MB")

Codigo A.1: The benchmark script for write performance, in Python3

A-2

BWrite performance results

B-1

B.1 Documents with attached image

B.1.1 1 shard

Table B.1: Write benchmark for documents with image attached (total running time) - 1 shard(a) Hashed sharding

Threads Trial 1[s] Trial 2[s] Trial 3[s] Average[s]1 145.53 143.53 145.93 145.002 131.91 129.67 126.54 129.374 130.29 128.19 127.54 128.67

16 124.66 127.26 129.62 127.1864 124.87 125.34 126.00 125.40

(b) GeoSharding


16 122.14 124.67 126.04 124.2864 125.00 127.07 123.58 125.22

B.1.2 2 shards

Table B.2: Write benchmark for documents with image attached (total running time) - 2 shards(a) Hashed sharding


16 62.35 63.22 62.27 62.6164 63.50 63.01 62.58 63.03

(b) GeoSharding


16 62.10 64.11 63.21 63.1464 62.23 63.76 62.23 62.74

B.1.3 4 shards

Table B.3: Write benchmark for documents with image attached (total running time) - 4 shards(a) Hashed sharding


16 38.12 37.14 36.63 37.3064 34.44 33.81 33.11 33.79

(b) GeoSharding


16 38.38 36.95 36.91 37.4164 34.87 33.62 32.14 33.54

B-2

B.2 Text-only documents

B.2.1 1 shard

Table B.4: Write benchmark for text-only documents (total running time) - 1 shard(a) Hashed sharding


16 0.28 0.28 0.28 0.2864 0.28 0.27 0.28 0.28

(b) GeoSharding


16 0.29 0.29 0.27 0.2864 0.27 0.26 0.26 0.26

B.2.2 2 shards

Table B.5: Write benchmark for text-only documents (total running time) - 2 shards(a) Hashed sharding


16 0.23 0.22 0.23 0.2364 0.21 0.20 0.21 0.21

(b) GeoSharding


16 0.21 0.21 0.22 0.2164 0.20 0.19 0.19 0.19

B.2.3 4 shards

Table B.6: Write benchmark for text-only documents (total running time) - 4 shards(a) Hashed sharding


16 0.20 0.20 0.20 0.2064 0.19 0.20 0.20 0.20

(b) GeoSharding


16 0.21 0.20 0.20 0.2064 0.20 0.19 0.19 0.19

B-3

CRead performance benchmark script

C-1

1 from pymongo import MongoClient

2 from threading import Thread

3 from queue import Queue

4 import json , time , os, random

56 NUMBER_READS = None # Can be set to None to read all

7 N_THREADS = 1

89 ROUTER_IP = ’localhost ’

10 ROUTER_PORT = 27017

1112 DB_NAME = ’wells_US ’

13 COLL_NAME = ’points ’

14 DATASET_FILENAME = ’wells_US.json’

1516 client = MongoClient(ROUTER_IP , ROUTER_PORT)

17 db = client[DB_NAME]

18 collection = db[COLL_NAME]

1920 with open(DATASET_FILENAME) as dataset:

21 data_entries = json.load(dataset)

2223 if NUMBER_READS is None:

24 NUMBER_READS = len(data_entries)

2526 q = Queue () # shared FIFO queue , to send work to the threads

27 def worker ():

28 while True:

29 coordinates = q.get()

30 result = collection.find_one (’geometry ’:

31 ’type’: ’Point’,

32 ’coordinates ’: coordinates

33

34 )

35 q.task_done ()


38 t = Thread(target=worker)

39 t.daemon = True

40 t.start ()

4142 print("Created ", N_THREADS , " threads")

4344 start_time = time.time()

45 for i in range(NUMBER_READS):

46 coordinates = random.choice(data_entries)[’geometry ’][’coordinates ’]

47 q.put(coordinates)

4849 q.join() # wait for all threads to finish

50 time_lookup_fromDB = time.time() - start_time

5152 print(’took ’ + str(time_lookup_fromDB) + ’s to look up ’ + str(NUMBER_READS) + ’ coordinates

that exist in the dataset ’)

Codigo C.1: The benchmark script for read performance, in Python3

C-2

DRead performance results

D-1

D.1 Documents with attached image

D.1.1 1 shard

Table D.1: Read benchmark for documents with image attached (total running time) - 1 shard(a) Hashed sharding


16 26.40 26.10 25.80 26.1064 25.50 24.90 25.60 25.33

(b) GeoSharding


16 25.50 25.50 25.30 25.4364 24.50 24.80 25.10 24.80

D.1.2 2 shards

Table D.2: Read benchmark for documents with image attached (total running time) - 2 shards(a) Hashed sharding


16 28.20 33.20 25.80 29.0764 24.30 25.90 25.00 25.07

(b) GeoSharding


16 21.90 24.20 23.00 23.0364 24.10 22.90 22.30 23.10

D.1.3 4 shards

Table D.3: Read benchmark for documents with image attached (total running time) - 4 shards(a) Hashed sharding


16 22.40 22.00 23.10 22.5064 20.70 21.10 23.20 21.67

(b) GeoSharding


16 21.40 22.10 23.00 22.1764 19.82 20.30 21.80 20.64

D-2

D.2 Text-only documents

D.2.1 1 shard

Table D.4: Read benchmark for text-only documents (total running time) - 1 shard(a) Hashed sharding


16 207.50 208.23 208.49 208.0764 207.21 210.02 209.10 208.78

(b) GeoSharding


16 206.34 206.85 205.73 206.3164 208.11 200.80 208.96 205.96

D.2.2 2 shards

Table D.5: Read benchmark for text-only documents (total running time) - 2 shards(a) Hashed sharding


16 158.19 158.24 158.40 158.2864 154.51 154.59 155.37 154.82

(b) GeoSharding


16 58.33 57.83 58.69 58.2864 50.20 49.75 49.62 49.86

D.2.3 4 shards

Table D.6: Read benchmark for text-only documents (total running time) - 4 shards(a) Hashed sharding


16 92.84 93.71 94.21 93.5964 90.26 91.56 92.91 91.58

(b) GeoSharding


16 28.32 28.03 28.21 28.1964 18.50 17.21 18.01 17.91

D-3

EExample document from the dataset

E-1

1

2 "type": "Feature",

3 "geometry ":

4 "type": "Point",

5 "coordinates ": [ -69.08976219 ,47.11225721 ]

6 ,

7 "properties ":

8 "SITE_NO ":1010500 ,

9 "SNAME ":"ST JOHN R A DICKEY ME",

10 "NWISDA1 ":2680 ,

11 "STATE ":" Maine",

12 "COUNTY_NAME ":" Aroostook",

13 "ECO_L3_CODE ":"5.3.1" ,

14 "ECO_L3_NAME ":" Northern Appalachian and Atlantic Maritime Highlands",

15 "ECO_L2_CODE ":5.3,

16 "ECO_L2_NAME ":" ATLANTIC HIGHLANDS",

17 "ECO_L1_NAME ":" NORTHERN FORESTS",

18 "ECO_L1_CODE ":5,

19 "HUC_REGION_NAME ":"New England Region",

20 "HUC_SUBREGION_NAME ":"St. John",

21 "HUC_BASIN_NAME ":"St. John",

22 "HUC_SUBBASIN_NAME ":" Upper St. John",

23 "HUC_2 ":1,

24 "HUC_4 ":101,

25 "HUC_6 ":"010100" ,

26 "HUC_8 ":1010001 ,

27 "PERM ":1.6567 ,

28 "BFI ":48.3947 ,

29 "KFACT ":0.2243 ,

30 "RFACT ":69.79872 ,

31 "PPT30 ":98.8643 ,

32 "URBAN ":0.146723998 ,

33 "FOREST ":59.02735921 ,

34 "AGRIC ":0.003490023 ,

35 "MAJ_DAMS ":0,

36 "NID_STOR ":0,

37 "CLAY ":17.4695 ,

38 "SAND ":22.1427 ,

39 "SILT ":60.3878 ,

40 "BENCHMARK_SITE ":"0",

41 "NHDP1 ":255,

42 "NHDP5 ":420,

43 "NHDP10 ":550,

44 "NHDP20 ":815,

45 "NHDP25 ":943.5 ,

46 "NHDP30 ":1100 ,

47 "NHDP40 ":1470 ,

48 "NHDP50 ":2000 ,

49 "NHDP60 ":2720 ,

50 "NHDP70 ":3850 ,

51 "NHDP75 ":4735 ,

52 "NHDP80 ":6000 ,

53 "NHDP90 ":11800 ,

54 "NHDP95 ":19700 ,

55 "NHDP99 ":39300 ,

56 "sample_count ":30

57

58

E-2