Hello Everyone, I’m Hanson. I come from China. And I’m a Ph.d student majoring in GIS. Today I’m going to share some of my works related to MongoDB. As you may noticed, mongodb has a geospatial part which is really great and in some way, you can say, quite revolutionary. But if you examine this geospatial part from GIS tech point of view you will find that MongoDB is actually quite lonely. you know, GIS tech is the main field that attempts to deal with geospatial information, and after decades of development GIS ecosystem has become a very prosperous field with hundreds of libraries, tools and software to handle all kinds of geospatial problems. But among you can hardly find any which could cooperate directly with mongodb. well that’s really a pity. the work we did here is trying to change this awkward situation. The hope is that making GIS ecosystem powered by MongoDB, the new generation database technology, and meanwhile giving mongodb the way to play with the GIS community. Ok, Let’s start from a real world example, and see how we met mongodb.
Spatial Pyramid – View the world with multiple spatiotemporal scales
1
Real world example - Spatial Pyramid
Challenges with PostGIS
Handling with MongoDB cluster
Presenter
Presentation Notes
One of our projects needs to generate a spatial pyramid. So what is spatial pyramid?
Global
North America
Canada U.S.A.
Illinois
Champaign
UIUC Campus
Downtown
Chicago
New York
Asia
South Asia East Asia
China
Shanghai Beijing
Olympic Park
Xidan Street
Japan
Spatial Pyramid | Introduction
Presenter
Presentation Notes
Spatial pyramid is actually quite simple. Suppose we have millions of point data, for example photos and tweets with geo tag distributed all over the world. We want to analyze these data in multi scales, in different levels. It requires a recursive aggregation from bottom up, which we call spatial pyramid. The diagram here illustrates what the spatial pyramid can look like. But I have to mention that there are three quite outstanding characteristics of spatial pyramid. The data size is huge, and millions of data keep pouring in every day. It requires lots of spatial queries. There are lots of data back and forth, writes and reads in each scale, so it’s a very typical IO intensive task.
Open Layers
Internet
Leaflets
ArcJs
Spatial Pyramid | PostGIS in the Open Stack
LAN uDig
QGIS
GRASS
ArcGIS
Mapserver
GeoServer
ArcServer
PostGIS
Presenter
Presentation Notes
Well, at first, I think it’s quite nature, that we decided to use postgis as our backend database server. PostGIS, built upon postgresql, is a very famous open source spatial database, and has a very wide range of support from the GIS ecosystem. So it can be very convenient for us to build a solution in the Postgis stack.
Spatial Pyramid | Generator Architecture
Spatial Pyramid Generator Architecture Data Server
Spatial Pyramid Generator
PostGIS
HPC Cluster
Pyrimad Model
Python OGR, MPI
Postgre SQL
What is ArcSDE 8?
2.3 hours !!
Presenter
Presentation Notes
So we built one, and put all our data in the PostGIS database. But it turned out that postgis was extremely slow in this scenario, it takes us 2.3 hours on average to finish 10 layers with a single thread. We know that with multithreads and parallel computing it can be faster. But since the spatial pyramid is an IO intensive task, we believe that the throughput of PostGIS will be the bottleneck for further performance tuning. And besides our data keep growing every day and we need a high performance spatial database with a good scalability. So we start to search new solutions for our project. And we found mongodb.
Spatial Pyramid | MongoDB Approach
Spat
ial P
yram
id
Requests
Load Balance
MongoS
P
S
S
P
S
S
MongoS
Shard
Shard
C C C
Config
GD
AL/
OG
R
15 minutes !!
Presenter
Presentation Notes
MongoDB was designed as a next generation database with high throughput, high performance, and good scalability. And the most exciting thing about mongodb is it has a well-defined geospatial part. The spatial query in the spatial pyramid is actually quite simple, just sort of nearby query. So we found that mongodb fits in our demands very nicely. So we decided to build our project up on mongodb cluster. We spent several weeks to set up a MongoDB cluster, import all our data and shard them. And another several weeks to rewrite the spatial pyramid generator inside of mongodb cluster, and then lots of performance tuning. And finally we managed to bring down the time to about 15 minutes. That’s acceptable. But today I’m not going to talk about the performance tuning part. What I want to talk about is the problem this mongodb approach has. The problem is that we use a library called GDAL as a fundamental library in our later spatial analysis part. and mongodb do not has a GDAL support. So the work flow in our project is actually broken.
Open Source – GDAL is released under an X/MIT style Open Source license
– supported by the Open Source Geospatial Foundation
A library for geospatial data formats – abstract data model conformed to OGC standards.
– 133 raster data formats, 79 vector data formats
Widely used by the GIS community – 88 software listed in the gdal.org using GDAL
Basic Library for HPGC – We use GDAL as the basic tools to build high performance computing algorithms
Spatial Pyramid | GDAL Library
Presenter
Presentation Notes
So what is gdal? Can we get rid of it? Yes, we can, but not a good idea. GDAL is an open source library, and its main purpose is to load spatial data from all kinds of data formats. And it has been widely used throughout the GIS community and served as a basic element in the ecosystem. So algorithms written based on gdal will acquire a good interoperability and easy way to cooperate with other gis tools in the stack. So a better choice would be keeping both of GDAL and MongoDB in our project, and build a bridge between the two. Of course you can write an ad-hoc program by using a certain programing language to glue them together but in fact there is a much more elegant way to handle this.
a
Spatial Pyramid | GDAL Architecture
Presenter
Presentation Notes
GDAL actually has an extendable architecture for new data formats, so instead of gluing them together but regarding mongodb as a new data formats for gdal. we could write a gdal driver for mongodb.
GDAL Driver for MongoDB – Giving MongoDB the way to play with the GIS community
2
View MongoDB as a spatial database
Design GDAL Driver for MongoDB
Cooperate with other GIS tools
Presenter
Presentation Notes
So let’s see how we did it. But before we go to the details, in order to have a better understanding, we have to first go through the way how GDAL organizes its spatial data. And I’ll also talk about the challenges to write a gdal driver for mongodb, and the way we solved it.
FID Geometry Name States Time Zone
10001 POINT(40.77, 73.98) NYC New York UTC-05:00
10002 POINT(41.90, 87.65) Chicago Illinois UTC-06:00
Feature – a spatial object
Point
Line
Polygon
Geometries
Attributes, Non-Spatial Data
GDAL | spatial database structure
Spatial Relational Table
1
2
3
Presenter
Presentation Notes
So how the gdal organize the geospatial data? Well, there is a fundamental concept called feature. And each feature is designed to represent a spatial object in the real world. Let’s say use a point to represent a city. But in the database each feature is stored as a row, and is composed of three parts, the geometries, their reference, and the attributes data. you put many features of the same theme in one table to form a spatial layer.
Then you overlay those layers together to get a whole map. Here is basic structure, we got tables for layers, and rows for features. But the problem is that in mongodb you can’t find a table anywhere. So the first challenge is how to organize geospatial data in mongodb.
GDAL | Simple Feature Access
Presenter
Presentation Notes
Another thing is that gdal followed an international standard – called simple feature access – which defined a geometry hierarchy, as illustrated in this diagram. Unfortunately there is no definition for representing these geometry types in json style. So we have a problem again. How to represent this geometries in mongodb?
RDBMS GeoDatabase MongoDB
Database Datasource Database
Table Layer Collection
Row(s) Feature(s) JSON Document
Field(s) Field(s) Key:Value
Index R tree Index
Join Join Embedding & Linking
Partition — Shard
GDAL | Terminology
Presenter
Presentation Notes
The first challenge can be very easy to handle. Yes, mongodb don’t have table but it has collection. We could treat each JSON document as a row. So put all these terminologies together, we could quickly find the idea of how to organize geospatial data in mongodb. While the second problem is not an easy one, and I’ll talk about three approaches to walk around it.
WKT, Well-known text, originally defined by the Open Geospatial
Consortium (OGC) and described in their Simple Feature Access and
In total, there are 18 distinct geometric objects that can be represented.
http://en.wikipedia.org/wiki/Well-known_text
Presenter
Presentation Notes
Well, the first approach lies in the standards itself. In the standard it defined a string format to represent the geometries called WKT. And below in the table are some WKT examples on how to represent point, linestring, polygon. So we could take advantage of this string format to represent geometries in the json document.
10002 POINT(41.90, 87.65) Chicago Illinois UTC-06:00
WKT
Geospatial Metadata collection
Presenter
Presentation Notes
So it becomes very simple. you just add a geometry field and use WKT string to represent them.
GDAL | WKT for Spatial data
U.S.A
States
Cities
Canada
Roads
G_sys_Metadata
MongoDB Cluster
NYC
Chicago
……
Database
Collection
WKT
Feature
Layer
Datasource
|c_name | coord_d | src | type | Extent|+----------------------+-------------------+| Cities | 2 | 4326 | Point | [p1,p2]| States | 2 | 4326 | Polygon | [p1,p2]
No spatial Index
Presenter
Presentation Notes
So the overall solution can be described by this diagram. We had database, the terminology in the database technology. Its counterpart is datasource the terminology in GIS tech. each collection serves as a layer, Each JSON document is a feature with a field include WKT string. We also got a metadata collection in each database, just as traditional spatial database did. So here we go. We could organize all the geospatial information in the mongodb. But there is problem. You have no way to build a spatial index on the WKT field right now in mongodb. And that’s terrible.
GDAL | GeoJSON for spatial data
FID Geometry Name States Time Zone
10001 POINT(40.77, 73.98) NYC New York UTC-05:00
10002 POINT(41.90, 87.65) Chicago Illinois UTC-06:00
So another way to walk around is GeoJSON. Well from last year, since version 2.4 , mongodb start to use GeoJSON to store geospatial data. GeoJSON is a specification for encoding a variety of geospatial data in JSON style.
U.S.A
States
Cities
Canada
Roads
G_sys_Metadata
MongoDB Cluster
NYC
Chicago
……
Database
Collection
GeoJSON
Feature
Layer
Datasource
|c_name | coord_d | src | type | Extent|+----------------------+-------------------+| Cities | 2 | 4326 | Point | [p1,p2]| States | 2 | 4326 | Polygon | [p1,p2]
GDAL | GeoJSON for spatial data
Presenter
Presentation Notes
So we just follow the database structure of WKT approach, and change the document style into geojson, we got our second approach. And since mongodb has native support for geojson, the nice thing about this approach is that you could build up a spatial index and sent spatial query as you want.
But there is still one thing left, Cause’ in the geojson specification a document is not necessarily to be a feature. It can be a featurecollection as well, which means you got a lot of features in one document. And that will change the spatial database structure in mongodb.
So here is the terminology comparison table of the three approaches. You see WKT and GeoJSON approaches are alike. But Featurecollection has a slightly different structure. In the fCL approach each document serves as a layer.
Geometry types ALL SFA, 18* LIMITED, 6** LIMITED, 6**
Presenter
Presentation Notes
But Who is better? It depands. WKT approach is poor in spatial query but has a full support of the standard. GeoJSON has spatial index but ONLY limited SDTs. Featurecollection approach contains all the information in one document, so is very convenient for sharing. OK. So here we go. We have three approaches to develop the GDAL driver for MongoDB to solve the problem in our project. However the driver is by no means only for our project, it has a much broader impact.
ogr2ogr – convert simple features data between file formats
– spatial or attribute selections, reducing the set of attributes,
– setting the output coordinate system or even reprojecting
– Extract, Transform, and Load (ETL) Tools for MongoDB Geospatial
GDAL | Load all sorts of spatial data
Presenter
Presentation Notes
The most direct result is that we could utilize the tools of the gdal project. e.g. there is a tool called ogr2ogr which could transfer spatial data from different formats. With this tool mongodb could directly load spatial data from all kinds of spatial data formats, i.e. from esri shapefile, postgis, oracle spatial …
Presenter
Presentation Notes
The second benefits will be the algorithms and software built on the gdal library. You do not have to make any changes to the algorithms but just use mongodb to organize your geospatial data. The algorithms will take advantage of the high performance capability of mongodb and fly.
Work with various GIS software
Presenter
Presentation Notes
And as I mentioned that gdal is widely used among the gis ecosystem, and served as a fundamental library for lots of other gis software. So with the gdal driver for mongodb you suddenly have a number of GIS software which could directly cooperate with mongodb geospatial part. For example, arcgis, qgis, mapserver and so on.
MongoDB Works with QGIS
Presenter
Presentation Notes
Here is one experiment I did. I loaded the global airports data from shapefile to mongodb using the tool I mentioned ogr2ogr. And then use QGIS, a popular desktop GIS software, to visualize them, and calculate their Voronoi polygons. So here in this map, Each polygon actually represents the nearest service area of each airport. So you see, with the gdal driver for mongodb You could use lots of GIS software directly to visualize, analyze and publish your geospatial data that stored in mongodb.
A step forward : MongoGIS – Mend the way for the GIS community to play with MongoDB
3
Evolution of spatial database Tech
Comparison of spatial database solutions
Roadmap to make the way
Presenter
Presentation Notes
So that’s what we did to help mongodb cooperate with the gis ecosystem. But today I want to talk a little more than that. Mongodb is a great product, and for the GIS community it should not just serve as a container, a box for spatial data. It can have a much more significant destiny, and We should move a step forward. But in order to see that let’s first make a step backward and review the evolution of spatial database technology.
GIS Application
Geometries Geometries Geometries files
FID
20th Century late 80s & early 90s
RDBMS for attribute data
File systems for geometry data.
An unique ID of feature link the two
ESRI Shapefile is one of most famous
Problems with data integrity, multiuser
access and editing
1st Generation | Hybrid Solution
Standard SQL Geoprocessing
Attributes
Presenter
Presentation Notes
Spatial database technology nowadays has go through three generations. Before spatial database GIS uses files to manage all the geospatial data. And late on, GIS could put some of its data, the attributes, in the database, and meanwhile still maintaining the spatial part, geometries, in the files. ESRI Shapefile is one of the most famous, and still widely used even today. But the problem with generation is that when the data volume becomes huge, maintaining the data integrity tends to be really complicated. And you also got problems with multiusers access and editing.
IT
20th Century mid 90s
Attributes & Geometries in database
But geometry as binary large object
SDE as a middleware by GIS venders
Geometries are not understandable.
Poor integration, no spatial structure
query language
2nd Generation | Spatial Database Engine
SDE
Attributes Geometries Geometries Geometries
blobs SQL
GIS Application
Presenter
Presentation Notes
So when the database technology provided the binary large object (BLOB) for unstructured data, GIS quickly take advantage of it and developed spatial database engine upon it. For the first time GIS could store all its data in the database. ESRI ArcSDE is one of good examples. But the problem is that the database have no idea what these binaries are, you can’t use SQL language to do spatial query.
GIS
eBusiness
Geometries Attributes
E-SQL
20th Century late 90s
Spatial is a native Data Type
Attributes & geometries all in
Rich GIS functions built inside
Supported by major DB venders
Spatial data queried using E-SQL
DB functionality fully supported
E-SQL
GIS GIS
eBusiness eBusiness
3rd Generation | Object-based Spatial Database
Presenter
Presentation Notes
The third generation comes, because of the object relational database technology. It allows you define new data types, spatial data types. And for the first time Spatial data in the database are no longer treated as second-class citizens. spatial database first got full support from database technology. Postgis is a most famous example of this kind.
BIG DATA Spreading
2008.9
Nature
2009.1
Google
2009.5
UN
Detecting influenza epidemics using search engine query data
Global Plus Project
"Big Data for Development: Opportunities & Challenges”: A Global Pulse White Paper
2009.12
Microsoft
The Fourth Paradigm: Data-Intensive Scientific Discovery
2011.2
Science
Dealing with data
highlight both the challenges posed by the data deluge and the opportunities that can be realized if we can better organize and access the data.
2012.3
The White House
Big Data Initiative
more than $200 million to big data research projects.
Presenter
Presentation Notes
We are marching into the fourth generation of spatial database because of big data. In the era of big data, database technology tends to be running on large scale clusters, with high performance, high availability, and good scalability, which has been well discussed in the context of NoSQL technology. Geospatial data are no exception, especially when you see the wide spread of location-aware mobile devices, such as smart phones, tablets, and wearable sensors. So the GIS community needs a Next-G spatial database technology. But unfortunately when you look around in the world, you can find very little solutions, and in fact mongodb is a leader in this direction.
Feature\Solutions
PostGIS As A Cluster MongoDB
Cluster Shared Disk Failover
File System Replication
Transaction Log Shipping
Trigger-Based Master-Standby
Replication
Statement-Based Replication Middleware
Asynchronous Multi-Master Replication
Implementation NAS DRBD Streaming Slony-I pgpool-II Bucardo Sharding
Communication Shared Disk Disk Blocks WAL Table Rows SQL Table Rows olog
No Special Hardware × √ √ √ √ √ √
Data Synchronous Sync Sync Sync, Async Async Sync Async Sync
Failover for HA Fast Fast Fast with Hot Manual Hard to Re-attach × Fast
Writes Scalability × × × × With M-M √ Good
Reads Scalability × × With Hot √ √ √ Good
Parallel Query × × × × With M-M √ √
Complexity For Admin Low Low Low High Very High High Low
Load Balancing × × × × √ × √
MongoDB as a High Performance Database
Presenter
Presentation Notes
Here in this table, from the perspective of next-g technology, I compared the 7 methods listed in the postgresql website, which can be used to deploy a postgis cluster. The table seems a little bit complex, but let’s focus one the colors. There are three colors in the table, Green means very good in this category Yellow means it’s OK. While red means you should pay close attention, otherwise you may get lots of trouble. As you can see, none of these solutions could fit in the demands of next-g spatial database. They all have sort of problems here or there. And you’ll get similar results if you review other popular existing spatial database solutions. But if you use these standards to examine mongodb you get all green. That is to say from the perspective of next-g technology, mongodb fits the needs quite nicely.
Spatial Index -- -- -- R tree Gist, Rtree R tree GeoHash
Geometry I/O √ √ -- +++ +++ ++ ×
Geometry Accessors √ √ -- +++ ++ ++ ×
Geometry Editors -- -- -- +++ ++ + ×
Topological Info -- √ -- +++ ++ +++ ×
Spatial Measurements √ √ -- +++ ++ ++ ×
Geo-processing √ √ -- +++ ++ ++ ×
Spatial Relationships √ √ -- +++ ++ ++ 4
GIS Tech Ecosystems -- -- -- +++ +++ + ×
MongoDB as a spatial database
Presenter
Presentation Notes
But how about the geospatial functionality? Here is another table, I examined the richness of spatial operators of the most popular spatial database solutions today. The first three columns are the related standards, the simple feature access, sql/mm, geojson. You see the geojson specification, which mongodb followed, only defined limited spatial data types, no additional operators. In the later three columns about spatial database solutions, a plus means I have all you defined and more. ArcSDE, the spatial database solution provided by the world largest GIS vendor ESRI, has the most richest spatial operators. Well what will happen if we use those categories to examine mongodb, right you got almost red. For example all the spatial database support thousands of spatial references, but mongodb only one. All the spatial database have a rich geospatial ecosystem upon them, but you can hardly find any for mongodb. So that is to say, from the professional spatial database point of view, both the spatial functionalities within mongodb and the geospatial ecosystem built on are incomplete. That’s really a sad lose for the GIS community. You know mongodb has a great potential to serve as next generation spatial database. And the GIS community should not lose such a great opportunity to promote its power. And it is the time for the GIS community to help mongodb geospatial part. But we do not have to start from scratch. there are a bunch of well-developed open source spatial libraries in the GIS ecosystem, which can be harnessed to improve the mongodb geospatial part. For example the proj4 library can be used to deal with the spatial reference part. And gdal library can help to bring up the geospatial ecosystem on mongodb.
GDAL driver for mongodb – The way that mongodb plays with the GIS community
– Work with GDAL community to included in the next release
– Open Source: https://github.com/mongogis/mongodb-gdal-driver
MongoGIS – The Next Generation Infrastructure for the GIS community
– MongoGIS group in the github: https://github.com/mongogis
– We may build it together!
MongoGIS in github
Presenter
Presentation Notes
Therefore together with some of my colleagues, we set up a group called MongoGIS in the github, aiming at helping mongodb improve its geospatial part. And you can also find the gdal driver for mongodb there. And of course if you are interested, you are very welcome to join this group. It’s our hour to work with talent people from all over the world to build the next g infrastructure for the GIS community.
Appreciate Your Time!
Sponsored by the China Scholarship Council for one year program at UIUC, Illinois, USA. Supported by the Scientific Research Foundation of Graduate School of Nanjing University.
Great Thanks go to Craig Wilson, Greg Steinbruner for their precious advices.