using big data techniques to query and store openstreetmap data. stephen knox, digital.arup
TRANSCRIPT
![Page 1: Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox, digital.Arup](https://reader036.vdocuments.us/reader036/viewer/2022062522/58808b791a28ab35718b6a4b/html5/thumbnails/1.jpg)
Using Big Data techniques with Open
Street MapStephen Knox
ArupPartly based on research for an MSc in Geographical Information Systems and Science
Kingston University 2015
![Page 2: Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox, digital.Arup](https://reader036.vdocuments.us/reader036/viewer/2022062522/58808b791a28ab35718b6a4b/html5/thumbnails/2.jpg)
Disclaimer• I am in no way an expert on Hadoop!• I am a Geographic Information Systems specialist who can program
(and is interested in big data)• Hopefully I can tell you something you didn’t know about
OpenStreetMap and geographic big data processing
![Page 3: Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox, digital.Arup](https://reader036.vdocuments.us/reader036/viewer/2022062522/58808b791a28ab35718b6a4b/html5/thumbnails/3.jpg)
Outline• Background to OpenStreetMap (OSM) and growth• Background to Geographic Big Data• Dissertation Research• Aims & Objectives• Methodology• Results • Conclusions
• My general experiences of using Hadoop/SpatialHadoop and related tools
![Page 4: Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox, digital.Arup](https://reader036.vdocuments.us/reader036/viewer/2022062522/58808b791a28ab35718b6a4b/html5/thumbnails/4.jpg)
2006 2016
![Page 5: Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox, digital.Arup](https://reader036.vdocuments.us/reader036/viewer/2022062522/58808b791a28ab35718b6a4b/html5/thumbnails/5.jpg)
![Page 6: Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox, digital.Arup](https://reader036.vdocuments.us/reader036/viewer/2022062522/58808b791a28ab35718b6a4b/html5/thumbnails/6.jpg)
![Page 7: Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox, digital.Arup](https://reader036.vdocuments.us/reader036/viewer/2022062522/58808b791a28ab35718b6a4b/html5/thumbnails/7.jpg)
![Page 8: Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox, digital.Arup](https://reader036.vdocuments.us/reader036/viewer/2022062522/58808b791a28ab35718b6a4b/html5/thumbnails/8.jpg)
![Page 9: Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox, digital.Arup](https://reader036.vdocuments.us/reader036/viewer/2022062522/58808b791a28ab35718b6a4b/html5/thumbnails/9.jpg)
![Page 10: Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox, digital.Arup](https://reader036.vdocuments.us/reader036/viewer/2022062522/58808b791a28ab35718b6a4b/html5/thumbnails/10.jpg)
INPUT
STORAGE
GRAPHICAL OUTPUT (MAPS)
DATA OUTPUT
![Page 11: Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox, digital.Arup](https://reader036.vdocuments.us/reader036/viewer/2022062522/58808b791a28ab35718b6a4b/html5/thumbnails/11.jpg)
OSM Size and Growth• Current Data – c. 0.5 – 1 TB • Current and Historical Data – 5.15TB• Growing at 1TB per annum
2006 2007 2008 2009 2010 2011 2012 2013 2014 20150
5
10
15
20
25
30
35
40
45
50
DB dump Size (XML BZ2)
2 processor cores8GB RAM6TB disk
4 processor cores64GB RAM6TB disk64GB SSD
8 processor cores256GB RAM24TB disk400GB SSD
OSM DB server
?
Source: Planet OSM http://planet.openstreetmap.org
Source: OSM http://wiki.openstreetmap.org/wiki/Servers Source: OSM http://munin.openstreetmap.org/openstreetmap/katla.openstreetmap/postgres_size_openstreetmap_9_1_main.html
![Page 12: Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox, digital.Arup](https://reader036.vdocuments.us/reader036/viewer/2022062522/58808b791a28ab35718b6a4b/html5/thumbnails/12.jpg)
OSM Potential Growth (1)Population
Africa Antarctica AsiaAustralia Central America EuropeN. America S. America
Land Area
Africa Antarctica AsiaAustralia Central America EuropeN. America S. America
Data in OSM
Africa Antarctica AsiaAustralia Central America EuropeN. America S. America
+38%+29%
+22%+27%
+16%
+10%
+21%
Source: Geofabrik individual region download pages for OSM size and growth (http://download.geofabrik.de/index.html) , ArcGIS Continents (http://www.arcgis.com/home/item.html?id=3c4741e22e2e4af2bd4050511b9fc6ad) and UN Department of Economic & Social Affairs Total Population – Both Sexes (http://esa.un.org/unpd/wpp/Excel-Data/EXCEL_FILES/1_Population/WPP2012_POP_F01_1_TOTAL_POPULATION_BOTH_SEXES.XLS)
![Page 13: Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox, digital.Arup](https://reader036.vdocuments.us/reader036/viewer/2022062522/58808b791a28ab35718b6a4b/html5/thumbnails/13.jpg)
Scaling systemsScale-up Scale-out(parallel) Scale-out(NoSql)
• More memory
• More cores• More SSD• More hard
disk
Controlling Server
$$$$$$$$$
$$$$$$$$
Hardware costsSoftware acquisition & development costsMaintenance costsTraining costs
$$$$$$$$$
Sources: Scale-up vs Scale-out for Hadoop: Time to rethink? http://www.msr-waypoint.com/pubs/204499/a20-appuswamy.pdf Scaling Up vs. scaling Out: Hidden Costs:http://blog.codinghorror.com/scaling-up-vs-scaling-out-hidden-costs/
![Page 14: Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox, digital.Arup](https://reader036.vdocuments.us/reader036/viewer/2022062522/58808b791a28ab35718b6a4b/html5/thumbnails/14.jpg)
It’s getting complicated …. !Source: The 451 Group https://blogs.the451group.com/information_management/2011/04/15/nosql-newsql-and-beyond/
![Page 15: Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox, digital.Arup](https://reader036.vdocuments.us/reader036/viewer/2022062522/58808b791a28ab35718b6a4b/html5/thumbnails/15.jpg)
What is the right tool for the job?
1MB 1GB 1TB 1PB 1EB
?
Transaction Logs
Tool
Application / Data
![Page 16: Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox, digital.Arup](https://reader036.vdocuments.us/reader036/viewer/2022062522/58808b791a28ab35718b6a4b/html5/thumbnails/16.jpg)
NoSQL Spatial• Key research topic is indexing across multiple nodes
Source: Geowave Docs http://ngageoint.github.io/geowave/documentation.html#theory
• Implementations that add spatial capabilities to NoSQL databases• SpatialHadoop, Hadoop GIS, ESRI tools for Hadoop• SpatialSpark, GeoTrellis• Geomesa, Geowave• MongoDB (extension)• Geocouch
![Page 17: Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox, digital.Arup](https://reader036.vdocuments.us/reader036/viewer/2022062522/58808b791a28ab35718b6a4b/html5/thumbnails/17.jpg)
Dissertation - Aims• Investigate whether a parallel non-relational solution could be used
to:• Analyse data from OSM (read-only)?• Become the main storage platform (reads & writes)?In terms of performance, and practicality (whole life cost)
• Does the size and growth rate of OSM make it likely that a non-relational parallel storage solution will become technically or economically desirable in the future?
![Page 18: Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox, digital.Arup](https://reader036.vdocuments.us/reader036/viewer/2022062522/58808b791a28ab35718b6a4b/html5/thumbnails/18.jpg)
Dissertation - Methodology• Compare common current OSM tasks to an equivalent task using Big
Data tools• Chose technologies in the Hadoop ecosystem rather than parallel
databases. Used SpatialHadoop and Hbase as principal platforms• Started using a test Hadoop cluster @ work, but ran into issues, so
used cloud platforms • Keep processing power and cost constant, so performance could be
directly compared1 16 core server64GB RAM
8 2-core servers8GB RAM each
Master node
Broadly equivalent in cost and equivalent in nominal performance
![Page 19: Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox, digital.Arup](https://reader036.vdocuments.us/reader036/viewer/2022062522/58808b791a28ab35718b6a4b/html5/thumbnails/19.jpg)
SpatialHadoop• University of Minnesota Open Source project• Uses pig as an execution engine• Creates spatial indexes and operators for big geographic datasets
![Page 20: Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox, digital.Arup](https://reader036.vdocuments.us/reader036/viewer/2022062522/58808b791a28ab35718b6a4b/html5/thumbnails/20.jpg)
Methodology (continued)• 3 stages:• Data loading & preparation for data analysis
• Test whether a data reader to read the OSM binary format was quicker than using the XML format
• Data querying (read / analyse data)• Spatial – give me the total features in this area [using spatial index]• Non-spatial (e.g. count the total number of shops in the osm database)
• Simulation of master database (reads and writes)• downloading existing data to work on (by bounding box)• uploading new data changes
![Page 21: Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox, digital.Arup](https://reader036.vdocuments.us/reader036/viewer/2022062522/58808b791a28ab35718b6a4b/html5/thumbnails/21.jpg)
Uncompressed XML Compressed XML PBF
UK OSM data 17GB 1.2GB 765MB
![Page 22: Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox, digital.Arup](https://reader036.vdocuments.us/reader036/viewer/2022062522/58808b791a28ab35718b6a4b/html5/thumbnails/22.jpg)
Results – Loading DataFile & size Cluster Time
UK PBF* (765MB) 4 high memory nodes
37m
UK XML (17GB) 4 high memory nodes
75.5m
UK XML BZ2+ (1.2GB)
4 high memory nodes
66m
Europe PBF (15.7GB) 8 high memory nodes
246m
Europe XML (345GB) Not undertaken – too big to process
Europe XML BZ2 (24GB)
8 high memory nodes
Did not complete
Europe PBF (15.7GB) 16 high memory nodes
143m
Europe XML Not undertaken – too big to process
Europe XML BZ2 (24GB)
16 high memory nodes
Did not complete
* Protocol Buffer Format – binary format+ without taking into account decompression time – c. 7 minutes
File & size Cluster Time
UK XML BZ2 (1.2GB) 1 x 8 core machine(52GB RAM)
17m
Europe XML BZ2 (24GB)
1 x 16 core machine(104GB RAM)
578m
OverpassHadoop
![Page 23: Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox, digital.Arup](https://reader036.vdocuments.us/reader036/viewer/2022062522/58808b791a28ab35718b6a4b/html5/thumbnails/23.jpg)
Results – Querying DataIndex type Time Taken
Grid 75m
R-tree 81m
Quad-tree 56mOperation Cluster config Cluster Time Standalone config Standalone time
Europe data small bounding box
8 x 2-core high memory nodes (13GB RAM)
Grid: 50sR-tree: 25sQ-tree: 6s
1 x 16 core machine(104GB RAM)
<1s
Europe data medium bounding box
8 x 2-core high memory nodes (13GB RAM)
Grid: 85sR-tree: 141sQ-tree:12s
1 x 16 core machine(104GB RAM)
4s
Europe data large bounding box (1°2)
8 x 2 core high memory nodes (13GB RAM)
Grid: 91mR-tree: 83sQ-tree: 56s
1 x 16 core machine(104GB RAM)
39s
Europe data huge bounding box (3°2)
8 x 2 core high memory nodes (13GB RAM)
Only attempted with Q-tree: 88s
1 x 16 core machine(104GB RAM)
Out of memory
Shops query 8 x 2 core high memory nodes (13GB RAM)
729s 1 x 16 core machine(104GB RAM)
349s (but also got out of mem errors)
Shops query after indexing
8 x 2 core high memory nodes (13GB RAM)
40s BUT… indexing took 714 seconds!
![Page 24: Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox, digital.Arup](https://reader036.vdocuments.us/reader036/viewer/2022062522/58808b791a28ab35718b6a4b/html5/thumbnails/24.jpg)
Results – Reading & Writing Data• Used Hbase and Jython, but did not have time to implement spatial
indexesOperation Cluster configuration Cluster Time Standalone
configurationStandalone time
Data loading England PBF (610MB)
8 x 2-core high memory nodes (13GB ram each)
30m 1 x 16 core machine (104GB RAM)
527m
Data retrieval (small town) 8 x 2-core high memory nodes (13GB ram each)
1 x 16 core machine (104GB RAM)
3s
Data retrieval (large town) 8 x 2-core high memory nodes (13GB ram each)
1 x 16 core machine (104GB RAM)
113s
Data retrieval (city) 8 x 2-core high memory nodes (13GB ram each)
1 x 16 core machine (104GB RAM)
Did not complete (> 300s and 50,000 nodes)
![Page 25: Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox, digital.Arup](https://reader036.vdocuments.us/reader036/viewer/2022062522/58808b791a28ab35718b6a4b/html5/thumbnails/25.jpg)
Conclusions• It’s possible to replicate much of what OSM requires in Hadoop• Open Street Map is growing quickly, but it is a long way from requiring horizontal
sharing of databases• In general, it is not quicker to run geographic queries in a cluster at the TB order of
magnitude (at least with current OSM tools) • Indexes do significantly speed up geographic queries (Quad-tree seems to be the best)• There is a high barrier of entry (technical & cost) for Hadoop and ecosystem that will
make it difficult for OSM to adopt the technology• OSM should also consider parallel databases if they do have a requirement to scale-
out as there is less mismatch between their current system• Spatial extensions to big data platforms are relatively immature, but there is a huge
potential there to do data analytics on massive datasets and gain new insights• I’ve learnt a lot personally!
![Page 26: Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox, digital.Arup](https://reader036.vdocuments.us/reader036/viewer/2022062522/58808b791a28ab35718b6a4b/html5/thumbnails/26.jpg)
Experiences with Azure+ Easy to use – click to deploy+ Good free trial program+ Good integration with storage
- Less customisable- It was impossible to deploy >= 8
node clusters (rate limits?) so I gave up
- Technical support was responsive but not especially helpful
![Page 27: Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox, digital.Arup](https://reader036.vdocuments.us/reader036/viewer/2022062522/58808b791a28ab35718b6a4b/html5/thumbnails/27.jpg)
Experiences with Google Cloud+ Already had Hortonworks
Hadoop distribution automated setup
+ Easy to customise – everything on GitHub.
+ Uses a standard setup (Ambari)
- Not always reliable- Free trial was quite limited- More difficult to connect with
Google Storage buckets- Bit more work to deploy solution
as code-based and have to download 3rd party tool (gcloud)
![Page 28: Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox, digital.Arup](https://reader036.vdocuments.us/reader036/viewer/2022062522/58808b791a28ab35718b6a4b/html5/thumbnails/28.jpg)
General Hadoop experiences• Choosing the correct tool can be a significant part of the problem• Setting up Hadoop clusters is hard! • Spatial Big Data is still a little niche (although I did get lots of help)
• Running Hadoop jobs (even with Pig) is hard!• Trial and error to experiment with memory requirements• Size of files is a real barrier (especially when you are paying!)• Often jobs failed half way through• Debugging is not easy• Have to recompile Java whenever there is a change (and sometimes deploy to
nodes)