geostack

The Value of DataThe Big Data Revolution

An Use CaseFinal Remarks

A GEO-stack for Big DataDriving Spatial Analysis Beyond the Limits of Traditional Storage

Joana Simoes 1

1Bdigital, CASA, CICS.NOVA

March 12, 2015

1 / 62



Big Data Word Cloud; source:http://olap.com/big-data/

2 / 62

http://olap.com/big-data/



Table of Contents

1 The Value of Data

2 The Big Data Revolution

3 An Use Case

4 Final Remarks

3 / 62



Warning

\* This presentation may contain tech talk, such as: databases,clusters, NoSQL, cloud, parallelism, scalability.If you are susceptible to these concepts, you may want to leave theroom now.From this point beyond, you are at your own risk! \*

4 / 62



The Value of Data

Not a New Idea!Matthew Fontaine Maury(1806-1873).

Foresaw the hidden value oncaptain’s ships logs, whenanalysed collectively.Used time series data tocarry out analysis that wouldenable him to recomendoptimal shipping routes.

5 / 62



The Value of Data

Not a New Idea!Matthew Fontaine Maury(1806-1873).Foresaw the hidden value oncaptain’s ships logs, whenanalysed collectively.Used time series data tocarry out analysis that wouldenable him to recomendoptimal shipping routes.

6 / 62



The Value of DataData Mining, Open-source and Crowd-sourcing

In 1848, Captain Jackson was the the first person to try the routerecommended by Maury, and as a result he was able to save 17 dayson his outbond trip.Apart from collecting existing logs, Maury encouraged the collectionof more regular and systematic time series, by creating a template.Collected Data: longitude, latitude, currents, magnetic variation, airand water temperature, general wind direction, etc.

7 / 62




In 1848, Captain Jackson was the the first person to try the routerecommended by Maury, and as a result he was able to save 17 dayson his outbond trip.

Apart from collecting existing logs, Maury encouraged the collectionof more regular and systematic time series, by creating a template.Collected Data: longitude, latitude, currents, magnetic variation, airand water temperature, general wind direction, etc.

8 / 62




In 1848, Captain Jackson was the the first person to try the routerecommended by Maury, and as a result he was able to save 17 dayson his outbond trip.Apart from collecting existing logs, Maury encouraged the collectionof more regular and systematic time series, by creating a template.

Collected Data: longitude, latitude, currents, magnetic variation, airand water temperature, general wind direction, etc.

9 / 62




In 1848, Captain Jackson was the the first person to try the routerecommended by Maury, and as a result he was able to save 17 dayson his outbond trip.Apart from collecting existing logs, Maury encouraged the collectionof more regular and systematic time series, by creating a template.Collected Data: longitude, latitude, currents, magnetic variation, airand water temperature, general wind direction, etc.

10 / 62



Data Analysis

A multidisciplinary field

11 / 62



Data Analysis

Traditional StackSpreadsheets (e.g.: Excel, OpenOffice)RDBMS (e.g.: Oracle, PostgreSQL, MySQL)Statistical Packages (e.g.: R, Matlab)GIS Packages (e.g.: QGIS)Scripting and programming languages (e.g.: Python, Java)Libraries

12 / 62



Big Data

What changed in recent years?Differences in the way global business and transportation aredone have exploded the volume of traditional data sources.Widespread increase in data from sensors.

13 / 62



Big Data

Smart Citizen Project

Platform to generateparticipatory processes ofpeople in the cities, byconnecting data, people andknowledge.

Based on geolocation, Internetand free hardware and softwarefor data collection and sharing.

Relies on the production ofobjects to connect people withtheir environment and theircity.

https://smartcitizen.me/

14 / 62




Big Data






15 / 62




Big Data






16 / 62




Big Data






17 / 62




Big Data

A great deal of this data is actually geo-located (e.g.: satellitenavigate coordinates, ip addresses).Geography has finally the opportunity to switch from being basedon guesses and samples, to become a truly data-driven science.

Based on reports, Fillipo Arrieta published mapsdescribing a multi-stage containment plan designed tolimit the plague in Bari (1864).

The iLab used the ushahidi platform to collect and displaycrowd-sorcing information about Ebola in Liberia (2014).

18 / 62



Big Data

A great deal of this data is actually geo-located (e.g.: satellitenavigate coordinates, ip addresses).

Geography has finally the opportunity to switch from being basedon guesses and samples, to become a truly data-driven science.



19 / 62



Big Data

A great deal of this data is actually geo-located (e.g.: satellitenavigate coordinates, ip addresses).Geography has finally the opportunity to switch from being basedon guesses and samples, to become a truly data-driven science.



20 / 62



Big DataWhat characterizes Big Data?

Volume: Large amounts of data.Variety: Different types of structured, unstructured, andmulti-structured data.Velocity: Needs to be analyzed quickly.

21 / 62




Volume: Large amounts of data.

Variety: Different types of structured, unstructured, andmulti-structured data.Velocity: Needs to be analyzed quickly.

22 / 62




Volume: Large amounts of data.Variety: Different types of structured, unstructured, andmulti-structured data.

Velocity: Needs to be analyzed quickly.

23 / 62




Volume: Large amounts of data.Variety: Different types of structured, unstructured, andmulti-structured data.Velocity: Needs to be analyzed quickly.

24 / 62



Technological Challenges

These characteristics map into challenges:ScalabilityHeterogeneityLow latency

When the traditional stack is no longer enough, a paradigm shift isrequired.

25 / 62



RDBMS vs NoSQL

NoSQL databases trade away some capabilities of relationaldatabases (SQL), in order to improve scalability.Advantages: NoSQL databases are simpler, can handlesemi-structured and denormalized data and have an higherscalability.Disadvantages: loss of abstraction provided by the queryoptimizer, that increases the complexity of the applications.Recently, tools were developed that bring back the full powerof SQL language to the NoSQL ecosystem (e.g.: Apache Drill,Hive).

26 / 62



RDBMS vs NoSQL

NoSQL databases trade away some capabilities of relationaldatabases (SQL), in order to improve scalability.

Advantages: NoSQL databases are simpler, can handlesemi-structured and denormalized data and have an higherscalability.Disadvantages: loss of abstraction provided by the queryoptimizer, that increases the complexity of the applications.Recently, tools were developed that bring back the full powerof SQL language to the NoSQL ecosystem (e.g.: Apache Drill,Hive).

27 / 62



RDBMS vs NoSQL

NoSQL databases trade away some capabilities of relationaldatabases (SQL), in order to improve scalability.Advantages: NoSQL databases are simpler, can handlesemi-structured and denormalized data and have an higherscalability.

Disadvantages: loss of abstraction provided by the queryoptimizer, that increases the complexity of the applications.Recently, tools were developed that bring back the full powerof SQL language to the NoSQL ecosystem (e.g.: Apache Drill,Hive).

28 / 62



RDBMS vs NoSQL

NoSQL databases trade away some capabilities of relationaldatabases (SQL), in order to improve scalability.Advantages: NoSQL databases are simpler, can handlesemi-structured and denormalized data and have an higherscalability.Disadvantages: loss of abstraction provided by the queryoptimizer, that increases the complexity of the applications.

Recently, tools were developed that bring back the full powerof SQL language to the NoSQL ecosystem (e.g.: Apache Drill,Hive).

29 / 62



RDBMS vs NoSQL

NoSQL databases trade away some capabilities of relationaldatabases (SQL), in order to improve scalability.Advantages: NoSQL databases are simpler, can handlesemi-structured and denormalized data and have an higherscalability.Disadvantages: loss of abstraction provided by the queryoptimizer, that increases the complexity of the applications.Recently, tools were developed that bring back the full powerof SQL language to the NoSQL ecosystem (e.g.: Apache Drill,Hive).

30 / 62



RDBMS vs NoSQL

31 / 62



Examples

32 / 62



MapReduce

Programming model for processing and generating large data setswith a parallel, distributed algorithm on a cluster.

Map(): performs filtering and sorting.Reduce(): performs a summary operation.

The framework coordinates the processing, by marshalling thedistributed servers, running the tasks and parallel andmanaging all communications and data between the variousparts of the system.There are many libraries that implement MapReduce.

33 / 62



MapReduce


Map(): performs filtering and sorting.Reduce(): performs a summary operation.The framework coordinates the processing, by marshalling thedistributed servers, running the tasks and parallel andmanaging all communications and data between the variousparts of the system.

There are many libraries that implement MapReduce.

34 / 62



MapReduce


Map(): performs filtering and sorting.Reduce(): performs a summary operation.The framework coordinates the processing, by marshalling thedistributed servers, running the tasks and parallel andmanaging all communications and data between the variousparts of the system.There are many libraries that implement MapReduce.

35 / 62



A Big Data Approach

Does this mean we have to throw away our traditional toolsand methods?

First ensure that you really have, or will have Big Data atsome point in the future.Then identify the stages of the workflow that are bottlenecks,in terms of the current technologies.It is possible to mix & and match.

36 / 62



A Big Data Approach


First ensure that you really have, or will have Big Data atsome point in the future.

Then identify the stages of the workflow that are bottlenecks,in terms of the current technologies.It is possible to mix & and match.

37 / 62



A Big Data Approach


First ensure that you really have, or will have Big Data atsome point in the future.Then identify the stages of the workflow that are bottlenecks,in terms of the current technologies.

It is possible to mix & and match.

38 / 62



A Big Data Approach


First ensure that you really have, or will have Big Data atsome point in the future.Then identify the stages of the workflow that are bottlenecks,in terms of the current technologies.It is possible to mix & and match.

39 / 62



Analysing geo-located Tweets

The purpose of this use case was to analyse the stream ofgeo-located Tweets, as a sensor for citizen presence.Number of tweets in Catalunya in around 3 months: +- 6million.Continuous stream of data.This amount of data is not easily assimilated by the”human-eye”, so we decided to create clusters of Tweets.

40 / 62



Analysing geo-located TweetsThe purpose of this use case was to analyse the stream ofgeo-located Tweets, as a sensor for citizen presence.

Number of tweets in Catalunya in around 3 months: +- 6million.Continuous stream of data.This amount of data is not easily assimilated by the”human-eye”, so we decided to create clusters of Tweets.

41 / 62



Analysing geo-located TweetsThe purpose of this use case was to analyse the stream ofgeo-located Tweets, as a sensor for citizen presence.Number of tweets in Catalunya in around 3 months: +- 6million.

Continuous stream of data.This amount of data is not easily assimilated by the”human-eye”, so we decided to create clusters of Tweets.

42 / 62



Analysing geo-located TweetsThe purpose of this use case was to analyse the stream ofgeo-located Tweets, as a sensor for citizen presence.Number of tweets in Catalunya in around 3 months: +- 6million.Continuous stream of data.

This amount of data is not easily assimilated by the”human-eye”, so we decided to create clusters of Tweets.

43 / 62



Analysing geo-located TweetsThe purpose of this use case was to analyse the stream ofgeo-located Tweets, as a sensor for citizen presence.Number of tweets in Catalunya in around 3 months: +- 6million.Continuous stream of data.This amount of data is not easily assimilated by the”human-eye”, so we decided to create clusters of Tweets.

44 / 62



An Use Case

Clustering

It is a descriptive data mining technique, often used fordimensionality reduction.It groups a set of objects in such a way that objects in thesame group are more similar to each other, than to object inother groups.Strictly, it corresponds to a family of unsupervised machinelearning algorithms.We wanted to implement this concept using only Hadoop, andapply it to spatial attributes.

45 / 62



An Use Case

ClusteringIt is a descriptive data mining technique, often used fordimensionality reduction.

It groups a set of objects in such a way that objects in thesame group are more similar to each other, than to object inother groups.Strictly, it corresponds to a family of unsupervised machinelearning algorithms.We wanted to implement this concept using only Hadoop, andapply it to spatial attributes.

46 / 62



An Use Case

ClusteringIt is a descriptive data mining technique, often used fordimensionality reduction.It groups a set of objects in such a way that objects in thesame group are more similar to each other, than to object inother groups.

Strictly, it corresponds to a family of unsupervised machinelearning algorithms.We wanted to implement this concept using only Hadoop, andapply it to spatial attributes.

47 / 62



An Use Case

ClusteringIt is a descriptive data mining technique, often used fordimensionality reduction.It groups a set of objects in such a way that objects in thesame group are more similar to each other, than to object inother groups.Strictly, it corresponds to a family of unsupervised machinelearning algorithms.

We wanted to implement this concept using only Hadoop, andapply it to spatial attributes.

48 / 62



An Use Case

ClusteringIt is a descriptive data mining technique, often used fordimensionality reduction.It groups a set of objects in such a way that objects in thesame group are more similar to each other, than to object inother groups.Strictly, it corresponds to a family of unsupervised machinelearning algorithms.We wanted to implement this concept using only Hadoop, andapply it to spatial attributes.

49 / 62



Technological Stack

50 / 62



Workflow

51 / 62



ResultsUsing this workflow we were able to turn the original point cloud,into a density grid, and then into clusters.

Clusters expose patterns of presence over time, and can help us tore-define boundaries for the city that are data-driven, rather thanadministrative.

52 / 62



Final Remarks

It is ok to use non scalable tools, at certain points of theworkflow.We are working on the edge of existing technologies; somefunctions were not implemented yet, and bugs are common.Since it is a niche, this is even more true for spatialtechnologies.There are not ready-made solutions: the particular stack andworkflow should be compiled for the specific case study.this is one stack to solve this problem; it is not the only one,and it may not even be the ”best” one;

53 / 62



Final Remarks

It is ok to use non scalable tools, at certain points of theworkflow.

We are working on the edge of existing technologies; somefunctions were not implemented yet, and bugs are common.Since it is a niche, this is even more true for spatialtechnologies.There are not ready-made solutions: the particular stack andworkflow should be compiled for the specific case study.this is one stack to solve this problem; it is not the only one,and it may not even be the ”best” one;

54 / 62



Final Remarks

It is ok to use non scalable tools, at certain points of theworkflow.We are working on the edge of existing technologies; somefunctions were not implemented yet, and bugs are common.

Since it is a niche, this is even more true for spatialtechnologies.There are not ready-made solutions: the particular stack andworkflow should be compiled for the specific case study.this is one stack to solve this problem; it is not the only one,and it may not even be the ”best” one;

55 / 62



Final Remarks

It is ok to use non scalable tools, at certain points of theworkflow.We are working on the edge of existing technologies; somefunctions were not implemented yet, and bugs are common.Since it is a niche, this is even more true for spatialtechnologies.

There are not ready-made solutions: the particular stack andworkflow should be compiled for the specific case study.this is one stack to solve this problem; it is not the only one,and it may not even be the ”best” one;

56 / 62



Final Remarks

It is ok to use non scalable tools, at certain points of theworkflow.We are working on the edge of existing technologies; somefunctions were not implemented yet, and bugs are common.Since it is a niche, this is even more true for spatialtechnologies.There are not ready-made solutions: the particular stack andworkflow should be compiled for the specific case study.

this is one stack to solve this problem; it is not the only one,and it may not even be the ”best” one;

57 / 62



Final Remarks

It is ok to use non scalable tools, at certain points of theworkflow.We are working on the edge of existing technologies; somefunctions were not implemented yet, and bugs are common.Since it is a niche, this is even more true for spatialtechnologies.There are not ready-made solutions: the particular stack andworkflow should be compiled for the specific case study.this is one stack to solve this problem; it is not the only one,and it may not even be the ”best” one;

58 / 62



AcknowledgementsI would like to thank Ellen Friedman (MapR, Apache Mahout,Apache Drill), for her inspiring work on Big Data and MachineLearning.

59 / 62



References

Aji, A., Wang, F., Vo, H., Lee, R., Liu, Q., Zhang, X. andSaltz, J.” Hadoop-GIS: A High Performance Spatial DataWarehousing System over MapReduce”. Proceedings of theVLDB Endowment International Conference on Very LargeData Bases, 6(11), p1009. (2013)Cuesta, H. ”Practical Data Analysis”. Packt Publishing (2013)Dunning, T. and Friedman, E. ”Time Series Databases: NewWays to Store and Access Data”. O’Reilly Media; 1 edition(October, 2014)Myatt, G and Johnson, W. ”Making Sense Of Data I: APractical Guide to Exploratory Data Analysis and DataMining”. O’Reilly : 2 edition (2014).

60 / 62



Thank You!

This presentation is available at:http://tinyurl.com/pcmgzxp

Next:

9as Jornadas de SIG Libre - 26th-27th March 2015, Girona.

61 / 62

http://tinyurl.com/pcmgzxp



Thank You!

This presentation is available at:http://tinyurl.com/pcmgzxp

Next:9as Jornadas de SIG Libre - 26th-27th March 2015, Girona.

62 / 62

http://tinyurl.com/pcmgzxp

geostack

Documents