geostack

62
The Value of Data The Big Data Revolution An Use Case Final Remarks A GEO-stack for Big Data Driving Spatial Analysis Beyond the Limits of Traditional Storage Joana Sim˜ oes 1 1 Bdigital, CASA, CICS.NOVA March 12, 2015 1 / 62

Upload: joana-simoes

Post on 16-Apr-2017

166 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: geostack

The Value of DataThe Big Data Revolution

An Use CaseFinal Remarks

A GEO-stack for Big DataDriving Spatial Analysis Beyond the Limits of Traditional Storage

Joana Simoes 1

1Bdigital, CASA, CICS.NOVA

March 12, 2015

1 / 62

Page 2: geostack

The Value of DataThe Big Data Revolution

An Use CaseFinal Remarks

Big Data Word Cloud; source:http://olap.com/big-data/

2 / 62

Page 3: geostack

The Value of DataThe Big Data Revolution

An Use CaseFinal Remarks

Table of Contents

1 The Value of Data

2 The Big Data Revolution

3 An Use Case

4 Final Remarks

3 / 62

Page 4: geostack

The Value of DataThe Big Data Revolution

An Use CaseFinal Remarks

Warning

\* This presentation may contain tech talk, such as: databases,clusters, NoSQL, cloud, parallelism, scalability.If you are susceptible to these concepts, you may want to leave theroom now.From this point beyond, you are at your own risk! \*

4 / 62

Page 5: geostack

The Value of DataThe Big Data Revolution

An Use CaseFinal Remarks

The Value of Data

Not a New Idea!Matthew Fontaine Maury(1806-1873).

Foresaw the hidden value oncaptain’s ships logs, whenanalysed collectively.Used time series data tocarry out analysis that wouldenable him to recomendoptimal shipping routes.

5 / 62

Page 6: geostack

The Value of DataThe Big Data Revolution

An Use CaseFinal Remarks

The Value of Data

Not a New Idea!Matthew Fontaine Maury(1806-1873).Foresaw the hidden value oncaptain’s ships logs, whenanalysed collectively.Used time series data tocarry out analysis that wouldenable him to recomendoptimal shipping routes.

6 / 62

Page 7: geostack

The Value of DataThe Big Data Revolution

An Use CaseFinal Remarks

The Value of DataData Mining, Open-source and Crowd-sourcing

In 1848, Captain Jackson was the the first person to try the routerecommended by Maury, and as a result he was able to save 17 dayson his outbond trip.Apart from collecting existing logs, Maury encouraged the collectionof more regular and systematic time series, by creating a template.Collected Data: longitude, latitude, currents, magnetic variation, airand water temperature, general wind direction, etc.

7 / 62

Page 8: geostack

The Value of DataThe Big Data Revolution

An Use CaseFinal Remarks

The Value of DataData Mining, Open-source and Crowd-sourcing

In 1848, Captain Jackson was the the first person to try the routerecommended by Maury, and as a result he was able to save 17 dayson his outbond trip.

Apart from collecting existing logs, Maury encouraged the collectionof more regular and systematic time series, by creating a template.Collected Data: longitude, latitude, currents, magnetic variation, airand water temperature, general wind direction, etc.

8 / 62

Page 9: geostack

The Value of DataThe Big Data Revolution

An Use CaseFinal Remarks

The Value of DataData Mining, Open-source and Crowd-sourcing

In 1848, Captain Jackson was the the first person to try the routerecommended by Maury, and as a result he was able to save 17 dayson his outbond trip.Apart from collecting existing logs, Maury encouraged the collectionof more regular and systematic time series, by creating a template.

Collected Data: longitude, latitude, currents, magnetic variation, airand water temperature, general wind direction, etc.

9 / 62

Page 10: geostack

The Value of DataThe Big Data Revolution

An Use CaseFinal Remarks

The Value of DataData Mining, Open-source and Crowd-sourcing

In 1848, Captain Jackson was the the first person to try the routerecommended by Maury, and as a result he was able to save 17 dayson his outbond trip.Apart from collecting existing logs, Maury encouraged the collectionof more regular and systematic time series, by creating a template.Collected Data: longitude, latitude, currents, magnetic variation, airand water temperature, general wind direction, etc.

10 / 62

Page 11: geostack

The Value of DataThe Big Data Revolution

An Use CaseFinal Remarks

Data Analysis

A multidisciplinary field

11 / 62

Page 12: geostack

The Value of DataThe Big Data Revolution

An Use CaseFinal Remarks

Data Analysis

Traditional StackSpreadsheets (e.g.: Excel, OpenOffice)RDBMS (e.g.: Oracle, PostgreSQL, MySQL)Statistical Packages (e.g.: R, Matlab)GIS Packages (e.g.: QGIS)Scripting and programming languages (e.g.: Python, Java)Libraries

12 / 62

Page 13: geostack

The Value of DataThe Big Data Revolution

An Use CaseFinal Remarks

Big Data

What changed in recent years?Differences in the way global business and transportation aredone have exploded the volume of traditional data sources.Widespread increase in data from sensors.

13 / 62

Page 14: geostack

The Value of DataThe Big Data Revolution

An Use CaseFinal Remarks

Big Data

Smart Citizen Project

Platform to generateparticipatory processes ofpeople in the cities, byconnecting data, people andknowledge.

Based on geolocation, Internetand free hardware and softwarefor data collection and sharing.

Relies on the production ofobjects to connect people withtheir environment and theircity.

https://smartcitizen.me/

14 / 62

Page 15: geostack

The Value of DataThe Big Data Revolution

An Use CaseFinal Remarks

Big Data

Smart Citizen Project

Platform to generateparticipatory processes ofpeople in the cities, byconnecting data, people andknowledge.

Based on geolocation, Internetand free hardware and softwarefor data collection and sharing.

Relies on the production ofobjects to connect people withtheir environment and theircity.

https://smartcitizen.me/

15 / 62

Page 16: geostack

The Value of DataThe Big Data Revolution

An Use CaseFinal Remarks

Big Data

Smart Citizen Project

Platform to generateparticipatory processes ofpeople in the cities, byconnecting data, people andknowledge.

Based on geolocation, Internetand free hardware and softwarefor data collection and sharing.

Relies on the production ofobjects to connect people withtheir environment and theircity.

https://smartcitizen.me/

16 / 62

Page 17: geostack

The Value of DataThe Big Data Revolution

An Use CaseFinal Remarks

Big Data

Smart Citizen Project

Platform to generateparticipatory processes ofpeople in the cities, byconnecting data, people andknowledge.

Based on geolocation, Internetand free hardware and softwarefor data collection and sharing.

Relies on the production ofobjects to connect people withtheir environment and theircity.

https://smartcitizen.me/

17 / 62

Page 18: geostack

The Value of DataThe Big Data Revolution

An Use CaseFinal Remarks

Big Data

A great deal of this data is actually geo-located (e.g.: satellitenavigate coordinates, ip addresses).Geography has finally the opportunity to switch from being basedon guesses and samples, to become a truly data-driven science.

Based on reports, Fillipo Arrieta published mapsdescribing a multi-stage containment plan designed tolimit the plague in Bari (1864).

The iLab used the ushahidi platform to collect and displaycrowd-sorcing information about Ebola in Liberia (2014).

18 / 62

Page 19: geostack

The Value of DataThe Big Data Revolution

An Use CaseFinal Remarks

Big Data

A great deal of this data is actually geo-located (e.g.: satellitenavigate coordinates, ip addresses).

Geography has finally the opportunity to switch from being basedon guesses and samples, to become a truly data-driven science.

Based on reports, Fillipo Arrieta published mapsdescribing a multi-stage containment plan designed tolimit the plague in Bari (1864).

The iLab used the ushahidi platform to collect and displaycrowd-sorcing information about Ebola in Liberia (2014).

19 / 62

Page 20: geostack

The Value of DataThe Big Data Revolution

An Use CaseFinal Remarks

Big Data

A great deal of this data is actually geo-located (e.g.: satellitenavigate coordinates, ip addresses).Geography has finally the opportunity to switch from being basedon guesses and samples, to become a truly data-driven science.

Based on reports, Fillipo Arrieta published mapsdescribing a multi-stage containment plan designed tolimit the plague in Bari (1864).

The iLab used the ushahidi platform to collect and displaycrowd-sorcing information about Ebola in Liberia (2014).

20 / 62

Page 21: geostack

The Value of DataThe Big Data Revolution

An Use CaseFinal Remarks

Big DataWhat characterizes Big Data?

Volume: Large amounts of data.Variety: Different types of structured, unstructured, andmulti-structured data.Velocity: Needs to be analyzed quickly.

21 / 62

Page 22: geostack

The Value of DataThe Big Data Revolution

An Use CaseFinal Remarks

Big DataWhat characterizes Big Data?

Volume: Large amounts of data.

Variety: Different types of structured, unstructured, andmulti-structured data.Velocity: Needs to be analyzed quickly.

22 / 62

Page 23: geostack

The Value of DataThe Big Data Revolution

An Use CaseFinal Remarks

Big DataWhat characterizes Big Data?

Volume: Large amounts of data.Variety: Different types of structured, unstructured, andmulti-structured data.

Velocity: Needs to be analyzed quickly.

23 / 62

Page 24: geostack

The Value of DataThe Big Data Revolution

An Use CaseFinal Remarks

Big DataWhat characterizes Big Data?

Volume: Large amounts of data.Variety: Different types of structured, unstructured, andmulti-structured data.Velocity: Needs to be analyzed quickly.

24 / 62

Page 25: geostack

The Value of DataThe Big Data Revolution

An Use CaseFinal Remarks

Technological Challenges

These characteristics map into challenges:ScalabilityHeterogeneityLow latency

When the traditional stack is no longer enough, a paradigm shift isrequired.

25 / 62

Page 26: geostack

The Value of DataThe Big Data Revolution

An Use CaseFinal Remarks

RDBMS vs NoSQL

NoSQL databases trade away some capabilities of relationaldatabases (SQL), in order to improve scalability.Advantages: NoSQL databases are simpler, can handlesemi-structured and denormalized data and have an higherscalability.Disadvantages: loss of abstraction provided by the queryoptimizer, that increases the complexity of the applications.Recently, tools were developed that bring back the full powerof SQL language to the NoSQL ecosystem (e.g.: Apache Drill,Hive).

26 / 62

Page 27: geostack

The Value of DataThe Big Data Revolution

An Use CaseFinal Remarks

RDBMS vs NoSQL

NoSQL databases trade away some capabilities of relationaldatabases (SQL), in order to improve scalability.

Advantages: NoSQL databases are simpler, can handlesemi-structured and denormalized data and have an higherscalability.Disadvantages: loss of abstraction provided by the queryoptimizer, that increases the complexity of the applications.Recently, tools were developed that bring back the full powerof SQL language to the NoSQL ecosystem (e.g.: Apache Drill,Hive).

27 / 62

Page 28: geostack

The Value of DataThe Big Data Revolution

An Use CaseFinal Remarks

RDBMS vs NoSQL

NoSQL databases trade away some capabilities of relationaldatabases (SQL), in order to improve scalability.Advantages: NoSQL databases are simpler, can handlesemi-structured and denormalized data and have an higherscalability.

Disadvantages: loss of abstraction provided by the queryoptimizer, that increases the complexity of the applications.Recently, tools were developed that bring back the full powerof SQL language to the NoSQL ecosystem (e.g.: Apache Drill,Hive).

28 / 62

Page 29: geostack

The Value of DataThe Big Data Revolution

An Use CaseFinal Remarks

RDBMS vs NoSQL

NoSQL databases trade away some capabilities of relationaldatabases (SQL), in order to improve scalability.Advantages: NoSQL databases are simpler, can handlesemi-structured and denormalized data and have an higherscalability.Disadvantages: loss of abstraction provided by the queryoptimizer, that increases the complexity of the applications.

Recently, tools were developed that bring back the full powerof SQL language to the NoSQL ecosystem (e.g.: Apache Drill,Hive).

29 / 62

Page 30: geostack

The Value of DataThe Big Data Revolution

An Use CaseFinal Remarks

RDBMS vs NoSQL

NoSQL databases trade away some capabilities of relationaldatabases (SQL), in order to improve scalability.Advantages: NoSQL databases are simpler, can handlesemi-structured and denormalized data and have an higherscalability.Disadvantages: loss of abstraction provided by the queryoptimizer, that increases the complexity of the applications.Recently, tools were developed that bring back the full powerof SQL language to the NoSQL ecosystem (e.g.: Apache Drill,Hive).

30 / 62

Page 31: geostack

The Value of DataThe Big Data Revolution

An Use CaseFinal Remarks

RDBMS vs NoSQL

31 / 62

Page 32: geostack

The Value of DataThe Big Data Revolution

An Use CaseFinal Remarks

Examples

32 / 62

Page 33: geostack

The Value of DataThe Big Data Revolution

An Use CaseFinal Remarks

MapReduce

Programming model for processing and generating large data setswith a parallel, distributed algorithm on a cluster.

Map(): performs filtering and sorting.Reduce(): performs a summary operation.

The framework coordinates the processing, by marshalling thedistributed servers, running the tasks and parallel andmanaging all communications and data between the variousparts of the system.There are many libraries that implement MapReduce.

33 / 62

Page 34: geostack

The Value of DataThe Big Data Revolution

An Use CaseFinal Remarks

MapReduce

Programming model for processing and generating large data setswith a parallel, distributed algorithm on a cluster.

Map(): performs filtering and sorting.Reduce(): performs a summary operation.The framework coordinates the processing, by marshalling thedistributed servers, running the tasks and parallel andmanaging all communications and data between the variousparts of the system.

There are many libraries that implement MapReduce.

34 / 62

Page 35: geostack

The Value of DataThe Big Data Revolution

An Use CaseFinal Remarks

MapReduce

Programming model for processing and generating large data setswith a parallel, distributed algorithm on a cluster.

Map(): performs filtering and sorting.Reduce(): performs a summary operation.The framework coordinates the processing, by marshalling thedistributed servers, running the tasks and parallel andmanaging all communications and data between the variousparts of the system.There are many libraries that implement MapReduce.

35 / 62

Page 36: geostack

The Value of DataThe Big Data Revolution

An Use CaseFinal Remarks

A Big Data Approach

Does this mean we have to throw away our traditional toolsand methods?

First ensure that you really have, or will have Big Data atsome point in the future.Then identify the stages of the workflow that are bottlenecks,in terms of the current technologies.It is possible to mix & and match.

36 / 62

Page 37: geostack

The Value of DataThe Big Data Revolution

An Use CaseFinal Remarks

A Big Data Approach

Does this mean we have to throw away our traditional toolsand methods?

First ensure that you really have, or will have Big Data atsome point in the future.

Then identify the stages of the workflow that are bottlenecks,in terms of the current technologies.It is possible to mix & and match.

37 / 62

Page 38: geostack

The Value of DataThe Big Data Revolution

An Use CaseFinal Remarks

A Big Data Approach

Does this mean we have to throw away our traditional toolsand methods?

First ensure that you really have, or will have Big Data atsome point in the future.Then identify the stages of the workflow that are bottlenecks,in terms of the current technologies.

It is possible to mix & and match.

38 / 62

Page 39: geostack

The Value of DataThe Big Data Revolution

An Use CaseFinal Remarks

A Big Data Approach

Does this mean we have to throw away our traditional toolsand methods?

First ensure that you really have, or will have Big Data atsome point in the future.Then identify the stages of the workflow that are bottlenecks,in terms of the current technologies.It is possible to mix & and match.

39 / 62

Page 40: geostack

The Value of DataThe Big Data Revolution

An Use CaseFinal Remarks

Analysing geo-located Tweets

The purpose of this use case was to analyse the stream ofgeo-located Tweets, as a sensor for citizen presence.Number of tweets in Catalunya in around 3 months: +- 6million.Continuous stream of data.This amount of data is not easily assimilated by the”human-eye”, so we decided to create clusters of Tweets.

40 / 62

Page 41: geostack

The Value of DataThe Big Data Revolution

An Use CaseFinal Remarks

Analysing geo-located TweetsThe purpose of this use case was to analyse the stream ofgeo-located Tweets, as a sensor for citizen presence.

Number of tweets in Catalunya in around 3 months: +- 6million.Continuous stream of data.This amount of data is not easily assimilated by the”human-eye”, so we decided to create clusters of Tweets.

41 / 62

Page 42: geostack

The Value of DataThe Big Data Revolution

An Use CaseFinal Remarks

Analysing geo-located TweetsThe purpose of this use case was to analyse the stream ofgeo-located Tweets, as a sensor for citizen presence.Number of tweets in Catalunya in around 3 months: +- 6million.

Continuous stream of data.This amount of data is not easily assimilated by the”human-eye”, so we decided to create clusters of Tweets.

42 / 62

Page 43: geostack

The Value of DataThe Big Data Revolution

An Use CaseFinal Remarks

Analysing geo-located TweetsThe purpose of this use case was to analyse the stream ofgeo-located Tweets, as a sensor for citizen presence.Number of tweets in Catalunya in around 3 months: +- 6million.Continuous stream of data.

This amount of data is not easily assimilated by the”human-eye”, so we decided to create clusters of Tweets.

43 / 62

Page 44: geostack

The Value of DataThe Big Data Revolution

An Use CaseFinal Remarks

Analysing geo-located TweetsThe purpose of this use case was to analyse the stream ofgeo-located Tweets, as a sensor for citizen presence.Number of tweets in Catalunya in around 3 months: +- 6million.Continuous stream of data.This amount of data is not easily assimilated by the”human-eye”, so we decided to create clusters of Tweets.

44 / 62

Page 45: geostack

The Value of DataThe Big Data Revolution

An Use CaseFinal Remarks

An Use Case

Clustering

It is a descriptive data mining technique, often used fordimensionality reduction.It groups a set of objects in such a way that objects in thesame group are more similar to each other, than to object inother groups.Strictly, it corresponds to a family of unsupervised machinelearning algorithms.We wanted to implement this concept using only Hadoop, andapply it to spatial attributes.

45 / 62

Page 46: geostack

The Value of DataThe Big Data Revolution

An Use CaseFinal Remarks

An Use Case

ClusteringIt is a descriptive data mining technique, often used fordimensionality reduction.

It groups a set of objects in such a way that objects in thesame group are more similar to each other, than to object inother groups.Strictly, it corresponds to a family of unsupervised machinelearning algorithms.We wanted to implement this concept using only Hadoop, andapply it to spatial attributes.

46 / 62

Page 47: geostack

The Value of DataThe Big Data Revolution

An Use CaseFinal Remarks

An Use Case

ClusteringIt is a descriptive data mining technique, often used fordimensionality reduction.It groups a set of objects in such a way that objects in thesame group are more similar to each other, than to object inother groups.

Strictly, it corresponds to a family of unsupervised machinelearning algorithms.We wanted to implement this concept using only Hadoop, andapply it to spatial attributes.

47 / 62

Page 48: geostack

The Value of DataThe Big Data Revolution

An Use CaseFinal Remarks

An Use Case

ClusteringIt is a descriptive data mining technique, often used fordimensionality reduction.It groups a set of objects in such a way that objects in thesame group are more similar to each other, than to object inother groups.Strictly, it corresponds to a family of unsupervised machinelearning algorithms.

We wanted to implement this concept using only Hadoop, andapply it to spatial attributes.

48 / 62

Page 49: geostack

The Value of DataThe Big Data Revolution

An Use CaseFinal Remarks

An Use Case

ClusteringIt is a descriptive data mining technique, often used fordimensionality reduction.It groups a set of objects in such a way that objects in thesame group are more similar to each other, than to object inother groups.Strictly, it corresponds to a family of unsupervised machinelearning algorithms.We wanted to implement this concept using only Hadoop, andapply it to spatial attributes.

49 / 62

Page 50: geostack

The Value of DataThe Big Data Revolution

An Use CaseFinal Remarks

Technological Stack

50 / 62

Page 51: geostack

The Value of DataThe Big Data Revolution

An Use CaseFinal Remarks

Workflow

51 / 62

Page 52: geostack

The Value of DataThe Big Data Revolution

An Use CaseFinal Remarks

ResultsUsing this workflow we were able to turn the original point cloud,into a density grid, and then into clusters.

Clusters expose patterns of presence over time, and can help us tore-define boundaries for the city that are data-driven, rather thanadministrative.

52 / 62

Page 53: geostack

The Value of DataThe Big Data Revolution

An Use CaseFinal Remarks

Final Remarks

It is ok to use non scalable tools, at certain points of theworkflow.We are working on the edge of existing technologies; somefunctions were not implemented yet, and bugs are common.Since it is a niche, this is even more true for spatialtechnologies.There are not ready-made solutions: the particular stack andworkflow should be compiled for the specific case study.this is one stack to solve this problem; it is not the only one,and it may not even be the ”best” one;

53 / 62

Page 54: geostack

The Value of DataThe Big Data Revolution

An Use CaseFinal Remarks

Final Remarks

It is ok to use non scalable tools, at certain points of theworkflow.

We are working on the edge of existing technologies; somefunctions were not implemented yet, and bugs are common.Since it is a niche, this is even more true for spatialtechnologies.There are not ready-made solutions: the particular stack andworkflow should be compiled for the specific case study.this is one stack to solve this problem; it is not the only one,and it may not even be the ”best” one;

54 / 62

Page 55: geostack

The Value of DataThe Big Data Revolution

An Use CaseFinal Remarks

Final Remarks

It is ok to use non scalable tools, at certain points of theworkflow.We are working on the edge of existing technologies; somefunctions were not implemented yet, and bugs are common.

Since it is a niche, this is even more true for spatialtechnologies.There are not ready-made solutions: the particular stack andworkflow should be compiled for the specific case study.this is one stack to solve this problem; it is not the only one,and it may not even be the ”best” one;

55 / 62

Page 56: geostack

The Value of DataThe Big Data Revolution

An Use CaseFinal Remarks

Final Remarks

It is ok to use non scalable tools, at certain points of theworkflow.We are working on the edge of existing technologies; somefunctions were not implemented yet, and bugs are common.Since it is a niche, this is even more true for spatialtechnologies.

There are not ready-made solutions: the particular stack andworkflow should be compiled for the specific case study.this is one stack to solve this problem; it is not the only one,and it may not even be the ”best” one;

56 / 62

Page 57: geostack

The Value of DataThe Big Data Revolution

An Use CaseFinal Remarks

Final Remarks

It is ok to use non scalable tools, at certain points of theworkflow.We are working on the edge of existing technologies; somefunctions were not implemented yet, and bugs are common.Since it is a niche, this is even more true for spatialtechnologies.There are not ready-made solutions: the particular stack andworkflow should be compiled for the specific case study.

this is one stack to solve this problem; it is not the only one,and it may not even be the ”best” one;

57 / 62

Page 58: geostack

The Value of DataThe Big Data Revolution

An Use CaseFinal Remarks

Final Remarks

It is ok to use non scalable tools, at certain points of theworkflow.We are working on the edge of existing technologies; somefunctions were not implemented yet, and bugs are common.Since it is a niche, this is even more true for spatialtechnologies.There are not ready-made solutions: the particular stack andworkflow should be compiled for the specific case study.this is one stack to solve this problem; it is not the only one,and it may not even be the ”best” one;

58 / 62

Page 59: geostack

The Value of DataThe Big Data Revolution

An Use CaseFinal Remarks

AcknowledgementsI would like to thank Ellen Friedman (MapR, Apache Mahout,Apache Drill), for her inspiring work on Big Data and MachineLearning.

59 / 62

Page 60: geostack

The Value of DataThe Big Data Revolution

An Use CaseFinal Remarks

References

Aji, A., Wang, F., Vo, H., Lee, R., Liu, Q., Zhang, X. andSaltz, J.” Hadoop-GIS: A High Performance Spatial DataWarehousing System over MapReduce”. Proceedings of theVLDB Endowment International Conference on Very LargeData Bases, 6(11), p1009. (2013)Cuesta, H. ”Practical Data Analysis”. Packt Publishing (2013)Dunning, T. and Friedman, E. ”Time Series Databases: NewWays to Store and Access Data”. O’Reilly Media; 1 edition(October, 2014)Myatt, G and Johnson, W. ”Making Sense Of Data I: APractical Guide to Exploratory Data Analysis and DataMining”. O’Reilly : 2 edition (2014).

60 / 62

Page 61: geostack

The Value of DataThe Big Data Revolution

An Use CaseFinal Remarks

Thank You!

This presentation is available at:http://tinyurl.com/pcmgzxp

Next:

9as Jornadas de SIG Libre - 26th-27th March 2015, Girona.

61 / 62

Page 62: geostack

The Value of DataThe Big Data Revolution

An Use CaseFinal Remarks

Thank You!

This presentation is available at:http://tinyurl.com/pcmgzxp

Next:9as Jornadas de SIG Libre - 26th-27th March 2015, Girona.

62 / 62