bigdata - the practice€¦ · • hbase - columnar keyvalue database • cascading - flow data...

38
BigData - The Practice Mansour Raad http://thunderheadxpler.blogspot.com/ [email protected] @mraad February 10–11, 2014 | Washington DC Federal GIS Conference 2014

Upload: others

Post on 24-Sep-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: BigData - The Practice€¦ · • HBase - Columnar KeyValue Database • Cascading - Flow Data Analysis • Avro - Data Serializer • Zookeeper - Centralized State Management. GIS

BigData - The PracticeMansour Raad

http://thunderheadxpler.blogspot.com/ [email protected]

@mraad

February 10–11, 2014 | Washington DC

Federal GIS Conference 2014

Page 2: BigData - The Practice€¦ · • HBase - Columnar KeyValue Database • Cascading - Flow Data Analysis • Avro - Data Serializer • Zookeeper - Centralized State Management. GIS

On Today’s Todo List:

• Run into store • Frantically ask “What year is it ?” • When they reply • Yell “It Works, because of BigData !” • And run out

Page 3: BigData - The Practice€¦ · • HBase - Columnar KeyValue Database • Cascading - Flow Data Analysis • Avro - Data Serializer • Zookeeper - Centralized State Management. GIS
Page 4: BigData - The Practice€¦ · • HBase - Columnar KeyValue Database • Cascading - Flow Data Analysis • Avro - Data Serializer • Zookeeper - Centralized State Management. GIS

Hadoop Distributed File System (HDFS)

Yet Another Resource Negotiator (YARN)

MapReduce

Commodity Servers

Hadoop Basic Stack

Page 5: BigData - The Practice€¦ · • HBase - Columnar KeyValue Database • Cascading - Flow Data Analysis • Avro - Data Serializer • Zookeeper - Centralized State Management. GIS

The Zoo

• Hive - Ad Hoc Query - “SQL” to MapReduce • Pig - High Level Data Analysis Language • Impala - MPP SQL Engine • Mahout - Machine Learning Toolbox • HBase - Columnar KeyValue Database • Cascading - Flow Data Analysis • Avro - Data Serializer • Zookeeper - Centralized State Management

Page 6: BigData - The Practice€¦ · • HBase - Columnar KeyValue Database • Cascading - Flow Data Analysis • Avro - Data Serializer • Zookeeper - Centralized State Management. GIS
Page 7: BigData - The Practice€¦ · • HBase - Columnar KeyValue Database • Cascading - Flow Data Analysis • Avro - Data Serializer • Zookeeper - Centralized State Management. GIS

GIS Tools For Hadoop

• Geometry API • Point / Line / Polygon • Operations - Contains, Intersect, Buffer • I/O - WKT, GeoJSON, Shape

• Hive Spatial UDF • ST_POINT, ST_CONTAINS

• GeoProcessing Extensions

Page 8: BigData - The Practice€¦ · • HBase - Columnar KeyValue Database • Cascading - Flow Data Analysis • Avro - Data Serializer • Zookeeper - Centralized State Management. GIS

Cloudera Quick Start VM

Page 9: BigData - The Practice€¦ · • HBase - Columnar KeyValue Database • Cascading - Flow Data Analysis • Avro - Data Serializer • Zookeeper - Centralized State Management. GIS

Hello, MapReduce !

Page 10: BigData - The Practice€¦ · • HBase - Columnar KeyValue Database • Cascading - Flow Data Analysis • Avro - Data Serializer • Zookeeper - Centralized State Management. GIS

Density Analysis - Cell Count

Page 11: BigData - The Practice€¦ · • HBase - Columnar KeyValue Database • Cascading - Flow Data Analysis • Avro - Data Serializer • Zookeeper - Centralized State Management. GIS

MapReduce Recap

• Map • Extract • Filter • Transform

• Reduce • Group By • Aggregate

Page 12: BigData - The Practice€¦ · • HBase - Columnar KeyValue Database • Cascading - Flow Data Analysis • Avro - Data Serializer • Zookeeper - Centralized State Management. GIS

Cell Count

function map(lineno,text) { (x,y) = tokenize(text) if(inGrid(x,y)){ (cellX,cellY) = toCell(x,y) emit((cellX,cellY),1) } }

function reduce((cellX,cellY),iterator){ sum = 0 for( one in iterator){ sum = sum + one } emit((cellX,cellY), sum) }

Page 13: BigData - The Practice€¦ · • HBase - Columnar KeyValue Database • Cascading - Flow Data Analysis • Avro - Data Serializer • Zookeeper - Centralized State Management. GIS

In Action Demo

Page 14: BigData - The Practice€¦ · • HBase - Columnar KeyValue Database • Cascading - Flow Data Analysis • Avro - Data Serializer • Zookeeper - Centralized State Management. GIS

MapReduce Is Hard

Page 15: BigData - The Practice€¦ · • HBase - Columnar KeyValue Database • Cascading - Flow Data Analysis • Avro - Data Serializer • Zookeeper - Centralized State Management. GIS

Thinking Of Data As Water

Page 16: BigData - The Practice€¦ · • HBase - Columnar KeyValue Database • Cascading - Flow Data Analysis • Avro - Data Serializer • Zookeeper - Centralized State Management. GIS
Page 17: BigData - The Practice€¦ · • HBase - Columnar KeyValue Database • Cascading - Flow Data Analysis • Avro - Data Serializer • Zookeeper - Centralized State Management. GIS

To Cell

GroupBy

count

X,Y Collection

Cell Count

RM

SourceSink

Filter

Cascading Pipeline

Page 18: BigData - The Practice€¦ · • HBase - Columnar KeyValue Database • Cascading - Flow Data Analysis • Avro - Data Serializer • Zookeeper - Centralized State Management. GIS
Page 19: BigData - The Practice€¦ · • HBase - Columnar KeyValue Database • Cascading - Flow Data Analysis • Avro - Data Serializer • Zookeeper - Centralized State Management. GIS

Cascading In Action

Page 20: BigData - The Practice€¦ · • HBase - Columnar KeyValue Database • Cascading - Flow Data Analysis • Avro - Data Serializer • Zookeeper - Centralized State Management. GIS

How About No Programming ? What About SQL ?

Page 21: BigData - The Practice€¦ · • HBase - Columnar KeyValue Database • Cascading - Flow Data Analysis • Avro - Data Serializer • Zookeeper - Centralized State Management. GIS

Hive and Impala

Page 22: BigData - The Practice€¦ · • HBase - Columnar KeyValue Database • Cascading - Flow Data Analysis • Avro - Data Serializer • Zookeeper - Centralized State Management. GIS

drop table if exists zipcodes;

create external table if not exists zipcodes( id int, lon double, lat double ) row format delimited fields terminated by '\t' lines terminated by '\n' stored as textfile location '/user/cloudera/zipcodes';

Page 23: BigData - The Practice€¦ · • HBase - Columnar KeyValue Database • Cascading - Flow Data Analysis • Avro - Data Serializer • Zookeeper - Centralized State Management. GIS

Cell Density in SQL

SELECT T.X-180+0.5 AS LON,T.Y-90+0.5 AS LAT,COUNT(*) AS POPULATION FROM ( SELECT FLOOR(LON+180) AS X,FLOOR(LAT+90) AS Y FROM ZIPCODES) T GROUP BY T.X,T.Y;

Page 24: BigData - The Practice€¦ · • HBase - Columnar KeyValue Database • Cascading - Flow Data Analysis • Avro - Data Serializer • Zookeeper - Centralized State Management. GIS

Hive and Impala In Action

Page 25: BigData - The Practice€¦ · • HBase - Columnar KeyValue Database • Cascading - Flow Data Analysis • Avro - Data Serializer • Zookeeper - Centralized State Management. GIS

In Memory Spatial Index

Page 26: BigData - The Practice€¦ · • HBase - Columnar KeyValue Database • Cascading - Flow Data Analysis • Avro - Data Serializer • Zookeeper - Centralized State Management. GIS

In Memory Spatial Index

• Geometry API in GIS Tools For Hadoop • new SpatialIndex( new Envelope2D(), depth); • insert( new Envelope2D(), id) • iterator = query( new Envelope2D()) • Use in mapper in “small” spatial joins

Page 27: BigData - The Practice€¦ · • HBase - Columnar KeyValue Database • Cascading - Flow Data Analysis • Avro - Data Serializer • Zookeeper - Centralized State Management. GIS

Spatial Index In Action

Page 28: BigData - The Practice€¦ · • HBase - Columnar KeyValue Database • Cascading - Flow Data Analysis • Avro - Data Serializer • Zookeeper - Centralized State Management. GIS

ArcGIS Desktop and Hadoop

Page 29: BigData - The Practice€¦ · • HBase - Columnar KeyValue Database • Cascading - Flow Data Analysis • Avro - Data Serializer • Zookeeper - Centralized State Management. GIS
Page 30: BigData - The Practice€¦ · • HBase - Columnar KeyValue Database • Cascading - Flow Data Analysis • Avro - Data Serializer • Zookeeper - Centralized State Management. GIS

AIS DATA

• 14.8 Million data points • 1 Month • Zone 18 (North East / NY Area) • MMSI, Zulu Time, Lat, Lon, Vessel ID, Draught

Page 31: BigData - The Practice€¦ · • HBase - Columnar KeyValue Database • Cascading - Flow Data Analysis • Avro - Data Serializer • Zookeeper - Centralized State Management. GIS

DEMO Steps

• GP Toolbox • Track Assembly • Hex Generation • Density Analysis

Page 32: BigData - The Practice€¦ · • HBase - Columnar KeyValue Database • Cascading - Flow Data Analysis • Avro - Data Serializer • Zookeeper - Centralized State Management. GIS

Import Job

AIS CSV

Import Partitioner

/ais/YYYY/MM/dd/HH/UUID.csv

HDFS MapReduce

Page 33: BigData - The Practice€¦ · • HBase - Columnar KeyValue Database • Cascading - Flow Data Analysis • Avro - Data Serializer • Zookeeper - Centralized State Management. GIS
Page 34: BigData - The Practice€¦ · • HBase - Columnar KeyValue Database • Cascading - Flow Data Analysis • Avro - Data Serializer • Zookeeper - Centralized State Management. GIS
Page 35: BigData - The Practice€¦ · • HBase - Columnar KeyValue Database • Cascading - Flow Data Analysis • Avro - Data Serializer • Zookeeper - Centralized State Management. GIS
Page 36: BigData - The Practice€¦ · • HBase - Columnar KeyValue Database • Cascading - Flow Data Analysis • Avro - Data Serializer • Zookeeper - Centralized State Management. GIS
Page 37: BigData - The Practice€¦ · • HBase - Columnar KeyValue Database • Cascading - Flow Data Analysis • Avro - Data Serializer • Zookeeper - Centralized State Management. GIS
Page 38: BigData - The Practice€¦ · • HBase - Columnar KeyValue Database • Cascading - Flow Data Analysis • Avro - Data Serializer • Zookeeper - Centralized State Management. GIS

Q&A http://thunderheadxpler.blogspot.com

[email protected] @mraad