apache con big data 2015 magellan

42
Page 1 Magellan: Geospatial Analytics on Spark Ram Sriharsha Spark and Data Science Architect, Hortonworks

Upload: ram-sriharsha

Post on 15-Apr-2017

1.917 views

Category:

Data & Analytics


4 download

TRANSCRIPT

Page 1

Magellan: Geospatial Analytics on SparkRam Sriharsha

Spark and Data Science Architect, Hortonworks

Page 2

Agenda• Geospatial Analytics• Geospatial Data Formats• Challenges• Magellan• Spark SQL and Catalyst: An Intro• How does Magellan use Spark SQL?• Demo• Q & A

Page 3

Geospatial Context is useful

Where do people go on weekends?Does usage pattern change with time?Predict the drop off point of a user?Predict the location where next pick up can be expected?

Identify crime hotspotsHow do these hotspots evolve with time?Predict the likelihood of crime occurring at a given neighborhood

Predict climate at fairly granular levelClimate insurance: do I need to buy insurance for my crops?Climate as a factor in crime: Join climate dataset with Crimes

Page 4

Geospatial Data is pervasive

Page 5

Obscure Data Formats• ESRI Shapefile format

–SHP–SHX–DBX (Dbase File Format, very few legitimate parsers exist)

• GeoJSON• Open Source Parsers exist (GIS 4 Hadoop)

Page 6

Why do we need a proper framework?• No standardized way of dealing with data and metadata• Esoteric Data Formats and Coordinate Systems• No optimizations for efficient joins • APIs are too low level• Language integration simplifies exploratory analytics• Commonly used algorithms can be made available at scale

–Map matching–Geohash indices–Markov models

Page 7

What do we want to support?

• Parse geospatial data and metadata into Shapes + Metadata Map• Python and Scala support• Geometric Queries

–efficiently!–simple and intuitive syntax!

• Scalable implementations of common algorithms–Map Matching–Geohash Indexing–Spatial Joins

Page 8

Where are we at?• Magellan available on Github (https://github.com/harsha2010/magellan)• Can parse and understand most widely used formats

–GeoJSON, ESRIShapefile

• All geometries supported• 1.0.3 released (http://spark-packages.org/package/harsha2010/magellan)• Broadcast join available for common scenarios• Work in progress (targeted 1.0.4)

–Geohash Join optimization–Map Matching algorithm using Markov Models

• Python and Scala support• Please give it a try and give us feedback!

Page 9

Geospatial Data Structures

Page 10

Operations• intersection• union• Symmetrical difference• Difference• Convex hull

Page 11

Queries• contains (within)• Covers (covered-by)• intersects• touches

Page 12

Geospatial Queries

Is a triplet of points (A, B, C) Clockwise or Counter Clockwise ordered?

Page 13

Geospatial Queries

Ray Tracing Algorithm:Draw a ray from point and countthe # of times it intersects polygon

Page 14

Accelerating Spatial Queries• Bounding Box• Geohashing• R-Tree (and other) indices

Page 15

Magellan• Basic Abstraction in terms of Shape

–Point, PolyLine, Polygon–Supports multiple implementations (currently uses ESRI-Java-API)

• SQL Data Type = Shape–Efficient: Construct once and use

• Operations supported as SQL operations–within, intersects, contains etc.

• Allows efficient Join implementations using Catalyst–Broadcast join already available–Geohash based join algorithm in progress

Page 16

load

Page 17

query metadata

Page 18

geometric query

Page 19

Python API

Page 20

Python API, join

Page 21

Why spark?• DataFrames

– Intuitive manipulation of distributed structured data

• Catalyst Optimizer–Push predicates to Data Source, allows optimized filters

• Memory Optimized Execution Engine

Page 22

The Spark ecosystem

Spark Core

Spark SQL

Spark Streaming

ML-Lib GraphX

Distributed Compute Engine• Speed, ease of use and fast prototyping• Open source• Powerful abstractions• Python, R , Scala, Java support

Page 23

Spark DataFrames are intuitive

RDD

DataFrame

dept name age

Bio H Smith 48

CS A Turing 54

Bio B Jones 43

Phys E Witten 61

Page 24

Spark DataFrames are fast!

Page 25

Spark SQL under the covers

Page 26

Catalyst• Rows, Data Types• Expressions• Operators• Rules• Optimization

Page 27

Rows and Data Types• Standard SQL Data Types

–Date, Int, Long, String, etc

• Complex Data Types–Array, Map, Struct, etc

• Custom Data Types• Row = Collection of Data Types

–Represents a single row

Page 28

Expressions• Literals• Arithmetic Expressions

–maxOf, unaryMinus

• Predicates–Not, and, in, case when

• Cast• String Expressions

–substring, like, startsWith

Page 29

Optimizations• Constant Folding• Predicate Pushdown

–Combine Filters–Push Predicate Through Join

• Null Propagation• Boolean Simplification

Page 30

Execution Engine• Data Sources to read data into Data Frames

–Supports extending pushdowns to data store

• Optimized in memory layout–ORC, Tungsten etc.

• Spark Strategies–Convert logical plan -> physical plan–Rules based on statistics

Page 31

How does Magellan use Catalyst?• Custom Data Source

–Parses GeoJSON, ESRIShapefile etc into (Shape, Metadata) pairs–Returns a DataFrame with columns (point, polygon, polyline, metadata)–Overrides PrunedFilteredScan–Outputs Shape instances

• Custom Data Type–Point, Polygon, Polyline instances of Shape–Each Shape has a python counterpart. –Each Shape is its own SQL type (=> no serialization overhead for SQL -> Scala and back)

• Magellan Context–Overrides Spark Planner allowing custom join implementations

• Python wrappers

Page 32

Leveraging Data Sources

Page 33

Leveraging Catalyst

Binary Expression

Page 34

Leveraging Catalyst

Spatial Join + Predicate Pushdown

Page 35

Python Bindings

Page 36

• Adds Custom Python Data Types–Point, PolyLine, Polygon wrappers around Scala Data Types

• Wraps coordinate transformations and expressions• Custom Picklers and Unpicklers

–Serialize to and from Scala

Page 37

Future Work• Geohash Indices• Spatial Join Optimization• Map Matching Algorithms• Improving pyspark bindings

Page 38

geohashing

Page 39

Base 32 encoding

Page 40

geohashing

Page 41

Map Matching• Given sequence of points (representing a trip?) what was the road path taken?• Challenges

–Error in GPS measurements–Error in coordinate projections–Time Gap between measurements, cannot just snap to nearest road

Page 42

DemoUber use case?