apache con big data 2015 magellan

Magellan: Geospatial Analytics on SparkRam Sriharsha

Spark and Data Science Architect, Hortonworks

Agenda• Geospatial Analytics• Geospatial Data Formats• Challenges• Magellan• Spark SQL and Catalyst: An Intro• How does Magellan use Spark SQL?• Demo• Q & A

Geospatial Context is useful

Where do people go on weekends?Does usage pattern change with time?Predict the drop off point of a user?Predict the location where next pick up can be expected?

Identify crime hotspotsHow do these hotspots evolve with time?Predict the likelihood of crime occurring at a given neighborhood

Predict climate at fairly granular levelClimate insurance: do I need to buy insurance for my crops?Climate as a factor in crime: Join climate dataset with Crimes

Geospatial Data is pervasive

Obscure Data Formats• ESRI Shapefile format

–SHP–SHX–DBX (Dbase File Format, very few legitimate parsers exist)

• GeoJSON• Open Source Parsers exist (GIS 4 Hadoop)

Why do we need a proper framework?• No standardized way of dealing with data and metadata• Esoteric Data Formats and Coordinate Systems• No optimizations for efficient joins • APIs are too low level• Language integration simplifies exploratory analytics• Commonly used algorithms can be made available at scale

–Map matching–Geohash indices–Markov models

What do we want to support?

• Parse geospatial data and metadata into Shapes + Metadata Map• Python and Scala support• Geometric Queries

–efficiently!–simple and intuitive syntax!

• Scalable implementations of common algorithms–Map Matching–Geohash Indexing–Spatial Joins

Where are we at?• Magellan available on Github (https://github.com/harsha2010/magellan)• Can parse and understand most widely used formats

–GeoJSON, ESRIShapefile

• All geometries supported• 1.0.3 released (http://spark-packages.org/package/harsha2010/magellan)• Broadcast join available for common scenarios• Work in progress (targeted 1.0.4)

–Geohash Join optimization–Map Matching algorithm using Markov Models

• Python and Scala support• Please give it a try and give us feedback!

https://github.com/harsha2010/magellan

https://github.com/harsha2010/magellan

http://spark-packages.org/package/harsha2010/magellan




Geospatial Data Structures

Operations• intersection• union• Symmetrical difference• Difference• Convex hull

Queries• contains (within)• Covers (covered-by)• intersects• touches

Geospatial Queries

Is a triplet of points (A, B, C) Clockwise or Counter Clockwise ordered?

Geospatial Queries

Ray Tracing Algorithm:Draw a ray from point and countthe # of times it intersects polygon

Accelerating Spatial Queries• Bounding Box• Geohashing• R-Tree (and other) indices

Magellan• Basic Abstraction in terms of Shape

–Point, PolyLine, Polygon–Supports multiple implementations (currently uses ESRI-Java-API)

• SQL Data Type = Shape–Efficient: Construct once and use

• Operations supported as SQL operations–within, intersects, contains etc.

• Allows efficient Join implementations using Catalyst–Broadcast join already available–Geohash based join algorithm in progress

query metadata

geometric query

Python API

Python API, join

Why spark?• DataFrames

– Intuitive manipulation of distributed structured data

• Catalyst Optimizer–Push predicates to Data Source, allows optimized filters

• Memory Optimized Execution Engine

The Spark ecosystem

Spark Core

Spark SQL

Spark Streaming

ML-Lib GraphX

Distributed Compute Engine• Speed, ease of use and fast prototyping• Open source• Powerful abstractions• Python, R , Scala, Java support

Spark DataFrames are intuitive

RDD

DataFrame

dept name age

Bio H Smith 48

CS A Turing 54

Bio B Jones 43

Phys E Witten 61

Spark DataFrames are fast!

Spark SQL under the covers

Catalyst• Rows, Data Types• Expressions• Operators• Rules• Optimization

Rows and Data Types• Standard SQL Data Types

–Date, Int, Long, String, etc

• Complex Data Types–Array, Map, Struct, etc

• Custom Data Types• Row = Collection of Data Types

–Represents a single row

Expressions• Literals• Arithmetic Expressions

–maxOf, unaryMinus

• Predicates–Not, and, in, case when

• Cast• String Expressions

–substring, like, startsWith

Optimizations• Constant Folding• Predicate Pushdown

–Combine Filters–Push Predicate Through Join

• Null Propagation• Boolean Simplification

Execution Engine• Data Sources to read data into Data Frames

–Supports extending pushdowns to data store

• Optimized in memory layout–ORC, Tungsten etc.

• Spark Strategies–Convert logical plan -> physical plan–Rules based on statistics

How does Magellan use Catalyst?• Custom Data Source

–Parses GeoJSON, ESRIShapefile etc into (Shape, Metadata) pairs–Returns a DataFrame with columns (point, polygon, polyline, metadata)–Overrides PrunedFilteredScan–Outputs Shape instances

• Custom Data Type–Point, Polygon, Polyline instances of Shape–Each Shape has a python counterpart. –Each Shape is its own SQL type (=> no serialization overhead for SQL -> Scala and back)

• Magellan Context–Overrides Spark Planner allowing custom join implementations

• Python wrappers

Leveraging Data Sources

Leveraging Catalyst

Binary Expression

Leveraging Catalyst

Spatial Join + Predicate Pushdown

Python Bindings

• Adds Custom Python Data Types–Point, PolyLine, Polygon wrappers around Scala Data Types

• Wraps coordinate transformations and expressions• Custom Picklers and Unpicklers

–Serialize to and from Scala

Future Work• Geohash Indices• Spatial Join Optimization• Map Matching Algorithms• Improving pyspark bindings

geohashing

Base 32 encoding

geohashing

Map Matching• Given sequence of points (representing a trip?) what was the road path taken?• Challenges

–Error in GPS measurements–Error in coordinate projections–Time Gap between measurements, cannot just snap to nearest road

DemoUber use case?

apache con big data 2015 magellan

Data & Analytics