apache con big data 2015 magellan
TRANSCRIPT
Page 1
Magellan: Geospatial Analytics on SparkRam Sriharsha
Spark and Data Science Architect, Hortonworks
Page 2
Agenda• Geospatial Analytics• Geospatial Data Formats• Challenges• Magellan• Spark SQL and Catalyst: An Intro• How does Magellan use Spark SQL?• Demo• Q & A
Page 3
Geospatial Context is useful
Where do people go on weekends?Does usage pattern change with time?Predict the drop off point of a user?Predict the location where next pick up can be expected?
Identify crime hotspotsHow do these hotspots evolve with time?Predict the likelihood of crime occurring at a given neighborhood
Predict climate at fairly granular levelClimate insurance: do I need to buy insurance for my crops?Climate as a factor in crime: Join climate dataset with Crimes
Page 5
Obscure Data Formats• ESRI Shapefile format
–SHP–SHX–DBX (Dbase File Format, very few legitimate parsers exist)
• GeoJSON• Open Source Parsers exist (GIS 4 Hadoop)
Page 6
Why do we need a proper framework?• No standardized way of dealing with data and metadata• Esoteric Data Formats and Coordinate Systems• No optimizations for efficient joins • APIs are too low level• Language integration simplifies exploratory analytics• Commonly used algorithms can be made available at scale
–Map matching–Geohash indices–Markov models
Page 7
What do we want to support?
• Parse geospatial data and metadata into Shapes + Metadata Map• Python and Scala support• Geometric Queries
–efficiently!–simple and intuitive syntax!
• Scalable implementations of common algorithms–Map Matching–Geohash Indexing–Spatial Joins
Page 8
Where are we at?• Magellan available on Github (https://github.com/harsha2010/magellan)• Can parse and understand most widely used formats
–GeoJSON, ESRIShapefile
• All geometries supported• 1.0.3 released (http://spark-packages.org/package/harsha2010/magellan)• Broadcast join available for common scenarios• Work in progress (targeted 1.0.4)
–Geohash Join optimization–Map Matching algorithm using Markov Models
• Python and Scala support• Please give it a try and give us feedback!
Page 13
Geospatial Queries
Ray Tracing Algorithm:Draw a ray from point and countthe # of times it intersects polygon
Page 15
Magellan• Basic Abstraction in terms of Shape
–Point, PolyLine, Polygon–Supports multiple implementations (currently uses ESRI-Java-API)
• SQL Data Type = Shape–Efficient: Construct once and use
• Operations supported as SQL operations–within, intersects, contains etc.
• Allows efficient Join implementations using Catalyst–Broadcast join already available–Geohash based join algorithm in progress
Page 21
Why spark?• DataFrames
– Intuitive manipulation of distributed structured data
• Catalyst Optimizer–Push predicates to Data Source, allows optimized filters
• Memory Optimized Execution Engine
Page 22
The Spark ecosystem
Spark Core
Spark SQL
Spark Streaming
ML-Lib GraphX
Distributed Compute Engine• Speed, ease of use and fast prototyping• Open source• Powerful abstractions• Python, R , Scala, Java support
Page 23
Spark DataFrames are intuitive
RDD
DataFrame
dept name age
Bio H Smith 48
CS A Turing 54
Bio B Jones 43
Phys E Witten 61
Page 27
Rows and Data Types• Standard SQL Data Types
–Date, Int, Long, String, etc
• Complex Data Types–Array, Map, Struct, etc
• Custom Data Types• Row = Collection of Data Types
–Represents a single row
Page 28
Expressions• Literals• Arithmetic Expressions
–maxOf, unaryMinus
• Predicates–Not, and, in, case when
• Cast• String Expressions
–substring, like, startsWith
Page 29
Optimizations• Constant Folding• Predicate Pushdown
–Combine Filters–Push Predicate Through Join
• Null Propagation• Boolean Simplification
Page 30
Execution Engine• Data Sources to read data into Data Frames
–Supports extending pushdowns to data store
• Optimized in memory layout–ORC, Tungsten etc.
• Spark Strategies–Convert logical plan -> physical plan–Rules based on statistics
Page 31
How does Magellan use Catalyst?• Custom Data Source
–Parses GeoJSON, ESRIShapefile etc into (Shape, Metadata) pairs–Returns a DataFrame with columns (point, polygon, polyline, metadata)–Overrides PrunedFilteredScan–Outputs Shape instances
• Custom Data Type–Point, Polygon, Polyline instances of Shape–Each Shape has a python counterpart. –Each Shape is its own SQL type (=> no serialization overhead for SQL -> Scala and back)
• Magellan Context–Overrides Spark Planner allowing custom join implementations
• Python wrappers
Page 36
• Adds Custom Python Data Types–Point, PolyLine, Polygon wrappers around Scala Data Types
• Wraps coordinate transformations and expressions• Custom Picklers and Unpicklers
–Serialize to and from Scala
Page 37
Future Work• Geohash Indices• Spatial Join Optimization• Map Matching Algorithms• Improving pyspark bindings
Page 41
Map Matching• Given sequence of points (representing a trip?) what was the road path taken?• Challenges
–Error in GPS measurements–Error in coordinate projections–Time Gap between measurements, cannot just snap to nearest road