h-hypermap heatmap analytics at scale
TRANSCRIPT
O C T O B E R 1 1 - 1 4 , 2 0 1 6 • B O S T O N , M A
H-Hypermap: Heatmap Analytics at ScaleDavid Smiley
Freelance Search Developer/Consultant
About: David Smiley• Software Engineer (16 years)• Search (7 years)• Java (full-stack), Web, Spatial• Freelance search consultant / developer• Apache Lucene / Solr committer & PMC• Wrote first book on Solr, updated twice
Agenda• About this project• Architecture• Solr & time sharding• Experiences with:
– Kotlin, Dropwizard, Swagger
– Kafka– Docker, Kontena
• Solr for geo-enrichment• Solr adapter for Lucene
BKD Lat-Lon point search & sort
• Heatmaps– Existing functionality
• demo– New functionality
H-Hypermap / BOP• Harvard University, CGA:
Center for Geospatial Analysis http://gis.harvard.edu
• Harvard Hypermap Project– Managed by Ben Lewis
• BOP “Billion Object Platform”– Funded by the Sloan Foundation
BOP Requirements Summary
• Most recent ~billion geo-tweets• Realtime search (<5 sec latency)• Sub-second queries– Including heatmaps!
• On the cheap: ~6 mediocre boxes
Provide a proof-of-concept platform designed to lower the barrier for researchers who need to access big streaming spatio-temporal datasets.
Logical High-Level Architecture
Archival
Realtime
Harvesting Enrichment
various clients...
various clients...
Data flows via Apache Kafka Systems exposeHTTP web services
“BOP”
Shard: W51
The BOPKafka Topic Ingester
ZooKeeper
Shard: W52Shard: W53Shard: W54Shard: RT
...
Web-Service
Kafka Streams• Create Solr doc• Routes to shard
REST/JSON API• Keyword search• Faceting• Heatmaps• CSV export
...
BOP Solr Sharding ArchitectureRealtime
T2016_05_20T2016_05_06T2016_04_22T2016_04_08
… 4-5 mo.
T2016_05_20T2016_05_06T2016_04_22T2016_04_08
… 4-5 mo.
G_North_America G_Elsewhere
Lone Realtime Collection/Shard. 1-25 hrsCopy then delete, at night
• Realtime shard is where realtime search happens. No caches, but small.
• Primary collections have useful caches• Housekeeping Tasks:
• Move data from RT to primary• Create new shards; expire old• Merge/optimize shards
Building a Search Web-Service• Kotlin language (JVM based)– Nullity as first-class language feature
• DropWizard framework– Designed for web-services
• Swagger– Dynamically generated dev UI for web-services
Apache Kafka• Kafka: a scalable message/queue platform• See new Kafka Streams & Kafka Connect APIs• No back-pressure; can be a challenge• Non-obvious use:– For storage; time partitioning• Lots of benefits yet serious limitations
Docker• Easy to find/try/use
software– No installation– Simplified configuration
(env variables)– Common logging– Isolated
• Ideal for:– Continuous Int. servers– Trying new software– Production advantages
• But “new”
Docker in Production• I use “Kontena”• Common logging, machine/proc stats, security– VPN to secure network; access everything as local
• No longer need to care about:– Ansible, Chef, Puppet, etc.– Security at network or proxy; not service specific
• Challenges: state & big-data
Enrichment
Geo: Query Solr via spatial point query; attach related metadata to tweet
Kafka Topic Enrich Kafka
Topic
Twitter Sentiment Classifier
Geo: Solr with regional polygons & metadata
Solr for Geo Enrichment• Tweets (docs) can have a geo lat/lon• Enrich tweet with Country, State/Province, …– Gazetteer lookup (point-in-polygon)
Data Set Features Raw size Index time Index size
Admin2 46,311 824 MB 510 min 892 MB
US States 74,002 747 MB 4.9 min 840 MB
Massachusetts Census Blocks 154,621 152 MB 5.9 min 507 MB
Fast Point-in-Polygon TricksIndex/Config• Optimize to 1 segment• RptWithGeometry
SpatialField– precisionModel=
"floating_single"– autoIndex="true"
• <cache name="perSegSpatialFieldCache_WKT" …
Search• Embed Solr (in-process)• Use docValues, not stored
– fl=block:field(GEOID10)Query like this:• q={!field cache=false
f=WKT}Intersects(POINT($lon $lat))
Sub-Millisecond!
Lucene “LatLonPoint”• Uses new PointValues (BKD index) in Lucene 6• Fastest: http://home.apache.org/~mikemccand/geobench.html
• Presently in Lucene sandbox module• Some limitations: WGS84 points only• Credit to Rob Muir and Mike McCandless
Solr Adapter For LatLonPoint• New Solr FieldType for Lucene LatLonPoint– Filter points by circle, rect, polygon– Distance sort; but no boosting
Coming soon! Solr 6.4?
Heatmaps: Spatial Grid Faceting• Spatial density summary grid faceting,
also useful for point-plotting search results• Lucene & Solr APIs• Scalable & fast usually…
• Usually rendered with a gradient radius ->• See: http://spacemansteve.github.io/leaflet-solr-heatmap/example/
index.html
How-to: Heatmaps• On an RPT field geo="false" worldBounds= "ENVELOPE( -180, 180, 180, -180)" prefixTree="packedQuad"
• Query: /select?facet=true&facet.heatmap=geo_rpt&facet.heatmap.geom= ["-180 -90" TO "180 90”]&facet.heatmap.format= ints2D or png
// Normal Solr response..."facet_counts":{ ... // facet response fields "facet_heatmaps":{ "geo_rpt":[ "gridLevel",2, "columns",32, "rows",32, "minX",-180.0, "maxX",180.0, "minY",-90.0, "maxY",90.0, "counts_ints2D”,[null, null, [0, 1, ... ]]...
New HeatmapSpatialField• Why?– With new BKD/PointValues, no “RPT” field to use– Scalable for heatmaps; don’t worry about search• Scalable at all resolutions; many millions of docs/shard
– Can be specific about grid resolutions
Coming soon! Solr 6.4?
Heatmaps with Stats• Instead of counting docs; calculate a metric– Ex: avg(minuteOfDay)
• Will require JSON Facet API• Inherently slower than just doc counts
Coming soon! Solr 6.4?
Final Remarks• Open-Source– https://github.com/dsmiley/hhypermap-bop
• In-progress• Improvements to Solr expected to be available
before December; officially in Solr 6.4.