h-hypermap heatmap analytics at scale

25
OCTOBER 11-14, 2016 BOSTON, MA

Upload: david-smiley

Post on 07-Jan-2017

49 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: H-Hypermap Heatmap Analytics at Scale

O C T O B E R 1 1 - 1 4 , 2 0 1 6 • B O S T O N , M A

Page 2: H-Hypermap Heatmap Analytics at Scale

H-Hypermap: Heatmap Analytics at ScaleDavid Smiley

Freelance Search Developer/Consultant

Page 3: H-Hypermap Heatmap Analytics at Scale

About: David Smiley• Software Engineer (16 years)• Search (7 years)• Java (full-stack), Web, Spatial• Freelance search consultant / developer• Apache Lucene / Solr committer & PMC• Wrote first book on Solr, updated twice

Page 4: H-Hypermap Heatmap Analytics at Scale

Agenda• About this project• Architecture• Solr & time sharding• Experiences with:

– Kotlin, Dropwizard, Swagger

– Kafka– Docker, Kontena

• Solr for geo-enrichment• Solr adapter for Lucene

BKD Lat-Lon point search & sort

• Heatmaps– Existing functionality

• demo– New functionality

Page 5: H-Hypermap Heatmap Analytics at Scale

H-Hypermap / BOP• Harvard University, CGA:

Center for Geospatial Analysis http://gis.harvard.edu

• Harvard Hypermap Project– Managed by Ben Lewis

• BOP “Billion Object Platform”– Funded by the Sloan Foundation

Page 6: H-Hypermap Heatmap Analytics at Scale

BOP Requirements Summary

• Most recent ~billion geo-tweets• Realtime search (<5 sec latency)• Sub-second queries– Including heatmaps!

• On the cheap: ~6 mediocre boxes

Provide a proof-of-concept platform designed to lower the barrier for researchers who need to access big streaming spatio-temporal datasets.

Page 7: H-Hypermap Heatmap Analytics at Scale

Logical High-Level Architecture

Archival

Realtime

Harvesting Enrichment

various clients...

various clients...

Data flows via Apache Kafka Systems exposeHTTP web services

“BOP”

Page 8: H-Hypermap Heatmap Analytics at Scale

Shard: W51

The BOPKafka Topic Ingester

ZooKeeper

Shard: W52Shard: W53Shard: W54Shard: RT

...

Web-Service

Kafka Streams• Create Solr doc• Routes to shard

REST/JSON API• Keyword search• Faceting• Heatmaps• CSV export

...

Page 9: H-Hypermap Heatmap Analytics at Scale

BOP Solr Sharding ArchitectureRealtime

T2016_05_20T2016_05_06T2016_04_22T2016_04_08

… 4-5 mo.

T2016_05_20T2016_05_06T2016_04_22T2016_04_08

… 4-5 mo.

G_North_America G_Elsewhere

Lone Realtime Collection/Shard. 1-25 hrsCopy then delete, at night

• Realtime shard is where realtime search happens. No caches, but small.

• Primary collections have useful caches• Housekeeping Tasks:

• Move data from RT to primary• Create new shards; expire old• Merge/optimize shards

Page 10: H-Hypermap Heatmap Analytics at Scale

Building a Search Web-Service• Kotlin language (JVM based)– Nullity as first-class language feature

• DropWizard framework– Designed for web-services

• Swagger– Dynamically generated dev UI for web-services

Page 11: H-Hypermap Heatmap Analytics at Scale

Apache Kafka• Kafka: a scalable message/queue platform• See new Kafka Streams & Kafka Connect APIs• No back-pressure; can be a challenge• Non-obvious use:– For storage; time partitioning• Lots of benefits yet serious limitations

Page 12: H-Hypermap Heatmap Analytics at Scale

Docker• Easy to find/try/use

software– No installation– Simplified configuration

(env variables)– Common logging– Isolated

• Ideal for:– Continuous Int. servers– Trying new software– Production advantages

• But “new”

Page 13: H-Hypermap Heatmap Analytics at Scale

Docker in Production• I use “Kontena”• Common logging, machine/proc stats, security– VPN to secure network; access everything as local

• No longer need to care about:– Ansible, Chef, Puppet, etc.– Security at network or proxy; not service specific

• Challenges: state & big-data

Page 14: H-Hypermap Heatmap Analytics at Scale

Enrichment

Geo: Query Solr via spatial point query; attach related metadata to tweet

Kafka Topic Enrich Kafka

Topic

Twitter Sentiment Classifier

Geo: Solr with regional polygons & metadata

Page 15: H-Hypermap Heatmap Analytics at Scale

Solr for Geo Enrichment• Tweets (docs) can have a geo lat/lon• Enrich tweet with Country, State/Province, …– Gazetteer lookup (point-in-polygon)

Data Set Features Raw size Index time Index size

Admin2 46,311 824 MB 510 min 892 MB

US States 74,002 747 MB 4.9 min 840 MB

Massachusetts Census Blocks 154,621 152 MB 5.9 min 507 MB

Page 16: H-Hypermap Heatmap Analytics at Scale

Fast Point-in-Polygon TricksIndex/Config• Optimize to 1 segment• RptWithGeometry

SpatialField– precisionModel=

"floating_single"– autoIndex="true"

• <cache name="perSegSpatialFieldCache_WKT" …

Search• Embed Solr (in-process)• Use docValues, not stored

– fl=block:field(GEOID10)Query like this:• q={!field cache=false

f=WKT}Intersects(POINT($lon $lat))

Sub-Millisecond!

Page 17: H-Hypermap Heatmap Analytics at Scale

Lucene “LatLonPoint”• Uses new PointValues (BKD index) in Lucene 6• Fastest: http://home.apache.org/~mikemccand/geobench.html

• Presently in Lucene sandbox module• Some limitations: WGS84 points only• Credit to Rob Muir and Mike McCandless

Page 18: H-Hypermap Heatmap Analytics at Scale

Solr Adapter For LatLonPoint• New Solr FieldType for Lucene LatLonPoint– Filter points by circle, rect, polygon– Distance sort; but no boosting

Coming soon! Solr 6.4?

Page 20: H-Hypermap Heatmap Analytics at Scale

How-to: Heatmaps• On an RPT field geo="false" worldBounds= "ENVELOPE( -180, 180, 180, -180)" prefixTree="packedQuad"

• Query: /select?facet=true&facet.heatmap=geo_rpt&facet.heatmap.geom= ["-180 -90" TO "180 90”]&facet.heatmap.format= ints2D or png

// Normal Solr response..."facet_counts":{ ... // facet response fields "facet_heatmaps":{ "geo_rpt":[ "gridLevel",2, "columns",32, "rows",32, "minX",-180.0, "maxX",180.0, "minY",-90.0, "maxY",90.0, "counts_ints2D”,[null, null, [0, 1, ... ]]...

Page 21: H-Hypermap Heatmap Analytics at Scale

New HeatmapSpatialField• Why?– With new BKD/PointValues, no “RPT” field to use– Scalable for heatmaps; don’t worry about search• Scalable at all resolutions; many millions of docs/shard

– Can be specific about grid resolutions

Coming soon! Solr 6.4?

Page 22: H-Hypermap Heatmap Analytics at Scale

Heatmaps with Stats• Instead of counting docs; calculate a metric– Ex: avg(minuteOfDay)

• Will require JSON Facet API• Inherently slower than just doc counts

Coming soon! Solr 6.4?

Page 23: H-Hypermap Heatmap Analytics at Scale
Page 24: H-Hypermap Heatmap Analytics at Scale
Page 25: H-Hypermap Heatmap Analytics at Scale

Final Remarks• Open-Source– https://github.com/dsmiley/hhypermap-bop

• In-progress• Improvements to Solr expected to be available

before December; officially in Solr 6.4.