webinar: solr 6 deep dive - sql and graph
TRANSCRIPT
Solr 6 Deep Dive: SQL and Graph
Grant Ingersoll
@gsingers
CTO, Lucidworks
Tim Potter
@thelabdude
Sr. Software Engineer, Lucidworks
• Motivations
• Streaming Expressions and Parallel SQL
• Graph Capabilities
• How does this compare to…?
• Future Directions
Agenda
Search-Driven Everything
Customer Service
Customer Insights
Fraud Surveillance
Research Portal
Online Retail Digital Content
• Big data systems have grown too complex trying to satisfy a variety of access patterns
• Fast primary key lookups / atomic updates (Solr, HBase, Cassandra, …)
• Low-latency ranked retrieval (Solr, Elastic, DataStax, …)
• Large, distributed table scans (Spark, M/R, Pig, Cassandra, Hive, Impala, …)
• Graph traversal (Graphx, Giraph, Neo4j, …)
• De-normalization can be inconvenient as related data sets can change at different velocities (movies vs. movie ratings)
• Leverage progress made by the Solr community to support big data in Solr using horizontal scalability (shards & replicas)
• Don’t forget about speed ~ Search engines in general and Solr in particular are extremely fast!
Why Solr needs Parallel Computation
Lucidworks Fusion Is Search-Driven Everything
• Drive next generation relevance via Content, Collaboration and Context
• Harness best in class Open Source: Apache Solr + Spark
• Simplify application development and reduce ongoing maintenance
CATALOG
DYNAMIC NAVIGATION AND LANDING PAGES
INSTANT INSIGHTS AND ANALYTICS
PERSONALIZED SHOPPING EXPERIENCE
PROMOTIONS USER HISTORY
Data Acquisition
Indexing & Streaming
Smart Access API
Recommendations & Alerts
Analytics & InsightsExtreme Relevancy
Access data from anywhere to build intelligent, data-driven applications.
Fusion Architecture
REST
API
Worker Worker Cluster Mgr.
Apache Spark
Shards Shards
Apache Solr
HD
FS (O
ptio
nal)
Shared Config Mgmt
Leader Election
Load Balancing
ZK 1
Apache Zookeeper
ZK N
DATABASEWEBFILELOGSHADOOP CLOUD
Connectors
Alerting/Messaging
NLP
Pipelines
Blob Storage
Scheduling
Recommenders/Signals
…
Core Services
Admin UI
SECURITY BUILT-IN
Lucidworks View
• SQL is ubiquitous language for analytics
• People: Less training and easier to understand
• Tools! Solr as JDBC data source (DbVisualizer, Apache Zeppelin, and SQuirreL SQL)
• Query planning / optimization can evolve iteratively
SQL is natural extension for Solr’s parallel computing engine
Give me the top 5 action movies with rating of 4 or better
Mental Warm-up
/select?q=*:* &fq=genre_ss:action &fq=rating_i:[4 TO *] &facet=true &facet.limit=5 &facet.mincount=1 &facet.field=title_s
SELECT title_s, COUNT(*) as cnt FROM movielens WHERE genre_ss='action' AND rating_i='[4 TO *]’ GROUP BY title_s ORDER BY cnt desc LIMIT 5
{ ... "facet_counts":{ "facet_fields":{ "title_s":[ "Star Wars (1977)",501, "Return of the Jedi (1983)",379, "Godfather, The (1972)",351, "Raiders of the Lost Ark (1981)",348, "Empire Strikes Back, The (1980)",293]}, ...}}
{"result-‐set":{"docs":[ {"title_s":"Star Wars (1977)”,"cnt":501}, {"title_s":"Return of the Jedi (1983)","cnt":379}, {"title_s":"Godfather, The (1972)","cnt":351}, {"title_s":"Raiders of the Lost Ark (1981)","cnt":348}, {"title_s":"Empire Strikes Back, The (1980)","cnt":293}, {"EOF":true,"RESPONSE_TIME":42}]}}
SELECT gender_s, COUNT(*) as num_ratings, avg(rating_i) as avg_rating FROM movielens WHERE genre_ss='romance' AND age_i='[30 TO *]' GROUP BY gender_s ORDER BY num_ratings desc
SQL Examples
SELECT title_s, genre_s, COUNT(*) as num_ratings, avg(rating_i) as avg_rating FROM movielens GROUP BY title_s, genre_s HAVING num_ratings >= 100 ORDER BY avg_rating desc LIMIT 5
SELECT DISTINCT(user_id_i) as user_id FROM movielens WHERE genre_ss='documentary' ORDER BY user_id desc
Give me the avg rating for men and women over 30 for romance movies
Give me the top 5 rated movies with at least 100 ratings
Give me the set of unique users that have rated documentaries
• Perform relational operations on streams
• Stream sources: search, jdbc, facets, stats, topic, gatherNodes
• Stream decorators: complement, daemon, leftOuterJoin, hashJoin, innerJoin, intersect, merge, outerHashJoin, parallel, reduce, random, rollup, select, shortestPath, sort, top, unique, update
Streaming Expressions
• Relies on docValues (column-oriented data structure) and /export handler
• Extreme read performance (8-10x faster than queries using cursorMark)
• Facet or map/reduce style aggregation modes
• Tiered architecture
• SQL interface tier
• Worker tier (scale a pool of worker “nodes” independently of the data collection)
• Data tier (Solr collection)
Streaming API: Nuts and Bolts
parallel(workers, hashJoin( search(movielens, q=*:*, fl="user_id_i,movie_id_i,rating_i", sort="movie_id_i asc", partitionKeys="movie_id_i"), hashed=search(movielens_movies, q=*:*, fl="movie_id_i,title_s,genre_s", sort="movie_id_i asc", partitionKeys="movie_id_i"), on="movie_id_i" ), workers="4", sort="movie_id_i asc" )
Streaming Expression Example: hashJoin
The small “right” side of the join gets loaded into memory on each worker node
Each shard queried by N workers, so 4 workers x 4 shards means 16 queries (usually all replicas per shard are hit)
Workers collection isolates parallel computation nodes from data nodes
Aggregation Modes
• Map/Reduce aggregationMode — for high cardinality aggregations and distributed joins (requires a shuffle phase to move keys to correct worker)
curl -‐-‐data-‐urlencode "stmt=SELECT user_id_i, avg(rating_i) as avg_rating FROM movielens GROUP BY user_id_i" \ “http://host:port/solr/movielens/sql?aggregationMode=map_reduce”
• Facet aggregationMode — Uses JSON facet engine for high performance on low-to-moderate cardinality fields (e.g. movies)
curl -‐-‐data-‐urlencode "stmt=SELECT movie_id_i, avg(rating_i) as avg_rating FROM movielens GROUP BY movie_id_i" \ “http://host:port/solr/movielens/sql?aggregationMode=facet”
• spark-solr project uses streaming API to pull data from Solr into Spark jobs if docValues enabled, see: https://github.com/lucidworks/spark-solr
• Perform aggregations of “signals”, e.g clicks, to compute boosts and recommendations using Spark
• Custom Scala script jobs to perform complex analysis on data in Solr, e.g. sessionize request logs
• Power rich data visualizations using Spark SQL over Solr streaming aggregations
How we use Solr streaming API in Fusion
• Anomaly detection and fraud detection
• Recommenders
• Social network analysis
• Graph Search
• Access Control
• Examples:
• Find all tweets mentioning “Solr” by me or people I follow
• Find all draft blog posts about “Parallel SQL” written by a developer
• Find 3-star hotels in NYC my friends stayed in last year
Graph Use Cases
• Some data is much more naturally represented as a graph structure
• Traditionally hard to deal with in search’s inverted index
• Solr 6.0 introduces the Graph Query Parser
• Solr 6.1 brings Graph Streaming expressions
Graph Basics
• Query-time, cyclic aware graph traversal is able to rank documents based on relationships
• Provides controls for depth, filtering of results and inclusion of root and/or leaves
• Limitations: single node/shard only
• Examples:
• http://localhost:8983/solr/graph/query?fl=id,score&q={!graph+from=in_edge+to=out_edge}id:A
• http://localhost:8983/solr/my_graph/query?fl=id&q={!graph+from=in_edge+to=out_edge+traversalFilter='foo:[*+TO+15]'}id:A
• http://localhost:8983/solr/my_graph/query?fl=id&q={!graph+from=in_edge+to=out_edge+maxDepth=1}foo:[*+TO+10]
Graph Query Parser
• Part of Solr’s broader Streaming Expressions capability
• Implements a powerful, breadth-first traversal
• Works across shards AND collections
• Supports aggregations
• Cycle aware
Graph Streaming Expressions (Solr 6.1)
curl -X POST -H "Content-Type: application/x-www-form-urlencoded" -d ‘expr=…’ "http://localhost:18984/solr/movielens/stream"
All movies that user 389 watched
expr:gatherNodes(movielens,walk="389->user_id_i",gather="movie_id_i")
All the Movies that viewers of Movie 161 watched
expr:gatherNodes(movielens,
gatherNodes(movielens,walk="161->movie_id_i",gather="user_id_i"),
walk="node->user_id_i",gather="movie_id_i", trackTraversal="true")
Movie 161: “The Air Up There”
Collaborative Filtering Example
expr=top(n="5", sort="count(*) desc",
gatherNodes(movielens, top(n="30", sort="count(*) desc",
gatherNodes(movielens,
search(movielens, q="user_id_i:305", fl="movie_id_i", sort="movie_id_i asc", qt=“/export"),
walk="movie_id_i->movie_id_i", gather="user_id_i",
maxDocFreq="10000", count(*))),
walk="node->user_id_i", gather="movie_id_i", count(*)))'
Comparing Graph Choices
Solr Elastic Graph Neo4J Spark GraphX
Best Use Case
QParser: predef. relationships as filters Expressions: fast, query-based, dist.
graph ops
Term relationship exploration
Graph ops and querying that fit on a
single nodeLarge-scale, iterative
graph ops
Common Graph Algorithms (e.g.
Pregel, Traversal)Partial No Yes Yes
Scaling QParser: no Expressions: yes Yes Master/Replica Yes
Commercial License Required No Yes GPLv3 No
Visualizations GraphML support (Gephi) Kibana Neo4j browser 3rd party
Comparing Big Data SQL Choices
Solr Hive Drill SparkSQL
Secret SaucePush complex query
constructs into engine (full text, spatial, functions, etc)
Mature SQL solution for Hadoop stack
Execute SQL over NoSQL data sources
Spark core (optimized shuffle, in-memory, etc), integration of
other APIs: ML, Streaming, GraphX
SQL Features Evolving Mature Maturing Maturing
ScalingLinear (shards and replicas) backed by
inverted index
Limited by Hadoop infrastructure (table
scans)Good, but need to
benchmark
Memory intensive; Scale out using Spark
cluster, backed by RDDs
Integration w/ external systems JDBC stream source external tables /
plugin APImany drivers
availableDataSource API, many systems
supported
• Alternate graph traversal approaches, e.g. depth-first
• Possible support for Gremlin (Graph Traversal Language from Tinker Pop)
• Additional graph algorithms (e.g. strongly conn. components, page rank)
Future Work
• No support for pushing >, >=, <, <= operators in WHERE clause down into Solr as range queries; use range syntax [4 TO *] for now
• Using Solr function queries in WHERE clause, e.g. WHERE location_p='{!geofilt d=90 pt=37.773972,-‐122.431297 sfield=location_p}’
• SQL Joins (SOLR-8593)
• Port SQL layer to use Apache Calcite vs. Presto
SQL: Current Limitations and Future Plans