webinar: solr 6 deep dive - sql and graph

31

Upload: lucidworks

Post on 07-Jan-2017

1.295 views

Category:

Technology


1 download

TRANSCRIPT

2016

OCTOBER 11-14BOSTON, MA

http://lucenerevolution.com

Solr 6 Deep Dive: SQL and Graph

Grant Ingersoll

@gsingers

CTO, Lucidworks

Tim Potter

@thelabdude

Sr. Software Engineer, Lucidworks

• Motivations

• Streaming Expressions and Parallel SQL

• Graph Capabilities

• How does this compare to…?

• Future Directions

Agenda

Search-Driven Everything

Customer Service

Customer Insights

Fraud Surveillance

Research Portal

Online Retail Digital Content

• Big data systems have grown too complex trying to satisfy a variety of access patterns

• Fast primary key lookups / atomic updates (Solr, HBase, Cassandra, …)

• Low-latency ranked retrieval (Solr, Elastic, DataStax, …)

• Large, distributed table scans (Spark, M/R, Pig, Cassandra, Hive, Impala, …)

• Graph traversal (Graphx, Giraph, Neo4j, …)

• De-normalization can be inconvenient as related data sets can change at different velocities (movies vs. movie ratings)

• Leverage progress made by the Solr community to support big data in Solr using horizontal scalability (shards & replicas)

• Don’t forget about speed ~ Search engines in general and Solr in particular are extremely fast!

Why Solr needs Parallel Computation

Lucidworks Fusion Is Search-Driven Everything

• Drive next generation relevance via Content, Collaboration and Context

• Harness best in class Open Source: Apache Solr + Spark

• Simplify application development and reduce ongoing maintenance

CATALOG

DYNAMIC NAVIGATION AND LANDING PAGES

INSTANT INSIGHTS AND ANALYTICS

PERSONALIZED SHOPPING EXPERIENCE

PROMOTIONS USER HISTORY

Data Acquisition

Indexing & Streaming

Smart Access API

Recommendations & Alerts

Analytics & InsightsExtreme Relevancy

Access data from anywhere to build intelligent, data-driven applications.

Fusion Architecture

REST

API

Worker Worker Cluster Mgr.

Apache Spark

Shards Shards

Apache Solr

HD

FS (O

ptio

nal)

Shared Config Mgmt

Leader Election

Load Balancing

ZK 1

Apache Zookeeper

ZK N

DATABASEWEBFILELOGSHADOOP CLOUD

Connectors

Alerting/Messaging

NLP

Pipelines

Blob Storage

Scheduling

Recommenders/Signals

Core Services

Admin UI

SECURITY BUILT-IN

Lucidworks View

Streaming Expressions and SQL

• SQL is ubiquitous language for analytics

• People: Less training and easier to understand

• Tools! Solr as JDBC data source (DbVisualizer, Apache Zeppelin, and SQuirreL SQL)

• Query planning / optimization can evolve iteratively

SQL is natural extension for Solr’s parallel computing engine

Give me the top 5 action movies with rating of 4 or better

Mental Warm-up

/select?q=*:* &fq=genre_ss:action &fq=rating_i:[4 TO *] &facet=true &facet.limit=5 &facet.mincount=1 &facet.field=title_s

SELECT title_s, COUNT(*) as cnt FROM movielens WHERE genre_ss='action' AND rating_i='[4 TO *]’ GROUP BY title_s ORDER BY cnt desc LIMIT 5

{  ...        "facet_counts":{          "facet_fields":{              "title_s":[                  "Star  Wars  (1977)",501,                  "Return  of  the  Jedi  (1983)",379,                  "Godfather,  The  (1972)",351,                  "Raiders  of  the  Lost  Ark  (1981)",348,                  "Empire  Strikes  Back,  The  (1980)",293]},          ...}}

{"result-­‐set":{"docs":[  {"title_s":"Star  Wars  (1977)”,"cnt":501},  {"title_s":"Return  of  the  Jedi  (1983)","cnt":379},  {"title_s":"Godfather,  The  (1972)","cnt":351},  {"title_s":"Raiders  of  the  Lost  Ark  (1981)","cnt":348},  {"title_s":"Empire  Strikes  Back,  The  (1980)","cnt":293},  {"EOF":true,"RESPONSE_TIME":42}]}}

   SELECT  gender_s,  COUNT(*)  as  num_ratings,  avg(rating_i)  as  avg_rating            FROM  movielens          WHERE  genre_ss='romance'  AND  age_i='[30  TO  *]'  GROUP  BY  gender_s    ORDER  BY  num_ratings  desc

SQL Examples

   SELECT  title_s,  genre_s,  COUNT(*)  as  num_ratings,  avg(rating_i)  as  avg_rating            FROM  movielens    GROUP  BY  title_s,  genre_s        HAVING  num_ratings  >=  100    ORDER  BY  avg_rating  desc          LIMIT  5

   SELECT  DISTINCT(user_id_i)  as  user_id            FROM  movielens          WHERE  genre_ss='documentary'    ORDER  BY  user_id  desc

Give me the avg rating for men and women over 30 for romance movies

Give me the top 5 rated movies with at least 100 ratings

Give me the set of unique users that have rated documentaries

• Perform relational operations on streams

• Stream sources: search, jdbc, facets, stats, topic, gatherNodes

• Stream decorators: complement, daemon, leftOuterJoin, hashJoin, innerJoin, intersect, merge, outerHashJoin, parallel, reduce, random, rollup, select, shortestPath, sort, top, unique, update

Streaming Expressions

• Relies on docValues (column-oriented data structure) and /export handler

• Extreme read performance (8-10x faster than queries using cursorMark)

• Facet or map/reduce style aggregation modes

• Tiered architecture

• SQL interface tier

• Worker tier (scale a pool of worker “nodes” independently of the data collection)

• Data tier (Solr collection)

Streaming API: Nuts and Bolts

parallel(workers,      hashJoin(          search(movielens,  q=*:*,                              fl="user_id_i,movie_id_i,rating_i",                              sort="movie_id_i  asc",                              partitionKeys="movie_id_i"),          hashed=search(movielens_movies,  q=*:*,                                                fl="movie_id_i,title_s,genre_s",                                                sort="movie_id_i  asc",                                                partitionKeys="movie_id_i"),          on="movie_id_i"      ),      workers="4",      sort="movie_id_i  asc"  )  

Streaming Expression Example: hashJoin

The small “right” side of the join gets loaded into memory on each worker node

Each shard queried by N workers, so 4 workers x 4 shards means 16 queries (usually all replicas per shard are hit)

Workers collection isolates parallel computation nodes from data nodes

Aggregation Modes

• Map/Reduce aggregationMode — for high cardinality aggregations and distributed joins (requires a shuffle phase to move keys to correct worker)

curl  -­‐-­‐data-­‐urlencode  "stmt=SELECT  user_id_i,  avg(rating_i)  as  avg_rating  FROM  movielens  GROUP  BY  user_id_i"  \  “http://host:port/solr/movielens/sql?aggregationMode=map_reduce”  

• Facet aggregationMode — Uses JSON facet engine for high performance on low-to-moderate cardinality fields (e.g. movies)

curl  -­‐-­‐data-­‐urlencode  "stmt=SELECT  movie_id_i,  avg(rating_i)  as  avg_rating  FROM  movielens  GROUP  BY  movie_id_i"  \      “http://host:port/solr/movielens/sql?aggregationMode=facet”

• spark-solr project uses streaming API to pull data from Solr into Spark jobs if docValues enabled, see: https://github.com/lucidworks/spark-solr

• Perform aggregations of “signals”, e.g clicks, to compute boosts and recommendations using Spark

• Custom Scala script jobs to perform complex analysis on data in Solr, e.g. sessionize request logs

• Power rich data visualizations using Spark SQL over Solr streaming aggregations

How we use Solr streaming API in Fusion

Graph

• Anomaly detection and fraud detection

• Recommenders

• Social network analysis

• Graph Search

• Access Control

• Examples:

• Find all tweets mentioning “Solr” by me or people I follow

• Find all draft blog posts about “Parallel SQL” written by a developer

• Find 3-star hotels in NYC my friends stayed in last year

Graph Use Cases

• Some data is much more naturally represented as a graph structure

• Traditionally hard to deal with in search’s inverted index

• Solr 6.0 introduces the Graph Query Parser

• Solr 6.1 brings Graph Streaming expressions

Graph Basics

• Query-time, cyclic aware graph traversal is able to rank documents based on relationships

• Provides controls for depth, filtering of results and inclusion of root and/or leaves

• Limitations: single node/shard only

• Examples:

• http://localhost:8983/solr/graph/query?fl=id,score&q={!graph+from=in_edge+to=out_edge}id:A

• http://localhost:8983/solr/my_graph/query?fl=id&q={!graph+from=in_edge+to=out_edge+traversalFilter='foo:[*+TO+15]'}id:A

• http://localhost:8983/solr/my_graph/query?fl=id&q={!graph+from=in_edge+to=out_edge+maxDepth=1}foo:[*+TO+10]

Graph Query Parser

• Part of Solr’s broader Streaming Expressions capability

• Implements a powerful, breadth-first traversal

• Works across shards AND collections

• Supports aggregations

• Cycle aware

Graph Streaming Expressions (Solr 6.1)

curl -X POST -H "Content-Type: application/x-www-form-urlencoded" -d ‘expr=…’ "http://localhost:18984/solr/movielens/stream"

All movies that user 389 watched

expr:gatherNodes(movielens,walk="389->user_id_i",gather="movie_id_i")

All the Movies that viewers of Movie 161 watched

expr:gatherNodes(movielens,

gatherNodes(movielens,walk="161->movie_id_i",gather="user_id_i"),

walk="node->user_id_i",gather="movie_id_i", trackTraversal="true")

Movie 161: “The Air Up There”

Collaborative Filtering Example

expr=top(n="5", sort="count(*) desc",

gatherNodes(movielens, top(n="30", sort="count(*) desc",

gatherNodes(movielens,

search(movielens, q="user_id_i:305", fl="movie_id_i", sort="movie_id_i asc", qt=“/export"),

walk="movie_id_i->movie_id_i", gather="user_id_i",

maxDocFreq="10000", count(*))),

walk="node->user_id_i", gather="movie_id_i", count(*)))'

Comparisons

Comparing Graph Choices

Solr Elastic Graph Neo4J Spark GraphX

Best Use Case

QParser: predef. relationships as filters Expressions: fast, query-based, dist.

graph ops

Term relationship exploration

Graph ops and querying that fit on a

single nodeLarge-scale, iterative

graph ops

Common Graph Algorithms (e.g.

Pregel, Traversal)Partial No Yes Yes

Scaling QParser: no Expressions: yes Yes Master/Replica Yes

Commercial License Required No Yes GPLv3 No

Visualizations GraphML support (Gephi) Kibana Neo4j browser 3rd party

Comparing Big Data SQL Choices

Solr Hive Drill SparkSQL

Secret SaucePush complex query

constructs into engine (full text, spatial, functions, etc)

Mature SQL solution for Hadoop stack

Execute SQL over NoSQL data sources

Spark core (optimized shuffle, in-memory, etc), integration of

other APIs: ML, Streaming, GraphX

SQL Features Evolving Mature Maturing Maturing

ScalingLinear (shards and replicas) backed by

inverted index

Limited by Hadoop infrastructure (table

scans)Good, but need to

benchmark

Memory intensive; Scale out using Spark

cluster, backed by RDDs

Integration w/ external systems JDBC stream source external tables /

plugin APImany drivers

availableDataSource API, many systems

supported

Future Work

• Alternate graph traversal approaches, e.g. depth-first

• Possible support for Gremlin (Graph Traversal Language from Tinker Pop)

• Additional graph algorithms (e.g. strongly conn. components, page rank)

Future Work

• No support for pushing >, >=, <, <= operators in WHERE clause down into Solr as range queries; use range syntax [4 TO *] for now

• Using Solr function queries in WHERE clause, e.g. WHERE  location_p='{!geofilt  d=90  pt=37.773972,-­‐122.431297  sfield=location_p}’  

• SQL Joins (SOLR-8593)

• Port SQL layer to use Apache Calcite vs. Presto

SQL: Current Limitations and Future Plans