solr 6.0 graph query overview

Post on 07-Jan-2017

1.038 Views

Category:

Software

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Solr 6.0 Graph Query Overview

Kevin Watters KMW Technologykwatters@kmwllc.comhttp://www.kmwllc.com/03/29/2016

KMW Technology OverviewBoston based software consulting and

professional services organization.Founded in 2010.Seven consultants with deep industry

experience.Boutique firm specializing in Search

and Big Data technologies.Custom Connectors, Pipelines, Search,

Analytics, and UI development.

Search, Join, vs GraphWhich query should I use?Search is for flat data, no relationships

◦ Data often de-normalized, updates require large amounts of re-indexing potentially.

Join is for one level of relationships◦ Data is normalized, but for more than 2 tables

involved, join queries must be nested.Graph is for arbitrary depth/levels of

relationships.◦ Data can be completely normalized, arbitrary

numbers of tables can be joined together.A one level hop on a graph is roughly equivalent

to a join query.

What is a Graph?A generic representation of all data models. “One data model to rule them all”!

G = <V,E> ?!?!

Vertices/Nodes◦Can have properties as key value pairs.

Edges◦Can have properties as key value pairs

Graph TraversalThere are many graph traversal / exploration algorithms. DFS, BFS, A*, Alpha–beta, etc…

Solr graph query implements “BFS”Breadth-first search, each hop expands the “Frontier” of the graph. It explores all current edges in a single step, also known as a “hop”

Key Features and Design Goals“Graph is a Filter on top of your data” -someone

Designed for large scale and large number of edges and very deep traversals.

Limited memory usage for traversalCycle detection for “free”Highly cacheableSupport multiValued fields for nodes and/or edgesSupport filters during the traversalFollow Every Edge! No edge left behind!Works with Facets & Facet Queries!

A Word about Memory UsageOne bit set to rule them all!BitSet provides cycle detection implicitly.

(Have I been here before?)BitSet is equal to the size of the index.100 Million doc index only uses about 12

MB per query! (Same size as 1 filter cache entry!)

Additional bitsets may be used during query execution depending on query params. (leaf nodes and root nodes bitsets)

Graph Query Parser Syntax

Parameter Default Descriptionfrom field containing the node idto Field contaning the edge id(s)

maxDepth -1The number of hops to traverse from the root of the graph.  -1 means traverse until all edges and documents have been collected. maxDepth=1 is similar behavior to a JOIN.

traversalFilter null arbitrary query string to apply at each hop of the traversal

returnRoot true true|false – indication of if the documents matching the root query should be returned.

leafNodesOnly false true|false – indication to return only documents in the result set that do not have a value in the “to” field.

useAutn True Performance trade off based on use case. Mileage may vary.

Uses Solr’s query parser plugin and “local params” syntax{!graph param=”value” … }

Princeton WordnetPrinceton Wordnet has an ontology for many of the words in the English language. These relationships contain hierarchies of words that represent a more general and a more specific class of relatonships. https://wordnet.princeton.edu/Words have a “sense”, or meaning.Hypernym is a more specific related word.Hyponem is a more general related word.

◦ Jaguar is a type of Cat◦ Large Cat is a type of Animal

Intersections of this hierachy can answer questions: “Is a jaguar an animal?”

Wordnet Hypernym TraversalStart traversing from the word sense “jaguar” up the hypernym graph 9 levels.+{!graph from="synset_id" to="hypernym_id" maxDepth=9}sense_lemma:jaguar

Wordnet Graph IntersectionsIs a jaguar an animal? Query for an

intersection between the two graphs.

If a graph intersection exists, the answer is yes!

OpenCV, Video RecognitionImagine indexing each frame of

video from security cameras. Pass each frame of video through OpenCV for object recognition & face recognition.

Each frame has a frame number of it’s frame and the previous frame.

Search for object/face “A” detected, followed by object/face “B” detected, across all of your video streams.

Users , Items and ActionsModel your browsing/purchase history as

◦Users (have an ID)◦ Items (have an ID, metadata, category, etc)◦Actions (link between user and Items, such as

rating, purchase, like/dislike)User -> Action -> Item -> Action -> User …Use Graph + maxDepth to get from a user to an item. maxDepth = 2… gets from a user to an Item. maxDepth = 4 .. Gets from one user to a new set of users, and on and on.

Actions occur over timeThese events can’t easily be

aggregated or flattened onto a record.

Model this as a “person” record, with a set of “action” records.

Each action record has the id of the “previous” action.

Search for an action, graph traverse based on person id to another action, then finally to the person record.

Find similar usersGraph traversal from a user (or

set of users) through their actions to items they like, to find similar users, and out to items they like.

Now, exclude the original starting set

“returnRoot=false”

Graph Query For SecurityGraph queries are elegant and

simple to use for traversing security hierarchies such as LDAP and AD

Custom security models that are hierarchical or folder based in nature.

Example Company with Security Model

Document/Security Model within the Solr Index

Graph Traversal for User 1

Graph Traversal for User 2

Security Query Single security query term to traverse the entire

graph{!graph from=“node_id” to=“edge_ids”

returnOnlyLeaf=“true”}id:user_1 The query is applied as a FilterQuery to the query

request, normal query is user for filtering against documents

FoaFFriend of a Friend of a Friend of a Friend…

2 ways to model in the index.Multi-valued “friendid” field that points to other

person records.◦ More efficient and faster search.◦ Filter traversal based on metadata on the person

record.Single value field and on a document that

represents the link/edge between two person records.◦ More flexible slower search. ◦ Can filter edges with metadata about the edge record..

Graph Analytics via FacetingWhat do my friend’s friends like that live in Boston?

Identify a graph/ dataset with a graph query to identify the people records.

Use facets to generate analytics on the result set based on the values in the person record “like” field.

Use drill down to understand characteristics of different demographics/cohorts.

Get counts at various levels using maxDepth graph queries as facet queries.

What next?Edge weights & Relevancy

◦ Based on tf/idf or bm25?◦ Based on numerical field values (min/max/sum/avg

weight application)?Min distance computationBetter support for D3.js and other Visualization

toolsDriving directions?Distributed Traversal via Kafka frontier query

brokerSparkRDD Support? GraphX?minDepth parameter? Only return records that

are at least N hops away?

top related