the apache solr smart data ecosystem

Post on 17-Feb-2017

326 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

The Apache Solr Smart Data EcosystemTrey Grainger

SVP of Engineering, Lucidworks

DFW Data Science 2017.01.09

Trey Grainger SVP of Engineering

• Previously Director of Engineering @ CareerBuilder• MBA, Management of Technology – Georgia Tech• BA, Computer Science, Business, & Philosophy – Furman University• Information Retrieval & Web Search - Stanford University

Other fun projects: • Co-author of Solr in Action, plus numerous research papers• Frequent conference speaker• Founder of Celiaccess.com, the gluten-free search engine• Lucene/Solr contributor

About Me

• Apache Solr OverviewLucidworks Fusion Overview

• Search & Relevancy - Keyword Search - Text Analysis - Multilingual Text Analysis• Recommendations (Demo)• Relevancy Spectrum• Reflected Intelligence

- Relevancy Tuning - Learning to Rank (Demo) - Signals (Demo) …

Agenda…

• Semantic Search - Entity Extraction (Demo) - Query Parsing (Demo) - Semantic Knowledge Graph (Demo)• Streaming Expressions• Solr / Fusion SQL (Demo)• Solr Graph

DFW Data Science

what do you do?

Search-Driven

Everything

Customer Service Custome

r Insights

Fraud Surveillance

Research Portal

Online Retail Digital Content

Lucidworks enables Search-Driven Everything

Data Acquisition

Indexing & Streaming

Smart Access API

Recommendations & Alerts Analytics & InsightsExtreme Relevancy

CUSTOMER SERVICE

RESEARCH PORTAL

DIGITAL CONTENT

CUSTOMER INSIGHTS

FRAUD SURVEILLANCE

ONLINERETAIL

•Access all your data in a number of ways from one place.

•Secure storage and processing from Solr and Spark.

•Acquire data from any source with pre-built connectors and adapters.

Machine learning and advanced analytics turn all of your apps into intelligent data-driven applications.

Apache Solr

“Solr is the popular, blazing-fast, open source enterprise

search platform built on Apache Lucene™.”

Key Solr Features:● Multilingual Keyword search● Relevancy Ranking of results● Faceting & Analytics (nested / relational)● Highlighting● Spelling Correction● Autocomplete/Type-ahead Prediction● Sorting, Grouping, Deduplication● Distributed, Fault-tolerant, Scalable● Geospatial search● Complex Function queries● Recommendations (More Like This)● Graph Queries and Traversals● SQL Query Support● Streaming Aggregations● Batch and Streaming processing● Highly Configurable / Plugins● Learning to Rank● Building machine-learning models● … many more

*source: Solr in Action, chapter 2

The standard for enterprise search.

of Fortune 500 uses Solr.

90%

Lucidworks Fusion

DFW Data Science

All Your Data

• Over 50 connectors to integrate all your data

• Robust parsing framework to seamlessly ingest all your document types

• Point and click Indexing configuration and iterative simulation of results for full control over your ETL process

• Your security model enforced end-to-end from ingest to search across your different datasources

ExperienceManagement

• Relevancy tuning: Point-and-click query pipeline configuration allow fine-grained control of results.

• Machine-driven relevancy: Signals aggregation learn and automatically tune relevancy and drive recommendations out of the box .

• Powerful pipeline stages: Customize fields, stages, synonyms, boosts, facets, machine learning models, your own scripted behavior, and dozens of other powerful search stages.

• Turnkey search UI(Lucidworks View): Build a sophisticated end-to-end search application in just hours.

Operational Simplicity

SECURITY BUILT-IN

Shards Shards

Apache Solr

Apache ZookeeperZK 1

Leader Election

Load Balancing

Shared Config

Management

Worker Worker

Apache SparkCluster

Manager

Core Services

• • •

NLP

Recommenders / Signals

Blob Storage

Pipelines

Scheduling

Alerting / Messaging

Connectors

REST

API

Admin UI

Lucidworks View

LOGS FILE WEB DATABASE CLOUD

HDFS

(Opt

iona

l)

• 75% decrease in development time

• Licensing costs cut by 50%

With Fusion’s out-of-the-box capabilities, we skipped months in our dev cycle so we could focus our team where they would have the most impact.

We cut our licensing costs by 50% and improved application usability. The Lucidworks professional services team amplified our success even further. We’re all Fusion from here on out!”

Lourduraju PamishettySenior IT Application Architect—

• Seamless integration of your entire search & analytics platform

• All capabilities exposed through secured API's, so you can use our UI or build your own.

• End-to-end security policies can be applied out of the box to every aspect of your search ecosystem.

• Distributed, fault-tolerant scaling and supervision of your entire search application

Core Services

• • •

NLP

Recommenders / Signals

Blob Storage

Pipelines

Scheduling

Alerting / Messaging

Connectors

REST

API

Admin UI

Lucidworks View

LOGS FILE WEB DATABASE CLOUD

• Seamless integration of your entire search & analytics platform

• All capabilities exposed through secured API's, so you can use our UI or build your own.

• End-to-end security policies can be applied out of the box to every aspect of your search ecosystem.

• Distributed, fault-tolerant scaling and supervision of your entire search application

Fusion powers search for the brightest companies in the world.

Lucidworks Fusion

search & relevancy

Basic Keyword Search

The beginning of a typical search journey

Term Documents

a doc1 [2x]

brown doc3 [1x] , doc5 [1x]

cat doc4 [1x]

cow doc2 [1x] , doc5 [1x]

… ...

once doc1 [1x], doc5 [1x]

over doc2 [1x], doc3 [1x]

the doc2 [2x], doc3 [2x], doc4[2x], doc5 [1x]

… …

Document Content Field

doc1 once upon a time, in a land far, far away

doc2 the cow jumped over the moon.

doc3 the quick brown fox jumped over the lazy dog.

doc4 the cat in the hat

doc5 The brown cow said “moo” once.

… …

What you SEND to Lucene/Solr:How the content is INDEXED into Lucene/Solr (conceptually):

The inverted index

DFW Data Science

/solr/select/?q=apache solr

Field Documents

… …

apache doc1, doc3, doc4, doc5

hadoop doc2, doc4, doc6

… …

solr doc1, doc3, doc4, doc7, doc8

… …

doc5

doc7 doc8

doc1 doc3 doc4

solr

apache

apache solr

Matching queries to documents

DFW Data Science

Text Analysis

Generating terms to index from raw text

Text Analysis in SolrA text field in Lucene/Solr has an Analyzer containing:

① Zero or more CharFiltersTakes incoming text and “cleans it up” before it is tokenized

② One TokenizerSplits incoming text into a Token Stream containing Zero or more Tokens

③ Zero or more TokenFiltersExamines and optionally modifies each Token in the Token Stream

*From Solr in Action, Chapter 6

DFW Data Science

A text field in Lucene/Solr has an Analyzer containing:

① Zero or more CharFiltersTakes incoming text and “cleans it up” before it is tokenized

② One TokenizerSplits incoming text into a Token Stream containing Zero or more Tokens

③ Zero or more TokenFiltersExamines and optionally modifies each Token in the Token Stream

Text Analysis in Solr

*From Solr in Action, Chapter 6

DFW Data Science

A text field in Lucene/Solr has an Analyzer containing:

① Zero or more CharFiltersTakes incoming text and “cleans it up” before it is tokenized

② One TokenizerSplits incoming text into a Token Stream containing Zero or more Tokens

③ Zero or more TokenFiltersExamines and optionally modifies each Token in the Token Stream

Text Analysis in Solr

*From Solr in Action, Chapter 6

DFW Data Science

A text field in Lucene/Solr has an Analyzer containing:

① Zero or more CharFiltersTakes incoming text and “cleans it up” before it is tokenized

② One TokenizerSplits incoming text into a Token Stream containing Zero or more Tokens

③ Zero or more TokenFiltersExamines and optionally modifies each Token in the Token Stream

Text Analysis in Solr

*From Solr in Action, Chapter 6

DFW Data Science

Multi-lingual Text Analysis

Analyzing text across multiple languages

Example English Analysis Chains

<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt” ignoreCase="true" /> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.EnglishPossessiveFilterFactory"/> <filter class="solr.KeywordMarkerFilterFactory" protected="lang/en_protwords.txt"/> <filter class="solr.PorterStemFilterFactory"/> </analyzer></fieldType>

<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100"> <analyzer> <charFilter class="solr.HTMLStripCharFilterFactory"/> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="lang/en_synonyms.txt" I ignoreCase="true" expand="true"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> <filter class="solr.ASCIIFoldingFilterFactory"/> <filter class="solr.KStemFilterFactory"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer></fieldType>

DFW Data Science

Per-language Analysis Chains

DFW Data Science

*Some of the 32 different languages configurations in Appendix B of Solr in Action

Per-language Analysis Chains

*Some of the 32 different languages configurations in Appendix B of Solr in Action

DFW Data Science

Which Stemmer do I choose?

*From Solr in Action, Chapter 14

DFW Data Science

Common English Stemmers

DFW Data Science

*From Solr in Action, Chapter 14

When Stemming goes awry

Fixing Stemming Mistakes:• Unfortunately, every stemmer will have problem-cases that aren’t handled as you would

expect• Thankfully, Stemmers can be overriden

• KeywordMarkerFilter: protects a list of terms you specify from being stemmed• StemmerOverrideFilter: applies a list of custom term mappings you specify

Alternate strategy:• Use Lemmatization (root-form analysis) instead of Stemming• Commercial vendors help tremendously in this space• The Hunspell stemmer enables dictionary-based support of varying quality in over

100 languagesDFW Data Science

Relevancy

Scoring the results, returning the best matches

Classic Lucene Relevancy Algorithm (now switched to BM25):

*Source: Solr in Action, chapter 3

Score(q, d) = ∑ ( tf(t in d) · idf(t)2 · t.getBoost() · norm(t, d) ) · coord(q, d) · queryNorm(q) t in q

Where: t = term; d = document; q = query; f = field tf(t in d) = numTermOccurrencesInDocument ½ idf(t) = 1 + log (numDocs / (docFreq + 1)) coord(q, d) = numTermsInDocumentFromQuery / numTermsInQuery queryNorm(q) = 1 / (sumOfSquaredWeights ½ ) sumOfSquaredWeights = q.getBoost()2 · ∑ (idf(t) · t.getBoost() )2 t in q

norm(t, d) = d.getBoost() · lengthNorm(f) · f.getBoost()

DFW Data Science

• Term Frequency: “How well a term describes a document?”– Measure: how often a term occurs per document

• Inverse Document Frequency: “How important is a term overall?”– Measure: how rare the term is across all documents

TF * IDF

*Source: Solr in Action, chapter 3

DFW Data Science

News Search : popularity and freshness drive relevance

Restaurant Search: geographical proximity and price range are critical

Ecommerce: likelihood of a purchase is key

Movie search: More popular titles are generally more relevant

Job search: category of job, salary range, and geographical proximity matter

TF * IDF of keywords can’t hold it’s own against good domain-specific relevance factors!

That’s great, but what about domain-specific knowledge?

DFW Data Science

John lives in Boston but wants to move to New York or possibly another big city. He is currently a sales manager but wants to move towards business development.

Irene is a bartender in Dublin and is only interested in jobs within 10KM of her location in the food service industry.

Irfan is a software engineer in Atlanta and is interested in software engineering jobs at a Big Data company. He is happy to move across the U.S. for the right job.

Jane is a nurse educator in Boston seeking between $40K and $60K

*Example from chapter 16 of Solr in Action

Consider what you know about users

DFW Data Science

http://localhost:8983/solr/jobs/select/? fl=jobtitle,city,state,salary& q=( jobtitle:"nurse educator"^25 OR jobtitle:(nurse educator)^10 ) AND ( (city:"Boston" AND state:"MA")^15 OR state:"MA") AND _val_:"map(salary, 40000, 60000,10, 0)”

*Example from chapter 16 of Solr in Action

Query for Jane

Jane is a nurse educator in Boston seeking between $40K and $60K

DFW Data Science

{ ... "response":{"numFound":22,"start":0,"docs":[ {"jobtitle":" Clinical Educator (New England/ Boston)", "city":"Boston", "state":"MA", "salary":41503},

…]}}

*Example documents available @ http://github.com/treygrainger/solr-in-action

Search Results for Jane

{"jobtitle":"Nurse Educator", "city":"Braintree", "state":"MA", "salary":56183},

{"jobtitle":"Nurse Educator", "city":"Brighton", "state":"MA", "salary":71359}

DFW Data Science

You just built a recommendation

engine!

Demo: Recommendations

Traditional Keyword Search

Recommendations

SemanticSearch

User Intent

Personalized Search

Augmented Search

Domain-awareMatching

The Relevancy Spectrum

DFW Data Science

Basic Keyword Search(inverted index, tf-idf, bm25, query formulation, etc.)

Taxonomies / Entity Extraction(entity recognition, ontologies, synonyms, etc.)

Query Intent(query classification, semantic query parsing, concept expansion, rules, clustering, classification)

Relevancy Tuning(signals, AB testing/genetic algorithms, Learning to Rank, Neural Networks)

Self-learningData-driven App Sophistication

DFW Data Science

what is “reflected intelligence”?

The Three C’sContent:Keywords and other features in your documents

Collaboration:How other’s have chosen to interact with your system

Context:Available information about your users and their intent

Reflected Intelligence “Leveraging previous data and interactions to improve how new data and interactions should be interpreted”

DFW Data Science

Feedback LoopsUser

Searches

User Sees

ResultsUser

takes an

action

Users’ actions inform system improvements

DFW Data Science

● Recommendation Algorithms● Building user profiles from past searches, clicks, and other actions● Identifying correlations between keywords/phrases● Building out automatically-generated ontologies from content and

queries● Determining relevancy judgements (precision, recall, nDCG, etc.)

from click logs● Learning to Rank - using relevancy judgements and machine

learning to train a relevance model● Discovering misspellings, synonyms, acronyms, and related

keywords● Disambiguation of keyword phrases with multiple meanings● Learning what’s important in your content

Examples of Reflected Intelligence

DFW Data Science

Relevancy Tuning

Improving ranking algorithms through experiments and models

How to Measure Relevancy?

A B CRetrieved Documents

Related Documents

Precision = B/A

Recall = B/C

Problem:

Assume Prec = 90% and Rec = 100% but assume the 10% irrelevant documents were ranked at the top of the retrieved documents, is that OK?

DFW Data Science

Normalized Discounted Cumulative Gain

Rank Relevancy

3 0.95

1 0.70

2 0.60

4 0.45

Rank Relevancy

1 0.95

2 0.85

3 0.80

4 0.65

Ranking

IdealGiven

• Position is considered in quantifying relevancy.

• Labeled dataset is required.

DFW Data Science

Learning to Rank

Learning to Rank (LTR)

● It applies machine learning techniques to discover the best combination of features that provide best ranking.

● It requires labeled set of documents with relevancy scores for given set of queries

● Features used for ranking are usually more computationally expensive than the ones used for matching

● It typically re-ranks a subset of the matched documents (e.g. top 1000)

DFW Data Science

DFW Data Science

Common LTR Algorithms

• RankNet* (Neural Network, boosted trees)

• LambdaMart* (set of regression trees)

• SVM Rank** (SVM classifier)

** http://research.microsoft.com/en-us/people/hangli/cao-et-al-sigir2006.pdf

* http://research.microsoft.com/pubs/132652/MSR-TR-2010-82.pdf

DFW Data Science

LambdaMart Example

Source: T. Grainger, K. AlJadda. ”Reflected Intelligence: Evolving self-learning data systems". Georgia Tech, 2016

DFW Data Science

Demo: Solr Learning to Rank

Obtaining Relevancy JudgementsTypical Methodologies 1) Hire employees, contractors, or interns -Pros: Accuracy -Cons: Expensive Not scalable (cost or man-power-wise) Data Becomes Stale

2) Crowdsource -Pros: Less cost, more scalable -Cons: Less accurate Data still becomes stale

Source: T. Grainger, K. AlJadda. ”Reflected Intelligence: Evolving self-learning data systems". Georgia Tech, 2016

DFW Data Science

Reflected Intelligence: Possible to infer relevancy judgements?

Rank Document ID

1 Doc1

2 Doc2

3 Doc3

4 Doc4

QueryQuery

Doc1 Doc2 Doc3

01 1

Query

Doc1 Doc2 Doc3

10 0

Click Graph

Skip Graph

Source: T. Grainger, K. AlJadda. ”Reflected Intelligence: Evolving self-learning data systems". Georgia Tech, 2016

DFW Data Science

Automated Relevancy Benchmarking

DFW Data Science

Demo: Fusion Signals

• 200%+ increase in click-through rates

• 91% lower TCO• Fewer support

tickets• Increased customer

satisfaction

semantic search

DFW Data Science

Building a Taxonomy of Entities

Many ways to generate this:• Topic Modelling

• Clustering of documents

• Statistical Analysis of interesting phrases- Word2Vec / Glove / Dice Conceptual Search

• Buy a dictionary (often doesn’t work for domain-specific search problems)

• Generate a model of domain-specific phrases by mining query logs for commonly searched phrases within the domain*

* K. Aljadda, M. Korayem, T. Grainger, C. Russell. "Crowdsourced Query Augmentation through Semantic Discovery of Domain-specific Jargon," in IEEE Big Data 2014.

DFW Data Science

DFW Data Science

DFW Data Science

entity extraction

DFW Data Science

Demo: Solr Text Tagger

semantic query parsing

DFW Data Science

Probabilistic Query ParserGoal: given a query, predict which combinations of keywords should be combined together as phrases

Example: senior java developer hadoopPossible Parsings:senior, java, developer, hadoop"senior java", developer, hadoop"senior java developer", hadoop"senior java developer hadoop”"senior java", "developer hadoop”senior, "java developer", hadoopsenior, java, "developer hadoop" Source: Trey Grainger, “Searching on Intent: Knowledge Graphs, Personalization,

and Contextual Disambiguation”, Bay Area Search Meetup, November 2015.

DFW Data Science

Demo: Probabilistic Query Parser

Semantic Query ParsingIdentification of phrases in queries using two steps:

1) Check a dictionary of known terms that is continuously built, cleaned, and refined based upon common inputs from interactions with real users of the system. The SolrTextTagger works well for this.*

2) Also invoke a probabilistic query parser to dynamically identify unknown phrases using statistics from a corpus of data (language model)

*K. Aljadda, M. Korayem, T. Grainger, C. Russell. "Crowdsourced Query Augmentation through Semantic Discovery of Domain-specific Jargon," in IEEE Big Data 2014.

DFW Data Science

query augmentation

DFW Data Science

Knowledge Graph

Semantic Data Encoded into Free Text Content

DFW Data Science

id: 1job_title: Software Engineerdesc: software engineer at a great companyskills: .Net, C#, java

id: 2job_title: Registered Nursedesc: a registered nurse at hospital doing hard workskills: oncology, phlebotemy

id: 3job_title: Java Developerdesc: a software engineer or a java engineer doing workskills: java, scala, hibernate

field term postings list

doc pos

desc

a

1 4

2 1

3 1, 5

at1 3

2 4

company 1 6

doing2 6

3 8

engineer1 2

3 3, 7

great 1 5

hard 2 7

hospital 2 5

java 3 6

nurse 2 3

or 3 4

registered 2 2

software1 1

3 2

work2 10

3 9

job_title java developer 3 1

… … … …

field doc term

desc

1 a

at

company

engineer

great

software

2 a

at

doing

hard

hospital

nurse

registered

work

3 a

doing

engineer

java

or

software

work

job_title 1 Software Engineer

… … …

Terms-Docs Inverted IndexDocs-Terms Forward IndexDocuments

Source: Trey Grainger, Khalifeh AlJadda, Mohammed Korayem, Andries Smith.“The Semantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain”. DSAA 2016.

Knowledge Graph

DFW Data Science

Source: Trey Grainger, Khalifeh AlJadda, Mohammed Korayem, Andries Smith.“The Semantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain”. DSAA 2016.

Knowledge Graph

Set-theory View

Graph View

How the Graph Traversal Works

skill: Java

skill: Scala

skill: Hibernate

skill: Oncology

has_related_skill

has_related_skillhas_related_skill

doc 1

doc 2

doc 3

doc 4

doc 5

doc 6

skill: Java

skill: Java

skill: Scala

skill: Hibernate

skill: Oncology

Data Structure View

Java

Scala Hibernate

docs1, 2, 6

docs 3, 4

Oncologydoc 5

DFW Data Science

Knowledge Graph

Graph ModelStructure:

Single-level Traversal / Scoring:

Multi-level Traversal / Scoring:

Source: Trey Grainger, Khalifeh AlJadda, Mohammed Korayem, Andries Smith.“The Semantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain”. DSAA 2016.

Knowledge Graph

Multi-level Traversal

Data Structure View

Graph Viewdoc 1

doc 2

doc 3

doc 4

doc 5

doc 6

skill: Java

skill: Java

skill: Scala

skill: Hibernate

skill: Oncology

doc 1

doc 2

doc 3

doc 4

doc 5

doc 6

job_title: Software Engineer

job_title: Data

Scientist

job_title: Java

Developer

……

Inverted Index Lookup

Forward Index Lookup

Forward Index Lookup

Inverted Index Lookup

Java

Java Developer

Hibernate

Scala

Software Engineer

Data Scientist

has_related_skill has_related_skill

has_related_skill

has_

rela

ted_

job_

title

has_

rela

ted_job_title

has_

rela

ted_

job_

title

has_

relat

ed_jo

b_title

has_related_job_ti

tle

has_related_job_title

DFW Data Science

Source: Trey Grainger, Khalifeh AlJadda, Mohammed Korayem, Andries Smith.“The Semantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain”. DSAA 2016.

Knowledge Graph

Scoring nodes in the Graph

Foreground vs. Background AnalysisEvery term scored against it’s context. The more commonly the term appears within it’s foreground context versus its background context, the more relevant it is to the specified foreground context.

countFG(x) - totalDocsFG * probBG(x) z = -------------------------------------------------------- sqrt(totalDocsFG * probBG(x) * (1 - probBG(x)))

{ "type":"keywords”, "values":[ { "value":"hive", "relatedness": 0.9765, "popularity":369 },

{ "value":"spark", "relatedness": 0.9634, "popularity":15653 },

{ "value":".net", "relatedness": 0.5417, "popularity":17683 },

{ "value":"bogus_word", "relatedness": 0.0, "popularity":0 },

{ "value":"teaching", "relatedness": -0.1510, "popularity":9923 },

{ "value":"CPR", "relatedness": -0.4012, "popularity":27089 } ] } 

+-

Foreground Query: "Hadoop"

DFW Data Science

Source: Trey Grainger, Khalifeh AlJadda, Mohammed Korayem, Andries Smith.“The Semantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain”. DSAA 2016.

Knowledge Graph

Multi-level Graph Traversal with Scores

software engineer*(materialized node)

Java

C#

.NET

.NET Developer

Java Developer

HibernateScalaVB.NET

Software Engineer

Data Scientist

SkillNodes

has_related_skillStartingNode

SkillNodes

has_related_skill Job TitleNodes

has_related_job_title

0.900.88 0.93

0.93

0.34

0.74

0.91

0.89

0.74

0.89

0.780.72

0.48

0.93

0.76

0.83

0.80

0.64

0.61

0.780.55

DFW Data Science

Knowledge Graph

Use Case: Document Summarization

Experiment: Pass in raw text (extracting phrases as needed), and rank their similarity to the documents using the SKG.

Additionally, can traverse the graph to “related” entities/keyword phrases NOT found in the original document

Applications: Content-based and multi-modal recommendations (no cold-start problem), data cleansing prior to clustering or other ML methods, semantic search / similarity scoring

Demo: Semantic Knowledge Graph

Knowledge Graph

DFW Data Science

Knowledge Graph

DFW Data Science

DFW Data Science

streaming expressions

• Perform relational operations on streams

• Stream sources: search, jdbc, facets, features, gatherNodes, shortestPath, train, features, model, random, stats, topic

• Stream decorators: classify, commit, complement, daemon, executor, fetch, having, leftOuterJoin, hashJoin, innerJoin, intersect, merge, null, outerHashJoin, parallel, priority, reduce, rollup, scoreNodes, select, sort, top, unique, update

Streaming Expressions

Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.

DFW Data Science

• Relies on docValues (column-oriented data structure) and /export handler

• Extreme read performance (8-10x faster than queries using cursorMark)

• Facet or map/reduce style aggregation modes

• Tiered architecture• SQL interface tier• Worker tier (scale a pool of worker

“nodes” independently of the data collection)

• Data tier (Solr collection)

Streaming API: Nuts and Bolts

Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.

DFW Data Science

Streaming Expressions - Examples

Shortest-path Graph Traversal

Parallel Batch Procesing

Train a Logistic Regression Model

Distributed Joins

Rapid Export of all Search Results

Pull Results from External Database

Sources: https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions http://joelsolr.blogspot.com/2016/10/solr-63-batch-jobs-parallel-etl-and.html

Classifying Search Results

Solr SQL

• SQL is ubiquitous language for analytics• People: Less training and easier to

understand• Tools! Solr as JDBC data source

(DbVisualizer, Apache Zeppelin, and SQuirreL SQL)

• Query planning / optimization can evolve iteratively

SQL is natural extension for Solr’s parallel computing engine

Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.

DFW Data Science

Give me the top 5 action movies with rating of 4 or better

Mental Warm-up

/select?q=*:* &fq=genre_ss:action &fq=rating_i:[4 TO *] &facet=true &facet.limit=5 &facet.mincount=1 &facet.field=title_s

SELECT title_s, COUNT(*) as cnt FROM movielens WHERE genre_ss='action' AND rating_i='[4 TO *]’ GROUP BY title_s ORDER BY cnt desc LIMIT 5

{ ... "facet_counts":{ "facet_fields":{ "title_s":[ "Star Wars (1977)",501, "Return of the Jedi (1983)",379, "Godfather, The (1972)",351, "Raiders of the Lost Ark (1981)",348, "Empire Strikes Back, The (1980)",293]}, ...}}

{"result-set":{"docs":[{"title_s":"Star Wars (1977)”,"cnt":501},{"title_s":"Return of the Jedi (1983)","cnt":379},{"title_s":"Godfather, The (1972)","cnt":351},{"title_s":"Raiders of the Lost Ark (1981)","cnt":348},{"title_s":"Empire Strikes Back, The (1980)","cnt":293},{"EOF":true,"RESPONSE_TIME":42}]}}

Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.

DFW Data Science

SELECT gender_s, COUNT(*) as num_ratings, avg(rating_i) as avg_rating FROM movielens WHERE genre_ss='romance' AND age_i='[30 TO *]'GROUP BY gender_s ORDER BY num_ratings desc

SQL Examples

SELECT title_s, genre_s, COUNT(*) as num_ratings, avg(rating_i) as avg_rating FROM movielens GROUP BY title_s, genre_s HAVING num_ratings >= 100 ORDER BY avg_rating desc LIMIT 5

SELECT DISTINCT(user_id_i) as user_id FROM movielens WHERE genre_ss='documentary' ORDER BY user_id desc

Give me the avg rating for menand women over 30 for romance movies

Give me the top 5 rated movies with at least 100 ratings

Give me the set of unique users that have rated documentaries

DFW Data Science

Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.

parallel(workers, hashJoin( search(movielens, q=*:*, fl="user_id_i,movie_id_i,rating_i", sort="movie_id_i asc", partitionKeys="movie_id_i"), hashed=search(movielens_movies, q=*:*, fl="movie_id_i,title_s,genre_s", sort="movie_id_i asc", partitionKeys="movie_id_i"), on="movie_id_i" ), workers="4", sort="movie_id_i asc")

Streaming Expression Example: hashJoin

The small “right” side of the join gets loaded into memory on each worker node

Each shard queried by N workers, so 4 workers x 4 shards means 16 queries (usually all replicas per shard are hit)

Workers collection isolates parallel computation nodes from data nodes

DFW Data Science

Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.

• spark-solr project uses streaming API to pull data from Solr into Spark jobs if docValues enabled, see: https://github.com/lucidworks/spark-solr

• Perform aggregations of “signals”, e.g clicks, to compute boosts and recommendations using Spark

• Custom Scala script jobs to perform complex analysis on data in Solr, e.g. sessionize request logs

• Power rich data visualizations using Fusion’s SQL Engine powered by SparkSQL + Solr streaming aggregations

How we use Solr streaming API in Fusion

Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.

DFW Data Science

DFW Data Science

Comparing SQL CapabilitiesFusion Solr Hive Drill SparkSQL

Secret SauceSparkSQL Benefits

+ Solr Benefits+ Enterprise

Security

Push complex query

constructs into engine (full text,

spatial, relevancy,

graph, functions, etc)

Mature SQL solution for

Hadoop stack

Execute SQL over NoSQL data

sources

Spark core (optimized shuffle, in-memory, etc),

integration of other APIs: ML,

Streaming, GraphX

SQL Features Maturing Evolving Mature Maturing Maturing

ScalingLinear (shards and replicas) backed by

inverted index;

Linear (shards and replicas)

backed by inverted index

Limited by Hadoop

infrastructure (table scans)

Good, but need to benchmark

Memory intensive;Scale out using Spark cluster,

backed by RDDsIntegration w/

external systems

Analytics Catalog API, JDCB Driver,

ODBC BridgeJDBC stream

sourceexternal tables /

plugin APImany drivers

availableDataSource API, many systems

supported

DFW Data Science

Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.

Demo: Fusion SQL Engine

Graph

Graph Use Cases• Anomaly detection /fraud detection

• Recommenders• Social network analysis• Graph Search• Access Control• Relationship discovery / scoring

Exampleso Find all draft blog posts about “Parallel SQL”

written by a developero Find all tweets mentioning “Solr” by me or

people I followo Find all draft blog posts about “Parallel SQL”

written by a developero Find 3-star hotels in NYC my friends stayed

in last year

DFW Data Science

Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.

Solr Graph Timeline

• Some data is much more naturally represented as a graph structure

• Solr 6.0: Introduced the Graph Query Parser• Solr 6.1: Introduced Graph Streaming expressions…

• Solr 6.3: Current Version• TBD: Semantic Knowledge Graph (patch available)

DFW Data Science

Graph Query Parser• Query-time, cyclic aware graph traversal is able to rank documents based on

relationships• Provides controls for depth, filtering of results and inclusion

of root and/or leaves• Limitations: single node/shard only

Examples:

• http://localhost:8983/solr/graph/query?fl=id,score&q={!graph from=in_edge to=out_edge}id:A

• http://localhost:8983/solr/my_graph/query?fl=id&q={!graph from=in_edge to=out_edge traversalFilter='foo:[* TO 15]'}id:A

• http://localhost:8983/solr/my_graph/query?fl=id&q={!graph from=in_edge to=out_edge maxDepth=1}foo:[* TO 10]

DFW Data Science

Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.

Graph Streaming Expressions• Part of Solr’s broader Streaming Expressions capability• Implements a powerful, breadth-first traversal• Works across shards AND collections• Supports aggregations• Cycle aware

curl -X POST -H "Content-Type: application/x-www-form-urlencoded" -d ‘expr=…’"http://localhost:18984/solr/movielens/stream"

DFW Data Science

Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.

All movies that user 389 watched

expr:gatherNodes(movielens,walk="389->user_id_i",gather="movie_id_i")

DFW Data Science

Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.

All movies that viewers of a specific movie watched

expr:gatherNodes(movielens, gatherNodes(movielens,walk="161-

>movie_id_i",gather="user_id_i"), walk="node->user_id_i",gather="movie_id_i",

trackTraversal="true")

Movie 161: “The Air Up There”

DFW Data Science

Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.

Collaborative Filteringexpr=top(n="5", sort="count(*) desc", gatherNodes(movielens, top(n="30", sort="count(*) desc", gatherNodes(movielens, search(movielens, q="user_id_i:305", fl="movie_id_i", sort="movie_id_i asc", qt=“/export"), walk="movie_id_i->movie_id_i", gather="user_id_i", maxDocFreq="10000", count(*) ) ), walk="node->user_id_i", gather="movie_id_i", count(*) ))

DFW Data Science

Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.

Comparing Graph ChoicesSolr Elastic

Graph Neo4J Spark GraphX

Best Use Case

QParser: predef. relationships as

filtersExpressions: fast, query-based, dist.

graph ops

Limited to sequential, term

relatedness exploration only

Graph ops and querying that fit on

a single nodeLarge-scale,

iterative graph ops

Common Graph Algorithms (e.g. Pregel, Traversal)

Partial No Yes Yes

ScalingQParser: Co-located

Shards onlyExpressions: Yes

Yes Master/Replica Yes

CommercialLicense Required No Yes GPLv3 No

Visualizations GraphML support (e.g. Gephi) Kibana Neo4j browser 3rd party

DFW Data Science

Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.

Additional References:

DFW Data Science

Contact InfoTrey Grainger

trey.grainger@lucidworks.com @treygrainger

http://solrinaction.comMeetup discount (39% off): 39grainger

Other presentations: http://www.treygrainger.com

DFW Data Science

top related