the apache solr smart data ecosystem

122
The Apache Solr Smart Data Ecosystem Trey Grainger SVP of Engineering, Lucidworks DFW Data Science 2017.01.09

Upload: trey-grainger

Post on 17-Feb-2017

326 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: The Apache Solr Smart Data Ecosystem

The Apache Solr Smart Data EcosystemTrey Grainger

SVP of Engineering, Lucidworks

DFW Data Science 2017.01.09

Page 2: The Apache Solr Smart Data Ecosystem

Trey Grainger SVP of Engineering

• Previously Director of Engineering @ CareerBuilder• MBA, Management of Technology – Georgia Tech• BA, Computer Science, Business, & Philosophy – Furman University• Information Retrieval & Web Search - Stanford University

Other fun projects: • Co-author of Solr in Action, plus numerous research papers• Frequent conference speaker• Founder of Celiaccess.com, the gluten-free search engine• Lucene/Solr contributor

About Me

Page 3: The Apache Solr Smart Data Ecosystem

• Apache Solr OverviewLucidworks Fusion Overview

• Search & Relevancy - Keyword Search - Text Analysis - Multilingual Text Analysis• Recommendations (Demo)• Relevancy Spectrum• Reflected Intelligence

- Relevancy Tuning - Learning to Rank (Demo) - Signals (Demo) …

Agenda…

• Semantic Search - Entity Extraction (Demo) - Query Parsing (Demo) - Semantic Knowledge Graph (Demo)• Streaming Expressions• Solr / Fusion SQL (Demo)• Solr Graph

DFW Data Science

Page 4: The Apache Solr Smart Data Ecosystem

what do you do?

Page 5: The Apache Solr Smart Data Ecosystem
Page 6: The Apache Solr Smart Data Ecosystem

Search-Driven

Everything

Customer Service Custome

r Insights

Fraud Surveillance

Research Portal

Online Retail Digital Content

Page 7: The Apache Solr Smart Data Ecosystem

Lucidworks enables Search-Driven Everything

Data Acquisition

Indexing & Streaming

Smart Access API

Recommendations & Alerts Analytics & InsightsExtreme Relevancy

CUSTOMER SERVICE

RESEARCH PORTAL

DIGITAL CONTENT

CUSTOMER INSIGHTS

FRAUD SURVEILLANCE

ONLINERETAIL

•Access all your data in a number of ways from one place.

•Secure storage and processing from Solr and Spark.

•Acquire data from any source with pre-built connectors and adapters.

Machine learning and advanced analytics turn all of your apps into intelligent data-driven applications.

Page 8: The Apache Solr Smart Data Ecosystem

Apache Solr

Page 9: The Apache Solr Smart Data Ecosystem

“Solr is the popular, blazing-fast, open source enterprise

search platform built on Apache Lucene™.”

Page 10: The Apache Solr Smart Data Ecosystem

Key Solr Features:● Multilingual Keyword search● Relevancy Ranking of results● Faceting & Analytics (nested / relational)● Highlighting● Spelling Correction● Autocomplete/Type-ahead Prediction● Sorting, Grouping, Deduplication● Distributed, Fault-tolerant, Scalable● Geospatial search● Complex Function queries● Recommendations (More Like This)● Graph Queries and Traversals● SQL Query Support● Streaming Aggregations● Batch and Streaming processing● Highly Configurable / Plugins● Learning to Rank● Building machine-learning models● … many more

*source: Solr in Action, chapter 2

Page 11: The Apache Solr Smart Data Ecosystem

The standard for enterprise search.

of Fortune 500 uses Solr.

90%

Page 12: The Apache Solr Smart Data Ecosystem

Lucidworks Fusion

Page 13: The Apache Solr Smart Data Ecosystem

DFW Data Science

Page 14: The Apache Solr Smart Data Ecosystem
Page 15: The Apache Solr Smart Data Ecosystem

All Your Data

Page 16: The Apache Solr Smart Data Ecosystem

• Over 50 connectors to integrate all your data

• Robust parsing framework to seamlessly ingest all your document types

• Point and click Indexing configuration and iterative simulation of results for full control over your ETL process

• Your security model enforced end-to-end from ingest to search across your different datasources

Page 17: The Apache Solr Smart Data Ecosystem

ExperienceManagement

Page 18: The Apache Solr Smart Data Ecosystem

• Relevancy tuning: Point-and-click query pipeline configuration allow fine-grained control of results.

• Machine-driven relevancy: Signals aggregation learn and automatically tune relevancy and drive recommendations out of the box .

• Powerful pipeline stages: Customize fields, stages, synonyms, boosts, facets, machine learning models, your own scripted behavior, and dozens of other powerful search stages.

• Turnkey search UI(Lucidworks View): Build a sophisticated end-to-end search application in just hours.

Page 19: The Apache Solr Smart Data Ecosystem

Operational Simplicity

Page 20: The Apache Solr Smart Data Ecosystem

SECURITY BUILT-IN

Shards Shards

Apache Solr

Apache ZookeeperZK 1

Leader Election

Load Balancing

Shared Config

Management

Worker Worker

Apache SparkCluster

Manager

Core Services

• • •

NLP

Recommenders / Signals

Blob Storage

Pipelines

Scheduling

Alerting / Messaging

Connectors

REST

API

Admin UI

Lucidworks View

LOGS FILE WEB DATABASE CLOUD

HDFS

(Opt

iona

l)

Page 21: The Apache Solr Smart Data Ecosystem

• 75% decrease in development time

• Licensing costs cut by 50%

With Fusion’s out-of-the-box capabilities, we skipped months in our dev cycle so we could focus our team where they would have the most impact.

We cut our licensing costs by 50% and improved application usability. The Lucidworks professional services team amplified our success even further. We’re all Fusion from here on out!”

Lourduraju PamishettySenior IT Application Architect—

Page 22: The Apache Solr Smart Data Ecosystem

• Seamless integration of your entire search & analytics platform

• All capabilities exposed through secured API's, so you can use our UI or build your own.

• End-to-end security policies can be applied out of the box to every aspect of your search ecosystem.

• Distributed, fault-tolerant scaling and supervision of your entire search application

Page 23: The Apache Solr Smart Data Ecosystem

Core Services

• • •

NLP

Recommenders / Signals

Blob Storage

Pipelines

Scheduling

Alerting / Messaging

Connectors

REST

API

Admin UI

Lucidworks View

LOGS FILE WEB DATABASE CLOUD

• Seamless integration of your entire search & analytics platform

• All capabilities exposed through secured API's, so you can use our UI or build your own.

• End-to-end security policies can be applied out of the box to every aspect of your search ecosystem.

• Distributed, fault-tolerant scaling and supervision of your entire search application

Page 24: The Apache Solr Smart Data Ecosystem

Fusion powers search for the brightest companies in the world.

Page 25: The Apache Solr Smart Data Ecosystem

Lucidworks Fusion

Page 26: The Apache Solr Smart Data Ecosystem

search & relevancy

Page 27: The Apache Solr Smart Data Ecosystem

Basic Keyword Search

The beginning of a typical search journey

Page 28: The Apache Solr Smart Data Ecosystem

Term Documents

a doc1 [2x]

brown doc3 [1x] , doc5 [1x]

cat doc4 [1x]

cow doc2 [1x] , doc5 [1x]

… ...

once doc1 [1x], doc5 [1x]

over doc2 [1x], doc3 [1x]

the doc2 [2x], doc3 [2x], doc4[2x], doc5 [1x]

… …

Document Content Field

doc1 once upon a time, in a land far, far away

doc2 the cow jumped over the moon.

doc3 the quick brown fox jumped over the lazy dog.

doc4 the cat in the hat

doc5 The brown cow said “moo” once.

… …

What you SEND to Lucene/Solr:How the content is INDEXED into Lucene/Solr (conceptually):

The inverted index

DFW Data Science

Page 29: The Apache Solr Smart Data Ecosystem

/solr/select/?q=apache solr

Field Documents

… …

apache doc1, doc3, doc4, doc5

hadoop doc2, doc4, doc6

… …

solr doc1, doc3, doc4, doc7, doc8

… …

doc5

doc7 doc8

doc1 doc3 doc4

solr

apache

apache solr

Matching queries to documents

DFW Data Science

Page 30: The Apache Solr Smart Data Ecosystem

Text Analysis

Generating terms to index from raw text

Page 31: The Apache Solr Smart Data Ecosystem

Text Analysis in SolrA text field in Lucene/Solr has an Analyzer containing:

① Zero or more CharFiltersTakes incoming text and “cleans it up” before it is tokenized

② One TokenizerSplits incoming text into a Token Stream containing Zero or more Tokens

③ Zero or more TokenFiltersExamines and optionally modifies each Token in the Token Stream

*From Solr in Action, Chapter 6

DFW Data Science

Page 32: The Apache Solr Smart Data Ecosystem

A text field in Lucene/Solr has an Analyzer containing:

① Zero or more CharFiltersTakes incoming text and “cleans it up” before it is tokenized

② One TokenizerSplits incoming text into a Token Stream containing Zero or more Tokens

③ Zero or more TokenFiltersExamines and optionally modifies each Token in the Token Stream

Text Analysis in Solr

*From Solr in Action, Chapter 6

DFW Data Science

Page 33: The Apache Solr Smart Data Ecosystem

A text field in Lucene/Solr has an Analyzer containing:

① Zero or more CharFiltersTakes incoming text and “cleans it up” before it is tokenized

② One TokenizerSplits incoming text into a Token Stream containing Zero or more Tokens

③ Zero or more TokenFiltersExamines and optionally modifies each Token in the Token Stream

Text Analysis in Solr

*From Solr in Action, Chapter 6

DFW Data Science

Page 34: The Apache Solr Smart Data Ecosystem

A text field in Lucene/Solr has an Analyzer containing:

① Zero or more CharFiltersTakes incoming text and “cleans it up” before it is tokenized

② One TokenizerSplits incoming text into a Token Stream containing Zero or more Tokens

③ Zero or more TokenFiltersExamines and optionally modifies each Token in the Token Stream

Text Analysis in Solr

*From Solr in Action, Chapter 6

DFW Data Science

Page 35: The Apache Solr Smart Data Ecosystem

Multi-lingual Text Analysis

Analyzing text across multiple languages

Page 36: The Apache Solr Smart Data Ecosystem

Example English Analysis Chains

<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt” ignoreCase="true" /> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.EnglishPossessiveFilterFactory"/> <filter class="solr.KeywordMarkerFilterFactory" protected="lang/en_protwords.txt"/> <filter class="solr.PorterStemFilterFactory"/> </analyzer></fieldType>

<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100"> <analyzer> <charFilter class="solr.HTMLStripCharFilterFactory"/> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="lang/en_synonyms.txt" I ignoreCase="true" expand="true"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> <filter class="solr.ASCIIFoldingFilterFactory"/> <filter class="solr.KStemFilterFactory"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer></fieldType>

DFW Data Science

Page 37: The Apache Solr Smart Data Ecosystem

Per-language Analysis Chains

DFW Data Science

*Some of the 32 different languages configurations in Appendix B of Solr in Action

Page 38: The Apache Solr Smart Data Ecosystem

Per-language Analysis Chains

*Some of the 32 different languages configurations in Appendix B of Solr in Action

DFW Data Science

Page 39: The Apache Solr Smart Data Ecosystem

Which Stemmer do I choose?

*From Solr in Action, Chapter 14

DFW Data Science

Page 40: The Apache Solr Smart Data Ecosystem

Common English Stemmers

DFW Data Science

*From Solr in Action, Chapter 14

Page 41: The Apache Solr Smart Data Ecosystem

When Stemming goes awry

Fixing Stemming Mistakes:• Unfortunately, every stemmer will have problem-cases that aren’t handled as you would

expect• Thankfully, Stemmers can be overriden

• KeywordMarkerFilter: protects a list of terms you specify from being stemmed• StemmerOverrideFilter: applies a list of custom term mappings you specify

Alternate strategy:• Use Lemmatization (root-form analysis) instead of Stemming• Commercial vendors help tremendously in this space• The Hunspell stemmer enables dictionary-based support of varying quality in over

100 languagesDFW Data Science

Page 42: The Apache Solr Smart Data Ecosystem

Relevancy

Scoring the results, returning the best matches

Page 43: The Apache Solr Smart Data Ecosystem

Classic Lucene Relevancy Algorithm (now switched to BM25):

*Source: Solr in Action, chapter 3

Score(q, d) = ∑ ( tf(t in d) · idf(t)2 · t.getBoost() · norm(t, d) ) · coord(q, d) · queryNorm(q) t in q

Where: t = term; d = document; q = query; f = field tf(t in d) = numTermOccurrencesInDocument ½ idf(t) = 1 + log (numDocs / (docFreq + 1)) coord(q, d) = numTermsInDocumentFromQuery / numTermsInQuery queryNorm(q) = 1 / (sumOfSquaredWeights ½ ) sumOfSquaredWeights = q.getBoost()2 · ∑ (idf(t) · t.getBoost() )2 t in q

norm(t, d) = d.getBoost() · lengthNorm(f) · f.getBoost()

DFW Data Science

Page 44: The Apache Solr Smart Data Ecosystem

• Term Frequency: “How well a term describes a document?”– Measure: how often a term occurs per document

• Inverse Document Frequency: “How important is a term overall?”– Measure: how rare the term is across all documents

TF * IDF

*Source: Solr in Action, chapter 3

DFW Data Science

Page 45: The Apache Solr Smart Data Ecosystem

News Search : popularity and freshness drive relevance

Restaurant Search: geographical proximity and price range are critical

Ecommerce: likelihood of a purchase is key

Movie search: More popular titles are generally more relevant

Job search: category of job, salary range, and geographical proximity matter

TF * IDF of keywords can’t hold it’s own against good domain-specific relevance factors!

That’s great, but what about domain-specific knowledge?

DFW Data Science

Page 46: The Apache Solr Smart Data Ecosystem

John lives in Boston but wants to move to New York or possibly another big city. He is currently a sales manager but wants to move towards business development.

Irene is a bartender in Dublin and is only interested in jobs within 10KM of her location in the food service industry.

Irfan is a software engineer in Atlanta and is interested in software engineering jobs at a Big Data company. He is happy to move across the U.S. for the right job.

Jane is a nurse educator in Boston seeking between $40K and $60K

*Example from chapter 16 of Solr in Action

Consider what you know about users

DFW Data Science

Page 47: The Apache Solr Smart Data Ecosystem

http://localhost:8983/solr/jobs/select/? fl=jobtitle,city,state,salary& q=( jobtitle:"nurse educator"^25 OR jobtitle:(nurse educator)^10 ) AND ( (city:"Boston" AND state:"MA")^15 OR state:"MA") AND _val_:"map(salary, 40000, 60000,10, 0)”

*Example from chapter 16 of Solr in Action

Query for Jane

Jane is a nurse educator in Boston seeking between $40K and $60K

DFW Data Science

Page 48: The Apache Solr Smart Data Ecosystem

{ ... "response":{"numFound":22,"start":0,"docs":[ {"jobtitle":" Clinical Educator (New England/ Boston)", "city":"Boston", "state":"MA", "salary":41503},

…]}}

*Example documents available @ http://github.com/treygrainger/solr-in-action

Search Results for Jane

{"jobtitle":"Nurse Educator", "city":"Braintree", "state":"MA", "salary":56183},

{"jobtitle":"Nurse Educator", "city":"Brighton", "state":"MA", "salary":71359}

DFW Data Science

Page 49: The Apache Solr Smart Data Ecosystem

You just built a recommendation

engine!

Page 50: The Apache Solr Smart Data Ecosystem

Demo: Recommendations

Page 51: The Apache Solr Smart Data Ecosystem

Traditional Keyword Search

Recommendations

SemanticSearch

User Intent

Personalized Search

Augmented Search

Domain-awareMatching

The Relevancy Spectrum

DFW Data Science

Page 52: The Apache Solr Smart Data Ecosystem

Basic Keyword Search(inverted index, tf-idf, bm25, query formulation, etc.)

Taxonomies / Entity Extraction(entity recognition, ontologies, synonyms, etc.)

Query Intent(query classification, semantic query parsing, concept expansion, rules, clustering, classification)

Relevancy Tuning(signals, AB testing/genetic algorithms, Learning to Rank, Neural Networks)

Self-learningData-driven App Sophistication

DFW Data Science

Page 53: The Apache Solr Smart Data Ecosystem

what is “reflected intelligence”?

Page 54: The Apache Solr Smart Data Ecosystem

The Three C’sContent:Keywords and other features in your documents

Collaboration:How other’s have chosen to interact with your system

Context:Available information about your users and their intent

Reflected Intelligence “Leveraging previous data and interactions to improve how new data and interactions should be interpreted”

DFW Data Science

Page 55: The Apache Solr Smart Data Ecosystem

Feedback LoopsUser

Searches

User Sees

ResultsUser

takes an

action

Users’ actions inform system improvements

DFW Data Science

Page 56: The Apache Solr Smart Data Ecosystem

● Recommendation Algorithms● Building user profiles from past searches, clicks, and other actions● Identifying correlations between keywords/phrases● Building out automatically-generated ontologies from content and

queries● Determining relevancy judgements (precision, recall, nDCG, etc.)

from click logs● Learning to Rank - using relevancy judgements and machine

learning to train a relevance model● Discovering misspellings, synonyms, acronyms, and related

keywords● Disambiguation of keyword phrases with multiple meanings● Learning what’s important in your content

Examples of Reflected Intelligence

DFW Data Science

Page 57: The Apache Solr Smart Data Ecosystem

Relevancy Tuning

Improving ranking algorithms through experiments and models

Page 58: The Apache Solr Smart Data Ecosystem

How to Measure Relevancy?

A B CRetrieved Documents

Related Documents

Precision = B/A

Recall = B/C

Problem:

Assume Prec = 90% and Rec = 100% but assume the 10% irrelevant documents were ranked at the top of the retrieved documents, is that OK?

DFW Data Science

Page 59: The Apache Solr Smart Data Ecosystem

Normalized Discounted Cumulative Gain

Rank Relevancy

3 0.95

1 0.70

2 0.60

4 0.45

Rank Relevancy

1 0.95

2 0.85

3 0.80

4 0.65

Ranking

IdealGiven

• Position is considered in quantifying relevancy.

• Labeled dataset is required.

DFW Data Science

Page 60: The Apache Solr Smart Data Ecosystem

Learning to Rank

Page 61: The Apache Solr Smart Data Ecosystem

Learning to Rank (LTR)

● It applies machine learning techniques to discover the best combination of features that provide best ranking.

● It requires labeled set of documents with relevancy scores for given set of queries

● Features used for ranking are usually more computationally expensive than the ones used for matching

● It typically re-ranks a subset of the matched documents (e.g. top 1000)

DFW Data Science

Page 62: The Apache Solr Smart Data Ecosystem

DFW Data Science

Page 63: The Apache Solr Smart Data Ecosystem

Common LTR Algorithms

• RankNet* (Neural Network, boosted trees)

• LambdaMart* (set of regression trees)

• SVM Rank** (SVM classifier)

** http://research.microsoft.com/en-us/people/hangli/cao-et-al-sigir2006.pdf

* http://research.microsoft.com/pubs/132652/MSR-TR-2010-82.pdf

DFW Data Science

Page 64: The Apache Solr Smart Data Ecosystem

LambdaMart Example

Source: T. Grainger, K. AlJadda. ”Reflected Intelligence: Evolving self-learning data systems". Georgia Tech, 2016

DFW Data Science

Page 65: The Apache Solr Smart Data Ecosystem

Demo: Solr Learning to Rank

Page 66: The Apache Solr Smart Data Ecosystem

Obtaining Relevancy JudgementsTypical Methodologies 1) Hire employees, contractors, or interns -Pros: Accuracy -Cons: Expensive Not scalable (cost or man-power-wise) Data Becomes Stale

2) Crowdsource -Pros: Less cost, more scalable -Cons: Less accurate Data still becomes stale

Source: T. Grainger, K. AlJadda. ”Reflected Intelligence: Evolving self-learning data systems". Georgia Tech, 2016

DFW Data Science

Page 67: The Apache Solr Smart Data Ecosystem

Reflected Intelligence: Possible to infer relevancy judgements?

Rank Document ID

1 Doc1

2 Doc2

3 Doc3

4 Doc4

QueryQuery

Doc1 Doc2 Doc3

01 1

Query

Doc1 Doc2 Doc3

10 0

Click Graph

Skip Graph

Source: T. Grainger, K. AlJadda. ”Reflected Intelligence: Evolving self-learning data systems". Georgia Tech, 2016

DFW Data Science

Page 68: The Apache Solr Smart Data Ecosystem

Automated Relevancy Benchmarking

DFW Data Science

Page 69: The Apache Solr Smart Data Ecosystem

Demo: Fusion Signals

Page 70: The Apache Solr Smart Data Ecosystem
Page 71: The Apache Solr Smart Data Ecosystem

• 200%+ increase in click-through rates

• 91% lower TCO• Fewer support

tickets• Increased customer

satisfaction

Page 72: The Apache Solr Smart Data Ecosystem

semantic search

Page 73: The Apache Solr Smart Data Ecosystem

DFW Data Science

Page 74: The Apache Solr Smart Data Ecosystem

Building a Taxonomy of Entities

Many ways to generate this:• Topic Modelling

• Clustering of documents

• Statistical Analysis of interesting phrases- Word2Vec / Glove / Dice Conceptual Search

• Buy a dictionary (often doesn’t work for domain-specific search problems)

• Generate a model of domain-specific phrases by mining query logs for commonly searched phrases within the domain*

* K. Aljadda, M. Korayem, T. Grainger, C. Russell. "Crowdsourced Query Augmentation through Semantic Discovery of Domain-specific Jargon," in IEEE Big Data 2014.

DFW Data Science

Page 75: The Apache Solr Smart Data Ecosystem

DFW Data Science

Page 76: The Apache Solr Smart Data Ecosystem

DFW Data Science

Page 77: The Apache Solr Smart Data Ecosystem

entity extraction

Page 78: The Apache Solr Smart Data Ecosystem

DFW Data Science

Page 79: The Apache Solr Smart Data Ecosystem

Demo: Solr Text Tagger

Page 80: The Apache Solr Smart Data Ecosystem

semantic query parsing

Page 81: The Apache Solr Smart Data Ecosystem

DFW Data Science

Page 82: The Apache Solr Smart Data Ecosystem

Probabilistic Query ParserGoal: given a query, predict which combinations of keywords should be combined together as phrases

Example: senior java developer hadoopPossible Parsings:senior, java, developer, hadoop"senior java", developer, hadoop"senior java developer", hadoop"senior java developer hadoop”"senior java", "developer hadoop”senior, "java developer", hadoopsenior, java, "developer hadoop" Source: Trey Grainger, “Searching on Intent: Knowledge Graphs, Personalization,

and Contextual Disambiguation”, Bay Area Search Meetup, November 2015.

DFW Data Science

Page 83: The Apache Solr Smart Data Ecosystem

Demo: Probabilistic Query Parser

Page 84: The Apache Solr Smart Data Ecosystem

Semantic Query ParsingIdentification of phrases in queries using two steps:

1) Check a dictionary of known terms that is continuously built, cleaned, and refined based upon common inputs from interactions with real users of the system. The SolrTextTagger works well for this.*

2) Also invoke a probabilistic query parser to dynamically identify unknown phrases using statistics from a corpus of data (language model)

*K. Aljadda, M. Korayem, T. Grainger, C. Russell. "Crowdsourced Query Augmentation through Semantic Discovery of Domain-specific Jargon," in IEEE Big Data 2014.

DFW Data Science

Page 85: The Apache Solr Smart Data Ecosystem

query augmentation

Page 86: The Apache Solr Smart Data Ecosystem

DFW Data Science

Page 87: The Apache Solr Smart Data Ecosystem

Knowledge Graph

Semantic Data Encoded into Free Text Content

DFW Data Science

Page 88: The Apache Solr Smart Data Ecosystem

id: 1job_title: Software Engineerdesc: software engineer at a great companyskills: .Net, C#, java

id: 2job_title: Registered Nursedesc: a registered nurse at hospital doing hard workskills: oncology, phlebotemy

id: 3job_title: Java Developerdesc: a software engineer or a java engineer doing workskills: java, scala, hibernate

field term postings list

doc pos

desc

a

1 4

2 1

3 1, 5

at1 3

2 4

company 1 6

doing2 6

3 8

engineer1 2

3 3, 7

great 1 5

hard 2 7

hospital 2 5

java 3 6

nurse 2 3

or 3 4

registered 2 2

software1 1

3 2

work2 10

3 9

job_title java developer 3 1

… … … …

field doc term

desc

1 a

at

company

engineer

great

software

2 a

at

doing

hard

hospital

nurse

registered

work

3 a

doing

engineer

java

or

software

work

job_title 1 Software Engineer

… … …

Terms-Docs Inverted IndexDocs-Terms Forward IndexDocuments

Source: Trey Grainger, Khalifeh AlJadda, Mohammed Korayem, Andries Smith.“The Semantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain”. DSAA 2016.

Knowledge Graph

DFW Data Science

Page 89: The Apache Solr Smart Data Ecosystem

Source: Trey Grainger, Khalifeh AlJadda, Mohammed Korayem, Andries Smith.“The Semantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain”. DSAA 2016.

Knowledge Graph

Set-theory View

Graph View

How the Graph Traversal Works

skill: Java

skill: Scala

skill: Hibernate

skill: Oncology

has_related_skill

has_related_skillhas_related_skill

doc 1

doc 2

doc 3

doc 4

doc 5

doc 6

skill: Java

skill: Java

skill: Scala

skill: Hibernate

skill: Oncology

Data Structure View

Java

Scala Hibernate

docs1, 2, 6

docs 3, 4

Oncologydoc 5

DFW Data Science

Page 90: The Apache Solr Smart Data Ecosystem

Knowledge Graph

Graph ModelStructure:

Single-level Traversal / Scoring:

Multi-level Traversal / Scoring:

Page 91: The Apache Solr Smart Data Ecosystem

Source: Trey Grainger, Khalifeh AlJadda, Mohammed Korayem, Andries Smith.“The Semantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain”. DSAA 2016.

Knowledge Graph

Multi-level Traversal

Data Structure View

Graph Viewdoc 1

doc 2

doc 3

doc 4

doc 5

doc 6

skill: Java

skill: Java

skill: Scala

skill: Hibernate

skill: Oncology

doc 1

doc 2

doc 3

doc 4

doc 5

doc 6

job_title: Software Engineer

job_title: Data

Scientist

job_title: Java

Developer

……

Inverted Index Lookup

Forward Index Lookup

Forward Index Lookup

Inverted Index Lookup

Java

Java Developer

Hibernate

Scala

Software Engineer

Data Scientist

has_related_skill has_related_skill

has_related_skill

has_

rela

ted_

job_

title

has_

rela

ted_job_title

has_

rela

ted_

job_

title

has_

relat

ed_jo

b_title

has_related_job_ti

tle

has_related_job_title

DFW Data Science

Page 92: The Apache Solr Smart Data Ecosystem

Source: Trey Grainger, Khalifeh AlJadda, Mohammed Korayem, Andries Smith.“The Semantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain”. DSAA 2016.

Knowledge Graph

Scoring nodes in the Graph

Foreground vs. Background AnalysisEvery term scored against it’s context. The more commonly the term appears within it’s foreground context versus its background context, the more relevant it is to the specified foreground context.

countFG(x) - totalDocsFG * probBG(x) z = -------------------------------------------------------- sqrt(totalDocsFG * probBG(x) * (1 - probBG(x)))

{ "type":"keywords”, "values":[ { "value":"hive", "relatedness": 0.9765, "popularity":369 },

{ "value":"spark", "relatedness": 0.9634, "popularity":15653 },

{ "value":".net", "relatedness": 0.5417, "popularity":17683 },

{ "value":"bogus_word", "relatedness": 0.0, "popularity":0 },

{ "value":"teaching", "relatedness": -0.1510, "popularity":9923 },

{ "value":"CPR", "relatedness": -0.4012, "popularity":27089 } ] } 

+-

Foreground Query: "Hadoop"

DFW Data Science

Page 93: The Apache Solr Smart Data Ecosystem

Source: Trey Grainger, Khalifeh AlJadda, Mohammed Korayem, Andries Smith.“The Semantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain”. DSAA 2016.

Knowledge Graph

Multi-level Graph Traversal with Scores

software engineer*(materialized node)

Java

C#

.NET

.NET Developer

Java Developer

HibernateScalaVB.NET

Software Engineer

Data Scientist

SkillNodes

has_related_skillStartingNode

SkillNodes

has_related_skill Job TitleNodes

has_related_job_title

0.900.88 0.93

0.93

0.34

0.74

0.91

0.89

0.74

0.89

0.780.72

0.48

0.93

0.76

0.83

0.80

0.64

0.61

0.780.55

DFW Data Science

Page 94: The Apache Solr Smart Data Ecosystem

Knowledge Graph

Use Case: Document Summarization

Experiment: Pass in raw text (extracting phrases as needed), and rank their similarity to the documents using the SKG.

Additionally, can traverse the graph to “related” entities/keyword phrases NOT found in the original document

Applications: Content-based and multi-modal recommendations (no cold-start problem), data cleansing prior to clustering or other ML methods, semantic search / similarity scoring

Page 95: The Apache Solr Smart Data Ecosystem

Demo: Semantic Knowledge Graph

Page 96: The Apache Solr Smart Data Ecosystem

Knowledge Graph

DFW Data Science

Page 97: The Apache Solr Smart Data Ecosystem

Knowledge Graph

DFW Data Science

Page 98: The Apache Solr Smart Data Ecosystem

DFW Data Science

Page 99: The Apache Solr Smart Data Ecosystem

streaming expressions

Page 100: The Apache Solr Smart Data Ecosystem

• Perform relational operations on streams

• Stream sources: search, jdbc, facets, features, gatherNodes, shortestPath, train, features, model, random, stats, topic

• Stream decorators: classify, commit, complement, daemon, executor, fetch, having, leftOuterJoin, hashJoin, innerJoin, intersect, merge, null, outerHashJoin, parallel, priority, reduce, rollup, scoreNodes, select, sort, top, unique, update

Streaming Expressions

Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.

DFW Data Science

Page 101: The Apache Solr Smart Data Ecosystem

• Relies on docValues (column-oriented data structure) and /export handler

• Extreme read performance (8-10x faster than queries using cursorMark)

• Facet or map/reduce style aggregation modes

• Tiered architecture• SQL interface tier• Worker tier (scale a pool of worker

“nodes” independently of the data collection)

• Data tier (Solr collection)

Streaming API: Nuts and Bolts

Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.

DFW Data Science

Page 102: The Apache Solr Smart Data Ecosystem

Streaming Expressions - Examples

Shortest-path Graph Traversal

Parallel Batch Procesing

Train a Logistic Regression Model

Distributed Joins

Rapid Export of all Search Results

Pull Results from External Database

Sources: https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions http://joelsolr.blogspot.com/2016/10/solr-63-batch-jobs-parallel-etl-and.html

Classifying Search Results

Page 103: The Apache Solr Smart Data Ecosystem

Solr SQL

Page 104: The Apache Solr Smart Data Ecosystem

• SQL is ubiquitous language for analytics• People: Less training and easier to

understand• Tools! Solr as JDBC data source

(DbVisualizer, Apache Zeppelin, and SQuirreL SQL)

• Query planning / optimization can evolve iteratively

SQL is natural extension for Solr’s parallel computing engine

Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.

DFW Data Science

Page 105: The Apache Solr Smart Data Ecosystem

Give me the top 5 action movies with rating of 4 or better

Mental Warm-up

/select?q=*:* &fq=genre_ss:action &fq=rating_i:[4 TO *] &facet=true &facet.limit=5 &facet.mincount=1 &facet.field=title_s

SELECT title_s, COUNT(*) as cnt FROM movielens WHERE genre_ss='action' AND rating_i='[4 TO *]’ GROUP BY title_s ORDER BY cnt desc LIMIT 5

{ ... "facet_counts":{ "facet_fields":{ "title_s":[ "Star Wars (1977)",501, "Return of the Jedi (1983)",379, "Godfather, The (1972)",351, "Raiders of the Lost Ark (1981)",348, "Empire Strikes Back, The (1980)",293]}, ...}}

{"result-set":{"docs":[{"title_s":"Star Wars (1977)”,"cnt":501},{"title_s":"Return of the Jedi (1983)","cnt":379},{"title_s":"Godfather, The (1972)","cnt":351},{"title_s":"Raiders of the Lost Ark (1981)","cnt":348},{"title_s":"Empire Strikes Back, The (1980)","cnt":293},{"EOF":true,"RESPONSE_TIME":42}]}}

Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.

DFW Data Science

Page 106: The Apache Solr Smart Data Ecosystem

SELECT gender_s, COUNT(*) as num_ratings, avg(rating_i) as avg_rating FROM movielens WHERE genre_ss='romance' AND age_i='[30 TO *]'GROUP BY gender_s ORDER BY num_ratings desc

SQL Examples

SELECT title_s, genre_s, COUNT(*) as num_ratings, avg(rating_i) as avg_rating FROM movielens GROUP BY title_s, genre_s HAVING num_ratings >= 100 ORDER BY avg_rating desc LIMIT 5

SELECT DISTINCT(user_id_i) as user_id FROM movielens WHERE genre_ss='documentary' ORDER BY user_id desc

Give me the avg rating for menand women over 30 for romance movies

Give me the top 5 rated movies with at least 100 ratings

Give me the set of unique users that have rated documentaries

DFW Data Science

Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.

Page 107: The Apache Solr Smart Data Ecosystem

parallel(workers, hashJoin( search(movielens, q=*:*, fl="user_id_i,movie_id_i,rating_i", sort="movie_id_i asc", partitionKeys="movie_id_i"), hashed=search(movielens_movies, q=*:*, fl="movie_id_i,title_s,genre_s", sort="movie_id_i asc", partitionKeys="movie_id_i"), on="movie_id_i" ), workers="4", sort="movie_id_i asc")

Streaming Expression Example: hashJoin

The small “right” side of the join gets loaded into memory on each worker node

Each shard queried by N workers, so 4 workers x 4 shards means 16 queries (usually all replicas per shard are hit)

Workers collection isolates parallel computation nodes from data nodes

DFW Data Science

Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.

Page 108: The Apache Solr Smart Data Ecosystem

• spark-solr project uses streaming API to pull data from Solr into Spark jobs if docValues enabled, see: https://github.com/lucidworks/spark-solr

• Perform aggregations of “signals”, e.g clicks, to compute boosts and recommendations using Spark

• Custom Scala script jobs to perform complex analysis on data in Solr, e.g. sessionize request logs

• Power rich data visualizations using Fusion’s SQL Engine powered by SparkSQL + Solr streaming aggregations

How we use Solr streaming API in Fusion

Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.

DFW Data Science

Page 109: The Apache Solr Smart Data Ecosystem

DFW Data Science

Page 110: The Apache Solr Smart Data Ecosystem

Comparing SQL CapabilitiesFusion Solr Hive Drill SparkSQL

Secret SauceSparkSQL Benefits

+ Solr Benefits+ Enterprise

Security

Push complex query

constructs into engine (full text,

spatial, relevancy,

graph, functions, etc)

Mature SQL solution for

Hadoop stack

Execute SQL over NoSQL data

sources

Spark core (optimized shuffle, in-memory, etc),

integration of other APIs: ML,

Streaming, GraphX

SQL Features Maturing Evolving Mature Maturing Maturing

ScalingLinear (shards and replicas) backed by

inverted index;

Linear (shards and replicas)

backed by inverted index

Limited by Hadoop

infrastructure (table scans)

Good, but need to benchmark

Memory intensive;Scale out using Spark cluster,

backed by RDDsIntegration w/

external systems

Analytics Catalog API, JDCB Driver,

ODBC BridgeJDBC stream

sourceexternal tables /

plugin APImany drivers

availableDataSource API, many systems

supported

DFW Data Science

Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.

Page 111: The Apache Solr Smart Data Ecosystem

Demo: Fusion SQL Engine

Page 112: The Apache Solr Smart Data Ecosystem

Graph

Page 113: The Apache Solr Smart Data Ecosystem

Graph Use Cases• Anomaly detection /fraud detection

• Recommenders• Social network analysis• Graph Search• Access Control• Relationship discovery / scoring

Exampleso Find all draft blog posts about “Parallel SQL”

written by a developero Find all tweets mentioning “Solr” by me or

people I followo Find all draft blog posts about “Parallel SQL”

written by a developero Find 3-star hotels in NYC my friends stayed

in last year

DFW Data Science

Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.

Page 114: The Apache Solr Smart Data Ecosystem

Solr Graph Timeline

• Some data is much more naturally represented as a graph structure

• Solr 6.0: Introduced the Graph Query Parser• Solr 6.1: Introduced Graph Streaming expressions…

• Solr 6.3: Current Version• TBD: Semantic Knowledge Graph (patch available)

DFW Data Science

Page 115: The Apache Solr Smart Data Ecosystem

Graph Query Parser• Query-time, cyclic aware graph traversal is able to rank documents based on

relationships• Provides controls for depth, filtering of results and inclusion

of root and/or leaves• Limitations: single node/shard only

Examples:

• http://localhost:8983/solr/graph/query?fl=id,score&q={!graph from=in_edge to=out_edge}id:A

• http://localhost:8983/solr/my_graph/query?fl=id&q={!graph from=in_edge to=out_edge traversalFilter='foo:[* TO 15]'}id:A

• http://localhost:8983/solr/my_graph/query?fl=id&q={!graph from=in_edge to=out_edge maxDepth=1}foo:[* TO 10]

DFW Data Science

Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.

Page 116: The Apache Solr Smart Data Ecosystem

Graph Streaming Expressions• Part of Solr’s broader Streaming Expressions capability• Implements a powerful, breadth-first traversal• Works across shards AND collections• Supports aggregations• Cycle aware

curl -X POST -H "Content-Type: application/x-www-form-urlencoded" -d ‘expr=…’"http://localhost:18984/solr/movielens/stream"

DFW Data Science

Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.

Page 117: The Apache Solr Smart Data Ecosystem

All movies that user 389 watched

expr:gatherNodes(movielens,walk="389->user_id_i",gather="movie_id_i")

DFW Data Science

Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.

Page 118: The Apache Solr Smart Data Ecosystem

All movies that viewers of a specific movie watched

expr:gatherNodes(movielens, gatherNodes(movielens,walk="161-

>movie_id_i",gather="user_id_i"), walk="node->user_id_i",gather="movie_id_i",

trackTraversal="true")

Movie 161: “The Air Up There”

DFW Data Science

Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.

Page 119: The Apache Solr Smart Data Ecosystem

Collaborative Filteringexpr=top(n="5", sort="count(*) desc", gatherNodes(movielens, top(n="30", sort="count(*) desc", gatherNodes(movielens, search(movielens, q="user_id_i:305", fl="movie_id_i", sort="movie_id_i asc", qt=“/export"), walk="movie_id_i->movie_id_i", gather="user_id_i", maxDocFreq="10000", count(*) ) ), walk="node->user_id_i", gather="movie_id_i", count(*) ))

DFW Data Science

Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.

Page 120: The Apache Solr Smart Data Ecosystem

Comparing Graph ChoicesSolr Elastic

Graph Neo4J Spark GraphX

Best Use Case

QParser: predef. relationships as

filtersExpressions: fast, query-based, dist.

graph ops

Limited to sequential, term

relatedness exploration only

Graph ops and querying that fit on

a single nodeLarge-scale,

iterative graph ops

Common Graph Algorithms (e.g. Pregel, Traversal)

Partial No Yes Yes

ScalingQParser: Co-located

Shards onlyExpressions: Yes

Yes Master/Replica Yes

CommercialLicense Required No Yes GPLv3 No

Visualizations GraphML support (e.g. Gephi) Kibana Neo4j browser 3rd party

DFW Data Science

Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.

Page 121: The Apache Solr Smart Data Ecosystem

Additional References:

DFW Data Science

Page 122: The Apache Solr Smart Data Ecosystem

Contact InfoTrey Grainger

[email protected] @treygrainger

http://solrinaction.comMeetup discount (39% off): 39grainger

Other presentations: http://www.treygrainger.com

DFW Data Science