the apache solr smart data ecosystem

The Apache Solr Smart Data EcosystemTrey Grainger

SVP of Engineering, Lucidworks

DFW Data Science 2017.01.09

Trey Grainger SVP of Engineering

• Previously Director of Engineering @ CareerBuilder• MBA, Management of Technology – Georgia Tech• BA, Computer Science, Business, & Philosophy – Furman University• Information Retrieval & Web Search - Stanford University

Other fun projects: • Co-author of Solr in Action, plus numerous research papers• Frequent conference speaker• Founder of Celiaccess.com, the gluten-free search engine• Lucene/Solr contributor

About Me

• Apache Solr OverviewLucidworks Fusion Overview

• Search & Relevancy - Keyword Search - Text Analysis - Multilingual Text Analysis• Recommendations (Demo)• Relevancy Spectrum• Reflected Intelligence

- Relevancy Tuning - Learning to Rank (Demo) - Signals (Demo) …

Agenda…

• Semantic Search - Entity Extraction (Demo) - Query Parsing (Demo) - Semantic Knowledge Graph (Demo)• Streaming Expressions• Solr / Fusion SQL (Demo)• Solr Graph

DFW Data Science

what do you do?

Search-Driven

Everything

Customer Service Custome

r Insights

Fraud Surveillance

Research Portal

Online Retail Digital Content

Lucidworks enables Search-Driven Everything

Data Acquisition

Indexing & Streaming

Smart Access API

Recommendations & Alerts Analytics & InsightsExtreme Relevancy

CUSTOMER SERVICE

RESEARCH PORTAL

DIGITAL CONTENT

CUSTOMER INSIGHTS

FRAUD SURVEILLANCE

ONLINERETAIL

•Access all your data in a number of ways from one place.

•Secure storage and processing from Solr and Spark.

•Acquire data from any source with pre-built connectors and adapters.

Machine learning and advanced analytics turn all of your apps into intelligent data-driven applications.

Apache Solr

“Solr is the popular, blazing-fast, open source enterprise

search platform built on Apache Lucene™.”

Key Solr Features:● Multilingual Keyword search● Relevancy Ranking of results● Faceting & Analytics (nested / relational)● Highlighting● Spelling Correction● Autocomplete/Type-ahead Prediction● Sorting, Grouping, Deduplication● Distributed, Fault-tolerant, Scalable● Geospatial search● Complex Function queries● Recommendations (More Like This)● Graph Queries and Traversals● SQL Query Support● Streaming Aggregations● Batch and Streaming processing● Highly Configurable / Plugins● Learning to Rank● Building machine-learning models● … many more

*source: Solr in Action, chapter 2

The standard for enterprise search.

of Fortune 500 uses Solr.

Lucidworks Fusion

DFW Data Science

All Your Data

• Over 50 connectors to integrate all your data

• Robust parsing framework to seamlessly ingest all your document types

• Point and click Indexing configuration and iterative simulation of results for full control over your ETL process

• Your security model enforced end-to-end from ingest to search across your different datasources

ExperienceManagement

• Relevancy tuning: Point-and-click query pipeline configuration allow fine-grained control of results.

• Machine-driven relevancy: Signals aggregation learn and automatically tune relevancy and drive recommendations out of the box .

• Powerful pipeline stages: Customize fields, stages, synonyms, boosts, facets, machine learning models, your own scripted behavior, and dozens of other powerful search stages.

• Turnkey search UI(Lucidworks View): Build a sophisticated end-to-end search application in just hours.

Operational Simplicity

SECURITY BUILT-IN

Shards Shards

Apache Solr

Apache ZookeeperZK 1

Leader Election

Load Balancing

Shared Config

Management

Worker Worker

Apache SparkCluster

Manager

Core Services

• • •

Recommenders / Signals

Blob Storage

Pipelines

Scheduling

Alerting / Messaging

Connectors

Admin UI

Lucidworks View

LOGS FILE WEB DATABASE CLOUD

• 75% decrease in development time

• Licensing costs cut by 50%

With Fusion’s out-of-the-box capabilities, we skipped months in our dev cycle so we could focus our team where they would have the most impact.

We cut our licensing costs by 50% and improved application usability. The Lucidworks professional services team amplified our success even further. We’re all Fusion from here on out!”

Lourduraju PamishettySenior IT Application Architect—

• Seamless integration of your entire search & analytics platform

• All capabilities exposed through secured API's, so you can use our UI or build your own.

• End-to-end security policies can be applied out of the box to every aspect of your search ecosystem.

• Distributed, fault-tolerant scaling and supervision of your entire search application

Core Services

• • •

Recommenders / Signals

Blob Storage

Pipelines

Scheduling

Alerting / Messaging

Connectors

Admin UI

Lucidworks View

LOGS FILE WEB DATABASE CLOUD

• Seamless integration of your entire search & analytics platform

• All capabilities exposed through secured API's, so you can use our UI or build your own.

• End-to-end security policies can be applied out of the box to every aspect of your search ecosystem.

• Distributed, fault-tolerant scaling and supervision of your entire search application

Fusion powers search for the brightest companies in the world.

Lucidworks Fusion

search & relevancy

Basic Keyword Search

The beginning of a typical search journey

Term Documents

a doc1 [2x]

brown doc3 [1x] , doc5 [1x]

cat doc4 [1x]

cow doc2 [1x] , doc5 [1x]

… ...

once doc1 [1x], doc5 [1x]

over doc2 [1x], doc3 [1x]

the doc2 [2x], doc3 [2x], doc4[2x], doc5 [1x]

… …

Document Content Field

doc1 once upon a time, in a land far, far away

doc2 the cow jumped over the moon.

doc3 the quick brown fox jumped over the lazy dog.

doc4 the cat in the hat

doc5 The brown cow said “moo” once.

… …

What you SEND to Lucene/Solr:How the content is INDEXED into Lucene/Solr (conceptually):

The inverted index

DFW Data Science

/solr/select/?q=apache solr

Field Documents

… …

apache doc1, doc3, doc4, doc5

hadoop doc2, doc4, doc6

… …

solr doc1, doc3, doc4, doc7, doc8

… …

doc7 doc8

doc1 doc3 doc4

apache

apache solr

Matching queries to documents

DFW Data Science

Text Analysis

Generating terms to index from raw text

Text Analysis in SolrA text field in Lucene/Solr has an Analyzer containing:

① Zero or more CharFiltersTakes incoming text and “cleans it up” before it is tokenized

② One TokenizerSplits incoming text into a Token Stream containing Zero or more Tokens

③ Zero or more TokenFiltersExamines and optionally modifies each Token in the Token Stream

*From Solr in Action, Chapter 6

DFW Data Science

A text field in Lucene/Solr has an Analyzer containing:

Text Analysis in Solr

DFW Data Science

Multi-lingual Text Analysis

Analyzing text across multiple languages

Example English Analysis Chains

DFW Data Science

Per-language Analysis Chains

DFW Data Science

*Some of the 32 different languages configurations in Appendix B of Solr in Action

Per-language Analysis Chains

*Some of the 32 different languages configurations in Appendix B of Solr in Action

DFW Data Science

Which Stemmer do I choose?

DFW Data Science

Common English Stemmers

DFW Data Science

When Stemming goes awry

Fixing Stemming Mistakes:• Unfortunately, every stemmer will have problem-cases that aren’t handled as you would

expect• Thankfully, Stemmers can be overriden

• KeywordMarkerFilter: protects a list of terms you specify from being stemmed• StemmerOverrideFilter: applies a list of custom term mappings you specify

Alternate strategy:• Use Lemmatization (root-form analysis) instead of Stemming• Commercial vendors help tremendously in this space• The Hunspell stemmer enables dictionary-based support of varying quality in over

100 languagesDFW Data Science

Relevancy

Scoring the results, returning the best matches

Classic Lucene Relevancy Algorithm (now switched to BM25):

*Source: Solr in Action, chapter 3

Score(q, d) = ∑ ( tf(t in d) · idf(t)2 · t.getBoost() · norm(t, d) ) · coord(q, d) · queryNorm(q) t in q

Where: t = term; d = document; q = query; f = field tf(t in d) = numTermOccurrencesInDocument ½ idf(t) = 1 + log (numDocs / (docFreq + 1)) coord(q, d) = numTermsInDocumentFromQuery / numTermsInQuery queryNorm(q) = 1 / (sumOfSquaredWeights ½ ) sumOfSquaredWeights = q.getBoost()2 · ∑ (idf(t) · t.getBoost() )2 t in q

norm(t, d) = d.getBoost() · lengthNorm(f) · f.getBoost()

DFW Data Science

• Term Frequency: “How well a term describes a document?”– Measure: how often a term occurs per document

• Inverse Document Frequency: “How important is a term overall?”– Measure: how rare the term is across all documents

TF * IDF

*Source: Solr in Action, chapter 3

DFW Data Science

News Search : popularity and freshness drive relevance

Restaurant Search: geographical proximity and price range are critical

Ecommerce: likelihood of a purchase is key

Movie search: More popular titles are generally more relevant

Job search: category of job, salary range, and geographical proximity matter

TF * IDF of keywords can’t hold it’s own against good domain-specific relevance factors!

That’s great, but what about domain-specific knowledge?

DFW Data Science

John lives in Boston but wants to move to New York or possibly another big city. He is currently a sales manager but wants to move towards business development.

Irene is a bartender in Dublin and is only interested in jobs within 10KM of her location in the food service industry.

Irfan is a software engineer in Atlanta and is interested in software engineering jobs at a Big Data company. He is happy to move across the U.S. for the right job.

Jane is a nurse educator in Boston seeking between $40K and $60K

*Example from chapter 16 of Solr in Action

Consider what you know about users

DFW Data Science

http://localhost:8983/solr/jobs/select/? fl=jobtitle,city,state,salary& q=( jobtitle:"nurse educator"^25 OR jobtitle:(nurse educator)^10 ) AND ( (city:"Boston" AND state:"MA")^15 OR state:"MA") AND _val_:"map(salary, 40000, 60000,10, 0)”

*Example from chapter 16 of Solr in Action

Query for Jane

Jane is a nurse educator in Boston seeking between $40K and $60K

DFW Data Science

{ ... "response":{"numFound":22,"start":0,"docs":[ {"jobtitle":" Clinical Educator (New England/ Boston)", "city":"Boston", "state":"MA", "salary":41503},

…]}}

*Example documents available @ http://github.com/treygrainger/solr-in-action

Search Results for Jane

{"jobtitle":"Nurse Educator", "city":"Braintree", "state":"MA", "salary":56183},

{"jobtitle":"Nurse Educator", "city":"Brighton", "state":"MA", "salary":71359}

DFW Data Science

You just built a recommendation

engine!

Demo: Recommendations

Traditional Keyword Search

Recommendations

SemanticSearch

User Intent

Personalized Search

Augmented Search

Domain-awareMatching

The Relevancy Spectrum

DFW Data Science

Basic Keyword Search(inverted index, tf-idf, bm25, query formulation, etc.)

Taxonomies / Entity Extraction(entity recognition, ontologies, synonyms, etc.)

Query Intent(query classification, semantic query parsing, concept expansion, rules, clustering, classification)

Relevancy Tuning(signals, AB testing/genetic algorithms, Learning to Rank, Neural Networks)

Self-learningData-driven App Sophistication

DFW Data Science

what is “reflected intelligence”?

The Three C’sContent:Keywords and other features in your documents

Collaboration:How other’s have chosen to interact with your system

Context:Available information about your users and their intent

Reflected Intelligence “Leveraging previous data and interactions to improve how new data and interactions should be interpreted”

DFW Data Science

Feedback LoopsUser

Searches

User Sees

ResultsUser

takes an

action

Users’ actions inform system improvements

DFW Data Science

● Recommendation Algorithms● Building user profiles from past searches, clicks, and other actions● Identifying correlations between keywords/phrases● Building out automatically-generated ontologies from content and

queries● Determining relevancy judgements (precision, recall, nDCG, etc.)

from click logs● Learning to Rank - using relevancy judgements and machine

learning to train a relevance model● Discovering misspellings, synonyms, acronyms, and related

keywords● Disambiguation of keyword phrases with multiple meanings● Learning what’s important in your content

Examples of Reflected Intelligence

DFW Data Science

Relevancy Tuning

Improving ranking algorithms through experiments and models

How to Measure Relevancy?

A B CRetrieved Documents

the apache solr smart data ecosystem

Technology

apache solr/lucene: looking ahead

apache solr 5.0 and beyond

minneapolis solr meetup - may 28, 2014: ecommerce search...

introduction to apache solr

intro to apache solr

integrating apache solr with crawels

apache solr

solr flair: search user interfaces powered by apache solr

apache solr cookbook - the-eye.euapache solr cookbook iii 4...

indexing data and faceting search with apache solr ·...

solr, lucene, apache, and you!

apache solr essentials - sample chapter

faceted searching with apache solr - apache...

apache solr search...

dev8d apache solr tutorial

ease of use in apache solr

apache solr for begginers

apache solr + ajax solr

solr @ etsy - apache lucene eurocon

responsive facets with apache solr