intent algorithms: the data science of smart information retrieval systems

88
Intent Algorithms: The data science of smart information retrieval systems Trey Grainger SVP of Engineering, Lucidworks Southern Data Science Conference 2017.04.07

Upload: trey-grainger

Post on 29-Jan-2018

1.324 views

Category:

Software


0 download

TRANSCRIPT

Page 1: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

Intent Algorithms: The data science of smart information retrieval systems

Trey GraingerSVP of Engineering, Lucidworks

Southern Data Science Conference2017.04.07

Page 2: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

Trey GraingerSVP of Engineering

• Previously Director of Engineering @ CareerBuilder

• MBA, Management of Technology – Georgia Tech

• BA, Computer Science, Business, & Philosophy – Furman University

• Information Retrieval & Web Search - Stanford University

Other fun projects:

• Co-author of Solr in Action, plus numerous research papers

• Frequent conference speaker

• Founder of Celiaccess.com, the gluten-free search engine

• Lucene/Solr contributor

About Me

Page 3: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

• Introduction

- Apache Solr

- Lucidworks / Fusion

• Search Engine Fundamentals

- Keyword Search

- Relevancy Ranking

- Domain-specific Relevancy

- Crafting Relevancy Functions

Agenda

• Reflected Intelligence

- Signals (Demo)

- Recommendations (Demo)

- Learning to Rank (Demo)

• Semantic Search

- RDF / SPARQL

- Entity Extraction

- Query Parsing

- Semantic Knowledge Graph

Southern Data Science

Page 4: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

Traditional

Keyword

SearchRecommendations

Semantic

Search

User Intent

Personalized

Search

Augmented

SearchDomain-aware

Matching

Dimensions of

User Intent

Southern Data Science

Page 5: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

what do you do?

Page 6: Intent Algorithms: The Data Science of Smart Information Retrieval Systems
Page 7: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

Search-Driven Everything

Customer Service

Customer Insights

Fraud Surveillance

Research Portal

Online RetailDigital Content

Page 8: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

Apache Solr

Page 9: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

“Solr is the popular, blazing-fast,

open source enterprise search

platform built on Apache Lucene™.”

Page 10: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

Key Solr Features:

● Multilingual Keyword search

● Relevancy Ranking of results

● Faceting & Analytics (nested / relational)

● Highlighting

● Spelling Correction

● Autocomplete/Type-ahead Prediction

● Sorting, Grouping, Deduplication

● Distributed, Fault-tolerant, Scalable

● Geospatial search

● Complex Function queries

● Recommendations (More Like This)

● Graph Queries and Traversals

● SQL Query Support

● Streaming Aggregations

● Batch and Streaming processing

● Highly Configurable / Plugins

● Learning to Rank

● Building machine-learning models

● … many more*source: Solr in Action, chapter 2

Page 11: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

The standard

for enterprise

search.of Fortune 500

uses Solr.

90%

Page 12: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

Lucidworks Fusion

Page 13: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

Typical Search Architecture Evolution

Optional

Worker Worker Cluster Manager

Spark/Hadoop

Shards Shards

Solr

HD

FS

Shared Config Management

Leader Election

Load Balancing

ZK 1

Zookeeper

ZK N

Nutch/

HeretrixLog Proc.

Mahout

(Recommender)

ManifoldCF*

(Connectors)

Security

(Roll your own)

Roll your own

*only 12 connectors available,

compared w/ 60+ in Fusion

SiLK

Scheduling

(cron?)

Admin UI

Deployment

(Roll your own)

Monitoring

(Roll your own)

Relevance Tools

(Roll your own)

Tika ships w/ Solr, but can’t be scaled independently

NLP tools

Page 14: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

SECURITY BUILT-IN

Shards Shards

Apache Solr

Apache Zookeeper

ZK 1

Leader Election

Load Balancing

Shared Config Management

Worker Worker

Apache Spark

Cluster Manager

RE

ST

AP

I

Admin UI

Lucidworks

View

LOGS FILE WEB DATABASE CLOUD

Core Services

• • •

ETL and Query Pipelines

Recommenders/Signals

NLP

Machine Learning

Alerting and Messaging

Security

Connectors

Scheduling

Fusion Simplifies the Deployment

HD

FS

(O

ptio

na

l)

Page 15: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

Lucidworks Fusion

Page 16: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

Fusion powers search for the brightest companies in the world.

Page 17: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

search & relevancy

Page 18: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

Basic Keyword Search(inverted index, tf-idf, bm25, multilingual text analysis, query formulation, etc.)

Taxonomies / Entity Extraction(entity recognition, ontologies, synonyms, etc.)

Query Intent(query classification, semantic query parsing, concept expansion, rules, clustering, classification)

Relevancy Tuning(signals, AB testing/genetic algorithms, Learning to Rank, Neural Networks)

Self-learningIntent Algorithm Spectrum

Southern Data Science

Page 19: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

Basic Keyword Search

The beginning of a typical search journey

Page 20: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

Term Documents

a doc1 [2x]

brown doc3 [1x] , doc5 [1x]

cat doc4 [1x]

cow doc2 [1x] , doc5 [1x]

… ...

once doc1 [1x], doc5 [1x]

over doc2 [1x], doc3 [1x]

the doc2 [2x], doc3 [2x],

doc4[2x], doc5 [1x]

… …

Document Content Field

doc1 once upon a time, in a land far,

far away

doc2 the cow jumped over the moon.

doc3 the quick brown fox jumped over

the lazy dog.

doc4 the cat in the hat

doc5 The brown cow said “moo”

once.

… …

What you SEND to Lucene/Solr:How the content is INDEXED into Lucene/Solr (conceptually):

The inverted index

Southern Data Science

Page 21: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

/solr/select/?q=apache solr

Field Documents

… …

apache doc1, doc3, doc4,

doc5

hadoop doc2, doc4, doc6

… …

solr doc1, doc3, doc4,

doc7, doc8

… …

doc5

doc7 doc8

doc1 doc3 doc4

solr

apache

apache solr

Matching queries to documents

Southern Data Science

Page 22: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

Text Analysis

Generating terms to index from raw text

Page 23: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

Text Analysis in Solr

A text field in Lucene/Solr has an Analyzer containing:

① Zero or more CharFiltersTakes incoming text and “cleans it up” before it is tokenized

② One TokenizerSplits incoming text into a Token Stream containing Zero or more Tokens

③ Zero or more TokenFiltersExamines and optionally modifies each Token in the Token Stream

*From Solr in Action, Chapter 6

Southern Data Science

Page 24: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

A text field in Lucene/Solr has an Analyzer containing:

① Zero or more CharFiltersTakes incoming text and “cleans it up” before it is tokenized

② One TokenizerSplits incoming text into a Token Stream containing Zero or more Tokens

③ Zero or more TokenFiltersExamines and optionally modifies each Token in the Token Stream

Text Analysis in Solr

*From Solr in Action, Chapter 6

Southern Data Science

Page 25: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

A text field in Lucene/Solr has an Analyzer containing:

① Zero or more CharFiltersTakes incoming text and “cleans it up” before it is tokenized

② One TokenizerSplits incoming text into a Token Stream containing Zero or more Tokens

③ Zero or more TokenFiltersExamines and optionally modifies each Token in the Token Stream

Text Analysis in Solr

*From Solr in Action, Chapter 6

Southern Data Science

Page 26: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

A text field in Lucene/Solr has an Analyzer containing:

① Zero or more CharFiltersTakes incoming text and “cleans it up” before it is tokenized

② One TokenizerSplits incoming text into a Token Stream containing Zero or more Tokens

③ Zero or more TokenFiltersExamines and optionally modifies each Token in the Token Stream

Text Analysis in Solr

*From Solr in Action, Chapter 6

Southern Data Science

Page 27: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

Per-language Analysis Chains

*Some of the 32 different languages configurations in Appendix B of Solr in Action

Southern Data Science

Page 28: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

Per-language Analysis Chains

*Some of the 32 different languages configurations in Appendix B of Solr in Action

Southern Data Science

Page 29: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

Southern Data Science

Page 30: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

Relevancy Ranking

Scoring the results, returning the best matches

Page 31: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

Classic Lucene/Solr Relevancy Algorithm:

*Source: Solr in Action, chapter 3

Score(q, d) =

∑ ( tf(t in d) · idf(t)2 · t.getBoost() · norm(t, d) ) · coord(q, d) · queryNorm(q)t in q

Where:t = term; d = document; q = query; f = field

tf(t in d) = numTermOccurrencesInDocument ½

idf(t) = 1 + log (numDocs / (docFreq + 1))

coord(q, d) = numTermsInDocumentFromQuery / numTermsInQuery

queryNorm(q) = 1 / (sumOfSquaredWeights ½ )

sumOfSquaredWeights = q.getBoost()2 · ∑ (idf(t) · t.getBoost() )2

t in q

norm(t, d) = d.getBoost() · lengthNorm(f) · f.getBoost()

Southern Data Science

Page 32: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

Classic Lucene/Solr Relevancy Algorithm:

*Source: Solr in Action, chapter 3

Score(q, d) =

∑ ( tf(t in d) · idf(t)2 · t.getBoost() · norm(t, d) ) · coord(q, d) · queryNorm(q)t in q

Where:t = term; d = document; q = query; f = field

tf(t in d) = numTermOccurrencesInDocument ½

idf(t) = 1 + log (numDocs / (docFreq + 1))

coord(q, d) = numTermsInDocumentFromQuery / numTermsInQuery

queryNorm(q) = 1 / (sumOfSquaredWeights ½ )

sumOfSquaredWeights = q.getBoost()2 · ∑ (idf(t) · t.getBoost() )2

t in q

norm(t, d) = d.getBoost() · lengthNorm(f) · f.getBoost()

Southern Data Science

Page 33: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

• Term Frequency: “How well a term describes a document?”

– Measure: how often a term occurs per document

• Inverse Document Frequency: “How important is a term overall?”

– Measure: how rare the term is across all documents

TF * IDF

*Source: Solr in Action, chapter 3

Southern Data Science

Page 34: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

BM25 (Okapi “Best Match” 25th Iteration)

Score(q, d) =

∑ idf(t) · ( tf(t in d) · (k + 1) ) / ( tf(t in d) + k · (1 – b + b · |d| / avgdl )t in q

Where:t = term; d = document; q = query; i = index

tf(t in d) = numTermOccurrencesInDocument ½

idf(t) = 1 + log (numDocs / (docFreq + 1))

|d| = ∑ 1t in d

avgdl = = ( ∑ |d| ) / ( ∑ 1 ) )d in i d in i

k = Free parameter. Usually ~1.2 to 2.0. Increases term frequency saturation point.

b = Free parameter. Usually ~0.75. Increases impact of document normalization.

Southern Data Science

Page 35: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

News Search : popularity and freshness drive relevance

Restaurant Search: geographical proximity and price range are critical

Ecommerce: likelihood of a purchase is key

Movie search: More popular titles are generally more relevant

Job search: category of job, salary range, and geographical proximity matter

TF * IDF of keywords can’t hold it’s own against good

domain-specific relevance factors!

That’s great, but what about domain-specific knowledge?

Southern Data Science

Page 36: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

Southern Data Science

*Example from chapter 16 of Solr in Action

Domain-specific relevancy calculation (News Website Example)

News website:

/select?

fq=$myQuery&

q=_query_:"{!func}scale(query($myQuery),0,100)"

AND _query_:"{!func}div(100,map(geodist(),0,1,1))"

AND _query_:"{!func}recip(rord(publicationDate),0,100,100)"

AND _query_:"{!func}scale(popularity,0,100)"&

myQuery="street festival"&

sfield=location&

pt=33.748,-84.391

25%

25%

25%

25%

Page 37: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

Southern Data Science

Fancy boosting functions (Restaurant Search Example)

Distance (50%) + keywords (30%) + category (20%)

q=_val_:"scale(mul(query($keywords),1),0,30)" AND

_val_:"scale(sum($radiusInKm,mul(query($distance),-1)),0,50)” AND

_val_:"scale(mul(query($category),1),0,20)"

&keywords=filet mignon

&radiusInKm=48.28

&distance=_val_:"geodist(latitudelongitude.latlon_is,33.77402,-84.29659)”

&category=”fine dining"

&fq={!cache=false v=$keywords}

Page 38: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

This is powerful, but feels like

a lot of work to get right…

Page 39: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

what is “reflected intelligence”?

Page 40: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

The Three C’s

Content:Keywords and other features in your documents

Collaboration:How other’s have chosen to interact with your system

Context:Available information about your users and their intent

Reflected Intelligence“Leveraging previous data and interactions to improve how

new data and interactions should be interpreted”

Southern Data Science

Page 41: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

● Recommendation Algorithms

● Building user profiles from past searches, clicks, and other actions

● Identifying correlations between keywords/phrases

● Building out automatically-generated ontologies from content and queries

● Determining relevancy judgements (precision, recall, nDCG, etc.) from click

logs

● Learning to Rank - using relevancy judgements and machine learning to train

a relevance model

● Discovering misspellings, synonyms, acronyms, and related keywords

● Disambiguation of keyword phrases with multiple meanings

● Learning what’s important in your content

Examples of Reflected Intelligence

Southern Data Science

Page 42: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

John lives in Boston but wants to move to New York or possibly another big city. He is

currently a sales manager but wants to move towards business development.

Irene is a bartender in Dublin and is only interested in jobs within 10KM of her location

in the food service industry.

Irfan is a software engineer in Atlanta and is interested in software engineering jobs at a

Big Data company. He is happy to move across the U.S. for the right job.

Jane is a nurse educator in Boston seeking between $40K and $60K

*Example from chapter 16 of Solr in Action

Consider what you know about users

Southern Data Science

Page 43: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

http://localhost:8983/solr/jobs/select/?

fl=jobtitle,city,state,salary&

q=(

jobtitle:"nurse educator"^25 OR jobtitle:(nurse educator)^10

)

AND (

(city:"Boston" AND state:"MA")^15

OR state:"MA")

AND _val_:"map(salary, 40000, 60000,10, 0)”

*Example from chapter 16 of Solr in Action

Query for Jane

Jane is a nurse educator in Boston seeking between $40K and $60K

Southern Data Science

Page 44: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

{ ...

"response":{"numFound":22,"start":0,"docs":[

{"jobtitle":" Clinical Educator

(New England/ Boston)",

"city":"Boston",

"state":"MA",

"salary":41503},

…]}}

*Example documents available @ http://github.com/treygrainger/solr-in-action

Search Results for Jane

{"jobtitle":"Nurse Educator",

"city":"Braintree",

"state":"MA",

"salary":56183},

{"jobtitle":"Nurse Educator",

"city":"Brighton",

"state":"MA",

"salary":71359}

Southern Data Science

Page 45: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

You just built a

recommendation engine!

Page 46: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

Southern Data Science

Collaborative Filtering

Term Documents

user1 doc1, doc5

user2 doc2

user3 doc2

user4 doc1, doc3, doc4, doc5

user5 doc1, doc4

… …

Document “Users who bought this product” field

doc1 user1, user4, user5

doc2 user2, user3

doc3 user4

doc4 user4, user5

doc5 user4, user1

… …

What you SEND to Lucene/Solr:How the content is INDEXED into Lucene/Solr (conceptually):

Page 47: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

Southern Data Science

Step 1: Find similar users who like the same documents

Document “Users who bought this product” field

doc1 user1, user4, user5

doc2 user2, user3

doc3 user4

doc4 user4, user5

doc5 user4, user1

… …

Top-scoring results (most similar users):1) user4 (2 shared likes)2) user5 (2 shared likes)3) user 1 (1 shared like)

doc1user1 user4

user5

user4 user5

doc4

q=documentid: ("doc1" OR "doc4")

*Source: Solr in Action, chapter 16

Page 48: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

/solr/select/?q=userlikes:("user4"^2 OR "user5"^2 OR "user1"^1)

Southern Data Science

Step 2: Search for docs “liked” by those similar users

Term Documents

user1 doc1, doc5

user2 doc2

user3 doc2

user4 doc1, doc3, doc4, doc5

user5 doc1, doc4

… …

Top recommended documents:1) doc1 (matches user4, user5, user1)2) doc4 (matches user4, user5)3) doc5 (matches user4, user1)4) doc3 (matches user4)

// doc2 does not match

Most similar users:1) user4 (2 shared likes)2) user5 (2 shared likes)3) user 1 (1 shared like)

*Source: Solr in Action, chapter 16

Page 49: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

Using matrix factorization is typically more efficient (Ships with Fusion 3.1):

Southern Data Science

Page 50: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

Feedback Loops

User

Searches

User

Sees

ResultsUser

takes an

action

Users’ actions

inform system

improvements

Southern Data Science

Page 51: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

Demo:

Signals & Recommendations

Page 52: Intent Algorithms: The Data Science of Smart Information Retrieval Systems
Page 53: Intent Algorithms: The Data Science of Smart Information Retrieval Systems
Page 54: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

• 200%+ increase in

click-through rates

• 91% lower TCO

• 50,000 fewer support

tickets

• Increased customer

satisfaction

Page 55: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

Learning to Rank

Page 56: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

Learning to Rank (LTR)

● It applies machine learning techniques to discover the best combination

of features that provide best ranking.

● It requires labeled set of documents with relevancy scores for given set

of queries

● Features used for ranking are usually more computationally expensive

than the ones used for matching

● It typically re-ranks a subset of the matched documents (e.g. top 1000)

Southern Data Science

Page 57: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

Southern Data Science

Page 58: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

Common LTR Algorithms

• RankNet* (neural networks, boosted trees)

• LambdaMart* (regression trees)

• SVM Rank** (SVM classifier)

** http://research.microsoft.com/en-us/people/hangli/cao-et-al-sigir2006.pdf

* http://research.microsoft.com/pubs/132652/MSR-TR-2010-82.pdf

Southern Data Science

Page 59: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

Demo: Learning to Rank

Page 60: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

#1: Pull, Build, Start Solrgit clone https://github.com/apache/lucene-solr.git && cd lucene-solr/solrant server bin/solr -e techproducts -Dsolr.ltr.enabled=true

#2: Run Searcheshttp://localhost:8983/solr/techproducts/browse?q=ipod

#3: Supply User Relevancy Judgementscd contrib/ltr/example/nano user_queries.txt

#4: Install Training Librarycurl -L https://github.com/cjlin1/liblinear/archive/v210.zip > liblinear-2.1.tar.gztar -xf liblinear-2.1.tar.gz && mv liblinear-210 liblinearcd liblinear && make && cd ../

#5: Train and Upload Model./train_and_upload_demo_model.py -c config.json

#6: Re-run Searches using Machine-learned Ranking Modelhttp://localhost:8983/solr/techproducts/browse?q=ipod

&rq={!ltr model=exampleModel reRankDocs=25 efi.user_query=$q}

Page 61: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

# Run Searcheshttp://localhost:8983/solr/techproducts/select?q=ipod

Page 62: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

# Supply User Relevancy Judgementsnano contrib/ltr/example/user_queries.txt

#Format: query | doc id | relevancy judgement | source

# Train and Upload Model./train_and_upload_demo_model.py -c config.json

Page 63: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

# Re-run Searches using Machine-learned Ranking Modelhttp://localhost:8984/solr/techproducts/browse?q=ipod

&rq={!ltr model=exampleModel reRankDocs=100 efi.user_query=$q}

Page 64: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

semantic search

Page 65: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

Building a Taxonomy of Entities

Many ways to generate this:

• Statistical Analysis of interesting phrases

- Word2Vec / Glove / Dice Conceptual Search

• Topic Modelling

• Clustering of documents / phrases

• Buy a dictionary (often doesn’t work for

domain-specific search problems)

• Generate a model of domain-specific phrases by mining query logs for commonly searched phrases within the domain*

* K. Aljadda, M. Korayem, T. Grainger, C. Russell. "Crowdsourced Query Augmentation through Semantic Discovery of Domain-specific Jargon," in IEEE Big Data 2014.Southern Data Science

Page 66: Intent Algorithms: The Data Science of Smart Information Retrieval Systems
Page 67: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

Southern Data Science

Page 68: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

Southern Data Science

Page 69: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

entity extraction

Page 70: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

Southern Data Science

Page 71: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

semantic query parsing

Page 72: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

Southern Data Science

Page 73: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

Probabilistic Query Parser

Goal: given a query, predict which

combinations of keywords should be

combined together as phrases

Example:

senior java developer hadoop

Possible Parsings:senior, java, developer, hadoop

"senior java", developer, hadoop

"senior java developer", hadoop

"senior java developer hadoop”

"senior java", "developer hadoop”

senior, "java developer", hadoop

senior, java, "developer hadoop" Source: Trey Grainger, “Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disambiguation”, Bay Area Search Meetup, November 2015.

Southern Data Science

Page 74: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

Semantic Query Parsing

Identification of phrases in queries using two steps:

1) Check a dictionary of known terms that is continuously

built, cleaned, and refined based upon common inputs from

interactions with real users of the system. The SolrTextTagger

works well for this.*

2) Also invoke a probabilistic query parser to dynamically

identify unknown phrases using statistics from a corpus of data

(language model)

*K. Aljadda, M. Korayem, T. Grainger, C. Russell. "Crowdsourced Query Augmentation

through Semantic Discovery of Domain-specific Jargon," in IEEE Big Data 2014.

Southern Data Science

Page 75: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

query augmentation

Page 76: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

Southern Data Science

Page 77: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

Southern Data Science

Page 78: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

id: 1job_title: Software Engineerdesc: software engineer at a great companyskills: .Net, C#, java

id: 2job_title: Registered Nursedesc: a registered nurse at hospital doing hard workskills: oncology, phlebotemy

id: 3job_title: Java Developerdesc: a software engineer or a java engineer doing workskills: java, scala, hibernate

field term postings list

doc pos

desc

a

1 4

2 1

3 1, 5

at1 3

2 4

company 1 6

doing2 6

3 8

engineer1 2

3 3, 7

great 1 5

hard 2 7

hospital 2 5

java 3 6

nurse 2 3

or 3 4

registered 2 2

software1 1

3 2

work2 10

3 9

job_title java developer 3 1

… … … …

field doc term

desc

1a

at

company

engineer

great

software

2a

at

doing

hard

hospital

nurse

registered

work

3a

doing

engineer

java

or

software

work

job_title 1Software Engineer

… … …

Terms-Docs Inverted IndexDocs-Terms Forward IndexDocuments

Source: Trey Grainger, Khalifeh AlJadda, Mohammed Korayem, Andries Smith.“TheSemantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain”. DSAA 2016.

Knowledge Graph

Southern Data Science

Page 79: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

Source: Trey Grainger, Khalifeh AlJadda, Mohammed Korayem, Andries Smith.“TheSemantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain”. DSAA 2016.

Knowledge Graph

Set-theory View

Graph View

How the Graph Traversal Works

skill: Java

skill: Scala

skill: Hibernate

skill: Oncology

doc 1

doc 2

doc 3

doc 4

doc 5

doc 6

skill: Java

skill: Java

skill: Scala

skill: Hibernate

skill: Oncology

Data Structure View

Java

Scala Hibernate

docs1, 2, 6

docs 3, 4

Oncology

doc 5

Southern Data Science

Page 80: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

Source: Trey Grainger, Khalifeh AlJadda, Mohammed Korayem, Andries Smith.“The Semantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain”. DSAA 2016.

Knowledge Graph

Scoring nodes in the Graph

Foreground vs. Background AnalysisEvery term scored against it’s context. The more commonly the term appears within it’s foreground context versus its background context, the more relevant it is to the specified foreground context.

countFG(x) - totalDocsFG * probBG(x)

z = --------------------------------------------------------

sqrt(totalDocsFG * probBG(x) * (1 - probBG(x)))

{ "type":"keywords”, "values":[

{ "value":"hive", "relatedness": 0.9765, "popularity":369 },

{ "value":"spark", "relatedness": 0.9634, "popularity":15653 },

{ "value":".net", "relatedness": 0.5417, "popularity":17683 },

{ "value":"bogus_word", "relatedness": 0.0, "popularity":0 },

{ "value":"teaching", "relatedness": -0.1510, "popularity":9923 },

{ "value":"CPR", "relatedness": -0.4012, "popularity":27089 } ] }

+-

Foreground Query: "Hadoop"

Southern Data Science

Page 81: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

Source: Trey Grainger, Khalifeh AlJadda, Mohammed Korayem, Andries Smith.“TheSemantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain”. DSAA 2016.

Knowledge Graph

Multi-level Graph Traversal with Scores

software engineer*(materialized node)

Java

C#

.NET

.NET Developer

Java Developer

Hibernate

ScalaVB.NET

Software Engineer

Data Scientist

SkillNodes

has_related_skillStartingNode

SkillNodes

has_related_skill Job TitleNodes

has_related_job_title

0.900.88 0.93

0.93

0.34

0.74

0.91

0.89

0.74

0.89

0.780.72

0.48

0.93

0.76

0.83

0.80

0.64

0.61

0.780.55

Southern Data Science

Page 82: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

Knowledge Graph

Southern Data Science

Page 83: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

Knowledge Graph

Southern Data Science

Page 84: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

Southern Data Science

Page 85: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

Knowledge Graph

Use Case: Summarizing Document Intent

Experiment: Pass in raw text (extracting phrases as needed), and rank their similarity to the documents using the SKG.

Additionally, can traverse the graph to “related” entities/keyword phrases NOT found in the original document

Applications: Content-based and multi-modal recommendations (no cold-start problem), data cleansing prior to clustering or other ML methods, semantic search / similarity scoring

Page 86: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

Basic Keyword Search(inverted index, tf-idf, bm25, multilingual text analysis, query formulation, etc.)

Taxonomies / Entity Extraction(entity recognition, ontologies, synonyms, etc.)

Query Intent(query classification, semantic query parsing, concept expansion, rules, clustering, classification)

Relevancy Tuning(signals, AB testing/genetic algorithms, Learning to Rank, Neural Networks)

Self-learningIntent Algorithm Spectrum

Southern Data Science

Page 87: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

Contact Info

Trey [email protected]@treygrainger

http://solrinaction.comMeetup discount (39% off): 39grainger

Other presentations: http://www.treygrainger.com

Southern Data Science

Page 88: Intent Algorithms: The Data Science of Smart Information Retrieval Systems

Additional References:

Southern Data Science