webinar: simpler semantic search with solr

34

Upload: lucidworks

Post on 03-Aug-2015

571 views

Category:

Software


0 download

TRANSCRIPT

Page 1: Webinar: Simpler Semantic Search with Solr
Page 2: Webinar: Simpler Semantic Search with Solr

Ted Sullivan

Simpler Semantic Search

lucidworks.com

Senior Solutions Architect

Page 3: Webinar: Simpler Semantic Search with Solr

Building Search Applications

Search is about Technology & Language

• These are difficult but also different problems

• Solving the “language problem” requires that we understand how language is used in search

• We understand language at the semantic level - where “meaning” or intent lives

• Search Engines deal with language at the syntactic level

• Most problems relating to search quality stem from this basic “disconnect” – the “what” vs “what words” dichotomy

Better^

Page 4: Webinar: Simpler Semantic Search with Solr

Technology – Horizontal Concerns

Search applications share these requirements with other information retrieval systems

• Performance – returning results in HTT (Human Tolerable Time)

• Scalability – being able to search “billions and billions”of documents serving thousands or tens of thousands of users at a time.

• Reliability – fault tolerance, fail-over, redundancy

• Maintainability – easy to upgrade, search index can be kept current in the face of rapidly changing content.

• Usability – User Experience is critical to success. UI and UX Mobile Technology Is a Game Changer here!!!

Page 5: Webinar: Simpler Semantic Search with Solr

Language – Vertical Concerns

These requirements are more specific to search systems.

• Accuracy – returning the “correct” results.

• Precision – few false positives

• Recall – few false negatives

• Relevance – returning the “best” results at the top

Returning the wrong results very fast is notnecessarily a good thing. Returning too manyresults can affect performance.

Page 6: Webinar: Simpler Semantic Search with Solr

Time flies like an arrow

Fruit flies like a banana

Our mental image for the second sentence depends on how we “parse” it. It depends on what the subject noun or noun phrase is.

Page 7: Webinar: Simpler Semantic Search with Solr

The subject can be “fruit” or “fruit flies”. This decision changes the verb which is either “flies” or “like” respectively.

Fruit flies like a banana

Fruit flies like a banana

Page 8: Webinar: Simpler Semantic Search with Solr

We can do this because we know that both “fruit” and “fruit flies” represent single concepts – even though “fruit flies” is two words – i.e. a “noun phrase”.

Fruit flies like a banana

Fruit flies like a banana

Page 9: Webinar: Simpler Semantic Search with Solr

Search algorithms and semantics

Tokenization plus vector mathematics(TF/IDF or one of its cousins) – “bag-of-words” Algorithmic tweaks – enhanced bag-of-words:

1. Some fields are more relevant than others

2. Hitting on more terms in the query is better than hitting on fewer (token scores are summed)

3. The nearer the query terms are to each other in the document the better – same order as query is best

4. Getting 0 results provides no feedback – OR is safer than AND (we already have “fuzzy” & with bullet (2)

Problem: Search engines don’t understand semantics

Page 10: Webinar: Simpler Semantic Search with Solr

Better Search: Detecting Noun Phrases

Can algorithms be used to detect noun phrases?

Yes, but not perfectly and may need too much CPU at query-time

Another way is to use knowledge bases – a lot of extra work, but in some cases – we already have one - the search index itself!

Page 11: Webinar: Simpler Semantic Search with Solr

Better Search: Detecting Noun Phrases

The basic technique is called “autophrasing” – recognizing when more than one word represents just one thing.

Autophrasing – uses an extra knowledge-base file “autophrases.txt”

Query Autofiltering – uses the phrases that are stored as metadata values in the index.

Page 12: Webinar: Simpler Semantic Search with Solr

Multi-term Synonym Problem

Subject was inspired by an old JIRA ticket: Lucene-1622

“if multi-word synonyms are indexed together with the original token stream (at overlapping positions), then a query for a partial synonym sequence (e.g., ‘big’ in the synonym ‘big apple’ for

‘new york city’) causes the document to match”

(or “apple” which will hit on my blog post if you crawl lucidworks.com !)

Page 13: Webinar: Simpler Semantic Search with Solr

Sausagization

From Mike McCandless blog: Changing Bits: Lucene's TokenStreams are actually graphs!

• This means certain phrase queries should match but don't (e.g.: "hotspot is down"), and other phrase queries shouldn't match but do (e.g.: "fast hotspot fi").

• Other cases do work correctly (e.g.: "fast hotspot"). We refer to this "lossy serialization" as sausagization, because the incoming graph is unexpectedly turned from a correct word lattice into an incorrect sausage.

• This limitation is challenging to fix: it requires changing the index format (and Codec APIs) to store an additional int position length per position, and then fixing positional queries to respect this value.

http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html

Page 14: Webinar: Simpler Semantic Search with Solr

Multi-term Synonym Demo

autophrases.txt

new yorknew york state empire state new york citynew york new yorkbig apple ny nycity of new yorkstate of new yorkny state

synonyms.txt

new_york => new_york_state, new_york_city, big_apple, new_york_new_york, ny_ny, nyc,empire_state,ny_state, state_of_new_york

new_york_state,empire_state,ny_state, state_of_new_york

new_york_city,big_apple,new_york_new_york,ny_ny,nyc, city_of_new_york

Page 15: Webinar: Simpler Semantic Search with Solr

Multi-term Synonym Demo

This document is about new york state.

This document is about new york city.

There is a lot going on in NYC.

I heart the big apple.

The empire state is a great state.

New York, New York is a hellova town.

I am a native of the great state of New York.

New York New York City New York State

/select /autophrase

Page 16: Webinar: Simpler Semantic Search with Solr

Multi-term Synonym Demo

This document is about new york state.

This document is about new york city.

There is a lot going on in NYC.

I heart the big apple.

The empire state is a great state.

New York, New York is a hellova town.

I am a native of the great state of New York.

Empire State

/select /autophrase

Page 17: Webinar: Simpler Semantic Search with Solr

Query Autofiltering

Content Tagging and Intelligent Query Filtering. Using the search index itself as the knowledge source:

Search Index

Content ContentTagging

Auto FilteringQuery The Answer

Page 18: Webinar: Simpler Semantic Search with Solr

Lucene FieldCache “In Action”

Standard “Inverted Index” (Lucene itself): • Show all documents that have this term value in this field

• Used to get initial set of search result IDs

Uninverted or Forward Index (FieldCache): • Show all term values that have been indexed in this field

• Can lookup term value for a doc ID

• Used to facet and get display values for doc IDs.

Page 19: Webinar: Simpler Semantic Search with Solr

Query Autofiltering Implementation

Use Lucene FieldCache to build a map of field values to field names (of string fields)

Add synonym mappings from synonyms.txt and stemming to this value(s) -> field(s) map

Use this map to discover noun phrases in the query that correspond to field values in the index – longest contiguous phrase wins

Build filter or boost queries based on these discovered mappings

Page 20: Webinar: Simpler Semantic Search with Solr

QueryAutoFilteringComponent

Solr SearchComponent

github: https://github.com/LucidWorks/query-autofiltering-component

JIRA: SOLR-7539

<requestHandler name="/autofilter" class="solr.SearchHandler"> <lst name="defaults"> <str name="echoParams">explicit</str> <int name="rows">10</int> <str name="df">text</str> </lst> <arr name="first-components"> <str>queryAutofiltering</str> </arr> </requestHandler>

<searchComponent name=“queryAutofiltering" class="org.apache.solr.handler.component.QueryAutoFilteringComponent" />

Page 21: Webinar: Simpler Semantic Search with Solr

QueryAutofiltering DemoHypothetical eCommerce App for a Fictional department store

• Metadata has Noun Phrases!

<doc> <field name="id">95</field> <field name="product_type">sweat shirt</field> <field name="product_category">shirt</field> <field name="style">V neck</field> <field name="style">short sleeve</field> <field name="brand">J Crew</field> <field name="color">grey</field> <field name="material">cotton</field> <field name="consumer_type">womens</field></doc> <doc> <field name="id">154</field> <field name="product_type">crew socks</field> <field name="product_category">socks</field> <field name="color">white</field> <field name="brand">Joe Boxer</field> <field name="consumer_type">mens</field></doc><doc> <field name="id">17</field> <field name="product_type">boxer shorts</field> <field name="product_category">underwear</field> <field name="color">white</field> <field name="brand">Fruit of the Loom</field> <field name="consumer_type">mens</field></doc>

Page 22: Webinar: Simpler Semantic Search with Solr

Query Autofiltering – Basic Behaviorq = red socks -> fq=color:red&fq=product_type:socks

or bq=(color:red AND product_type:socks)^20

q = Red Lion socks -> fq=brand:”Red Lion”&fq=product_type:socks

q = scarlet Chaise Lounge -> color:red AND product_type:”Lounge Chair”

q = white dress shirts -> color:white AND product_type:”dress shirt”

Page 23: Webinar: Simpler Semantic Search with Solr

Dealing With “Unstructured” Text

This term ITSELF is evidence that we think of language as unstructured when we know that it actually is not - It HAS to have structure or we couldn’t communicate very well.

“The Lady Is A Tramp” vs “Lady And The Tramp”

Dealing with unstructured text means better handling of phrases.

Little words – like “if” can have big meaning!

Page 24: Webinar: Simpler Semantic Search with Solr

Classification Technologies

Machine Learning • Automated vs Semi-Automated

Natural Language Processing (NLP) • Parts Of Speech

Taxonomy / Ontology • Relationships

• Handles Phrases naturally

• Knows what is what and what is related to what!

Page 25: Webinar: Simpler Semantic Search with Solr

Ontologies Designed for Search

Category Nodes – ‘parent’ nodes that can have child nodes, including:

• Sub Categories

• Evidence Nodes

Evidence Node – tend to be a leaf nodes (with no children) and contain keyterms (synonyms)

• May contain “rules” e.g. (if contains term a and term b but not term c)

• Evidence Nodes can have more than one category node parent

Hits on Evidence Nodes add to the cumulative score of a Category Node.

Scores can be diluted as they traverse the graph – so that the nearest category gets the strongest ‘vote’.

Page 26: Webinar: Simpler Semantic Search with Solr

Fortune 100 Companies

Energy • Financial Services

• Investment Banks

• Commercial Banks

Health Care • Health Insurance

• HMO

• Medical Devices

• Pharmaceuticals

Hospitality

Manufacturing • Aircraft

• Automobiles

• Electrical Equipment

Corporations • US

• British

• Chinese

• French

• German

• Japanese

• Russian

• +

Page 27: Webinar: Simpler Semantic Search with Solr

Fortune 100 Companies

Energy • Financial Services

• Investment Banks

• Commercial Banks

Health Care • Health Insurance

• HMO

• Medical Devices

• Pharmaceuticals

Hospitality

Manufacturing • Aircraft

• Automobiles

• Electrical Equipment

Corporations • US

• British

• Chinese

• French

• German

• Japanese

• Russian

• +

Page 28: Webinar: Simpler Semantic Search with Solr

The Basic Search “Use Case”

Traditional - Brief display – snippeting,hyperlinks and paging

• Faceted Navigation

• Highlighting

• Need To RETHINK for Mobile!!!

Query Formulation

–> Result Inspection

–> Query Refinement

Page 29: Webinar: Simpler Semantic Search with Solr

Shortening The Loop

Query Suggestion (aka autocomplete, typeahead)

• “Predictive” search

• Single field restriction

Recommendation • Query – result – click – store – aggregate

• Boosting results or Suggesting queries

Best Bets (Query Elevation) – i.e. Punting • Spotlighting

• Making it dynamic

Faceting • Takes advantage of classification tagging

• Can be used to generate multi-field phrases for suggestion

Inferential Search • “I’m Feeling Lucky”

• Query Autofiltering

Page 30: Webinar: Simpler Semantic Search with Solr

Enhanced Search: Pipelines

Document and Query Pre-Processing

Internal to Solr:

• Update Request Processor

• Data Import Handler (DIH)

• Search Component Chain

Big Data = Big Problem or just a Big Opportunity:

• Hadoop – Solr

• Spark – Solr

• Morphlines

External to Solr: • Custom ETL + SolrJ Integration

• Apache UIMA *

• DIH Client (SOLR-7188)

• Lucidworks Fusion

• Modular Informatic Designs framework (coming soon to Open Source?)

Page 31: Webinar: Simpler Semantic Search with Solr

Index Pipelines – Good Ole ETL + ______

Annotations!Subject - Verb - Object

Entity Extractors – Identify Subject and Object (noun phrases)

Annotations – mark locations of entities in document

Discover Facts from Semantic Patterns • $Person joined $Company

• $Drug is used to treat $Disease

• $Company acquired $Company

• $Person wrote $Song

Watson used IBM’s (now Apache’s) UIMA (+40,000 PC’s)

Jeopardy is a “guess subject given object and verb - posed as a question” – game

Page 32: Webinar: Simpler Semantic Search with Solr

Who Needs Query Pipelines?

Who, What, Where, When: • Security Filtering - Entitlements

• Dynamic Boost Block based on Preferences, Search History

• Geo Filtering – IP to geolocation

• Content Spotlighting based on time, place and search history

• Query Introspection – Infer User Intent

Page 33: Webinar: Simpler Semantic Search with Solr

Lucidworks Fusion: Pipelines Proliferate

Documents and Queries are dynamic Metadata Objects • PipelineDocument QueryRequestAndResponse respectively

Lots of Stages – more coming with every release • Metadata -> metadata – lookup, clone, map, join

• Content -> metadata – extract, transform, classify

Index Pipelines: One-Way Query Pipelines: Round-Trip • Both pre- and post-Query filtering opportunities

Connectoror Query Stage Stage Stage Stage Solr Cloud

Page 34: Webinar: Simpler Semantic Search with Solr

Thank you!

lucidworks.com

Ted Sullivan

Senior Solutions Architect