you know, for search

De Bitmanager, 2016

You Know, for Search

Peter van der Weerd

De Bitmanager, 2016

Who am I?

• Peter van der Weerd

• Search specialist

• Self employed Bitmanager

• Enormous span of control

De Bitmanager, 2016

Search

• Common sense:

Easy

Solved

De Bitmanager, 2016

Yeah, true…

• Install ES

• Fill it with some data

• And \o/: we can search

De Bitmanager, 2016

But…

• Are the users satisfied?

• Many people struggle with sub-optimal search results.

De Bitmanager, 2016

Search as a toolbox

• It consists of 1 or more(!) tools to find what you need

Searchbox

Faceting (intersecting)

Sorting

More like this

Not more like this (this is not what I mean)

Etc…

De Bitmanager, 2016

Search at Booking

• Destination based (city, region, airport, etc)

• AutocompleteResults in max 5 destinations, query per keystroke

• DisambiguationShow a partioned result that enables peopleto choose a destination

De Bitmanager, 2016

Autocomplete in action

De Bitmanager, 2016

Disambiguation in action

De Bitmanager, 2016

Scoring

De Bitmanager, 2016

Scoring

• Lucene scores in general like: tf * idf

• Tf = term frequencythe more matched terms, the more important

• Idf = inverse document frequencyThe more matched documents for the term, the less important

De Bitmanager, 2016

Term frequency

• Used to give more importance to relative high occurring terms.

• Scoring examples for ‘house’

House

The house

The little house on the prairie

The little house on the prairie blah blah blah

score

De Bitmanager, 2016

Inverse document frequency

• Prefers less frequent tokens.

• Useless on single token queries: it is only usedto relative score multiple tokens

• Examples:

house

little

on

the

score

De Bitmanager, 2016

Drawback of idf

• Other example…

Pekela

Haarlem

Amsterdam

Paris

• Booking switched off idf, but could have useddf instead…

score

De Bitmanager, 2016

When does idf work

• Idf typically work for large text-like queries.

• The documents *must* be evenly distributedover shards(or use dfs_query_then_fetch)

De Bitmanager, 2016

Is tf * idf enough?

• Well, no…

• What to deliver on a query for ‘Paris’?

The city (ehm, the are several cities Paris)

Airports?

Hotels? Which one? There are 1000’s of them.

• Even worse:What to deliver for query ‘p’ or ‘pa’?

De Bitmanager, 2016

Record boost

• Based on

Popularity

From where booked

Language

o Same (doc language == site language)

o Local translations

o English

oMismatch

De Bitmanager, 2016

+ or x?

• Boosts are implemented by adding

• Intuitive justification:

Language could be seen as yet another (implicit!) search term

Same for popularity: people ar typical notsearching for impopular things

• Example (from an english site):amsterdam->amsterdam english popular

De Bitmanager, 2016

But wait…

• How big should the record-boost be?

0..1? 100?

• Lucene score might vary heavely,sometimes more then 10x different

• So lets take 10 as max record-boost

But now the recordboost might out-weight smaller scores

• Argggggg….

De Bitmanager, 2016

Score ranges

• Difficult to tinker with:

For instance use a stemmed token with boost 0.5house^1.0 vs houses^0.5

What if the Lucene score is more than 2 timeshigher than the stem itself?

• We are doing entity search vs text search

De Bitmanager, 2016

Different scorers

Title Score:default Score:BM25 Score:custom

House 1.22 0.77 1.20

The house 0.76 0.61 1.10

The little house on the prairie

0.46 0.39 1.05

Querying for ‘house’:

De Bitmanager, 2016

Normalizing scores

• Goal: each term is scored around 1.0

Base score 1.0

Tf is normalized between 0 .. 0.2 and added to the base score

Idf is normalized between 0 .. 0.2 and added to the base score

Giving a score varying between 1 and 1.4 per term(sometimes we don’t use idf)

De Bitmanager, 2016

Language boosting

• Same language or english: +0.7

• Local language: +0.3(Roma vs Rome in an English site)

• Mismatched language: -0.3

De Bitmanager, 2016

About N-grams

• For auto-complete: left-edge N-Grams

• Rome:romeromror

De Bitmanager, 2016

About N-grams

• When a user types ‘ro’…

Rome

Ródos

Rotterdam

Etc

• Score depends on percentage of match(or Levenshtein distance)

score

De Bitmanager, 2016

Original approach

• Multiple fields (name, city, region, etc)

• Combining them by a weighted dismax query

De Bitmanager, 2016

Dismax query

• More subtle way of combining scores.

• Score = max + (sum - max) * tieBreaker

In words: the max plus a percentage of the others

• Edge cases:

Tiebreaker=0Score is the max. score

Tiebreaker=1Score is the sum of all the individual scores(same behavior as boolean or)

De Bitmanager, 2016

Dismax example

• Q= the houseSuppose S[the] = 0.8, S[house]=1.2

• Scores for different tiebreakers:

Bool score (tiebreaker=1): 2.0

Max score (tiebreaker=0): 1.2

Score with tiebreaker=0.1: 1.28this makes documents containing ‘the house’ a little bit more important than ‘house’ only.

De Bitmanager, 2016

Difficulties

• Lack of context

• Hard to create a reliable scoring model

De Bitmanager, 2016

Different approach

• Canonical name: Hotel V Frederiksplein, Amsterdam, Noord-Holland, Netherlands

• Self name (indexed)

Hotel V Frederiksplein

• Rest (indexed)

Amsterdam, Noord-Holland, Netherlands

De Bitmanager, 2016

Weighting fields

• All fields are equal but some fields are more equal than others…

Self name is most important

Other names (like the city where a hotel resides) are less important

• Dismax over self name and other

De Bitmanager, 2016

Payload

• Small piece of information that is added toevery occurrence

• Basically a byte[]

De Bitmanager, 2016

Nowadays: payloads

• We need more information per occurrence of a token:

Length of the original token

Self-name or other location info

Type of the name (hotel, city, landmark, etc)

• All the above info is encoded in a 32 bit integer, and indexed as a payload

De Bitmanager, 2016

Dismax vs payload

• With fieldinfo in the payload we can simulatedismax behavior

• We query only 1 index-field (instead of 5)

• Context: easier to do advanced scoring: all info is in 1 scorer.

• Payloads *are* possible in ElasticSearch, but more difficult to use

De Bitmanager, 2016

Search

• Difficult

• Sensitive equilibrium

• Impossible to serve them all

De Bitmanager, 2016

Suits

De Bitmanager, 2016

Suits

• Reasons for people to wear a suit mightinclude:

Hiding the fact that you cannot trust them

Hiding their incompetence

etc

De Bitmanager, 2016

Combining fields

• To prevent double counting, a dismax is adviced.

• The fact that a term occurs in both the title as the abstract doesn’t make it roughly twice as important.

But it does make it somewhat more important

De Bitmanager, 2016

Combining fields

• Intuitive reaction: query terms in each others neighborhood are more important…

• Example: search for a book:chamber secrets rowling

• Expected top result:Harry Potter and the Chamber of Secrets/J.K. Rowling

De Bitmanager, 2016

Combining fields

"_score": 2.0767038,"author": "De Bitmanager","title": "Excerpt book","abstract": "Contains: Harry Potter and the Chamber of Secrets by J.K. Rowling"

"_score": 1.2030121,"author": "J.K. Rowling","title": "Harry Potter and the Chamber of Secrets","abstract": "Fresh torments and horrors arise, including an outrageously stuck-up

new professor, Gilderoy Lockheart, and a spirit named Moaning Myrtle who haunts the girls' bathroom."

• More important if in the same field?

De Bitmanager, 2016

Combining fields

• But: we get an excerpt book that contains the requested

(all terms were present in the abstract field)

• Phrases behave even worse

De Bitmanager, 2016

Combining fields

• Suppose:

we have 2 fields: F1 and F2

2 query terms: qt1 and qt2

• Now we have choices how to combine…

De Bitmanager, 2016

Combining fields

• (F1:qt1 | F1:qt2) dismax (F2:qt1 | F2:qt2)

this will prefer records where both terms are found in the same field

• (F1:qt1 dismax F2:qt1) | (F1:qt2 dismax F2:qt2)

this prefer behaves more like a there were no fields

De Bitmanager, 2016

Combining fields

(F1:qt1 | F1:qt2) dismax (F2:qt1 | F2:qt2)




De Bitmanager, 2016

Combining fields

(F1:qt1 dismax F2:qt1) | (F1:qt2 dismax F2:qt2)




De Bitmanager, 2016

Combining fields

• Of course: way more possibilities.

See the multi-match query for examples

Most but not all possibilities can be done by hand(blending)

De Bitmanager, 2016

Combining fields

• Different strategy:

Combine all fields as if they were one field

Do some re-scoring afterwards

Example:

o Search ‘rowling’ anywhere, score 1

o Search ‘potter’ anywhere, score 1

oCombine with additional queries to do a finishing touch

De Bitmanager, 2016

Explain

• Always use explain (in debug mode)

• Did I already tell you to always use explain?

• Create a new application by first making explain part of your infrastructure

• At least expose the scores in debug mode.

De Bitmanager, 2016

Suits: beware the logic rules…

• Cannot be reversed:

• The fact that I am not wearing a suit does notimply that:

I am trustworthy

I am competent

De Bitmanager, 2016

You Know, for Bits…

Peter @ bitmanager.nl

you know, for search

Software