you know, for search
TRANSCRIPT
De Bitmanager, 2016
You Know, for Search
Peter van der Weerd
De Bitmanager, 2016
Who am I?
• Peter van der Weerd
• Search specialist
• Self employed Bitmanager
• Enormous span of control
De Bitmanager, 2016
Search
• Common sense:
Easy
Solved
De Bitmanager, 2016
Yeah, true…
• Install ES
• Fill it with some data
• And \o/: we can search
De Bitmanager, 2016
But…
• Are the users satisfied?
• Many people struggle with sub-optimal search results.
De Bitmanager, 2016
Search as a toolbox
• It consists of 1 or more(!) tools to find what you need
Searchbox
Faceting (intersecting)
Sorting
More like this
Not more like this (this is not what I mean)
Etc…
De Bitmanager, 2016
Search at Booking
• Destination based (city, region, airport, etc)
• AutocompleteResults in max 5 destinations, query per keystroke
• DisambiguationShow a partioned result that enables peopleto choose a destination
De Bitmanager, 2016
Autocomplete in action
De Bitmanager, 2016
Disambiguation in action
De Bitmanager, 2016
Scoring
De Bitmanager, 2016
Scoring
• Lucene scores in general like: tf * idf
• Tf = term frequencythe more matched terms, the more important
• Idf = inverse document frequencyThe more matched documents for the term, the less important
De Bitmanager, 2016
Term frequency
• Used to give more importance to relative high occurring terms.
• Scoring examples for ‘house’
House
The house
The little house on the prairie
The little house on the prairie blah blah blah
score
De Bitmanager, 2016
Inverse document frequency
• Prefers less frequent tokens.
• Useless on single token queries: it is only usedto relative score multiple tokens
• Examples:
house
little
on
the
score
De Bitmanager, 2016
Drawback of idf
• Other example…
Pekela
Haarlem
Amsterdam
Paris
• Booking switched off idf, but could have useddf instead…
score
De Bitmanager, 2016
When does idf work
• Idf typically work for large text-like queries.
• The documents *must* be evenly distributedover shards(or use dfs_query_then_fetch)
De Bitmanager, 2016
Is tf * idf enough?
• Well, no…
• What to deliver on a query for ‘Paris’?
The city (ehm, the are several cities Paris)
Airports?
Hotels? Which one? There are 1000’s of them.
• Even worse:What to deliver for query ‘p’ or ‘pa’?
De Bitmanager, 2016
Record boost
• Based on
Popularity
From where booked
Language
o Same (doc language == site language)
o Local translations
o English
oMismatch
De Bitmanager, 2016
+ or x?
• Boosts are implemented by adding
• Intuitive justification:
Language could be seen as yet another (implicit!) search term
Same for popularity: people ar typical notsearching for impopular things
• Example (from an english site):amsterdam->amsterdam english popular
De Bitmanager, 2016
But wait…
• How big should the record-boost be?
0..1? 100?
• Lucene score might vary heavely,sometimes more then 10x different
• So lets take 10 as max record-boost
But now the recordboost might out-weight smaller scores
• Argggggg….
De Bitmanager, 2016
Score ranges
• Difficult to tinker with:
For instance use a stemmed token with boost 0.5house^1.0 vs houses^0.5
What if the Lucene score is more than 2 timeshigher than the stem itself?
• We are doing entity search vs text search
De Bitmanager, 2016
Different scorers
Title Score:default Score:BM25 Score:custom
House 1.22 0.77 1.20
The house 0.76 0.61 1.10
The little house on the prairie
0.46 0.39 1.05
Querying for ‘house’:
De Bitmanager, 2016
Normalizing scores
• Goal: each term is scored around 1.0
Base score 1.0
Tf is normalized between 0 .. 0.2 and added to the base score
Idf is normalized between 0 .. 0.2 and added to the base score
Giving a score varying between 1 and 1.4 per term(sometimes we don’t use idf)
De Bitmanager, 2016
Language boosting
• Same language or english: +0.7
• Local language: +0.3(Roma vs Rome in an English site)
• Mismatched language: -0.3
De Bitmanager, 2016
About N-grams
• For auto-complete: left-edge N-Grams
• Rome:romeromror
De Bitmanager, 2016
About N-grams
• When a user types ‘ro’…
Rome
Ródos
Rotterdam
Etc
• Score depends on percentage of match(or Levenshtein distance)
score
De Bitmanager, 2016
Original approach
• Multiple fields (name, city, region, etc)
• Combining them by a weighted dismax query
De Bitmanager, 2016
Dismax query
• More subtle way of combining scores.
• Score = max + (sum - max) * tieBreaker
In words: the max plus a percentage of the others
• Edge cases:
Tiebreaker=0Score is the max. score
Tiebreaker=1Score is the sum of all the individual scores(same behavior as boolean or)
De Bitmanager, 2016
Dismax example
• Q= the houseSuppose S[the] = 0.8, S[house]=1.2
• Scores for different tiebreakers:
Bool score (tiebreaker=1): 2.0
Max score (tiebreaker=0): 1.2
Score with tiebreaker=0.1: 1.28this makes documents containing ‘the house’ a little bit more important than ‘house’ only.
De Bitmanager, 2016
Difficulties
• Lack of context
• Hard to create a reliable scoring model
De Bitmanager, 2016
Different approach
• Canonical name: Hotel V Frederiksplein, Amsterdam, Noord-Holland, Netherlands
• Self name (indexed)
Hotel V Frederiksplein
• Rest (indexed)
Amsterdam, Noord-Holland, Netherlands
De Bitmanager, 2016
Weighting fields
• All fields are equal but some fields are more equal than others…
Self name is most important
Other names (like the city where a hotel resides) are less important
• Dismax over self name and other
De Bitmanager, 2016
Payload
• Small piece of information that is added toevery occurrence
• Basically a byte[]
De Bitmanager, 2016
Nowadays: payloads
• We need more information per occurrence of a token:
Length of the original token
Self-name or other location info
Type of the name (hotel, city, landmark, etc)
• All the above info is encoded in a 32 bit integer, and indexed as a payload
De Bitmanager, 2016
Dismax vs payload
• With fieldinfo in the payload we can simulatedismax behavior
• We query only 1 index-field (instead of 5)
• Context: easier to do advanced scoring: all info is in 1 scorer.
• Payloads *are* possible in ElasticSearch, but more difficult to use
De Bitmanager, 2016
Search
• Difficult
• Sensitive equilibrium
• Impossible to serve them all
De Bitmanager, 2016
Suits
De Bitmanager, 2016
Suits
• Reasons for people to wear a suit mightinclude:
Hiding the fact that you cannot trust them
Hiding their incompetence
etc
De Bitmanager, 2016
Combining fields
• To prevent double counting, a dismax is adviced.
• The fact that a term occurs in both the title as the abstract doesn’t make it roughly twice as important.
But it does make it somewhat more important
De Bitmanager, 2016
Combining fields
• Intuitive reaction: query terms in each others neighborhood are more important…
• Example: search for a book:chamber secrets rowling
• Expected top result:Harry Potter and the Chamber of Secrets/J.K. Rowling
De Bitmanager, 2016
Combining fields
"_score": 2.0767038,"author": "De Bitmanager","title": "Excerpt book","abstract": "Contains: Harry Potter and the Chamber of Secrets by J.K. Rowling"
"_score": 1.2030121,"author": "J.K. Rowling","title": "Harry Potter and the Chamber of Secrets","abstract": "Fresh torments and horrors arise, including an outrageously stuck-up
new professor, Gilderoy Lockheart, and a spirit named Moaning Myrtle who haunts the girls' bathroom."
• More important if in the same field?
De Bitmanager, 2016
Combining fields
• But: we get an excerpt book that contains the requested
(all terms were present in the abstract field)
• Phrases behave even worse
De Bitmanager, 2016
Combining fields
• Suppose:
we have 2 fields: F1 and F2
2 query terms: qt1 and qt2
• Now we have choices how to combine…
De Bitmanager, 2016
Combining fields
• (F1:qt1 | F1:qt2) dismax (F2:qt1 | F2:qt2)
this will prefer records where both terms are found in the same field
• (F1:qt1 dismax F2:qt1) | (F1:qt2 dismax F2:qt2)
this prefer behaves more like a there were no fields
De Bitmanager, 2016
Combining fields
(F1:qt1 | F1:qt2) dismax (F2:qt1 | F2:qt2)
"_score": 2.0767038,"author": "De Bitmanager","title": "Excerpt book","abstract": "Contains: Harry Potter and the Chamber of Secrets by J.K. Rowling"
"_score": 1.2030121,"author": "J.K. Rowling","title": "Harry Potter and the Chamber of Secrets","abstract": "Fresh torments and horrors arise, including an outrageously stuck-up
new professor, Gilderoy Lockheart, and a spirit named Moaning Myrtle who haunts the girls' bathroom."
De Bitmanager, 2016
Combining fields
(F1:qt1 dismax F2:qt1) | (F1:qt2 dismax F2:qt2)
"_score": 2.1447253,"author": "J.K. Rowling","title": "Harry Potter and the Chamber of Secrets","abstract": "Fresh torments and horrors arise, including an outrageously stuck-up
new professor, Gilderoy Lockheart, and a spirit named Moaning Myrtle who haunts the girls' bathroom."
"_score": 2.0767038,"author": "De Bitmanager","title": "Excerpt book","abstract": "Contains: Harry Potter and the Chamber of Secrets by J.K. Rowling"
De Bitmanager, 2016
Combining fields
• Of course: way more possibilities.
See the multi-match query for examples
Most but not all possibilities can be done by hand(blending)
De Bitmanager, 2016
Combining fields
• Different strategy:
Combine all fields as if they were one field
Do some re-scoring afterwards
Example:
o Search ‘rowling’ anywhere, score 1
o Search ‘potter’ anywhere, score 1
oCombine with additional queries to do a finishing touch
De Bitmanager, 2016
Explain
• Always use explain (in debug mode)
• Did I already tell you to always use explain?
• Create a new application by first making explain part of your infrastructure
• At least expose the scores in debug mode.
De Bitmanager, 2016
Suits: beware the logic rules…
• Cannot be reversed:
• The fact that I am not wearing a suit does notimply that:
I am trustworthy
I am competent
De Bitmanager, 2016
You Know, for Bits…
Peter @ bitmanager.nl