The riddles of the Sphinx
Full-text engine anatomy atlas
Who are you?
• Sphinx – FOSS full-text search engine
Who are you?
• Sphinx – FOSS full-text search engine• Good at playing ball
Who are you?
• Sphinx – FOSS full-text search engine• Good at playing ball• Good at not playing ball
Who are you?
• Sphinx – FOSS full-text search engine• Good at playing ball• Good at not playing ball• Good at passing the ball to a team-mate
Who are you?
• Sphinx – FOSS full-text search engine• Good at playing ball• Good at not playing ball• Good at passing the ball to a team-mate• Good at many other “inferior” games
– “faceted” search, geosearch, snippet extraction, multi-queries, IO throttling, and 10-20 other interesting directives
What are you here for?
• What will not be covered?– No entry-level “what’s that Sphinx and
what’s in it for me” overview– No long quotes from the documentation– No C++ architecture details
What are you here for?
• What will not be covered?– No entry-level “what’s that Sphinx and
what’s in it for me” overview– No long quotes from the documentation– No C++ architecture details
• What will be?– How does it generally work inside– How things can be optimized– How things can be parallelized
Chapter 1. Engine insides
Total workflow
• Indexing first• Searching second
Total workflow
• Indexing first• Searching second
• There are data sources (what to fetch, where from)
• There are indexes– What data sources to index– How to process the incoming text– Where to put the results
How indexing works
• In two acts, with an intermission• Phase 1 – collecting documents
– Fetch the documents (loop over the sources)– Split the documents into words– Process the words (morphology, *fixes)– Replace the words with their wordid’s
(CRC32/64)– Emit a number of temp files
How indexing works
• Phase 2 – sorting hits– Hit (occurrence) is a (docid,wordid,wordpos) record– Input is a number of partially sorted (by wordid) hit
lists– The incoming lists are merge-sorted– Output is essentially a single fully sorted hit list
• Intermezzo– Collect and sort MVA values– Sort ordinals– Sort extern attributes
Dumb & dumber
• The index format is… simple• Several sorted lists
– Dictionary (the complete list of wordid’s)– Attributes (only if docinfo=extern)– Document lists (for each keyword)– Hit lists (for each keyword)
• Everything is laid out linearly, good for IO
How searching works
• For each local index– Build a list of candidates (documents that
satisfy the full-text query)– Filter (the analogy is WHERE)– Rank (compute the documents’ relevance
values)– Sort (the analogy is ORDER BY)– Group (the analogy is GROUP BY)
• Merge the results from all the local indexes
1. Searching cost
• Building the candidates list– 1 keyword = 1+ IO (document list)– Boolean operations on document lists– Cost is proportional (~) to the lists lengths– That is, to the sum of all the keyword
frequencies– In case of phrase/proximity/etc search,
there also will be operations on hit lists – approx. 2x IO/CPU
1. Searching cost
• Building the candidates list– 1 keyword = 1+ IO (document list)– Boolean operations on document lists– Cost is proportional (~) to the lists lengths– That is, to the sum of all the keyword
frequencies– In case of phrase/proximity/etc search, there also
are operations on hit lists – approx. 2x IO/CPU
• Bottom line – “The Who” are really bad
2. Filtering cost• docinfo=inline
– Attributes are inlined in the document lists– ALL the values are duplicated MANY times! – Immediately accessible after disk read
• docinfo=extern– Attributes are stored in a separate list (file)– Fully cached in RAM– Hashed by docid + binary search
• Simple loop over all filters• Cost ~ number of candidates and filters
3. Ranking cost
• Direct – depends on the ranker– To account for keyword positions –
• Helps the relevancy• But costs extra resources – double impact!
• Cost ~ number of results• Most expensive – phrase proximity + BM25• Most cheap – none (weight=1)
• Indirect – induced in the sorting
4. Sorting cost
• Cost ~ number of results• Also depends on the sorting criteria
(documents will be supplied in @id asc order)• Also depends on max_matches• The more the max, the worse the server feels• 1-10K is acceptable, 100K is way too much• 10-20 is not enough (makes little sense)
5. Grouping cost
• Grouping is internally a kind of sorting• Cost affected by the number of results,
too• Cost affected by max_matches, too• Additionally, max_matches setting
affects @count and @distinct precision
Chapter 2. Optimizing things
How to optimize queries
• Partitioning the data
• Choosing ranking vs. sorting mode• Filters vs. keywords• Filters vs. manual MTF• Multi queries
How to optimize queries
• Partitioning the data
• Choosing ranking vs. sorting mode• Filters vs. keywords• Filters vs. manual MTF• Multi queries
• Last line of defense – Three Big Buttons
1. Partitioning the data
• Swiss army knife, for different tasks• Bound by indexing time?
– Partition, re-index the recent changes only
• Bound by filtering?– Partition, search the needed indexes only
• Bound by CPU/HDD?– Partition, move out to different
cores/HDDs/boxes
1a. Partitioning vs. indexing
• Vital to keep the balance right• Under-partition – and indexing will be
slow• Over-partition – and searching will be
slow• 1-10 indexes – work reasonably well• Some users are fine with 50+ (30+24...)• Some users are fine with 2000+ (!!!)
1b. Partitioning vs. filtering
• Totally, 100% dependent on production query statistics– Analyze your very own production logs– Add comments if needed (3rd arg to Query())
• Justified only if the amount of processed data is going to decrease significantly– Move out last week’s documents – yes– Move out English-only documents – no (!)
1c. Partitioning vs. CPU/HDD
• Use a distributed index, explicitly map the chunks to physical devices
• Point searchd “at itself” –
index dist1{
type = distributedlocal = chunk01agent = localhost:3312:chunk02agent = localhost:3312:chunk03agent = localhost:3312:chunk04
}
1c. How to find CPU/HDD bottlenecks
• Three standard tools– vmstat – what’s the CPU busy with? how busy is it?– oprofile – specifically who eats the CPU?– iostat – how busy is the HDD?
• Also use logs, also use searchd --iostats option• Normally everything is clear (us/sy/bi/bo…),
but!• Caveat – HDD might be iops bound• Caveat – CPU load from Sphinx might be
induced and “hidden” in sy
2. Ranking
• Can now be very different (so called rankers in extended2 mode)
• Default ranker – phrase+BM25, accounts for keyword positions – not for free
• Sometimes it’s ok to use simpler ranker• Sometimes @weight is ignored at all
совсем (searching for ipod, sorting by price)
• Sometimes you can save on ranker
3. Filters vs. keywords• Well-known trick
– When indexing, add a special, fake keyword to the document (_authorid123)
– When searching, add it to the query• Obvious questions
– What’s faster, what’s better?• Simple answer
– Count the change before moving away from the cashier
3. Filters vs. keywords• Cost of searching ~ keyword frequencies• Cost of filtering ~ number of candidates• Searching – CPU+IO, filtering – CPU only
• Fake keyword frequency= filter value selectivity
• Frequent value + few candidates → bad!• Rare value + many candidates → good!
4. Filters vs. manual MTF
• Filters are looped over sequentially• In the order specified by the app!
• Narrowest filter – better at the start• Widest filter – better at the end
• Does not matter if you use fake keywords• Exercise to the reader – why?
5. Multi-queries
• Any queries can be sent together in a batch
• Always saves on network roundtrip• Sometimes allows the optimizer to trigger
• Especially important and frequent case –different sorting/grouping modes
• 2x+ optimization for “faceted” searches
5. Multi-queries
$client = new SphinxClient ();$q = “laptop”; // coming from website user
$client->SetSortMode ( SPH_SORT_EXTENDED, “@weight desc”);$client->AddQuery ( $q, “products” );
$client->SetGroupBy ( SPH_GROUPBY_ATTR, “vendor_id” );$client->AddQuery ( $q, “products” );
$client->ResetGroupBy ();$client->SetSortMode ( SPH_SORT_EXTENDED, “price asc” );$client->SetLimit ( 0, 10 );
$result = $client->RunQueries ();
6. Three Big Buttons
• If nothing else helps…• Cutoff (см. SetLimits())
– Forcibly stops searching after first N matches– Per-index, not overall
• MaxQueryTime (см. SetMaxQueryTime())– Forcibly stops searching after M milli-seconds– Per-index, not overall
6. Three Big Buttons
• If nothing else helps…• Consulting
– We can notice the unnoticed– We can implement the unimplemented
Chapter 3. Parallelization sample
Combat mission
• Got ~160M cross-links• Needed misc reports (by
domains→groupby)*************************** 1. row *************************** domain_id: 440682 link_id: 15 url_from: http://www.insidegamer.nl/forum/viewtopic.php?t=40750 url_to: http://xbox360achievements.org/content/view/101/114/ anchor: NULL from_site_id: 9835 from_forum_id: 1818 from_author_id: 282 from_message_id: 2586message_published: 2006-09-30 00:00:00 ...
Tackling – one
• Partitioned the data• 8 boxes, 4x CPU, ~5M links per CPU
• Used Sphinx• In theory, we could had used MySQL• It practice, way too complicated
– Would had resulted in 15-20M+ rows/CPU– Would had resulted in “manual” aggregation
code
Tackling – two
• Extracted “interesting parts” of the URL when indexing, using an UDF
• Replaced the SELECT with full-text query*************************** 1. row *************************** url_from: http://www.insidegamer.nl/forum/viewtopic.php?t=40750urlize(url_from,0): www$insidegamer$nl insidegamer$nl insidegamer$nl$forum insidegamer$nl$forum$viewtopic.php insidegamer$nl$forum$viewtopic.php$t=40750urlize(url_from,1): www$insidegamer$nl insidegamer$nl insidegamer$nl$forum insidegamer$nl$forum$viewtopic.php
Tackling – three
• 64 indexes– 4 searchd instances per box, by CPU/HDD
count– 2 indexes (main+delta) per CPU
• All searched in parallel– Web box queries the main instance at each box– Main instance queries itself and other 3 copies– Using 4 instances, because of startup/update– Using plain HDDs, because of IO stepping
Results
• The precision is acceptable• “Rare” domains – precise results• “Frequent” domains – precision within 0.5%
• Average query time – 0.125 sec• 90% queries – under 0.227 sec• 95% queries – under 0.352 sec• 99% queries – under 2.888 sec
The end