faceting optimizations for solr: presented by toke eskildsen, state & university library,...
TRANSCRIPT
O C T O B E R 1 3 - 1 6 , 2 0 1 6 • A U S T I N , T X
Faceting optimizations for Solr Toke Eskildsen
Search Engineer / Solr Hacker State and University Library, Denmark
@TokeEskildsen / [email protected]
3/55 3
Overview
l Web scale at the State and University Library, Denmark
l Field faceting 101 l Optimizations - Reuse - Tracking - Caching - Alternative counters
4/55
Web scale for a small web
l Denmark - Consolidation circa 10th century - 5.6 million people
l Danish Net Archive (http://netarkivet.dk) - Constitution 2005 - 20 billion items / 590TB+ raw data
5/55
Indexing 20 billion web items / 590TB into Solr
l Solr index size is 1/9th of real data = 70TB l Each shard holds 200M documents / 900GB - Shards build chronologically by dedicated machine - Projected 80 shards - Current build time per shard: 4 days - Total build time is 20 CPU-core years - So far only 7.4 billion documents / 27TB in index
6/55
Searching a 7.4 billion documents / 27TB Solr index
l SolrCloud with 2 machines, each having - 16 HT-cores, 256GB RAM, 25 * 930GB SSD - 25 shards @ 900GB - 1 Solr/shard/SSD, Xmx=8g, Solr 4.10 - Disk cache 100GB or < 1% of index size
7/55
Danish Net Archive Search, late 2014
8/55
String faceting 101 (single shard)
counter = new int[ordinals]
for docID: result.getDocIDs()
for ordinal: getOrdinals(docID)
counter[ordinal]++
for ordinal = 0 ; ordinal < counters.length ; ordinal++
priorityQueue.add(ordinal, counter[ordinal])
for entry: priorityQueue
result.add(resolveTerm(ordinal), count)
ord term counter 0 A 0 1 B 3 2 C 0 3 D 1006 4 E 1 5 F 1 6 G 0 7 H 0 8 I 3
9/55
Test setup 1 (easy start)
l Solr setup - 16 HT-cores, 256GB RAM, SSD - Single shard 250M documents / 900GB
l URL field - Single String value - 200M unique terms
l 3 concurrent “users” l Random search terms
10/55
Vanilla Solr, single shard, 250M documents, 200M values, 3 users
11/55
Allocating and dereferencing 800MB arrays
12/55
Reuse the counter
counter = new int[ordinals]
for docID: result.getDocIDs()
for ordinal: getOrdinals(docID)
counter[ordinal]++
for ordinal = 0 ; ordinal < counters.length ; ordinal++
priorityQueue.add(ordinal, counter[ordinal])
<counter no more referenced and will be garbage collected at some point>
13/55
Reuse the counter
counter = pool.getCounter()
for docID: result.getDocIDs()
for ordinal: getOrdinals(docID)
counter[ordinal]++
for ordinal = 0 ; ordinal < counters.length ; ordinal++
priorityQueue.add(ordinal, counter[ordinal])
pool.release(counter)
Note: The JSON Facet API in Solr 5 already supports reuse of counters
14/55
Using and clearing 800MB arrays
15/55
Reusing counters vs. not doing so
16/55
Reusing counters, now with readable visualization
17/55
Reusing counters, now with readable visualization
Why does it always take more than 500ms?
18/55
Iteration is not free
counter = pool.getCounter()
for docID: result.getDocIDs()
for ordinal: getOrdinals(docID)
counter[ordinal]++
for ordinal = 0 ; ordinal < counters.length ; ordinal++
priorityQueue.add(ordinal, counter[ordinal])
pool.release(counter)
200M unique terms = 800MB
19/55
ord counter 0 0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 8 0
tracker N/A N/A N/A N/A N/A N/A N/A N/A N/A
Tracking updated counters
20/55
ord counter 0 0 1 0 2 0 3 1 4 0 5 0 6 0 7 0 8 0
tracker 3
N/A N/A N/A N/A N/A N/A N/A N/A
counter[3]++
Tracking updated counters
21/55
ord counter 0 0 1 1 2 0 3 1 4 0 5 0 6 0 7 0 8 0
tracker 3 1
N/A N/A N/A N/A N/A N/A N/A
counter[3]++
counter[1]++
Tracking updated counters
22/55
ord counter 0 0 1 3 2 0 3 1 4 0 5 0 6 0 7 0 8 0
tracker 3 1
N/A N/A N/A N/A N/A N/A N/A
counter[3]++
counter[1]++
counter[1]++
counter[1]++
Tracking updated counters
23/55
ord counter 0 0 1 3 2 0 3 1006 4 1 5 1 6 0 7 0 8 3
tracker 3 1 8 4 5
N/A N/A N/A N/A
counter[3]++
counter[1]++
counter[1]++
counter[1]++
counter[8]++
counter[8]++
counter[4]++
counter[8]++
counter[5]++
counter[1]++
counter[1]++
…
counter[1]++
Tracking updated counters
24/55
Tracking updated counters
counter = pool.getCounter()
for docID: result.getDocIDs()
for ordinal: getOrdinals(docID)
if counter[ordinal]++ == 0 && tracked < maxTracked
tracker[tracked++] = ordinal
if tracked < maxTracked
for i = 0 ; i < tracked ; i++
priorityQueue.add(tracker[i], counter[tracker[i]])
else
for ordinal = 0 ; ordinal < counter.length ; ordinal++
priorityQueue.add(ordinal, counter[ordinal])
ord counter 0 0 1 3 2 0 3 1006 4 1 5 1 6 0 7 0 8 3
tracker 3 1 8 4 5
N/A N/A N/A N/A
25/55
Tracking updated counters
26/55
Distributed faceting
Phase 1) All shards performs faceting. The Merger calculates the top-X terms. Phase 2) The term counts are requested from the shards that did not return them in phase 1. The Merger calculates the final counts for the top-X terms. for term: fineCountRequest.getTerms()
result.add(term,
searcher.numDocs(query(field:term), base.getDocIDs()))
27/55
Test setup 2 (more shards, smaller field)
l Solr setup - 16 HT-cores, 256GB RAM, SSD - 9 shards @ 250M documents / 900GB
l domain field - Single String value - 1.1M unique terms per shard
l 1 concurrent “user” l Random search terms
28/55
Pit of Pain™ (or maybe “Horrible Hill”?)
29/55
Fine counting can be slow
Phase 1: Standard faceting Phase 2: for term: fineCountRequest.getTerms()
result.add(term,
searcher.numDocs(query(field:term), base.getDocIDs()))
30/55
Alternative fine counting
counter = pool.getCounter()
for docID: result.getDocIDs()
for ordinal: getOrdinals(docID)
counter.increment(ordinal)
for term: fineCountRequest.getTerms()
result.add(term, counter.get(getOrdinal(term)))
} Same as phase 1, which yields ord counter
0 0 1 3 2 0 3 1006 4 1 5 1 6 0 7 0 8 3
31/55
Using cached counters from phase 1 in phase 2
counter = pool.getCounter(key)
for term: query.getTerms()
result.add(term, counter.get(getOrdinal(term)))
pool.release(counter)
32/55
Pit of Pain™ practically eliminated
33/55
Pit of Pain™ practically eliminated
Stick figure CC BY-NC 2.5 Randall Munroe xkcd.com
34/55
Test setup 3 (more shards, more fields)
l Solr setup - 16 HT-cores, 256GB RAM, SSD - 23 shards @ 250M documents / 900GB
l Faceting on 6 fields - url: ~200M unique terms / shard - domain & host: ~1M unique terms each / shard - type, suffix, year: < 1000 unique terms / shard
35/55
1 machine, 7 billion documents / 23TB total index, 6 facet fields
36/55
High-cardinality can mean different things
Single shard / 250,000,000 docs / 900GB
Field References Max docs/term Unique terms domain 250,000,000 3,000,000 1,100,000
url 250,000,000 56,000 200,000,000
links 5,800,000,000 5,000,000 610,000,000
2440 MB / counter
37/55
Different distributions domain 1.1M url 200M links 600M
High max
Low max
Very long tail
Short tail
38/55
Theoretical lower limit per counter: log2(max_count)
max=1
max=7
max=2047
max=3
max=63
39/55
int vs. PackedInts domain: 4 MB url: 780 MB links: 2350 MB
int[ordinals] PackedInts(ordinals, maxBPV)
domain: 3 MB (72%) url: 420 MB (53%) links: 1760 MB (75%)
40/55
n-plane-z counters
Platonic ideal Harsh reality
Plane d
Plane c
Plane b
Plane a
41/55
Plane d
Plane c
Plane b
Plane a
L: 0 ≣ 000000
42/55
Plane d
Plane c
Plane b
Plane a
L: 0 ≣ 000000 L: 1 ≣ 000001
43/55
Plane d
Plane c
Plane b
Plane a
L: 0 ≣ 000000 L: 1 ≣ 000001 L: 2 ≣ 000011
44/55
Plane d
Plane c
Plane b
Plane a
L: 0 ≣ 000000 L: 1 ≣ 000001 L: 2 ≣ 000011 L: 3 ≣ 000101
45/55
Plane d
Plane c
Plane b
Plane a
L: 0 ≣ 000000 L: 1 ≣ 000001 L: 2 ≣ 000011 L: 3 ≣ 000101 L: 4 ≣ 000111 L: 5 ≣ 001001 L: 6 ≣ 001011 L: 7 ≣ 001101 ... L: 12 ≣ 010111
46/55
Comparison of counter structures domain: 4 MB url: 780 MB links: 2350 MB
domain: 3 MB (72%) url: 420 MB (53%) links: 1760 MB (75%)
domain: 1 MB (30%) url: 66 MB ( 8%) links: 311 MB (13%)
int[ordinals] PackedInts(ordinals, maxBPV) n-plane-z
47/55
Speed comparison
48/55
I could go on about
l Threaded counting l Heuristic faceting l Fine count skipping l Counter capping l Monotonically increasing tracker for n-plane-z l Regexp filtering
49/55
What about huge result sets?
l Rare for explorative term-based searches l Common for batch extractions l Threading works poorly as #shards > #CPUs l But how bad is it really?
50/55
Really bad! 8 minutes
51/55
Heuristic faceting
l Use sampling to guess top-X terms - Re-use the existing tracked counters - 1:1000 sampling seems usable for the field links,
which has 5 billion references per shard l Fine-count the guessed terms
52/55
Over provisioning helps validity
53/55
10 seconds < 8 minutes
54/55
Web scale for a small web
l Denmark - Consolidation circa 10th century - 5.6 million people
l Danish Net Archive (http://netarkivet.dk) - Constitution 2005 - 20 billion items / 590TB+ raw data
55/55
Never enough time, but talk to me about
l Threaded counting l Monotonically increasing tracker for n-plane-z l Regexp filtering l Fine count skipping l Counter capping