solr performance & key innovations yonik seeley, lucid imagination [email protected],...
TRANSCRIPT
Solr Performance & Key Innovations
Yonik Seeley, Lucid [email protected], May 26 2011
Solr 3.1 Highlights Numeric range facets (similar to date faceting). New spatial search, including spatial filtering,
boosting and sorting capabilities. Example Velocity driven search UI at
http://localhost:8983/solr/browse A new faster termvector-based highlighter. Extended dismax (edismax) query parser with
support for fielded queries, enhanced relevancy, and full lucene syntax support.
Distributed search support for the Spell check and Terms components.
2
Solr 3.1 Highlights (continued)
Suggester, a fast trie-based autocomplete component.
Sort results by any function query. JSON document indexing. CSV response format Apache UIMA integration for metadata
extraction. Tons of optimizations, bugfixes, and new
analysis capabilities via Apache Lucene 3.1.
3
What’s not in 3.1?
Result Grouping (AKA Field Collapsing) Pivot Faceting SolrCloud Pseudo-fields Pseudo-join Relevancy function queries Per-segment faceting *Tons* of new Lucene performance/efficiency
goodness4
Recent Lucene Performance TieredMergePolicy – the new default
• Much better for incremental indexing / NRT• Ignores segment order when selecting best merge• Takes deletes into account• Does not over-merge (no cascading merges)
Finite State Transducer (FST) based terms index
5
DocumentWriterPerThread (DWPT)
6
_1_0.tiv_1_0.prx_1_0.frq
…
_1_0.tiv_1_0.prx_1_0.frq
…
_2_0.tiv_2_0.prx_2_0.frq
…
_2_0.tiv_2_0.prx_2_0.frq
…
_3_0.tiv_3_0.prx_3_0.frq
…
_3_0.tiv_3_0.prx_3_0.frq
…
Index WriterIndex Writer
DWPT DWPT DWPT
Indexing thread
Flush segmentto disk
Flushing new segment is now concurrent w/ indexing
Use multiple indexing threads/connections
When max mem is hit, biggest DWPT is concurrently flushed
in-memory
Solr Cloud
7
shard1(replica1)
replica2
replica3
shard2(replica1)
replica2
replica3
ZooKeeper quorum
ZK nodeZK
node
ZK nodeZK
node
ZK nodeZK
node
ZK nodeZK
node
ZK nodeZK
node
/configs /myconf solrconfig.xml schema.xml
/livenodes server1:8983/solr server2:8983/solr server2:8983/solr
/collections /collection1 configName=myconf
/shards /shard1 server1:8983/solr server2:8983/solr /shard2 server3:8983/solr server4:8983/solr
http://.../solr/collection1?distrib=true
Load-balanced sub-request
Solr Cloud: Getting Started
http://wiki.apache.org/solr/SolrCloud
java -Dbootstrap_confdir=./solr/conf -Dcollection.configName=myconf -DzkRun -jar start.jar
Upload /solr/conf to ZK and call it
“myconf”
Upload /solr/conf to ZK and call it
“myconf”
Run an internal ZK server
Run an internal ZK server
http://localhost:8983/solr/collection1/admin/zookeeper.jsp
Distributed RequestsExplicitly specify node addresses to load-balance across
shards=localhost:8983/solr|localhost:8900/solr, localhost:7574/solr|localhost:7500/solr A list of equivalent nodes are separated by “|” Different phases of the same distributed request use the same node
Specify logical shard ids to search acrossshards=NY_shard,NJ_shard
Query across all shards in the collectionhttp://localhost:8983/solr/collection1/select?distrib=true
public CloudSolrServer(String zkHost) SolrJ Java client that load-balances across all nodes in cluster
Extended Dismax ParserSuperset of dismaxDesigned to directly handle user queries w/o exceptions
&defType=edismax&q=foo&qf=body
Fixes edge cases where dismax could still throw exceptionsOR AND NOT - “
Full lucene syntax support Tries lucene syntax first Smart escaping is done if syntax errors
Optionally supports treating “and”/”or” as AND/OR in lucene syntax
Fielded queries (e.g. myfield:foo) even in degraded mode
uf parameter controls what field names may be directly specified in “q”
Extended Dismax Parser (continued)boost parameter for multiplicative boost-by-functionPure negative query clauses
Example: solr OR (-solr)Enhanced term proximity boosting
pf2=myfield – results in term bigrams in sloppy phrase queriesmyfield:“aa bb cc” -> myfield:“aa bb” myfield:“bb cc”
Enhanced stopword handling stopwords omitted in main query, but added in optional proximity
boosting partExample: q=solr is awesome & qf=myfield & pf2=myfield -> +myfield:(solr awesome) (myfield:”solr is” myfield:”is awesome”)
Currently controlled by the absence of StopWordFilter in index analyzer, and presence in query analyzer
Faceting Performance Improvements
For facet.method=enum, speed up initial population of the filterCache (i.e. first time facet): from 30% to 32x improvement
Optimized facet.method=fc for multi-valued fields and large facet.limit – up to 3x faster
Optimized deep facet paging – up to 10x faster with really large facet.offsets
Less memory consumed by field cache entriesPer-segment faceting with facet.method=fcs
Only faster when re-opening index frequently (many times a second) Only works for single-valued fields
Pivot Faceting
Other names that could have made sense: Grid Faceting, Cross-Product Faceting, Matrix Faceting
Syntax: facet.pivot=field1,field2,field3,…
#docs #docs w/ inStock:true
#docs w/ instock:false
cat:electronics 14 10 4
cat:memory 3 3 0
cat:connector 2 0 2
cat:graphics card 2 0 2
cat:hard drive 2 2 0
facet.pivot=cat,inStock
Pivot Faceting
"facet_counts":{ "facet_pivot":{ "cat,popularity":[{ "field":"cat", "value":"electronics", "count":14, "pivot":[{ "field":"popularity", "value":"6", "count":5}, { "field":"popularity", "value":"7", "count":4},
http://...&facet=true&facet.pivot=cat,popularity
(continued)
{ "field":"popularity", "value":"1", "count":2}]}, { "field":"cat", "value":"memory", "count":3, "pivot":[]},
[…]
14 docs w/cat==electronics
5 docs w/cat==electronics&& popularity==6
Range Faceting
Like Date faceting, but more generic
http://...&facet=true&facet.range=price&facet.range.start=0&facet.range.end=500&facet.range.gap=50
"facet_counts":{ "facet_ranges":{ "price":{ "counts":{ "0.0":5, "50.0":2, "100.0":0, "150.0":2, "200.0":0, "250.0":1, "300.0":2, "350.0":2, "400.0":0, "450.0":1}, "gap":50.0, "start":0.0, "end":500.0}}}}
Spatial SearchStep1: Index some locations!<field name=“name”>The Alpine Shop</field><field name=“store”>44.013617,-73.168264</field>
Step2: Decide where you are&pt=44.0153371,-73.16734&d=1&sfield=store
Step3: Profit!
Spatial Filter: &fq={!geofilt}
Bounding Box: &fq={!bbox}
Distance Function: &sort=geodist() asc
Returning the distance: &fl=geodist()
Note: You can now sort by any arbitrary
function query!
Note: You can now sort by any arbitrary
function query!
Pseudo-fields!Pseudo-fields!
Pseudo-Fields
Returns other info along with document stored fieldsFunction queries
fl=name,location,geodist(),add(myfield,10)
Fieldname globsfl=id,attr_*
Multiple “fl” (field list) values&fl=id,attr_*&fl=geodist()&fl=termfreq(text,’solr’)
Aliasingfl=id,location:loc,_dist_:geodist()
Future: inlined highlighting, “explain”, sort-values, group-value
17
Result Grouping / Field Collapsing
Goal Limit the number of results per category “category” normally defined by unique values in a field
Uses Web Search – collapse by web site Email threads – collapse by thread id Ecommerce/retail
Show the top 5 items for each store category (music, movies, etc)
Field Collapsing by Site
Field Collapse on Product TypeResult Grouping by Category
Group by Field
http://...&fl=id,name&q=ipod&group=true&group.field=manu_exact
"grouped":{ "manu_exact":{ "matches":3, "groups":[{ "groupValue":"Belkin", "doclist":{"numFound":2,"start":0,"docs":[ { "id":"IW-02", "name":"iPod & iPod Mini USB 2.0 Cable"}] }}, { "groupValue":"Apple Computer Inc.", "doclist":{"numFound":1,"start":0,"docs":[ { "id":"MA147LL/A", "name":"Apple 60 GB iPod with Video Playback
Black"}] }}]}}}
Group by Queryhttp://...&group=true&group.query=price:[0 TO 99.99]&
group.query=price:[100 TO *]&group.limit=5
"grouped":{ "price:[0 TO 99.99]":{ "matches":3, "doclist":{"numFound":2,"start":0,"docs":[ { "id":"IW-02", "name":"iPod & iPod Mini USB 2.0 Cable"}, { "id":"F8V7067-APL-KIT", "name":"Belkin Mobile Power Cord for iPod"}] }}, "price:[100 TO *]":{ "matches":3, "doclist":{"numFound":1,"start":0,"docs":[ { "id":"MA147LL/A", "name":"Apple 60 GB iPod with Video Playback
Black"}] }}}}
Grouping Paramsparameter meaning default
group.field=<field> Like facet.field – group by unique field values
group.query=<query> Like facet.query – top docs that also match
group.function=<function query>
Group by unique values produced by the function query
group.limit=<n> How many docs per group 1
group.sort=<sort spec> How to sort documents within a group Same as sort
rows=<n> How many groups to return 10
sort=<sort spec> How to sort the groups relative to each other (based on top doc)
group.format=<format> grouped/simple – if simple, a single flat list is used and rows units are “docs”
grouped
group.main=true/false If true, the first field grouping command is used as main result set
false
Pseudo-Join
24
id: blog1name: Solr ‘n Stuffowner: Yonik SeeleyStarted: 2007-10-26
id: blog1name: Solr ‘n Stuffowner: Yonik SeeleyStarted: 2007-10-26
id: blog2name: lifehackerowner: Gawker Mediastarted: 2005-1-31
id: blog2name: lifehackerowner: Gawker Mediastarted: 2005-1-31
id: post1blog_id: blog1author: Yonik Seeleytitle: Solr relevancy function queriesbody: Lucene’s default ranking […]
id: post1blog_id: blog1author: Yonik Seeleytitle: Solr relevancy function queriesbody: Lucene’s default ranking […]
id: post2blog_id: blog1author: Yonik Seeleytitle: Solr result groupingbody: Result Grouping, also called […]
id: post2blog_id: blog1author: Yonik Seeleytitle: Solr result groupingbody: Result Grouping, also called […]
id: post3blog_id: blog2author: Whitson Gordontitle: How to Install Netflix on Almost Any Android Device
id: post3blog_id: blog2author: Whitson Gordontitle: How to Install Netflix on Almost Any Android Device
fq={!join from=blog_id to=id}body:netflix
- Finds all documents matching “netflix”- Maps to different docs by following blog_id to id
Restrict to blogs mentioning netflix
Pseudo-Join Examples
Only show posts from blogs started after 2010q=foo&fq={!join from=id to=blog_id}started:[2010 TO *]
If any post in a blog mentions “obama”, then search all posts in that blog for “bomb” (self-join)q=bomb&fq={!join from=blog_id to=blog_id}obama
If any blog post mentions “obama”, then search all websites with the same blog owner for “bomb” q=bomb&fq={!join from=owner to=website_owner}{!join from=blog_id to=id}obama
25
Cross-Core Join
http://localhost:8983/solr/collection1/select?q=foo&fq={!join fromIndex=sec1 from=security_groups to=security}user:john
26
id: doc1security: managerstitle: doc for managers onlybody: …
id: doc1security: managerstitle: doc for managers onlybody: …
id: marysecurity_groups: managers, employeesid: marysecurity_groups: managers, employees
id: doc1security: managers, employeestitle: doc for everyonebody: …
id: doc1security: managers, employeestitle: doc for everyonebody: …
id: johnsecurity_groups: employeesid: johnsecurity_groups: employees
collection1 sec1
Single Solr Server
Pseudo-Join vs GroupingPseudo-Join Result Grouping / Field Collapsing
O(n_terms_in_join_fields) O(n_docs_in_result)
Single or multi-valued fields Single-valued fields only
Filters only (no info currently passed from the “from” docs to the “to” docs).
Can order docs within a group and groups by top doc within that group using normal sort criteria.
Chainable (one join can be the input to another)
Not currently chainable – can only group one field deep
Affects which documents match a request, so naturally affects facet numbers (e.g. you can search posts and get numbers of blogs)
Grouping does not currently affect the set of documents matching the query, so faceting is unaffected.
27
Auto-SuggestMany people previously used terms component
Can be slow for a large corpus
New auto-suggest builds off SpellCheck component TST implementation: compact memory based trie FST implementation: slower to build, but smaller & faster lookup Based on a field in the main index, or on a dictionary file
http://localhost:8983/solr/suggest?wt=json&indent=true&q=ult
28
"spellcheck":{ "suggestions":[ "ult",{ "numFound":1, "startOffset":0, "endOffset":3, "suggestion":["ultrasharp"]}, "collation","ultrasharp"]}}
Index with JSON$ URL=http://localhost:8983/solr/update/json$ curl $URL -H 'Content-type:application/json' -d ’[ { "id" : "978-0641723445", "cat" : ["book","hardcover"], "title" : "The Lightning Thief", "author" : "Rick Riordan", "series_t" : "Percy Jackson and the Olympians", "sequence_i" : 1, "genre_s" : "fantasy", "inStock" : true, "price" : 12.50, "pages_i" : 384 }]'
Query Results in CSV
http://localhost:8983/solr/select?q=ipod&fl=name,price,cat,popularity&wt=csv
name,price,cat,popularity
iPod & iPod Mini USB 2.0 Cable,11.5,"electronics,connector",1
Belkin Mobile Power Cord for iPod w/ Dock,19.95,"electronics,connector",1
Apple 60 GB iPod with Video Playback Black,399.0,"electronics,music",10
Can handle multi-valued fields (see “cat” field in example) Completely compatible with the CSV update handler (can round-trip) Results are streamed – good for dumping entire parts of the index
http://localhost:8983/solr/browse
Q&A