solr performance & key innovations yonik seeley, lucid imagination [email protected],...

Solr Performance & Key Innovations

Yonik Seeley, Lucid [email protected], May 26 2011

Solr 3.1 Highlights Numeric range facets (similar to date faceting). New spatial search, including spatial filtering,

boosting and sorting capabilities. Example Velocity driven search UI at

http://localhost:8983/solr/browse A new faster termvector-based highlighter. Extended dismax (edismax) query parser with

support for fielded queries, enhanced relevancy, and full lucene syntax support.

Distributed search support for the Spell check and Terms components.

2

http://wiki.apache.org/solr/SimpleFacetParameters%23Facet_by_Range

http://wiki.apache.org/solr/SpatialSearch

http://wiki.apache.org/solr/HighlightingParameters

http://wiki.apache.org/solr/SpellCheckComponent

http://wiki.apache.org/solr/TermsComponent

Solr 3.1 Highlights (continued)

Suggester, a fast trie-based autocomplete component.

Sort results by any function query. JSON document indexing. CSV response format Apache UIMA integration for metadata

extraction. Tons of optimizations, bugfixes, and new

analysis capabilities via Apache Lucene 3.1.

3

http://wiki.apache.org/solr/Suggester

http://wiki.apache.org/solr/FunctionQuery%23Sort_By_Function%20any%20function

http://wiki.apache.org/solr/UpdateJSON

http://wiki.apache.org/solr/CSVResponseWriter

http://wiki.apache.org/solr/SolrUIMA

http://lucene.apache.org/java/docs/index.html

What’s not in 3.1?

Result Grouping (AKA Field Collapsing) Pivot Faceting SolrCloud Pseudo-fields Pseudo-join Relevancy function queries Per-segment faceting *Tons* of new Lucene performance/efficiency

goodness4

Recent Lucene Performance TieredMergePolicy – the new default

• Much better for incremental indexing / NRT• Ignores segment order when selecting best merge• Takes deletes into account• Does not over-merge (no cascading merges)

Finite State Transducer (FST) based terms index

5

DocumentWriterPerThread (DWPT)

6

_1_0.tiv_1_0.prx_1_0.frq

…

_1_0.tiv_1_0.prx_1_0.frq

…

_2_0.tiv_2_0.prx_2_0.frq

…

_2_0.tiv_2_0.prx_2_0.frq

…

_3_0.tiv_3_0.prx_3_0.frq

…

_3_0.tiv_3_0.prx_3_0.frq

…

Index WriterIndex Writer

DWPT DWPT DWPT

Indexing thread

Flush segmentto disk

Flushing new segment is now concurrent w/ indexing

Use multiple indexing threads/connections

When max mem is hit, biggest DWPT is concurrently flushed

in-memory

Solr Cloud

7

shard1(replica1)

replica2

replica3

shard2(replica1)

replica2

replica3

ZooKeeper quorum

ZK nodeZK

node

ZK nodeZK

node

ZK nodeZK

node

ZK nodeZK

node

ZK nodeZK

node

/configs /myconf solrconfig.xml schema.xml

/livenodes server1:8983/solr server2:8983/solr server2:8983/solr

/collections /collection1 configName=myconf

/shards /shard1 server1:8983/solr server2:8983/solr /shard2 server3:8983/solr server4:8983/solr

http://.../solr/collection1?distrib=true

Load-balanced sub-request

Solr Cloud: Getting Started

http://wiki.apache.org/solr/SolrCloud

java -Dbootstrap_confdir=./solr/conf -Dcollection.configName=myconf -DzkRun -jar start.jar

Upload /solr/conf to ZK and call it

“myconf”

Upload /solr/conf to ZK and call it

“myconf”

Run an internal ZK server

Run an internal ZK server

http://localhost:8983/solr/collection1/admin/zookeeper.jsp

http://wiki.apache.org/solr/SolrCloud

http://localhost:8983/solr/collection1/admin/zookeeper.jsp

Distributed RequestsExplicitly specify node addresses to load-balance across

shards=localhost:8983/solr|localhost:8900/solr, localhost:7574/solr|localhost:7500/solr A list of equivalent nodes are separated by “|” Different phases of the same distributed request use the same node

Specify logical shard ids to search acrossshards=NY_shard,NJ_shard

Query across all shards in the collectionhttp://localhost:8983/solr/collection1/select?distrib=true

public CloudSolrServer(String zkHost) SolrJ Java client that load-balances across all nodes in cluster

http://localhost:8983/solr/collection1/select?distrib=true

Extended Dismax ParserSuperset of dismaxDesigned to directly handle user queries w/o exceptions

&defType=edismax&q=foo&qf=body

Fixes edge cases where dismax could still throw exceptionsOR AND NOT - “

Full lucene syntax support Tries lucene syntax first Smart escaping is done if syntax errors

Optionally supports treating “and”/”or” as AND/OR in lucene syntax

Fielded queries (e.g. myfield:foo) even in degraded mode

uf parameter controls what field names may be directly specified in “q”

Extended Dismax Parser (continued)boost parameter for multiplicative boost-by-functionPure negative query clauses

Example: solr OR (-solr)Enhanced term proximity boosting

pf2=myfield – results in term bigrams in sloppy phrase queriesmyfield:“aa bb cc” -> myfield:“aa bb” myfield:“bb cc”

Enhanced stopword handling stopwords omitted in main query, but added in optional proximity

boosting partExample: q=solr is awesome & qf=myfield & pf2=myfield -> +myfield:(solr awesome) (myfield:”solr is” myfield:”is awesome”)

Currently controlled by the absence of StopWordFilter in index analyzer, and presence in query analyzer

Faceting Performance Improvements

For facet.method=enum, speed up initial population of the filterCache (i.e. first time facet): from 30% to 32x improvement

Optimized facet.method=fc for multi-valued fields and large facet.limit – up to 3x faster

Optimized deep facet paging – up to 10x faster with really large facet.offsets

Less memory consumed by field cache entriesPer-segment faceting with facet.method=fcs

Only faster when re-opening index frequently (many times a second) Only works for single-valued fields

Pivot Faceting

Other names that could have made sense: Grid Faceting, Cross-Product Faceting, Matrix Faceting

Syntax: facet.pivot=field1,field2,field3,…

#docs #docs w/ inStock:true

#docs w/ instock:false

cat:electronics 14 10 4

cat:memory 3 3 0

cat:connector 2 0 2

cat:graphics card 2 0 2

cat:hard drive 2 2 0

facet.pivot=cat,inStock

Pivot Faceting

"facet_counts":{ "facet_pivot":{ "cat,popularity":[{ "field":"cat", "value":"electronics", "count":14, "pivot":[{ "field":"popularity", "value":"6", "count":5}, { "field":"popularity", "value":"7", "count":4},

http://...&facet=true&facet.pivot=cat,popularity

(continued)

{ "field":"popularity", "value":"1", "count":2}]}, { "field":"cat", "value":"memory", "count":3, "pivot":[]},

[…]

14 docs w/cat==electronics

5 docs w/cat==electronics&& popularity==6

http://localhost:8983/solr/select?q=*:*&rows=0&wt=json&indent=true&facet=true&facet.pivot=cat,popularity





Range Faceting

Like Date faceting, but more generic

http://...&facet=true&facet.range=price&facet.range.start=0&facet.range.end=500&facet.range.gap=50

"facet_counts":{ "facet_ranges":{ "price":{ "counts":{ "0.0":5, "50.0":2, "100.0":0, "150.0":2, "200.0":0, "250.0":1, "300.0":2, "350.0":2, "400.0":0, "450.0":1}, "gap":50.0, "start":0.0, "end":500.0}}}}

http://localhost:8983/solr/select?wt=json&indent=true&rows=0&q=*:*&facet=true&facet.range=price&facet.range.start=0&facet.range.end=500&facet.range.gap=50




Spatial SearchStep1: Index some locations!<field name=“name”>The Alpine Shop</field><field name=“store”>44.013617,-73.168264</field>

Step2: Decide where you are&pt=44.0153371,-73.16734&d=1&sfield=store

Step3: Profit!

Spatial Filter: &fq={!geofilt}

Bounding Box: &fq={!bbox}

Distance Function: &sort=geodist() asc

Returning the distance: &fl=geodist()

Note: You can now sort by any arbitrary

function query!

Note: You can now sort by any arbitrary

function query!

Pseudo-fields!Pseudo-fields!

Pseudo-Fields

Returns other info along with document stored fieldsFunction queries

fl=name,location,geodist(),add(myfield,10)

Fieldname globsfl=id,attr_*

Multiple “fl” (field list) values&fl=id,attr_*&fl=geodist()&fl=termfreq(text,’solr’)

Aliasingfl=id,location:loc,_dist_:geodist()

Future: inlined highlighting, “explain”, sort-values, group-value

17

Result Grouping / Field Collapsing

Goal Limit the number of results per category “category” normally defined by unique values in a field

Uses Web Search – collapse by web site Email threads – collapse by thread id Ecommerce/retail

Show the top 5 items for each store category (music, movies, etc)

Field Collapsing by Site

Field Collapse on Product TypeResult Grouping by Category

Group by Field

http://...&fl=id,name&q=ipod&group=true&group.field=manu_exact

"grouped":{ "manu_exact":{ "matches":3, "groups":[{ "groupValue":"Belkin", "doclist":{"numFound":2,"start":0,"docs":[ { "id":"IW-02", "name":"iPod & iPod Mini USB 2.0 Cable"}] }}, { "groupValue":"Apple Computer Inc.", "doclist":{"numFound":1,"start":0,"docs":[ { "id":"MA147LL/A", "name":"Apple 60 GB iPod with Video Playback

Black"}] }}]}}}

http://localhost:8983/solr/select?wt=json&indent=true&fl=id,name&q=ipod&group=true&group.field=manu_exact









Group by Queryhttp://...&group=true&group.query=price:[0 TO 99.99]&

group.query=price:[100 TO *]&group.limit=5

"grouped":{ "price:[0 TO 99.99]":{ "matches":3, "doclist":{"numFound":2,"start":0,"docs":[ { "id":"IW-02", "name":"iPod & iPod Mini USB 2.0 Cable"}, { "id":"F8V7067-APL-KIT", "name":"Belkin Mobile Power Cord for iPod"}] }}, "price:[100 TO *]":{ "matches":3, "doclist":{"numFound":1,"start":0,"docs":[ { "id":"MA147LL/A", "name":"Apple 60 GB iPod with Video Playback

Black"}] }}}}

http://localhost:8983/solr/select?wt=json&indent=true&fl=id,name&q=ipod&group=true&group.query=price:%5B0%20TO%2099.99%5D&group.query=price:%5B100%20TO%20*%5D&group.limit=5







Grouping Paramsparameter meaning default

group.field=<field> Like facet.field – group by unique field values

group.query=<query> Like facet.query – top docs that also match

group.function=<function query>

Group by unique values produced by the function query

group.limit=<n> How many docs per group 1

group.sort=<sort spec> How to sort documents within a group Same as sort

rows=<n> How many groups to return 10

sort=<sort spec> How to sort the groups relative to each other (based on top doc)

group.format=<format> grouped/simple – if simple, a single flat list is used and rows units are “docs”

grouped

group.main=true/false If true, the first field grouping command is used as main result set

false

Pseudo-Join

24

id: blog1name: Solr ‘n Stuffowner: Yonik SeeleyStarted: 2007-10-26

id: blog1name: Solr ‘n Stuffowner: Yonik SeeleyStarted: 2007-10-26

id: blog2name: lifehackerowner: Gawker Mediastarted: 2005-1-31

id: blog2name: lifehackerowner: Gawker Mediastarted: 2005-1-31

id: post1blog_id: blog1author: Yonik Seeleytitle: Solr relevancy function queriesbody: Lucene’s default ranking […]

id: post1blog_id: blog1author: Yonik Seeleytitle: Solr relevancy function queriesbody: Lucene’s default ranking […]

id: post2blog_id: blog1author: Yonik Seeleytitle: Solr result groupingbody: Result Grouping, also called […]

id: post2blog_id: blog1author: Yonik Seeleytitle: Solr result groupingbody: Result Grouping, also called […]

id: post3blog_id: blog2author: Whitson Gordontitle: How to Install Netflix on Almost Any Android Device

id: post3blog_id: blog2author: Whitson Gordontitle: How to Install Netflix on Almost Any Android Device

fq={!join from=blog_id to=id}body:netflix

- Finds all documents matching “netflix”- Maps to different docs by following blog_id to id

Restrict to blogs mentioning netflix

Pseudo-Join Examples

Only show posts from blogs started after 2010q=foo&fq={!join from=id to=blog_id}started:[2010 TO *]

If any post in a blog mentions “obama”, then search all posts in that blog for “bomb” (self-join)q=bomb&fq={!join from=blog_id to=blog_id}obama

If any blog post mentions “obama”, then search all websites with the same blog owner for “bomb” q=bomb&fq={!join from=owner to=website_owner}{!join from=blog_id to=id}obama

25

Cross-Core Join

http://localhost:8983/solr/collection1/select?q=foo&fq={!join fromIndex=sec1 from=security_groups to=security}user:john

26

id: doc1security: managerstitle: doc for managers onlybody: …

id: doc1security: managerstitle: doc for managers onlybody: …

id: marysecurity_groups: managers, employeesid: marysecurity_groups: managers, employees

id: doc1security: managers, employeestitle: doc for everyonebody: …

id: doc1security: managers, employeestitle: doc for everyonebody: …

id: johnsecurity_groups: employeesid: johnsecurity_groups: employees

collection1 sec1

Single Solr Server

Pseudo-Join vs GroupingPseudo-Join Result Grouping / Field Collapsing

O(n_terms_in_join_fields) O(n_docs_in_result)

Single or multi-valued fields Single-valued fields only

Filters only (no info currently passed from the “from” docs to the “to” docs).

Can order docs within a group and groups by top doc within that group using normal sort criteria.

Chainable (one join can be the input to another)

Not currently chainable – can only group one field deep

Affects which documents match a request, so naturally affects facet numbers (e.g. you can search posts and get numbers of blogs)

Grouping does not currently affect the set of documents matching the query, so faceting is unaffected.

27

Auto-SuggestMany people previously used terms component

Can be slow for a large corpus

New auto-suggest builds off SpellCheck component TST implementation: compact memory based trie FST implementation: slower to build, but smaller & faster lookup Based on a field in the main index, or on a dictionary file

http://localhost:8983/solr/suggest?wt=json&indent=true&q=ult

28

"spellcheck":{ "suggestions":[ "ult",{ "numFound":1, "startOffset":0, "endOffset":3, "suggestion":["ultrasharp"]}, "collation","ultrasharp"]}}

Index with JSON$ URL=http://localhost:8983/solr/update/json$ curl $URL -H 'Content-type:application/json' -d ’[ { "id" : "978-0641723445", "cat" : ["book","hardcover"], "title" : "The Lightning Thief", "author" : "Rick Riordan", "series_t" : "Percy Jackson and the Olympians", "sequence_i" : 1, "genre_s" : "fantasy", "inStock" : true, "price" : 12.50, "pages_i" : 384 }]'

Query Results in CSV

http://localhost:8983/solr/select?q=ipod&fl=name,price,cat,popularity&wt=csv

name,price,cat,popularity

iPod & iPod Mini USB 2.0 Cable,11.5,"electronics,connector",1

Belkin Mobile Power Cord for iPod w/ Dock,19.95,"electronics,connector",1

Apple 60 GB iPod with Video Playback Black,399.0,"electronics,music",10

Can handle multi-valued fields (see “cat” field in example) Completely compatible with the CSV update handler (can round-trip) Results are streamed – good for dumping entire parts of the index

http://localhost:8983/solr/browse

solr performance & key innovations yonik seeley, lucid imagination [email protected],...

Documents

solr server2

apache lucene

solr server4

new spatial search

lucene syntaxfielded

solr shard2 server3

new defaultmuch

distributed search support