solr 1.5 and beyond yonik seeley may 11, 2010
DESCRIPTION
NYC Lucene / Solr Meetup. Solr 1.5 and Beyond Yonik Seeley May 11, 2010. Agenda. Lucene / Solr merge Relevancy (Extended Dismax Parser) Scalability ( Solr Cloud) Spatial/Geo Search Near Real Time Field Collapsing Q&A. Lucene-Solr Merge. Lucene / Solr voted to merge (March 2010) - PowerPoint PPT PresentationTRANSCRIPT
Solr 1.5 and BeyondYonik SeeleyMay 11, 2010
NYC Lucene/Solr Meetup
Lucid Imagination, Inc.
Agenda
Lucene/Solr merge
Relevancy (Extended Dismax Parser)
Scalability (Solr Cloud)
Spatial/Geo Search
Near Real Time
Field Collapsing
Q&A
Lucid Imagination, Inc.
Lucene-Solr Merge
Lucene/Solr voted to merge (March 2010)Were already separate sub-projects of the Lucene TLP
High committer overlap
Solr had stopped using Lucene trunk/development versions
Much code duplication
What it meansSingle set of committers
Single developer mailing list ([email protected])
Single subversion trunk
Keep separate downloads, user mailing lists
Lucid Imagination, Inc.
Lucene/Solr Development Changes
Nutch, Tika, Mahout spun off to their own TLPStill may be considered part of “Lucene Ecosystem”
Lucene/Solr development changestrunk is now always next major release (currently 4.0)
branch_3x will be base for all 3.x releases
No back compat guarantees between major releases
Relevance
Lucid Imagination, Inc.
Extended Dismax Parser
Superset of dismax&defType=edismax&q=foo&qf=body
Fixes edge cases where dismax could still throw exceptionsOR AND NOT - “
Full lucene syntax supportTries lucene syntax first
Smart escaping is done if syntax errors
Optionally supports treating “and”/”or” as AND/OR in lucene syntax
Fielded queries (e.g. myfield:foo) even in degraded modeuf parameter controls what field names may be directly specified in “q”
Lucid Imagination, Inc.
Extended Dismax Parser (continued)
boost parameter for multiplicative boost-by-function
Pure negative query clausesExample: solr OR (-solr)
Enhanced term proximity boostingpf2=myfield – results in term bigrams in sloppy phrase queries
myfield:“aa bb cc” -> myfield:“aa bb” myfield:“bb cc”
Enhanced stopword handlingstopwords omitted in main query, but added in optional proximity boosting part
Example: q=solr is awesome & qf=myfield & pf2=myfield -> +myfield:(solr awesome) (myfield:”solr is” myfield:”is awesome”)
Currently controlled by the absence of StopWordFilter in index analyzer, and presence in query analyzer
Scalability
Lucid Imagination, Inc.
SolrCloud
First steps toward simplifying cluster management
Integrates ZookeeperCentral configuration (schema.xml, solrconfig.xml, etc)
Tracks live nodes + shards of collections
Removes need for external load balancersshards=localhost:8983/solr|localhost:8900/solr, localhost:7574/solr|localhost:7500/solr
Can specify logical shard idsshards=NY_shard,NJ_shard
Clients don’t need to know shards:http://localhost:8983/solr/collection1/select?
distrib=true
Lucid Imagination, Inc.
SolrCloud : The Future
Eliminate all single points of failure
Remove Master/Searcher distinctionEnables near real-time search in a highly scalable environment
High Availability for WritesEventual consistency model (like Amazon Dynamo, Cassandra)
ElasticSimply add/subtract servers, cluster will rebalance automatically
By default, Solr will handle document partitioning
Spatial Search
Lucid Imagination, Inc.
Spatial Search
PointTypeGeneric improvement: polyField – single value -> multiple indexed fields
Compound values: 38.89,-77.03
Range queries and exact matches supported• q=location:21.33,51.37• q=location:[10,20 TO 30,40]
Distance FunctionsGeneric improvement: function queries can yield multiple values
Haversine: hsin(3963.205, store, vector(10,20))Many possibilities, including boost by distance
Lucid Imagination, Inc.
Spatial Search (continued)
Sorting by function querysort=hsin(3963.205,store,vector(10,20)) asc
Distance Filtering (SOLR-1568)fq={!sfilt fl=store_tiles}&pt=45.17614,-93.87341&d=1Implementations: trie range queries, spatial tiles, geohash
Return sort values or function query values for each doc FunctionQuery results as pseudo-fields (SOLR-1298)
fl=field1,field2,{!func key=dist}hsin(…) ???
Near Real Time
Lucid Imagination, Inc.
Near Real-Time Search
Shorter times until updates are searchable/visible
Lucene 2.9 first laid the groundwork w/ per-segment searchingPer-segment FieldCache entries for sorting and FunctionQueries
NRT IndexWriter.getReader()• Make new segments available before merging is done in background
• Doesn’t cause commit/fsync first
Solr still needsPer-segment faceting
Per-segment caching
Per-segment statistics (and anything else that uses FieldCache)
Lucid Imagination, Inc.
Existing single-valued faceting algorithm
53514521
(null)batman
flashspidermansupermanwolverine
order: for each doc, an index into the lookup array
lookup: the string values
Lucene FieldCache Entry (StringIndex) for the “hero” field
027
010002
Documents matching the base query “Juggernaut”
accumulator
increment
lookup
q=Juggernaut&facet=true&facet.field=hero
Lucid Imagination, Inc.
Per-segment single-valued faceting algorithm
Segment1FieldCache
Entry
Segment2FieldCache
Entry
Segment3FieldCache
Entry
Segment4FieldCache
Entry
027
035012
0210
1304
010
Priority queue
Batman, 3flash, 5
Base DocSet
lookupinc
accumulator1 accumulator2 accumulator3 accumulator4
FieldCache + accumulator merger(Priority queue)
thread1
thread2 thread3thread4
Lucid Imagination, Inc.
Per-segment faceting
Enable with facet.method=fcs
Controllable multi-threadingfacet.field={!threads=4}myfield
DisadvantagesLarger memory use (FieldCaches + accumulators)
Slower (extra FieldCache merge step needed)
AdvantagesRebuilds FieldCache entries only for new segments (NRT friendly)
Multi-threaded
Lucid Imagination, Inc.
Per-segment faceting performance comparison
Time for request* facet.method=fc facet.method=fcs
static index 3 ms 244 ms
quickly changing index 1388 ms 267 ms
Base DocSet=100 docs, facet.field on a field with 100,000 unique terms
Test index: 10M documents, 18 segments, single valued field
Time for request* facet.method=fc facet.method=fcs
static index 26 ms 34 ms
quickly changing index 741 ms 94 ms
Base DocSet=1,000,000 docs, facet.field on a field with 100 unique terms
*complete request time, measured externally
A
B
Field Collapsing
Lucid Imagination, Inc.
Field Collapsing
Field collapsingLimit the number of results per category
“category” defined by unique values in a field
UsesWeb Search – collapse by web site
Email threads – collapse by thread id
Ecommerce/retail• Show the top 5 items for each store category (music, movies, etc)
Lucid Imagination, Inc.
Field Collapsing by Site
Lucid Imagination, Inc.
Field Collapse on Product Type
Q&A