solr 1.5 and beyond yonik seeley may 11, 2010

24
Solr 1.5 and Beyond Yonik Seeley May 11, 2010 NYC Lucene/Solr Meetup

Upload: jon

Post on 24-Feb-2016

61 views

Category:

Documents


0 download

DESCRIPTION

NYC Lucene / Solr Meetup. Solr 1.5 and Beyond Yonik Seeley May 11, 2010. Agenda. Lucene / Solr merge Relevancy (Extended Dismax Parser) Scalability ( Solr Cloud) Spatial/Geo Search Near Real Time Field Collapsing Q&A. Lucene-Solr Merge. Lucene / Solr voted to merge (March 2010) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Solr  1.5 and Beyond Yonik  Seeley May 11, 2010

Solr 1.5 and BeyondYonik SeeleyMay 11, 2010

NYC Lucene/Solr Meetup

Page 2: Solr  1.5 and Beyond Yonik  Seeley May 11, 2010

Lucid Imagination, Inc.

Agenda

Lucene/Solr merge

Relevancy (Extended Dismax Parser)

Scalability (Solr Cloud)

Spatial/Geo Search

Near Real Time

Field Collapsing

Q&A

Page 3: Solr  1.5 and Beyond Yonik  Seeley May 11, 2010

Lucid Imagination, Inc.

Lucene-Solr Merge

Lucene/Solr voted to merge (March 2010)Were already separate sub-projects of the Lucene TLP

High committer overlap

Solr had stopped using Lucene trunk/development versions

Much code duplication

What it meansSingle set of committers

Single developer mailing list ([email protected])

Single subversion trunk

Keep separate downloads, user mailing lists

Page 4: Solr  1.5 and Beyond Yonik  Seeley May 11, 2010

Lucid Imagination, Inc.

Lucene/Solr Development Changes

Nutch, Tika, Mahout spun off to their own TLPStill may be considered part of “Lucene Ecosystem”

Lucene/Solr development changestrunk is now always next major release (currently 4.0)

branch_3x will be base for all 3.x releases

No back compat guarantees between major releases

Page 5: Solr  1.5 and Beyond Yonik  Seeley May 11, 2010

Relevance

Page 6: Solr  1.5 and Beyond Yonik  Seeley May 11, 2010

Lucid Imagination, Inc.

Extended Dismax Parser

Superset of dismax&defType=edismax&q=foo&qf=body

Fixes edge cases where dismax could still throw exceptionsOR AND NOT - “

Full lucene syntax supportTries lucene syntax first

Smart escaping is done if syntax errors

Optionally supports treating “and”/”or” as AND/OR in lucene syntax

Fielded queries (e.g. myfield:foo) even in degraded modeuf parameter controls what field names may be directly specified in “q”

Page 7: Solr  1.5 and Beyond Yonik  Seeley May 11, 2010

Lucid Imagination, Inc.

Extended Dismax Parser (continued)

boost parameter for multiplicative boost-by-function

Pure negative query clausesExample: solr OR (-solr)

Enhanced term proximity boostingpf2=myfield – results in term bigrams in sloppy phrase queries

myfield:“aa bb cc” -> myfield:“aa bb” myfield:“bb cc”

Enhanced stopword handlingstopwords omitted in main query, but added in optional proximity boosting part

Example: q=solr is awesome & qf=myfield & pf2=myfield -> +myfield:(solr awesome) (myfield:”solr is” myfield:”is awesome”)

Currently controlled by the absence of StopWordFilter in index analyzer, and presence in query analyzer

Page 8: Solr  1.5 and Beyond Yonik  Seeley May 11, 2010

Scalability

Page 9: Solr  1.5 and Beyond Yonik  Seeley May 11, 2010

Lucid Imagination, Inc.

SolrCloud

First steps toward simplifying cluster management

Integrates ZookeeperCentral configuration (schema.xml, solrconfig.xml, etc)

Tracks live nodes + shards of collections

Removes need for external load balancersshards=localhost:8983/solr|localhost:8900/solr, localhost:7574/solr|localhost:7500/solr

Can specify logical shard idsshards=NY_shard,NJ_shard

Clients don’t need to know shards:http://localhost:8983/solr/collection1/select?

distrib=true

Page 10: Solr  1.5 and Beyond Yonik  Seeley May 11, 2010

Lucid Imagination, Inc.

SolrCloud : The Future

Eliminate all single points of failure

Remove Master/Searcher distinctionEnables near real-time search in a highly scalable environment

High Availability for WritesEventual consistency model (like Amazon Dynamo, Cassandra)

ElasticSimply add/subtract servers, cluster will rebalance automatically

By default, Solr will handle document partitioning

Page 11: Solr  1.5 and Beyond Yonik  Seeley May 11, 2010

Spatial Search

Page 12: Solr  1.5 and Beyond Yonik  Seeley May 11, 2010

Lucid Imagination, Inc.

Spatial Search

PointTypeGeneric improvement: polyField – single value -> multiple indexed fields

Compound values: 38.89,-77.03

Range queries and exact matches supported• q=location:21.33,51.37• q=location:[10,20 TO 30,40]

Distance FunctionsGeneric improvement: function queries can yield multiple values

Haversine: hsin(3963.205, store, vector(10,20))Many possibilities, including boost by distance

Page 13: Solr  1.5 and Beyond Yonik  Seeley May 11, 2010

Lucid Imagination, Inc.

Spatial Search (continued)

Sorting by function querysort=hsin(3963.205,store,vector(10,20)) asc

Distance Filtering (SOLR-1568)fq={!sfilt fl=store_tiles}&pt=45.17614,-93.87341&d=1Implementations: trie range queries, spatial tiles, geohash

Return sort values or function query values for each doc FunctionQuery results as pseudo-fields (SOLR-1298)

fl=field1,field2,{!func key=dist}hsin(…) ???

Page 14: Solr  1.5 and Beyond Yonik  Seeley May 11, 2010

Near Real Time

Page 15: Solr  1.5 and Beyond Yonik  Seeley May 11, 2010

Lucid Imagination, Inc.

Near Real-Time Search

Shorter times until updates are searchable/visible

Lucene 2.9 first laid the groundwork w/ per-segment searchingPer-segment FieldCache entries for sorting and FunctionQueries

NRT IndexWriter.getReader()• Make new segments available before merging is done in background

• Doesn’t cause commit/fsync first

Solr still needsPer-segment faceting

Per-segment caching

Per-segment statistics (and anything else that uses FieldCache)

Page 16: Solr  1.5 and Beyond Yonik  Seeley May 11, 2010

Lucid Imagination, Inc.

Existing single-valued faceting algorithm

53514521

(null)batman

flashspidermansupermanwolverine

order: for each doc, an index into the lookup array

lookup: the string values

Lucene FieldCache Entry (StringIndex) for the “hero” field

027

010002

Documents matching the base query “Juggernaut”

accumulator

increment

lookup

q=Juggernaut&facet=true&facet.field=hero

Page 17: Solr  1.5 and Beyond Yonik  Seeley May 11, 2010

Lucid Imagination, Inc.

Per-segment single-valued faceting algorithm

Segment1FieldCache

Entry

Segment2FieldCache

Entry

Segment3FieldCache

Entry

Segment4FieldCache

Entry

027

035012

0210

1304

010

Priority queue

Batman, 3flash, 5

Base DocSet

lookupinc

accumulator1 accumulator2 accumulator3 accumulator4

FieldCache + accumulator merger(Priority queue)

thread1

thread2 thread3thread4

Page 18: Solr  1.5 and Beyond Yonik  Seeley May 11, 2010

Lucid Imagination, Inc.

Per-segment faceting

Enable with facet.method=fcs

Controllable multi-threadingfacet.field={!threads=4}myfield

DisadvantagesLarger memory use (FieldCaches + accumulators)

Slower (extra FieldCache merge step needed)

AdvantagesRebuilds FieldCache entries only for new segments (NRT friendly)

Multi-threaded

Page 19: Solr  1.5 and Beyond Yonik  Seeley May 11, 2010

Lucid Imagination, Inc.

Per-segment faceting performance comparison

Time for request* facet.method=fc facet.method=fcs

static index 3 ms 244 ms

quickly changing index 1388 ms 267 ms

Base DocSet=100 docs, facet.field on a field with 100,000 unique terms

Test index: 10M documents, 18 segments, single valued field

Time for request* facet.method=fc facet.method=fcs

static index 26 ms 34 ms

quickly changing index 741 ms 94 ms

Base DocSet=1,000,000 docs, facet.field on a field with 100 unique terms

*complete request time, measured externally

A

B

Page 20: Solr  1.5 and Beyond Yonik  Seeley May 11, 2010

Field Collapsing

Page 21: Solr  1.5 and Beyond Yonik  Seeley May 11, 2010

Lucid Imagination, Inc.

Field Collapsing

Field collapsingLimit the number of results per category

“category” defined by unique values in a field

UsesWeb Search – collapse by web site

Email threads – collapse by thread id

Ecommerce/retail• Show the top 5 items for each store category (music, movies, etc)

Page 22: Solr  1.5 and Beyond Yonik  Seeley May 11, 2010

Lucid Imagination, Inc.

Field Collapsing by Site

Page 23: Solr  1.5 and Beyond Yonik  Seeley May 11, 2010

Lucid Imagination, Inc.

Field Collapse on Product Type

Page 24: Solr  1.5 and Beyond Yonik  Seeley May 11, 2010

Q&A