solr 3.1 and beyond

Upload: ervin-miller

Post on 10-Apr-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/8/2019 Solr 3.1 and Beyond

    1/33

    Solr 3.1 and Beyond

    [email protected]

    October 8, 2010

    2

    Lucid ImaginationYonik Seeley

    http://www.lucidimagination.com/events/revolution2010http://www.lucidimagination.com/events/revolution2010http://www.lucidimagination.com/events/revolution2010
  • 8/8/2019 Solr 3.1 and Beyond

    2/33

    Agenda

    Goal : Introduce new features you can try & use now inSolr development versions 3.1 or 4.0

    Relevancy (Extended Dismax Parser)Spatial/Geo SearchSearch Result Grouping / Field CollapsingFaceting (Pivot, Range, Per-segment)Scalability (Solr Cloud)Odds & EndsQ&A

    10/12/10 3

  • 8/8/2019 Solr 3.1 and Beyond

    3/33

    Solr 3.1? What happened to 1.5?

    Lucene/Solr merged (March 2010) Single set of committers Single dev mailing list ([email protected]) Single shared subversion trunk Keep separate downloads, user mailing lists Other former lucene subprojects spun off (Nutch, Tika, Mahout, etc)

    Development trunk is now always next major release (currently 4.0) branch_3x will be base for all 3.x releases Branch together, Release together, Share version numbers

  • 8/8/2019 Solr 3.1 and Beyond

    4/33

    RELEVANCE

  • 8/8/2019 Solr 3.1 and Beyond

    5/33

    Extended Dismax Parser

    Superset of dismax&defType=edismax&q=foo&qf=body

    Fixes edge cases where dismax could still throwexceptionsORANDNOT-

    Full lucene syntax support Tries lucene syntax first Smart escaping is done if syntax errors

    Optionally supports treating and/or as AND/OR inlucene syntax

    Fielded queries (e.g. myfield:foo) even in degradedmode uf parameter controls what field names may be directly specified in q

  • 8/8/2019 Solr 3.1 and Beyond

    6/33

    Extended Dismax Parser (continued)

    boost parameter for multiplicative boost-by-functionPure negative query clauses

    Example: solrOR(-solr)

    Enhanced term proximity boosting pf2=myfield results in term bigrams in sloppy phrase queriesmyfield:aabbcc->myfield:aabbmyfield:bbcc

    Enhanced stopword handling stopwords omitted in main query, but added in optional proximity boosting part

    Example: q=solrisawesome&qf=myfield&pf2=myfield->

    +myfield:(solrawesome)(myfield:solrismyfield:is

    awesome)

    Currently controlled by the absence of StopWordFilter in index analyzer, andpresence in query analyzer

  • 8/8/2019 Solr 3.1 and Beyond

    7/33

    SPATIAL SEARCH

    8

  • 8/8/2019 Solr 3.1 and Beyond

    8/33

    Spatial Search

    10/12/10 9

    Step1: Index some locations!

    The Alpine Shop44.013617,-73.168264

    Step2: Decide where you are&pt=44.0153371,-73.16734

    &d=1&sfield=store

    Step3: Profit!

    Spatial Filter: &fq={!geofilt}

    Bounding Box: &fq={!bbox}

    Distance Function: &sort=geodist() asc

  • 8/8/2019 Solr 3.1 and Beyond

    9/33

    RESULT GROUPING /FIELD COLLAPSING

  • 8/8/2019 Solr 3.1 and Beyond

    10/33

    Field Collapsing Definition

    Field collapsing Limit the number of results per category category normally defined by unique values in a field

    Uses Web Search collapse by web site Email threads collapse by thread id Ecommerce/retail

    Show the top 5 items for each store category (music, movies,etc)

  • 8/8/2019 Solr 3.1 and Beyond

    11/33

    Field Collapsing by Site

  • 8/8/2019 Solr 3.1 and Beyond

    12/33

    Field Collapse on Product Type

    Result Grouping by Category

  • 8/8/2019 Solr 3.1 and Beyond

    13/33

    Group by Field

    http://...&fl=id,name&q=ipod&group=true&group.field=manu_exact

    10/12/10 14

    "grouped":{

    "manu_exact":{

    "matches":3,

    "groups":[{"groupValue":"Belkin",

    "doclist":{"numFound":2,"start":0,"docs":[

    {

    "id":"IW-02",

    "name":"iPod & iPod Mini USB 2.0 Cable"}]}},

    {

    "groupValue":"Apple Computer Inc.",

    "doclist":{"numFound":1,"start":0,"docs":[

    {"id":"MA147LL/A",

  • 8/8/2019 Solr 3.1 and Beyond

    14/33

    Group by Query

    10/12/10 15

    http://...&group=true&group.query=price:[0 TO 99.99]

    &group.query=price:[100 TO *]&group.limit=5

    "grouped":{

    "price:[0 TO 99.99]":{

    "matches":3,

    "doclist":{"numFound":2,"start":0,"docs":[{

    "id":"IW-02",

    "name":"iPod & iPod Mini USB 2.0 Cable"},

    {

    "id":"F8V7067-APL-KIT","name":"Belkin Mobile Power Cord for iPod"}]

    }},

    "price:[100 TO *]":{

    "matches":3,

    "doclist":{"numFound":1,"start":0,"docs":[{

  • 8/8/2019 Solr 3.1 and Beyond

    15/33

    Grouping Params

    parameter meaning default

    group.field= Like facet.field group by unique field

    values

    group.query= Like facet.query top docs that also

    match

    group.function=

    Group by unique values produced bythe function query

    group.limit= How many docs per group 1

    group.sort= How to sort documents within a group Same as

    sort

    param

    rows= How many groups to return 10

    sort= How to sort the groups relative to

    each other (based on top doc)

    10/12/10 16

  • 8/8/2019 Solr 3.1 and Beyond

    16/33

    FACETING

  • 8/8/2019 Solr 3.1 and Beyond

    17/33

    Pivot Faceting

    Other names that could have made sense: Grid Faceting, Cross-Product Faceting, Matrix Faceting

    Syntax: facet.pivot=field1,field2,field3,

    10/12/10 18

    #docs #docs w/

    inStock:true

    #docs w/

    instock:false

    cat:electronics 14 10 4

    cat:memory 3 3 0

    cat:connector 2 0 2

    cat:graphics card 2 0 2

    cat:hard drive 2 2 0

    facet.pivot=cat,inStock

  • 8/8/2019 Solr 3.1 and Beyond

    18/33

    Pivot Faceting

    "facet_counts":{

    "facet_pivot":{

    "cat,popularity":[{

    "field":"cat","value":"electronics",

    "count":14,

    "pivot":[{

    "field":"popularity",

    "value":"6","count":5},

    {

    "field":"popularity",

    "value":"7",

    "count":4},

    10/12/10 19

    http://...&facet=true&facet.pivot=cat,popularity(continued)

    {

    "field":"popularity","value":"1",

    "count":2}]},

    {

    "field":"cat",

    "value":"memory","count":3,

    "pivot":[]},

    []

    14 docs w/

    cat==electronics

    5 docs w/

    cat==electronics&& popularity==6

  • 8/8/2019 Solr 3.1 and Beyond

    19/33

    Range Faceting

    Like Date faceting, butmore generic

    http://...&facet=true

    &facet.range=price

    &facet.range.start=0

    &facet.range.end=500

    &facet.range.gap=50

    "facet_counts":{"facet_ranges":{

    "price":{"counts":{

    "0.0":5,

    "50.0":2,"100.0":0,

    "150.0":2,

    "200.0":0,

    "250.0":1,"300.0":2,

    "350.0":2,"400.0":0,"450.0":1},

    "gap":50.0,

    "start":0.0,

    "end":500.0}}}}

    10/12/10 20

  • 8/8/2019 Solr 3.1 and Beyond

    20/33

    5

    3

    5

    14

    5

    2

    1

    (null)

    batman

    flash

    spidermansuperman

    wolverine

    order: for each

    doc, an index into

    the lookup arraylookup: the

    string values

    Lucene FieldCache Entry

    (StringIndex) for the hero field

    02

    7

    01

    0

    0

    0

    2

    Documents

    matching the

    base query

    Juggernaut

    accumulator

    increment

    lookupq=Juggernaut&facet=true&facet.field=hero

    Priority queue

    Batman, 3flash, 5

    Existing single-valued faceting

    algorithm

  • 8/8/2019 Solr 3.1 and Beyond

    21/33

    Segment1

    FieldCacheEntry

    Segment2

    FieldCacheEntry

    Segment3

    FieldCacheEntry

    Segment4

    FieldCacheEntry

    0

    2

    7

    0

    3

    5

    0

    1

    2

    0

    2

    1

    0

    1

    3

    0

    4

    0

    1

    0

    Priority queue

    Batman, 3flash, 5

    Base

    DocSet

    lookupinc accumulator1 accumulator2 accumulator3 accumulator4

    FieldCache +

    accumulatormerger

    (Priority queue)

    thread1

    thread2 thread3thread4

    Per-segment single-valued

    algorithm

  • 8/8/2019 Solr 3.1 and Beyond

    22/33

    Per-segment faceting

    Enable with facet.method=fcsControllable multi-threading

    facet.field={!threads=4}myfield

    Disadvantages Larger memory use (FieldCaches + accumulators) Slower (extra FieldCache merge step needed)

    Advantages Rebuilds FieldCache entries only for new segments (NRT friendly) Multi-threaded

  • 8/8/2019 Solr 3.1 and Beyond

    23/33

  • 8/8/2019 Solr 3.1 and Beyond

    24/33

    Faceting Performance Improvements

    For facet.method=enum, speed up initialpopulation of the filterCache (i.e. first time

    facet): from 30% to 32x improvement

    Optimized facet.method=fc for multi-valuedfields and large facet.limit up to 3x faster

    Optimized deep facet paging up to 10x fasterwith really large facet.offsets

    Less memory consumed by field cache entries

    10/12/10 25

  • 8/8/2019 Solr 3.1 and Beyond

    25/33

    SCALABILITY

  • 8/8/2019 Solr 3.1 and Beyond

    26/33

    SolrCloud

    First steps toward simplifying cluster managementIntegrates Zookeeper

    Central configuration (schema.xml, solrconfig.xml, etc) Tracks live nodes + shards of collections

    Removes need for external load balancersshards=localhost:8983/solr|localhost:8900/solr,

    localhost:7574/solr|localhost:7500/solr

    Can specify logical shard idsshards=NY_shard,NJ_shard

    Clients dont need to know shards at all:http://localhost:8983/solr/collection1/select?distrib=true

  • 8/8/2019 Solr 3.1 and Beyond

    27/33

    SolrCloud : The Future

    Eliminate all single points of failureRemove Master/Searcher distinction

    Enables near real-time search in a highly scalable environmentHigh Availability for Writes

    Eventual consistency model (like Amazon Dynamo, Cassandra)Elastic

    Simply add/subtract servers, cluster will rebalance automatically By default, Solr will handle document partitioning

  • 8/8/2019 Solr 3.1 and Beyond

    28/33

    ODDS & ENDS

  • 8/8/2019 Solr 3.1 and Beyond

    29/33

    Auto-Suggest

    Many people currently use terms component Can be slow for a large corpus

    New auto-suggest builds off SpellCheck component Compact memory based trie for really fast completions Based on a field in the main index, or on a dictionary file

    http://localhost:8983/solr/suggest?wt=json&indent=true&q=ult

    10/12/10 30

    "spellcheck":{

    "suggestions":[

    "ult",{"numFound":1,"startOffset":0,

    "endOffset":3,

    "suggestion":["ultrasharp"]},

    "collation","ultrasharp"]}}

  • 8/8/2019 Solr 3.1 and Beyond

    30/33

    Index with JSON

    $URL=http://localhost:8983/solr/update/json$curl$URL-H'Content-type:application/json'-d'

    {"add":{"doc":{"id":"978-0641723445",

    "cat":["book","hardcover"],"title":"TheLightningThief","author":"RickRiordan","series_t":"PercyJacksonandtheOlympians","sequence_i":1,"genre_s":"fantasy",

    "inStock":true,"price":12.50,"pages_i":384}}}'

    31

  • 8/8/2019 Solr 3.1 and Beyond

    31/33

    Query Results in CSV

    http://localhost:8983/solr/select?q=ipod&fl=name,price,cat,popularity&wt=csv

    name,price,cat,popularity

    iPod & iPod Mini USB 2.0 Cable,11.5,"electronics,connector",1

    Belkin Mobile Power Cord for iPod w/ Dock,19.95,"electronics,connector",1Apple 60 GB iPod with Video Playback Black,399.0,"electronics,music",10

    Can handle multi-valued fields (see cat field in example) Completely compatible with the CSV update handler (can round-trip) Results are streamed good for dumping entire parts of the index

    10/12/10 32

  • 8/8/2019 Solr 3.1 and Beyond

    32/33

    http://localhost:8983/solr/browse

    10/12/10 33

  • 8/8/2019 Solr 3.1 and Beyond

    33/33

    Q&A

    For more information about Solr visit

    www.lucidimagination.com

    http://www.lucidimagination.com/events/revolution2010http://www.lucidimagination.com/events/revolution2010http://www.lucidimagination.com/events/revolution2010