the future of search in plone

28
The Future of Search in Plone Sally Kleinfeldt and friends Plone Conference, San Francisco November 3, 2011 Tuesday, November 29, 2011

Upload: sally-kleinfeldt

Post on 29-Nov-2014

756 views

Category:

Technology


2 download

DESCRIPTION

From the beginning, the Zope Catalog has provided Plone with out-of-the-box content search - an important feature not found in all open source content management systems. However, search engine technology has been racing ahead and user expectations of what search should do have been changing. At the same time, search engines have gone from premium enterprise product to cheap commodity. The most important search engine worth considering these days is also open source: Lucene/Solr. Several add-on products exist that integrate Solr with Plone, and interest in this technology is growing. In this talk, Sally Kleinfeldt provides an information retrieval tutorial and discusses the questions: What does Solr bring to Plone? Should Solr become part of Plone core?These slides include conclusions from the conference discussion. A link to audio of the presentation is here: http://2011ploneconference.sched.org/event/095b67970b402c319721239711033d65

TRANSCRIPT

Page 1: The Future of Search in Plone

The Future of Search in PloneSally Kleinfeldt

and friendsPlone Conference, San Francisco

November 3, 2011

Tuesday, November 29, 2011

Page 2: The Future of Search in Plone

Motivation

• Raise awareness

• Promote discussion

• Forge consensus

Tuesday, November 29, 2011

Page 3: The Future of Search in Plone

Agenda

• Introduction to IR concepts

• Description of Solr and ZCatalog

• Discussion

Tuesday, November 29, 2011

Page 4: The Future of Search in Plone

IR 101

Tuesday, November 29, 2011

Page 5: The Future of Search in Plone

IR 101

• Transformations

• Terms

• Models

• Measures

Tuesday, November 29, 2011

Page 6: The Future of Search in Plone

IR 101Transformations

• Turn binary, HTML, or other document formats into fields and strings

• Parse the strings into a set of terms

• Build indexes of the terms specific to the IR model used

• Queries are parsed into query operators and strings, which are parsed into terms

Tuesday, November 29, 2011

Page 7: The Future of Search in Plone

IR 101String => Terms

• Tokenization - locate word boundaries

• Normalization - remove capitals and diacritics

• Stopping - remove stop words (a, of, on, the...)

• Stemming - reduce to word stems (walks, walking => walk)

• Recognizers - concepts, parts of speech, names, locations...

• Must be identical for documents and queries

Tuesday, November 29, 2011

Page 8: The Future of Search in Plone

IR 101Terms

• Application specific

• Words or phrases

• IR models assign weights to terms in documents

Tuesday, November 29, 2011

Page 9: The Future of Search in Plone

IR 101Term Weighting

• Simplest: Yes/No Boolean value

• Better: Term Frequency - # occurrences

• More meaningful: tf-idf

• Term Freq * Inverse Document Freq

• How many documents contain the term?

• Increase weight of rare terms and vice versa

Tuesday, November 29, 2011

Page 10: The Future of Search in Plone

IR 101Boolean Model

• First and most adopted

• Based on Boolean logic + set theory

• Does a document contain query terms - Y/N

• Intuitive, easy to implement

• No ranking, special query language, too many or too few results

• Typical for library systems

Tuesday, November 29, 2011

Page 11: The Future of Search in Plone

IR 101Vector Space Models

• Represent documents and queries as vectors of terms

• Term values are weighted - by count or tf-idf

• Use vector operations to compare documents with queries

• Relevance score based on cosine of angle between doc/query vectors

Tuesday, November 29, 2011

Page 12: The Future of Search in Plone

IR 101Probabilistic Models

• Compute probability that a document is relevant to a query

• Relevance ranking functions range from simple to complex

• Sophisticated ranking functions include

• Okapi BM25 (uses tf and idf)

• Machine learning formulas (use training data)

Tuesday, November 29, 2011

Page 13: The Future of Search in Plone

IR 101Extending the Models

• Many many refinements possible

• Term interdependencies

• Fuzzy sets

• Semantic analysis, link analysis

• Combining models (Extended Boolean)

• The best search engines represent thousands of engineering hours

Tuesday, November 29, 2011

Page 14: The Future of Search in Plone

IR 101Measures

• Search engine results are measured against:

• Precision - Percent of results that are relevant

• Recall - Percent of relevant results that are returned

• F-Score - Harmonic mean of precision and recall

Tuesday, November 29, 2011

Page 15: The Future of Search in Plone

ZCatalog and Solr

Tuesday, November 29, 2011

Page 16: The Future of Search in Plone

ZCatalog

• Zope/Plone search engine

• Full text and field searching

• Probabilistic model using Okapi BM25

• OOTB ZCTextIndex very simple

• TextIndexNG adds multilingual, better parsing components, binary transforms, synonyms

Tuesday, November 29, 2011

Page 17: The Future of Search in Plone

Solr

• Popular open source enterprise search platform

• Eliminating smaller commercial search companies

• Java, based on Lucene Java search library, sophisticated vector space ++ model

• RESTful APIs

• Large, active community

• Powers Twitter, Wikipedia, Netflix...

Tuesday, November 29, 2011

Page 18: The Future of Search in Plone

What does Solr have that ZCatalog Doesn’t?• Better relevance ranking

• More search features: snippets, hit highlighting, spelling suggestions, synonyms, more like this, faceted search

• More configurable: stop words, field boosting, parsing components

• An army of engineers working on it

Tuesday, November 29, 2011

Page 19: The Future of Search in Plone

Plone + SolrToday

• Two add-ons available

• collective.solr - Intercepts catalog queries and dispatches them to Solr

• alm.solrindex - adds a new index type to the catalog, SolrIndex

• Plus a buildout recipe: collective.recipe.solrinstance

Tuesday, November 29, 2011

Page 20: The Future of Search in Plone

Conclusions from Conference Discussion

Tuesday, November 29, 2011

Page 21: The Future of Search in Plone

Why Does Plone Need Solr?

• Certain types of projects need it, for features or because ZCatalog can’t scale to very large sites

• We need it to keep up with the enterprise CMS pack

Tuesday, November 29, 2011

Page 22: The Future of Search in Plone

Points of Agreement

• It will be impossible to completely replace ZCatalog with Solr

• Solr indexing will never be transactional

• Removing ZCatalog from Zope would be very difficult

• Tackle small, focused ZCatalog improvements when possible - like improving indexing interface

Tuesday, November 29, 2011

Page 23: The Future of Search in Plone

Points of Agreement

• Navigation and search should be handled separately

• Navigation needs to be transactional, search does not

• Split out a catalog used for navigation from the general catalog

• Explore a non-catalog utility to support navigation, optimize for speed

Tuesday, November 29, 2011

Page 24: The Future of Search in Plone

Points of Agreement

• Treating Solr integration simply as ZCatalog replacement does not take best advantage of Solr features

• ZCatalog can’t represent the richness of Solr, focus on the Solr API

• Take advantage of spelling suggestions, facets, results snippets with hit highlighting, synonyms, more like this, etc.

• Provide Solr indexing, field weighting, etc. configuration choices in the control panel

Tuesday, November 29, 2011

Page 25: The Future of Search in Plone

Points of Agreement

• Neither of the current Solr add-ons provides the best foundation for the future

• But they’ve taught us how to do things better

• Non-Solr approaches to improved Plone search should be deprecated

• Andreas Jung is not planning improvements to TextIndexNG!

Tuesday, November 29, 2011

Page 26: The Future of Search in Plone

Points of Agreement

• Stop investing in ZCatalog as a search engine, Solr is the future

Tuesday, November 29, 2011

Page 27: The Future of Search in Plone

Plone + SolrRoadmap

• Short term: Make Solr integration easy with an approved add-on (like LDAP)

• Build on what we’ve learned and create a better add-on to replace collective.solr and alm.solrindex

• Who wants to sponsor a sprint?

Tuesday, November 29, 2011

Page 28: The Future of Search in Plone

Plone + SolrRoadmap

• Long term: Ship Solr integration with Plone, but don’t require Solr

• Solr has a lot of overhead and is not always needed

• But using it should be as easy as answering yes to a “Build with Solr?” installation option

Tuesday, November 29, 2011