the future of search in plone
Post on 29-Nov-2014
757 Views
Preview:
DESCRIPTION
TRANSCRIPT
The Future of Search in PloneSally Kleinfeldt
and friendsPlone Conference, San Francisco
November 3, 2011
Tuesday, November 29, 2011
Motivation
• Raise awareness
• Promote discussion
• Forge consensus
Tuesday, November 29, 2011
Agenda
• Introduction to IR concepts
• Description of Solr and ZCatalog
• Discussion
Tuesday, November 29, 2011
IR 101
Tuesday, November 29, 2011
IR 101
• Transformations
• Terms
• Models
• Measures
Tuesday, November 29, 2011
IR 101Transformations
• Turn binary, HTML, or other document formats into fields and strings
• Parse the strings into a set of terms
• Build indexes of the terms specific to the IR model used
• Queries are parsed into query operators and strings, which are parsed into terms
Tuesday, November 29, 2011
IR 101String => Terms
• Tokenization - locate word boundaries
• Normalization - remove capitals and diacritics
• Stopping - remove stop words (a, of, on, the...)
• Stemming - reduce to word stems (walks, walking => walk)
• Recognizers - concepts, parts of speech, names, locations...
• Must be identical for documents and queries
Tuesday, November 29, 2011
IR 101Terms
• Application specific
• Words or phrases
• IR models assign weights to terms in documents
Tuesday, November 29, 2011
IR 101Term Weighting
• Simplest: Yes/No Boolean value
• Better: Term Frequency - # occurrences
• More meaningful: tf-idf
• Term Freq * Inverse Document Freq
• How many documents contain the term?
• Increase weight of rare terms and vice versa
Tuesday, November 29, 2011
IR 101Boolean Model
• First and most adopted
• Based on Boolean logic + set theory
• Does a document contain query terms - Y/N
• Intuitive, easy to implement
• No ranking, special query language, too many or too few results
• Typical for library systems
Tuesday, November 29, 2011
IR 101Vector Space Models
• Represent documents and queries as vectors of terms
• Term values are weighted - by count or tf-idf
• Use vector operations to compare documents with queries
• Relevance score based on cosine of angle between doc/query vectors
Tuesday, November 29, 2011
IR 101Probabilistic Models
• Compute probability that a document is relevant to a query
• Relevance ranking functions range from simple to complex
• Sophisticated ranking functions include
• Okapi BM25 (uses tf and idf)
• Machine learning formulas (use training data)
Tuesday, November 29, 2011
IR 101Extending the Models
• Many many refinements possible
• Term interdependencies
• Fuzzy sets
• Semantic analysis, link analysis
• Combining models (Extended Boolean)
• The best search engines represent thousands of engineering hours
Tuesday, November 29, 2011
IR 101Measures
• Search engine results are measured against:
• Precision - Percent of results that are relevant
• Recall - Percent of relevant results that are returned
• F-Score - Harmonic mean of precision and recall
Tuesday, November 29, 2011
ZCatalog and Solr
Tuesday, November 29, 2011
ZCatalog
• Zope/Plone search engine
• Full text and field searching
• Probabilistic model using Okapi BM25
• OOTB ZCTextIndex very simple
• TextIndexNG adds multilingual, better parsing components, binary transforms, synonyms
Tuesday, November 29, 2011
Solr
• Popular open source enterprise search platform
• Eliminating smaller commercial search companies
• Java, based on Lucene Java search library, sophisticated vector space ++ model
• RESTful APIs
• Large, active community
• Powers Twitter, Wikipedia, Netflix...
Tuesday, November 29, 2011
What does Solr have that ZCatalog Doesn’t?• Better relevance ranking
• More search features: snippets, hit highlighting, spelling suggestions, synonyms, more like this, faceted search
• More configurable: stop words, field boosting, parsing components
• An army of engineers working on it
Tuesday, November 29, 2011
Plone + SolrToday
• Two add-ons available
• collective.solr - Intercepts catalog queries and dispatches them to Solr
• alm.solrindex - adds a new index type to the catalog, SolrIndex
• Plus a buildout recipe: collective.recipe.solrinstance
Tuesday, November 29, 2011
Conclusions from Conference Discussion
Tuesday, November 29, 2011
Why Does Plone Need Solr?
• Certain types of projects need it, for features or because ZCatalog can’t scale to very large sites
• We need it to keep up with the enterprise CMS pack
Tuesday, November 29, 2011
Points of Agreement
• It will be impossible to completely replace ZCatalog with Solr
• Solr indexing will never be transactional
• Removing ZCatalog from Zope would be very difficult
• Tackle small, focused ZCatalog improvements when possible - like improving indexing interface
Tuesday, November 29, 2011
Points of Agreement
• Navigation and search should be handled separately
• Navigation needs to be transactional, search does not
• Split out a catalog used for navigation from the general catalog
• Explore a non-catalog utility to support navigation, optimize for speed
Tuesday, November 29, 2011
Points of Agreement
• Treating Solr integration simply as ZCatalog replacement does not take best advantage of Solr features
• ZCatalog can’t represent the richness of Solr, focus on the Solr API
• Take advantage of spelling suggestions, facets, results snippets with hit highlighting, synonyms, more like this, etc.
• Provide Solr indexing, field weighting, etc. configuration choices in the control panel
Tuesday, November 29, 2011
Points of Agreement
• Neither of the current Solr add-ons provides the best foundation for the future
• But they’ve taught us how to do things better
• Non-Solr approaches to improved Plone search should be deprecated
• Andreas Jung is not planning improvements to TextIndexNG!
Tuesday, November 29, 2011
Points of Agreement
• Stop investing in ZCatalog as a search engine, Solr is the future
Tuesday, November 29, 2011
Plone + SolrRoadmap
• Short term: Make Solr integration easy with an approved add-on (like LDAP)
• Build on what we’ve learned and create a better add-on to replace collective.solr and alm.solrindex
• Who wants to sponsor a sprint?
Tuesday, November 29, 2011
Plone + SolrRoadmap
• Long term: Ship Solr integration with Plone, but don’t require Solr
• Solr has a lot of overhead and is not always needed
• But using it should be as easy as answering yes to a “Build with Solr?” installation option
Tuesday, November 29, 2011
top related