solr: 4 big features
DESCRIPTION
Four Big Features: * Faceting * Query auto-complete * Geospatial * Scaling Presented at a Meetup in Boston, 11 March 2014.TRANSCRIPT
APACHE SOLR
Four Big Features: • Faceting • Query auto-complete • Geospatial • Scaling
2014 March Presented by David Smiley at the Boston Java Meetup Group
About David Smiley ➢ Software Engineer (14 years)
○ Search (5 years) ○ Java, Web, Spatial
➢ Part-time employed at MITRE ➢ Part-time search consultant ➢ Apache Lucene / Solr committer & PMC ➢ Published 1st book on Solr ➢ Presented at several conferences ➢ Taught several Solr classes
Faceting • Do you know what I mean by “faceting”?
• AKA: faceted navigation, or parametric search
• Popular Apps:
• eBay, Amazon, and many e-commerce sites
• Apps I use that don’t use faceting but I wish they did: • http://search.maven.org and all Maven repository software: Nexus,
Artifactory, Archiva • JIRA
• Compare this to: http://jirasearch.mikemccandless.com/
Faceted Navigation & Analytics by example…
Notice the counts
Optionally start with a keyword search or
filter
Extremely useful feature supported by very few platforms: Solr, ElasticSearch, Sphinx, … (no DBs)
Credit: Trey Grainger; CareerBuilder
How to: Field Faceting • Index setup: schema.xml: <field name=“category” type=“string” />
<field name=“manufacturer” type=“string” />
• Facet search: http://localhost:8983/solr/
collection1/ select?
q=*:*&
facet=true&
facet.field=category&
facet.field=manufacturer
How to: Numeric/Date Faceting • Index setup: schema.xml: <field name=“timestamp” type=“tdate” />
• Facet search: http://localhost:8983/solr/
collection1/ select?
q=*:*&
facet=true&
facet.range=timestamp&
facet.range.start=NOW/YEAR-10YEAR
facet.range.end=NOW/YEAR+1YEAR facet.range.gap=+1YEAR
Query Suggest / Autocomplete
If you aren’t doing this then you really should!
Several Types • Instant search
• Direct navigation to documents, usually by name/title/id, etc. • Implement via edge n-grams or a Suggester • Ex: iTunes, Netflix, …
• Query log completion • Searches user queries you’ve captured & indexed • Implement via edge n-grams or FreeTextSuggester • Ex: Google
• Term completion • Completes indexed words • Implement via facet.prefix technique or a Suggester
• Facet / field value completion • Ex: Mint.com
Not mutually exclusive!
Tools for Completing / Suggesting • The Suggester
• A specialization of the spell-check Solr component • 8 implementations to choose from! Different pros/cons
• Weighted? Analyzing? Infix? Highlight? Fuzzy? N-gram model?
• Faceting with facet.prefix • Respects your current filters – don’t suggest a 0-result response
• Edge n-grams, with standard search • Terms component
Sample Suggester Search
Search
http://localhost:8983/solr/
mbartists/ a_term_suggest? q=sma
Response
{ "responseHeader":{ "status":0, "QTime":1}, "spellcheck":{ "suggestions":[ "sma",{ "numFound":4, "startOffset":0, "endOffset":3, "suggestion":[ "small", "smart", "smash" “smalley”]}, "collation","small"]}}
Geospatial Features • Lucene/Solr can index text, numbers, dates, and spatial
data • Features:
• Index latitude & longitude coordinates or any X Y pairs • Index polygons or other geometry • Query by point-radius, rectangle, or polygon geometry
• Including “IsWithin” vs “Intersects” vs “Contains” predicates • 2d/flat Euclidean OR geodetic spherical world model • Sort or relevancy-boost by distance to indexed points
The NoSQL solutions with the best spatial are CouchDB, MongoDB, Solr, and ElasticSearch
How to: Spatial Filter & Sort • Index setup: schema.xml: <field name=“geo” type=“location_rpt” />
• Index latitude comma longitude in your document: 37.7752,-100.0232
• Filter : http://localhost:8983/solr/
collection1/ select?
q=*:*& fq={!geofilt}& sort=geodist() asc& sfield=geo& pt=45.15,-93.85& d=5
Cool Technology Under the Hood • Grid / tile based recursive indexed structure using a prefix tree / trie indexing approach on standard Lucene inverted index
• Future: • Precise indexed shapes • Geodetic polygons • Hilbert curve ordering
Scaling Solr Solr’s mechanisms for scaling:
• Replication • Eliminates single point of failure • Reduces query load on any one node • Backups
• Distributed-search (for sharded indexes) • For collections of large multi-million document collections
• SolrCloud • Combines distributed-search and real-time replicated indexing • Centrally manages configuration • A higher level logical API, manages lots of coordination underneath • Advanced: doc routing, shard splitting, migration
Replication & Sharding Illustrated with a metaphor of an encyclopedia at a library
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
26 Shards
3 Replicas A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Nice Admin Screen UI
More Advanced SolrCloud Features • Document routing customization
• Answers: Which shard does a document belong in? • Hash (i.e. random) distribution • Or keep certain related documents together (ex: for same user)
• Helps scale when searching by a subset • Or manage it yourself manually (ex: index by month)
• Shard splitting • When your shard(s) get to be too big • Live; no down-time
• Inter-collection document migration • Copies a subset of one collection to another, possibly new
collection • Live; no down-time
That’s all for now; thanks for coming! Need Lucene/Solr guidance or custom development? Contact me: [email protected]
ETA: June 2014