search engine-building with lucene and solr, part 2 (socal code camp la 2013)
DESCRIPTION
These are the slides for the session I presented at SoCal Code Camp Los Angeles on November 10, 2013. http://www.socalcodecamp.com/socalcodecamp/session.aspx?sid=8cdfd955-2cd4-44a2-ad08-5353e079685aTRANSCRIPT
![Page 1: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)](https://reader033.vdocuments.us/reader033/viewer/2022052821/554a572bb4c905572f8b4cbe/html5/thumbnails/1.jpg)
Search Engine-Building with Lucene and Solr
Part 2Kai Chan
SoCal Code Camp, November 2013
![Page 2: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)](https://reader033.vdocuments.us/reader033/viewer/2022052821/554a572bb4c905572f8b4cbe/html5/thumbnails/2.jpg)
Overview
● indexing process● searching process● advanced features● scaling/redundancy● resources● demo● questions/answers
![Page 3: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)](https://reader033.vdocuments.us/reader033/viewer/2022052821/554a572bb4c905572f8b4cbe/html5/thumbnails/3.jpg)
Indexing Process
● request handler○ data are read to create documents
● update request processor chain○ optional document-wide processing○ fields can be added, changed, removed○ analysis○ creation of indexed and stored fields
● update handler○ the index is updated
![Page 4: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)](https://reader033.vdocuments.us/reader033/viewer/2022052821/554a572bb4c905572f8b4cbe/html5/thumbnails/4.jpg)
Update Request Processor Chain
● de-duplication○ creates a signature (hash) for each document to be
added○ replaces (delete) existing documents with the same
signature○ MD5Signature
■ exact hashing○ Lookup3Signature
■ faster calculation and smaller hash than MD5○ TextProfileSignature
■ fuzzy hashing, near-duplicate detection
![Page 5: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)](https://reader033.vdocuments.us/reader033/viewer/2022052821/554a572bb4c905572f8b4cbe/html5/thumbnails/5.jpg)
Update Request Processor Chain
● language detection○ detects the language used in field(s)○ adds a language field to the document○ TikaLanguageIdentifierUpdateProcessorFa
ctory■ uses Apache Tika
○ LangDetectLanguageIdentifierUpdateProcessorFactory■ uses language-detection library
○ external programs■ e.g. Chromium Compact Language Detector
See Also: Language detection with Google's Compact Language Detector <http://blog.mikemccandless.com/2011/10/language-detection-with-googles-compact.html>
![Page 6: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)](https://reader033.vdocuments.us/reader033/viewer/2022052821/554a572bb4c905572f8b4cbe/html5/thumbnails/6.jpg)
Analysis
● analyzed○ tokenization, i.e. breaking down the content to be
search into smaller units (“tokens”)○ manipulation of tokens
● not analyzed○ the whole content treated as 1 unit for searching
● analyzed v.s. not analyzed○ are individual tokens meaningful on their own?○ are individual tokens used in queries?
![Page 7: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)](https://reader033.vdocuments.us/reader033/viewer/2022052821/554a572bb4c905572f8b4cbe/html5/thumbnails/7.jpg)
1-933-98817-7
Lucene in Action, Second Edition: Covers Apache Lucene 3.0
Lucene in Action, Second Edition: Covers Apache Lucene 3.0
1 933 98817 7
Example 1: book title
Example 2: ISBN
Lucene in Action, Second Edition: Covers Apache Lucene 3.0
1 933 98817 7
makes more sense to not tokenize
makes more sense to tokenize
search for “Lucene”: no match
search for “933”: match
![Page 8: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)](https://reader033.vdocuments.us/reader033/viewer/2022052821/554a572bb4c905572f8b4cbe/html5/thumbnails/8.jpg)
Analysis
analyzed:● text
How about URL?
not analyzed:● number● serial number● GUID● checksum
![Page 9: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)](https://reader033.vdocuments.us/reader033/viewer/2022052821/554a572bb4c905572f8b4cbe/html5/thumbnails/9.jpg)
Analysis
● character filter(s)○ character replacement○ e.g. accent marks with their base forms
café → cafejalapeño → jalapeno
● tokenizer● token filter(s)
![Page 10: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)](https://reader033.vdocuments.us/reader033/viewer/2022052821/554a572bb4c905572f8b4cbe/html5/thumbnails/10.jpg)
Analysis
● character filter(s)● tokenizer
○ create tokens (“words”) from characters○ sometimes straightforward○ many unusual cases:
e-mail address, URL, code, etc.● token filter(s)
![Page 11: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)](https://reader033.vdocuments.us/reader033/viewer/2022052821/554a572bb4c905572f8b4cbe/html5/thumbnails/11.jpg)
Analysis
● character filter(s)● tokenizer● token filter(s)
○ token replacement■ change case, remove apostrophe■ remove stop words (a, and, the, for)■ split/join words (ice-cream, ice cream, icecream)■ stemming (importing, imported → import)■ synonym (nation → country)
![Page 12: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)](https://reader033.vdocuments.us/reader033/viewer/2022052821/554a572bb4c905572f8b4cbe/html5/thumbnails/12.jpg)
Field value:Let's sign up for the amazing So-Cal Code Camp® at http://bit.ly/oZiZsu. Free Wi-Fi!
Tokens (text_en):1 2 3 6 7 8 9 10 12 13 14 15 16 17let sign up amaz so cal code camp http bit.li ozizsu free wi fi
Tokens (text_en_splitting):1 2 3 6 7 8 9 10 12 13 14 1516 17 18 19 20let sign up amaz so cal code camp http bit ly o zi zsu free wi fi socal httpbitlyozizsu wifi 8 17 20
Tokens (text_general):1 2 3 4 6 6 7 8 9 10 11 12 13 14 15 16 17let's sign up for the amazing so cal code camp at http bit.ly oZiZsu free wi fi
![Page 13: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)](https://reader033.vdocuments.us/reader033/viewer/2022052821/554a572bb4c905572f8b4cbe/html5/thumbnails/13.jpg)
Searching Process
● query parsing● analysis● scoring● sorting● loading of stored fields● optional search components
○ faceting○ term vector○ More Like This○ highlighting
![Page 14: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)](https://reader033.vdocuments.us/reader033/viewer/2022052821/554a572bb4c905572f8b4cbe/html5/thumbnails/14.jpg)
Scoring
● for a given query, each document not filtered out gets a score (float)
● higher score: higher in the results● scoring algorithms
○ default: TF-IDF○ other: Okapi BM25, etc.○ very customizable
See Also: Lucene/Solr Revolution 2013 presentation “Beyond TF-IDF: Why, What and How”
![Page 15: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)](https://reader033.vdocuments.us/reader033/viewer/2022052821/554a572bb4c905572f8b4cbe/html5/thumbnails/15.jpg)
Scoring - TF-IDF
● term frequency (TF)○ how many times does this term appear in this
document?● inverse document frequency (IDF)
○ how many documents contain this term?○ score proportional to the inverse of document
frequency
![Page 16: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)](https://reader033.vdocuments.us/reader033/viewer/2022052821/554a572bb4c905572f8b4cbe/html5/thumbnails/16.jpg)
Scoring - Other Factors
● coordination factor (coord)○ documents that contains all or most query terms get
higher scores● normalizing factor (norm)
○ adjust for field length and query complexity
![Page 17: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)](https://reader033.vdocuments.us/reader033/viewer/2022052821/554a572bb4c905572f8b4cbe/html5/thumbnails/17.jpg)
Scoring - Boost
● manual override: ask Lucene/Solr to give a higher score to some particular thing(s)
● index-time○ per document○ per field (of a particular document)
● search-time○ per query
![Page 18: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)](https://reader033.vdocuments.us/reader033/viewer/2022052821/554a572bb4c905572f8b4cbe/html5/thumbnails/18.jpg)
More Like This
● finds documents similar in content (of one field) to those matched
● constructs a query based on the highest scoring terms in a document
● requires the field to:○ have stored term vectors (recommended), or○ be stored
Credit: How MoreLikeThis Works in Lucene <http://cephas.net/blog/2008/03/30/how-morelikethis-works-in-lucene/>
![Page 19: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)](https://reader033.vdocuments.us/reader033/viewer/2022052821/554a572bb4c905572f8b4cbe/html5/thumbnails/19.jpg)
Spell Checking
● typos in queries happen● returns spell checking suggestion (if any)
within the same result● can also be used for auto-complete
○ treating a prefix as a spelling mistake○ returning full words as suggestions
![Page 20: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)](https://reader033.vdocuments.us/reader033/viewer/2022052821/554a572bb4c905572f8b4cbe/html5/thumbnails/20.jpg)
<lst name="spellcheck"> <lst name="suggestions"> <lst name="busness"> <int name="numFound">1</int> <int name="startOffset">6</int> <int name="endOffset">13</int> <arr name="suggestion"> <str>business</str> </arr> </lst> <lst name="comunication"> <int name="numFound">1</int> <int name="startOffset">14</int> <int name="endOffset">26</int> <arr name="suggestion"> <str>communication</str> </arr> </lst> </lst></lst>
/select?q=text:"busness comunication"&spellcheck=true&wt=xml
![Page 21: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)](https://reader033.vdocuments.us/reader033/viewer/2022052821/554a572bb4c905572f8b4cbe/html5/thumbnails/21.jpg)
Query Elevation
● a.k.a. “sponsored search”● make sure certain documents appear at the
top of the results for a certain query
![Page 22: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)](https://reader033.vdocuments.us/reader033/viewer/2022052821/554a572bb4c905572f8b4cbe/html5/thumbnails/22.jpg)
Credit: Google Web Search <http://www.google.com/>
![Page 23: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)](https://reader033.vdocuments.us/reader033/viewer/2022052821/554a572bb4c905572f8b4cbe/html5/thumbnails/23.jpg)
Query Elevation
● configure the elevator search component in solrconfig.xml
● in elevate.xml, specify the queries and the list of documents (by id) to elevate or exclude
● enable query elevation:enableElevation=true
● (optional) override the sort parameter:forceElevation=true
![Page 24: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)](https://reader033.vdocuments.us/reader033/viewer/2022052821/554a572bb4c905572f8b4cbe/html5/thumbnails/24.jpg)
Function Query
● like formulas in Excel● apply functions to field values for filtering
and scoring
![Page 25: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)](https://reader033.vdocuments.us/reader033/viewer/2022052821/554a572bb4c905572f8b4cbe/html5/thumbnails/25.jpg)
Function Query
● query:q={!func} cos(angle)
● query (range):q={!frange l=0.5 u=1} cos(angle)
● field:fl=angle,cos(angle)
● sort:sort=cos(angle) desc
![Page 26: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)](https://reader033.vdocuments.us/reader033/viewer/2022052821/554a572bb4c905572f8b4cbe/html5/thumbnails/26.jpg)
Spatial Search
● data: contains locations (longitudes, latitudes)○ e.g. merchants with store locations
● search: filter and/or sort by location
![Page 27: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)](https://reader033.vdocuments.us/reader033/viewer/2022052821/554a572bb4c905572f8b4cbe/html5/thumbnails/27.jpg)
Credit: Google Maps <http://maps.google.com/>
![Page 28: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)](https://reader033.vdocuments.us/reader033/viewer/2022052821/554a572bb4c905572f8b4cbe/html5/thumbnails/28.jpg)
Spatial Search
● geofilt○ circle centered at a given point○ distance from a given point○ fq={!geofilt sfield=store}&pt=45.15,
-93.85&d=5● bbox
○ square (“bounding box”) centered at a given point○ distance from a given point + corners○ fq={!bbox sfield=store}&pt=45.15,-93.85
&d=5
Credit: Apache Solr Reference Guide 4.5 <http://lucene.apache.org/>
![Page 29: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)](https://reader033.vdocuments.us/reader033/viewer/2022052821/554a572bb4c905572f8b4cbe/html5/thumbnails/29.jpg)
geofilt bbox
5 km 5 km
(45.15, -93.85) (45.15, -93.85)
Credit: Apache Solr Reference Guide 4.5 <http://lucene.apache.org/>
![Page 30: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)](https://reader033.vdocuments.us/reader033/viewer/2022052821/554a572bb4c905572f8b4cbe/html5/thumbnails/30.jpg)
geofilt bbox
5 km 5 km
(45.15, -93.85) (45.15, -93.85)
x
o
o
x
x
xo
o
o
o
x
o
Credit: Apache Solr Reference Guide 4.5 <http://lucene.apache.org/>
![Page 31: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)](https://reader033.vdocuments.us/reader033/viewer/2022052821/554a572bb4c905572f8b4cbe/html5/thumbnails/31.jpg)
Spatial Search
● geodist○ returns the distance between the location given in a
field and a certain coordinate○ e.g. sort by ascending distance from (45.15,-93.85),
and return the distances as the score:q={!func}geodist()&sfield=store&pt=45.15,-93.85&sort=score+asc
Credit: Apache Solr Reference Guide 4.5 <http://lucene.apache.org/>
![Page 32: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)](https://reader033.vdocuments.us/reader033/viewer/2022052821/554a572bb4c905572f8b4cbe/html5/thumbnails/32.jpg)
Scaling/Redundancy - Problems
● collection too large for a single machine● too many requests for a single machine● a machine can go down
![Page 33: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)](https://reader033.vdocuments.us/reader033/viewer/2022052821/554a572bb4c905572f8b4cbe/html5/thumbnails/33.jpg)
Scaling/Redundancy - Solutions
● collection too large for a single machine○ distribution
■ spread the collection across multiple machines● too many requests for a single machine
○ distribution■ spread the requests across multiple machines
● a machine can go down○ replication
■ copy data and configuration across multiple machines
■ make sure no single point of failure
![Page 34: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)](https://reader033.vdocuments.us/reader033/viewer/2022052821/554a572bb4c905572f8b4cbe/html5/thumbnails/34.jpg)
SolrCloud
● Solr instances● ZooKeeper instances
![Page 35: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)](https://reader033.vdocuments.us/reader033/viewer/2022052821/554a572bb4c905572f8b4cbe/html5/thumbnails/35.jpg)
SolrCloud
● Solr instances○ collection (logical index) divided into one or more
partial collections (“shards”)○ for each shard, one or more Solr instances keep
copies of the data■ one as leader - handles reads and writes■ others as replicas - handle reads
● ZooKeeper instances
![Page 36: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)](https://reader033.vdocuments.us/reader033/viewer/2022052821/554a572bb4c905572f8b4cbe/html5/thumbnails/36.jpg)
SolrCloud
● Solr instances● ZooKeeper instances
○ management of Solr instances○ leader election○ node discovery
![Page 37: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)](https://reader033.vdocuments.us/reader033/viewer/2022052821/554a572bb4c905572f8b4cbe/html5/thumbnails/37.jpg)
leader replica replica
leader replica
leader replica
shard 1: ⅓ of the collection
shard 2:⅓ of the collection
shard 3:⅓ of the collection
collection (i.e. logical index)
replica
replica
replica
![Page 38: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)](https://reader033.vdocuments.us/reader033/viewer/2022052821/554a572bb4c905572f8b4cbe/html5/thumbnails/38.jpg)
leader replica replica
leader replica
leader replica
shard 1: ⅓ of the collection
shard 2:⅓ of the collection
shard 3:⅓ of the collection
collection (i.e. logical index)
replica
replica
replica
replica
![Page 39: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)](https://reader033.vdocuments.us/reader033/viewer/2022052821/554a572bb4c905572f8b4cbe/html5/thumbnails/39.jpg)
leader replica replica
(offline) leader
leader replica
shard 1: ⅓ of the collection
shard 2:⅓ of the collection
shard 3:⅓ of the collection
collection (i.e. logical index)
replica
replica
replica
replica
![Page 40: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)](https://reader033.vdocuments.us/reader033/viewer/2022052821/554a572bb4c905572f8b4cbe/html5/thumbnails/40.jpg)
leader replica replica
replica leader
leader replica
shard 1: ⅓ of the collection
shard 2:⅓ of the collection
shard 3:⅓ of the collection
collection (i.e. logical index)
replica
replica
replica
replica
![Page 41: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)](https://reader033.vdocuments.us/reader033/viewer/2022052821/554a572bb4c905572f8b4cbe/html5/thumbnails/41.jpg)
Resources - Books
● Lucene in Action○ written by 3 committer and PMC members○ somewhat outdated (2010; covers Lucene 3.0)○ http://www.manning.com/hatcher3/
● Solr in Action○ early access; coming out later this year○ http://www.manning.com/grainger/
● Apache Solr 4 Cookbook○ common problems and useful tips○ http://www.packtpub.com/apache-solr-4-
cookbook/book
![Page 42: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)](https://reader033.vdocuments.us/reader033/viewer/2022052821/554a572bb4c905572f8b4cbe/html5/thumbnails/42.jpg)
Resources - Books
● Introduction to Information Retrieval○ not specific to Lucene/Solr, but about IR concepts○ free e-book○ http://nlp.stanford.edu/IR-book/
● Managing Gigabytes○ indexing, compression and other topics○ accompanied by MG4J - a full-text search software○ http://mg4j.di.unimi.it/
![Page 43: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)](https://reader033.vdocuments.us/reader033/viewer/2022052821/554a572bb4c905572f8b4cbe/html5/thumbnails/43.jpg)
Resources - Web
● official websites○ Lucene Core - http://lucene.apache.org/core/○ Solr - http://lucene.apache.org/solr/
● mailing lists● Wiki sites
○ Lucene Core - http://wiki.apache.org/lucene-java/○ Solr - http://wiki.apache.org/solr/
● reference guides○ API Documentation for Lucene and Solr○ Apache Solr Reference Guide
![Page 44: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)](https://reader033.vdocuments.us/reader033/viewer/2022052821/554a572bb4c905572f8b4cbe/html5/thumbnails/44.jpg)
Getting Started
● download Solr○ requires Java 6 or newer to run
● Solr comes bundled/configured with Jetty○ <Solr directory>/example/start.jar
● "exampledocs" directory contains sample documents○ <Solr directory>/example/exampledocs/post.jar○ java -Durl=http://localhost:
8983/solr/update -jar post.jar *.xml● use the Solr admin interface
○ http://localhost:8983/solr/