ir with lucene
DESCRIPTION
Presentation at the Greek Java Hellenic group about the open source search engine LuceneTRANSCRIPT
![Page 1: IR with lucene](https://reader033.vdocuments.us/reader033/viewer/2022052412/558cde08d8b42a155a8b45b0/html5/thumbnails/1.jpg)
Introduction to Information
Retrieval with Lucene
By Stylianos Gkorilas
![Page 2: IR with lucene](https://reader033.vdocuments.us/reader033/viewer/2022052412/558cde08d8b42a155a8b45b0/html5/thumbnails/2.jpg)
Introductions
Presenter Architect/Development Team Leader @Trasys Greece
Java EE projects for European Agencies
IR (Information Retrieval) The tracing and recovery of specific information from stored
data IR is interdisciplinary, based on computer science,
mathematics, library science, information science, information architecture, cognitive psychology, linguistics, and statistics.
Lucene Open Source – Apache Software License
(http://lucene.apache.org) Founder: Doug Cutting 0.01 release on March 2000 (SourceForge) 1.2 release June 2002 (First apache Jakarta Release) Its own top level apache project in 2005 Current version is 3.1
![Page 3: IR with lucene](https://reader033.vdocuments.us/reader033/viewer/2022052412/558cde08d8b42a155a8b45b0/html5/thumbnails/3.jpg)
More Lucene Intro…
Lucene is high performance, scalable IR library (not a ready to use application) Number of full featured search applications
built on top (More later…)
Lucene ports and bindings in many other programming environments incl. Perl, Python, Ruby, C/C++, PHP and C# (.NET)
Lucene „Powered By‟ apps (a few of many): LinkedIn, Apple, MySpace, Eclipse IDE, MS Outlook, Atlassian (JIRA). See more @ http://wiki.apache.org/lucene-java/PoweredBy
![Page 4: IR with lucene](https://reader033.vdocuments.us/reader033/viewer/2022052412/558cde08d8b42a155a8b45b0/html5/thumbnails/4.jpg)
Components of a Search
Application (1/4)
Acquire Content Gather and scope the content
e.g. from the web with a spider or crawler, a CMS, a Database or a file system
Projects helping Solr: handles RDBMS and XML
feeds and rich documents through Tika integration
Nutch: web crawler - sister project at apache
Grub: open source web crawler
![Page 5: IR with lucene](https://reader033.vdocuments.us/reader033/viewer/2022052412/558cde08d8b42a155a8b45b0/html5/thumbnails/5.jpg)
Components of a Search
Application (2/4)
Build document Define the document
The unit of the search engine
Has fields
De-normalization involved
Projects helping: Usually the same frameworks cover both this and the previous step Compass and its evolution
ElasticSearch
Hibernate Search
DBSight
Oracle/Lucene Integration
![Page 6: IR with lucene](https://reader033.vdocuments.us/reader033/viewer/2022052412/558cde08d8b42a155a8b45b0/html5/thumbnails/6.jpg)
Components of a Search
Application (3/4)
Analyze Document Handled by Analyzers
Built-in and contributed
Built with tokenizers and token filters
Index Document Through Lucene API or your
framework of choice
Search User Interface/Render Results Application specific
![Page 7: IR with lucene](https://reader033.vdocuments.us/reader033/viewer/2022052412/558cde08d8b42a155a8b45b0/html5/thumbnails/7.jpg)
Components of a Search
Application (4/4)
Query Builder Lucene provides one Frameworks provide extensions but also
the application itself e.g. advanced search
Run Query Retrieve documents running the query
built Three common theoretical models
Boolean model Vector space model Probabilistic model
Administration e.g. tuning options
Analytics reporting
![Page 8: IR with lucene](https://reader033.vdocuments.us/reader033/viewer/2022052412/558cde08d8b42a155a8b45b0/html5/thumbnails/8.jpg)
How Lucene models content
Documents
Fields
Denormalization of content
Flexible Schema
Inverted Index
![Page 9: IR with lucene](https://reader033.vdocuments.us/reader033/viewer/2022052412/558cde08d8b42a155a8b45b0/html5/thumbnails/9.jpg)
Basic Lucene Classes
Indexing IndexWriter
Directory
Analyzer
Document
Field
Searching IndexSearcher
Query
TopDocs
Term
QueryParser
![Page 10: IR with lucene](https://reader033.vdocuments.us/reader033/viewer/2022052412/558cde08d8b42a155a8b45b0/html5/thumbnails/10.jpg)
Basic Indexing
Adding documents
RAMDirectory directory = new RAMDirectory();
IndexWriter writer = new IndexWriter(directory,
new WhitespaceAnalyzer(),
IndexWriter.MaxFieldLength.UNLIMITED);
Document doc = new Document();
doc.add(new Field(“post",
"the JHUG meeting is on this Saturday",
Field.Store.YES,
Field.Index.ANALYZED));
Deleting and updating documents Field options
Store Analyze Norms Term vectors Boost
![Page 11: IR with lucene](https://reader033.vdocuments.us/reader033/viewer/2022052412/558cde08d8b42a155a8b45b0/html5/thumbnails/11.jpg)
Scoring – The formula
tf(t in d): Term frequency factor for the term (t) in the document (d), i.e. how many times the term t occurs in the document.
idf(t): Inverse document frequency of the term: a measure of how “unique” the term is. Very common terms have a low idf; very rare terms have a high idf.
boost(t.field in d): Field & Document boost, as set during indexing. This may be used to statically boost certain fields and certain documents over others.
lengthNorm(t.field in d): Normalization value of a field, given the number of terms within the field. This value is computed during indexing and stored in the index norms. Shorter fields (fewer tokens) get a bigger boost from this factor.
coord(q, d): Coordination factor, based on the number of query terms the document contains. The coordination factor gives an AND-like boost to documents that contain more of the search terms than other documents
queryNorm(q): Normalization value for a query, given the sum of the squared weights of each of the query terms.
![Page 12: IR with lucene](https://reader033.vdocuments.us/reader033/viewer/2022052412/558cde08d8b42a155a8b45b0/html5/thumbnails/12.jpg)
Querying – the API
Variety of Query class implementations TermQuery PhraseQuery TermRangeQuery NumericRangeQuery PrefixQuery BooleanQuery WildCardQuery FuzzyQuery MatchAllDocsQuery …
![Page 13: IR with lucene](https://reader033.vdocuments.us/reader033/viewer/2022052412/558cde08d8b42a155a8b45b0/html5/thumbnails/13.jpg)
Querying - Example
private void indexSingleFieldDocs(Field[] fields) throws Exception {
IndexWriter writer = new IndexWriter(directory,
new WhitespaceAnalyzer(), IndexWriter.MaxFieldLength.UNLIMITED);
for (int i = 0; i < fields.length; i++) {
Document doc = new Document();
doc.add(fields[i]);
writer.addDocument(doc);
}
writer.optimize();
writer.close();
}
public void wildcard() throws Exception {
indexSingleFieldDocs(new Field[]
{ new Field("contents", "wild", Field.Store.YES, Field.Index.ANALYZED),
new Field("contents", "child", Field.Store.YES, Field.Index.ANALYZED),
new Field("contents", "mild", Field.Store.YES, Field.Index.ANALYZED),
new Field("contents", "mildew", Field.Store.YES, Field.Index.ANALYZED) });
IndexSearcher searcher = new IndexSearcher(directory, true);
Query query = new WildcardQuery(new Term("contents", "?ild*"));
TopDocs matches = searcher.search(query, 10);
}
![Page 14: IR with lucene](https://reader033.vdocuments.us/reader033/viewer/2022052412/558cde08d8b42a155a8b45b0/html5/thumbnails/14.jpg)
Querying - QueryParser
Query query = new QueryParser("subject",
analyzer).parse("(clinical OR ethics) AND methodology");
trachea AND esophagus The default join condition is OR e.g. trachea esophagus cough AND (trachea OR esophagus) trachea NOT esophagus full_title:trachea "trachea disease" "trachea disease“~5 is_gender_male:y [2010-01-01 TO 2010-07-01] esophaguz~ Trachea^5 esophagus
![Page 15: IR with lucene](https://reader033.vdocuments.us/reader033/viewer/2022052412/558cde08d8b42a155a8b45b0/html5/thumbnails/15.jpg)
Analyzers - Internals
At Indexing and querying time Inside an analyzer
Operates on a TokenStream A token has a text value and metadata like
Start end character offsets Token type Position increment Optionally application specific bit flags and byte[]
payload
Token stream is abstract. Tokenizer and TokenFilterare the concrete ones Tokenizer reads chars and produces tokens Token filter ingests tokens and produces new ones The composite pattern is implemented and they form
a chain of one another
![Page 16: IR with lucene](https://reader033.vdocuments.us/reader033/viewer/2022052412/558cde08d8b42a155a8b45b0/html5/thumbnails/16.jpg)
Analyzers – building blocks
Analyzers can be created by combining token streams (Order is important)
Building blocks provided in core CharTokenizer WhitespaceTokenizer KeywordTokenizer. LetterTokenizer LowerCaseTokenizer SinkTokenizer StandardTokenizer LowerCaseFilter StopFilter PorterStemFilter TeeTokenFilter ASCIIFoldingFilter CachingTokenFilter LengthFilter StandardFilter
![Page 17: IR with lucene](https://reader033.vdocuments.us/reader033/viewer/2022052412/558cde08d8b42a155a8b45b0/html5/thumbnails/17.jpg)
Analyzers - core
WhitespaceAnalyzer Splits tokens at whitespace
SimpleAnalyzer Divides text at non letter characters and lowercases
StopAnalyzer Divides text at non letter characters, lowercases, and removes stop words
KeywordAnalyzer Treats entire text as a single token
StandardAnalyzer Tokenizes based on a sophisticated grammar that recognizes e-mailaddresses, acronyms, Chinese-Japanese-Korean characters,alphanumerics, and more lowercases and removes stop words
![Page 18: IR with lucene](https://reader033.vdocuments.us/reader033/viewer/2022052412/558cde08d8b42a155a8b45b0/html5/thumbnails/18.jpg)
Analyzers – Example (1/2)
Analyzing “The JHUG meeting is on this Saturday"
WhitespaceAnalyzer:
[The] [JHUG] [meeting] [is] [on] [this] [Saturday]
SimpleAnalyzer:
[the] [jhug] [meeting] [is] [on] [this] [saturday]
StopAnalyzer:
[jhug] [meeting] [saturday]
StandardAnalyzer:
[jhug] [meeting] [Saturday]
![Page 19: IR with lucene](https://reader033.vdocuments.us/reader033/viewer/2022052412/558cde08d8b42a155a8b45b0/html5/thumbnails/19.jpg)
Analyzers – Example (2/2)
Analyzing "XY&Z Corporation - [email protected]"
WhitespaceAnalyzer:
[XY&Z] [Corporation] [-] [[email protected]]
SimpleAnalyzer:
[xy] [z] [corporation] [xyz] [example] [com]
StopAnalyzer:
[xy] [z] [corporation] [xyz] [example] [com]
StandardAnalyzer:
[xy&z] [corporation] [[email protected]]
![Page 20: IR with lucene](https://reader033.vdocuments.us/reader033/viewer/2022052412/558cde08d8b42a155a8b45b0/html5/thumbnails/20.jpg)
Analyzers – Beyond the built in
language-specific analyzers, under contrib/analyzers. language-specific stemming and stop-word removal
Sounds Like analyzer e.g. MetaphoneReplacementAnalyzerthat transforms terms to their phonetic roots
SynonymAnalyzer Nutch Analysis: bigrams for stop words Stemming analysis
The PorterStemFilter. It stems words using the Porter stemming algorithm created by Dr. Martin Porter, and it‟s best defined in his own words: The Porter stemming algorithm (or „Porter stemmer‟) is a process
for removing the commoner morphological and inflexional endings from words in English. Its main use is as part of a term normalisation process that is usually done when setting up Information Retrieval systems.
SnowballAnalyzer: Stemming for many European languages
![Page 21: IR with lucene](https://reader033.vdocuments.us/reader033/viewer/2022052412/558cde08d8b42a155a8b45b0/html5/thumbnails/21.jpg)
Filters
Narrow the search space
Overloaded search methods that accept Filter instances
Examples TermRangeFilter
NumericRangeFilter
PrefixFilter
QueryWrapperFilter
SpanQueryFilter
ChainedFilter
![Page 22: IR with lucene](https://reader033.vdocuments.us/reader033/viewer/2022052412/558cde08d8b42a155a8b45b0/html5/thumbnails/22.jpg)
Example: Filters for Security
Constraints known at indexing time Index the constraint as a field Search wrapping a TermQuery on the constraint
field with a QueryWrapperFilter
Factor in information at search time A custom filter Filter will access an external privilege store that will
provide some means of identifying documents in the index e.g. a unique term with regard to permissions
Return an DocIdSet to Lucene. Bit positions match the document numbers. Enabled bits mean the document for that position is available to be searched against the query; unset bits mean the documents won‟t be considered in the search
![Page 23: IR with lucene](https://reader033.vdocuments.us/reader033/viewer/2022052412/558cde08d8b42a155a8b45b0/html5/thumbnails/23.jpg)
Internals - Concurrency
Any number of IndexReaders open IndexSearchers use underlying
IndexReaders
Only one IndexWriter at a time Locking with write lock file
IndexReaders may be open while the index is being changed by an IndexWriter It will see changes only when the writer
commits and is reopened
Both are thread safe/friendly classes
![Page 24: IR with lucene](https://reader033.vdocuments.us/reader033/viewer/2022052412/558cde08d8b42a155a8b45b0/html5/thumbnails/24.jpg)
Internals - Indexing concepts
Index is made up from segment files Deleting documents does not actually deletes - only
marks for deletion Index writes are buffered and flushed periodically Segments need to be merged
Automatically by the IndexWriter Explicit calls to optimize
There is the notion of commit (as you would expect), which has 4 steps Flush buffered documents and deletions Sync files; force OS to write to stable storage of the
underlying I/O system Write and sync the segments_N file Remove old commits
![Page 25: IR with lucene](https://reader033.vdocuments.us/reader033/viewer/2022052412/558cde08d8b42a155a8b45b0/html5/thumbnails/25.jpg)
Internals - Transactions
Two-phase commit is supported prepareCommit performs steps 1,2 and
most of 3
Lucene implements the ACID transactional model Atomicity: all or nothing commit Consistency: e.g. update will mean both
delete and add Isolation: IndexReaders cannot see what
has not been comitted Durability: Index is not corrupted and
persists in storage
![Page 26: IR with lucene](https://reader033.vdocuments.us/reader033/viewer/2022052412/558cde08d8b42a155a8b45b0/html5/thumbnails/26.jpg)
Architectures
Cluster nodes that share a remote file system index Slower than local Possible limitations due to client side caching
(Samba, NFS, AFP) or stale file handles (NFS)
Index in database Much slower
Separate write and read indexes (replication) relies on the IndexDeletionPolicy feature of Lucene Out of the box in Solr and ElasticSearch
Autonomous search servers (e.g. Solr, ElasticSearch) Loose coupling through JSON or XML
![Page 27: IR with lucene](https://reader033.vdocuments.us/reader033/viewer/2022052412/558cde08d8b42a155a8b45b0/html5/thumbnails/27.jpg)
Frameworks– Compass Document
definition via JPA mapping<compass-core-mapping package="eu.emea.eudract.model.entity">
<class name="cta.sectiona.CtaIdentification" alias="cta" root="true" support-unmarshall="false">
<id name="ctaIdentificationId">
<meta-data>cta_id</meta-data>
</id>
<dynamic-meta-data name="ncaName" converter="jexl" store="yes">data.submissionOrg.name
</dynamic-meta-data>
<property name="fullTitle">
<meta-data>cta_full_title</meta-data>
</property><property name="sponsorProtocolVersionDate">
<meta-data format="yyyy-MM-dd" store="no">cta_sponsor_protocol_version_date</meta-data>
</property>
<property name="isResubmission">
<meta-data converter="shortToYesNoNaConverter" store="no">cta_is_resubmission</meta-data>
</property>
<component name="eudractNumber" />
</class>
<class name="eudractnumber.EudractNumber" alias="eudract_number" root="false">
<property name="eudractNumberId">
<meta-data converter="dashHandlingConverter" store="no">filteredEudractNumberId</meta-data>
<meta-data>eudract_number</meta-data>
</property>
<property name="paediatricClinicalTrial">
<meta-data converter="shortToYesNoNaConverter" store="no">paediatric_clinical_trial
</meta-data>
</property>
</class>
.....
</compass-core-mapping>
![Page 28: IR with lucene](https://reader033.vdocuments.us/reader033/viewer/2022052412/558cde08d8b42a155a8b45b0/html5/thumbnails/28.jpg)
Frameworks– Solr Document definition
via DB mapping<dataConfig>
<dataSource driver="org.hsqldb.jdbcDriver" url="jdbc:hsqldb:/temp/example/ex" user="sa" />
<document name="products">
<entity name="item" query="select * from item">
<field column="ID" name="id" />
<field column="NAME" name="name" />
<field column="MANU" name="manu" />
<field column="WEIGHT" name="weight" />
<field column="PRICE" name="price" />
<field column="POPULARITY" name="popularity" />
<field column="INSTOCK" name="inStock" />
<field column="INCLUDES" name="includes" />
<entity name="feature" query="select description from feature where item_id='${item.ID}'">
<field name="features" column="description" />
</entity>
<entity name="item_category" query="select CATEGORY_ID from item_category where item_id='${item.ID}'">
<entity name="category" query="select description from category where id =
'${item_category.CATEGORY_ID}'">
<field column="description" name="cat" />
</entity>
</entity>
</entity>
</document>
</dataConfig>
![Page 29: IR with lucene](https://reader033.vdocuments.us/reader033/viewer/2022052412/558cde08d8b42a155a8b45b0/html5/thumbnails/29.jpg)
Frameworks– Compass/Lucene
Configuration<compass name="default">
<setting name="compass.transaction.managerLookup">
org.compass.core.transaction.manager.OC4J</setting>
<setting name="compass.transaction.factory">
org.compass.core.transaction.JTASyncTransactionFactory</setting>
<setting name="compass.transaction.lockPollInterval">400</setting>
<setting name="compass.transaction.lockTimeout">90</setting>
<setting name="compass.engine.connection">file://P:/Tmp/stelinio</setting>
<!--<setting name="compass.engine.connection">
jdbc://jdbc/EudractV8DataSourceSecure</setting>-->
<!--<setting name="compass.engine.store.jdbc.connection.provider.class">-->
<!--org.compass.core.lucene.engine.store.jdbc.JndiDataSourceProvider-->
<!--</setting>-->
<!--<setting name="compass.engine.ramBufferSize">512</setting>-->
<!--<setting name="compass.engine.maxBufferedDocs">-1</setting>-->
<setting name="compass.converter.dashHandlingConverter.type">
eu.emea.eudract.compasssearch.DashHandlingConverter
</setting>
<setting name="compass.converter.shortToYesNoNaConverter.type">
eu.emea.eudract.compasssearch.ShortToYesNoNaConverter
</setting>
<setting name="compass.converter.shortToPerDayOrTotalConverter.type">
eu.emea.eudract.compasssearch.ShortToPerDayOrTotalConverter
</setting>
<setting name="compass.engine.store.jdbc.dialect">
org.apache.lucene.store.jdbc.dialect.OracleDialect
</setting>
<setting name="compass.engine.analyzer.default.type">
org.apache.lucene.analysis.standard.StandardAnalyzer
</setting>
</compass>
![Page 30: IR with lucene](https://reader033.vdocuments.us/reader033/viewer/2022052412/558cde08d8b42a155a8b45b0/html5/thumbnails/30.jpg)
Cool extra features- Spellchecking
You will need a dictionary of valid words You could use the unique terms in your index Given the dictionary you could
Use a Sounds like algorithm like Soundex or Metaphone Or use Ngrams E.g. squirrel as a 3gram is squ, qui, uir, irr, rre, rel. As a
4gram squi, quir, uirr, irre, rrel. Mistakenly searching for squirel would match 5 grams, with 2 shared between the 3grams and 4grams. This would score high!
To present or not to present (the suggestion) Maybe use the Levenshtein distance
Other ideas Use the rest of the terms in the query to bias Maybe combine distance with frequency of term Check result numbers in initial and corrected searches
![Page 31: IR with lucene](https://reader033.vdocuments.us/reader033/viewer/2022052412/558cde08d8b42a155a8b45b0/html5/thumbnails/31.jpg)
Even More features
Sorting Use a field for sorting instead of relevance e.g. when you use the MatchAllDocsQuery Beware it uses FieldCache which resides in RAM!
SpanQueries distance between terms (span) Family of queries like SpanNearQuery or SpanOrQuery and others
Synonyms Injection during indexing or during searching?
A MultiPhraseQuery is appropriate for searching time During indexing will allow faster searches
Leverage a synonyms knowledge base A good strategy is to convert it into an index
Key thing is to understand that synonyms must be injected on the same position increments
Spatial Searches Answer to the query “Greek Restaurants Near Me” An efficient technique is to use grids
Assign non-unique grid numbers at areas (e.g. in a mercator space) Index documents with a field containing the grid numbers that match their positional lingitude and
latitude
MoreLikeThis One use of term vectors
Function Queries e.g. add boosts for fields at search time
![Page 32: IR with lucene](https://reader033.vdocuments.us/reader033/viewer/2022052412/558cde08d8b42a155a8b45b0/html5/thumbnails/32.jpg)
Some last things to bare in mind
It would be wise to back up you index You can have hot back ups (supported through the
CommitDeletionPolicy)
Performance has some trade-offs search latency indexing throughput near real time results index replication index optimization
Resource consumption Disk space File descriptors Memory
„Luke‟ is a really handy tool You can repair a corrupted index (corrupted
segments are just lost… D‟oh!)
![Page 33: IR with lucene](https://reader033.vdocuments.us/reader033/viewer/2022052412/558cde08d8b42a155a8b45b0/html5/thumbnails/33.jpg)
Resources
Book: Lucene in Action
Solr: http://lucene.apache.org/solr/
Vector Space Model: http://en.wikipedia.org/wiki/Vector_Space_Model
IR Links: http://wiki.apache.org/lucene-java/InformationRetrieval
![Page 34: IR with lucene](https://reader033.vdocuments.us/reader033/viewer/2022052412/558cde08d8b42a155a8b45b0/html5/thumbnails/34.jpg)
That’s it
Questions?