hatcher erik lucene
TRANSCRIPT
-
8/14/2019 hatcher erik lucene
1/32
presented byErik Hatcher
OReilly Open Source ConventionJuly 7 - 11, 2003
Introducing Lucene
-
8/14/2019 hatcher erik lucene
2/32
2
High-performance, full-featured text searchengine
100% Java http://jakarta.apache.org/lucene It is an API only, but...
What is Lucene
-
8/14/2019 hatcher erik lucene
3/32
3
Learn what Lucene is, see how easy it is touse
Gain appreciation for what is hidden
underneath Benefit from some best practices andexperience
Goals
-
8/14/2019 hatcher erik lucene
4/32
4
Powered by BobDylan.com
jGuru ZOE Jive Forums
Scarab JavaDevWithAnt BlogScene
... and many others
-
8/14/2019 hatcher erik lucene
5/32
5
An index contains a sequence of documents A document is a sequence of fields
A field is a named sequence of terms A term is a text string, keyed by field name
Concepts
-
8/14/2019 hatcher erik lucene
6/32
6
Simple Examples
Because for the developer, Lucene is, in fact,quite simple
Keep this simplicity in mind as we delveunder the hood later
-
8/14/2019 hatcher erik lucene
7/327
Quiz: what is a document?
Indexing
-
8/14/2019 hatcher erik lucene
8/328
Example Data
String[] docs = {"doc1 - present!","doc2 is right here","and do not forget lil ol doc3"
};
-
8/14/2019 hatcher erik lucene
9/329
Creating an Index
Directory directory = new RAMDirectory(); Analyzer analyzer = new StandardAnalyzer();
IndexWriter writer =new IndexWriter(directory,
analyzer, true);
for (int j = 0; j < docs.length; j++) {Document d = new Document();d.add(Field.Text("contents", docs[j]));writer.addDocument(d);
}
writer.close();
-
8/14/2019 hatcher erik lucene
10/3210
... is where Lucene really shines
Searching
-
8/14/2019 hatcher erik lucene
11/3211
Searching
Searcher searcher =new IndexSearcher(directory);
Query query =new TermQuery(
new Term("contents", "doc1"));
Hits hits = searcher.search(query);System.out.println("doc1 hits = " +
hits.length());
searcher.close();
doc1 hits = 1
-
8/14/2019 hatcher erik lucene
12/3212
Stores statistics about terms For a term, the documents that contain it can
be quickly retrieved
Inverted Index
-
8/14/2019 hatcher erik lucene
13/3213
An index is composed of segments
Segments are created as documents areadded, and can be merged
A segment is a fully independent index
Searching can (and often does) spanmultiple segments Each segment index contains:
field names, stored field values, termdictionary, frequency data, proximitydata, normalization factors, deleteddocuments
Index Structure
-
8/14/2019 hatcher erik lucene
14/3214
Stored- Non-inverted, content retrievable
Indexed
- Inverted, searchable Tokenized- Broken into tokens
Fields
-
8/14/2019 hatcher erik lucene
15/3215
Field object
method Stored? Indexed? Tokenized?
Keyword 4 4
Text
(w/Reader)4 4
Text(w/String)
4 4 4
UnIndexed 4
UnStored 4 4
-
8/14/2019 hatcher erik lucene
16/3216
Field examples
doc.add(Field.Text("title", title));doc.add(Field.UnIndexed("body", body));
doc.add(Field.Keyword("category",category));
doc.add(Field.Keyword("path", category));doc.add(Field.Keyword("path", ymd));
doc.add(Field.UnStored("contents", title +" " + bodyText +" " + category +
" " + docname));
-
8/14/2019 hatcher erik lucene
17/3217
Tokenize text Can be field-specific, but not generally Used during both indexing and searching
Choosing (or writing) the right Analyzer(s) isone of the most important decision to makewith Lucene!
Analyzers
-
8/14/2019 hatcher erik lucene
18/32
18
Tokenize a phrase
Analyzer Demo
-
8/14/2019 hatcher erik lucene
19/32
19
StandardAnalyzer Features
Sequences of letters and digits Interior apostrophe's (e.g. o'clock) Acroynms (e.g. H.P.) Companies (e.g. AT&T) E-mail addresses Hostname (e.g. jakarta.apache.org) Serial numbers, IP addresses, floating
point numbers (e.g. R2D2 and192.168.1.1)
-
8/14/2019 hatcher erik lucene
20/32
20
GermanAnalyzer RussianAnalyzer SnowballAnalyzer (currently in "sandbox")
Danish, Dutch, English, Finnish, French,German, Italian, Lovins, Norwegian, Porter,Portuguese, Russian, Spanish, Swedish
Other Analyzers
-
8/14/2019 hatcher erik lucene
21/32
21
Instantiate IndexSearcher Call one of the search methods
search(Query)
search(Query, Filter) search(Query, HitCollector)
Process the hits
Searching
-
8/14/2019 hatcher erik lucene
22/32
22
Subclasses include:
TermQuery, RangeQuery, PrefixQuery,PhraseQuery, WildcardQuery, FuzzyQuery,and BooleanQuery
Use these subclasses for code-generatedqueries
QueryParser is useful for human-enteredqueries
Query object
-
8/14/2019 hatcher erik lucene
23/32
23
Provides powerful expression syntax Boolean operators (+, AND, OR) Field selection (title:foo) Wildcard support (blog*, bl?g, b*g) Fuzzy terms (socks~) using Levenshtein
algorithm Proximity ("blog lucene"~5) Boosting (ant^4 blog) Grouping ((lucene OR blog) AND jakarta)
Range (20020101 TO 20021231) -lexicographic
Special character escaping (+ - && || ( ) { } [] ^ " ~ * ? : \)
QueryParser
-
8/14/2019 hatcher erik lucene
24/32
24
QueryParser example
Analyzer analyzer = new StandardAnalyzer();
String q = "+blog -radio +title:lucene";
Query query =QueryParser.parse(q, // query
"contents", // field analyzer);
Hits hits = searcher.search(query);
-
8/14/2019 hatcher erik lucene
25/32
25
Deleting documents
IndexReader reader = ....
// By document id reader.delete(docId);
// or by Term int numDeleted =
reader.delete(new Term("contents", "xyz"));
Documents have a document id id can change when index is optimized
Do not rely on document id's between sessions(essentially per IndexReader instantiation)
-
8/14/2019 hatcher erik lucene
26/32
26
Is Lucene thread-safe?
IndexSearcher and IndexWriter are thread-safe.QueryParser is not.
Do document id's change?Yes.
How do I update a document in an index?Delete it and re-add it.
Can Lucene index documents of type ...?
Can you extract the text from files of that type? How do I deal with dates?
Lexicographically
A few FAQs
-
8/14/2019 hatcher erik lucene
27/32
27
Sandbox
CVS: jakarta-lucene-sandbox LARM - Lucene Advanced Retrieval Machine Others
XML-Indexing-Demo searchbean parsers - only PDF currently javascript - browser query creation/validation
utils ant - create/update index during build process Etc... check repository before writing your own
Batteries not included
-
8/14/2019 hatcher erik lucene
28/32
28
PDFBox - http://www.pdfbox.org POI - http://jakarta.apache.org/poi
TextMining - http://www.textmining.org wraps PDFBox and POI
Third-party
-
8/14/2019 hatcher erik lucene
29/32
29
Michaels.com - searching for colors JavaDevWithAnt - documentation search
engine
ZOE - "intertwingled e-mail" BlogScene - flattened datastore
Interesting Usages
-
8/14/2019 hatcher erik lucene
30/32
30
Lucene is a fast, flexible, Java-basedsearch engine
It is an API, with a few builtinconveniences, but the developer must do
some work Its very simple to realize powerful
capabilities
Lots of tips, examples, and layers abovethe API available online
Session Summary
-
8/14/2019 hatcher erik lucene
31/32
31
Lucene official sitehttp://jakarta.apache.org/lucene "Search-Enable Your Application with
Lucene" by Craig WallsJava Developers Journal, December 2002
JavaDevWithAnthttp://www.ehatchersolutions.com/JavaDevWithAnt
from Java Development with Ant (Manning) jGuru Lucene FAQ
http://www.jguru.com/faq/Lucene
Resources
-
8/14/2019 hatcher erik lucene
32/32