hatcher erik lucene

Upload: ritesh

Post on 31-May-2018

232 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/14/2019 hatcher erik lucene

    1/32

    presented byErik Hatcher

    OReilly Open Source ConventionJuly 7 - 11, 2003

    Introducing Lucene

  • 8/14/2019 hatcher erik lucene

    2/32

    2

    High-performance, full-featured text searchengine

    100% Java http://jakarta.apache.org/lucene It is an API only, but...

    What is Lucene

  • 8/14/2019 hatcher erik lucene

    3/32

    3

    Learn what Lucene is, see how easy it is touse

    Gain appreciation for what is hidden

    underneath Benefit from some best practices andexperience

    Goals

  • 8/14/2019 hatcher erik lucene

    4/32

    4

    Powered by BobDylan.com

    jGuru ZOE Jive Forums

    Scarab JavaDevWithAnt BlogScene

    ... and many others

  • 8/14/2019 hatcher erik lucene

    5/32

    5

    An index contains a sequence of documents A document is a sequence of fields

    A field is a named sequence of terms A term is a text string, keyed by field name

    Concepts

  • 8/14/2019 hatcher erik lucene

    6/32

    6

    Simple Examples

    Because for the developer, Lucene is, in fact,quite simple

    Keep this simplicity in mind as we delveunder the hood later

  • 8/14/2019 hatcher erik lucene

    7/327

    Quiz: what is a document?

    Indexing

  • 8/14/2019 hatcher erik lucene

    8/328

    Example Data

    String[] docs = {"doc1 - present!","doc2 is right here","and do not forget lil ol doc3"

    };

  • 8/14/2019 hatcher erik lucene

    9/329

    Creating an Index

    Directory directory = new RAMDirectory(); Analyzer analyzer = new StandardAnalyzer();

    IndexWriter writer =new IndexWriter(directory,

    analyzer, true);

    for (int j = 0; j < docs.length; j++) {Document d = new Document();d.add(Field.Text("contents", docs[j]));writer.addDocument(d);

    }

    writer.close();

  • 8/14/2019 hatcher erik lucene

    10/3210

    ... is where Lucene really shines

    Searching

  • 8/14/2019 hatcher erik lucene

    11/3211

    Searching

    Searcher searcher =new IndexSearcher(directory);

    Query query =new TermQuery(

    new Term("contents", "doc1"));

    Hits hits = searcher.search(query);System.out.println("doc1 hits = " +

    hits.length());

    searcher.close();

    doc1 hits = 1

  • 8/14/2019 hatcher erik lucene

    12/3212

    Stores statistics about terms For a term, the documents that contain it can

    be quickly retrieved

    Inverted Index

  • 8/14/2019 hatcher erik lucene

    13/3213

    An index is composed of segments

    Segments are created as documents areadded, and can be merged

    A segment is a fully independent index

    Searching can (and often does) spanmultiple segments Each segment index contains:

    field names, stored field values, termdictionary, frequency data, proximitydata, normalization factors, deleteddocuments

    Index Structure

  • 8/14/2019 hatcher erik lucene

    14/3214

    Stored- Non-inverted, content retrievable

    Indexed

    - Inverted, searchable Tokenized- Broken into tokens

    Fields

  • 8/14/2019 hatcher erik lucene

    15/3215

    Field object

    method Stored? Indexed? Tokenized?

    Keyword 4 4

    Text

    (w/Reader)4 4

    Text(w/String)

    4 4 4

    UnIndexed 4

    UnStored 4 4

  • 8/14/2019 hatcher erik lucene

    16/3216

    Field examples

    doc.add(Field.Text("title", title));doc.add(Field.UnIndexed("body", body));

    doc.add(Field.Keyword("category",category));

    doc.add(Field.Keyword("path", category));doc.add(Field.Keyword("path", ymd));

    doc.add(Field.UnStored("contents", title +" " + bodyText +" " + category +

    " " + docname));

  • 8/14/2019 hatcher erik lucene

    17/3217

    Tokenize text Can be field-specific, but not generally Used during both indexing and searching

    Choosing (or writing) the right Analyzer(s) isone of the most important decision to makewith Lucene!

    Analyzers

  • 8/14/2019 hatcher erik lucene

    18/32

    18

    Tokenize a phrase

    Analyzer Demo

  • 8/14/2019 hatcher erik lucene

    19/32

    19

    StandardAnalyzer Features

    Sequences of letters and digits Interior apostrophe's (e.g. o'clock) Acroynms (e.g. H.P.) Companies (e.g. AT&T) E-mail addresses Hostname (e.g. jakarta.apache.org) Serial numbers, IP addresses, floating

    point numbers (e.g. R2D2 and192.168.1.1)

  • 8/14/2019 hatcher erik lucene

    20/32

    20

    GermanAnalyzer RussianAnalyzer SnowballAnalyzer (currently in "sandbox")

    Danish, Dutch, English, Finnish, French,German, Italian, Lovins, Norwegian, Porter,Portuguese, Russian, Spanish, Swedish

    Other Analyzers

  • 8/14/2019 hatcher erik lucene

    21/32

    21

    Instantiate IndexSearcher Call one of the search methods

    search(Query)

    search(Query, Filter) search(Query, HitCollector)

    Process the hits

    Searching

  • 8/14/2019 hatcher erik lucene

    22/32

    22

    Subclasses include:

    TermQuery, RangeQuery, PrefixQuery,PhraseQuery, WildcardQuery, FuzzyQuery,and BooleanQuery

    Use these subclasses for code-generatedqueries

    QueryParser is useful for human-enteredqueries

    Query object

  • 8/14/2019 hatcher erik lucene

    23/32

    23

    Provides powerful expression syntax Boolean operators (+, AND, OR) Field selection (title:foo) Wildcard support (blog*, bl?g, b*g) Fuzzy terms (socks~) using Levenshtein

    algorithm Proximity ("blog lucene"~5) Boosting (ant^4 blog) Grouping ((lucene OR blog) AND jakarta)

    Range (20020101 TO 20021231) -lexicographic

    Special character escaping (+ - && || ( ) { } [] ^ " ~ * ? : \)

    QueryParser

  • 8/14/2019 hatcher erik lucene

    24/32

    24

    QueryParser example

    Analyzer analyzer = new StandardAnalyzer();

    String q = "+blog -radio +title:lucene";

    Query query =QueryParser.parse(q, // query

    "contents", // field analyzer);

    Hits hits = searcher.search(query);

  • 8/14/2019 hatcher erik lucene

    25/32

    25

    Deleting documents

    IndexReader reader = ....

    // By document id reader.delete(docId);

    // or by Term int numDeleted =

    reader.delete(new Term("contents", "xyz"));

    Documents have a document id id can change when index is optimized

    Do not rely on document id's between sessions(essentially per IndexReader instantiation)

  • 8/14/2019 hatcher erik lucene

    26/32

    26

    Is Lucene thread-safe?

    IndexSearcher and IndexWriter are thread-safe.QueryParser is not.

    Do document id's change?Yes.

    How do I update a document in an index?Delete it and re-add it.

    Can Lucene index documents of type ...?

    Can you extract the text from files of that type? How do I deal with dates?

    Lexicographically

    A few FAQs

  • 8/14/2019 hatcher erik lucene

    27/32

    27

    Sandbox

    CVS: jakarta-lucene-sandbox LARM - Lucene Advanced Retrieval Machine Others

    XML-Indexing-Demo searchbean parsers - only PDF currently javascript - browser query creation/validation

    utils ant - create/update index during build process Etc... check repository before writing your own

    Batteries not included

  • 8/14/2019 hatcher erik lucene

    28/32

    28

    PDFBox - http://www.pdfbox.org POI - http://jakarta.apache.org/poi

    TextMining - http://www.textmining.org wraps PDFBox and POI

    Third-party

  • 8/14/2019 hatcher erik lucene

    29/32

    29

    Michaels.com - searching for colors JavaDevWithAnt - documentation search

    engine

    ZOE - "intertwingled e-mail" BlogScene - flattened datastore

    Interesting Usages

  • 8/14/2019 hatcher erik lucene

    30/32

    30

    Lucene is a fast, flexible, Java-basedsearch engine

    It is an API, with a few builtinconveniences, but the developer must do

    some work Its very simple to realize powerful

    capabilities

    Lots of tips, examples, and layers abovethe API available online

    Session Summary

  • 8/14/2019 hatcher erik lucene

    31/32

    31

    Lucene official sitehttp://jakarta.apache.org/lucene "Search-Enable Your Application with

    Lucene" by Craig WallsJava Developers Journal, December 2002

    JavaDevWithAnthttp://www.ehatchersolutions.com/JavaDevWithAnt

    from Java Development with Ant (Manning) jGuru Lucene FAQ

    http://www.jguru.com/faq/Lucene

    Resources

  • 8/14/2019 hatcher erik lucene

    32/32