luceneintroduction-091118115519-phpapp01

Upload: felipe-guimaraes

Post on 07-Apr-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/4/2019 luceneintroduction-091118115519-phpapp01

    1/26

    Lucene Introduction

    Otis Gospodnetic, Sematext Intl @otisg

    [email protected]

    http://jroller.com/otis

    http://sematext.com/

  • 8/4/2019 luceneintroduction-091118115519-phpapp01

    2/26

    About Otis

    Lucener since pre-Apache (cca 2000)

    Committer: Lucene, Solr, Nutch, Mahout,

    Open Relevance Lucene in Action 1 & 2 co-author

    Solr in Action author

    Sematext co-founder

  • 8/4/2019 luceneintroduction-091118115519-phpapp01

    3/26

    What is Lucene?

    Free, ASL, Java IR library, Jar

    Doug Cutting, ASF, 2001

    A

    pplication agnostic: Indexing & Searching

    High performance, scalable

    No dependencies

    Heavily ported

    Otis Gospodnetic, Sematext Intl

  • 8/4/2019 luceneintroduction-091118115519-phpapp01

    4/26

    What Lucene Aint

    Turn key solution

    Application, no installer/wizard needed

    (Web) crawler

    Insert-doc-format-here parser / filter

    Otis Gospodnetic, Sematext Intl

  • 8/4/2019 luceneintroduction-091118115519-phpapp01

    5/26

    The Lucene Family

    Lucene vs. Apache Lucene vs. Java Lucene: IR library

    Nutch: Hadoop-loving crawler, indexer, searcher for web-wide scale SE

    Solr: Search server

    Droids: Standalone framework for writing crawlers

    Lucene.Net: C#, Incubator graduate

    Lucy: C Lucene impl

    Mahout: Hadoop-loving ML library

    Open Relevance: Relevance judgments

    PyLucene: Python port

    Otis Gospodnetic, Sematext Intl

  • 8/4/2019 luceneintroduction-091118115519-phpapp01

    6/26

    Integration

    Data Source Data Source

    Gather

    Parse

    Make Doc

    Search UI

    Search Appe.g. webapp

    Search

    Index

    Index

    Otis Gospodnetic, Sematext Intl

  • 8/4/2019 luceneintroduction-091118115519-phpapp01

    7/26

    Integration: Rich Doc Indexing

    HTML PDF

    Gather Make Doc

    Index

    Index

    MSWord PDF

    Parse

    with Tika

    Otis Gospodnetic, Sematext Intl

  • 8/4/2019 luceneintroduction-091118115519-phpapp01

    8/26

    Lucene Strengths

    Simple API

    Fast

    Concurrent indexing and searching Incremental indexing

    NRT: Near-Real-Time

    Boolean + Vector space, sorting, etc. Cheap

    Otis Gospodnetic, Sematext Intl

  • 8/4/2019 luceneintroduction-091118115519-phpapp01

    9/26

    Query Types

    Single and multi-term queries

    Phrase queries (sloppiness allowed)

    Wildcard and fuzzy Range queries

    Boolean: required, prohibited, should

    Grouping Fields

    Otis Gospodnetic, Sematext Intl

  • 8/4/2019 luceneintroduction-091118115519-phpapp01

    10/26

    Query Syntax

    +monkey +banana monkey AND banana

    +dog snoopy dog AND NOT snoopy

    pork flu

    pork flu new york pork flu NOT new york

    sweet pork~3 natur*

    schmidt~

    createDate:[200901 TO 201001]

    author:doug

    author:doug cutting author:doug cutting AND project:(lucene OR nutch OR hadoop)

    title:lucene^5.0 body:lucene

    Otis Gospodnetic, Sematext Intl

  • 8/4/2019 luceneintroduction-091118115519-phpapp01

    11/26

    Code: FS Indexer

    Otis Gospodnetic, Sematext Intl

    private IndexWriterwriter;public Indexer(String indexDir) throws IOException {

    Directory dir = FSDirectory.open(new File(indexDir));

    writer = new IndexWriter(dir, new StandardAnalyzer(Version.LUCENE_CURRENT),

    true, IndexWriter.MaxFieldLength.UNLIMITED);

    }

    public void close() throws IOException {writer.close();

    }

    public void index(String dataDir, FileFilter filter) throws Exception {

    File[] files = new File(dataDir).listFiles();

    for (File f: files) {

    Document doc = new Document();doc.add(new Field("contents", new FileReader(f)));

    doc.add(new Field("filename", f.getName(),

    Field.Store.YES, Field.Index.NOT_ANALYZED));

    writer.addDocument(doc);

    }

    }

  • 8/4/2019 luceneintroduction-091118115519-phpapp01

    12/26

    Indexing Pipeline

    Otis Gospodnetic, Sematext Intl

    Tokenizer TokenFilter Document Document

    Writer

    Inverted

    Index

    add

  • 8/4/2019 luceneintroduction-091118115519-phpapp01

    13/26

    Indexer Pipeline: Analysis

    Source: Lucene in Action

    Otis Gospodnetic, Sematext Intl

    1 Tokenizer

    N TokenFilters

  • 8/4/2019 luceneintroduction-091118115519-phpapp01

    14/26

    Analysis in Action

    Otis Gospodnetic, Sematext Intl

    "The quick brown fox jumped over the lazy dogs"

    WhitespaceAnalyzer:

    [The] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs]

    SimpleAnalyzer:

    [the] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs]

    StopAnalyzer :[quick] [brown] [fox] [jumped] [over] [lazy] [dogs]

    StandardAnalyzer:

    [quick] [brown] [fox] [jumped] [over] [lazy] [dogs]

    "XY&Z Corporation - [email protected]"

    WhitespaceAnalyzer:

    [XY&Z] [Corporation] [-] [[email protected]]

    SimpleAnalyzer:

    [xy] [z] [corporation] [xyz] [example] [com]

    StopAnalyzer:

    [xy] [z] [corporation] [xyz] [example] [com]

    StandardAnalyzer:

    [xy&z] [corporation] [[email protected]]

  • 8/4/2019 luceneintroduction-091118115519-phpapp01

    15/26

    Field Options

    Doc has 1+ Fields. Field has name+value

    Field.Index.(no, (not)analyzed, no norms,

    not analyzed no norms) Field.Store.(yes, no)

    Field.TermVector.(yes, no, with pos., with

    offset, withb

    oth)

    Otis Gospodnetic, Sematext Intl

  • 8/4/2019 luceneintroduction-091118115519-phpapp01

    16/26

    Inverted Index

    Source: developer.apple.com

    Otis Gospodnetic, Sematext Intl

  • 8/4/2019 luceneintroduction-091118115519-phpapp01

    17/26

    Index Directory

    # ls -lh

    total 1.1G

    -rw-r--r-- 1 root root 123M 2009-03-14 10:29_0.fdt

    -rw-r--r-- 1 root root 44M 2009-03-14 10:29_0.fdx

    -rw-r--r-- 1 root root 33 2009-03-14 10:31_9j.fnm

    -rw-r--r-- 1 root root 372M 2009-03-14 10:36_9j.frq

    -rw-r--r-- 1 root root 11M 2009-03-14 10:36_9j.nrm

    -rw-r--r-- 1 root root 180M 2009-03-14 10:36_9j.prx

    -rw-r--r-- 1 root root 5.5M 2009-03-14 10:36_9j.tii

    -rw-r--r-- 1 root root 308M 2009-03-14 10:36_9j.tis

    -rw-r--r-- 1 root root 64 2009-03-14 10:36 segments_2-rw-r--r-- 1 root root 20 2009-03-14 10:36 segments.gen

    Details: http://lucene.apache.org/java/2_9_0/fileformats.html

    Otis Gospodnetic, Sematext Intl

  • 8/4/2019 luceneintroduction-091118115519-phpapp01

    18/26

    Code: Searcher

    Otis Gospodnetic, Sematext Intl

    public void search(String indexDir, String q) throws IOException, ParseException {Directory dir = FSDirectory.open(new File(indexDir));

    IndexSearcher is = new IndexSearcher(dir, true);

    QueryParser parser = new QueryParser("contents",

    new StandardAnalyzer(Version.LUCENE_CURRENT));

    Query query = parser.parse(q);TopDocs hits = is.search(query, 10);

    System.err.println("Found " + hits.totalHits + " document(s)");

    for (int i=0; i

  • 8/4/2019 luceneintroduction-091118115519-phpapp01

    19/26

    Code: Doc Deletion

    Via IndexReader

    void deleteDocument(int docNum)

    Deletes the document numbered docNum

    int deleteDocuments(Term term)

    Deletes all documents that have a given term indexed.

    Via IndexWritervoid deleteAll()

    Delete all documents in the index.

    void deleteDocuments(Query query)

    Deletes the document(s) matching the provided query.

    void deleteDocuments(Query[] queries)

    Deletes the document(s) matching any of the provided queries.void deleteDocuments(Term term)

    Deletes the document(s) containing term.

    void deleteDocuments(Term[] terms)

    Deletes the document(s) containing any of the terms.

    Otis Gospodnetic, Sematext Intl

  • 8/4/2019 luceneintroduction-091118115519-phpapp01

    20/26

    Code: Doc Updates

    v

    o

    i

    d

    Via IndexWriterfacade

    void updateDocument(Term term, Document doc)

    Updates a document by first deleting the document(s) containing term and

    then adding the new document.

    v

    o

    id

    void updateDocument(Term term, Document doc, Analyzer analyzer)Updates a document by first deleting the document(s) containing term and

    then adding the new document.

    Otis Gospodnetic, Sematext Intl

  • 8/4/2019 luceneintroduction-091118115519-phpapp01

    21/26

    Pitfalls

    Update = delete + add

    No partial doc update

    No joins

    Otis Gospodnetic, Sematext Intl

  • 8/4/2019 luceneintroduction-091118115519-phpapp01

    22/26

    Performance Tips

    Index: -Xmx, setRAMBufferSizeMB,!optimize, !compound, !NFS, multi-thread,analysis, NO_NORMS

    Search: 1 searcher, !NFS, RAM vs. heap,SSD, optimize, FieldSelector

    Details:

    http://wiki.apache.org/lucene-java/ImproveIndexingSpeedhttp://wiki.apache.org/lucene-java/ImproveSearchingSpeed

    Otis Gospodnetic, Sematext Intl

  • 8/4/2019 luceneintroduction-091118115519-phpapp01

    23/26

    Lucene 2.9 & 3.0

    Per segment searching and caching (can lead to much faster reopen among otherthings)

    Near real-time search (aka NRT)

    New Query types

    Smarter, more scalable multi-term queries (wildcard, range, etc)

    Freshly optimized Collector/ScorerAPI

    Improved Unicode support and the addition of Collation contrib New Attribute based TokenStream API

    New QueryParser framework in contrib with a core QueryParser replacement implincluded

    Scoring is now optional when sorting by Field, or using a custom Collector, gainingsizable performance when scores are not required

    New analyzers (PersianAnalyzer, ArabicAnalyzer, SmartChineseAnalyzer)

    New fast-vector-highlighter for large documents Lucene now includes high-performance handling of numeric fields. Such fields are

    indexed with a trie structure, enabling simple to use and much faster numeric rangesearching without having to externally pre-process numeric values into textual values.

    Otis Gospodnetic, Sematext Intl

  • 8/4/2019 luceneintroduction-091118115519-phpapp01

    24/26

    Community

    [email protected] [email protected]

    Otis Gospodnetic, Sematext Intl

    "I posted, went to get a sandwich, and came back to see two answers.

    The change works, and I can get the fix into production today. This list is magic."

  • 8/4/2019 luceneintroduction-091118115519-phpapp01

    25/26

    Resources

    http://lucene.apache.org/java

    Wiki, MLs, javadoc

    http://manning.com/lucene

    LIA2 soon, MEAP available

    @lucene

    Otis Gospodnetic, Sematext Intl

  • 8/4/2019 luceneintroduction-091118115519-phpapp01

    26/26

    Contact

    @otisg

    [email protected]

    [email protected]

    jroller.com/otis

    blog.sematext.com

    Otis Gospodnetic, Sematext Intl