lucene introduction

Lucene Introduction

Otis Gospodnetic, Sematext Int’l @[email protected]://jroller.com/otishttp://sematext.com/

About Otis

• Lucener since pre-Apache (cca 2000)

• Committer: Lucene, Solr, Nutch, Mahout, Open Relevance

• Lucene in Action 1 & 2 co-author

• Solr in Action author

• Sematext co-founder

What is Lucene?

• Free, ASL, Java IR library, Jar

• Doug Cutting, ASF, 2001

• Application agnostic: Indexing & Searching

• High performance, scalable

• No dependencies

• Heavily ported

Otis Gospodnetic, Sematext Int’l

What Lucene Ain’t

• Turn key “solution”

• Application, no installer/wizard needed

• (Web) crawler

• Insert-doc-format-here parser / filter


The Lucene Family

• Lucene vs. Apache Lucene vs. Java Lucene: IR library• Nutch: Hadoop-loving crawler, indexer, searcher for web-wide scale SE• Solr: Search server• Droids: Standalone framework for writing crawlers• Lucene.Net: C#, Incubator graduate• Lucy: C Lucene impl• Mahout: Hadoop-loving ML library• Open Relevance: Relevance judgments• PyLucene: Python port


Integration

Data Source Data Source

GatherParse

Make Doc

Search UI

Search Appe.g. webapp

Search

Index

Index


Integration: Rich Doc Indexing

HTML PDF

Gather Make Doc

Index

Index

MS Word PDF

Parsewith Tika


Lucene Strengths

• Simple API

• Fast

• Concurrent indexing and searching

• Incremental indexing

• NRT: Near-Real-Time

• Boolean + Vector space, sorting, etc.

• Cheap


Query Types

• Single and multi-term queries

• Phrase queries (sloppiness allowed)

• Wildcard and fuzzy

• Range queries

• “Boolean”: required, prohibited, “should”

• Grouping

• Fields


Query Syntax

• +monkey +banana monkey AND banana

• +dog –snoopy dog AND NOT snoopy

• “pork flu”

• “pork flu” –”new york” “pork flu” NOT “new york”

• “sweet pork”~3

• natur*

• schmidt~

• createDate:[200901 TO 201001]

• author:doug

• author:”doug cutting”

• author:”doug cutting” AND project:(lucene OR nutch OR hadoop)

• title:lucene^5.0 body:lucene


Code: FS Indexer


private IndexWriter writer; public Indexer(String indexDir) throws IOException { Directory dir = FSDirectory.open(new File(indexDir)); writer = new IndexWriter(dir, new StandardAnalyzer(Version.LUCENE_CURRENT), true, IndexWriter.MaxFieldLength.UNLIMITED); }

public void close() throws IOException { writer.close(); }

public void index(String dataDir, FileFilter filter) throws Exception { File[] files = new File(dataDir).listFiles(); for (File f: files) { Document doc = new Document(); doc.add(new Field("contents", new FileReader(f))); doc.add(new Field("filename", f.getName(), Field.Store.YES, Field.Index.NOT_ANALYZED)); writer.addDocument(doc); }}

Indexing Pipeline


Tokenizer TokenFilterDocument DocumentWriter

InvertedIndex

add

Indexer Pipeline: Analysis

Source: Lucene in Action


• 1 Tokenizer

• N TokenFilters

Analysis in Action


"The quick brown fox jumped over the lazy dogs" WhitespaceAnalyzer : [The] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs] SimpleAnalyzer : [the] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs] StopAnalyzer : [quick] [brown] [fox] [jumped] [over] [lazy] [dogs] StandardAnalyzer: [quick] [brown] [fox] [jumped] [over] [lazy] [dogs] "XY&Z Corporation - [email protected]" WhitespaceAnalyzer: [XY&Z] [Corporation] [-] [[email protected]] SimpleAnalyzer: [xy] [z] [corporation] [xyz] [example] [com] StopAnalyzer: [xy] [z] [corporation] [xyz] [example] [com] StandardAnalyzer: [xy&z] [corporation] [[email protected]]

Field Options

• Doc has 1+ Fields. Field has name+value

• Field.Index.(no, (not)analyzed, no norms, not analyzed no norms)

• Field.Store.(yes, no)

• Field.TermVector.(yes, no, with pos., with offset, with both)


Inverted Index

Source: developer.apple.com


Index Directory# ls -lhtotal 1.1G-rw-r--r-- 1 root root 123M 2009-03-14 10:29 _0.fdt-rw-r--r-- 1 root root 44M 2009-03-14 10:29 _0.fdx-rw-r--r-- 1 root root 33 2009-03-14 10:31 _9j.fnm-rw-r--r-- 1 root root 372M 2009-03-14 10:36 _9j.frq-rw-r--r-- 1 root root 11M 2009-03-14 10:36 _9j.nrm-rw-r--r-- 1 root root 180M 2009-03-14 10:36 _9j.prx-rw-r--r-- 1 root root 5.5M 2009-03-14 10:36 _9j.tii-rw-r--r-- 1 root root 308M 2009-03-14 10:36 _9j.tis-rw-r--r-- 1 root root 64 2009-03-14 10:36 segments_2-rw-r--r-- 1 root root 20 2009-03-14 10:36 segments.gen

Details: http://lucene.apache.org/java/2_9_0/fileformats.html


http://lucene.apache.org/java/2_9_0/fileformats.html

Code: Searcher


public void search(String indexDir, String q) throws IOException, ParseException { Directory dir = FSDirectory.open(new File(indexDir)); IndexSearcher is = new IndexSearcher(dir, true);

QueryParser parser = new QueryParser("contents", new StandardAnalyzer(Version.LUCENE_CURRENT)); Query query = parser.parse(q); TopDocs hits = is.search(query, 10); System.err.println("Found " + hits.totalHits + " document(s)");

for (int i=0; i<hits.scoreDocs.length; i++) { ScoreDoc scoreDoc = hits.scoreDocs[i]; Document doc = is.doc(scoreDoc.doc); System.out.println(doc.get("filename")); }

is.close(); }

Code: Doc Deletion

Via IndexReader

void deleteDocument(int docNum) Deletes the document numbered docNum

int deleteDocuments(Term term) Deletes all documents that have a given term indexed.

Via IndexWriter

void deleteAll() Delete all documents in the index.

void deleteDocuments(Query query) Deletes the document(s) matching the provided query.

void deleteDocuments(Query[] queries) Deletes the document(s) matching any of the provided queries.

void deleteDocuments(Term term) Deletes the document(s) containing term.

void deleteDocuments(Term[] terms) Deletes the document(s) containing any of the terms.


Code: Doc Updates

voi

d

Via IndexWriter facade

void updateDocument(Term term, Document doc) Updates a document by first deleting the document(s) containing term and then adding the new document.

voi

dvoid updateDocument(Term term, Document doc, Analyzer analyzer)

Updates a document by first deleting the document(s) containing term and then adding the new document.


Pitfalls

• Update = delete + add

• No partial doc update

• No joins


Performance Tips

• Index: -Xmx, setRAMBufferSizeMB, !optimize, !compound, !NFS, multi-thread, analysis, NO_NORMS

• Search: 1 searcher, !NFS, RAM vs. heap, SSD, optimize, FieldSelector

Details:http://wiki.apache.org/lucene-java/ImproveIndexingSpeed http://wiki.apache.org/lucene-java/ImproveSearchingSpeed


http://wiki.apache.org/lucene-java/ImproveIndexingSpeed

http://wiki.apache.org/lucene-java/ImproveSearchingSpeed

Lucene 2.9 & 3.0• Per segment searching and caching (can lead to much faster reopen among other

things)• Near real-time search (aka NRT)• New Query types• Smarter, more scalable multi-term queries (wildcard, range, etc)• Freshly optimized Collector/Scorer API• Improved Unicode support and the addition of Collation contrib• New Attribute based TokenStream API• New QueryParser framework in contrib with a core QueryParser replacement impl

included• Scoring is now optional when sorting by Field, or using a custom Collector, gaining

sizable performance when scores are not required• New analyzers (PersianAnalyzer, ArabicAnalyzer, SmartChineseAnalyzer)• New fast-vector-highlighter for large documents• Lucene now includes high-performance handling of numeric fields. Such fields are

indexed with a trie structure, enabling simple to use and much faster numeric range searching without having to externally pre-process numeric values into textual values.


Community

[email protected] [email protected]


"I posted, went to get a sandwich, and came back to see two answers. The change works, and I can get the fix into production today. This list is magic."

Resources

• http://lucene.apache.org/java– Wiki, MLs, javadoc

• http://manning.com/lucene– LIA2 soon, MEAP available

• @lucene


http://lucene.apache.org/java

http://manning.com/lucene

Contact

@otisg

[email protected]

[email protected]

sematext.com

jroller.com/otis

blog.sematext.com


lucene introduction

Technology

lucene otis gospodnetic

action otis gospodnetic

searcher otis gospodnetic

lucene family lucene

fields otis gospodnetic

pipeline otis gospodnetic

tika otis gospodnetic

cheap otis gospodnetic