lucene introduction

26
Lucene Introduction tis Gospodnetic, Sematext Int’l @otisg [email protected] http://jroller.com/otis http://sematext.com/

Upload: otisg

Post on 14-Jan-2015

10.805 views

Category:

Technology


2 download

DESCRIPTION

Lucene introduction / overview, also touching on Lucene 2.9/3.0 features

TRANSCRIPT

Page 1: Lucene Introduction

Lucene Introduction

Otis Gospodnetic, Sematext Int’l @[email protected]://jroller.com/otishttp://sematext.com/

Page 2: Lucene Introduction

About Otis

• Lucener since pre-Apache (cca 2000)

• Committer: Lucene, Solr, Nutch, Mahout, Open Relevance

• Lucene in Action 1 & 2 co-author

• Solr in Action author

• Sematext co-founder

Page 3: Lucene Introduction

What is Lucene?

• Free, ASL, Java IR library, Jar

• Doug Cutting, ASF, 2001

• Application agnostic: Indexing & Searching

• High performance, scalable

• No dependencies

• Heavily ported

Otis Gospodnetic, Sematext Int’l

Page 4: Lucene Introduction

What Lucene Ain’t

• Turn key “solution”

• Application, no installer/wizard needed

• (Web) crawler

• Insert-doc-format-here parser / filter

Otis Gospodnetic, Sematext Int’l

Page 5: Lucene Introduction

The Lucene Family

• Lucene vs. Apache Lucene vs. Java Lucene: IR library• Nutch: Hadoop-loving crawler, indexer, searcher for web-wide scale SE• Solr: Search server• Droids: Standalone framework for writing crawlers• Lucene.Net: C#, Incubator graduate• Lucy: C Lucene impl• Mahout: Hadoop-loving ML library• Open Relevance: Relevance judgments• PyLucene: Python port

Otis Gospodnetic, Sematext Int’l

Page 6: Lucene Introduction

Integration

Data Source Data Source

GatherParse

Make Doc

Search UI

Search Appe.g. webapp

Search

Index

Index

Otis Gospodnetic, Sematext Int’l

Page 7: Lucene Introduction

Integration: Rich Doc Indexing

HTML PDF

Gather Make Doc

Index

Index

MS Word PDF

Parsewith Tika

Otis Gospodnetic, Sematext Int’l

Page 8: Lucene Introduction

Lucene Strengths

• Simple API

• Fast

• Concurrent indexing and searching

• Incremental indexing

• NRT: Near-Real-Time

• Boolean + Vector space, sorting, etc.

• Cheap

Otis Gospodnetic, Sematext Int’l

Page 9: Lucene Introduction

Query Types

• Single and multi-term queries

• Phrase queries (sloppiness allowed)

• Wildcard and fuzzy

• Range queries

• “Boolean”: required, prohibited, “should”

• Grouping

• Fields

Otis Gospodnetic, Sematext Int’l

Page 10: Lucene Introduction

Query Syntax

• +monkey +banana monkey AND banana

• +dog –snoopy dog AND NOT snoopy

• “pork flu”

• “pork flu” –”new york” “pork flu” NOT “new york”

• “sweet pork”~3

• natur*

• schmidt~

• createDate:[200901 TO 201001]

• author:doug

• author:”doug cutting”

• author:”doug cutting” AND project:(lucene OR nutch OR hadoop)

• title:lucene^5.0 body:lucene

Otis Gospodnetic, Sematext Int’l

Page 11: Lucene Introduction

Code: FS Indexer

Otis Gospodnetic, Sematext Int’l

private IndexWriter writer; public Indexer(String indexDir) throws IOException { Directory dir = FSDirectory.open(new File(indexDir)); writer = new IndexWriter(dir, new StandardAnalyzer(Version.LUCENE_CURRENT), true, IndexWriter.MaxFieldLength.UNLIMITED); }

public void close() throws IOException { writer.close(); }

public void index(String dataDir, FileFilter filter) throws Exception { File[] files = new File(dataDir).listFiles(); for (File f: files) { Document doc = new Document(); doc.add(new Field("contents", new FileReader(f))); doc.add(new Field("filename", f.getName(), Field.Store.YES, Field.Index.NOT_ANALYZED)); writer.addDocument(doc); }}

Page 12: Lucene Introduction

Indexing Pipeline

Otis Gospodnetic, Sematext Int’l

Tokenizer TokenFilterDocument DocumentWriter

InvertedIndex

add

Page 13: Lucene Introduction

Indexer Pipeline: Analysis

Source: Lucene in Action

Otis Gospodnetic, Sematext Int’l

• 1 Tokenizer

• N TokenFilters

Page 14: Lucene Introduction

Analysis in Action

Otis Gospodnetic, Sematext Int’l

"The quick brown fox jumped over the lazy dogs" WhitespaceAnalyzer : [The] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs] SimpleAnalyzer : [the] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs] StopAnalyzer : [quick] [brown] [fox] [jumped] [over] [lazy] [dogs] StandardAnalyzer: [quick] [brown] [fox] [jumped] [over] [lazy] [dogs] "XY&Z Corporation - [email protected]" WhitespaceAnalyzer: [XY&Z] [Corporation] [-] [[email protected]] SimpleAnalyzer: [xy] [z] [corporation] [xyz] [example] [com] StopAnalyzer: [xy] [z] [corporation] [xyz] [example] [com] StandardAnalyzer: [xy&z] [corporation] [[email protected]]

Page 15: Lucene Introduction

Field Options

• Doc has 1+ Fields. Field has name+value

• Field.Index.(no, (not)analyzed, no norms, not analyzed no norms)

• Field.Store.(yes, no)

• Field.TermVector.(yes, no, with pos., with offset, with both)

Otis Gospodnetic, Sematext Int’l

Page 16: Lucene Introduction

Inverted Index

Source: developer.apple.com

Otis Gospodnetic, Sematext Int’l

Page 17: Lucene Introduction

Index Directory# ls -lhtotal 1.1G-rw-r--r-- 1 root root 123M 2009-03-14 10:29 _0.fdt-rw-r--r-- 1 root root 44M 2009-03-14 10:29 _0.fdx-rw-r--r-- 1 root root 33 2009-03-14 10:31 _9j.fnm-rw-r--r-- 1 root root 372M 2009-03-14 10:36 _9j.frq-rw-r--r-- 1 root root 11M 2009-03-14 10:36 _9j.nrm-rw-r--r-- 1 root root 180M 2009-03-14 10:36 _9j.prx-rw-r--r-- 1 root root 5.5M 2009-03-14 10:36 _9j.tii-rw-r--r-- 1 root root 308M 2009-03-14 10:36 _9j.tis-rw-r--r-- 1 root root 64 2009-03-14 10:36 segments_2-rw-r--r-- 1 root root 20 2009-03-14 10:36 segments.gen

Details: http://lucene.apache.org/java/2_9_0/fileformats.html

Otis Gospodnetic, Sematext Int’l

Page 18: Lucene Introduction

Code: Searcher

Otis Gospodnetic, Sematext Int’l

public void search(String indexDir, String q) throws IOException, ParseException { Directory dir = FSDirectory.open(new File(indexDir)); IndexSearcher is = new IndexSearcher(dir, true);

QueryParser parser = new QueryParser("contents", new StandardAnalyzer(Version.LUCENE_CURRENT)); Query query = parser.parse(q); TopDocs hits = is.search(query, 10); System.err.println("Found " + hits.totalHits + " document(s)");

for (int i=0; i<hits.scoreDocs.length; i++) { ScoreDoc scoreDoc = hits.scoreDocs[i]; Document doc = is.doc(scoreDoc.doc); System.out.println(doc.get("filename")); }

is.close(); }

Page 19: Lucene Introduction

Code: Doc Deletion

Via IndexReader

void deleteDocument(int docNum)           Deletes the document numbered docNum

int deleteDocuments(Term term)           Deletes all documents that have a given term indexed.

Via IndexWriter

void deleteAll()           Delete all documents in the index.

void deleteDocuments(Query query)           Deletes the document(s) matching the provided query. 

void deleteDocuments(Query[] queries)           Deletes the document(s) matching any of the provided queries. 

void deleteDocuments(Term term)           Deletes the document(s) containing term. 

void deleteDocuments(Term[] terms)           Deletes the document(s) containing any of the terms.

Otis Gospodnetic, Sematext Int’l

Page 20: Lucene Introduction

Code: Doc Updates

 voi

d

Via IndexWriter facade

void updateDocument(Term term, Document doc)           Updates a document by first deleting the document(s) containing term and then adding the new document.

 voi

dvoid updateDocument(Term  term, Document  doc, Analyzer analyzer)

          Updates a document by first deleting the document(s) containing term and then adding the new document.

Otis Gospodnetic, Sematext Int’l

Page 21: Lucene Introduction

Pitfalls

• Update = delete + add

• No partial doc update

• No joins

Otis Gospodnetic, Sematext Int’l

Page 22: Lucene Introduction

Performance Tips

• Index: -Xmx, setRAMBufferSizeMB, !optimize, !compound, !NFS, multi-thread, analysis, NO_NORMS

• Search: 1 searcher, !NFS, RAM vs. heap, SSD, optimize, FieldSelector

Details:http://wiki.apache.org/lucene-java/ImproveIndexingSpeed http://wiki.apache.org/lucene-java/ImproveSearchingSpeed

Otis Gospodnetic, Sematext Int’l

Page 23: Lucene Introduction

Lucene 2.9 & 3.0• Per segment searching and caching (can lead to much faster reopen among other

things)• Near real-time search (aka NRT)• New Query types• Smarter, more scalable multi-term queries (wildcard, range, etc)• Freshly optimized Collector/Scorer API• Improved Unicode support and the addition of Collation contrib• New Attribute based TokenStream API• New QueryParser framework in contrib with a core QueryParser replacement impl

included• Scoring is now optional when sorting by Field, or using a custom Collector, gaining

sizable performance when scores are not required• New analyzers (PersianAnalyzer, ArabicAnalyzer, SmartChineseAnalyzer)• New fast-vector-highlighter for large documents• Lucene now includes high-performance handling of numeric fields. Such fields are

indexed with a trie structure, enabling simple to use and much faster numeric range searching without having to externally pre-process numeric values into textual values.

Otis Gospodnetic, Sematext Int’l

Page 24: Lucene Introduction

Community

[email protected] [email protected]

Otis Gospodnetic, Sematext Int’l

"I posted, went to get a sandwich, and came back to see two answers. The change works, and I can get the fix into production today. This list is magic."

Page 25: Lucene Introduction

Resources

• http://lucene.apache.org/java– Wiki, MLs, javadoc

• http://manning.com/lucene– LIA2 soon, MEAP available

• @lucene

Otis Gospodnetic, Sematext Int’l

Page 26: Lucene Introduction

Contact

@otisg

[email protected]

[email protected]

sematext.com

jroller.com/otis

blog.sematext.com

Otis Gospodnetic, Sematext Int’l