luceneintroduction-091118115519-phpapp01

8/4/2019 luceneintroduction-091118115519-phpapp01

1/26

Lucene Introduction

Otis Gospodnetic, Sematext Intl @otisg

[email protected]

http://jroller.com/otis

http://sematext.com/


2/26

About Otis

Lucener since pre-Apache (cca 2000)

Committer: Lucene, Solr, Nutch, Mahout,

Open Relevance Lucene in Action 1 & 2 co-author

Solr in Action author

Sematext co-founder


3/26

What is Lucene?

Free, ASL, Java IR library, Jar

Doug Cutting, ASF, 2001

A

pplication agnostic: Indexing & Searching

High performance, scalable

No dependencies

Heavily ported

Otis Gospodnetic, Sematext Intl


4/26

What Lucene Aint

Turn key solution

Application, no installer/wizard needed

(Web) crawler

Insert-doc-format-here parser / filter



5/26

The Lucene Family

Lucene vs. Apache Lucene vs. Java Lucene: IR library

Nutch: Hadoop-loving crawler, indexer, searcher for web-wide scale SE

Solr: Search server

Droids: Standalone framework for writing crawlers

Lucene.Net: C#, Incubator graduate

Lucy: C Lucene impl

Mahout: Hadoop-loving ML library

Open Relevance: Relevance judgments

PyLucene: Python port



6/26

Integration

Data Source Data Source

Gather

Parse

Make Doc

Search UI

Search Appe.g. webapp

Search

Index

Index



7/26

Integration: Rich Doc Indexing

HTML PDF

Gather Make Doc

Index

Index

MSWord PDF

Parse

with Tika



8/26

Lucene Strengths

Simple API

Fast

Concurrent indexing and searching Incremental indexing

NRT: Near-Real-Time

Boolean + Vector space, sorting, etc. Cheap



9/26

Query Types

Single and multi-term queries

Phrase queries (sloppiness allowed)

Wildcard and fuzzy Range queries

Boolean: required, prohibited, should

Grouping Fields



10/26

Query Syntax

+monkey +banana monkey AND banana

+dog snoopy dog AND NOT snoopy

pork flu

pork flu new york pork flu NOT new york

sweet pork~3 natur*

schmidt~

createDate:[200901 TO 201001]

author:doug

author:doug cutting author:doug cutting AND project:(lucene OR nutch OR hadoop)

title:lucene^5.0 body:lucene



11/26

Code: FS Indexer


private IndexWriterwriter;public Indexer(String indexDir) throws IOException {

Directory dir = FSDirectory.open(new File(indexDir));

writer = new IndexWriter(dir, new StandardAnalyzer(Version.LUCENE_CURRENT),

true, IndexWriter.MaxFieldLength.UNLIMITED);

}

public void close() throws IOException {writer.close();

}

public void index(String dataDir, FileFilter filter) throws Exception {

File[] files = new File(dataDir).listFiles();

for (File f: files) {

Document doc = new Document();doc.add(new Field("contents", new FileReader(f)));

doc.add(new Field("filename", f.getName(),

Field.Store.YES, Field.Index.NOT_ANALYZED));

writer.addDocument(doc);

}

}


12/26

Indexing Pipeline


Tokenizer TokenFilter Document Document

Writer

Inverted

Index

add


13/26

Indexer Pipeline: Analysis

Source: Lucene in Action


1 Tokenizer

N TokenFilters


14/26

Analysis in Action


"The quick brown fox jumped over the lazy dogs"

WhitespaceAnalyzer:

[The] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs]

SimpleAnalyzer:

[the] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs]

StopAnalyzer :[quick] [brown] [fox] [jumped] [over] [lazy] [dogs]

StandardAnalyzer:

[quick] [brown] [fox] [jumped] [over] [lazy] [dogs]

"XY&Z Corporation - [email protected]"

WhitespaceAnalyzer:

[XY&Z] [Corporation] [-] [[email protected]]

SimpleAnalyzer:

[xy] [z] [corporation] [xyz] [example] [com]

StopAnalyzer:

[xy] [z] [corporation] [xyz] [example] [com]

StandardAnalyzer:

[xy&z] [corporation] [[email protected]]


15/26

Field Options

Doc has 1+ Fields. Field has name+value

Field.Index.(no, (not)analyzed, no norms,

not analyzed no norms) Field.Store.(yes, no)

Field.TermVector.(yes, no, with pos., with

offset, withb

oth)



16/26

Inverted Index

Source: developer.apple.com



17/26

Index Directory

# ls -lh

total 1.1G

-rw-r--r-- 1 root root 123M 2009-03-14 10:29_0.fdt

-rw-r--r-- 1 root root 44M 2009-03-14 10:29_0.fdx

-rw-r--r-- 1 root root 33 2009-03-14 10:31_9j.fnm

-rw-r--r-- 1 root root 372M 2009-03-14 10:36_9j.frq

-rw-r--r-- 1 root root 11M 2009-03-14 10:36_9j.nrm

-rw-r--r-- 1 root root 180M 2009-03-14 10:36_9j.prx

-rw-r--r-- 1 root root 5.5M 2009-03-14 10:36_9j.tii

-rw-r--r-- 1 root root 308M 2009-03-14 10:36_9j.tis

-rw-r--r-- 1 root root 64 2009-03-14 10:36 segments_2-rw-r--r-- 1 root root 20 2009-03-14 10:36 segments.gen

Details: http://lucene.apache.org/java/2_9_0/fileformats.html



18/26

Code: Searcher


public void search(String indexDir, String q) throws IOException, ParseException {Directory dir = FSDirectory.open(new File(indexDir));

IndexSearcher is = new IndexSearcher(dir, true);

QueryParser parser = new QueryParser("contents",

new StandardAnalyzer(Version.LUCENE_CURRENT));

Query query = parser.parse(q);TopDocs hits = is.search(query, 10);

System.err.println("Found " + hits.totalHits + " document(s)");

for (int i=0; i


19/26

Code: Doc Deletion

Via IndexReader

void deleteDocument(int docNum)

Deletes the document numbered docNum

int deleteDocuments(Term term)

Deletes all documents that have a given term indexed.

Via IndexWritervoid deleteAll()

Delete all documents in the index.

void deleteDocuments(Query query)

Deletes the document(s) matching the provided query.

void deleteDocuments(Query[] queries)

Deletes the document(s) matching any of the provided queries.void deleteDocuments(Term term)

Deletes the document(s) containing term.

void deleteDocuments(Term[] terms)

Deletes the document(s) containing any of the terms.



20/26

Code: Doc Updates

v

o

i

d

Via IndexWriterfacade

void updateDocument(Term term, Document doc)

Updates a document by first deleting the document(s) containing term and

then adding the new document.

v

o

id

void updateDocument(Term term, Document doc, Analyzer analyzer)Updates a document by first deleting the document(s) containing term and

then adding the new document.



21/26

Pitfalls

Update = delete + add

No partial doc update

No joins



22/26

Performance Tips

Index: -Xmx, setRAMBufferSizeMB,!optimize, !compound, !NFS, multi-thread,analysis, NO_NORMS

Search: 1 searcher, !NFS, RAM vs. heap,SSD, optimize, FieldSelector

Details:

http://wiki.apache.org/lucene-java/ImproveIndexingSpeedhttp://wiki.apache.org/lucene-java/ImproveSearchingSpeed



23/26

Lucene 2.9 & 3.0

Per segment searching and caching (can lead to much faster reopen among otherthings)

Near real-time search (aka NRT)

New Query types

Smarter, more scalable multi-term queries (wildcard, range, etc)

Freshly optimized Collector/ScorerAPI

Improved Unicode support and the addition of Collation contrib New Attribute based TokenStream API

New QueryParser framework in contrib with a core QueryParser replacement implincluded

Scoring is now optional when sorting by Field, or using a custom Collector, gainingsizable performance when scores are not required

New analyzers (PersianAnalyzer, ArabicAnalyzer, SmartChineseAnalyzer)

New fast-vector-highlighter for large documents Lucene now includes high-performance handling of numeric fields. Such fields are

indexed with a trie structure, enabling simple to use and much faster numeric rangesearching without having to externally pre-process numeric values into textual values.



24/26

Community

[email protected] [email protected]


"I posted, went to get a sandwich, and came back to see two answers.

The change works, and I can get the fix into production today. This list is magic."


25/26

Resources

http://lucene.apache.org/java

Wiki, MLs, javadoc

http://manning.com/lucene

LIA2 soon, MEAP available

@lucene



26/26

Contact

@otisg

[email protected]

[email protected]

jroller.com/otis

blog.sematext.com