luceneintroduction-091118115519-phpapp01
TRANSCRIPT
-
8/4/2019 luceneintroduction-091118115519-phpapp01
1/26
Lucene Introduction
Otis Gospodnetic, Sematext Intl @otisg
http://jroller.com/otis
http://sematext.com/
-
8/4/2019 luceneintroduction-091118115519-phpapp01
2/26
About Otis
Lucener since pre-Apache (cca 2000)
Committer: Lucene, Solr, Nutch, Mahout,
Open Relevance Lucene in Action 1 & 2 co-author
Solr in Action author
Sematext co-founder
-
8/4/2019 luceneintroduction-091118115519-phpapp01
3/26
What is Lucene?
Free, ASL, Java IR library, Jar
Doug Cutting, ASF, 2001
A
pplication agnostic: Indexing & Searching
High performance, scalable
No dependencies
Heavily ported
Otis Gospodnetic, Sematext Intl
-
8/4/2019 luceneintroduction-091118115519-phpapp01
4/26
What Lucene Aint
Turn key solution
Application, no installer/wizard needed
(Web) crawler
Insert-doc-format-here parser / filter
Otis Gospodnetic, Sematext Intl
-
8/4/2019 luceneintroduction-091118115519-phpapp01
5/26
The Lucene Family
Lucene vs. Apache Lucene vs. Java Lucene: IR library
Nutch: Hadoop-loving crawler, indexer, searcher for web-wide scale SE
Solr: Search server
Droids: Standalone framework for writing crawlers
Lucene.Net: C#, Incubator graduate
Lucy: C Lucene impl
Mahout: Hadoop-loving ML library
Open Relevance: Relevance judgments
PyLucene: Python port
Otis Gospodnetic, Sematext Intl
-
8/4/2019 luceneintroduction-091118115519-phpapp01
6/26
Integration
Data Source Data Source
Gather
Parse
Make Doc
Search UI
Search Appe.g. webapp
Search
Index
Index
Otis Gospodnetic, Sematext Intl
-
8/4/2019 luceneintroduction-091118115519-phpapp01
7/26
Integration: Rich Doc Indexing
HTML PDF
Gather Make Doc
Index
Index
MSWord PDF
Parse
with Tika
Otis Gospodnetic, Sematext Intl
-
8/4/2019 luceneintroduction-091118115519-phpapp01
8/26
Lucene Strengths
Simple API
Fast
Concurrent indexing and searching Incremental indexing
NRT: Near-Real-Time
Boolean + Vector space, sorting, etc. Cheap
Otis Gospodnetic, Sematext Intl
-
8/4/2019 luceneintroduction-091118115519-phpapp01
9/26
Query Types
Single and multi-term queries
Phrase queries (sloppiness allowed)
Wildcard and fuzzy Range queries
Boolean: required, prohibited, should
Grouping Fields
Otis Gospodnetic, Sematext Intl
-
8/4/2019 luceneintroduction-091118115519-phpapp01
10/26
Query Syntax
+monkey +banana monkey AND banana
+dog snoopy dog AND NOT snoopy
pork flu
pork flu new york pork flu NOT new york
sweet pork~3 natur*
schmidt~
createDate:[200901 TO 201001]
author:doug
author:doug cutting author:doug cutting AND project:(lucene OR nutch OR hadoop)
title:lucene^5.0 body:lucene
Otis Gospodnetic, Sematext Intl
-
8/4/2019 luceneintroduction-091118115519-phpapp01
11/26
Code: FS Indexer
Otis Gospodnetic, Sematext Intl
private IndexWriterwriter;public Indexer(String indexDir) throws IOException {
Directory dir = FSDirectory.open(new File(indexDir));
writer = new IndexWriter(dir, new StandardAnalyzer(Version.LUCENE_CURRENT),
true, IndexWriter.MaxFieldLength.UNLIMITED);
}
public void close() throws IOException {writer.close();
}
public void index(String dataDir, FileFilter filter) throws Exception {
File[] files = new File(dataDir).listFiles();
for (File f: files) {
Document doc = new Document();doc.add(new Field("contents", new FileReader(f)));
doc.add(new Field("filename", f.getName(),
Field.Store.YES, Field.Index.NOT_ANALYZED));
writer.addDocument(doc);
}
}
-
8/4/2019 luceneintroduction-091118115519-phpapp01
12/26
Indexing Pipeline
Otis Gospodnetic, Sematext Intl
Tokenizer TokenFilter Document Document
Writer
Inverted
Index
add
-
8/4/2019 luceneintroduction-091118115519-phpapp01
13/26
Indexer Pipeline: Analysis
Source: Lucene in Action
Otis Gospodnetic, Sematext Intl
1 Tokenizer
N TokenFilters
-
8/4/2019 luceneintroduction-091118115519-phpapp01
14/26
Analysis in Action
Otis Gospodnetic, Sematext Intl
"The quick brown fox jumped over the lazy dogs"
WhitespaceAnalyzer:
[The] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs]
SimpleAnalyzer:
[the] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs]
StopAnalyzer :[quick] [brown] [fox] [jumped] [over] [lazy] [dogs]
StandardAnalyzer:
[quick] [brown] [fox] [jumped] [over] [lazy] [dogs]
"XY&Z Corporation - [email protected]"
WhitespaceAnalyzer:
[XY&Z] [Corporation] [-] [[email protected]]
SimpleAnalyzer:
[xy] [z] [corporation] [xyz] [example] [com]
StopAnalyzer:
[xy] [z] [corporation] [xyz] [example] [com]
StandardAnalyzer:
[xy&z] [corporation] [[email protected]]
-
8/4/2019 luceneintroduction-091118115519-phpapp01
15/26
Field Options
Doc has 1+ Fields. Field has name+value
Field.Index.(no, (not)analyzed, no norms,
not analyzed no norms) Field.Store.(yes, no)
Field.TermVector.(yes, no, with pos., with
offset, withb
oth)
Otis Gospodnetic, Sematext Intl
-
8/4/2019 luceneintroduction-091118115519-phpapp01
16/26
Inverted Index
Source: developer.apple.com
Otis Gospodnetic, Sematext Intl
-
8/4/2019 luceneintroduction-091118115519-phpapp01
17/26
Index Directory
# ls -lh
total 1.1G
-rw-r--r-- 1 root root 123M 2009-03-14 10:29_0.fdt
-rw-r--r-- 1 root root 44M 2009-03-14 10:29_0.fdx
-rw-r--r-- 1 root root 33 2009-03-14 10:31_9j.fnm
-rw-r--r-- 1 root root 372M 2009-03-14 10:36_9j.frq
-rw-r--r-- 1 root root 11M 2009-03-14 10:36_9j.nrm
-rw-r--r-- 1 root root 180M 2009-03-14 10:36_9j.prx
-rw-r--r-- 1 root root 5.5M 2009-03-14 10:36_9j.tii
-rw-r--r-- 1 root root 308M 2009-03-14 10:36_9j.tis
-rw-r--r-- 1 root root 64 2009-03-14 10:36 segments_2-rw-r--r-- 1 root root 20 2009-03-14 10:36 segments.gen
Details: http://lucene.apache.org/java/2_9_0/fileformats.html
Otis Gospodnetic, Sematext Intl
-
8/4/2019 luceneintroduction-091118115519-phpapp01
18/26
Code: Searcher
Otis Gospodnetic, Sematext Intl
public void search(String indexDir, String q) throws IOException, ParseException {Directory dir = FSDirectory.open(new File(indexDir));
IndexSearcher is = new IndexSearcher(dir, true);
QueryParser parser = new QueryParser("contents",
new StandardAnalyzer(Version.LUCENE_CURRENT));
Query query = parser.parse(q);TopDocs hits = is.search(query, 10);
System.err.println("Found " + hits.totalHits + " document(s)");
for (int i=0; i
-
8/4/2019 luceneintroduction-091118115519-phpapp01
19/26
Code: Doc Deletion
Via IndexReader
void deleteDocument(int docNum)
Deletes the document numbered docNum
int deleteDocuments(Term term)
Deletes all documents that have a given term indexed.
Via IndexWritervoid deleteAll()
Delete all documents in the index.
void deleteDocuments(Query query)
Deletes the document(s) matching the provided query.
void deleteDocuments(Query[] queries)
Deletes the document(s) matching any of the provided queries.void deleteDocuments(Term term)
Deletes the document(s) containing term.
void deleteDocuments(Term[] terms)
Deletes the document(s) containing any of the terms.
Otis Gospodnetic, Sematext Intl
-
8/4/2019 luceneintroduction-091118115519-phpapp01
20/26
Code: Doc Updates
v
o
i
d
Via IndexWriterfacade
void updateDocument(Term term, Document doc)
Updates a document by first deleting the document(s) containing term and
then adding the new document.
v
o
id
void updateDocument(Term term, Document doc, Analyzer analyzer)Updates a document by first deleting the document(s) containing term and
then adding the new document.
Otis Gospodnetic, Sematext Intl
-
8/4/2019 luceneintroduction-091118115519-phpapp01
21/26
Pitfalls
Update = delete + add
No partial doc update
No joins
Otis Gospodnetic, Sematext Intl
-
8/4/2019 luceneintroduction-091118115519-phpapp01
22/26
Performance Tips
Index: -Xmx, setRAMBufferSizeMB,!optimize, !compound, !NFS, multi-thread,analysis, NO_NORMS
Search: 1 searcher, !NFS, RAM vs. heap,SSD, optimize, FieldSelector
Details:
http://wiki.apache.org/lucene-java/ImproveIndexingSpeedhttp://wiki.apache.org/lucene-java/ImproveSearchingSpeed
Otis Gospodnetic, Sematext Intl
-
8/4/2019 luceneintroduction-091118115519-phpapp01
23/26
Lucene 2.9 & 3.0
Per segment searching and caching (can lead to much faster reopen among otherthings)
Near real-time search (aka NRT)
New Query types
Smarter, more scalable multi-term queries (wildcard, range, etc)
Freshly optimized Collector/ScorerAPI
Improved Unicode support and the addition of Collation contrib New Attribute based TokenStream API
New QueryParser framework in contrib with a core QueryParser replacement implincluded
Scoring is now optional when sorting by Field, or using a custom Collector, gainingsizable performance when scores are not required
New analyzers (PersianAnalyzer, ArabicAnalyzer, SmartChineseAnalyzer)
New fast-vector-highlighter for large documents Lucene now includes high-performance handling of numeric fields. Such fields are
indexed with a trie structure, enabling simple to use and much faster numeric rangesearching without having to externally pre-process numeric values into textual values.
Otis Gospodnetic, Sematext Intl
-
8/4/2019 luceneintroduction-091118115519-phpapp01
24/26
Community
[email protected] [email protected]
Otis Gospodnetic, Sematext Intl
"I posted, went to get a sandwich, and came back to see two answers.
The change works, and I can get the fix into production today. This list is magic."
-
8/4/2019 luceneintroduction-091118115519-phpapp01
25/26
Resources
http://lucene.apache.org/java
Wiki, MLs, javadoc
http://manning.com/lucene
LIA2 soon, MEAP available
@lucene
Otis Gospodnetic, Sematext Intl
-
8/4/2019 luceneintroduction-091118115519-phpapp01
26/26
Contact
@otisg
jroller.com/otis
blog.sematext.com
Otis Gospodnetic, Sematext Intl