lucene part2. lucene jarkarta lucene ( is a high- performance, full-featured, java, open-source,...

35
Lucene Part2

Upload: brittany-gardner

Post on 16-Jan-2016

240 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Lucene Part2. Lucene Jarkarta Lucene ( is a high- performance, full-featured, java, open-source, text search engine

Lucene Part2

Page 2: Lucene Part2. Lucene Jarkarta Lucene ( is a high- performance, full-featured, java, open-source, text search engine

Lucene

Lucene

Jarkarta Lucene (http://jakarta.apache.org/lucene/) is a high-performance, full-featured, java, open-source, text search engine API written by Doug Cutting.

Page 3: Lucene Part2. Lucene Jarkarta Lucene ( is a high- performance, full-featured, java, open-source, text search engine

Lucene

APINote that Lucene is specifically an API, not an application. This means that all the hard parts have been done, but the easy programming has been left to you. The payoff for you is that, unlike normal search engine applications, you spend less time wading through tons of options and build a search application that is specifically suited to what you're doing. You can easily develop a custom search application, perfectly suited to your needs. Lucene is startlingly easy to develop with and use.

Page 4: Lucene Part2. Lucene Jarkarta Lucene ( is a high- performance, full-featured, java, open-source, text search engine

Lucene

Use the Source, Luke

This tutorial is a brief overview; the Lucene distribution comes with four example classes:

FileDocument IndexFiles SearchFiles DeleteFiles

These classes are really a good introduction to how to use Lucene. .

Page 5: Lucene Part2. Lucene Jarkarta Lucene ( is a high- performance, full-featured, java, open-source, text search engine

LuceneIndexes

Here's a simple attempt to diagram how the Lucene classes go together:

 Index  Document 1 Field A (name/value)

Field B (name/value)

Document 2 Field A (name/value)

Field B (name/value)

Page 6: Lucene Part2. Lucene Jarkarta Lucene ( is a high- performance, full-featured, java, open-source, text search engine

LuceneIndex and Searching

At the heart of Lucene is an Index. This class usually gets its data from a filesystem directory that contains a certain set of files that follow a certain structure, but it doesn't absolutely have to be a directory.

You pump data into the Index, then do searches on the Index to get results out. To build the Index, you use an IndexWriter object. To run a search on the Index you use an IndexSearcher object.

Page 7: Lucene Part2. Lucene Jarkarta Lucene ( is a high- performance, full-featured, java, open-source, text search engine

LuceneSearch

The search itself is a Query object, which you pass into IndexSearcher.search(). IndexSearcher.search() returns a Hits object, which contains a Vector of Document objects.

Page 8: Lucene Part2. Lucene Jarkarta Lucene ( is a high- performance, full-featured, java, open-source, text search engine

Lucene

Documents Document objects are stored in the Index, but they have to be put into the Index at some point, and that's your job. You have to select what data to enter in, and convert them into Documents. You read in each data file (or database entry, or whatever), instantiate a Document for it, break down the data into chunks and store the chunks in the Document as Field objects (a name/value pair). When you're done building a Document, you write it to the Index using the IndexWriter.

Page 9: Lucene Part2. Lucene Jarkarta Lucene ( is a high- performance, full-featured, java, open-source, text search engine

Lucene Query Parser

Queries can be quite complicated, so Lucene includes a tool to help generate Query objects, called a QueryParser. The QueryParser takes a query string, much like what you'd put into an Internet search engine, and generates a Query object.

Page 10: Lucene Part2. Lucene Jarkarta Lucene ( is a high- performance, full-featured, java, open-source, text search engine

LuceneAnalyzer

The Analyzer. Lucene indexes text, and part of the first step is cleaning up the text. You use an Analyzer to do this - it drops out punctuation and commonly occurring but meaningless words (the, a, an, etc). Lucene provides a couple different Analyzers, and you can make but your own, but the BIG GOTCHA people keep running into is that you must make sure you use the same sort of analyzer for both indexing and searching. You must feed the same sort of Analyzer to the QueryParser that you originally fed to the IndexWriter.

Page 11: Lucene Part2. Lucene Jarkarta Lucene ( is a high- performance, full-featured, java, open-source, text search engine

Lucene

What you have to do

Lucene handles the indexing, searching and retrieving, but it doesn't handle:

managing the process (instantiating the objects and hooking them together, both for indexing and for searching) selecting the data files parsing the data files getting the search string from the user displaying the search results to the use

Page 12: Lucene Part2. Lucene Jarkarta Lucene ( is a high- performance, full-featured, java, open-source, text search engine

LuceneIndexing In Depth

You index by creating Documents full of Fields (which contain name/value pairs) and pumping them into an IndexWriter, which parses the contents of the Field values into tokens and creates an index.

Page 13: Lucene Part2. Lucene Jarkarta Lucene ( is a high- performance, full-featured, java, open-source, text search engine

Lucene

Document Objects

Lucene doesn't index files, it indexes Document objects. To index and then search files, you first need write code that

converts your files into Document objects.

A Document object is a collection of Field objects (name/value pairs). So, for each file, instantiate a Document, then populate it with Fields.

Page 14: Lucene Part2. Lucene Jarkarta Lucene ( is a high- performance, full-featured, java, open-source, text search engine

Lucene

Lucene Key Feature

Lucene just handles name/value pairs. Email, for example, is mostly name/value oriented:to: fred from: barney subject: dinner? body: Let's get together for dinner tonight!

For more complex files, you have to "flatten" that structure out into a set of name/value fields.

Page 15: Lucene Part2. Lucene Jarkarta Lucene ( is a high- performance, full-featured, java, open-source, text search engine

Lucene

A minimum, as in the standard Lucene examples, would be:

the path to the original documentactually show the user the original document after the search

a modification datecompare against the original Document's modification date, to see if it needs to be reindexed

the contents of the filerun the search against

Standard Lucene

Page 16: Lucene Part2. Lucene Jarkarta Lucene ( is a high- performance, full-featured, java, open-source, text search engine

Lucene

You also ought to really think about glomming all of the Field data together and storing it as some sort of "all" Field. This is the easiest way to set it up so your users can search all Fields at once, if they want. Yes, you could come up with a complex scheme to rewrite your users' query so it searches across all of the known fields, but remember, keep it simple. See tutorial.

The All Field

Page 17: Lucene Part2. Lucene Jarkarta Lucene ( is a high- performance, full-featured, java, open-source, text search engine

Lucene

A Field object contains a name (a String) and a value (a String or a Reader), and three booleans that control whether or not the value will be indexed for searches, tokenized prior to indexing, and stored in the index so it can be returned with the search.

Field Objects

Page 18: Lucene Part2. Lucene Jarkarta Lucene ( is a high- performance, full-featured, java, open-source, text search engine

Lucene

Indexed for searches - sometimes you'll want to have fields available in your Documents that don't really have anything to do with searching. Two examples I can think of off the top of my head are creation dates and file names, so you can compare when the Document was created against the file modification date, and decide if the document needs to be reindexed. Since these fields won't ever make sense to use in an actual search, you can decrease the amount of work Lucene does by marking them as not indexed for searches.

Indexed for Searches

Page 19: Lucene Part2. Lucene Jarkarta Lucene ( is a high- performance, full-featured, java, open-source, text search engine

Lucene

Tokenized prior to indexing - tokenizing refers to taking a piece of text and cleaning it up, and breaking it down into individual pieces (tokens) for the indexer. This is done by the Analyzer. Some fields you may not want to be tokenized, for example a serial number field.

.

Tokenized prior to indexing

Page 20: Lucene Part2. Lucene Jarkarta Lucene ( is a high- performance, full-featured, java, open-source, text search engine

LuceneStored in the index

Stored in the index - even if a field is entirely indexed, it doesn't necessarily mean that it'll be easy for Lucene to reconstruct it. Although Lucene is a search index, and not a database, if your fields are reasonably small, you can ask Lucene to store them in the index. With the fields stored in the index, instead of using the Document to locate the original file or data and load it, you can actually pull the data out of the Document. This works best with fairly small fields and documents that you'd need to parse for display anyway.

Page 21: Lucene Part2. Lucene Jarkarta Lucene ( is a high- performance, full-featured, java, open-source, text search engine

LuceneField Factory

Factory MethodField.Text(String name, String value)Field.Text(String name, Reader value)Field.Keyword(String name, String value)Field.UnIndexed(String name, String value)Field.UnStored(String name, String value)

Page 22: Lucene Part2. Lucene Jarkarta Lucene ( is a high- performance, full-featured, java, open-source, text search engine

LuceneIndexWriter

The IndexWriter's job is to take the input (a Document), feed it through the Analyzer you instantiate it with, and create an index. Using the IndexWriter itself is fairly simple. You instantiate it with parameters for where to put the index files and the Analyzer you want it to use for cleaning up the tokens. Then feed Documents into IndexWriter.addDocument(). The actual index is a set of data files that the IndexWriter creates in a location defined (depending on how you instantiate the IndexWriter) by a lucene Directory object, a File, or a path string.

Page 23: Lucene Part2. Lucene Jarkarta Lucene ( is a high- performance, full-featured, java, open-source, text search engine

LuceneDirectory Objects

You can also store the index in a Lucene Directory object. A Lucene Directory is an abstraction around the java filesystem classes. Using a Directory lets the Lucene classes hide what exactly is going on. This in turn lets you do clever behind-the-scenes things like keeping the file cached in memory for really high performance by using the RAM-based Directory class (Lucene comes with two Directory classes, one for file-based and one for RAM-based).

Page 24: Lucene Part2. Lucene Jarkarta Lucene ( is a high- performance, full-featured, java, open-source, text search engine

LuceneAnalyzers and Tokenizers

The analyzer's job is to take apart a string of text and give you back a stream of tokens. The tokens are presumably usually words from the text content of the string, and that's what gets stored (along with the location and other details) in the index.

Each analyzer includes one or more tokenizers and may include filters. The tokenizers take care of the actual rules for where to break the text up into words (typically whitespace). The filters do any post-tokenizing work on the tokens (typically dropping out punctuation and commonly occurring words like "the", "an", "a", etc).

Page 25: Lucene Part2. Lucene Jarkarta Lucene ( is a high- performance, full-featured, java, open-source, text search engine

LuceneAnalyzers

SimpleAnalyzer seems to just use a Tokenizer that converts all of the input to lower case.StopAnalyzer includes the lower-case filter, and also has a filter that drops out any "stop words", words like articles (a, an, the, etc) that occur so commonly in english that they might as well be noise for searching purposes. StopAnalyzer comes with a set of stop words, but you can instantiate it with your own array of stop words.StandardAnalyzer does both lower-case and stop-word filtering, and in addition tries to do some basic clean-up of words, for example taking out apostrophes ( ' ) and removing periods from acronyms (i.e. "T.L.A." becomes "TLA").

Page 26: Lucene Part2. Lucene Jarkarta Lucene ( is a high- performance, full-featured, java, open-source, text search engine

LuceneSearching In Depth

To actually do the search, you need an IndexSearcher, but we'll get to that in a moment; before you can even think about feeding the IndexSearcher a query, you have to have a Query object. The IndexSearcher does the actual munging through the index, but it only understands Query objects.

Page 27: Lucene Part2. Lucene Jarkarta Lucene ( is a high- performance, full-featured, java, open-source, text search engine

LuceneQuery and QueryParser Objects

You produce the Query object by feeding the user's argument string into QueryParser.parse(), along with a string for the default field to search (if the user doesn't specify which field to search) and an Analyzer. The Analyzer is what QueryParser uses to tokenize the argument string. (Gotcha Warning: remember, again, you have to make sure that you use the same flavor Analyzer for tokenizing the argument string as you used for tokenizing the Index. StopAnalyzer is probably a safe choice for this, since that's the one used in the example code.) QueryParser.parse() returns a Query.

Page 28: Lucene Part2. Lucene Jarkarta Lucene ( is a high- performance, full-featured, java, open-source, text search engine

LuceneThread Safety

Multiple index searchers can read the lucene index files at the same time. An index writer or reader can edit the lucene index files while searches are ongoing Multiple index writers or readers can try to edit the lucene index files at the same time (it's important for the index writer/reader to be closed so it will release the file lock).

Page 29: Lucene Part2. Lucene Jarkarta Lucene ( is a high- performance, full-featured, java, open-source, text search engine

LuceneMore on Thread Safety

However, the query parser is not thread safe, so each thread using the index should have its own query parser.

The index writer however, is thread safe, so you can update the index while people are searching it. However, you then have to make sure that the threads with open index searchers close them and open new ones, to get the newly updated data.

Page 30: Lucene Part2. Lucene Jarkarta Lucene ( is a high- performance, full-featured, java, open-source, text search engine

LuceneIndexSearchers

To get an IndexSearcher you simply instantiate an IndexSearcher with a single argument that tells Lucene where to find an existing index. The argument is either of these two:

a string containing a path to the file, a Lucene Directory object (see the section about Directory objects under "Indexing In Depth", above)

Page 31: Lucene Part2. Lucene Jarkarta Lucene ( is a high- performance, full-featured, java, open-source, text search engine

LuceneIndexReaders

There's actually a third option for instantiating an IndexSearcher; you can instantiate it with any class that is a concrete subclass of the abstract class IndexReader

This makes more sense if you take a peek at the code for IndexSearcher. The other two constructors just turn your file path or Directory object into an IndexReader by calling the static method IndexReader.open().

Page 32: Lucene Part2. Lucene Jarkarta Lucene ( is a high- performance, full-featured, java, open-source, text search engine

LuceneMultiple Indexes

If you're searching a single index, you use an IndexSearcher with a single index. If you need to search across multiple indexes, you instantiate one IndexSearcher per index, create an array, stick the IndexSearcher instances in the array, and instantiate a MultiSearcher with the array as an argument.

Page 33: Lucene Part2. Lucene Jarkarta Lucene ( is a high- performance, full-featured, java, open-source, text search engine

Lucene Doing The Search

To actually do the search, you take the argument string the user enters, pass it to a QueryParser and get back a parsed Query object (and remember (third time's the charm) to use the right kind of Analyzer when you instantiate the QueryParser; use the same sort of Analyzer that you used when you built the index; the QueryParser'll use the Analyzer to tokenize the argument string).

Then you feed the parsed Query to the IndexSearcher.search(). The return is a Hits object, which is a collection of Document objects for documents that matched the search parameters. The Hits object also includes a score for each Document, indicating how well it matched.

Page 34: Lucene Part2. Lucene Jarkarta Lucene ( is a high- performance, full-featured, java, open-source, text search engine

Lucene Hits

IndexSearcher.search(Query) returns a "Hits" object, which is sort of like a Vector, containing a ranked list of Lucene Document objects. These are the same Document objects you fed into the IndexWriter, but specifically the ones that matched your search. Now you need to format the hits for a display, or manufacture HREFs pointing to the original documents, or whatever you were basically planning to do with the search results.

Page 35: Lucene Part2. Lucene Jarkarta Lucene ( is a high- performance, full-featured, java, open-source, text search engine

Lucene (cont.) Summary

Lucene

For key value pair fast retrieval.

For read oriented data.