epl 660: lab 2 apache lucene part ii department of

21
University of Cyprus Department of Computer Science EPL 660: Lab 2 Apache Lucene Part II Andreas Kamilaris

Upload: others

Post on 20-Jan-2022

7 views

Category:

Documents


0 download

TRANSCRIPT

University of CyprusDepartment of Computer Science

EPL 660: Lab 2Apache Lucene Part II

Andreas Kamilaris

University of CyprusIndexing with Lucene• Fundamental Lucene classes for indexing text:

– IndexWriter– Analyzer– Document– Field

• IndexWriter is used to create a new index and to add Documents to an existing index

Tutorial based on http://onjava.com/pub/a/onjava/2003/01/15/lucene.html?page=1

University of CyprusIndexing with Lucene• Before text is indexed, it is passed through an

Analyzer.• Analyzers extract indexable tokens out of text to

be indexed, and eliminating the rest.• Lucene comes with a few different Analyzer

implementations:– Some of them deal with skipping stop words

(frequently-used words that don't help distinguish onedocument from the other, such as "a," "an," "the” etc.)

– Some deal with converting all tokens to lowercaseletters, so that searches are not case-sensitive.

Tutorial based on http://onjava.com/pub/a/onjava/2003/01/15/lucene.html?page=1

University of CyprusIndexing with Lucene• An index consists of a set of Documents.• Each Document consists of one or more Fields.• Each Field has a name and a value:

– Think of a document as a row in a database table andfields as columns in that row.

Tutorial based on http://onjava.com/pub/a/onjava/2003/01/15/lucene.html?page=1

University of CyprusField Types• Lucene offers four different types of fields:

– Keyword, – UnIndexed, – UnStored, – Text

• Keyword fields: not tokenized, but indexed andstored in the index verbatim. This field is suitable for fields whose original valueshould be preserved in its entirety, such as URLs, dates, personal names, telephone numbers etc.

Tutorial based on http://onjava.com/pub/a/onjava/2003/01/15/lucene.html?page=1

University of CyprusField Types• UnIndexed fields: neither tokenized nor indexed,

but their value is stored in the index as it is. • This field is suitable for fields that you will display

with search results, but whose values you ‘ll neversearch directly (URL, database primary key).

• Because this type of field is not indexed, searchesagainst it are slow.

• Since the original value of a field of this type isstored in the index, this type is not suitable forstoring fields with very large values, if index sizeis an issue.

Tutorial based on http://onjava.com/pub/a/onjava/2003/01/15/lucene.html?page=1

University of CyprusField Types• UnStored fields: the opposite of UnIndexed fields.

Fields of this type are tokenized and indexed, butare not stored in the index (as they are). This field is suitable for indexing large amounts of text that does not need to be retrieved in itsoriginal form, such as the bodies of Web pages, or any other type of text document.

• Text fields: are tokenized, indexed, and stored in the index. This implies that fields of this type canbe searched, but be cautious about the size of thefield stored as Text field.

Tutorial based on http://onjava.com/pub/a/onjava/2003/01/15/lucene.html?page=1

University of CyprusOverview of different field types

Tutorial based on http://onjava.com/pub/a/onjava/2003/01/15/lucene.html?page=1

University of CyprusSimple Scenario

Text stored in a String variable

Lucene stores the index in a directory on the file system

IndexWriter’sconstructor

Standard Analyzer eliminatesstop words, converts tokens to lower case, and performs a fewother small input modifications, such as eliminating periodsfrom acronyms

createFlag: when true, tells IndexWriter to create a new index in the specifieddirectory, or overwrite an index in that directory, if it already existsA value of false instructs IndexWriter to instead add Documents to an existing index.

LuceneIndexExample.java

University of Cyprus

add it to the index via the instance of IndexWriter

Simple Scenario

IndexWriter’sconstructor

Create blank document

add a Field called fieldname to the document, with a value of theString that we want to index

LuceneIndexExample.java

Tutorial based on http://onjava.com/pub/a/onjava/2003/01/15/lucene.html?page=1

University of CyprusAnalyzers• Components that pre-process input text.• Also used for searching:

– Because the search string has to be processed thesame way that the indexed text was processed, it iscrucial to use the same Analyzer for both indexing andsearching. Not using the same Analyzer will result in invalid search results.

• Analyzer class is an abstract class.• Lucene comes with a few concrete analyzers that

pre-process their input in different ways.• Custom analyzers can be also implemented.

Tutorial based on http://onjava.com/pub/a/onjava/2003/01/15/lucene.html?page=1

University of CyprusThe PorterStemAnalyzer• Porter stemming algorithm is a process for

removing the more common morphological andinflexional endings from words in English.

• This analyzer will use an implementation of thePorter stemming algorithm provided by Lucene'sPorterStemFilter class.

PorterStemAnalyzer Tutorial: http://onjava.com/pub/a/onjava/2003/01/15/lucene.html?page=1

University of CyprusThe PorterStemAnalyzer

Standard Analyzer’s constructor acceptsa File with stopwords

PorterStemAnalyzer’s constructors

University of CyprusThe PorterStemAnalyzer

Core method of the PorterStemAnalyzer

• lower-cases input, • eliminates stop words, • uses the PorterStemFilter to removecommon morphological and inflexionalendings

To use the new PorterStemAnalyzer class, we need to modify a single line of ourLuceneIndexExample class, to instantiate PorterStemAnalyzer instead of StandardAnalyzer:

The rest of the code remains unchanged

PorterStemAnalyzer Tutorial: http://onjava.com/pub/a/onjava/2003/01/15/lucene.html?page=1

University of CyprusText indexing with PorterStemAnalyzer

PorterStemAnalyzer Tutorial: http://onjava.com/pub/a/onjava/2003/01/15/lucene.html?page=1

University of CyprusLucene in 5 minutes!

allocate memory in RAM to store the index

create a new index, overwriting anyexisting index

add document to the index

create a new doc add a title field

Source: http://www.lucenetutorial.com

University of CyprusLucene in 5 minutes!

the "title" argumentspecifies the defaultfield to use when nofield is explicitlyspecified in the query

Source: http://www.lucenetutorial.com

The QueryParser uses the same analyzer as the IndexWriter

University of CyprusLucene in 5 minutes!

Source: http://www.lucenetutorial.com

University of CyprusLucene in 5 minutes!

Source: http://www.lucenetutorial.com

Full source code is shown here: http://www.lucenetutorial.com/lucene-in-5-minutes.html

University of CyprusExamples• Get deeper into source code…

<lucene_dir>src/org/apache/lucene/demo

• Under <lucene_dir> run:java org.apache.lucene.demo.IndexFiles /src

• This program builds an index. It produces a subdirectory called index which contains an index of all of the Lucene source code files.

• To search the index type: java org.apache.lucene.demo.SearchFiles

University of CyprusUseful Info• Official Apache Lucene site: http://lucene.apache.org/java/docs/• Lucene-java Wiki: http://wiki.apache.org/lucene-

java/FrontPage?action=show&redirect=FrontPageEN• Lucene Intro (java.net):

http://today.java.net/pub/a/today/2003/07/30/LuceneIntro.html• Lucene Tutorial.com: http://www.lucenetutorial.com/