19lucene

Upload: kicao

Post on 02-Jun-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/11/2019 19lucene

    1/21

    LUCENE .NET SEARCH ENGINE

    Introduction:

    Lucene is a free/open source information retrieval library, originally implemented in Java

    by Doug Cutting. It is supported by the Apache Software oundation and is releasedunder the Apache Software !icense . Lucene has been ported to programming languagesincluding Delphi, "erl, C#, C$$, "ython, %uby and "&". Lucene is a search enginewhich ta'es the full te(t search to a step further.

    )hile suitable for any application which re*uires full te(t inde(ing and searchingcapability, !ucene has been widely recogni+ed for its utility in the implementation of Internet search engines and local, single site searching . Lucene itself is -ust an inde(ingand search library and does not contain crawling and & ! parsing functionality.

    In the present application we are implementing Lucene in Dot net ramewor' using C#

    to perform inde(ing and searching on database.

    Overview:

    In the present application we are performing search on the local database using !ucene.)e are passing an 0nglish *uery as the input or 'eyword for search. 1ow !ucene ta'esthe 'eyword as its input and creates an inde( on the particular table and searches for thematches. Consider the following e(ample2

    I1"3 2 "%4 41.

    43 "3 2 DIS"!A5S A!! A C&0S 4% "%4 41.

    In the above e(ample 6"%4 417 is an element in a particular table residing in thedatabase. !ucene ta'es "%4 41 as the 'eyword and displays all the elements in thetable which contain proton.

    Its sounds easy, isn8t it9 :ut for performing this functionality !ucene will be utili+ingsome classes and functions. !et8s discuss them in detail further.

  • 8/11/2019 19lucene

    2/21

    LUCENE DOTNET ARCHITECTURE

    As discussed earlier !ucene .net is an open source search engine which ta'es theimplementation of inde(ing and searching to a step further. o perform this !ucene .net

    has some classes defined.

    !ucene .net namespace has ;; other namespaces defined in it. hey are as follows2

    Names ace Descri tion

    !ucene.1et.Analysis his is used to analy+e the given 'eywordor input and then build to'ens and thenfilter it to search the matches.

    !ucene.1et.Analysis.Standard!ucene.1et.Documents his namespace is used to define the given

    document. < or e(ample2 1ame, fieldetc=>

    !ucene.1et.Inde( his namespace is the 'ey to !ucene searchengine. his namespace contains theclasses to create the inde( on a document.

    !ucene.1et.?uery"arsers his is used to parse the given *uery whichis nothing but the given 'eyword.

    !ucene.1et.Search his is also a 'ey to !ucene. hisnamespace contains classes for performingsearch operation.

    !ucene.1et.Search.Spans!ucene.1et.Store his namespace contains classes to store

    the inde(es created in a particular directoryspecified.

    !ucene.1et.3til his namespace contains classes formanipulating the given string , giving

    priorities to the 'eyword etc.

    !et8s discuss the namespaces further in detail.

  • 8/11/2019 19lucene

    3/21

    !ucene.1et.Analysis 1amespace2

    C!asses

    C!ass Descri tionAnaly+er An Analy+er builds o'enStreams, which

    analy+e te(t. It thus represents a policy fore(tracting inde( terms from te(t.

    ypical implementations first build ao'eni+er, which brea's the stream of

    characters from the %eader into rawo'ens. 4ne or more o'en ilters may

    then be applied to the output of theo'eni+er.

    )A%1I1@2 5ou must override one of themethods defined by this class in yoursubclass or the Analy+er will enter aninfinite l oop.

    Char o'eni+er An abstract base class for simple,character oriented to'eni+ers.

    !etter o'eni+er A !etter o'eni+er is a to'eni+er thatdivides te(t at non letters. hat s to say, itdefines to'ens as ma(imal strings ofad-acent letters, as defined by

    -ava.lang.Character.is!etter predicate. 1ote2 this does a decent -ob for most0uropean languages, but does a terrible -obfor some Asian languages, where words arenot separated by spaces.

    !owerCase ilter 1ormali+es to'en te(t to lower case.!owerCase o'eni+er"er ieldAnaly+er)rapper his analy+er is used to facilitate scenarios

    where different fields re*uire differentanalysis techni*ues. 3se B lin'#addAnaly+er to add a non defaultanaly+er on a ield name basis. See

    est"er ieldAnaly+er)rapper.-ava fore(ample usage.

    "orterStem ilterSimpleAnaly+er An Analy+er that filters !etter o'eni+er

    with !owerCase ilter.StopAnaly+er ilters !etter o'eni+er with

    !owerCase ilter and Stop ilter.

  • 8/11/2019 19lucene

    4/21

    Stop ilter %emoves stop words from a to'en stream.o'en A o'en is an occurrence of a term from

    the te(t of a ield. It consists of a term ste(t, the start and end offset of the term inthe te(t of the ield, and a type string. he

    start and end offsets permit applications tore associate a to'en with its source te(t,e.g., to display highlighted *uery terms in adocument browser, or to show matchingte(t fragments in a E)IC display, etc. he type is aninterned string, assigned by a le(icalanaly+er , naming thele(ical or syntactic class that the to'en

    belongs to. or e(ample an end of sentencemar'er to'en might be implemented with

    type FeosF. he default to'en type isFwordF.o'en iltero'eni+ero'enStream

    )hitespaceAnaly+er An Analy+er that uses)hitespace o'eni+er.

    )hitespace o'eni+er A )hitespace o'eni+er is a to'eni+er thatdivides te(t at whitespace. Ad-acentse*uences of non )hitespace charactersform to'ens.

  • 8/11/2019 19lucene

    5/21

    !ucene.1et.Analysis.Standard :

    C!asses

    C!ass Descri tionastCharStream

    "arse0(ception his e(ception is thrown when parse errorsare encountered. 5ou can e(plicitly createob-ects of this e(ception type by calling themethod generate"arse0(ception in thegenerated parser. 5ou can modify this classto customi+e your error reportingmechanisms so long as you retain the

    public fields.StandardAnaly+er ilters B lin' Standard o'eni+er with

    B lin' Standard ilter , B lin'!owerCase ilter and B lin' Stop ilter .

    Standard ilter 1ormali+es to'ens e(tracted with B lin'Standard o'eni+e r .

    Standard o'eni+erStandard o'eni+erConstantsStandard o'eni+er o'en anager

    o'en Describes the input to'en stream.o'en gr0rror

    !ucene.1et.Documents2

    C!asses:

    C!ass Descri tionDate ieldDocument

    ield A ield is a section of a Document. 0achield has two parts, a name and a value.

    Galues may be free te(t, provided as aString or as a %eader, or they may beatomic 'eywords, which are not further

    processed. Such 'eywords may be used torepresent dates, urls, etc. ields are

  • 8/11/2019 19lucene

    6/21

    optionally stored in the inde(, so that theymay be returned with hits on the document.

    !ucene.1et.Inde(2

    C!asses:

    C!ass Descri tionCompound ile%eader Class for accessing a compound stream.

    his class implements a directory, but islimited to only read operations. Directorymethods that would normally modify datathrow an e(ception.

    Compound ile%eader.CSInputStream Implementation of an InputStream thatreads from a portion of the compound file.he visibility is left as Fpac'ageF HonlyH

    because this helps with testing since J3nittest cases in a different class can thenaccess pac'age fields of this class.

    Compound ile)riterDocument)riter

    ieldInfoieldInfos Access to the ield Info file that describes

    document fields and whether or not they

    are inde(ed. 0ach segment has a separateield Info file. 4b-ects of this class arethread safe for multiple readers, but onlyone thread can be adding documents at atime, with no other reader or writer threadsaccessing this ob-ect.

    ields%eader Class responsible for access to storeddocument fields. It uses segment .fdt and

    segment .fd(K files.ilterInde(%eader A ilterInde(%eader contains another

    Inde(%eader, which it uses as its basic

    source of data, possibly transforming thedata along the way or providing additionalfunctionality. he class ilterInde(%eader itself simply implements all abstractmethods of Inde(%eader with versions that

    pass all re*uests to the contained inde(reader. Subclasses of ilterInde(%eader may further override some of these

  • 8/11/2019 19lucene

    7/21

    methods and may also provide additionalmethods and fields.

    ilterInde(%eader. ilter ermDocs :ase class for filtering B lin' ermDocsimplementations.

    ilterInde(%eader. ilter erm0num :ase class for filtering B lin' erm0numimplementations.ilterInde(%eader. ilter erm"ositions :ase class for filtering B lin'

    erm"ositions implementations.Inde(%eader %eads the Inde(.Inde()riter An Inde()riter creates and maintains an

    inde(. he third argument to theconstructor determines whether a newinde( is created, or whether an e(istinginde( is opened for the addition of newdocuments. In either case, documents are

    added with the addDocument method.)hen finished adding documents, c!ose should be called. If an inde( will not havemore documents added for a while andoptimal search performance is desired, thenthe o timi"e method should be called

    before the inde( is closed.ultiple erm"ositions Describe class ultiple erm"ositions

    &ere.ulti%eader An Inde(%eader which reads multiple

    inde(es, appending their content.

    SegmentInfoSegmentInfosSegment ergerSegment%eader IL 02 Describe class

    SegmentReader

    &ere.Segment ermDocsSegment erm0num

    erm A erm represents a word from te(t. his isthe unit of search. It is composed of twoelements, the te(t of the word, as a string,and the name of the ield that the te(toccurred in, an interned string. 1ote thatterms may represent more than words fromte(t fields, but also things li'e dates, emailaddresses, urls, etc.

    erm0numermInfo A ermInfo is the record of information

    stored for a term.

  • 8/11/2019 19lucene

    8/21

    ermInfos%eaderermInfos)riterermGectors%eader 4D42 rela( synchroMermGectors)riter )riter wor's by opening a document and

    then opening the fields within the document

    and then writing out the vectors for eachield.

    Inde()riter plays a ma-or role for creating the inde(

    #u$!ic Static %ie!ds:

    C4 I N!4CEN1A 0C4 I N!4CEN I 043 Default value is ;OOOO. 3se

    Lucene.Net.commitLockTimeout

    system property to override.D0 A3! N ALN I0!DN!01@ & Default value is ;OOOO. 3se

    Lucene.Net.maxFieldLength

    system property to override.D0 A3! N ALN 0%@0ND4CS Default value is B lin'

    Integer# ALNGA!30 . 3seLucene.Net.maxMergeDocs

    system property to override.D0 A3! N 0%@0N AC 4% Default value is ;O. 3se

    Lucene.Net.mergeFactor

    system property to override.

    D0 A3! N I1N 0%@0ND4CS Default value is ;O. 3seLucene.Net.minMergeDocssystem property to override.

    )%I 0N!4CEN1A 0)%I 0N!4CEN I 043 Default value is ;OOO. 3se

    Lucene.Net.writeLockTimeout

    system property to override.

    #u$!ic Instance Constructors

    Inde()riter4verloaded. Initiali+es a new instance of theInde()riter class.

  • 8/11/2019 19lucene

    9/21

    #u$!ic Instance %ie!ds:

    infoStreamIf non null, information about merges will

    be printed to this.

    ma( ield!ength he ma(imum number of terms that will beinde(ed for a single ield in a document.his limits the amount of memory re*uired

    for inde(ing, so that collections with verylarge files will not crash the inde(ing

    process by running out of memory. 1ote thatthis effectively truncates large documents,e(cluding from the inde( terms that occurfurther in the document. If you 'now yoursource documents are large, be sure to setthis value high enough to accommodate the

    e(pected si+e. If you set it toInteger. ALNGA!30, then the only limit isyour memory, but you should anticipate an4ut4f emory0rror. :y default, no morethan ;O,OOO terms will be inde(ed for a

    ield.

    ma( ergeDocs

    merge actor

    min ergeDocs

    #u$!ic Instance &et'ods:

    AddDocument 4verloaded. Adds a document to this inde(,using the provided analy+er instead of thevalue of B lin' #@etAnaly+er . If thedocument contains more than B lin'#ma( ield!ength terms for a given ield,

    the remainder are discarded.AddInde(es 4verloaded. erges the provided inde(es

    into this inde(. After this completes, theinde( is optimi+ed. he providedInde(%eaders are not closed.

    Close lushes all changes to an inde( and closesall associated files.

    DocCount %eturns the number of documents currently

  • 8/11/2019 19lucene

    10/21

    in this inde(.0*uals Determines whether the specified 4b-ect is

    e*ual to the current O$(ect .@etAnaly+er %eturns the analy+er used by this inde(.@et&ashCode Serves as a hash function for a particular

    type, suitable for use in hashing algorithmsand data structures li'e a hash table.

    @etSimilarity@et ype @ets the ype of the current instance.@et3seCompound ile Setting to turn on usage of a compound file.

    )hen on, multiple files for each segmentare merged into a single file once thesegment creation is finished. his is doneregardless of what directory is in use.

    4ptimi+e erges all segments together into a singlesegment, optimi+ing an inde( for search.

    SetSimilarity 0(pert2 Set the Similarity implementationused by this Inde()riter.

    Set3seCompound ile Setting to turn on usage of a compound file.)hen on, multiple files for each segmentare merged into a single file once thesegment creation is finished. his is doneregardless of what directory is in use.

    oString %eturns a String that represents the current4b-ect.

    #rotected Instance &et'ods:

    inali+e %elease the write loc', if needed.emberwiseClone Creates a shallow copy of the current

    4b-ect.

  • 8/11/2019 19lucene

    11/21

    !ucene.1et.?uery"arsers2

    C!asses:

    C!ass Descri tionastCharStreamulti ield?uery"arser A ?uery"arser which constructs *ueries to

    search multiple fields."arse0(ception his e(ception is thrown when parse errors

    are encountered. 5ou can e(plicitly createob-ects of this e(ception type by calling themethod generate"arse0(ception in thegenerated parser. 5ou can modify this classto customi+e your error reportingmechanisms so long as you retain the

    public fields.?uery"arser?uery"arserConstants?uery"arser o'en anager

    o'en Describes the input to'en stream.o'en gr0rror

  • 8/11/2019 19lucene

    12/21

    !ucene.1et.Search2

    C!asses:

    C!ass Descri tionAnonymousClassScoreDocComparatorAnonymousClassScoreDocComparator;:ooleanClause A clause in a :oolean?uery.:oolean?uery A ?uery that matches documents matching

    boolean combinations of other *ueries,typically B lin' erm?uery s or B lin'"hrase?uery s.

    :oolean?uery. oo anyClauses hrown when an attempt is made to addmore than B lin' #@et a(ClauseCountclauses.

    Caching)rapper ilter )raps another filters result and caches it.he caching behavior is li'e B lin'

    ?uery ilter . he purpose is to allow filtersto simply filter, and then wrap with thisclass to add caching, 'eeping the twoconcerns decoupled yet composable.

    Date ilterDefaultSimilarity 0(pert2 Default scoring implementation.0(planation 0(pert2 Describes the score computation

    for document and *uery.

    ieldDocilter Abstract base class providing a mechanismto restrict searches to a subset of an inde(.

    iltered?ueryiltered erm0numu++y?uery Implements the fu++y search *uery. he

    similiarity measurement is based on the!evenshtein algorithm.

    u++y erm0num&itCollector !ower level search A"I.&its A ran'ed list of documents, used to hold

    search results.Inde(Searcher

    ultiSearcherulti erm?uery

    "arallel ultiSearcher

  • 8/11/2019 19lucene

    13/21

    "hrase"refi(?uery "hrase"refi(?uery is a generali+ed versionof "hrase?uery, with an added methodB lin' #Add< ermPQ> . o use this class, tosearch for the phrase F icrosoft appHF firstuse add< erm> on the term F icrosoftF,

    then find all terms that has FappF as prefi(using Inde(%eader.terms< erm>, and use"hrase"refi(?uery.add< ermPQ terms> toadd them to the *uery.

    "hrase?uery A ?uery that matches documentscontaining a particular se*uence of terms.

    his may be combined with other termswith a B lin' :oolean?uery .

    "refi(?uery A ?uery that matches documentscontaining terms with a specified prefi(.

    ?uery

    ?uery ilter?uery ermGector%ange?uery A ?uery that matches documents within an

    e(clusive range.%emoteSearchable A remote searchable implementation.ScoreDoc 0(pert2 %eturned by low level search

    implementations.Scorer 0(pert2 Implements scoring for a class of

    *ueries.Searcher An abstract base class for search

    implementations. Implements somecommon utility methods.

    SimilaritySortSortComparatorSort ieldStringInde(

    erm?uery A ?uery that matches documentscontaining a term. his may be combinedwith other terms with a B lin':oolean?uery .

    opDocs 0(pert2 %eturned by low level searchimplementations.

    op ieldDocs

    Implements the wildcard search *uery.

  • 8/11/2019 19lucene

    14/21

    )ildcard?uery

    Supported wildcards are*

    , which matches any character se*uence, and?

    , which matches any single character. 1otethis *uery can be slow, as it needs to iterateover all terms. In order to prevente(tremely slow )ildcard?ueries, a)ildcard term must not start with one ofthe wildcards*

    or?

    .)ildcard erm0num

    !ucene.1et.Search.Spans2

    C!asses:

    C!ass Descri tionSpan irst?uery atches spans near the beginning of a

    ield.Span1ear?uery atches spans which are near one another.

    4ne can specify slop , the ma(imumnumber of intervening unmatched

    positions, as well as whether matches arere*uired to be in order.

    Span1ot?uery %emoves matches which overlap withanother Span?uery.

    Span4r?uery atches the union of its clauses.Span?uery :ase class for span based *ueries.Span erm?uery atches spans containing a term.

    !ucene.1et.Store2

  • 8/11/2019 19lucene

    15/21

  • 8/11/2019 19lucene

    16/21

    C!asses:

    C!ass Descri tion

    :itGector 4ptimi+ed implementation of a vector of bits. his is more or less li'e -ava.util.:itSet, but also includes thefollowing2

    a count method, which efficientlycomputes the number of one bitsK

    optimi+ed read from and write todis'K

    inlinable get methodKConstants Some useful constants."riority?ueue A "riority?ueue maintains a partial

    ordering of its elements such that the leastelement can always be found in constanttime. "ut s and pop s re*uire logtime.

    String&elper ethods for manipulating strings. RId2String&elper.-ava,v ;. OOT/OU/ V;U2UW2VW otis 0(p R

    &4) D40S I )4%E9

  • 8/11/2019 19lucene

    17/21

    !ucene is a high performance, scalable Information %etrieval library. It letsyou add inde(ing and searching capabilities to your applications. "eople new to !uceneoften mista'e it for a ready to use application li'e a file search program, a web crawler,or a web site search engine. hat isn8t what !ucene is2 !ucene is a software library, atool'it if you will, not a full featured search application. It concerns itself with te(t

    inde(ing and searching, and it does those things very well. !ucene lets your applicationdeal with business rules specific to its problem domain while hiding the comple(ity of inde(ing and searching implementation behind a simple to use A"I.

    As said earlier, !ucene allows you to add inde(ing and searching capabilities toyour applications. !ucene can inde( and ma'e searchable any data that can be convertedto a te(tual format. !ucene doesn8t care about the source of the data, its format, or evenits language, as long as you can convert it to te(t. his means you can use !ucene toinde( and search data stored in files2 web pages on remote web servers, documents storedin local file systems, simple te(t files, icrosoft )ord documents, & ! or "D files,or any other format from which you can e(tract te(tual information. Similarly, with

    !ucene8s help you can inde( data stored in your databases, giving your users full te(tsearch capabilities that many databases don8t provide.

    At the heart of all search engines is the concept of inde)in* 2 processing the originaldata into a highly efficient cross reference loo'up in order to facilitate rapid searc'in* .!et8s ta'e a *uic' high level loo' at both the inde(ing and searching processes.

    Inde)in*:

    Suppose you needed to search a large number of files, and you wanted to be able tofind files that contained a certain word or a phrase. &ow would you go about writing a

    program to do this9 A naXve approach would be to se*uentially scan each file for thegiven word or phrase. his approach has a number of flaws, the most obvious of which isthat it doesn8t scale to larger file sets or cases where files are very large. his is whereinde(ing comes in2 o search large amounts of te(t *uic'ly, you must first inde( that te(tand convert it into a format that will let you search it rapidly, eliminating the slowse*uential scanning process. his conversion process is called indexing , and its output iscalled an index .

    5ou can thin' of an inde( as a data structure that allows fast random access towords stored inside it. he concept behind it is analogous to an inde( at the end of a

    boo', which lets you *uic'ly locate pages that discuss certain topics. In the case of

    !ucene, an inde( is a specially designed data structure, typically stored on the file systemas a set of inde( files. !ucene inde( is a tool that allows *uic' word loo'up.

    Searc'in*:

  • 8/11/2019 19lucene

    18/21

    Searching is the process of loo'ing up words in an inde( to find documents wherethey appear. he *uality of a search is typically described using precision and recall metrics. %ecall measures how well the search system finds relevant documents, whereas

    precision measures how well the system filters out the irrelevant documents. &owever,you must consider a number of other factors when thin'ing about searching. )e already

    mentioned speed and the ability to *uic'ly search large *uantities of te(t. Support forsingle and multiterm *ueries, phrase *ueries, wildcards, result ran'ing, and sorting arealso important, as is a friendly synta( for entering those *ueries. !ucene8s powerfulsoftware library offers a number of search features.

    As we understood the concept of 6Inde(ing7 and 6Searching7, let8s see !ucene inaction.

    Initially to search the database we need to create an Inde) on the database. o do

    this we have a specific class namely 6 Inde)+riter 7. he importance andfunctionality of this class has been discussed in "ageY. he created Inde( will bestored in a particular path that is specified as one of the arguments of the classInde()riter.

    After creating the Inde( we will be saving it by using the function C!ose,- .

    1ow that the Inde( is created, we can perform search operation on the createdInde(. Searching in !ucene is as fast and simple as inde(ing. he search operationis completely controlled by classes namely 6 Searc'er 7 and 6Inde)Searc'er 7.)e need to specify the path of the inde( as the argument to Inde(Searcher, so that it performs search on that

    particular inde(.

    he search operation is based on what every e/word or 0uer/ we give as theinput. Initially the given ?uery which is human readable will be parsed into!ucene8s ?uery class. his is done by the class namely 6 0uer/#arser 7.

    Searching the inde( returns the output i.e., hits in the form of 6 Hits 7 ob-ect. 1otethat the Hits ob-ect contains only references to the underlying documents. Inother words, instead of being loaded immediately upon search, matches areloaded from the inde( in a la+y fashionZonly when re*uested with the call of 'its.doc,int- .

    inally the hits or matches for our given *uery are displayed and the search is performed successfully .

  • 8/11/2019 19lucene

    19/21

    hese are the basic steps followed in a simple !ucene search operation. he flowchart below e(plains this action in detailed way.

    !4)C&A% 4% !3C010 .10 DA A:AS0 S0A%C&

  • 8/11/2019 19lucene

    20/21

    +'o uses Lucene1

    START

    AN ENGLISH 0UER2 IS#ASSED AS A E2+ORD.

    E3ISTINGDATA4ASE

    CREATES AN INDE3 ON THAT#ARTICULAR TA4LE TO

    +HICH THE GI5EN 0UER24ELONGS TO AND STORES

    THE INDE3ES IN THE LOCALDIRECTOR2

    SEARCH O#ERATION ISCALLED ON THAT

    #ARTICULAR INDE3

    HITS%OUN

    D

    &ATCHES O%THE GI5EN

    E2+ORDARE DIS#LA2ED

    SEARCH IS#ER%OR&ED

    NO

    2ES

  • 8/11/2019 19lucene

    21/21

    )ho doesn8t9 A number of other large, well 'nown, multinational organi+ations are using!ucene. It provides searching capabilities for the 0clipse ID0, the 0ncyclopedia:ritannica CD %4 /DGD, ed0(, the ayo Clinic, &ewlett "ac'ard, 1ew Scientistmaga+ine, 0piphany, I 8s 4pen Courseware and DSpace, A'amai8s 0dge Computing

    platform, and so on. 5our name will be on this list soon, too. Some of the other

    applications are2 @oogle Des'top has made a splash by bringing this functionality to end users.

    1ow you have the power to bring the same inde(ing and searching capabilitiesinto your applications using !ucene.1et, a high performance, scalable searchengine library written in the C# language and utili+ing the .10 ramewor' .

    Can be used for any web application which needs a search portal.

    Can also be used for searching the local database and also can be handy for localcomputer search.