07 - string processing

Upload: adelino-thesaria

Post on 09-Apr-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/8/2019 07 - String Processing

    1/27

    CS4323 / 0708-2

    YFA

    Tersedia online di http://www.ittelkom.ac.id/staf/yanuar

    http://www.ittelkom.ac. id/staf/yanuar

    Institut Teknologi Telkom

  • 8/8/2019 07 - String Processing

    2/27

    Query Languages:h mm n r L n

    to information retrieval systems such as web indexes,

    bibliographic catalogs and museum collection information.Objective: human readable and human writable; intuitive whilemaintaining the expressiveness of more complex languages.

    Traditionally, query languages have fallen into two camps:

    (a) Powerful and expressive languages which are not easilyreadable nor writable by non-experts (e.g. SQL and XQuery).

    (b) Simple and intuitive languages not powerful enough toexpress complex concepts (e.g. CCL or Google's querylanguage).

    http://www.ittelkom.ac. id/staf/yanuar

  • 8/8/2019 07 - String Processing

    3/27

    The Common Query Language

    The Common Query Language is maintained by the Z39.50International Maintenance A enc at the Librar of Con ress.

    http://www.loc.gov/z3950/agency/zing/cql/

    ,

    Gentle Introduction to CQL.

    http://www.ittelkom.ac. id/staf/yanuar

  • 8/8/2019 07 - String Processing

    4/27

    The Common Query Language: Examples

    dinosaur

    comp.sources.misc" ""the complete dinosaur""ext->u.generic"an

    Booleans

    dinosaur and bird or dinobird

    (bird or dinosaur) and (feathers or scales)" "(((a and b) or (c not d) not (e or f and g)) and h not i) or j

    http://www.ittelkom.ac. id/staf/yanuar

  • 8/8/2019 07 - String Processing

    5/27

    The Common Query Language: Examples

    title = dinosaurtitle = dinosaur and bird or dinobirddc.title = saurischiabath.title="the complete dinosaur"

    .

    srw.resultSet=bar

    Index-set ma in definition of fields

    >dc="http://www.loc.gov/srw/index-sets/dc"

    dc.title=dinosaur and dc.author=farlow

    http://www.ittelkom.ac. id/staf/yanuar

  • 8/8/2019 07 - String Processing

    6/27

    The Common Query Language: Examples

    rox m ty

    The prox operator:prox/relation/distance/unit/ordering

    Examples:

    complete prox dinosaur [adjacent](caudal or dorsal) prox vertebra

    ribs prox//0/sentence chevrons [same sentence]

    ribs prox/>/0/paragraph chevrons [not adjacent]

    http://www.ittelkom.ac. id/staf/yanuar

  • 8/8/2019 07 - String Processing

    7/27

    The Common Query Language: Examples

    Relations

    year > 1998e a comp e e nosaur

    title any "dinosaur bird reptile"title exact "the complete dinosaur"

    publicationYear < 1980numberOfWheels 2.4

    bioMass >= 100

    http://www.ittelkom.ac. id/staf/yanuar

  • 8/8/2019 07 - String Processing

    8/27

    The Common Query Language: Examples

    Relation Modifiers

    title all/stem "com lete dinosaur"title any / relevant "dinosaur bird reptile"title exact/fuzzy "the complete dinosaur"

    The implementations of relevant and fuzzy are notdefined b the uer lan ua e.

    http://www.ittelkom.ac. id/staf/yanuar

  • 8/8/2019 07 - String Processing

    9/27

    The Common Query Language: Examples

    dinosaur* [zero or more characters]

    *sauriaman?raptor [exactly one character]man?raptor*" * "

    char\* [literal "*"]

    Word Anchoring

    title="^the complete dinosaur" [beginning of field]

    author="bakker^" [end of field]

    author any "^kernighan ^ritchie ^thompson"

    http://www.ittelkom.ac. id/staf/yanuar

  • 8/8/2019 07 - String Processing

    10/27

    The Common Query Language: Examples

    dc.author=(kern* or ritchie) and

    (bath.title exact "the c programming language" ordc.title=elements prox///4 dc.title=programming) andsubject any/relevant "style design analysis"

    http://www.ittelkom.ac. id/staf/yanuar

  • 8/8/2019 07 - String Processing

    11/27

    The Common Query Language: Examples

    dc.author=(kern* or ritchie) and

    (bath.title exact "the c programming language" ordc.title=elements prox///4 dc.title=programming) andsubject any/relevant "style design analysis"

    n recor s w ose au or n e u n ore sense nc u es e era word beginning kern or the word ritchie, and which have either theexact title (in the sense of the Bath profile) the c programminglanguage or a title containing the words elements and programmingnot more the four words apart, and whose subject is relevant to oneor more of the words style, design or analysis.

    http://www.ittelkom.ac. id/staf/yanuar

  • 8/8/2019 07 - String Processing

    12/27

    Regular Expressions in Java

    . .

    Classes for matching character sequences against patterns

    s ecified b re ular ex ressions.

    An instance of the Pattern class represents a regular expressionthat is specified in string form in a syntax similar to that used by

    Perl.Instances of the Matcher class are used to match character

    .

    Input is provided to matchers via the CharSequence interface in

    of input sources.

    http://www.ittelkom.ac. id/staf/yanuar

    S i S hi

  • 8/8/2019 07 - String Processing

    13/27

    String Searching:N iv Al ri hm

    Objective: Given a pattern, find any substring of a given text thatmatches the pattern.

    p pa ern o e ma c em length of pattern p(characters)t the text to be searched

    n length of t(characters)The naive algorithm examines the characters of txin sequence.

    for j from 1 to n-m+1

    if character j of t matches the first character

    (compare following characters of t andp

    until a

    http://www.ittelkom.ac. id/staf/yanuar

    St i S hi

  • 8/8/2019 07 - String Processing

    14/27

    String Searching:Kn h-M rri -Pr Al ri hm

    oncept: e na ve a gor t m s mo e , so t at w enever a part amatch is found, it may be possible to advance the character index,j,

    by more than 1.

    Example:

    = "universit "

    t = "the uniform commercial code ..."

    j=5 after partial match continue here

    To indicate how far to advance the character pointer, pis preprocessed

    to create a table, which lists how far to advance against a given length.

    In the example,jis advanced by the length of the partial match, 3.

    http://www.ittelkom.ac. id/staf/yanuar

    Si t Fil S ti l S h

  • 8/8/2019 07 - String Processing

    15/27

    Signature Files: Sequential Searchwi h Inv r Fil

    -qualifying items.

    Advantages

    Much faster than full text scanning -- 1 or 2 ordersof magnitude

    o est space over ea -- 10% to 15% o e

    Insertion is straightforward

    Sequential searching is no good for very large files

    http://www.ittelkom.ac. id/staf/yanuar

  • 8/8/2019 07 - String Processing

    16/27

    Signature Files

    Signature size. Number of bits in a signature, F.

    Word si nature. A bit attern of size Fwith mbits set to 1and the others 0.

    The word signature is calculated by a hash function.

    Block. A sequence of text that contains Ddistinct words.

    Block signature. The logical orof all the word signatures ina block of text.

    http://www.ittelkom.ac. id/staf/yanuar

  • 8/8/2019 07 - String Processing

    17/27

    Signature Files

    Example

    Word Signature

    free 001 000 110 010

    block signature 001 010 111 011

    F =12 bits in a signature

    =

    D =2 words per block

    http://www.ittelkom.ac. id/staf/yanuar

  • 8/8/2019 07 - String Processing

    18/27

    Signature Files

    A query term is processed by matching its signature againstthe block signature.

    (a) If the term is in the block, its word signature will alwaysmatch the block signature.

    wor s gna ure may ma c e oc s gna ure, u eword is not in the block. This is a false hit.

    probability, Fd .

    Frake, Section 4.2, page 47 discussed how to minimize .The rest of this chapter discusses enhancements to thebasic algorithm.

    http://www.ittelkom.ac. id/staf/yanuar

  • 8/8/2019 07 - String Processing

    19/27

    String Matching

    .

    Simple algorithm: Build an inverted index of all substrings of the

    file names of the form *f,

    Example: if the file name is foo.txt, search terms are:

    foo.txt

    oo.txto.txt.txttxt

    xt

    Lexicographic processing allows searching by any q.

    http://www.ittelkom.ac. id/staf/yanuar

  • 8/8/2019 07 - String Processing

    20/27

    Search for Substring

    In some information retrieval applications, any substring can be asearch term.

    Tries, using suffix trees, provide lexicographical indexes for allthe substrings in a document or set of documents.

    http://www.ittelkom.ac. id/staf/yanuar

  • 8/8/2019 07 - String Processing

    21/27

    Tries: Search for Substring

    Basic concept

    The text is divided into unique semi-infinite strings, or

    sistrings. Each sistring has a starting position in the text, andcon nues o e r g un s un que.

    The sistrings are stored in (the leaves of) a tree, the suffix. .

    Each sistring can be associated with a location within adocument where the sistrin occurs. Subtrees below a certainnode represent all occurrences of the substring represented by

    that node.Suffix trees have a size of the same order of magnitude as theinput documents.

    http://www.ittelkom.ac. id/staf/yanuar

  • 8/8/2019 07 - String Processing

    22/27

    Tries: Suffix Tree

    following words:

    be inbeginningbetween

    break e rea

    gin tween d k

    null ning

    http://www.ittelkom.ac. id/staf/yanuar

  • 8/8/2019 07 - String Processing

    23/27

    Tries: Sistrings

    A binary example

    String: 01 100 100 010 111

    Sistrings: 1 01 100 100 010 111

    3 10 010 001 011 14 00 100 010 111

    5 01 000 101 11

    6 10 001 011 17 00 010 111

    8 00 101 11

    http://www.ittelkom.ac. id/staf/yanuar

  • 8/8/2019 07 - String Processing

    24/27

    Tries: Lexical Ordering

    7 00 010 111

    8 00 101 11

    1 01 100 100 010 111

    3 10 010 001 011 1

    Unique string indicated in blue

    http://www.ittelkom.ac. id/staf/yanuar

  • 8/8/2019 07 - String Processing

    25/27

    Trie: Basic Concept

    0 1

    00

    11

    00 0 11

    7 5 1

    0

    0

    1

    6 30

    1

    http://www.ittelkom.ac. id/staf/yanuar

    4 8

  • 8/8/2019 07 - String Processing

    26/27

    Patricia Tree

    0 1

    000

    112

    2

    00 0 1101

    33

    4

    7 5 1 6 3

    0 1

    5

    4 8 Single-descendant nodes are eliminated.

    http://www.ittelkom.ac. id/staf/yanuar

    .

  • 8/8/2019 07 - String Processing

    27/27

    YFAApril 2008

    . . .

    Diadaptasi dari cs.cornell.edu

    http://www.ittelkom.ac. id/staf/yanuar