isp 433/633 week 4
DESCRIPTION
ISP 433/633 Week 4. Text operation, indexing and search. Document Process Steps. Example Collection. Documents D1: It is a dog eat dog world! D2: While the world sleeps. D3: Let sleeping dogs lie. D4: I will eat my hat. D5: My dog wears a hat. Step 1: Parse Text Into Words. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: ISP 433/633 Week 4](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814e1d550346895dbb84cf/html5/thumbnails/1.jpg)
ISP 433/633 Week 4
Text operation, indexing and search
![Page 2: ISP 433/633 Week 4](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814e1d550346895dbb84cf/html5/thumbnails/2.jpg)
Document Process Steps
![Page 3: ISP 433/633 Week 4](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814e1d550346895dbb84cf/html5/thumbnails/3.jpg)
Example Collection
Documents
D1: It is a dog eat dog world!
D2: While the world sleeps.
D3: Let sleeping dogs lie.
D4: I will eat my hat.
D5: My dog wears a hat.
![Page 4: ISP 433/633 Week 4](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814e1d550346895dbb84cf/html5/thumbnails/4.jpg)
Step 1: Parse Text Into Words
• break at spaces and punctuation
D1: IT IS A DOG
EAT
DOG WORLD
D2: WHILE THE WORLD SLEEPS
D3: LETSLEEPING DOGS LIE
D4: I WILL
EAT MY HAT
D5: MY DOG WEARS A HAT
![Page 5: ISP 433/633 Week 4](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814e1d550346895dbb84cf/html5/thumbnails/5.jpg)
Step 2: Stop Words Elimination
• Remove non-distinguishing words• Pronouns, … prepositions, … articles, ... to Be, to Have, to Do
• I,MY,IT,YOUR,…OF,BY,ON,…A,THE,THIS,…,IS,HAS,WILL,…
D1: DOG
EAT
DOG WORLD
D2: WORLD SLEEPS
D3: LETSLEEPING DOGS LIE
D4: EAT
HAT
D5: DOG WEARS HAT
![Page 6: ISP 433/633 Week 4](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814e1d550346895dbb84cf/html5/thumbnails/6.jpg)
Stop Words List
• 250-300 most common words in English account for 50% or more of a given text.– Example: “the” and “of” represent 10% of
tokens. “and”, “to”, “a”, and “in” - another 10%. Next 12 words - another 10%.
• Moby Dick Ch.1: 859 unique words (types), 2256 word occurrences (tokens). – Top 65 types cover 1132 tokens (> 50%).– Token/type ratio: 2256/859 = 2.63
![Page 7: ISP 433/633 Week 4](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814e1d550346895dbb84cf/html5/thumbnails/7.jpg)
Step 3: Stemming
• Goal: “normalize” similar words
D1: DOG
EAT
DOG WORLD
D2: WORLD SLEEP
D3: LETSLEEP DOG LIE
D4: EAT
HAT
D5: DOG WEAR HAT
![Page 8: ISP 433/633 Week 4](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814e1d550346895dbb84cf/html5/thumbnails/8.jpg)
Stemming and Morphological Analysis
Morphology (“form” of words)– Inflectional Morphology
• E.g,. inflect verb endings and noun number• Never change grammatical class
– dog, dogs
– Derivational Morphology • Derive one word from another• Often change grammatical class
– build, building; health, healthy
![Page 9: ISP 433/633 Week 4](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814e1d550346895dbb84cf/html5/thumbnails/9.jpg)
Simple “S” stemming
• IF a word ends in “ies”, but not “eies” or “aies”– THEN “ies” “y”
• IF a word ends in “es”, but not “aes”, “ees”, or “oes”– THEN “es” “e”
• IF a word ends in “s”, but not “us” or “ss”– THEN “s” NULL Harman, JASIS 1991
![Page 10: ISP 433/633 Week 4](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814e1d550346895dbb84cf/html5/thumbnails/10.jpg)
Porter’s Algorithm
• An effective, simple and popular English stemmer
• Official URL http://www.tartarus.org/~martin/PorterStemmer/
• A demo http://snowball.tartarus.org/demo.php
![Page 11: ISP 433/633 Week 4](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814e1d550346895dbb84cf/html5/thumbnails/11.jpg)
Porter’s Algorithm
• 1. The measure, m, of a stem is a function of sequences of vowels followed by a consonant. If V is a sequence of vowels and C is a sequence of consonants, then m is: C(VC)mVwhere the initial C and final V are optional and m is the number of VC repeats. m=0 free, why m=1 frees, whose m=2 prologue, compute2. *<X> - stem ends with letter X3. *v* - stem ends in a vowel4. *d - stem ends in double consonant5. *o - stem ends with consonant-vowel-consonant sequence where the final consonant is now w, x, or y
Porter, Program 1980
![Page 12: ISP 433/633 Week 4](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814e1d550346895dbb84cf/html5/thumbnails/12.jpg)
Porter’s Algorithm
• Suffix conditions take the form current_suffix = = patternActions are in the form old_suffix -> new_suffixRules are divided into steps to define the order of applying the rules. The following are some examples of the rules:
STEP CONDITION SUFFIX REPLACEMENT EXAMPLE1a NULL sses ss stresses->stress1b *v* ing NULL making->mak1b1 NULL at ate inflat(ed)->inflate1c *v* y I happy->happi2 m>0 aliti al formaliti->formal3 m>0 icate ic duplicate->duplic4 m>1 able NULL adjustable->adjust5a m>1 e NULL inflate->inflat
![Page 13: ISP 433/633 Week 4](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814e1d550346895dbb84cf/html5/thumbnails/13.jpg)
Problems of Porter’s Algorithm
Too Aggressive Too TimidOrganization/organ Relatedness/related
Executive/execute Create/creation
• Unreadable results• Does not handle some irregular verbs and
adjectives– Take/took– Bad/worse
• Possible errors:
![Page 14: ISP 433/633 Week 4](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814e1d550346895dbb84cf/html5/thumbnails/14.jpg)
Step 4: Indexing
• Inverted Files
D1 D3 D5
D1 D4
D4 D5
D3
D3
D2 D3
D5
D1 D2
Occurrences DOG
EAT
HAT
LET
LIE
SLEEP
WEAR
WORLD
Vocabulary
![Page 15: ISP 433/633 Week 4](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814e1d550346895dbb84cf/html5/thumbnails/15.jpg)
Inverted Files
• Occurrences can point to– Documents– Positions in a document– Weight
• Most commonly used indexing method• Based on words
– Queries such as phrases are expensive to solve– Some data does not have words
• Genetic data
![Page 16: ISP 433/633 Week 4](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814e1d550346895dbb84cf/html5/thumbnails/16.jpg)
Suffix Trees
1234567890123456789012345678901234567890123456789012345678901234567This is a text. A text has many words. Words are made from letters.
60
28
50
11
19
33
40
l
m ad
n
te x t
.
‘ ‘
w
o r d s‘ ‘
.
Patricia tree
![Page 17: ISP 433/633 Week 4](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814e1d550346895dbb84cf/html5/thumbnails/17.jpg)
Text Compression
• Represent text in fewer bits
• Symbols to be compressed are words
• Method of choice– Huffman coding
![Page 18: ISP 433/633 Week 4](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814e1d550346895dbb84cf/html5/thumbnails/18.jpg)
Huffman Coding
• Developed by David Huffman (1952)• Average of 5 bits per character• Based on frequency distributions of
symbols• Idea: assign shorter code to more
frequent symbols• Algorithm: iteratively build a tree of
symbols starting with the two least frequent symbols
![Page 19: ISP 433/633 Week 4](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814e1d550346895dbb84cf/html5/thumbnails/19.jpg)
An Example
Symbol Frequency
A 7
B 4
C 10
D 5
E 2
F 11
G 15
H 3
I 7
J 8
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
c
b d
f
g
i j
he
a
![Page 20: ISP 433/633 Week 4](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814e1d550346895dbb84cf/html5/thumbnails/20.jpg)
Example Coding
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
c
b d
f
g
i j
he
a
Symbol Code
A 0110
B 0010
C 000
D 0011
E 01110
F 010
G 10
H 01111
I 110
J 111
![Page 21: ISP 433/633 Week 4](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814e1d550346895dbb84cf/html5/thumbnails/21.jpg)
Exercise
• Consider the bit string: 011011011110001001100011101001110001101011010111
• Use the Huffman code from the example to decode it.
• Try inserting, deleting, and switching some bits at random locations and try decoding
![Page 22: ISP 433/633 Week 4](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814e1d550346895dbb84cf/html5/thumbnails/22.jpg)
Huffman Code
• Prefix property – it means that no word in the code is a
prefix of any other word in the code
• Random access– Decompress starting from any where
• Not the fastest
![Page 23: ISP 433/633 Week 4](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814e1d550346895dbb84cf/html5/thumbnails/23.jpg)
Sequential string searching
• Boyer-Moore algorithm
• Example: search for “cats” in “the catalog of all cats”
• Some preprocessing is needed.• Demos:
http://www-sr.informatik.uni-tuebingen.de/~buehler/BM/BM.html