2002.10.29 - slide 1is 202 – fall 2002 prof. ray larson & prof. marc davis uc berkeley sims...
Post on 21-Dec-2015
217 views
TRANSCRIPT
![Page 1: 2002.10.29 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d625503460f94a43cd9/html5/thumbnails/1.jpg)
2002.10.29 - SLIDE 1IS 202 – FALL 2002
Prof. Ray Larson & Prof. Marc Davis
UC Berkeley SIMS
Tuesday and Thursday 10:30 am - 12:00 pm
Fall 2002http://www.sims.berkeley.edu/academics/courses/is202/f02/
SIMS 202:
Information Organization
and Retrieval
Lecture 17: Statistical Properties of Text
![Page 2: 2002.10.29 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d625503460f94a43cd9/html5/thumbnails/2.jpg)
2002.10.29 - SLIDE 2IS 202 – FALL 2002
Lecture Overview
• Review– Central Concepts in IR– Boolean Logic
• Content Analysis
• Statistical Properties of Text– Zipf distribution– Statistical dependence
• Indexing and Inverted Files
Credit for some of the slides in this lecture goes to Marti Hearst
![Page 3: 2002.10.29 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d625503460f94a43cd9/html5/thumbnails/3.jpg)
2002.10.29 - SLIDE 3IS 202 – FALL 2002
Central Concepts in IR
• Documents
• Queries
• Collections
• Evaluation
• Relevance
![Page 4: 2002.10.29 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d625503460f94a43cd9/html5/thumbnails/4.jpg)
2002.10.29 - SLIDE 4IS 202 – FALL 2002
Relevance (introduction)
• In what ways can a document be relevant to a query?– Answer precise question precisely
– Who is buried in grant’s tomb? Grant
– Partially answer question– Where is Danville? Near Walnut Creek
– Suggest a source for more information– What is lymphodema? Look in this Medical Dictionary…
– Give background information– Remind the user of other knowledge– Others ...
![Page 5: 2002.10.29 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d625503460f94a43cd9/html5/thumbnails/5.jpg)
2002.10.29 - SLIDE 5IS 202 – FALL 2002
Relevance
• “Intuitively, we understand quite well what relevance means. It is a primitive ‘y’ know’ concept, as is information for which we hardly need a definition. … if and when any productive contact [in communication] is desired, consciously or not, we involve and use this intuitive notion or relevance.”
» Saracevic, 1975 p. 324
![Page 6: 2002.10.29 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d625503460f94a43cd9/html5/thumbnails/6.jpg)
2002.10.29 - SLIDE 6IS 202 – FALL 2002
Janes’ View
Topicality
Pertinence
Relevance
Utility
Satisfaction
![Page 7: 2002.10.29 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d625503460f94a43cd9/html5/thumbnails/7.jpg)
2002.10.29 - SLIDE 7IS 202 – FALL 2002
Boolean Queries
• Cat
• Cat OR Dog
• Cat AND Dog
• (Cat AND Dog)
• (Cat AND Dog) OR Collar
• (Cat AND Dog) OR (Collar AND Leash)
• (Cat OR Dog) AND (Collar OR Leash)
![Page 8: 2002.10.29 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d625503460f94a43cd9/html5/thumbnails/8.jpg)
2002.10.29 - SLIDE 8IS 202 – FALL 2002
Boolean Logic
A B
BABA
BABA
BAC
BAC
AC
AC
:Law sDeMorgan'
![Page 9: 2002.10.29 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d625503460f94a43cd9/html5/thumbnails/9.jpg)
2002.10.29 - SLIDE 9IS 202 – FALL 2002
Boolean Logic
t33
t11 t22
D11D22
D33
D44D55
D66
D88D77
D99
D1010
D1111
m1
m2
m3m5
m4
m7m8
m6
m2 = t1 t2 t3
m1 = t1 t2 t3
m4 = t1 t2 t3
m3 = t1 t2 t3
m6 = t1 t2 t3
m5 = t1 t2 t3
m8 = t1 t2 t3
m7 = t1 t2 t3
![Page 10: 2002.10.29 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d625503460f94a43cd9/html5/thumbnails/10.jpg)
2002.10.29 - SLIDE 10IS 202 – FALL 2002
Boolean Systems
• Most of the commercial database search systems that pre-date the WWW are based on Boolean search– Dialog, Lexis-Nexis, etc.
• Most Online Library Catalogs are Boolean systems– E.g. MELVYL
• Database systems use Boolean logic for searching
• Many of the search engines sold for intranet search of web sites are Boolean
![Page 11: 2002.10.29 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d625503460f94a43cd9/html5/thumbnails/11.jpg)
2002.10.29 - SLIDE 11IS 202 – FALL 2002
Content Analysis
• Automated Transformation of raw text into a form that represents some aspect(s) of its meaning
• Including, but not limited to:– Automated Thesaurus Generation– Phrase Detection– Categorization– Clustering– Summarization
![Page 12: 2002.10.29 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d625503460f94a43cd9/html5/thumbnails/12.jpg)
2002.10.29 - SLIDE 12IS 202 – FALL 2002
Techniques for Content Analysis
• Statistical– Single Document– Full Collection
• Linguistic– Syntactic– Semantic– Pragmatic
• Knowledge-Based (Artificial Intelligence)
• Hybrid (Combinations)
![Page 13: 2002.10.29 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d625503460f94a43cd9/html5/thumbnails/13.jpg)
2002.10.29 - SLIDE 13IS 202 – FALL 2002
Text Processing
• Standard Steps:– Recognize document structure
• titles, sections, paragraphs, etc.
– Break into tokens• usually space and punctuation delineated• special issues with Asian languages
– Stemming/morphological analysis– Store in inverted index (to be discussed later)
![Page 14: 2002.10.29 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d625503460f94a43cd9/html5/thumbnails/14.jpg)
2002.10.29 - SLIDE 14IS 202 – FALL 2002
Content Analysis Areas
How isthe text processed?
Informationneed
Index
Pre-process
Parse
Collections
Rank
Query
text input
How isthe queryconstructed?
![Page 15: 2002.10.29 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d625503460f94a43cd9/html5/thumbnails/15.jpg)
2002.10.29 - SLIDE 15
Document Processing Steps
From “Modern IR” textbook
![Page 16: 2002.10.29 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d625503460f94a43cd9/html5/thumbnails/16.jpg)
2002.10.29 - SLIDE 16IS 202 – FALL 2002
Stemming and Morphological Analysis
• Goal: “normalize” similar words• Morphology (“form” of words)
– Inflectional Morphology• E.g,. inflect verb endings and noun number• Never change grammatical class
– dog, dogs– tengo, tienes, tiene, tenemos, tienen
– Derivational Morphology • Derive one word from another, • Often change grammatical class
– build, building; health, healthy
![Page 17: 2002.10.29 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d625503460f94a43cd9/html5/thumbnails/17.jpg)
2002.10.29 - SLIDE 17IS 202 – FALL 2002
Automated Methods
• Powerful multilingual tools exist for morphological analysis– PCKimmo, Xerox Lexical technology– Require a grammar and dictionary– Use “two-level” automata
• Stemmers:– Very dumb rules work well (for English)– Porter Stemmer: Iteratively remove suffixes– Improvement: pass results through a lexicon
![Page 18: 2002.10.29 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d625503460f94a43cd9/html5/thumbnails/18.jpg)
2002.10.29 - SLIDE 18IS 202 – FALL 2002
Errors Generated by Porter Stemmer
Too Aggressive Too Timid organization/ organ european/ europe
policy/ police cylinder/ cylindrical
execute/ executive create/ creation
arm/ army search/ searcher
From Krovetz ‘93
![Page 19: 2002.10.29 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d625503460f94a43cd9/html5/thumbnails/19.jpg)
2002.10.29 - SLIDE 19IS 202 – FALL 2002
Statistical Properties of Text
• Token occurrences in text are not uniformly distributed
• They are also not normally distributed
• They do exhibit a Zipf distribution
![Page 20: 2002.10.29 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d625503460f94a43cd9/html5/thumbnails/20.jpg)
2002.10.29 - SLIDE 20IS 202 – FALL 2002
Plotting Word Frequency by Rank
• Main idea:– Count how many times tokens occur in the
text• Sum over all of the texts in the collection
• Now order these tokens according to how often they occur (highest to lowest)
• This is called the rank
![Page 21: 2002.10.29 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d625503460f94a43cd9/html5/thumbnails/21.jpg)
2002.10.29 - SLIDE 21IS 202 – FALL 2002
A Typical Collection
8164 the4771 of4005 to2834 a2827 and2802 in1592 The1370 for1326 is1324 s1194 that 973 by
969 on 915 FT 883 Mr 860 was 855 be 849 Pounds 798 TEXT 798 PUB 798 PROFILE 798 PAGE 798 HEADLINE 798 DOCNO
1 ABC 1 ABFT 1 ABOUT 1 ACFT 1 ACI 1 ACQUI 1 ACQUISITIONS 1 ACSIS 1 ADFT 1 ADVISERS 1 AE
Government documents, 157734 tokens, 32259 unique
![Page 22: 2002.10.29 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d625503460f94a43cd9/html5/thumbnails/22.jpg)
2002.10.29 - SLIDE 22IS 202 – FALL 2002
A Small Collection (Stems)Rank Freq Term1 37 system2 32 knowledg3 24 base4 20 problem5 18 abstract6 15 model7 15 languag8 15 implem9 13 reason10 13 inform11 11 expert12 11 analysi13 10 rule14 10 program15 10 oper16 10 evalu17 10 comput18 10 case19 9 gener20 9 form
150 2 enhanc151 2 energi152 2 emphasi153 2 detect154 2 desir155 2 date156 2 critic157 2 content158 2 consider159 2 concern160 2 compon161 2 compar162 2 commerci163 2 clause164 2 aspect165 2 area166 2 aim167 2 affect
![Page 23: 2002.10.29 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d625503460f94a43cd9/html5/thumbnails/23.jpg)
2002.10.29 - SLIDE 23IS 202 – FALL 2002
The Corresponding Zipf Curve
Rank Freq1 37 system2 32 knowledg3 24 base4 20 problem5 18 abstract6 15 model7 15 languag8 15 implem9 13 reason10 13 inform11 11 expert12 11 analysi13 10 rule14 10 program15 10 oper16 10 evalu17 10 comput18 10 case19 9 gener20 9 form
![Page 24: 2002.10.29 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d625503460f94a43cd9/html5/thumbnails/24.jpg)
2002.10.29 - SLIDE 24IS 202 – FALL 2002
Zoom in on the Knee of the Curve
43 6 approach44 5 work45 5 variabl46 5 theori47 5 specif48 5 softwar49 5 requir50 5 potenti51 5 method52 5 mean53 5 inher54 5 data55 5 commit56 5 applic57 4 tool58 4 technolog59 4 techniqu
![Page 25: 2002.10.29 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d625503460f94a43cd9/html5/thumbnails/25.jpg)
2002.10.29 - SLIDE 25IS 202 – FALL 2002
Zipf Distribution
• The Important Points:– a few elements occur very frequently– a medium number of elements have medium
frequency– many elements occur very infrequently
![Page 26: 2002.10.29 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d625503460f94a43cd9/html5/thumbnails/26.jpg)
2002.10.29 - SLIDE 26IS 202 – FALL 2002
• The product of the frequency of words (f) and their rank (r) is approximately constant– Rank = order of words’ frequency of occurrence
• Another way to state this is with an approximately correct rule of thumb:– Say the most common term occurs C times– The second most common occurs C/2 times– The third most common occurs C/3 times
Zipf Distribution
10/
/1
NC
rCf
![Page 27: 2002.10.29 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d625503460f94a43cd9/html5/thumbnails/27.jpg)
2002.10.29 - SLIDE 27
Zipf Distribution
Linear Scale Logarithmic Scale
![Page 28: 2002.10.29 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d625503460f94a43cd9/html5/thumbnails/28.jpg)
2002.10.29 - SLIDE 28IS 202 – FALL 2002
What has a Zipf Distribution?
• Words in a text collection– Virtually any use of natural language
• Library book checkout patterns
• Incoming Web Page Requests (Nielsen)
• Outgoing Web Page Requests (Cunha & Crovella)
• Document Size on Web (Cunha & Crovella)
![Page 29: 2002.10.29 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d625503460f94a43cd9/html5/thumbnails/29.jpg)
2002.10.29 - SLIDE 29IS 202 – FALL 2002
Related Distributions/”Laws”
• Bradford’s Law of Scattering
• Lotka’s Law of Productivity
• De Solla Price’s Urn Model for “Cumulative Advantage Processes”
½ = 50% 2/3 = 66% ¾ = 75%Pick Pick
Replace +1 Replace +1
![Page 30: 2002.10.29 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d625503460f94a43cd9/html5/thumbnails/30.jpg)
2002.10.29 - SLIDE 30IS 202 – FALL 2002
Very frequent word stemsWORD FREQu 63245ha 65470california 67251m 67903
1998 68662system 69345t 70014about 70923servic 71822work 71958home 72131other 72726research 74264
1997 75323can 76762next 77973your 78489all 79993public 81427us 82551c 83250www 87029wa 92384program 95260
not 100204http 100696d 101034html 103698student 104635univers 105183inform 106463will 109700new 115937have 119428page 128702messag 141542from 147440you 162499edu 167298be 185162publib 189334librari 189347i 190635lib 223851that 227311s 234467berkelei 245406re 272123web 280966archiv 305834
From the Cha-Cha Web Index for the Berkeley.EDU domain
![Page 31: 2002.10.29 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d625503460f94a43cd9/html5/thumbnails/31.jpg)
2002.10.29 - SLIDE 31IS 202 – FALL 2002
Frequent words on the WWW• 65002930 the• 62789720 a• 60857930 to• 57248022 of• 54078359 and• 52928506 in• 50686940 s• 49986064 for• 45999001 on• 42205245 this• 41203451 is• 39779377 by• 35439894 with• 35284151 or• 34446866 at• 33528897 all• 31583607 are
• 30998255 from• 30755410 e• 30080013 you• 29669506 be• 29417504 that• 28542378 not• 28162417 an• 28110383 as• 28076530 home• 27650474 it• 27572533 i• 24548796 have• 24420453 if• 24376758 new• 24171603 t• 23951805 your• 23875218 page
• 22292805 about• 22265579 com• 22107392 information• 21647927 will• 21368265 can• 21367950 more• 21102223 has• 20621335 no• 19898015 other• 19689603 one• 19613061 c• 19394862 d• 19279458 m• 19199145 was• 19075253 copyright• 18636563 us
(see http://elib.cs.berkeley.edu/docfreq/docfreq.html)
![Page 32: 2002.10.29 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d625503460f94a43cd9/html5/thumbnails/32.jpg)
2002.10.29 - SLIDE 32IS 202 – FALL 2002
Words that occur few times WORD FREQagendaaugust 1anelectronic 1centerjanuary 1packardequipment 1systemjuly 1systemscs186 1todaymcb 1workshopsfinding 1workshopsthe 1lollini 10+ 1
0 100summary 1
35816 135823 1
01d 135830 135837 1
02-156-10 135844 135851 1
02aframst 1311 1313 1
03agenvchm 1401 1408 1
408 1422 1424 1429 1
04agrcecon 104cklist 105-128-10 1
501 1506 1
05amstud 106anhist 107-149 107-800-80 107anthro 108apst 1
From the Cha-Cha Web Index for the Berkeley.EDU domain
![Page 33: 2002.10.29 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d625503460f94a43cd9/html5/thumbnails/33.jpg)
2002.10.29 - SLIDE 33IS 202 – FALL 2002
Consequences of Zipf
• There are always a few very frequent tokens that are not good discriminators.– Called “stop words” in IR– Usually correspond to linguistic notion of “closed-
class” words• English examples: to, from, on, and, the, ...• Grammatical classes that don’t take on new members.
• There are always a large number of tokens that occur once (and can have unexpected consequences with some IR algorithms)
• Medium frequency words most descriptive
![Page 34: 2002.10.29 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d625503460f94a43cd9/html5/thumbnails/34.jpg)
2002.10.29 - SLIDE 34IS 202 – FALL 2002
Word Frequency vs. Resolving Power
The most frequent words are not the most descriptive.
(from van Rijsbergen 79)
![Page 35: 2002.10.29 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d625503460f94a43cd9/html5/thumbnails/35.jpg)
2002.10.29 - SLIDE 35IS 202 – FALL 2002
• How likely is a red car to drive by given we’ve seen a black one?
• How likely is the word “ambulence” to appear, given that we’ve seen “car accident”?
• Color of cars driving by are independent (although more frequent colors are more likely)
• Words in text are not independent (although again more frequent words are more likely)
Statistical Independence vs. Dependence
![Page 36: 2002.10.29 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d625503460f94a43cd9/html5/thumbnails/36.jpg)
2002.10.29 - SLIDE 36IS 202 – FALL 2002
Statistical Independence
• Two events x and y are statistically independent if the product of the probabilities of their happening individually equals the probability of their happening together
),()()( yxPyPxP
![Page 37: 2002.10.29 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d625503460f94a43cd9/html5/thumbnails/37.jpg)
2002.10.29 - SLIDE 37IS 202 – FALL 2002
Statistical Independence and Dependence
• What are examples of things that are statistically independent?
• What are examples of things that are statistically dependent?
![Page 38: 2002.10.29 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d625503460f94a43cd9/html5/thumbnails/38.jpg)
2002.10.29 - SLIDE 38IS 202 – FALL 2002
Lexical Associations
• Subjects write first word that comes to mind– doctor/nurse; black/white (Palermo & Jenkins 64)
• Text Corpora can yield similar associations• One measure: Mutual Information (Church and
Hanks 89)
• If word occurrences were independent, the numerator and denominator would be equal (if measured across a large collection)
)(),(
),(log),( 2 yPxP
yxPyxI
![Page 39: 2002.10.29 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d625503460f94a43cd9/html5/thumbnails/39.jpg)
2002.10.29 - SLIDE 39IS 202 – FALL 2002
Statistical Independence
• Compute for a window of words
collectionin wordsofnumber
in occur -co and timesofnumber ),(
position at starting ndow within wiwords
5)(say windowoflength ||
),(1
),(
:follows as ),( eapproximat llWe'
/)()(
t.independen if ),()()(
||
1
N
wyxyxw
iw
ww
yxwN
yxP
yxP
NxfxP
yxPyPxP
i
wN
ii
w1 w11w21
a b c d e f g h i j k l m n o p
![Page 40: 2002.10.29 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d625503460f94a43cd9/html5/thumbnails/40.jpg)
2002.10.29 - SLIDE 40IS 202 – FALL 2002
Interesting Associations with “Doctor”
I(x,y) f(x,y) f(x) x f(y) y11.3 12 111 Honorary 621 Doctor
11.3 8 1105 Doctors 44 Dentists
10.7 30 1105 Doctors 241 Nurses
9.4 8 1105 Doctors 154 Treating
9.0 6 275 Examined 621 Doctor
8.9 11 1105 Doctors 317 Treat
8.7 25 621 Doctor 1407 Bills
AP Corpus, N=15 million, Church & Hanks 89
![Page 41: 2002.10.29 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d625503460f94a43cd9/html5/thumbnails/41.jpg)
2002.10.29 - SLIDE 41IS 202 – FALL 2002
I(x,y) f(x,y) f(x) x f(y) y0.96 6 621 doctor 73785 with
0.95 41 284690 a 1105 doctors
0.93 12 84716 is 1105 doctors
These associations were likely to happen because the non-doctor words shown here are very common
and therefore likely to co-occur with any noun.
Un-Interesting Associations with “Doctor”
AP Corpus, N=15 million, Church & Hanks 89
![Page 42: 2002.10.29 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d625503460f94a43cd9/html5/thumbnails/42.jpg)
2002.10.29 - SLIDE 42IS 202 – FALL 2002
Document Vectors
• Documents are represented as “bags of words”
• Represented as vectors when used computationally– A vector is like an array of floating point– Has direction and magnitude– Each vector holds a place for every term in
the collection– Therefore, most vectors are sparse
![Page 43: 2002.10.29 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d625503460f94a43cd9/html5/thumbnails/43.jpg)
2002.10.29 - SLIDE 43IS 202 – FALL 2002
Document Vectors
ID nova galaxy heat h'wood film role diet furA 10 5 3B 5 10C 10 8 7D 9 10 5E 10 10F 9 10G 5 7 9H 6 10 2 8I 7 5 1 3
“Nova” occurs 10 times in text A“Galaxy” occurs 5 times in text A“Heat” occurs 3 times in text A(Blank means 0 occurrences.)
![Page 44: 2002.10.29 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d625503460f94a43cd9/html5/thumbnails/44.jpg)
2002.10.29 - SLIDE 44IS 202 – FALL 2002
Document Vectors
ID nova galaxy heat h'wood film role diet furA 10 5 3B 5 10C 10 8 7D 9 10 5E 10 10F 9 10G 5 7 9H 6 10 2 8I 7 5 1 3
“Hollywood” occurs 7 times in text I“Film” occurs 5 times in text I“Diet” occurs 1 time in text I“Fur” occurs 3 times in text I
![Page 45: 2002.10.29 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d625503460f94a43cd9/html5/thumbnails/45.jpg)
2002.10.29 - SLIDE 45IS 202 – FALL 2002
Document Vectors
ID nova galaxy heat h'wood film role diet furA 10 5 3B 5 10C 10 8 7D 9 10 5E 10 10F 9 10G 5 7 9H 6 10 2 8I 7 5 1 3
![Page 46: 2002.10.29 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d625503460f94a43cd9/html5/thumbnails/46.jpg)
2002.10.29 - SLIDE 46IS 202 – FALL 2002
We Can Plot the Vectors
Star
Diet
Doc about astronomyDoc about movie stars
Doc about mammal behavior
![Page 47: 2002.10.29 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d625503460f94a43cd9/html5/thumbnails/47.jpg)
2002.10.29 - SLIDE 47
Documents in 3D Space
![Page 48: 2002.10.29 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d625503460f94a43cd9/html5/thumbnails/48.jpg)
2002.10.29 - SLIDE 48IS 202 – FALL 2002
Content Analysis Summary
• Content Analysis: transforming raw text into more computationally useful forms
• Words in text collections exhibit interesting statistical properties– Word frequencies have a Zipf distribution– Word co-occurrences exhibit dependencies
• Text documents are transformed to vectors– Pre-processing includes tokenization, stemming,
collocations/phrases– Documents occupy multi-dimensional space
![Page 49: 2002.10.29 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d625503460f94a43cd9/html5/thumbnails/49.jpg)
2002.10.29 - SLIDE 49IS 202 – FALL 2002
Inverted Index
• This is the primary data structure for text indexes
• Main Idea:– Invert documents into a big index
• Basic steps:– Make a “dictionary” of all the tokens in the
collection– For each token, list all the docs it occurs in.– Do a few things to reduce redundancy in the data
structure
![Page 50: 2002.10.29 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d625503460f94a43cd9/html5/thumbnails/50.jpg)
2002.10.29 - SLIDE 50IS 202 – FALL 2002
Informationneed
Index
Pre-process
Parse
Collections
Rank
Query
text inputHow isthe indexconstructed?
![Page 51: 2002.10.29 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d625503460f94a43cd9/html5/thumbnails/51.jpg)
2002.10.29 - SLIDE 51IS 202 – FALL 2002
Inverted Indexes
• We have seen “Vector files” conceptually– An Inverted File is a vector file “inverted” so
that rows become columns and columns become rows
docs t1 t2 t3D1 1 0 1D2 1 0 0D3 0 1 1D4 1 0 0D5 1 1 1D6 1 1 0D7 0 1 0D8 0 1 0D9 0 0 1
D10 0 1 1
Terms D1 D2 D3 D4 D5 D6 D7 …
t1 1 1 0 1 1 1 0t2 0 0 1 0 1 1 1t3 1 0 1 0 1 0 0
![Page 52: 2002.10.29 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d625503460f94a43cd9/html5/thumbnails/52.jpg)
2002.10.29 - SLIDE 52IS 202 – FALL 2002
How Inverted Files Are Created
• Documents are parsed to extract tokens. These are saved with the Document ID.
Now is the timefor all good men
to come to the aidof their country
Doc 1
It was a dark andstormy night in
the country manor. The time was past midnight
Doc 2
Term Doc #now 1is 1the 1time 1for 1all 1good 1men 1to 1come 1to 1the 1aid 1of 1their 1country 1it 2was 2a 2dark 2and 2stormy 2night 2in 2the 2country 2manor 2the 2time 2was 2past 2midnight 2
![Page 53: 2002.10.29 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d625503460f94a43cd9/html5/thumbnails/53.jpg)
2002.10.29 - SLIDE 53IS 202 – FALL 2002
How Inverted Files are Created
• After all documents have been parsed the inverted file is sorted alphabetically.
Term Doc #a 2aid 1all 1and 2come 1country 1country 2dark 2for 1good 1in 2is 1it 2manor 2men 1midnight 2night 2now 1of 1past 2stormy 2the 1the 1the 2the 2their 1time 1time 2to 1to 1was 2was 2
Term Doc #now 1is 1the 1time 1for 1all 1good 1men 1to 1come 1to 1the 1aid 1of 1their 1country 1it 2was 2a 2dark 2and 2stormy 2night 2in 2the 2country 2manor 2the 2time 2was 2past 2midnight 2
![Page 54: 2002.10.29 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d625503460f94a43cd9/html5/thumbnails/54.jpg)
2002.10.29 - SLIDE 54IS 202 – FALL 2002
How Inverted Files are Created
• Multiple term entries for a single document are merged.
• Within-document term frequency information is compiled.
Term Doc # Freqa 2 1aid 1 1all 1 1and 2 1come 1 1country 1 1country 2 1dark 2 1for 1 1good 1 1in 2 1is 1 1it 2 1manor 2 1men 1 1midnight 2 1night 2 1now 1 1of 1 1past 2 1stormy 2 1the 1 2the 2 2their 1 1time 1 1time 2 1to 1 2was 2 2
Term Doc #a 2aid 1all 1and 2come 1country 1country 2dark 2for 1good 1in 2is 1it 2manor 2men 1midnight 2night 2now 1of 1past 2stormy 2the 1the 1the 2the 2their 1time 1time 2to 1to 1was 2was 2
![Page 55: 2002.10.29 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d625503460f94a43cd9/html5/thumbnails/55.jpg)
2002.10.29 - SLIDE 55IS 202 – FALL 2002
How Inverted Files are Created
• Then the file can be split into – A Dictionary file – and – A Postings file
![Page 56: 2002.10.29 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d625503460f94a43cd9/html5/thumbnails/56.jpg)
2002.10.29 - SLIDE 56IS 202 – FALL 2002
How Inverted Files are Created
Dictionary PostingsTerm Doc # Freqa 2 1aid 1 1all 1 1and 2 1come 1 1country 1 1country 2 1dark 2 1for 1 1good 1 1in 2 1is 1 1it 2 1manor 2 1men 1 1midnight 2 1night 2 1now 1 1of 1 1past 2 1stormy 2 1the 1 2the 2 2their 1 1time 1 1time 2 1to 1 2was 2 2
Doc # Freq2 11 11 12 11 11 12 12 11 11 12 11 12 12 11 12 12 11 11 12 12 11 22 21 11 12 11 22 2
Term N docs Tot Freqa 1 1aid 1 1all 1 1and 1 1come 1 1country 2 2dark 1 1for 1 1good 1 1in 1 1is 1 1it 1 1manor 1 1men 1 1midnight 1 1night 1 1now 1 1of 1 1past 1 1stormy 1 1the 2 4their 1 1time 2 2to 1 2was 1 2
![Page 57: 2002.10.29 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d625503460f94a43cd9/html5/thumbnails/57.jpg)
2002.10.29 - SLIDE 57IS 202 – FALL 2002
Inverted indexes
• Permit fast search for individual terms• For each term, you get a list consisting of:
– document ID – frequency of term in doc (optional) – position of term in doc (optional)
• These lists can be used to solve Boolean queries:
• country -> d1, d2• manor -> d2• country AND manor -> d2
• Also used for statistical ranking algorithms
![Page 58: 2002.10.29 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d625503460f94a43cd9/html5/thumbnails/58.jpg)
2002.10.29 - SLIDE 58IS 202 – FALL 2002
How Inverted Files are Used
Dictionary PostingsDoc # Freq
2 11 11 12 11 11 12 12 11 11 12 11 12 12 11 12 12 11 11 12 12 11 22 21 11 12 11 22 2
Term N docs Tot Freqa 1 1aid 1 1all 1 1and 1 1come 1 1country 2 2dark 1 1for 1 1good 1 1in 1 1is 1 1it 1 1manor 1 1men 1 1midnight 1 1night 1 1now 1 1of 1 1past 1 1stormy 1 1the 2 4their 1 1time 2 2to 1 2was 1 2
Query on
“time” AND “dark”
2 docs with “time” in dictionary ->
IDs 1 and 2 from posting file
1 doc with “dark” in dictionary ->
ID 2 from posting file
Therefore, only doc 2 satisfied the query.
![Page 59: 2002.10.29 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d625503460f94a43cd9/html5/thumbnails/59.jpg)
2002.10.29 - SLIDE 59IS 202 – FALL 2002
Next Time
• More on Vector Representation
• The Vector Model of IR
• Term weighting
• Statistical ranking methods