the college of saint rose csc 460 / cis 560 – search and information retrieval david goldschmidt,...
TRANSCRIPT
![Page 1: The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,](https://reader035.vdocuments.us/reader035/viewer/2022062716/56649de75503460f94ae0d84/html5/thumbnails/1.jpg)
What can text statistics reveal?{week 05a}
The College of Saint RoseCSC 460 / CIS 560 – Search and Information RetrievalDavid Goldschmidt, Ph.D.
from Search Engines: Information Retrieval in Practice, 1st edition by Croft, Metzler, and Strohman, Pearson, 2010, ISBN 0-13-607224-0
![Page 2: The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,](https://reader035.vdocuments.us/reader035/viewer/2022062716/56649de75503460f94ae0d84/html5/thumbnails/2.jpg)
Text transformation
how do we bestconvert documentsto their index terms
how do we makeacquired documents
searchable?
![Page 3: The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,](https://reader035.vdocuments.us/reader035/viewer/2022062716/56649de75503460f94ae0d84/html5/thumbnails/3.jpg)
Find/Replace
Simplest approach is find, whichrequires no text transformation Useful in user applications,
but not in search (why?) Optional transformation
handled during the findoperation: case sensitivity
![Page 4: The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,](https://reader035.vdocuments.us/reader035/viewer/2022062716/56649de75503460f94ae0d84/html5/thumbnails/4.jpg)
Text statistics (i)
English documents are predictable: Top two most frequently occurring words
are “the” and “of” (10% of word occurrences)
Top six most frequently occurring wordsaccount for 20% of word occurrences
Top fifty most frequently occurring words account for 50% of word occurrences
Given all unique words in a (large) document, approximately 50% occur only once
![Page 5: The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,](https://reader035.vdocuments.us/reader035/viewer/2022062716/56649de75503460f94ae0d84/html5/thumbnails/5.jpg)
Text statistics (ii)
Zipf’s law: Rank words in order of decreasing
frequency The rank (r) of a word times its
frequency (f) is approximately equal to a constant (k)
r x f = k In other words, the frequency of the rth
most common word is inversely proportional to r
George Kingsley Zipf(1902-1950)
![Page 6: The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,](https://reader035.vdocuments.us/reader035/viewer/2022062716/56649de75503460f94ae0d84/html5/thumbnails/6.jpg)
Text statistics (iii)
The probability of occurrence (Pr)of a word is the word frequencydivided by the total number ofwords in the document
Revise Zipf’s law as: r x Pr = c
for English,c ≈ 0.1
![Page 7: The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,](https://reader035.vdocuments.us/reader035/viewer/2022062716/56649de75503460f94ae0d84/html5/thumbnails/7.jpg)
Text statistics (iv)
Verify Zipf’s law using the AP89 dataset: Collection of Associated Press (AP) news
stories from 1989 (available at http://trec.nist.gov):
Total documents 84,678Total word occurrences39,749,179Vocabulary size 198,763Words occurring > 1000 times 4,169Words occurring once 70,064
![Page 8: The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,](https://reader035.vdocuments.us/reader035/viewer/2022062716/56649de75503460f94ae0d84/html5/thumbnails/8.jpg)
Text statistics (v)
Top 50wordsof AP89
![Page 9: The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,](https://reader035.vdocuments.us/reader035/viewer/2022062716/56649de75503460f94ae0d84/html5/thumbnails/9.jpg)
Vocabulary growth (i)
As the corpus grows, so does vocabulary size Fewer new words when corpus is already
large
The relationship between corpus size (n) and vocabulary size (v) was defined empirically by Heaps (1978) and called Heaps law:
v = k x nβ
Constants k and β vary Typically 10 ≤ k ≤ 100 and β ≈ 0.5
![Page 10: The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,](https://reader035.vdocuments.us/reader035/viewer/2022062716/56649de75503460f94ae0d84/html5/thumbnails/10.jpg)
Vocabulary growth (ii)
note values of k and β
![Page 11: The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,](https://reader035.vdocuments.us/reader035/viewer/2022062716/56649de75503460f94ae0d84/html5/thumbnails/11.jpg)
Vocabulary growth (iii)
Web pages crawled from .gov in early 2004
![Page 12: The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,](https://reader035.vdocuments.us/reader035/viewer/2022062716/56649de75503460f94ae0d84/html5/thumbnails/12.jpg)
Estimating result set size (i) Word occurrence statistics can be
used to estimate result set size of a user query
Aside from stop words, how many pagescontain all of the query terms?▪ To figure this out, first assume that words
occur independently of one another▪ Also assume that the search engine knows N,
the number of documents it indexes
![Page 13: The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,](https://reader035.vdocuments.us/reader035/viewer/2022062716/56649de75503460f94ae0d84/html5/thumbnails/13.jpg)
Estimating result set size (ii)
Given three query terms a, b, and c Probability of a document containing all
threeis the product of individual probabilities foreach query term:
P(a b c) = P(a) x P(b) x P(c)
P(a b c) is the joint probability ofevents a, b, and c occurring
![Page 14: The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,](https://reader035.vdocuments.us/reader035/viewer/2022062716/56649de75503460f94ae0d84/html5/thumbnails/14.jpg)
Estimating result set size (iii)
We assume the search engine knows thenumber of documents that a word occurs in Call these na, nb, and nc ▪ Note that the book uses fa, fb, and fc
Estimate individual query term probabilities: P(a) = na / N P(b) = nb / N P(c)
= nc / N
![Page 15: The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,](https://reader035.vdocuments.us/reader035/viewer/2022062716/56649de75503460f94ae0d84/html5/thumbnails/15.jpg)
Estimating result set size (iv)
Given P(a), P(b), and P(c), we estimatethe result set size as:
nabc = N x (na / N) x (nb / N) x (nc / N)
nabc = (na x nb x nc) / N2
This estimation sounds good, but is lacking due to our query term independence assumption
![Page 16: The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,](https://reader035.vdocuments.us/reader035/viewer/2022062716/56649de75503460f94ae0d84/html5/thumbnails/16.jpg)
Estimating result set size (v)
Using the GOV2 dataset with N = 25,205,179 Poor results,
because of thequery termindependenceassumption
Could use wordco-occurrencedata...
![Page 17: The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,](https://reader035.vdocuments.us/reader035/viewer/2022062716/56649de75503460f94ae0d84/html5/thumbnails/17.jpg)
Estimating result set size (vi) Extrapolate based on the size
of the current result set: The current result set is the subset of
documents that have been ranked thus far Let C be the number of documents found
thus far containing all the query words Let s be the proportion of the total
documents ranked (use least frequently occurring term)
Estimate result set size via nabc = C / s
![Page 18: The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,](https://reader035.vdocuments.us/reader035/viewer/2022062716/56649de75503460f94ae0d84/html5/thumbnails/18.jpg)
Estimating result set size (vii)
Given example query: tropical fish aquarium Least frequently occurring term is
aquarium (which occurs in 26,480 documents)
After ranking 3,000 documents,258 documents contain all three query terms
Thus, nabc = C / s = 258 / (3,000 ÷ 26,480) = 2,277
After processing 20% of the documents, the estimate is 1,778▪ Which overshoots actual value of 1,529
![Page 19: The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,](https://reader035.vdocuments.us/reader035/viewer/2022062716/56649de75503460f94ae0d84/html5/thumbnails/19.jpg)
What next?
Read and study Chapter 4
Do Exercises 4.1, 4.2, and 4.3
Start thinking about how to write code to implement the stopping & stemming techniques of Ch.4