info 4300 / cs4300 information retrieval [0.5cm] slides ... · proximity search we just saw how to...

54
INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch¨utze’s, linked from http://informationretrieval.org/ IR 3: Dictionaries and tolerant retrieval Paul Ginsparg Cornell University, Ithaca, NY 3 Sep 2009 1 / 54

Upload: others

Post on 31-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Proximity search We just saw how to use a positional index for phrase searches. We can also use it for proximity search

INFO 4300 / CS4300

Information Retrieval

slides adapted from Hinrich Schutze’s,linked from http://informationretrieval.org/

IR 3: Dictionaries and tolerant retrieval

Paul Ginsparg

Cornell University, Ithaca, NY

3 Sep 2009

1 / 54

Page 2: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Proximity search We just saw how to use a positional index for phrase searches. We can also use it for proximity search

Administrativa

Course Webpage:http://www.infosci.cornell.edu/Courses/info4300/2009fa/

Lectures: Tuesday and Thursday 11:40-12:55, Hollister B14

Instructor: Paul Ginsparg, [email protected], 255-7371,Cornell Information Science, 301 College Avenue

Instructor’s Assistant: Corinne Russell, [email protected],255-5925, Cornell Information Science, 301 College Avenue

Instructor’s Office Hours: Wed 1–2pm 9,16 Sep, then Wed1–3, or e-mail instructor to schedule an appointment

Nam Nguyen, [email protected] Teaching Assistant does not have scheduled office hoursbut is available to help you by email.

2 / 54

Page 3: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Proximity search We just saw how to use a positional index for phrase searches. We can also use it for proximity search

Overview

1 Recap

2 Skip pointers

3 Phrase queries

4 Dictionaries

5 Wildcard queries

6 Discussion

3 / 54

Page 4: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Proximity search We just saw how to use a positional index for phrase searches. We can also use it for proximity search

Outline

1 Recap

2 Skip pointers

3 Phrase queries

4 Dictionaries

5 Wildcard queries

6 Discussion

4 / 54

Page 5: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Proximity search We just saw how to use a positional index for phrase searches. We can also use it for proximity search

Type/token distinction

Token – An instance of a word or term occurring in adocument.

Type – An equivalence class of tokens.

In June, the dog likes to chase the cat in the barn.

12 word tokens, 9 word types

5 / 54

Page 6: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Proximity search We just saw how to use a positional index for phrase searches. We can also use it for proximity search

Problems in tokenization

What are the delimiters? Space? Apostrophe? Hyphen?

For each of these: sometimes they delimit, sometimes theydon’t.

No whitespace in many languages! (e.g., Chinese)

No whitespace in Dutch, German, Swedish compounds(Lebensversicherungsgesellschaftsangestellter)

6 / 54

Page 7: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Proximity search We just saw how to use a positional index for phrase searches. We can also use it for proximity search

Problems in “equivalence classing”

A term is an equivalence class of tokens.

How do we define equivalence classes?

Numbers (3/20/91 vs. 20/3/91)

Case folding

Stemming, Porter stemmer

Morphological analysis: inflectional vs. derivational

Equivalence classing problems in other languages

More complex morphology than in EnglishFinnish: a single verb may have 12,000 different formsAccents, umlauts

7 / 54

Page 8: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Proximity search We just saw how to use a positional index for phrase searches. We can also use it for proximity search

Outline

1 Recap

2 Skip pointers

3 Phrase queries

4 Dictionaries

5 Wildcard queries

6 Discussion

8 / 54

Page 9: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Proximity search We just saw how to use a positional index for phrase searches. We can also use it for proximity search

Recall basic intersection algorithm

Brutus −→ 1 → 2 → 4 → 11 → 31 → 45 → 173 → 174

Calpurnia −→ 2 → 31 → 54 → 101

Intersection =⇒ 2 → 31

Linear in the length of the postings lists.

Can we do better?

9 / 54

Page 10: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Proximity search We just saw how to use a positional index for phrase searches. We can also use it for proximity search

Skip pointers

Skip pointers allow us to skip postings that will not figure inthe search results.

This makes intersecting postings lists more efficient.

Some postings lists contain several million entries – soefficiency can be an issue even if basic intersection is linear.

Where do we put skip pointers?

How do we make sure insection results are correct?

10 / 54

Page 11: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Proximity search We just saw how to use a positional index for phrase searches. We can also use it for proximity search

Basic idea

Brutus

Caesar

16

2 4 8

128

16 32 64 128

8

1 2 3 5

31

8 17 21 31 75 81 84 89 92

11 / 54

Page 12: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Proximity search We just saw how to use a positional index for phrase searches. We can also use it for proximity search

Skip lists: Larger example

12 / 54

Page 13: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Proximity search We just saw how to use a positional index for phrase searches. We can also use it for proximity search

Intersecting with skip pointers

IntersectWithSkips(p1, p2)1 answer ← 〈 〉2 while p1 6= nil and p2 6= nil

3 do if docID(p1) = docID(p2)4 then Add(answer , docID(p1))5 p1 ← next(p1)6 p2 ← next(p2)7 else if docID(p1) < docID(p2)8 then if hasSkip(p1) and (docID(skip(p1)) ≤ docID(p2))9 then while hasSkip(p1) and (docID(skip(p1)) ≤ docID(p2))

10 do p1 ← skip(p1)11 else p1 ← next(p1)12 else if hasSkip(p2) and (docID(skip(p2)) ≤ docID(p1))13 then while hasSkip(p2) and (docID(skip(p2)) ≤ docID(p1))14 do p2 ← skip(p2)15 else p2 ← next(p2)16 return answer

13 / 54

Page 14: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Proximity search We just saw how to use a positional index for phrase searches. We can also use it for proximity search

Where do we place skips?

Tradeoff: number of items skipped vs. frequency skip can betaken

More skips: Each skip pointer skips only a few items, but wecan frequently use it.

Fewer skips: Each skip pointer skips many items, but we cannot use it very often.

14 / 54

Page 15: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Proximity search We just saw how to use a positional index for phrase searches. We can also use it for proximity search

Where do we place skips? (cont)

Simple heuristic: for postings list of length P , use√

P

evenly-spaced skip pointers.

This ignores the distribution of query terms.

Easy if the index is static; harder in a dynamic environmentbecause of updates.

How much do skip pointers help?

They used to help lot.

With today’s fast CPUs, they don’t help that much anymore.

15 / 54

Page 16: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Proximity search We just saw how to use a positional index for phrase searches. We can also use it for proximity search

Outline

1 Recap

2 Skip pointers

3 Phrase queries

4 Dictionaries

5 Wildcard queries

6 Discussion

16 / 54

Page 17: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Proximity search We just saw how to use a positional index for phrase searches. We can also use it for proximity search

Phrase queries

We want to answer a query such as “stanford university” – asa phrase.

Thus The inventor Stanford Ovshinsky never went to

university shouldn’t be a match.

The concept of phrase query has proven easily understood byusers.

About 10% of web queries are phrase queries.

Consequence for inverted index: no longer suffices to storedocIDs in postings lists.

Any ideas?

17 / 54

Page 18: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Proximity search We just saw how to use a positional index for phrase searches. We can also use it for proximity search

Biword indexes

Index every consecutive pair of terms in the text as a phrase.

For example, Friends, Romans, Countrymen would generatetwo biwords: “friends romans” and “romans countrymen”

Each of these biwords is now a vocabulary term.

Two-word phrases can now easily be answered.

18 / 54

Page 19: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Proximity search We just saw how to use a positional index for phrase searches. We can also use it for proximity search

Longer phrase queries

A long phrase like “stanford university palo alto” can berepresented as the Boolean query “stanford university”

AND “university palo” AND “palo alto”

We need to do post-filtering of hits to identify subset thatactually contains the 4-word phrase.

19 / 54

Page 20: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Proximity search We just saw how to use a positional index for phrase searches. We can also use it for proximity search

Extended biwords

Parse each document and perform part-of-speech tagging

Bucket the terms into (say) nouns (N) andarticles/prepositions (X)

Now deem any string of terms of the form NX*N to be anextended biword

Examples:catcher in the ryeN X X N

king of DenmarkN X N

Include extended biwords in the term vocabulary

Queries are processed accordingly

20 / 54

Page 21: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Proximity search We just saw how to use a positional index for phrase searches. We can also use it for proximity search

Issues with biword indexes

Why are biword indexes rarely used?

False positives, as noted above

Index blowup due to very large term vocabulary

21 / 54

Page 22: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Proximity search We just saw how to use a positional index for phrase searches. We can also use it for proximity search

Positional indexes

Positional indexes are a more efficient alternative to biword indexes.Postings lists in a nonpositional index: each posting is just a docIDPostings lists in a positional index: each posting is a docID and a list ofpositionsExample query: “to1 be2 or3 not4 to5 be6”

to, 993427:〈 1: 〈7, 18, 33, 72, 86, 231〉;

2: 〈1, 17, 74, 222, 255〉;4: 〈8, 16, 190, 429, 433〉;5: 〈363, 367〉;7: 〈13, 23, 191〉; . . . 〉

be, 178239:〈 1: 〈17, 25〉;

4: 〈17, 191, 291, 430, 434〉;5: 〈14, 19, 101〉; . . . 〉

Document 4 is a match!

22 / 54

Page 23: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Proximity search We just saw how to use a positional index for phrase searches. We can also use it for proximity search

Proximity search

We just saw how to use a positional index for phrase searches.

We can also use it for proximity search.

For example: employment /4 place

Find all documents that contain employment and place

within 4 words of each other.

Employment agencies that place healthcare workers are seeing

growth is a hit.

Employment agencies that have learned to adapt now place

healthcare workers is not a hit.

23 / 54

Page 24: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Proximity search We just saw how to use a positional index for phrase searches. We can also use it for proximity search

Proximity search

Use the positional index

Simplest algorithm: look at cross-product of positions of (i)employment in document and (ii) place in document

Very inefficient for frequent words, especially stop words

Note that we want to return the actual matching positions,not just a list of documents.

This is important for dynamic summaries etc.

24 / 54

Page 25: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Proximity search We just saw how to use a positional index for phrase searches. We can also use it for proximity search

“Proximity” intersection

PositionalIntersect(p1, p2, k)1 answer ← 〈 〉2 while p1 6= nil and p2 6= nil

3 do if docID(p1) = docID(p2)4 then l ← 〈 〉5 pp1 ← positions(p1)6 pp2 ← positions(p2)7 while pp1 6= nil

8 do while pp2 6= nil

9 do if |pos(pp1)− pos(pp2)| ≤ k

10 then Add(l , pos(pp2))11 else if pos(pp2) > pos(pp1)12 then break

13 pp2 ← next(pp2)14 while l 6= 〈 〉 and |l [0]− pos(pp1)| > k

15 do Delete(l [0])16 for each ps ∈ l

17 do Add(answer , 〈docID(p1), pos(pp1), ps〉)18 pp1 ← next(pp1)19 p1 ← next(p1)20 p2 ← next(p2)21 else if docID(p1) < docID(p2)22 then p1 ← next(p1)23 else p2 ← next(p2)24 return answer

25 / 54

Page 26: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Proximity search We just saw how to use a positional index for phrase searches. We can also use it for proximity search

Combination scheme

Biword indexes and positional indexes can be profitablycombined.

Many biwords are extremely frequent: Michael Jackson,Britney Spears etc

For these biwords, increased speed compared to positionalpostings intersection is substantial.

Combination scheme: Include frequent biwords as vocabularyterms in the index. Do all other phrases by positionalintersection.

Williams et al. (2004) evaluate a more sophisticated mixedindexing scheme. Faster than a positional index, at a cost of26% more space for index.

26 / 54

Page 27: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Proximity search We just saw how to use a positional index for phrase searches. We can also use it for proximity search

“Positional” queries on Google

For web search engines, positional queries are much moreexpensive than regular Boolean queries.

Let’s look at the example of phrase queries.

Why are they more expensive than regular Boolean queries?

Can you demonstrate on Google that phrase queries are moreexpensive than Boolean queries?

27 / 54

Page 28: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Proximity search We just saw how to use a positional index for phrase searches. We can also use it for proximity search

Outline

1 Recap

2 Skip pointers

3 Phrase queries

4 Dictionaries

5 Wildcard queries

6 Discussion

28 / 54

Page 29: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Proximity search We just saw how to use a positional index for phrase searches. We can also use it for proximity search

Inverted index

For each term t, we store a list of all documents that contain t.

Brutus −→ 1 2 4 11 31 45 173 174

Caesar −→ 1 2 4 5 6 16 57 132 . . .

Calpurnia −→ 2 31 54 101

...

︸ ︷︷ ︸ ︸ ︷︷ ︸

dictionary postings

29 / 54

Page 30: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Proximity search We just saw how to use a positional index for phrase searches. We can also use it for proximity search

Dictionaries

Term vocabulary: the data

Dictionary: the data structure for storing the term vocabulary

30 / 54

Page 31: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Proximity search We just saw how to use a positional index for phrase searches. We can also use it for proximity search

Dictionary as array of fixed-width entries

For each term, we need to store a couple of items:

document frequencypointer to postings list. . .

Assume for the time being that we can store this informationin a fixed-length entry.

Assume that we store these entries in an array.

31 / 54

Page 32: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Proximity search We just saw how to use a positional index for phrase searches. We can also use it for proximity search

Dictionary as array of fixed-width entries

term documentfrequency

pointer topostings list

a 656,265 −→aachen 65 −→. . . . . . . . .zulu 221 −→

space needed: 20 bytes 4 bytes 4 bytes

How do we look up an element in this array at query time?

32 / 54

Page 33: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Proximity search We just saw how to use a positional index for phrase searches. We can also use it for proximity search

Data structures for looking up term

Two main classes of data structures: hashes and trees

Some IR systems use hashes, some use trees.

Criteria for when to use hashes vs. trees:

Is there a fixed number of terms or will it keep growing?What are the relative frequencies with which various keys willbe accessed?How many terms are we likely to have?

33 / 54

Page 34: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Proximity search We just saw how to use a positional index for phrase searches. We can also use it for proximity search

Hashes

Each vocabulary term is hashed into an integer.

Try to avoid collisions

At query time, do the following: hash query term, resolvecollisions, locate entry in fixed-width array

Pros: Lookup in a hash is faster than lookup in a tree.

Lookup time is constant.

Cons

no way to find minor variants (resume vs. resume)no prefix search (all terms starting with automat)need to rehash everything periodically if vocabulary keepsgrowing

34 / 54

Page 35: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Proximity search We just saw how to use a positional index for phrase searches. We can also use it for proximity search

Trees

Trees solve the prefix problem (find all terms starting with automat).Simplest tree: binary treeSearch is slightly slower than in hashes: O(log M), where M is thesize of the vocabulary.O(log M) only holds for balanced trees.Rebalancing binary trees is expensive.B-trees mitigate the rebalancing problem.B-tree definition: every internal node has a number of children in theinterval [a, b] where a, b are appropriate positive integers, e.g., [2, 4].

35 / 54

Page 36: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Proximity search We just saw how to use a positional index for phrase searches. We can also use it for proximity search

Binary tree

36 / 54

Page 37: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Proximity search We just saw how to use a positional index for phrase searches. We can also use it for proximity search

B-tree

37 / 54

Page 38: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Proximity search We just saw how to use a positional index for phrase searches. We can also use it for proximity search

Outline

1 Recap

2 Skip pointers

3 Phrase queries

4 Dictionaries

5 Wildcard queries

6 Discussion

38 / 54

Page 39: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Proximity search We just saw how to use a positional index for phrase searches. We can also use it for proximity search

Wildcard queries

mon*: find all docs containing any term beginning with mon

Easy with B-tree dictionary: retrieve all terms t in the range:mon ≤ t < moo

*mon: find all docs containing any term ending with mon

Maintain an additional tree for terms backwards

Then retrieve all terms t in the range: nom ≤ t < non

39 / 54

Page 40: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Proximity search We just saw how to use a positional index for phrase searches. We can also use it for proximity search

Query processing

At this point, we have an enumeration of all terms in thedictionary that match the wildcard query.

We still have to look up the postings for each enumeratedterm.

E.g., consider the query: gen* and universit*

This may result in the execution of many Boolean and

queries.

If there are 100 terms matching gen* and 50 terms matchinguniversit*, then we have to run 5000 Boolean queries!

40 / 54

Page 41: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Proximity search We just saw how to use a positional index for phrase searches. We can also use it for proximity search

How to handle * in the middle of a term

Example: m*nchen

We could look up m* and *nchen in the B-tree and intersectthe two term sets.

Expensive

Alternative: permuterm index

Basic idea: Rotate every wildcard query, so that the * occursat the end.

41 / 54

Page 42: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Proximity search We just saw how to use a positional index for phrase searches. We can also use it for proximity search

Permuterm index

For term hello: add hello$, ello$h, llo$he, lo$hel, and o$hell

to the B-tree where $ is a special symbol

42 / 54

Page 43: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Proximity search We just saw how to use a positional index for phrase searches. We can also use it for proximity search

Permuterm → term mapping

43 / 54

Page 44: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Proximity search We just saw how to use a positional index for phrase searches. We can also use it for proximity search

Permuterm index

For hello, we’ve stored: hello$, ello$h, llo$he, lo$hel, ando$hell

Queries

For X, look up X$For X*, look up X*$For *X, look up X$*For *X*, look up X*For X*Y, look up Y$X*Example: For hel*o, look up o$hel*

It’s really a tree and should be called permuterm tree.

But permuterm index is more common name.

44 / 54

Page 45: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Proximity search We just saw how to use a positional index for phrase searches. We can also use it for proximity search

Processing a lookup in the permuterm index

Rotate query wildcard to the right

Use B-tree lookup as before

Problem: Permuterm more than quadruples the size of thedictionary compared to a regular B-tree. (empirical number)

45 / 54

Page 46: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Proximity search We just saw how to use a positional index for phrase searches. We can also use it for proximity search

k-gram indexes

More space-efficient than permuterm index

Enumerate all character k-grams (sequence of k characters)occurring in a term

2-grams are called bigrams.

Example: from April is the cruelest month we get the bigrams:$a ap pr ri il l$ $i is s$ $t th he e$ $c cr ru ue el le es st t$ $m

mo on nt h$

$ is a special word boundary symbol, as before.

Maintain an inverted index from bigrams to the terms thatcontain the bigram

46 / 54

Page 47: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Proximity search We just saw how to use a positional index for phrase searches. We can also use it for proximity search

Postings list in a 3-gram inverted index

etr beetroot metric petrify retrieval- - - -

47 / 54

Page 48: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Proximity search We just saw how to use a positional index for phrase searches. We can also use it for proximity search

Bigram indexes

Note that we now have two different types of inverted indexes

The term-document inverted index for finding documentsbased on a query consisting of terms

The k-gram index for finding terms based on a queryconsisting of k-grams

48 / 54

Page 49: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Proximity search We just saw how to use a positional index for phrase searches. We can also use it for proximity search

Processing wildcarded terms in a bigram index

Query mon* can now be run as:$m and mo and on

Gets us all terms with the prefix mon . . .

. . . but also many “false positives” like moon.

We must postfilter these terms against query.

Surviving terms are then looked up in the term-documentinverted index.

k-gram indexes are more space efficient than permutermindexes.

49 / 54

Page 50: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Proximity search We just saw how to use a positional index for phrase searches. We can also use it for proximity search

Processing wildcard queries in the term-document index

As before, we must potentially execute a large number of BooleanqueriesMost straightforward semantics: Conjunction of disjunctionsRecall the query: gen* and universit*

(geneva and university) or (geneva and universite) or (geneveand university) or (geneve and universite) or (general and

universities) or . . . . . .

Very expensiveDoes Google allow wildcard queries?Users hate to type.If abbreviated queries like [pyth* theo*] for pythagoras’ theoremare legal, users will use them . . .

. . . a lot

According to Google search basics, 2009.04.27: “Note that the *operator works only on whole words, not parts of words.”But that does not seem to be true. Try [pythag*] and [m*nchen]

50 / 54

Page 51: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Proximity search We just saw how to use a positional index for phrase searches. We can also use it for proximity search

Outline

1 Recap

2 Skip pointers

3 Phrase queries

4 Dictionaries

5 Wildcard queries

6 Discussion

51 / 54

Page 52: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Proximity search We just saw how to use a positional index for phrase searches. We can also use it for proximity search

Discussion 1

Objective: explore three information retrieval systems (Bing, LOC,PubMed), and use each for the discovery task: “What is themedical evidence that a low calorie diet will extend your lifetime?”

Some general questions and observations:

How to authenticate the information?

Is the information up to date? (how to find updated info?)

In what order are items returned? (by “relevance”, but how isrelevance determined: link analysis? tf.idf?)

Use results of Bing search to refine vocabulary: e.g., “lowcalorie diet” → “calorie restriction”, “lifetime” → “longevity”

Assignment: everyone upload as a test of CMS the bestreference found, and outline of strategy used to find it

52 / 54

Page 53: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Proximity search We just saw how to use a positional index for phrase searches. We can also use it for proximity search

Discussion 1: epilogue

Challenge: What principled search strategy might turn uphttp://www.nytimes.com/2009/08/18/science/18aging.html

(without knowing the answer in advance)?

“Two experts on aging . . . argued recently in Nature that thewhole phenomenon of caloric restriction may be a misleading resultunwittingly produced in laboratory mice. The mice are selected forquick breeding and fed on rich diets. A low-calorie diet could bemuch closer to the diet that mice are adapted to in the wild, andtherefore it could extend life simply because it is much healthier forthem. ‘Life extension in model organisms may be an artifact tosome extent,’ they wrote. To the extent caloric restriction works atall, it may have a bigger impact in short-lived organisms that donot have to worry about cancer than in humans. Thus the hope ofmimicking caloric restriction with drugs ‘may be an illusion,’ . . .”

And what is the final word?

53 / 54

Page 54: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · Proximity search We just saw how to use a positional index for phrase searches. We can also use it for proximity search

Discussion 2 (Thurs 10 Sep)

Read and be prepared to discuss:

G. Salton, A. Wong and C. S. Yang, “A vector space model forautomatic indexing”. Communications of the ACM Volume 18 ,Issue 11 (November 1975) pages: 613 - 620.http://doi.acm.org/10.1145/361219.361220

This paper describes many of the concepts behind the vector spacemodel and the SMART system.

(Note that to access this paper from the ACM Digital Library, youneed to use a computer with a Cornell IP address [or use/readings/p613-salton.pdf ])

54 / 54